├── LICENSE ├── README.md ├── examples └── nyc.veml └── schema.json /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 [ Embedditor ] 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # VEML 2 | Vector Embedding Markup Language - markup language designed specifically for annotating and structuring data related to vector embeddings. This could be used to represent, exchange, or store vector embeddings in a structured way that's easily readable by both humans and machines. 3 | 4 |

5 | Embedditor • 6 | Discord • 7 | Try demo on IngestAI 8 |

9 | 10 | ## How the idea was born 11 | Running [IngestAI](https://ingestai.io) project since February 2023 we faced a lot of issues from thousands of our users. Almost all of these issues were connected with the dataset structure and ability to influence on the vector search results. 12 | 13 | # Join Our Community 14 | 15 | 16 | 17 | 18 | 19 | [![Stargazers repo roster for @embedditor/veml](https://reporoster.com/stars/embedditor/veml)](https://github.com/embedditor/veml/stargazers) 20 | 21 | ## VEML Markup 22 | VEML file is saved in .veml format and consists of following structure: 23 | 1. "html": an array of pure HTML code of a chunks to make it presentable for the users. 24 | 2. "tokens": an array of pure texts part that will be embedded 25 | 3. "vectors": an array of embeddings for every chunk (can be empty if chunk was disabled) 26 | 4. "meta": an array of meta information for every chunk, consists of strings, that have such strcture: key:value, ex. link:https://wikipedia.com 27 | 28 | You can see the structure of VEML file in schema.json file in this repository, and also you can see examples in the examples folder. 29 | 30 | # Benefits 31 | The implementation of VEML offers numerous advantages, such as: 32 | 33 | 1. Standardization: VEML provides a standardized format for pre-processing and editing vector embeddings. 34 | 2. Interoperability: It ensures better interoperability among different applications and systems that utilize vector embeddings. 35 | 3. Extensibility: Just like XML, VEML has the potential to be extensible, allowing users to add new tags and attributes to represent additional properties or metadata associated with the vector embeddings. 36 | 5. Machine Readability: A well-defined markup language would also be easily parseable by ML, ensuring efficient processing and manipulation of vector embeddings by various software applications. 37 | 38 | ## VEML Editor 39 | We understand that developing a markup without an app that supports it, is not a good idea, so we created open-source tool called [Embedditor](https://embedditor.ai). You can download it from Github or Docker and run it on your local server to start working with the VEML files and editor. 40 | ## How VEML Markup looks like 41 | 42 | ![1](https://embedditor.ai/images/embedditor_ui_05.png) -------------------------------------------------------------------------------- /examples/nyc.veml: -------------------------------------------------------------------------------- 1 | /* Structure in VEML: Vector Embedding Markup Language */ 2 | { 3 | "html":["New York, often called New York City[a] or NYC, is the most populous city in the United States. With a 2020 population of 8,804,190 distributed over 300.46 square miles (778.2 km2), New York City is the most densely populated major city in the United States and more than twice as populous as Los Angeles, the nation's second-largest city. The city also has a population that is larger than that of 38 individual U.S. states. New York City is located at the southern tip of New York State. The city constitutes the geographical and demographic center of both the Northeast megalopolis and the New York metropolitan area, the largest metropolitan area in the U.S. by both population and urban area. With over 20.1 million people in its metropolitan statistical area and 23.5 million in its combined statistical area as of 2020, New York is one of the world's most populous megacities, and over 58 million people live within 250 mi (400 km) of the city.[10] New York City is a global cultural, financial, high-tech, entertainment, glamor,[11] and media center with a significant influence on commerce, health care and life sciences,[12] research, technology, education, politics, tourism, dining, art, fashion, and sports. Home to the headquarters of the United Nations, New York is an important center for international diplomacy,[13][14] and is sometimes described as the capital of the world.[15][16]","Situated on one of the world's largest natural harbors and extending into the Atlantic Ocean, New York City comprises five boroughs, each of which is coextensive with a respective county of the state of New York. The five boroughs, which were created in 1898 when local governments were consolidated into a single municipal entity, are: Brooklyn (in Kings County), Queens (in Queens County), Manhattan (in New York County), The Bronx (in Bronx County), and Staten Island (in Richmond County).[17]"], 4 | "tokens":["new york often called nyc is the most populous city in the united states","on one of the worlds largest natural harbors and extending into the atlantic ocean"], 5 | "vectors":[[-0.014112884,0.020366829,0.0059737624,0.006458028,-0.026703792,-0.0039986507,0.04256003,-0.022843502,0.013801571,-0.0044137356],[-0.014112884,0.020366829,0.0059737624,0.006458028,-0.026703792,-0.0039986507,0.04256003,-0.022843502,0.013801571,-0.0044137356]], 6 | "meta":["url:https://en.wikipedia.org/wiki/New_York_City","image:https://upload.wikimedia.org/wikipedia/commons/thumb/7/7a/View_of_Empire_State_Building_from_Rockefeller_Center_New_York_City.jpg"] 7 | } -------------------------------------------------------------------------------- /schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-04/schema#", 3 | "type": "object", 4 | "properties": { 5 | "html": { 6 | "type": "array", 7 | "items": [ 8 | { 9 | "type": "string" 10 | }, 11 | { 12 | "type": "string" 13 | } 14 | ] 15 | }, 16 | "tokens": { 17 | "type": "array", 18 | "items": [ 19 | { 20 | "type": "string" 21 | }, 22 | { 23 | "type": "string" 24 | } 25 | ] 26 | }, 27 | "vectors": { 28 | "type": "array", 29 | "items": [ 30 | { 31 | "type": "array", 32 | "items": [ 33 | { 34 | "type": "number" 35 | }, 36 | { 37 | "type": "number" 38 | }, 39 | { 40 | "type": "number" 41 | }, 42 | { 43 | "type": "number" 44 | }, 45 | { 46 | "type": "number" 47 | }, 48 | { 49 | "type": "number" 50 | }, 51 | { 52 | "type": "number" 53 | }, 54 | { 55 | "type": "number" 56 | }, 57 | { 58 | "type": "number" 59 | }, 60 | { 61 | "type": "number" 62 | } 63 | ] 64 | }, 65 | { 66 | "type": "array", 67 | "items": [ 68 | { 69 | "type": "number" 70 | }, 71 | { 72 | "type": "number" 73 | }, 74 | { 75 | "type": "number" 76 | }, 77 | { 78 | "type": "number" 79 | }, 80 | { 81 | "type": "number" 82 | }, 83 | { 84 | "type": "number" 85 | }, 86 | { 87 | "type": "number" 88 | }, 89 | { 90 | "type": "number" 91 | }, 92 | { 93 | "type": "number" 94 | }, 95 | { 96 | "type": "number" 97 | } 98 | ] 99 | } 100 | ] 101 | }, 102 | "meta": { 103 | "type": "array", 104 | "items": [ 105 | { 106 | "type": "string" 107 | }, 108 | { 109 | "type": "string" 110 | } 111 | ] 112 | } 113 | }, 114 | "required": [ 115 | "html", 116 | "tokens", 117 | "vectors", 118 | "meta" 119 | ] 120 | } --------------------------------------------------------------------------------