├── .gitignore ├── LICENSE.txt ├── README.md ├── bias-graph.png ├── conceptnet-numberbatch.png ├── eval-graph.png ├── package.sh └── text_to_uri.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # Editor stuff 7 | *~ 8 | *.swp 9 | .idea 10 | 11 | # Large data files 12 | data/*.gz 13 | data/*.h5 14 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (C) 2016 Robyn Speer (rspeer@luminoso.com) and Joshua Chin (joshuarchin@gmail.com) 2 | 3 | The data included here is released under the Creative Commons Attribution-ShareAlike 4 | 4.0 license. See README.md for more details. 5 | 6 | The code is released under the MIT License: 7 | 8 | Permission is hereby granted, free of charge, to any person obtaining a copy of 9 | this software and associated documentation files (the "Software"), to deal in 10 | the Software without restriction, including without limitation the rights to 11 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 12 | of the Software, and to permit persons to whom the Software is furnished to do 13 | so, subject to the following conditions: 14 | 15 | The above copyright notice and this permission notice shall be included in all 16 | copies or substantial portions of the Software. 17 | 18 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 21 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 22 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 23 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 24 | SOFTWARE. 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![ConceptNet Numberbatch](conceptnet-numberbatch.png) 2 | 3 | 4 | ## The best pre-computed word embeddings you can use 5 | 6 | ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) 7 | that can be used directly as a representation of word meanings or as a starting point 8 | for further machine learning. 9 | 10 | ConceptNet Numberbatch is part of the [ConceptNet](http://conceptnet.io) open 11 | data project. ConceptNet is a knowledge graph that provides lots of ways to 12 | compute with word meanings, one of which is word embeddings, while ConceptNet 13 | Numberbatch is a snapshot of just the word embeddings. 14 | 15 | These embeddings benefit from the fact that they have semi-structured, common 16 | sense knowledge from ConceptNet, giving them a way to learn about words that 17 | isn't _just_ observing them in context. 18 | 19 | Numberbatch is built using an ensemble that combines data from ConceptNet, word2vec, 20 | GloVe, and OpenSubtitles 2016, using a variation on retrofitting. It is 21 | described in the paper [ConceptNet 5.5: An Open Multilingual Graph of General 22 | Knowledge][cn55-paper], presented at AAAI 2017. 23 | 24 | Unlike most embeddings, ConceptNet Numberbatch is **multilingual** from the 25 | ground up. Words in different languages share a common semantic space, and 26 | that semantic space is informed by all of the languages. 27 | 28 | ### Evaluation and publications 29 | 30 | ConceptNet Numberbatch can be seen as a replacement for other precomputed 31 | embeddings, such as word2vec and GloVe, that do not include the graph-style 32 | knowledge in ConceptNet. Numberbatch outperforms these datasets on benchmarks 33 | of word similarity. 34 | 35 | ConceptNet Numberbatch took first place in both subtasks at SemEval 2017 task 36 | 2, "[Multilingual and Cross-lingual Semantic Word Similarity][semeval17-2]". 37 | Within that task, it was also the first-place system in each of English, 38 | German, Italian, and Spanish. The result is described in our ACL 2017 SemEval 39 | paper, "[Extending Word Embeddings with Multilingual Relational Knowledge][semeval-paper]". 40 | 41 | [cn55-paper]: https://arxiv.org/abs/1612.03975 42 | [semeval17-2]: http://alt.qcri.org/semeval2017/task2/ 43 | [semeval-paper]: https://arxiv.org/abs/1704.03560 44 | 45 | The code and papers were created as a research project of [Luminoso 46 | Technologies, Inc.][luminoso], by Robyn Speer, Joshua Chin, Catherine Havasi, and 47 | Joanna Lowry-Duda. 48 | 49 | ![Graph of performance on English evaluations](eval-graph.png) 50 | 51 | ### Now with more fairness 52 | 53 | Word embeddings are prone to learn human-like stereotypes and prejudices. 54 | ConceptNet Numberbatch 17.04 and later counteract this as part of the build 55 | process, leading to word vectors that are less prejudiced than competitors such 56 | as word2vec and GloVe. See [our blog post on reducing 57 | bias](https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/). 58 | 59 | ![Graph of biases](bias-graph.png) 60 | 61 | A paper by Chris Sweeney and Maryam Najafian, ["A Transparent Framework for 62 | Evaluating Unintended Demographic Bias in Word Embeddings"][sweeney-paper], 63 | independently evaluates bias in precomputed word embeddings, and finds that 64 | ConceptNet Numberbatch is less likely than GloVe to inherently lead to 65 | demographic discrimination. 66 | 67 | [sweeney-paper]: (https://www.aclweb.org/anthology/P19-1162) 68 | 69 | ## Code 70 | 71 | Since 2016, the code for building ConceptNet Numberbatch is part of the [ConceptNet 72 | code base][conceptnet5], in the `conceptnet5.vectors` package. 73 | 74 | The only code contained in _this_ repository is `text_to_uri.py`, which 75 | normalizes natural-language text into the ConceptNet URI representation, 76 | allowing you to look up rows in these tables without requiring the entire 77 | ConceptNet codebase. For all other purposes, please refer to the [ConceptNet 78 | code][conceptnet5]. 79 | 80 | [conceptnet5]: https://github.com/commonsense/conceptnet5 81 | 82 | 83 | ## Out-of-vocabulary strategy 84 | 85 | ConceptNet Numberbatch is evaluated with an out-of-vocabulary strategy that 86 | helps its performance in the presence of unfamiliar words. The strategy is 87 | implemented in the [ConceptNet code base][conceptnet5]. It can be summarized 88 | as follows: 89 | 90 | - Given an unknown word whose language is not English, try looking up the 91 | equivalently-spelled word in the English embeddings (because English words 92 | tend to end up in text of all languages). 93 | - Given an unknown word, remove a letter from the end, and see if that is 94 | a prefix of known words. If so, average the embeddings of those known words. 95 | - If the prefix is still unknown, continue removing letters from the end until 96 | a known prefix is found. Give up when a single character remains. 97 | 98 | 99 | ## Downloads 100 | 101 | [ConceptNet Numberbatch 19.08][nb1908-main] is the current recommended download. 102 | 103 | This table lists the downloads and formats available for multiple recent versions: 104 | 105 | | Version | Multilingual | English-only | HDF5 | 106 | | -------- | --------------------------------------- | ----------------------------------------- | ---------------------------- | 107 | | **19.08**| [numberbatch-19.08.txt.gz][nb1908-main] | [numberbatch-en-19.08.txt.gz][nb1908-en] | [19.08/mini.h5][nb1908-mini] | 108 | | 17.06 | [numberbatch-17.06.txt.gz][nb1706-main] | [numberbatch-en-17.06.txt.gz][nb1706-en] | [17.06/mini.h5][nb1706-mini] | 109 | | 17.04 | [numberbatch-17.04.txt.gz][nb1704-main] | [numberbatch-en-17.04b.txt.gz][nb1704-en] | [17.05/mini.h5][nb1704-mini] | 110 | | 17.02 | [numberbatch-17.02.txt.gz][nb1704-main] | [numberbatch-en-17.02.txt.gz][nb1702-en] | | 111 | | 16.09 | | | [16.09/numberbatch.h5][nb1609-h5] | 112 | 113 | The 16.09 version was the version published at AAAI 2017. You can reproduce its results using a Docker snapshot of the conceptnet5 repository. 114 | See the instructions on the [ConceptNet wiki](https://github.com/commonsense/conceptnet5/wiki/Running-your-own-copy#reproducing-the-word-embedding-evaluation). 115 | 116 | [nb1908-main]: https://conceptnet.s3.amazonaws.com/downloads/2019/numberbatch/numberbatch-19.08.txt.gz 117 | [nb1908-en]: https://conceptnet.s3.amazonaws.com/downloads/2019/numberbatch/numberbatch-en-19.08.txt.gz 118 | [nb1908-mini]: http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/19.08/mini.h5 119 | 120 | [nb1706-main]: https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-17.06.txt.gz 121 | [nb1706-en]: https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gz 122 | [nb1706-mini]: http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.06/mini.h5 123 | 124 | [nb1704-main]: https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-17.04.txt.gz 125 | [nb1704-en]: https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.04b.txt.gz 126 | [nb1704-mini]: http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.05/mini.h5 127 | 128 | [nb1702-main]: http://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-17.02.txt.gz 129 | [nb1702-en]: http://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.02.txt.gz 130 | 131 | [nb1609-h5]: http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/16.09/numberbatch.h5 132 | 133 | 134 | The .txt.gz files of term vectors are in the text format used by word2vec, GloVe, and fastText. 135 | 136 | The first line of the file contains the dimensions of the matrix: 137 | 138 | 9161912 300 139 | 140 | Each line contains a term label followed by 300 floating-point numbers, 141 | separated by spaces: 142 | 143 | /c/en/absolute_value -0.0847 -0.1316 -0.0800 -0.0708 -0.2514 -0.1687 -... 144 | /c/en/absolute_zero 0.0056 -0.0051 0.0332 -0.1525 -0.0955 -0.0902 0.07... 145 | /c/en/absoluteless 0.2740 0.0718 0.1548 0.1118 -0.1669 -0.0216 -0.0508... 146 | /c/en/absolutely 0.0065 -0.1813 0.0335 0.0991 -0.1123 0.0060 -0.0009 0... 147 | /c/en/absolutely_convergent 0.3752 0.1087 -0.1299 -0.0796 -0.2753 -0.1... 148 | 149 | The HDF5 files are the format that ConceptNet uses internally. They are data 150 | tables that can be loaded into Python using a library such as `pandas` or 151 | `pytables`. 152 | 153 | The "mini.h5" files trade off a little bit of accuracy for a lot of 154 | memory savings, taking up less than 150 MB in RAM, and are used to power the 155 | [ConceptNet API](https://github.com/commonsense/conceptnet5/wiki/API). 156 | 157 | 158 | ## License and attribution 159 | 160 | These vectors are distributed under the [CC-By-SA 4.0][cc-by-sa] license. In 161 | short, if you distribute a transformed or modified version of these vectors, 162 | you must release them under a compatible Share-Alike license and give due 163 | credit to [Luminoso][luminoso]. 164 | 165 | Some suggested text: 166 | 167 | This data contains semantic vectors from ConceptNet Numberbatch, by 168 | Luminoso Technologies, Inc. You may redistribute or modify the 169 | data under the terms of the CC-By-SA 4.0 license. 170 | 171 | [cc-by-sa]: https://creativecommons.org/licenses/by-sa/4.0/ 172 | [luminoso]: http://luminoso.com 173 | 174 | If you build on this data, you should cite it. Here is a straightforward citation: 175 | 176 | > Robyn Speer, Joshua Chin, and Catherine Havasi (2017). "ConceptNet 5.5: An Open Multilingual Graph of General Knowledge." In proceedings of AAAI 2017. 177 | 178 | In BibTeX form, the citation is: 179 | 180 | @inproceedings{speer2017conceptnet, 181 | title = {{ConceptNet} 5.5: An Open Multilingual Graph of General Knowledge}, 182 | url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}, 183 | author = {Speer, Robyn and Chin, Joshua and Havasi, Catherine}, 184 | year = {2017}, 185 | pages = {4444--4451} 186 | } 187 | 188 | This data is itself built on: 189 | 190 | - [ConceptNet 5.7][conceptnet], which contains data from Wiktionary, 191 | WordNet, and many contributors to Open Mind Common Sense projects, 192 | edited by Robyn Speer 193 | 194 | - [GloVe][glove], by Jeffrey Pennington, Richard Socher, and Christopher 195 | Manning 196 | 197 | - [word2vec][], by Tomas Mikolov and Google Research 198 | 199 | - Parallel text from [OpenSubtitles 2016][opensubtitles], by Pierre Lison 200 | and Jörg Tiedemann, analyzed using [fastText][], by Piotr Bojanowski, 201 | Edouard Grave, Armand Joulin, and Tomas Mikolov 202 | 203 | [conceptnet]: http://conceptnet.io/ 204 | [glove]: http://nlp.stanford.edu/projects/glove/ 205 | [word2vec]: https://code.google.com/archive/p/word2vec/ 206 | [opensubtitles]: http://opus.lingfil.uu.se/OpenSubtitles2016.php 207 | [fastText]: https://github.com/facebookresearch/fastText 208 | 209 | 210 | ## Language statistics 211 | 212 | The multilingual data in ConceptNet Numberbatch represents 78 different language 213 | codes, though some have vocabularies with much more coverage than others. The following 214 | table lists the languages and their vocabulary size. 215 | 216 | You may notice a focus on even the smaller and historical languages of Europe, 217 | and under-representation of widely-spoken languages from outside Europe, which 218 | is an effect of the availability of linguistic resources for these languages. 219 | We would like to change this, but it requires finding good source data for 220 | ConceptNet in these under-represented languages. 221 | 222 | Because Numberbatch contains word forms, inflected languages end up with larger 223 | vocabularies. 224 | 225 | These vocabulary sizes were updated for ConceptNet Numberbatch 19.08. 226 | 227 | | code | language | vocab size | 228 | |:-----|:-------------------------------|-----------:| 229 | | fr | French | 1388686 | 230 | | la | Latin | 855294 | 231 | | es | Spanish | 651859 | 232 | | de | German | 594456 | 233 | | it | Italian | 557743 | 234 | | en | English | 516782 | 235 | | ru | Russian | 455325 | 236 | | zh | Chinese | 307441 | 237 | | fi | Finnish | 267307 | 238 | | pt | Portuguese | 262904 | 239 | | ja | Japanese | 256648 | 240 | | nl | Dutch | 190221 | 241 | | bg | Bulgarian | 178508 | 242 | | sv | Swedish | 167321 | 243 | | pl | Polish | 152949 | 244 | | no | Norwegian Bokmål | 105689 | 245 | | eo | Esperanto | 96255 | 246 | | th | Thai | 95342 | 247 | | sl | Slovenian | 91134 | 248 | | ms | Malay | 90554 | 249 | | cs | Czech | 88613 | 250 | | ca | Catalan | 87508 | 251 | | ar | Arabic | 85325 | 252 | | hu | Hungarian | 74384 | 253 | | se | Northern Sami | 67601 | 254 | | sh | Serbian | 66746 | 255 | | el | Greek | 65905 | 256 | | gl | Galician | 59006 | 257 | | da | Danish | 57119 | 258 | | fa | Persian | 53984 | 259 | | ro | Romanian | 51437 | 260 | | tr | Turkish | 51308 | 261 | | is | Icelandic | 48639 | 262 | | eu | Basque | 44151 | 263 | | ko | Korean | 42106 | 264 | | vi | Vietnamese | 39802 | 265 | | ga | Irish | 36988 | 266 | | grc | Ancient Greek | 36977 | 267 | | uk | Ukrainian | 36851 | 268 | | lv | Latvian | 36333 | 269 | | he | Hebrew | 33435 | 270 | | mk | Macedonian | 33370 | 271 | | ka | Georgian | 32338 | 272 | | hy | Armenian | 29844 | 273 | | sk | Slovak | 29376 | 274 | | lt | Lithuanian | 28826 | 275 | | ast | Asturian | 28401 | 276 | | mg | Malagasy | 26865 | 277 | | et | Estonian | 26525 | 278 | | oc | Occitan | 26095 | 279 | | fil | Filipino | 25088 | 280 | | io | Ido | 25004 | 281 | | hsb | Upper Sorbian | 24852 | 282 | | hi | Hindi | 23538 | 283 | | te | Telugu | 22173 | 284 | | be | Belarusian | 22117 | 285 | | fro | Old French | 21249 | 286 | | sq | Albanian | 20493 | 287 | | mul | (Multilingual, such as emoji) | 19376 | 288 | | cy | Welsh | 18721 | 289 | | xcl | Classical Armenian | 18420 | 290 | | az | Azerbaijani | 17184 | 291 | | kk | Kazakh | 16979 | 292 | | gd | Scottish Gaelic | 16827 | 293 | | af | Afrikaans | 16132 | 294 | | fo | Faroese | 15973 | 295 | | ang | Old English | 15700 | 296 | | ku | Kurdish | 13804 | 297 | | vo | Volapük | 12731 | 298 | | ta | Tamil | 12690 | 299 | | ur | Urdu | 12006 | 300 | | sw | Swahili | 11150 | 301 | | sa | Sanskrit | 11081 | 302 | | nrf | Norman French | 10048 | 303 | | non | Old Norse | 8536 | 304 | | gv | Manx | 8425 | 305 | | nv | Navajo | 8232 | 306 | | rup | Aromanian | 5107 | 307 | 308 | 309 | ## Referred here from an old version? 310 | 311 | An unpublished paper of ours described the "ConceptNet Vector Ensemble", and refers to 312 | a repository that now redirects here, and an attached store of data that is no 313 | longer hosted. We apologize, but we're not supporting the unpublished paper. 314 | Please use a newer version and use the currently supported 315 | [ConceptNet build process](https://github.com/commonsense/conceptnet5/wiki/Build-process). 316 | 317 | 318 | ## Image credit 319 | 320 | The otter logo was designed by [Christy 321 | Presler](https://thenounproject.com/cnpresler/) for The Noun Project, and is 322 | used under a Creative Commons Attribution license. 323 | -------------------------------------------------------------------------------- /bias-graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/commonsense/conceptnet-numberbatch/5559f04ccc9f6ff54684901f6ce99efede3fedfd/bias-graph.png -------------------------------------------------------------------------------- /conceptnet-numberbatch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/commonsense/conceptnet-numberbatch/5559f04ccc9f6ff54684901f6ce99efede3fedfd/conceptnet-numberbatch.png -------------------------------------------------------------------------------- /eval-graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/commonsense/conceptnet-numberbatch/5559f04ccc9f6ff54684901f6ce99efede3fedfd/eval-graph.png -------------------------------------------------------------------------------- /package.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | cd .. && tar zcvf conceptnet-numberbatch/conceptnet-numberbatch-16.09.tar.gz conceptnet-numberbatch/data/*.gz conceptnet-numberbatch/README.md conceptnet-numberbatch/LICENSE.txt conceptnet-numberbatch/text_to_uri.py 3 | -------------------------------------------------------------------------------- /text_to_uri.py: -------------------------------------------------------------------------------- 1 | """ 2 | This Python module provides just the code from the 'conceptnet5' module that 3 | you need to represent terms, possibly with multiple words, as ConceptNet URIs. 4 | 5 | It depends on 'wordfreq', a Python 3 library, so it can tokenize multilingual 6 | text consistently: https://pypi.org/project/wordfreq/ 7 | 8 | Example: 9 | 10 | >>> standardized_uri('es', 'ayudar') 11 | '/c/es/ayudar' 12 | >>> standardized_uri('en', 'a test phrase') 13 | '/c/en/test_phrase' 14 | >>> standardized_uri('en', '24 hours') 15 | '/c/en/##_hours' 16 | """ 17 | import wordfreq 18 | import re 19 | 20 | 21 | # English-specific stopword handling 22 | STOPWORDS = ['the', 'a', 'an'] 23 | DROP_FIRST = ['to'] 24 | DOUBLE_DIGIT_RE = re.compile(r'[0-9][0-9]') 25 | DIGIT_RE = re.compile(r'[0-9]') 26 | 27 | 28 | def standardized_uri(language, term): 29 | """ 30 | Get a URI that is suitable to label a row of a vector space, by making sure 31 | that both ConceptNet's and word2vec's normalizations are applied to it. 32 | 33 | 'language' should be a BCP 47 language code, such as 'en' for English. 34 | 35 | If the term already looks like a ConceptNet URI, it will only have its 36 | sequences of digits replaced by #. Otherwise, it will be turned into a 37 | ConceptNet URI in the given language, and then have its sequences of digits 38 | replaced. 39 | """ 40 | if not (term.startswith('/') and term.count('/') >= 2): 41 | term = _standardized_concept_uri(language, term) 42 | return replace_numbers(term) 43 | 44 | 45 | def english_filter(tokens): 46 | """ 47 | Given a list of tokens, remove a small list of English stopwords. This 48 | helps to work with previous versions of ConceptNet, which often provided 49 | phrases such as 'an apple' and assumed they would be standardized to 50 | 'apple'. 51 | """ 52 | non_stopwords = [token for token in tokens if token not in STOPWORDS] 53 | while non_stopwords and non_stopwords[0] in DROP_FIRST: 54 | non_stopwords = non_stopwords[1:] 55 | if non_stopwords: 56 | return non_stopwords 57 | else: 58 | return tokens 59 | 60 | 61 | def replace_numbers(s): 62 | """ 63 | Replace digits with # in any term where a sequence of two digits appears. 64 | 65 | This operation is applied to text that passes through word2vec, so we 66 | should match it. 67 | """ 68 | if DOUBLE_DIGIT_RE.search(s): 69 | return DIGIT_RE.sub('#', s) 70 | else: 71 | return s 72 | 73 | 74 | def _standardized_concept_uri(language, term): 75 | if language == 'en': 76 | token_filter = english_filter 77 | else: 78 | token_filter = None 79 | language = language.lower() 80 | norm_text = _standardized_text(term, token_filter) 81 | return '/c/{}/{}'.format(language, norm_text) 82 | 83 | 84 | def _standardized_text(text, token_filter): 85 | tokens = simple_tokenize(text.replace('_', ' ')) 86 | if token_filter is not None: 87 | tokens = token_filter(tokens) 88 | return '_'.join(tokens) 89 | 90 | 91 | def simple_tokenize(text): 92 | """ 93 | Tokenize text using the default wordfreq rules. 94 | """ 95 | return wordfreq.tokenize(text, 'xx') 96 | 97 | 98 | --------------------------------------------------------------------------------