├── .gitignore
├── LICENSE.txt
├── README.md
├── bias-graph.png
├── conceptnet-numberbatch.png
├── eval-graph.png
├── package.sh
└── text_to_uri.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # Editor stuff
 7 | *~
 8 | *.swp
 9 | .idea
10 | 
11 | # Large data files
12 | data/*.gz
13 | data/*.h5
14 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | Copyright (C) 2016 Robyn Speer (rspeer@luminoso.com) and Joshua Chin (joshuarchin@gmail.com)
 2 | 
 3 | The data included here is released under the Creative Commons Attribution-ShareAlike
 4 | 4.0 license. See README.md for more details.
 5 | 
 6 | The code is released under the MIT License:
 7 | 
 8 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 9 | this software and associated documentation files (the "Software"), to deal in
10 | the Software without restriction, including without limitation the rights to
11 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
12 | of the Software, and to permit persons to whom the Software is furnished to do
13 | so, subject to the following conditions:
14 | 
15 | The above copyright notice and this permission notice shall be included in all
16 | copies or substantial portions of the Software.
17 | 
18 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
19 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
20 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
21 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
22 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
23 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
24 | SOFTWARE.
25 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ![ConceptNet Numberbatch](conceptnet-numberbatch.png)
  2 | 
  3 | 
  4 | ## The best pre-computed word embeddings you can use
  5 | 
  6 | ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings)
  7 | that can be used directly as a representation of word meanings or as a starting point
  8 | for further machine learning.
  9 | 
 10 | ConceptNet Numberbatch is part of the [ConceptNet](http://conceptnet.io) open
 11 | data project. ConceptNet is a knowledge graph that provides lots of ways to
 12 | compute with word meanings, one of which is word embeddings, while ConceptNet
 13 | Numberbatch is a snapshot of just the word embeddings.
 14 | 
 15 | These embeddings benefit from the fact that they have semi-structured, common
 16 | sense knowledge from ConceptNet, giving them a way to learn about words that
 17 | isn't _just_ observing them in context.
 18 | 
 19 | Numberbatch is built using an ensemble that combines data from ConceptNet, word2vec,
 20 | GloVe, and OpenSubtitles 2016, using a variation on retrofitting. It is
 21 | described in the paper [ConceptNet 5.5: An Open Multilingual Graph of General
 22 | Knowledge][cn55-paper], presented at AAAI 2017.
 23 | 
 24 | Unlike most embeddings, ConceptNet Numberbatch is **multilingual** from the
 25 | ground up.  Words in different languages share a common semantic space, and
 26 | that semantic space is informed by all of the languages.
 27 | 
 28 | ### Evaluation and publications
 29 | 
 30 | ConceptNet Numberbatch can be seen as a replacement for other precomputed
 31 | embeddings, such as word2vec and GloVe, that do not include the graph-style
 32 | knowledge in ConceptNet. Numberbatch outperforms these datasets on benchmarks
 33 | of word similarity.
 34 | 
 35 | ConceptNet Numberbatch took first place in both subtasks at SemEval 2017 task
 36 | 2, "[Multilingual and Cross-lingual Semantic Word Similarity][semeval17-2]".
 37 | Within that task, it was also the first-place system in each of English,
 38 | German, Italian, and Spanish. The result is described in our ACL 2017 SemEval
 39 | paper, "[Extending Word Embeddings with Multilingual Relational Knowledge][semeval-paper]".
 40 | 
 41 | [cn55-paper]: https://arxiv.org/abs/1612.03975
 42 | [semeval17-2]: http://alt.qcri.org/semeval2017/task2/
 43 | [semeval-paper]: https://arxiv.org/abs/1704.03560
 44 | 
 45 | The code and papers were created as a research project of [Luminoso
 46 | Technologies, Inc.][luminoso], by Robyn Speer, Joshua Chin, Catherine Havasi, and
 47 | Joanna Lowry-Duda.
 48 | 
 49 | ![Graph of performance on English evaluations](eval-graph.png)
 50 | 
 51 | ### Now with more fairness
 52 | 
 53 | Word embeddings are prone to learn human-like stereotypes and prejudices.
 54 | ConceptNet Numberbatch 17.04 and later counteract this as part of the build
 55 | process, leading to word vectors that are less prejudiced than competitors such
 56 | as word2vec and GloVe. See [our blog post on reducing
 57 | bias](https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/).
 58 | 
 59 | ![Graph of biases](bias-graph.png)
 60 | 
 61 | A paper by Chris Sweeney and Maryam Najafian, ["A Transparent Framework for
 62 | Evaluating Unintended Demographic Bias in Word Embeddings"][sweeney-paper],
 63 | independently evaluates bias in precomputed word embeddings, and finds that
 64 | ConceptNet Numberbatch is less likely than GloVe to inherently lead to
 65 | demographic discrimination.
 66 | 
 67 | [sweeney-paper]: (https://www.aclweb.org/anthology/P19-1162)
 68 | 
 69 | ## Code
 70 | 
 71 | Since 2016, the code for building ConceptNet Numberbatch is part of the [ConceptNet
 72 | code base][conceptnet5], in the `conceptnet5.vectors` package.
 73 | 
 74 | The only code contained in _this_ repository is `text_to_uri.py`, which
 75 | normalizes natural-language text into the ConceptNet URI representation,
 76 | allowing you to look up rows in these tables without requiring the entire
 77 | ConceptNet codebase. For all other purposes, please refer to the [ConceptNet
 78 | code][conceptnet5].
 79 | 
 80 | [conceptnet5]: https://github.com/commonsense/conceptnet5
 81 | 
 82 | 
 83 | ## Out-of-vocabulary strategy
 84 | 
 85 | ConceptNet Numberbatch is evaluated with an out-of-vocabulary strategy that
 86 | helps its performance in the presence of unfamiliar words. The strategy is
 87 | implemented in the [ConceptNet code base][conceptnet5]. It can be summarized
 88 | as follows:
 89 | 
 90 | - Given an unknown word whose language is not English, try looking up the
 91 |   equivalently-spelled word in the English embeddings (because English words
 92 |   tend to end up in text of all languages).
 93 | - Given an unknown word, remove a letter from the end, and see if that is
 94 |   a prefix of known words. If so, average the embeddings of those known words.
 95 | - If the prefix is still unknown, continue removing letters from the end until
 96 |   a known prefix is found. Give up when a single character remains.
 97 | 
 98 | 
 99 | ## Downloads
100 | 
101 | [ConceptNet Numberbatch 19.08][nb1908-main] is the current recommended download.
102 | 
103 | This table lists the downloads and formats available for multiple recent versions:
104 | 
105 | | Version  | Multilingual                            | English-only                              | HDF5                         |
106 | | -------- | --------------------------------------- | ----------------------------------------- | ---------------------------- |
107 | | **19.08**| [numberbatch-19.08.txt.gz][nb1908-main] | [numberbatch-en-19.08.txt.gz][nb1908-en]  | [19.08/mini.h5][nb1908-mini] |
108 | | 17.06    | [numberbatch-17.06.txt.gz][nb1706-main] | [numberbatch-en-17.06.txt.gz][nb1706-en]  | [17.06/mini.h5][nb1706-mini] |
109 | | 17.04    | [numberbatch-17.04.txt.gz][nb1704-main] | [numberbatch-en-17.04b.txt.gz][nb1704-en] | [17.05/mini.h5][nb1704-mini] |
110 | | 17.02    | [numberbatch-17.02.txt.gz][nb1704-main] | [numberbatch-en-17.02.txt.gz][nb1702-en]  |                              |
111 | | 16.09    |                                         |                                           | [16.09/numberbatch.h5][nb1609-h5] |
112 | 
113 | The 16.09 version was the version published at AAAI 2017. You can reproduce its results using a Docker snapshot of the conceptnet5 repository.
114 | See the instructions on the [ConceptNet wiki](https://github.com/commonsense/conceptnet5/wiki/Running-your-own-copy#reproducing-the-word-embedding-evaluation).
115 | 
116 | [nb1908-main]: https://conceptnet.s3.amazonaws.com/downloads/2019/numberbatch/numberbatch-19.08.txt.gz
117 | [nb1908-en]: https://conceptnet.s3.amazonaws.com/downloads/2019/numberbatch/numberbatch-en-19.08.txt.gz
118 | [nb1908-mini]: http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/19.08/mini.h5
119 | 
120 | [nb1706-main]: https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-17.06.txt.gz
121 | [nb1706-en]: https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gz
122 | [nb1706-mini]: http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.06/mini.h5
123 | 
124 | [nb1704-main]: https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-17.04.txt.gz
125 | [nb1704-en]: https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.04b.txt.gz
126 | [nb1704-mini]: http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.05/mini.h5
127 | 
128 | [nb1702-main]: http://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-17.02.txt.gz
129 | [nb1702-en]: http://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.02.txt.gz
130 | 
131 | [nb1609-h5]: http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/16.09/numberbatch.h5
132 | 
133 | 
134 | The .txt.gz files of term vectors are in the text format used by word2vec, GloVe, and fastText.
135 | 
136 | The first line of the file contains the dimensions of the matrix:
137 | 
138 |     9161912 300
139 | 
140 | Each line contains a term label followed by 300 floating-point numbers,
141 | separated by spaces:
142 | 
143 |     /c/en/absolute_value -0.0847 -0.1316 -0.0800 -0.0708 -0.2514 -0.1687 -...
144 |     /c/en/absolute_zero 0.0056 -0.0051 0.0332 -0.1525 -0.0955 -0.0902 0.07...
145 |     /c/en/absoluteless 0.2740 0.0718 0.1548 0.1118 -0.1669 -0.0216 -0.0508...
146 |     /c/en/absolutely 0.0065 -0.1813 0.0335 0.0991 -0.1123 0.0060 -0.0009 0...
147 |     /c/en/absolutely_convergent 0.3752 0.1087 -0.1299 -0.0796 -0.2753 -0.1...
148 | 
149 | The HDF5 files are the format that ConceptNet uses internally. They are data
150 | tables that can be loaded into Python using a library such as `pandas` or
151 | `pytables`.
152 | 
153 | The "mini.h5" files trade off a little bit of accuracy for a lot of
154 | memory savings, taking up less than 150 MB in RAM, and are used to power the
155 | [ConceptNet API](https://github.com/commonsense/conceptnet5/wiki/API).
156 | 
157 | 
158 | ## License and attribution
159 | 
160 | These vectors are distributed under the [CC-By-SA 4.0][cc-by-sa] license. In
161 | short, if you distribute a transformed or modified version of these vectors,
162 | you must release them under a compatible Share-Alike license and give due
163 | credit to [Luminoso][luminoso].
164 | 
165 | Some suggested text:
166 | 
167 |     This data contains semantic vectors from ConceptNet Numberbatch, by
168 |     Luminoso Technologies, Inc. You may redistribute or modify the
169 |     data under the terms of the CC-By-SA 4.0 license.
170 | 
171 | [cc-by-sa]: https://creativecommons.org/licenses/by-sa/4.0/
172 | [luminoso]: http://luminoso.com
173 | 
174 | If you build on this data, you should cite it. Here is a straightforward citation:
175 | 
176 | > Robyn Speer, Joshua Chin, and Catherine Havasi (2017). "ConceptNet 5.5: An Open Multilingual Graph of General Knowledge." In proceedings of AAAI 2017.
177 | 
178 | In BibTeX form, the citation is:
179 | 
180 |     @inproceedings{speer2017conceptnet,
181 |         title = {{ConceptNet} 5.5: An Open Multilingual Graph of General Knowledge},
182 |         url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972},
183 |         author = {Speer, Robyn and Chin, Joshua and Havasi, Catherine},
184 |         year = {2017},
185 |         pages = {4444--4451}
186 |     }
187 | 
188 | This data is itself built on:
189 | 
190 |   - [ConceptNet 5.7][conceptnet], which contains data from Wiktionary,
191 |     WordNet, and many contributors to Open Mind Common Sense projects,
192 |     edited by Robyn Speer
193 | 
194 |   - [GloVe][glove], by Jeffrey Pennington, Richard Socher, and Christopher
195 |     Manning
196 | 
197 |   - [word2vec][], by Tomas Mikolov and Google Research
198 | 
199 |   - Parallel text from [OpenSubtitles 2016][opensubtitles], by Pierre Lison
200 |     and Jörg Tiedemann, analyzed using [fastText][], by Piotr Bojanowski,
201 |     Edouard Grave, Armand Joulin, and Tomas Mikolov
202 | 
203 | [conceptnet]: http://conceptnet.io/
204 | [glove]: http://nlp.stanford.edu/projects/glove/
205 | [word2vec]: https://code.google.com/archive/p/word2vec/
206 | [opensubtitles]: http://opus.lingfil.uu.se/OpenSubtitles2016.php
207 | [fastText]: https://github.com/facebookresearch/fastText
208 | 
209 | 
210 | ## Language statistics
211 | 
212 | The multilingual data in ConceptNet Numberbatch represents 78 different language
213 | codes, though some have vocabularies with much more coverage than others. The following
214 | table lists the languages and their vocabulary size.
215 | 
216 | You may notice a focus on even the smaller and historical languages of Europe,
217 | and under-representation of widely-spoken languages from outside Europe, which
218 | is an effect of the availability of linguistic resources for these languages.
219 | We would like to change this, but it requires finding good source data for
220 | ConceptNet in these under-represented languages.
221 | 
222 | Because Numberbatch contains word forms, inflected languages end up with larger
223 | vocabularies.
224 | 
225 | These vocabulary sizes were updated for ConceptNet Numberbatch 19.08.
226 | 
227 | | code | language                       | vocab size |
228 | |:-----|:-------------------------------|-----------:|
229 | |   fr | French                         |    1388686 |
230 | |   la | Latin                          |     855294 |
231 | |   es | Spanish                        |     651859 |
232 | |   de | German                         |     594456 |
233 | |   it | Italian                        |     557743 |
234 | |   en | English                        |     516782 |
235 | |   ru | Russian                        |     455325 |
236 | |   zh | Chinese                        |     307441 |
237 | |   fi | Finnish                        |     267307 |
238 | |   pt | Portuguese                     |     262904 |
239 | |   ja | Japanese                       |     256648 |
240 | |   nl | Dutch                          |     190221 |
241 | |   bg | Bulgarian                      |     178508 |
242 | |   sv | Swedish                        |     167321 |
243 | |   pl | Polish                         |     152949 |
244 | |   no | Norwegian Bokmål               |     105689 |
245 | |   eo | Esperanto                      |      96255 |
246 | |   th | Thai                           |      95342 |
247 | |   sl | Slovenian                      |      91134 |
248 | |   ms | Malay                          |      90554 |
249 | |   cs | Czech                          |      88613 |
250 | |   ca | Catalan                        |      87508 |
251 | |   ar | Arabic                         |      85325 |
252 | |   hu | Hungarian                      |      74384 |
253 | |   se | Northern Sami                  |      67601 |
254 | |   sh | Serbian                        |      66746 |
255 | |   el | Greek                          |      65905 |
256 | |   gl | Galician                       |      59006 |
257 | |   da | Danish                         |      57119 |
258 | |   fa | Persian                        |      53984 |
259 | |   ro | Romanian                       |      51437 |
260 | |   tr | Turkish                        |      51308 |
261 | |   is | Icelandic                      |      48639 |
262 | |   eu | Basque                         |      44151 |
263 | |   ko | Korean                         |      42106 |
264 | |   vi | Vietnamese                     |      39802 |
265 | |   ga | Irish                          |      36988 |
266 | |  grc | Ancient Greek                  |      36977 |
267 | |   uk | Ukrainian                      |      36851 |
268 | |   lv | Latvian                        |      36333 |
269 | |   he | Hebrew                         |      33435 |
270 | |   mk | Macedonian                     |      33370 |
271 | |   ka | Georgian                       |      32338 |
272 | |   hy | Armenian                       |      29844 |
273 | |   sk | Slovak                         |      29376 |
274 | |   lt | Lithuanian                     |      28826 |
275 | |  ast | Asturian                       |      28401 |
276 | |   mg | Malagasy                       |      26865 |
277 | |   et | Estonian                       |      26525 |
278 | |   oc | Occitan                        |      26095 |
279 | |  fil | Filipino                       |      25088 |
280 | |   io | Ido                            |      25004 |
281 | |  hsb | Upper Sorbian                  |      24852 |
282 | |   hi | Hindi                          |      23538 |
283 | |   te | Telugu                         |      22173 |
284 | |   be | Belarusian                     |      22117 |
285 | |  fro | Old French                     |      21249 |
286 | |   sq | Albanian                       |      20493 |
287 | |  mul | (Multilingual, such as emoji)  |      19376 |
288 | |   cy | Welsh                          |      18721 |
289 | |  xcl | Classical Armenian             |      18420 |
290 | |   az | Azerbaijani                    |      17184 |
291 | |   kk | Kazakh                         |      16979 |
292 | |   gd | Scottish Gaelic                |      16827 |
293 | |   af | Afrikaans                      |      16132 |
294 | |   fo | Faroese                        |      15973 |
295 | |  ang | Old English                    |      15700 |
296 | |   ku | Kurdish                        |      13804 |
297 | |   vo | Volapük                        |      12731 |
298 | |   ta | Tamil                          |      12690 |
299 | |   ur | Urdu                           |      12006 |
300 | |   sw | Swahili                        |      11150 |
301 | |   sa | Sanskrit                       |      11081 |
302 | |  nrf | Norman French                  |      10048 |
303 | |  non | Old Norse                      |       8536 |
304 | |   gv | Manx                           |       8425 |
305 | |   nv | Navajo                         |       8232 |
306 | |  rup | Aromanian                      |       5107 |
307 | 
308 | 
309 | ## Referred here from an old version?
310 | 
311 | An unpublished paper of ours described the "ConceptNet Vector Ensemble", and refers to
312 | a repository that now redirects here, and an attached store of data that is no
313 | longer hosted. We apologize, but we're not supporting the unpublished paper.
314 | Please use a newer version and use the currently supported
315 | [ConceptNet build process](https://github.com/commonsense/conceptnet5/wiki/Build-process).
316 | 
317 | 
318 | ## Image credit
319 | 
320 | The otter logo was designed by [Christy
321 | Presler](https://thenounproject.com/cnpresler/) for The Noun Project, and is
322 | used under a Creative Commons Attribution license.
323 | 


--------------------------------------------------------------------------------
/bias-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/commonsense/conceptnet-numberbatch/5559f04ccc9f6ff54684901f6ce99efede3fedfd/bias-graph.png


--------------------------------------------------------------------------------
/conceptnet-numberbatch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/commonsense/conceptnet-numberbatch/5559f04ccc9f6ff54684901f6ce99efede3fedfd/conceptnet-numberbatch.png


--------------------------------------------------------------------------------
/eval-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/commonsense/conceptnet-numberbatch/5559f04ccc9f6ff54684901f6ce99efede3fedfd/eval-graph.png


--------------------------------------------------------------------------------
/package.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | cd .. && tar zcvf conceptnet-numberbatch/conceptnet-numberbatch-16.09.tar.gz conceptnet-numberbatch/data/*.gz conceptnet-numberbatch/README.md conceptnet-numberbatch/LICENSE.txt conceptnet-numberbatch/text_to_uri.py
3 | 


--------------------------------------------------------------------------------
/text_to_uri.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This Python module provides just the code from the 'conceptnet5' module that
 3 | you need to represent terms, possibly with multiple words, as ConceptNet URIs.
 4 | 
 5 | It depends on 'wordfreq', a Python 3 library, so it can tokenize multilingual
 6 | text consistently: https://pypi.org/project/wordfreq/
 7 | 
 8 | Example:
 9 | 
10 | >>> standardized_uri('es', 'ayudar')
11 | '/c/es/ayudar'
12 | >>> standardized_uri('en', 'a test phrase')
13 | '/c/en/test_phrase'
14 | >>> standardized_uri('en', '24 hours')
15 | '/c/en/##_hours'
16 | """
17 | import wordfreq
18 | import re
19 | 
20 | 
21 | # English-specific stopword handling
22 | STOPWORDS = ['the', 'a', 'an']
23 | DROP_FIRST = ['to']
24 | DOUBLE_DIGIT_RE = re.compile(r'[0-9][0-9]')
25 | DIGIT_RE = re.compile(r'[0-9]')
26 | 
27 | 
28 | def standardized_uri(language, term):
29 |     """
30 |     Get a URI that is suitable to label a row of a vector space, by making sure
31 |     that both ConceptNet's and word2vec's normalizations are applied to it.
32 | 
33 |     'language' should be a BCP 47 language code, such as 'en' for English.
34 | 
35 |     If the term already looks like a ConceptNet URI, it will only have its
36 |     sequences of digits replaced by #. Otherwise, it will be turned into a
37 |     ConceptNet URI in the given language, and then have its sequences of digits
38 |     replaced.
39 |     """
40 |     if not (term.startswith('/') and term.count('/') >= 2):
41 |         term = _standardized_concept_uri(language, term)
42 |     return replace_numbers(term)
43 | 
44 | 
45 | def english_filter(tokens):
46 |     """
47 |     Given a list of tokens, remove a small list of English stopwords. This
48 |     helps to work with previous versions of ConceptNet, which often provided
49 |     phrases such as 'an apple' and assumed they would be standardized to
50 | 	'apple'.
51 |     """
52 |     non_stopwords = [token for token in tokens if token not in STOPWORDS]
53 |     while non_stopwords and non_stopwords[0] in DROP_FIRST:
54 |         non_stopwords = non_stopwords[1:]
55 |     if non_stopwords:
56 |         return non_stopwords
57 |     else:
58 |         return tokens
59 | 
60 | 
61 | def replace_numbers(s):
62 |     """
63 |     Replace digits with # in any term where a sequence of two digits appears.
64 | 
65 |     This operation is applied to text that passes through word2vec, so we
66 |     should match it.
67 |     """
68 |     if DOUBLE_DIGIT_RE.search(s):
69 |         return DIGIT_RE.sub('#', s)
70 |     else:
71 |         return s
72 | 
73 | 
74 | def _standardized_concept_uri(language, term):
75 |     if language == 'en':
76 |         token_filter = english_filter
77 |     else:
78 |         token_filter = None
79 |     language = language.lower()
80 |     norm_text = _standardized_text(term, token_filter)
81 |     return '/c/{}/{}'.format(language, norm_text)
82 | 
83 | 
84 | def _standardized_text(text, token_filter):
85 |     tokens = simple_tokenize(text.replace('_', ' '))
86 |     if token_filter is not None:
87 |         tokens = token_filter(tokens)
88 |     return '_'.join(tokens)
89 | 
90 | 
91 | def simple_tokenize(text):
92 |     """
93 |     Tokenize text using the default wordfreq rules.
94 |     """
95 |     return wordfreq.tokenize(text, 'xx')
96 | 
97 | 
98 | 


--------------------------------------------------------------------------------