├── .gitignore ├── LICENSE ├── Makefile ├── README.md ├── create_word2vec.py └── process_wiki.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Project specific 2 | data/ 3 | out/ 4 | .ipynb_checkpoints/ 5 | 6 | # Byte-compiled / optimized / DLL files 7 | __pycache__/ 8 | *.py[cod] 9 | 10 | # C extensions 11 | *.so 12 | 13 | # Distribution / packaging 14 | .Python 15 | env/ 16 | build/ 17 | develop-eggs/ 18 | dist/ 19 | downloads/ 20 | eggs/ 21 | .eggs/ 22 | lib/ 23 | lib64/ 24 | parts/ 25 | sdist/ 26 | var/ 27 | *.egg-info/ 28 | .installed.cfg 29 | *.egg 30 | 31 | # PyInstaller 32 | # Usually these files are written by a python script from a template 33 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 34 | *.manifest 35 | *.spec 36 | 37 | # Installer logs 38 | pip-log.txt 39 | pip-delete-this-directory.txt 40 | 41 | # Unit test / coverage reports 42 | htmlcov/ 43 | .tox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *,cover 50 | 51 | # Translations 52 | *.mo 53 | *.pot 54 | 55 | # Django stuff: 56 | *.log 57 | 58 | # Sphinx documentation 59 | docs/_build/ 60 | 61 | # PyBuilder 62 | target/ 63 | 64 | # virtualenv 65 | bin/ 66 | include/ 67 | pip-selfcheck.json 68 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Henk Griffioen 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | LANGUAGE := yo 2 | DATADIR := ./data/$(LANGUAGE)/ 3 | SAVE_WORD_VECTORS := True 4 | 5 | XMLNAME := $(LANGUAGE)wiki-latest-pages-articles.xml.bz2 6 | WIKIURL := https://dumps.wikimedia.org/$(LANGUAGE)wiki/latest/$(XMLNAME) 7 | CORPUSNAME := wiki.$(LANGUAGE).text 8 | MODELPATH := $(DATADIR)model_$(LANGUAGE).word2vec 9 | 10 | $(MODELPATH): $(DATADIR)$(CORPUSNAME) 11 | python create_word2vec.py $(DATADIR)$(CORPUSNAME) $(MODELPATH) $(SAVE_WORD_VECTORS) 12 | 13 | $(DATADIR)$(CORPUSNAME): $(DATADIR)$(XMLNAME) 14 | python process_wiki.py $(DATADIR)$(XMLNAME) $(DATADIR)$(CORPUSNAME) 15 | 16 | $(DATADIR)$(XMLNAME): 17 | mkdir -p $(DATADIR) 18 | wget -P $(DATADIR) $(WIKIURL) 19 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Wiki Word2vec 2 | 3 | 4 | Train a [gensim](https://radimrehurek.com/gensim/) word2vec model on Wikipedia. 5 | 6 | Most of it is taken from [this blogpost](http://textminingonline.com/training-word2vec-model-on-english-wikipedia-by-gensim) and [this discussion](https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw). 7 | This repository was created mostly for trying out `make`, see __The gist__ for the important stuff. 8 | Note that performance depends heavily on corpus size and chosen parameters (especially for smaller corpora). 9 | Examples and parameters below are cherry-picked. 10 | 11 | 12 | ## Usage 13 | 14 | Get the code for a language (see [here](https://meta.wikimedia.org/wiki/List_of_Wikipedias)). 15 | 16 | Run `make` with the code as the value for `LANGUAGE` (or change the Makefile). 17 | For instance, try Swahili (sw): 18 | 19 | ```sh 20 | make LANGUAGE=sw 21 | ``` 22 | 23 | ### The gist 24 | 25 | Ignore `make` and execute the following bash commands for Swahili: 26 | 27 | ```sh 28 | mkdir -p data/sw/ 29 | wget -P data/sw/ https://dumps.wikimedia.org/swwiki/latest/swwiki-latest-pages-articles.xml.bz2 30 | ``` 31 | 32 | Train a model in Python: 33 | 34 | ```python 35 | import multiprocessing 36 | from gensim.corpora.wikicorpus import WikiCorpus 37 | from gensim.models.word2vec import Word2Vec 38 | 39 | wiki = WikiCorpus('data/sw/swwiki-latest-pages-articles.xml.bz2', 40 | lemmatize=False, dictionary={}) 41 | sentences = list(wiki.get_texts()) 42 | params = {'size': 200, 'window': 10, 'min_count': 10, 43 | 'workers': max(1, multiprocessing.cpu_count() - 1), 'sample': 1E-3,} 44 | word2vec = Word2Vec(sentences, **params) 45 | ``` 46 | 47 | #### Example 1 48 | 49 | Try the old man:king woman:? problem: 50 | 51 | ```python 52 | female_king = word2vec.most_similar_cosmul(positive='mfalme mwanamke'.split(), 53 | negative='mtu'.split(), topn=5,) 54 | for ii, (word, score) in enumerate(female_king): 55 | print("{}. {} ({:1.2f})".format(ii+1, word, score)) 56 | 57 | 1. malkia (0.97) 58 | 2. kambisi (0.93) 59 | 3. suleimani (0.93) 60 | 4. karolo (0.92) 61 | 5. koreshi (0.92) 62 | ``` 63 | 64 | Returning respectively queen (jackpot!), [Cambyses II](https://en.wikipedia.org/wiki/Cambyses_II) (a Persian king), [Solomon](https://en.wikipedia.org/wiki/Solomon) (king of Israel), [Karolo Mkuu?](https://sw.wikipedia.org/wiki/Karolo_Mkuu) (Charlemagne?) and [Cyrus](https://en.wikipedia.org/wiki/Cyrus_(name)) (a Persian King), 65 | 66 | 67 | #### Example 2 68 | 69 | What doesn't match: car, train or breakfast? 70 | 71 | ```python 72 | print(word2vec.doesnt_match('gari treni mlo'.split())) 73 | 74 | mlo 75 | ``` 76 | 77 | 78 | ## Dependencies 79 | 80 | * Python 3 81 | * `pip install gensim` 82 | -------------------------------------------------------------------------------- /create_word2vec.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | USAGE: %(program)s TEXT_INPUT WORD2VEC_OUTPUT TEXT_OUTPUT 6 | 7 | Example script for training a word2vec model. Parameters for word2vec should be 8 | optimized per language. TEXT_OUTPUT, true of false, if vectors should be outputted to a text file. 9 | """ 10 | 11 | import logging 12 | import multiprocessing 13 | import os 14 | import sys 15 | from gensim.models.word2vec import LineSentence 16 | from gensim.models.word2vec import Word2Vec 17 | 18 | 19 | if __name__ == '__main__': 20 | program = os.path.basename(sys.argv[0]) 21 | logger = logging.getLogger(program) 22 | 23 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') 24 | logging.root.setLevel(level=logging.INFO) 25 | logger.info("Running %s", ' '.join(sys.argv)) 26 | 27 | # Check and process input arguments. 28 | if len(sys.argv) < 4: 29 | print(globals()['__doc__'] % locals()) 30 | sys.exit(1) 31 | 32 | inp, outp, veco = sys.argv[1:4] 33 | 34 | max_length = 0 35 | with open(inp, 'r') as f: 36 | for line in f.readlines(): 37 | max_length = max(max_length, len(line)) 38 | logger.info("Max article length: %s words.", max_length) 39 | 40 | params = { 41 | 'size': 400, 42 | 'window': 10, 43 | 'min_count': 10, 44 | 'workers': max(1, multiprocessing.cpu_count() - 1), 45 | 'sample': 1E-5, 46 | } 47 | 48 | word2vec = Word2Vec(LineSentence(inp, max_sentence_length=max_length), 49 | **params) 50 | word2vec.save(outp) 51 | 52 | if veco: 53 | word2vec.wv.save_word2vec_format(outp + '.model.txt', binary=False) 54 | -------------------------------------------------------------------------------- /process_wiki.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | USAGE: %(program)s WIKI_XML_DUMP OUTPUT 6 | 7 | Converts articles from a Wikipedia dump to a file containing the texts from the 8 | articles. A single line is an article, articles are separted by a newline. 9 | 10 | Note: doesn't support lemmatization. 11 | 12 | Adapted from: 13 | - http://textminingonline.com/training-word2vec-model-on-english-wikipedia-by-gensim 14 | 15 | See also: 16 | - https://github.com/piskvorky/gensim/blob/develop/gensim/scripts/make_wikicorpus.py 17 | """ 18 | 19 | import logging 20 | import os.path 21 | import sys 22 | from gensim.corpora import WikiCorpus 23 | 24 | if __name__ == '__main__': 25 | program = os.path.basename(sys.argv[0]) 26 | logger = logging.getLogger(program) 27 | 28 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') 29 | logging.root.setLevel(level=logging.INFO) 30 | logger.info("Running %s", ' '.join(sys.argv)) 31 | 32 | # Check and process input arguments. 33 | if len(sys.argv) < 3: 34 | print(globals()['__doc__'] % locals()) 35 | sys.exit(1) 36 | inp, outp = sys.argv[1:3] 37 | 38 | # Lemmatization is only available for English. 39 | # Don't construct a dictionary because we're not using it. 40 | wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) 41 | with open(outp, 'w') as output: 42 | for i, text in enumerate(wiki.get_texts()): 43 | if sys.version_info.major < 3: 44 | output.write(" ".join(unicode(text)) + "\n") 45 | else: 46 | output.write(" ".join(text) + "\n") 47 | if i > 0 and i % 10000 == 0: 48 | logger.info("Saved %s articles", i) 49 | n = i 50 | 51 | logger.info("Finished saving %s articles", n) 52 | --------------------------------------------------------------------------------