├── .gitignore
├── LICENSE
├── README.md
├── make_corpus.py
├── tokenization.py
└── train.py


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Custom configuration
  2 | 
  3 | work/
  4 | 
  5 | # Byte-compiled / optimized / DLL files
  6 | __pycache__/
  7 | *.py[cod]
  8 | *$py.class
  9 | 
 10 | # C extensions
 11 | *.so
 12 | 
 13 | # Distribution / packaging
 14 | .Python
 15 | build/
 16 | develop-eggs/
 17 | dist/
 18 | downloads/
 19 | eggs/
 20 | .eggs/
 21 | lib/
 22 | lib64/
 23 | parts/
 24 | sdist/
 25 | var/
 26 | wheels/
 27 | *.egg-info/
 28 | .installed.cfg
 29 | *.egg
 30 | MANIFEST
 31 | 
 32 | # PyInstaller
 33 | #  Usually these files are written by a python script from a template
 34 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 35 | *.manifest
 36 | *.spec
 37 | 
 38 | # Installer logs
 39 | pip-log.txt
 40 | pip-delete-this-directory.txt
 41 | 
 42 | # Unit test / coverage reports
 43 | htmlcov/
 44 | .tox/
 45 | .coverage
 46 | .coverage.*
 47 | .cache
 48 | nosetests.xml
 49 | coverage.xml
 50 | *.cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | 
 63 | # Flask stuff:
 64 | instance/
 65 | .webassets-cache
 66 | 
 67 | # Scrapy stuff:
 68 | .scrapy
 69 | 
 70 | # Sphinx documentation
 71 | docs/_build/
 72 | 
 73 | # PyBuilder
 74 | target/
 75 | 
 76 | # Jupyter Notebook
 77 | .ipynb_checkpoints
 78 | 
 79 | # pyenv
 80 | .python-version
 81 | 
 82 | # celery beat schedule file
 83 | celerybeat-schedule
 84 | 
 85 | # SageMath parsed files
 86 | *.sage.py
 87 | 
 88 | # Environments
 89 | .env
 90 | .venv
 91 | env/
 92 | venv/
 93 | ENV/
 94 | env.bak/
 95 | venv.bak/
 96 | 
 97 | # Spyder project settings
 98 | .spyderproject
 99 | .spyproject
100 | 
101 | # Rope project settings
102 | .ropeproject
103 | 
104 | # mkdocs documentation
105 | /site
106 | 
107 | # mypy
108 | .mypy_cache/
109 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Masatoshi Suzuki
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Wikipedia Entity Vectors
  2 | 
  3 | 
  4 | ## Introduction
  5 | 
  6 | **Wikipedia Entity Vectors** [1] is a distributed representation of words and named entities (NEs).
  7 | The words and NEs are mapped to the same vector space.
  8 | The vectors are trained with skip-gram algorithm using preprocessed Wikipedia text as the corpus.
  9 | 
 10 | 
 11 | ## Downloads
 12 | 
 13 | Pre-trained vectors are downloadable from the [Releases](https://github.com/singletongue/WikiEntVec/releases) page.
 14 | 
 15 | Several old versions are available at [this site](http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/).
 16 | 
 17 | 
 18 | ## Specs
 19 | 
 20 | Each version of the pre-trained vectors contains three files: `word_vectors.txt`, `entity_vectors.txt`, and `all_vectors.txt`.
 21 | These files are text files in word2vec output format, in which the first line declares the vocabulary size and the vector size (number of dimensions), followed by lines of tokens and their corresponding vectors.
 22 | 
 23 | `word_vectors.txt` and `entity_vectors.txt` contains vectors for words and NEs, respectively.
 24 | For `entity_vectors.txt`, white spaces within names of NEs are replaced with underscores like `United_States`.
 25 | `all_vectors.txt` contains vectors of both words and embeddings in one file, where each NE token is formatted with `##` like `##United_States##`
 26 | (Note that some old versions use square brackets `[]` for formatting NE tokens.)
 27 | 
 28 | Pre-trained vectors are trained under the configurations below (see Manual training for details):
 29 | 
 30 | ### `make_corpus.py`
 31 | 
 32 | #### Japanese
 33 | 
 34 | We used [mecab-ipadic-NEologd](https://github.com/neologd/mecab-ipadic-neologd) (v0.0.6) for tokenizing Japanese texts.
 35 | 
 36 | |Option              |Value                                                 |
 37 | |:-------------------|:-----------------------------------------------------|
 38 | |`--cirrus_file`     |path to `jawiki-20190520-cirrussearch-content.json.gz`|
 39 | |`--output_file`     |path to the output file                               |
 40 | |`--tokenizer`       |`mecab`                                               |
 41 | |`--tokenizer_option`|`-d <directory of mecab-ipadic-NEologd dictionary>`   |
 42 | 
 43 | ### `train.py`
 44 | 
 45 | |Option         |Value                                           |
 46 | |:--------------|:-----------------------------------------------|
 47 | |`--corpus_file`|path to a corpus file made with `make_corpus.py`|
 48 | |`--output_dir` |path to the output directory                    |
 49 | |`--embed_size` |`100`, `200`, or `300`                          |
 50 | |`--window_size`|`10`                                            |
 51 | |`--sample_size`|`10`                                            |
 52 | |`--min_count`  |`10`                                            |
 53 | |`--epoch`      |`5`                                             |
 54 | |`--workers`    |`20`                                            |
 55 | 
 56 | 
 57 | ## Manual training
 58 | 
 59 | You can manually process Wikipedia dump file and train a skip-gram model on the preprocessed file.
 60 | 
 61 | 
 62 | ### Requirements
 63 | 
 64 | - Python 3.6
 65 | - gensim
 66 | - logzero
 67 | - MeCab and its Python binding (mecab-python3) (optional: required for tokenizing Japanese texts)
 68 | 
 69 | 
 70 | ### Steps
 71 | 
 72 | 1. Download Wikipedia Cirrussearch dump file from [here](https://dumps.wikimedia.org/other/cirrussearch/).
 73 |     - Make sure to choose a file named like `**wiki-YYYYMMDD-cirrussearch-content.json.gz`.
 74 | 2. Clone this repository.
 75 | 3. Preprocess the downloaded dump file.
 76 |     ```
 77 |     $ python make_corpus.py --cirrus_file <dump file> --output_file <corpus file>
 78 |     ```
 79 |     If you're processing Japanese version of Wikipedia, make sure to use MeCab tokenizer by setting `--tokenizer mecab` option.
 80 |     Otherwise, the text will be tokenized by a simple rule based on regular expression.
 81 | 4. Train the model
 82 |     ```
 83 |     $ python train.py --corpus_file <corpus file> --output_dir <output directory>
 84 |     ```
 85 | 
 86 |     You can configure options below for training a model.
 87 | 
 88 |     ```
 89 |     usage: train.py [-h] --corpus_file CORPUS_FILE --output_dir OUTPUT_DIR
 90 |                     [--embed_size EMBED_SIZE] [--window_size WINDOW_SIZE]
 91 |                     [--sample_size SAMPLE_SIZE] [--min_count MIN_COUNT]
 92 |                     [--epoch EPOCH] [--workers WORKERS]
 93 | 
 94 |     optional arguments:
 95 |       -h, --help            show this help message and exit
 96 |       --corpus_file CORPUS_FILE
 97 |                             Corpus file (.txt)
 98 |       --output_dir OUTPUT_DIR
 99 |                             Output directory to save embedding files
100 |       --embed_size EMBED_SIZE
101 |                             Dimensionality of the word/entity vectors [100]
102 |       --window_size WINDOW_SIZE
103 |                             Maximum distance between the current and predicted
104 |                             word within a sentence [5]
105 |       --sample_size SAMPLE_SIZE
106 |                             Number of negative samples [5]
107 |       --min_count MIN_COUNT
108 |                             Ignores all words/entities with total frequency lower
109 |                             than this [5]
110 |       --epoch EPOCH         number of training epochs [5]
111 |       --workers WORKERS     Use these many worker threads to train the model [2]
112 |     ```
113 | 
114 | 
115 | ## Concepts
116 | 
117 | There are several methods to learn distributed representations (or embeddings) of words, such as CBOW and skip-gram [2].
118 | These methods train a neural network to predict contextual words given a word in a sentence from large corpora in an unsupervised way.
119 | 
120 | However, there are a couple of problems when applying these methods to learning distributed representations of NEs.
121 | One problem is that many NEs consist of multiple words (such as "New York" and "George Washington"), which makes a simple tokenization of text undesirable.
122 | 
123 | Other problems are the diversity and ambiguity of NE mentions.
124 | For each NE, several expression can be used to mention the NE.
125 | For example, "USA", "US", "United States", and "America" can all express the same country.
126 | On the other hand, the same words and phrases can refer to different entities.
127 | For example, the word "Mercury" may represent a planet or an element or even a person (such as "Freddie Mercury", the vocalist for the rock group Queen).
128 | Therefore, in order to learn distributed representations of NEs, one must identify the spans of NEs in the text and recognize the mentioned NEs so that they are not treated just as a sequence of words.
129 | 
130 | To address these problems, we used Wikipedia as the corpus and utilized its internal hyperlinks for identifying mentions of NEs in article text.
131 | 
132 | For each article in Wikipedia, we performed the following preprocessing.
133 | 
134 | First, we extracted all hyperlinks (pairs of anchor text and the linked article) from the source text (a.k.a wikitext) of an article.
135 | 
136 | Next, for each hyperlink, we replaced the appearances of anchor text in the article body with special tokens representing the linked articles.
137 | 
138 | For instance, if an article has a hyperlink to "Mercury (planet)" assigned to the anchor text "Mercury", we replace all the other appearances of "Mercury" in the same article with the special token `##Mercury_(planet)##`.
139 | 
140 | Note that the diversity of NE mentions is resolved by replacing possibly diverse anchor texts with special tokens which are unique to NEs.
141 | Moreover, the ambiguity of NE mentions is also addressed by making "one sense per discourse" assumption; we assume that NEs mentioned by possibly ambiguous phrases can be determined by the context or the document.
142 | With this assumption, the phrases "Mercury" in the above example are neither replaced with `##Mercury_(element)##` nor `##Freddie_Mercury##`, since the article does not have such mentions as hyperlinks.
143 | 
144 | We used the preprocessed Wikipedia articles as the corpus and applied skip-gram algorithm to learn distributed representations of words and NEs.
145 | This means that words and NEs are mapped to the same vector space.
146 | 
147 | 
148 | ## Licenses
149 | 
150 | The pre-trained vectors are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).
151 | 
152 | The codes in this repository are distributed under the MIT License.
153 | 
154 | 
155 | ## References
156 | 
157 | [1] Masatoshi Suzuki, Koji Matsuda, Satoshi Sekine, Naoaki Okazaki and Kentaro
158 | Inui. A Joint Neural Model for Fine-Grained Named Entity Classification of
159 | Wikipedia Articles. IEICE Transactions on Information and Systems, Special
160 | Section on Semantic Web and Linked Data, Vol. E101-D, No.1, pp.73-81, 2018.
161 | 
162 | [2] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient Estimation
163 | of Word Representations in Vector Space. ICLR, 2013.
164 | 
165 | [3] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, Jeff Dean.
166 | Distributed Representations of Words and Phrases and their Compositionality.
167 | NIPS, 2013.
168 | 
169 | 
170 | ## Acknowledgments
171 | 
172 | This work was partially supported by Research and Development on Real World Big Data Integration and Analysis.
173 | 


--------------------------------------------------------------------------------
/make_corpus.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import json
  3 | import gzip
  4 | import argparse
  5 | from collections import OrderedDict
  6 | 
  7 | from logzero import logger
  8 | 
  9 | from tokenization import RegExpTokenizer, NLTKTokenizer, MeCabTokenizer
 10 | 
 11 | 
 12 | regex_spaces = re.compile(r'\s+')
 13 | regex_title_paren = re.compile(r' \([^()].+?\)$')
 14 | regex_hyperlink = re.compile(r'\[\[(.+?)\]\]')
 15 | regex_entity = re.compile(r'##[^#]+?##')
 16 | 
 17 | 
 18 | def main(args):
 19 |     logger.info('initializing a tokenizer')
 20 |     if args.tokenizer == 'regexp':
 21 |         tokenizer = RegExpTokenizer(do_lower_case=args.do_lower_case,
 22 |                                     preserved_pattern=regex_entity)
 23 |     elif args.tokenizer == 'nltk':
 24 |         tokenizer = NLTKTokenizer(do_lower_case=args.do_lower_case,
 25 |                                   preserved_pattern=regex_entity)
 26 |     elif args.tokenizer == 'mecab':
 27 |         tokenizer = MeCabTokenizer(mecab_option=args.tokenizer_option,
 28 |                                    do_lower_case=args.do_lower_case,
 29 |                                    preserved_pattern=regex_entity)
 30 |     else:
 31 |         raise RuntimeError(f'Invalid tokenizer: {args.tokenizer}')
 32 | 
 33 | 
 34 |     redirects = dict()
 35 |     if args.do_resolve_redirects:
 36 |         logger.info('loading redirect information')
 37 |         with gzip.open(args.cirrus_file, 'rt') as fi:
 38 |             for line in fi:
 39 |                 json_item = json.loads(line)
 40 |                 if 'title' not in json_item:
 41 |                     continue
 42 | 
 43 |                 if 'redirect' not in json_item:
 44 |                     continue
 45 | 
 46 |                 dst_title = json_item['title']
 47 |                 redirects[dst_title] = dst_title
 48 |                 for redirect_item in json_item['redirect']:
 49 |                     if redirect_item['namespace'] == 0:
 50 |                         src_title = redirect_item['title']
 51 |                         redirects.setdefault(src_title, dst_title)
 52 | 
 53 |     logger.info('generating corpus for training')
 54 |     n_processed = 0
 55 |     with gzip.open(args.cirrus_file, 'rt') as fi, \
 56 |          open(args.output_file, 'wt') as fo:
 57 |         for line in fi:
 58 |             json_item = json.loads(line)
 59 |             if 'title' not in json_item:
 60 |                 continue
 61 | 
 62 |             title = json_item['title']
 63 |             text = regex_spaces.sub(' ', json_item['text'])
 64 | 
 65 |             hyperlinks = dict()
 66 |             title_without_paren = regex_title_paren.sub('', title)
 67 |             hyperlinks.setdefault(title_without_paren, title)
 68 |             for match in regex_hyperlink.finditer(json_item['source_text']):
 69 |                 if '|' in match.group(1):
 70 |                     (entity, anchor) = match.group(1).split('|', maxsplit=1)
 71 |                 else:
 72 |                     entity = anchor = match.group(1)
 73 | 
 74 |                 if '#' in entity:
 75 |                     entity = entity[:entity.find('#')]
 76 | 
 77 |                 anchor = anchor.strip()
 78 |                 entity = entity.strip()
 79 | 
 80 |                 if args.do_resolve_redirects:
 81 |                     entity = redirects.get(entity, '')
 82 | 
 83 |                 if len(anchor) > 0 and len(entity) > 0:
 84 |                     hyperlinks.setdefault(anchor, entity)
 85 | 
 86 |             hyperlinks_sorted = OrderedDict(sorted(
 87 |                 hyperlinks.items(), key=lambda t: len(t[0]), reverse=True))
 88 | 
 89 |             replacement_flags = [0] * len(text)
 90 |             for (anchor, entity) in hyperlinks_sorted.items():
 91 |                 cursor = 0
 92 |                 while cursor < len(text) and anchor in text[cursor:]:
 93 |                     start = text.index(anchor, cursor)
 94 |                     end = start + len(anchor)
 95 |                     if not any(replacement_flags[start:end]):
 96 |                         entity_token = f'##{entity}##'.replace(' ', '_')
 97 |                         text = text[:start] + entity_token + text[end:]
 98 |                         replacement_flags = replacement_flags[:start] \
 99 |                             + [1] * len(entity_token) + replacement_flags[end:]
100 |                         assert len(text) == len(replacement_flags)
101 |                         cursor = start + len(entity_token)
102 |                     else:
103 |                         cursor = end
104 | 
105 |             text = ' '.join(tokenizer.tokenize(text))
106 | 
107 |             print(text, file=fo)
108 |             n_processed += 1
109 | 
110 |             if n_processed <= 10:
111 |                 logger.info('*** Example ***')
112 |                 example_text = text[:400] + '...' if len(text) > 400 else text
113 |                 logger.info(example_text)
114 | 
115 |             if n_processed % 10000 == 0:
116 |                 logger.info(f'processed: {n_processed}')
117 | 
118 |     if n_processed % 10000 != 0:
119 |         logger.info(f'processed: {n_processed}')
120 | 
121 | 
122 | if __name__ == "__main__":
123 |     parser = argparse.ArgumentParser()
124 |     parser.add_argument('--cirrus_file', type=str, required=True,
125 |         help='Wikipedia Cirrussearch content dump file (.json.gz)')
126 |     parser.add_argument('--output_file', type=str, required=True,
127 |         help='output corpus file (.txt)')
128 |     parser.add_argument('--tokenizer', default='regexp',
129 |         help='tokenizer type [regexp]')
130 |     parser.add_argument('--do_lower_case', action='store_true',
131 |         help='lowercase words (not applied to NEs)')
132 |     parser.add_argument('--do_resolve_redirects', action='store_true',
133 |         help='resolve redirects of entity names')
134 |     parser.add_argument('--tokenizer_option', type=str, default='',
135 |         help='option string passed to the tokenizer')
136 |     args = parser.parse_args()
137 |     main(args)
138 | 


--------------------------------------------------------------------------------
/tokenization.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | 
 4 | class BaseTokenizer(object):
 5 |     def __init__(self, do_lower_case=False, preserved_pattern=None):
 6 |         self.do_lower_case = do_lower_case
 7 |         self.preserved_pattern = preserved_pattern
 8 | 
 9 |     def tokenize_words(self, text):
10 |         raise NotImplementedError
11 | 
12 |     def tokenize(self, text):
13 |         if self.preserved_pattern is not None:
14 |             tokens = []
15 |             split_texts = self.preserved_pattern.split(text)
16 |             matched_texts = \
17 |                 [m.group(0) for m in self.preserved_pattern.finditer(text)] + [None]
18 |             assert len(split_texts) == len(matched_texts)
19 |             for (split_text, matched_text) in zip(split_texts, matched_texts):
20 |                 if self.do_lower_case:
21 |                     tokens += [t.lower() for t in self.tokenize_words(split_text)]
22 |                 else:
23 |                     tokens += self.tokenize_words(split_text)
24 | 
25 |                 if matched_text is not None:
26 |                     tokens += [matched_text]
27 |         else:
28 |             if self.do_lower_case:
29 |                 tokens = [t.lower() for t in self.tokenize_words(text)]
30 |             else:
31 |                 tokens = self.tokenize_words(text)
32 | 
33 |         return tokens
34 | 
35 | 
36 | class RegExpTokenizer(BaseTokenizer):
37 |     def __init__(self, pattern=r'\w+|\S', do_lower_case=False, preserved_pattern=None):
38 |         super(RegExpTokenizer, self).__init__(do_lower_case, preserved_pattern)
39 |         self.pattern = re.compile(pattern)
40 | 
41 |     def tokenize_words(self, text):
42 |         tokens = [t.strip() for t in self.pattern.findall(text) if t.strip()]
43 |         return tokens
44 | 
45 | 
46 | class NLTKTokenizer(BaseTokenizer):
47 |     def __init__(self, do_lower_case=False, preserved_pattern=None):
48 |         super(NLTKTokenizer, self).__init__(do_lower_case, preserved_pattern)
49 |         from nltk import word_tokenize
50 |         self.nltk_tokenize = word_tokenize
51 | 
52 |     def tokenize_words(self, text):
53 |         tokens = [t.strip() for t in self.nltk_tokenize(text) if t.strip()]
54 |         return tokens
55 | 
56 | 
57 | class MeCabTokenizer(BaseTokenizer):
58 |     def __init__(self, mecab_option='', do_lower_case=False, preserved_pattern=None):
59 |         super(MeCabTokenizer, self).__init__(do_lower_case, preserved_pattern)
60 |         import MeCab
61 |         self.mecab_option = mecab_option
62 |         self.mecab = MeCab.Tagger(self.mecab_option)
63 | 
64 |     def tokenize_words(self, text):
65 |         tokens = []
66 |         for line in self.mecab.parse(text).split('\n'):
67 |             if line == 'EOS':
68 |                 break
69 | 
70 |             token = line.split('\t')[0].strip()
71 |             if not token:
72 |                 continue
73 | 
74 |             tokens.append(token)
75 | 
76 |         return tokens
77 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | import argparse
 3 | from pathlib import Path
 4 | 
 5 | import logzero
 6 | from logzero import logger
 7 | from gensim.models.word2vec import LineSentence, Word2Vec
 8 | 
 9 | 
10 | logger_word2vec = logzero.setup_logger(name='gensim.models.word2vec')
11 | logger_base_any2vec = logzero.setup_logger(name='gensim.models.base_any2vec')
12 | 
13 | regex_entity = re.compile(r'##[^#]+?##')
14 | 
15 | 
16 | def main(args):
17 |     output_dir = Path(args.output_dir)
18 |     output_dir.mkdir(parents=True, exist_ok=True)
19 | 
20 |     word_vectors_file = output_dir / 'word_vectors.txt'
21 |     entity_vectors_file = output_dir / 'entity_vectors.txt'
22 |     all_vectors_file = output_dir / 'all_vectors.txt'
23 | 
24 |     logger.info('training the model')
25 |     model = Word2Vec(sentences=LineSentence(args.corpus_file),
26 |                      size=args.embed_size,
27 |                      window=args.window_size,
28 |                      negative=args.sample_size,
29 |                      min_count=args.min_count,
30 |                      workers=args.workers,
31 |                      sg=1,
32 |                      hs=0,
33 |                      iter=args.epoch)
34 | 
35 |     word_vocab_size = 0
36 |     entity_vocab_size = 0
37 |     for token in model.wv.vocab:
38 |         if regex_entity.match(token):
39 |             entity_vocab_size += 1
40 |         else:
41 |             word_vocab_size += 1
42 | 
43 |     total_vocab_size = word_vocab_size + entity_vocab_size
44 |     logger.info(f'word vocabulary size: {word_vocab_size}')
45 |     logger.info(f'entity vocabulary size: {entity_vocab_size}')
46 |     logger.info(f'total vocabulary size: {total_vocab_size}')
47 | 
48 |     logger.info('writing word/entity vectors to files')
49 |     with open(word_vectors_file, 'w') as fo_word, \
50 |          open(entity_vectors_file, 'w') as fo_entity, \
51 |          open(all_vectors_file, 'w') as fo_all:
52 | 
53 |         # write word2vec headers to each file
54 |         print(word_vocab_size, args.embed_size, file=fo_word)
55 |         print(entity_vocab_size, args.embed_size, file=fo_entity)
56 |         print(total_vocab_size, args.embed_size, file=fo_all)
57 | 
58 |         # write tokens and vectors
59 |         for (token, _) in sorted(model.wv.vocab.items(), key=lambda t: -t[1].count):
60 |             vector = model.wv[token]
61 | 
62 |             if regex_entity.match(token):
63 |                 print(token[2:-2], *vector, file=fo_entity)
64 |             else:
65 |                 print(token, *vector, file=fo_word)
66 | 
67 |             print(token, *vector, file=fo_all)
68 | 
69 | 
70 | if __name__ == "__main__":
71 |     parser = argparse.ArgumentParser()
72 |     parser.add_argument('--corpus_file', type=str, required=True,
73 |         help='Corpus file (.txt)')
74 |     parser.add_argument('--output_dir', type=str, required=True,
75 |         help='Output directory to save embedding files')
76 |     parser.add_argument('--embed_size', type=int, default=100,
77 |         help='Dimensionality of the word/entity vectors [100]')
78 |     parser.add_argument('--window_size', type=int, default=5,
79 |         help='Maximum distance between the current and '
80 |              'predicted word within a sentence [5]')
81 |     parser.add_argument('--sample_size', type=int, default=5,
82 |         help='Number of negative samples [5]')
83 |     parser.add_argument('--min_count', type=int, default=5,
84 |         help='Ignores all words/entities with total frequency lower than this [5]')
85 |     parser.add_argument('--epoch', type=int, default=5,
86 |         help='number of training epochs [5]')
87 |     parser.add_argument('--workers', type=int, default=2,
88 |         help='Use these many worker threads to train the model [2]')
89 |     args = parser.parse_args()
90 |     main(args)
91 | 


--------------------------------------------------------------------------------