├── conec
    ├── __init__.py
    ├── context2vec.py
    └── word2vec.py
├── .gitignore
├── LICENSE
├── README.md
└── examples
    ├── test_analogy.py
    └── test_ner.py


/conec/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = "2.0.1"
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | .DS_Store
 2 | *~
 3 | *.sw[p|o]
 4 | *.py[c|o]
 5 | *.py~
 6 | *.csv
 7 | *.db
 8 | examples/data/
 9 | dist/
10 | build/
11 | conec.egg-info/
12 | MANIFEST.in
13 | setup.py
14 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Context Encoders (ConEc)
 2 | 
 3 | With this code you can train and evaluate Context Encoders (ConEc), an extension of word2vec, which can learn word embeddings from large corpora and create out-of-vocabulary embeddings on the spot as well as distinguish between multiple meanings of words based on their local contexts.
 4 | For further details on the model and experiments please refer to the [paper](https://arxiv.org/abs/1706.02496)  - and of course if any of this code was helpful for your research, please consider citing it:
 5 | ```
 6 |     @inproceedings{horn2017conecRepL4NLP,
 7 |       author       = {Horn, Franziska},
 8 |       title        = {Context encoders as a simple but powerful extension of word2vec},
 9 |       booktitle    = {Proceedings of the 2nd Workshop on Representation Learning for NLP},
10 |       year         = {2017},
11 |       organization = {Association for Computational Linguistics},
12 |       pages        = {10--14}
13 |     }
14 | ```
15 | 
16 | The code is intended for research purposes. It should run with Python 2.7 and 3 versions - no guarantees on this though (open an issue if you find a bug, please)!
17 | 
18 | ### installation
19 | 
20 | You either download the code from here and include the conec folder in your `$PYTHONPATH` or install (the library components only) via pip:
21 | ```
22 | $ pip install conec
23 | ```
24 | 
25 | ### conec library components
26 | 
27 | dependencies: `numpy, scipy`
28 | 
29 | - `word2vec.py`: code to train a standard word2vec model, adapted from the corresponding [gensim](https://radimrehurek.com/gensim/) implementation.
30 | - `context2vec.py`: code to build a sparse context matrix from a large collection of texts; this context matrix can then be multiplied with the corresponding word2vec embeddings to give the context encoder embeddings:
31 | 
32 | ```python
33 | # get the text for training
34 | sentences = Text8Corpus('data/text8')
35 | # train the word2vec model
36 | w2v_model = word2vec.Word2Vec(sentences, mtype='cbow', hs=0, neg=13, vector_size=200, seed=3)
37 | # get the global context matrix for the text
38 | context_model = context2vec.ContextModel(sentences, min_count=w2v_model.min_count, window=w2v_model.window, wordlist=w2v_model.wv.index2word)
39 | context_mat = context_model.get_context_matrix(fill_diag=False, norm='max')
40 | # multiply the context matrix with the (length normalized) word2vec embeddings
41 | # to get the context encoder (ConEc) embeddings
42 | conec_emb = context_mat.dot(w2v_model.wv.vectors_norm)
43 | # renormalize so the word embeddings have unit length again
44 | conec_emb = conec_emb / np.array([np.linalg.norm(conec_emb, axis=1)]).T
45 | ```
46 | 
47 | 
48 | ### examples
49 | 
50 | additional dependencies: `sklearn`
51 | 
52 | `test_analogy.py` and `test_ner.py` contain the code to replicate the analogy and named entity recognition (NER) experiments discussed in the aforementioned paper.
53 | 
54 | To run the analogy experiment, it is assumed that the [`text8 corpus`](http://mattmahoney.net/dc/text8.zip) or [`1-billion corpus`](http://code.google.com/p/1-billion-word-language-modeling-benchmark/) as well as the [`analogy questions`](https://code.google.com/archive/p/word2vec/) are in a data directory.
55 | 
56 | To run the named entity recognition experiment, it is assumed that the corresponding [`training and test files`](http://www.cnts.ua.ac.be/conll2003/ner/) are located in the data/conll2003 directory.
57 | 
58 | 
59 | If you have any questions please don't hesitate to send me an [email](mailto:cod3licious@gmail.com) and of course if you should find any bugs or want to contribute other improvements, pull requests are very welcome!
60 | 


--------------------------------------------------------------------------------
/conec/context2vec.py:
--------------------------------------------------------------------------------
  1 | from __future__ import unicode_literals, division, print_function, absolute_import
  2 | from builtins import object
  3 | from collections import defaultdict
  4 | from copy import deepcopy
  5 | import numpy as np
  6 | from scipy.sparse import csr_matrix, lil_matrix
  7 | 
  8 | 
  9 | class ContextModel(object):
 10 | 
 11 |     def __init__(self, sentences, min_count=5, window=5, wordlist=[], progress=1000, forward=True, backward=True):
 12 |         """
 13 |         sentences: list/generator of lists of words
 14 |         in case this is based on a pretrained word2vec model, give the index2word attribute as wordlist
 15 | 
 16 |         Attributes:
 17 |             - min_count: how often a word has to occur at least
 18 |             - window: how many words in a word's context should be considered
 19 |             - word2index: {word:idx}
 20 |             - index2word: [word1, word2, ...]
 21 |             - wcounts: {word: frequency}
 22 |             - featmat: n_voc x n_voc sparse array with weighted context word counts for every word
 23 |             - progress: after how many sentences a progress printout should occur (default 1000)
 24 |         """
 25 |         self.progress = progress
 26 |         self.min_count = min_count
 27 |         self.window = window
 28 |         self.forward = forward
 29 |         self.backward = backward
 30 |         self.build_windex(sentences, wordlist)
 31 |         self._get_raw_context_matrix(sentences)
 32 | 
 33 |     def build_windex(self, sentences, wordlist=[]):
 34 |         """
 35 |         go through all the sentences and get an overview of all used words and their frequencies
 36 |         """
 37 |         # get an overview of the vocabulary
 38 |         vocab = defaultdict(int)
 39 |         for sentence_no, sentence in enumerate(sentences):
 40 |             if not sentence_no % self.progress:
 41 |                 print("PROGRESS: at sentence #%i, processed %i words and %i unique words" % (sentence_no, sum(vocab.values()), len(vocab)))
 42 |             for word in sentence:
 43 |                 vocab[word] += 1
 44 |         print("collected %i unique words from a corpus of %i words and %i sentences" % (len(vocab), sum(vocab.values()), sentence_no + 1))
 45 |         # assign a unique index to each word and remove all words with freq < min_count
 46 |         self.wcounts, self.word2index, self.index2word = {}, {}, []
 47 |         if not wordlist:
 48 |             wordlist = [word for word, c in vocab.items() if c >= self.min_count]
 49 |         for word in wordlist:
 50 |             self.word2index[word] = len(self.word2index)
 51 |             self.index2word.append(word)
 52 |             self.wcounts[word] = vocab[word]
 53 | 
 54 |     def _get_raw_context_matrix(self, sentences):
 55 |         """
 56 |         compute the raw context matrix with weighted counts
 57 |         it has an entry for every word in the vocabulary
 58 |         """
 59 |         # make the feature matrix
 60 |         featmat = lil_matrix((len(self.index2word), len(self.index2word)), dtype=float)
 61 |         for sentence_no, sentence in enumerate(sentences):
 62 |             if not sentence_no % self.progress:
 63 |                 print("PROGRESS: at sentence #%i" % sentence_no)
 64 |             sentence = [word if word in self.word2index else None for word in sentence]
 65 |             # forward pass
 66 |             if self.forward:
 67 |                 for i, word in enumerate(sentence[:-1]):
 68 |                     if word:
 69 |                         # get all words in the forward window
 70 |                         wwords = sentence[i + 1:min(i + 1 + self.window, len(sentence))]
 71 |                         for j, w in enumerate(wwords, 1):
 72 |                             if w:
 73 |                                 featmat[self.word2index[word], self.word2index[w]] += 1.  # /j
 74 |             # backwards pass
 75 |             if self.backward:
 76 |                 sentence_back = sentence[::-1]
 77 |                 for i, word in enumerate(sentence_back[:-1]):
 78 |                     if word:
 79 |                         # get all words in the forward window of the backwards sentence
 80 |                         wwords = sentence_back[i + 1:min(i + 1 + self.window, len(sentence_back))]
 81 |                         for j, w in enumerate(wwords, 1):
 82 |                             if w:
 83 |                                 featmat[self.word2index[word], self.word2index[w]] += 1.  # /j
 84 |         print("PROGRESS: through with all the sentences")
 85 |         self.featmat = csr_matrix(featmat)
 86 | 
 87 |     def get_context_matrix(self, fill_diag=True, norm='count'):
 88 |         """
 89 |         for every word in the sentences, create a vector that contains the counts of its context words
 90 |         (weighted by the distance to it with a max distance of window)
 91 |         Inputs:
 92 |             - norm: if the feature matrix should be normalized to contain ones on the diagonal
 93 |                     (--> average context vectors)
 94 |             - fill_diag: if diagonal of featmat should be filled with word counts
 95 |         Returns:
 96 |             - featmat: n_voc x n_voc sparse array with weighted context word counts for every word
 97 |         """
 98 |         featmat = deepcopy(self.featmat)
 99 |         # fill up the diagonals with the total counts of each word --> similarity matrix
100 |         if fill_diag:
101 |             featmat = lil_matrix(featmat)
102 |             for i, word in enumerate(self.index2word):
103 |                 featmat[i, i] = self.wcounts[word]
104 |             featmat = csr_matrix(featmat)
105 |         assert ((featmat - featmat.transpose()).data**2).sum() < 2.220446049250313e-16, "featmat not symmetric"
106 |         # possibly normalize by the max counts
107 |         if norm in ("count", "max"):
108 |             normmat = lil_matrix(featmat.shape, dtype=float)
109 |             if norm == "count":
110 |                 print("normalizing feature matrix by word count")
111 |                 normmat.setdiag([1. / self.wcounts[word] for word in self.index2word])
112 |             elif norm == "max":
113 |                 print("normalizing feature matrix by max counts")
114 |                 normmat.setdiag([1. / v[0] if v[0] else 1. for v in featmat.max(axis=1).toarray()])
115 |             featmat = csr_matrix(normmat) * featmat  # row in featmat multiplied by entry on diagonal
116 |         return featmat
117 | 
118 |     def get_local_context_matrix(self, tokens):
119 |         """
120 |         compute a local context matrix. it has an entry for every token, even if it is not present in the vocabulary
121 |         Inputs:
122 |             - tokens: list of words
123 |         Returns:
124 |             - local_featmat: size len(set(tokens)) x n_vocab
125 |             - tok_idx: {word: index} to map the words from the tokens list to an index of the featmat
126 |         """
127 |         # for every token we still only need one representation per document
128 |         tok_idx = {word: i for i, word in enumerate(set(tokens))}
129 |         featmat = lil_matrix((len(tok_idx), len(self.index2word)), dtype=float)
130 |         # clean out context words we don't know
131 |         known_tokens = [word if word in self.word2index else None for word in tokens]
132 |         # forward pass
133 |         if self.forward:
134 |             for i, word in enumerate(tokens[:-1]):
135 |                 # get all words in the forward window
136 |                 wwords = known_tokens[i + 1:min(i + 1 + self.window, len(known_tokens))]
137 |                 for j, w in enumerate(wwords, 1):
138 |                     if w:
139 |                         featmat[tok_idx[word], self.word2index[w]] += 1. / j
140 |         # backwards pass
141 |         if self.backward:
142 |             tokens_back = tokens[::-1]
143 |             known_tokens_back = known_tokens[::-1]
144 |             for i, word in enumerate(tokens_back[:-1]):
145 |                 # get all words in the forward window of the backwards sentence, incl. word itself
146 |                 wwords = known_tokens_back[i + 1:min(i + 1 + self.window, len(known_tokens_back))]
147 |                 for j, w in enumerate(wwords, 1):
148 |                     if w:
149 |                         featmat[tok_idx[word], self.word2index[w]] += 1. / j
150 |         featmat = csr_matrix(featmat)
151 |         # normalize matrix
152 |         normmat = lil_matrix((featmat.shape[0], featmat.shape[0]), dtype=float)
153 |         normmat.setdiag([1. / v[0] if v[0] else 1. for v in featmat.max(axis=1).toarray()])
154 |         featmat = csr_matrix(normmat) * featmat
155 |         return featmat, tok_idx
156 | 


--------------------------------------------------------------------------------
/examples/test_analogy.py:
--------------------------------------------------------------------------------
  1 | from __future__ import unicode_literals, division, print_function, absolute_import
  2 | from builtins import object, range
  3 | from glob import glob
  4 | import pickle as pkl
  5 | import logging
  6 | from copy import deepcopy
  7 | import numpy as np
  8 | 
  9 | from conec import word2vec
 10 | from conec import context2vec
 11 | 
 12 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 13 | 
 14 | 
 15 | class Text8Corpus(object):
 16 |     """Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip ."""
 17 | 
 18 |     def __init__(self, fname):
 19 |         self.fname = fname
 20 | 
 21 |     def __iter__(self):
 22 |         # the entire corpus is one gigantic line -- there are no sentence marks at all
 23 |         # so just split the sequence of tokens arbitrarily: 1 sentence = 1000 tokens
 24 |         sentence, rest, max_sentence_length = [], '', 1000
 25 |         with open(self.fname) as fin:
 26 |             while True:
 27 |                 text = rest + fin.read(8192)  # avoid loading the entire file (=1 line) into RAM
 28 |                 if text == rest:  # EOF
 29 |                     sentence.extend(rest.split())  # return the last chunk of words, too (may be shorter/longer)
 30 |                     if sentence:
 31 |                         yield sentence
 32 |                     break
 33 |                 # the last token may have been split in two... keep it for the next iteration
 34 |                 last_token = text.rfind(' ')
 35 |                 words, rest = (text[:last_token].split(), text[last_token:].strip()) if last_token >= 0 else ([], text)
 36 |                 sentence.extend(words)
 37 |                 while len(sentence) >= max_sentence_length:
 38 |                     yield sentence[:max_sentence_length]
 39 |                     sentence = sentence[max_sentence_length:]
 40 | 
 41 | 
 42 | class OneBilCorpus(object):
 43 |     """Iterate over sentences from the "1-billion-word-language-modeling-benchmark" corpus,
 44 |     downloaded from http://code.google.com/p/1-billion-word-language-modeling-benchmark/ ."""
 45 | 
 46 |     def __init__(self):
 47 |         self.dir = 'data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news*'
 48 | 
 49 |     def __iter__(self):
 50 |         # go file by file
 51 |         for fname in glob(self.dir):
 52 |             with open(fname) as f:
 53 |                 yield f.read().lower().split()
 54 | 
 55 | 
 56 | def analogy(model, a, b, c):
 57 |     # man:woman as king:x - a:b as c:x - find x
 58 |     # get embeddings for a, b, and c and multiply with all other words
 59 |     a_sims = 1. + np.dot(model.wv.vectors_norm, model.wv.vectors_norm[model.wv.vocab[a].index])
 60 |     b_sims = 1. + np.dot(model.wv.vectors_norm, model.wv.vectors_norm[model.wv.vocab[b].index])
 61 |     c_sims = 1. + np.dot(model.wv.vectors_norm, model.wv.vectors_norm[model.wv.vocab[c].index])
 62 |     # add/multiply them as they should
 63 |     return b_sims - a_sims + c_sims
 64 |     # return (b_sims*c_sims)/a_sims
 65 | 
 66 | 
 67 | def accuracy(model, questions, lowercase=True, restrict_vocab=30000):
 68 |     """
 69 |     Compute accuracy of the model. `questions` is a filename where lines are
 70 |     4-tuples of words, split into sections by ": SECTION NAME" lines.
 71 |     See https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt for an example.
 72 | 
 73 |     The accuracy is reported (=printed to log and returned as a list) for each
 74 |     section separately, plus there's one aggregate summary at the end.
 75 | 
 76 |     Use `restrict_vocab` to ignore all questions containing a word whose frequency
 77 |     is not in the top-N most frequent words (default top 30,000).
 78 | 
 79 |     This method corresponds to the `compute-accuracy` script of the original C word2vec.
 80 | 
 81 |     """
 82 |     ok_vocab = dict(sorted(model.wv.vocab.items(), key=lambda item: -item[1].count)[:restrict_vocab])
 83 |     ok_index = set(v.index for v in ok_vocab.values())
 84 | 
 85 |     def log_accuracy(section):
 86 |         correct, incorrect = section['correct'], section['incorrect']
 87 |         if correct + incorrect > 0:
 88 |             print("%s: %.1f%% (%i/%i)" % (section['section'],
 89 |                                           100.0 * correct / (correct + incorrect), correct, correct + incorrect))
 90 | 
 91 |     sections, section = [], None
 92 |     for line_no, line in enumerate(open(questions)):
 93 |         # TODO: use level3 BLAS (=evaluate multiple questions at once), for speed
 94 |         if line.startswith(': '):
 95 |             # a new section starts => store the old section
 96 |             if section:
 97 |                 sections.append(section)
 98 |                 log_accuracy(section)
 99 |             section = {'section': line.lstrip(': ').strip(), 'correct': 0, 'incorrect': 0}
100 |         else:
101 |             if not section:
102 |                 raise ValueError("missing section header before line #%i in %s" % (line_no, questions))
103 |             try:
104 |                 if lowercase:
105 |                     a, b, c, expected = [word.lower() for word in line.split()]
106 |                 else:
107 |                     a, b, c, expected = [word for word in line.split()]
108 |             except:
109 |                 print("skipping invalid line #%i in %s" % (line_no, questions))
110 |             if a not in ok_vocab or b not in ok_vocab or c not in ok_vocab or expected not in ok_vocab:
111 |                 # print "skipping line #%i with OOV words: %s" % (line_no, line)
112 |                 continue
113 | 
114 |             ignore = set(model.wv.vocab[v].index for v in [a, b, c])  # indexes of words to ignore
115 |             predicted = None
116 |             # find the most likely prediction, ignoring OOV words and input words
117 |             # for index in np.argsort(model.wv.most_similar(positive=[b, c], negative=[a], topn=False))[::-1]:
118 |             for index in np.argsort(analogy(model, a, b, c))[::-1]:
119 |                 if index in ok_index and index not in ignore:
120 |                     predicted = model.wv.index2word[index]
121 |                     # if predicted != expected:
122 |                     #     print "%s: expected %s, predicted %s" % (line.strip(), expected, predicted)
123 |                     break
124 |             section['correct' if predicted == expected else 'incorrect'] += 1
125 |     if section:
126 |         # store the last section, too
127 |         sections.append(section)
128 |         log_accuracy(section)
129 | 
130 |     total = {'section': 'total', 'correct': sum(s['correct']
131 |                                                 for s in sections), 'incorrect': sum(s['incorrect'] for s in sections)}
132 |     log_accuracy(total)
133 |     sections.append(total)
134 |     return sections
135 | 
136 | 
137 | def accuracy_examples(model):
138 |     # just as advertised...
139 |     print(model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1))
140 |     # "boy" is to "father" as "girl" is to ...?
141 |     print(model.wv.most_similar(['girl', 'father'], ['boy'], topn=3))
142 |     more_examples = ["he his she", "big bigger bad", "going went being"]
143 |     for example in more_examples:
144 |         a, b, x = example.split()
145 |         predicted = model.wv.most_similar([x, b], [a])[0][0]
146 |         print("'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted))
147 |     # which word doesn't go with the others?
148 |     print(model.wv.doesnt_match("breakfast cereal dinner lunch".split()))
149 | 
150 | 
151 | def evaluate_google():
152 |     # see https://code.google.com/archive/p/word2vec/
153 |     # load pretrained google embeddings and test
154 |     from gensim.models import Word2Vec
155 |     model_google = Word2Vec.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)
156 |     _ = accuracy(model_google, "data/questions-words.txt", False)
157 | 
158 | 
159 | def evaluate_word2vec(corpus, seed=1):
160 |     # load and evaluate
161 |     fname = "%s_cbow_200_hs0_neg13_seed%i.model" % (corpus, seed)
162 |     with open("data/%s" % fname, 'rb') as f:
163 |         model = pkl.load(f)
164 |     _ = accuracy(model, "data/questions-words.txt")
165 | 
166 | 
167 | def evaluate_contextenc(corpus, seed=1):
168 |     # load word2vec model
169 |     print("####### seed = %i" % seed)
170 |     fname = "%s_cbow_200_hs0_neg13_seed%i.model" % (corpus, seed)
171 |     with open("data/%s" % fname, 'rb') as f:
172 |         model_org = pkl.load(f)
173 |     # get context matrix
174 |     if corpus == 'text8':
175 |         sentences = Text8Corpus('data/text8')
176 |     elif corpus == '1bil':
177 |         sentences = OneBilCorpus()
178 |     context_model = context2vec.ContextModel(
179 |         sentences, min_count=model_org.min_count, window=model_org.window, wordlist=model_org.wv.index2word)
180 |     for fill_diag in [True, False]:
181 |         model = deepcopy(model_org)
182 |         # build context matrix
183 |         print("constructing context matrix for fill_diag: %s" % (fill_diag))
184 |         context_mat = context_model.get_context_matrix(fill_diag, False)
185 |         # adapt the word2vec model
186 |         print("adapting the word2vec weights - vectors_norm")
187 |         model.wv.vectors_norm = context_mat.dot(model.wv.vectors_norm)
188 |         # renormalize
189 |         model.wv.vectors_norm = model.wv.vectors_norm / np.array([np.linalg.norm(model.wv.vectors_norm, axis=1)]).T
190 |         # evaluate
191 |         print("evaluating the model")
192 |         _ = accuracy(model, "data/questions-words.txt")
193 | 
194 | 
195 | def train_word2vec(corpus='text8', seed=1, it=10, save_interm=True):
196 |     # load text
197 |     if corpus == 'text8':
198 |         sentences = Text8Corpus('data/text8')
199 |     elif corpus == '1bil':
200 |         sentences = OneBilCorpus()
201 | 
202 |     def save_model(model, saven):
203 |         # delete the huge stupid table again
204 |         table = deepcopy(model.table)
205 |         model.table = None
206 |         # pickle the entire model to disk, so we can load&resume training later
207 |         pkl.dump(model, open("data/%s" % saven, 'wb'), -1)
208 |         # reinstate the table to continue training
209 |         model.table = table
210 | 
211 |     # train the cbow model; default window=5
212 |     model = word2vec.Word2Vec(sentences, mtype='cbow', hs=0, neg=13, vector_size=200, alpha=0.025, min_alpha=0.01, seed=seed)
213 |     for i in range(1, it):
214 |         print("####### ITERATION %i ########" % i)
215 |         _ = accuracy(model, "data/questions-words.txt")
216 |         if save_interm:
217 |             save_model(model, "%s_cbow_200_hs0_neg13_seed%i_it%i.model" % (corpus, seed, i))
218 |         model.train(sentences, alpha=0.025, min_alpha=0.01)
219 |     save_model(model, "%s_cbow_200_hs0_neg13_seed%i_it%i.model" % (corpus, seed, it))
220 |     print("####### ITERATION %i ########" % it)
221 |     _ = accuracy(model, "data/questions-words.txt")
222 |     accuracy_examples(model)
223 | 
224 | 
225 | def main():
226 |     # load the text on which we're training
227 |     sentences = Text8Corpus('data/text8')
228 |     # this would train the model for 1 iteration
229 |     # model = word2vec.Word2Vec(sentences, mtype='cbow', hs=0, neg=13, vector_size=200, seed=3)
230 |     # and we don't need the table used for negative sampling (it's huge)
231 |     # model.table = None
232 |     # however to replicate the results of the paper, you should train the model for 10 iterations
233 |     # we set `it' to 3 here to speed up the process, change it to 10 for better accuracies
234 |     it = 3
235 |     train_word2vec(corpus='text8', seed=3, it=it, save_interm=False)
236 |     # since this saves the model (e.g. for training on a cluster), we need to load it again
237 |     with open("data/text8_cbow_200_hs0_neg13_seed3_it%i.model" % it, "rb") as f:
238 |         model = pkl.load(f)
239 |     """
240 |         collected 253854 unique words from a corpus of 17005207 words and 17006 sentences
241 |         total of 71290 unique words after removing those with count < 5
242 |         training model on 71290 vocabulary and 200 features
243 |         training on 16718844 words took 2789.4s, 5994 words/s
244 |     """
245 |     # evaluate the accuracy on the analogy task (the results below are after 3 iterations)
246 |     _ = accuracy(model, "data/questions-words.txt")
247 |     """
248 |         capital-common-countries: 35.8% (181/506)
249 |         capital-world: 15.8% (230/1452)
250 |         currency: 10.4% (28/268)
251 |         city-in-state: 19.1% (300/1571)
252 |         family: 73.2% (224/306)
253 |         gram1-adjective-to-adverb: 11.2% (85/756)
254 |         gram2-opposite: 19.3% (59/306)
255 |         gram3-comparative: 57.0% (718/1260)
256 |         gram4-superlative: 33.6% (170/506)
257 |         gram5-present-participle: 24.2% (240/992)
258 |         gram6-nationality-adjective: 62.8% (861/1371)
259 |         gram7-past-tense: 27.5% (366/1332)
260 |         gram8-plural: 41.3% (410/992)
261 |         gram9-plural-verbs: 39.4% (256/650)
262 |         total: 33.6% (4128/12268)
263 |     """
264 |     # get the global context matrix relying on the same text
265 |     context_model = context2vec.ContextModel(sentences, min_count=model.min_count,
266 |                                              window=model.window, wordlist=model.wv.index2word)
267 |     # best results on the analogy task when counting the target word in addition to the context words
268 |     # --> fill diagonal of the context matrix. normalization is irrelevant since we renormalize later
269 |     context_mat = context_model.get_context_matrix(fill_diag=True, norm=False)
270 |     # adapt the word embeddings of the word2vec model by multiplying them with the context matrix
271 |     model.wv.vectors_norm = context_mat.dot(model.wv.vectors_norm)
272 |     # renormalize so the word embeddings have unit length again
273 |     model.wv.vectors_norm = model.wv.vectors_norm / np.array([np.linalg.norm(model.wv.vectors_norm, axis=1)]).T
274 |     # evaluate the model again
275 |     _ = accuracy(model, "data/questions-words.txt")
276 |     """
277 |         capital-common-countries: 62.3% (315/506)
278 |         capital-world: 34.9% (507/1452)
279 |         currency: 15.3% (41/268)
280 |         city-in-state: 29.2% (458/1571)
281 |         family: 72.5% (222/306)
282 |         gram1-adjective-to-adverb: 14.0% (106/756)
283 |         gram2-opposite: 19.9% (61/306)
284 |         gram3-comparative: 54.2% (683/1260)
285 |         gram4-superlative: 32.8% (166/506)
286 |         gram5-present-participle: 26.7% (265/992)
287 |         gram6-nationality-adjective: 56.1% (769/1371)
288 |         gram7-past-tense: 25.5% (340/1332)
289 |         gram8-plural: 37.1% (368/992)
290 |         gram9-plural-verbs: 24.8% (161/650)
291 |         total: 36.4% (4462/12268)
292 |     """
293 | 
294 | 
295 | if __name__ == '__main__':
296 |     main()
297 | 


--------------------------------------------------------------------------------
/examples/test_ner.py:
--------------------------------------------------------------------------------
  1 | from __future__ import unicode_literals, division, print_function, absolute_import
  2 | from future import standard_library
  3 | standard_library.install_aliases()
  4 | from builtins import object, range, next
  5 | import logging
  6 | import pickle as pkl
  7 | import re
  8 | import unicodedata
  9 | from copy import deepcopy
 10 | import numpy as np
 11 | from scipy.sparse import csr_matrix, lil_matrix
 12 | from sklearn.linear_model import LogisticRegression as logreg
 13 | 
 14 | from conec import word2vec
 15 | from conec import context2vec
 16 | 
 17 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 18 | 
 19 | 
 20 | def clean_conll2003(text, to_lower=False):
 21 |     # clean the text: no fucked up characters
 22 |     nfkd_form = unicodedata.normalize("NFKD", text)
 23 |     text = nfkd_form.encode("ASCII", "ignore").decode("ASCII")
 24 |     # normalize numbers
 25 |     text = re.sub(r"[0-9]", "1", text)
 26 |     if to_lower:
 27 |         text = text.lower()
 28 |     return text
 29 | 
 30 | 
 31 | class CoNLL2003(object):
 32 |     # collected 20102 unique words from a corpus of 218609 words and 946 sentences
 33 |     # generator for the conll2003 training data
 34 | 
 35 |     def __init__(self, to_lower=False, sources=["data/conll2003/ner/eng.train"]):
 36 |         self.sources = sources
 37 |         self.to_lower = to_lower
 38 | 
 39 |     def __iter__(self):
 40 |         """Iterate through all news articles."""
 41 |         for fname in self.sources:
 42 |             tokens = []
 43 |             for line in open(fname):
 44 |                 if line.startswith("-DOCSTART- -X- -X-"):
 45 |                     if tokens:
 46 |                         yield tokens
 47 |                     tokens = []
 48 |                 elif line.strip():
 49 |                     tokens.append(clean_conll2003(line.split()[0], self.to_lower))
 50 |                 else:
 51 |                     tokens.append('')
 52 |             yield tokens
 53 | 
 54 | 
 55 | def train_word2vec(train_all=False, it=20, seed=1):
 56 |     # train all models for 20 iterations
 57 |     # train the word2vec model on a) the training data
 58 |     sentences = CoNLL2003(to_lower=True)
 59 | 
 60 |     def save_model(model, saven):
 61 |         # delete the huge stupid table again
 62 |         table = deepcopy(model.table)
 63 |         model.table = None
 64 |         # pickle the entire model to disk, so we can load&resume training later
 65 |         pkl.dump(model, open("data/%s" % saven, 'wb'), -1)
 66 |         # reinstate the table to continue training
 67 |         model.table = table
 68 | 
 69 |     # train the cbow model; default window=5
 70 |     alpha = 0.02
 71 |     model = word2vec.Word2Vec(sentences, min_count=1, mtype='cbow', hs=0, neg=13, vector_size=200, alpha=alpha, min_alpha=alpha, seed=seed)
 72 |     for i in range(1, it):
 73 |         print("####### ITERATION %i ########" % (i + 1))
 74 |         if not i % 5:
 75 |             save_model(model, "conll2003_train_cbow_200_hs0_neg13_seed%i_it%i.model" % (seed, i))
 76 |             alpha /= 2.
 77 |             alpha = max(alpha, 0.0001)
 78 |         model.train(sentences, alpha=alpha, min_alpha=alpha)
 79 |     save_model(model, "conll2003_train_cbow_200_hs0_neg13_seed%i_it%i.model" % (seed, it))
 80 |     if train_all:
 81 |         # and b) the training + test data
 82 |         sentences = CoNLL2003(to_lower=True, sources=[
 83 |                               "data/conll2003/ner/eng.train", "data/conll2003/ner/eng.testa", "data/conll2003/ner/eng.testb"])
 84 |         model = word2vec.Word2Vec(sentences, min_count=1, mtype='cbow', hs=0, neg=13, vector_size=200, seed=seed)
 85 |         for i in range(19):
 86 |             model.train(sentences)
 87 |         # delete the huge stupid table again
 88 |         model.table = None
 89 |         # pickle the entire model to disk, so we can load&resume training later
 90 |         saven = "conll2003_test_20it_cbow_200_hs0_neg13_seed%i.model" % seed
 91 |         print("saving model")
 92 |         pkl.dump(model, open("data/%s" % saven, 'wb'), -1)
 93 | 
 94 | 
 95 | def make_wordfeat(w):
 96 |     return [int(w.isalnum()), int(w.isalpha()), int(w.isdigit()),
 97 |             int(w.islower()), int(w.istitle()), int(w.isupper()),
 98 |             len(w)]
 99 | 
100 | 
101 | def make_featmat_wordfeat(tokens):
102 |     # tokens: list of words
103 |     return np.array([make_wordfeat(t) for t in tokens])
104 | 
105 | 
106 | class ContextEnc_NER(object):
107 | 
108 |     def __init__(self, w2v_model, contextm=False, sentences=[], w_local=0.4, context_global_only=False, include_wf=False, to_lower=True, normed=True, renorm=True):
109 |         self.clf = None
110 |         self.w2v_model = w2v_model
111 |         self.rep_idx = {word: i for i, word in enumerate(w2v_model.wv.index2word)}
112 |         self.include_wf = include_wf
113 |         self.to_lower = to_lower
114 |         self.w_local = w_local  # how much the local context compared to the global should count
115 |         self.context_global_only = context_global_only  # if only global context should count (0 if global not available -- not same as w_local=0)
116 |         self.normed = normed
117 |         self.renorm = renorm
118 |         # should we include the context?
119 |         if contextm:
120 |             # sentences: depending on what the word2vec model was trained
121 |             self.context_model = context2vec.ContextModel(
122 |                 sentences, min_count=1, window=w2v_model.window, wordlist=w2v_model.wv.index2word)
123 |             # --> create a global context matrix
124 |             self.context_model.featmat = self.context_model.get_context_matrix(False, 'max')
125 |         else:
126 |             self.context_model = None
127 | 
128 |     def make_featmat_rep(self, tokens, local_context_mat=None, tok_idx={}):
129 |         """
130 |         Inputs:
131 |             - tokens: list of words
132 |         Returns:
133 |             - featmat: dense feature matrix for every token
134 |         """
135 |         # possibly preprocess tokens
136 |         if self.to_lower:
137 |             pp_tokens = [t.lower() for t in tokens]
138 |         else:
139 |             pp_tokens = tokens
140 |         dim = self.w2v_model.wv.vector_size
141 |         if self.include_wf:
142 |             dim += 7
143 |         featmat = np.zeros((len(tokens), dim), dtype=float)
144 |         # index in featmat for all known tokens
145 |         idx_featmat = [i for i, t in enumerate(pp_tokens) if t in self.rep_idx]
146 |         if self.normed:
147 |             rep_mat = deepcopy(self.w2v_model.wv.vectors_norm)
148 |         else:
149 |             rep_mat = deepcopy(self.w2v_model.vectors)
150 |         if self.context_model:
151 |             if self.context_global_only:
152 |                 # make context matrix out of global context vectors only
153 |                 context_mat = lil_matrix((len(tokens), len(self.rep_idx)))
154 |                 global_tok_idx = [self.rep_idx[t] for t in pp_tokens if t in self.rep_idx]
155 |                 context_mat[idx_featmat, :] = self.context_model.featmat[global_tok_idx, :]
156 |             else:
157 |                 # compute the local context matrix
158 |                 if not tok_idx:
159 |                     local_context_mat, tok_idx = self.context_model.get_local_context_matrix(pp_tokens)
160 |                 local_tok_idx = [tok_idx[t] for t in pp_tokens]
161 |                 context_mat = lil_matrix(local_context_mat[local_tok_idx, :])
162 |                 assert context_mat.shape == (len(tokens), len(self.rep_idx)), "context matrix has wrong shape"
163 |                 # average it with the global context vectors if available
164 |                 local_global_tok_idx = [tok_idx[t] for t in pp_tokens if t in self.rep_idx]
165 |                 global_tok_idx = [self.rep_idx[t] for t in pp_tokens if t in self.rep_idx]
166 |                 context_mat[idx_featmat, :] = self.w_local * lil_matrix(local_context_mat[local_global_tok_idx, :]) + (
167 |                     1. - self.w_local) * self.context_model.featmat[global_tok_idx, :]
168 |             # multiply context_mat with rep_mat to get featmat (+ normalize)
169 |             featmat[:, 0:rep_mat.shape[1]] = csr_matrix(context_mat) * rep_mat
170 |             # length normalize the feature vectors
171 |             if self.renorm:
172 |                 fnorm = np.linalg.norm(featmat, axis=1)
173 |                 featmat[fnorm > 0, :] = featmat[fnorm > 0, :] / np.array([fnorm[fnorm > 0]]).T
174 |         else:
175 |             # we set the feature matrix with the word2vec embeddings directly;
176 |             # tokens not in the original vocab will have a zero representation
177 |             idx_repmat = [self.rep_idx[t] for t in pp_tokens if t in self.rep_idx]
178 |             featmat[idx_featmat, 0:rep_mat.shape[1]] = rep_mat[idx_repmat, :]
179 |         if self.include_wf:
180 |             featmat[:, dim - 7:] = make_featmat_wordfeat(tokens)
181 |         return featmat
182 | 
183 |     def train_clf(self, trainfiles):
184 |         # tokens: list of words, labels: list of corresponding labels
185 |         # go document by document because of local context
186 |         final_labels = []
187 |         featmat = []
188 |         for trainfile in trainfiles:
189 |             for tokens, labels in yield_tokens_labels(trainfile):
190 |                 final_labels.extend(labels)
191 |                 featmat.append(self.make_featmat_rep(tokens))
192 |         featmat = np.vstack(featmat)
193 |         print("training classifier")
194 |         clf = logreg(class_weight='balanced', random_state=1)
195 |         clf.fit(featmat, final_labels)
196 |         self.clf = clf
197 | 
198 |     def find_ne_in_text(self, text, local_context_mat=None, tok_idx={}):
199 |         featmat = self.make_featmat_rep(text.strip().split(), local_context_mat, tok_idx)
200 |         labels = self.clf.predict(featmat)
201 |         # stitch text back together
202 |         results = []
203 |         for i, t in enumerate(text.strip().split()):
204 |             if results and labels[i] == results[-1][1]:
205 |                 results[-1] = (results[-1][0] + " " + t, results[-1][1])
206 |             else:
207 |                 if results:
208 |                     results.append((' ', 'O'))
209 |                 results.append((t, labels[i]))
210 |         return results
211 | 
212 | 
213 | def process_wordlabels(word_labels):
214 |     # process labels
215 |     tokens = []
216 |     labels = []
217 |     for word, l in word_labels:
218 |         if word:
219 |             if l.startswith("I-") or l.startswith("B-"):
220 |                 l = l[2:]
221 |             tokens.append(word)
222 |             labels.append(l)
223 |     assert len(tokens) == len(labels), "must have same number of tokens as labels"
224 |     return tokens, labels
225 | 
226 | 
227 | def get_tokens_labels(trainfile):
228 |     # read in trainfile to generate training labels
229 |     with open(trainfile) as f:
230 |         word_labels = [(clean_conll2003(line.split()[0]), line.strip().split()[-1]) if line.strip()
231 |                        else ('', 'O') for line in f if not line.startswith("-DOCSTART- -X- -X-")]
232 |     return process_wordlabels(word_labels)
233 | 
234 | 
235 | def yield_tokens_labels(trainfile):
236 |     # generate tokens and labels for every document
237 |     word_labels = []
238 |     for line in open(trainfile):
239 |         if line.startswith("-DOCSTART- -X- -X-"):
240 |             if word_labels:
241 |                 yield process_wordlabels(word_labels)
242 |             word_labels = []
243 |         elif line.strip():
244 |             word_labels.append((clean_conll2003(line.split()[0]), line.strip().split()[-1]))
245 |         else:
246 |             word_labels.append(('', 'O'))
247 |     yield process_wordlabels(word_labels)
248 | 
249 | 
250 | def ne_results_2_labels(ne_results):
251 |     """
252 |     helper function to transform a list of substrings and labels
253 |     into a list of labels for every (white space separated) token
254 |     """
255 |     l_list = []
256 |     last_l = ''
257 |     for i, (substr, l) in enumerate(ne_results):
258 |         if substr == ' ':
259 |             continue
260 |         if not l or l == 'O':
261 |             l_out = 'O'
262 |         elif l == last_l:
263 |             l_out = "B-" + l
264 |         else:
265 |             l_out = "I-" + l
266 |         last_l = l
267 |         if (not i) or (substr.startswith(' ') or ne_results[i - 1][0].endswith(' ')):
268 |             l_list.append(l_out)
269 |         # if there is no space between the previous and last substring, first token gets label
270 |         # of longer subsubstr (i.e. either previous or current)
271 |         elif i and len(ne_results[i - 1][0].split()[-1]) < len(substr.split()[0]):
272 |             l_list.pop()
273 |             l_list.append(l_out)
274 |         l_list.extend([l_out for n in range(len(substr.split()) - 1)])
275 |     return l_list
276 | 
277 | 
278 | def apply_conll2003_ner(ner, testfile, outfile):
279 |     """
280 |     Inputs:
281 |         - ner: named entity classifier with find_ne_in_text method
282 |         - testfile: path to the testfile
283 |         - outfile: where the output should be saved
284 |     """
285 |     documents = CoNLL2003(sources=[testfile], to_lower=True)
286 |     documents_it = documents.__iter__()
287 |     local_context_mat, tok_idx = None, {}
288 |     # read in test file + generate outfile
289 |     with open(outfile, 'w') as f_out:
290 |         # collect all the words in a sentence and save other rest of the lines
291 |         to_write, tokens = [], []
292 |         doc_tokens = []
293 |         for line in open(testfile):
294 |             if line.startswith("-DOCSTART- -X- -X-"):
295 |                 f_out.write("-DOCSTART- -X- -X- O O\n")
296 |                 # we're at a new document, time for a new local context matrix
297 |                 if ner.context_model:
298 |                     doc_tokens = next(documents_it)
299 |                     local_context_mat, tok_idx = ner.context_model.get_local_context_matrix(doc_tokens)
300 |             # outfile: testfile + additional column with predicted label
301 |             elif line.strip():
302 |                 to_write.append(line.strip())
303 |                 tokens.append(clean_conll2003(line.split()[0]))
304 |             else:
305 |                 # end of sentence: find all named entities!
306 |                 if to_write:
307 |                     ne_results = ner.find_ne_in_text(" ".join(tokens), local_context_mat, tok_idx)
308 |                     assert " ".join(tokens) == "".join(r[0]
309 |                                                        for r in ne_results), "returned text doesn't match"  # sanity check
310 |                     l_list = ne_results_2_labels(ne_results)
311 |                     assert len(l_list) == len(tokens), "Error: %i labels but %i tokens" % (len(l_list), len(tokens))
312 |                     for i, line in enumerate(to_write):
313 |                         f_out.write(to_write[i] + " " + l_list[i] + "\n")
314 |                 to_write, tokens = [], []
315 |                 f_out.write("\n")
316 | 
317 | 
318 | def log_results(clf_ner, description, filen='', subf=''):
319 |     import os
320 |     if not os.path.exists('data/conll2003_results'):
321 |         os.mkdir('data/conll2003_results')
322 |     if not os.path.exists('data/conll2003_results%s' % subf):
323 |         os.mkdir('data/conll2003_results%s' % subf)
324 |     import subprocess
325 |     print("applying to training set")
326 |     apply_conll2003_ner(clf_ner, 'data/conll2003/ner/eng.train', 'data/conll2003_results%s/eng.out_train.txt' % subf)
327 |     print("applying to test set")
328 |     apply_conll2003_ner(clf_ner, 'data/conll2003/ner/eng.testa', 'data/conll2003_results%s/eng.out_testa.txt' % subf)
329 |     apply_conll2003_ner(clf_ner, 'data/conll2003/ner/eng.testb', 'data/conll2003_results%s/eng.out_testb.txt' % subf)
330 |     # write out results
331 |     with open('data/conll2003_results/output_all_%s.txt' % filen, 'a') as f:
332 |         f.write('%s\n' % description)
333 |         f.write('results on training data\n')
334 |         out = subprocess.getstatusoutput('data/conll2003/ner/bin/conlleval < data/conll2003_results%s/eng.out_train.txt' % subf)[1]
335 |         f.write(out)
336 |         f.write('\n')
337 |         f.write('results on testa\n')
338 |         out = subprocess.getstatusoutput('data/conll2003/ner/bin/conlleval < data/conll2003_results%s/eng.out_testa.txt' % subf)[1]
339 |         f.write(out)
340 |         f.write('\n')
341 |         f.write('results on testb\n')
342 |         out = subprocess.getstatusoutput('data/conll2003/ner/bin/conlleval < data/conll2003_results%s/eng.out_testb.txt' % subf)[1]
343 |         f.write(out)
344 |         f.write('\n')
345 |         f.write('\n')
346 | 
347 | 
348 | if __name__ == '__main__':
349 |     seed = 3
350 |     it = 20
351 |     train_word2vec(train_all=False, it=it, seed=seed)
352 |     # load pretrained word2vec model
353 |     with open("data/conll2003_train_cbow_200_hs0_neg13_seed%i_it%i.model" % (seed, it), 'rb') as f:
354 |         w2v_model = pkl.load(f)
355 |     # train a classifier with these word embeddings on the training part
356 |     clf_ner = ContextEnc_NER(w2v_model, include_wf=False)
357 |     clf_ner.train_clf(['data/conll2003/ner/eng.train'])
358 |     # apply the classifier to all training and test parts of the CoNLL2003 task,
359 |     # run the evaluation script and save the results
360 |     log_results(clf_ner, '####### word2vec model, seed: %i, it: %i' % (seed, it), 'word2vec_%i' % seed, '_word2vec_%i_%i' % (seed, it))
361 |     """
362 |         results on training data
363 |         processed 204567 tokens with 23499 phrases; found: 38310 phrases; correct: 11537.
364 |         accuracy:  84.48%; precision:  30.11%; recall:  49.10%; FB1:  37.33
365 |                       LOC: precision:  51.57%; recall:  75.06%; FB1:  61.14  10391
366 |                      MISC: precision:  21.22%; recall:  39.70%; FB1:  27.66  6432
367 |                       ORG: precision:  18.52%; recall:  29.08%; FB1:  22.63  9924
368 |                       PER: precision:  25.73%; recall:  45.08%; FB1:  32.76  11563
369 |         results on testa
370 |         processed 51578 tokens with 5942 phrases; found: 8422 phrases; correct: 2525.
371 |         accuracy:  84.04%; precision:  29.98%; recall:  42.49%; FB1:  35.16
372 |                       LOC: precision:  52.03%; recall:  66.85%; FB1:  58.52  2360
373 |                      MISC: precision:  25.25%; recall:  41.54%; FB1:  31.41  1517
374 |                       ORG: precision:  19.26%; recall:  30.28%; FB1:  23.54  2108
375 |                       PER: precision:  20.85%; recall:  27.58%; FB1:  23.74  2437
376 |         results on testb
377 |         processed 46666 tokens with 5648 phrases; found: 7338 phrases; correct: 1960.
378 |         accuracy:  82.26%; precision:  26.71%; recall:  34.70%; FB1:  30.19
379 |                       LOC: precision:  52.07%; recall:  66.49%; FB1:  58.40  2130
380 |                      MISC: precision:  19.05%; recall:  38.32%; FB1:  25.45  1412
381 |                       ORG: precision:  19.64%; recall:  22.40%; FB1:  20.93  1894
382 |                       PER: precision:  11.04%; recall:  12.99%; FB1:  11.94  1902
383 |     """
384 | 
385 |     # load the text again (same as word2vec model was trained on) to generate the context matrix
386 |     sentences = CoNLL2003(to_lower=True)
387 |     # only use global context; no rep for out-of-vocab
388 |     clf_ner = ContextEnc_NER(w2v_model, contextm=True, sentences=sentences, w_local=0., context_global_only=True)
389 |     clf_ner.train_clf(['data/conll2003/ner/eng.train'])
390 |     # evaluate the results again
391 |     log_results(clf_ner, '####### context enc with global context matrix only, seed: %i, it: %i' % (seed, it), 'conec_global_%i' % seed, '_conec_global_%i_%i' % (seed, it))
392 | 
393 |     # for the out-of-vocabulary words in the dev and test set, only the local context matrix (based on only the current doc)
394 |     # is used to generate the respective word embeddings; where a global context vector is available (for all words in the training set)
395 |     # we use a combination of the local and global context, determined by w_local
396 |     for w_local in [0., 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.]:
397 |         print(w_local)
398 |         clf_ner = ContextEnc_NER(w2v_model, contextm=True, sentences=sentences, w_local=w_local)
399 |         clf_ner.train_clf(['data/conll2003/ner/eng.train'])
400 |         # evaluate the results again
401 |         log_results(clf_ner, '####### context enc with a combination of the global and local context matrix (w_local=%.1f), seed: %i, it: %i' % (w_local, seed, it), 'conec_%i_%i' % (round(w_local*10), seed), '_conec_%i_%i_%i' % (round(w_local*10), seed, it))
402 |         """
403 |             results on training data
404 |             processed 204567 tokens with 23499 phrases; found: 33708 phrases; correct: 11675.
405 |             accuracy:  84.34%; precision:  34.64%; recall:  49.68%; FB1:  40.82
406 |                           LOC: precision:  57.46%; recall:  75.34%; FB1:  65.20  9361
407 |                          MISC: precision:  19.56%; recall:  37.14%; FB1:  25.62  6530
408 |                           ORG: precision:  19.16%; recall:  24.62%; FB1:  21.55  8119
409 |                           PER: precision:  35.71%; recall:  52.47%; FB1:  42.50  9698
410 |             results on testa
411 |             processed 51578 tokens with 5942 phrases; found: 8756 phrases; correct: 3244.
412 |             accuracy:  85.01%; precision:  37.05%; recall:  54.59%; FB1:  44.14
413 |                           LOC: precision:  56.96%; recall:  77.74%; FB1:  65.75  2507
414 |                          MISC: precision:  22.97%; recall:  41.76%; FB1:  29.64  1676
415 |                           ORG: precision:  20.96%; recall:  28.64%; FB1:  24.20  1832
416 |                           PER: precision:  38.20%; recall:  56.84%; FB1:  45.69  2741
417 |             results on testb
418 |             processed 46666 tokens with 5648 phrases; found: 8407 phrases; correct: 2830.
419 |             accuracy:  84.17%; precision:  33.66%; recall:  50.11%; FB1:  40.27
420 |                           LOC: precision:  53.21%; recall:  74.58%; FB1:  62.11  2338
421 |                          MISC: precision:  16.29%; recall:  36.32%; FB1:  22.50  1565
422 |                           ORG: precision:  24.44%; recall:  30.04%; FB1:  26.95  2042
423 |                           PER: precision:  33.79%; recall:  51.45%; FB1:  40.79  2462
424 |         """
425 | 


--------------------------------------------------------------------------------
/conec/word2vec.py:
--------------------------------------------------------------------------------
  1 | # Original Code by Radim Rehurek <me@radimrehurek.com>
  2 | # [Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html]
  3 | # see: http://radimrehurek.com/gensim/
  4 | #
  5 | # Rewrite by Franziska Horn <cod3licious@gmail.com>
  6 | 
  7 | from __future__ import unicode_literals, division, print_function, absolute_import
  8 | from builtins import object, range, str
  9 | import time
 10 | import logging
 11 | import heapq
 12 | from copy import deepcopy
 13 | from math import sqrt
 14 | import numpy as np
 15 | 
 16 | logger = logging.getLogger("word2vec")
 17 | 
 18 | 
 19 | class Vocab(object):
 20 |     """
 21 |     A single vocabulary item, used internally e.g. for constructing binary trees
 22 |     (incl. both word leaves and inner nodes).
 23 | 
 24 |     Possible Fields:
 25 |         - count: how often the word occurred in the training sentences
 26 |         - index: the word's index in the embedding
 27 |     """
 28 | 
 29 |     def __init__(self, **kwargs):
 30 |         self.count = 0
 31 |         self.__dict__.update(kwargs)
 32 | 
 33 |     def __lt__(self, other):  # used for sorting in a priority queue
 34 |         return self.count < other.count
 35 | 
 36 |     def __str__(self):
 37 |         vals = ['%s:%r' % (key, self.__dict__[key]) for key in sorted(self.__dict__) if not key.startswith('_')]
 38 |         return "%s(%s)" % (self.__class__.__name__, ', '.join(vals))
 39 | 
 40 | 
 41 | class Word2VecEmbeddings(object):
 42 |     """
 43 |     Word2Vec embeddings only - can't be trained further, but enough for all calculations
 44 |     """
 45 |     def __init__(self, vector_size=100):
 46 |         """
 47 |         Initialize Word2Vec embeddings
 48 | 
 49 |         Inputs:
 50 |             - vector_size: (default 100) dimensionality of embedding
 51 |         """
 52 |         self.vector_size = vector_size
 53 |         self.vectors = np.zeros((0, vector_size))
 54 |         self.vectors_norm = None
 55 |         self.vocab = {}  # mapping from a word (string) to a Vocab object
 56 |         self.index2word = []  # map from a word's matrix index (int) to the word (string)
 57 | 
 58 |     def __str__(self):
 59 |         return "Word2VecEmbeddings(vocab=%s, size=%s)" % (len(self.index2word), self.vector_size)
 60 | 
 61 |     def __getitem__(self, word):
 62 |         """
 63 |         Return a word's representations in vector space, as a 1D numpy array.
 64 | 
 65 |         Example:
 66 |           >>> trained_model['woman']
 67 |           array([ -1.40128313e-02, ...]
 68 |         """
 69 |         return self.vectors[self.vocab[word].index]
 70 | 
 71 |     def __contains__(self, word):
 72 |         return word in self.vocab
 73 | 
 74 |     def build_vocab(self, sentences, min_count=5, thr=0):
 75 |         """
 76 |         Build vocabulary from a sequence of sentences (can be a once-only generator stream).
 77 |         Each sentence must be a list of strings.
 78 | 
 79 |         Inputs:
 80 |             - sentences: List or generator object supplying lists of (preprocessed) words
 81 |                          used to train the model (otherwise train manually with model.train(sentences))
 82 |             - min_count: (default 5) how often a word has to occur at least to be taken into the vocab
 83 |             - thr: (default 0) threshold for computing probabilities for sub-sampling words in training
 84 |         """
 85 |         logger.info("collecting all words and their counts")
 86 |         sentence_no, vocab = -1, {}
 87 |         total_words = 0
 88 |         for sentence_no, sentence in enumerate(sentences):
 89 |             if not sentence_no % 10000:
 90 |                 logger.info("PROGRESS: at sentence #%i, processed %i words and %i unique words" %
 91 |                             (sentence_no, total_words, len(vocab)))
 92 |             for word in sentence:
 93 |                 total_words += 1
 94 |                 try:
 95 |                     vocab[word].count += 1
 96 |                 except KeyError:
 97 |                     vocab[word] = Vocab(count=1)
 98 |         logger.info("collected %i unique words from a corpus of %i words and %i sentences" %
 99 |                     (len(vocab), total_words, sentence_no + 1))
100 |         # assign a unique index to each word
101 |         self.vocab, self.index2word = {}, []
102 |         for word, v in vocab.items():
103 |             if v.count >= min_count:
104 |                 v.index = len(self.vocab)
105 |                 self.index2word.append(word)
106 |                 self.vocab[word] = v
107 |         logger.info("total of %i unique words after removing those with count < %s" % (len(self.vocab), min_count))
108 |         # add probabilities for sub-sampling (if thr > 0)
109 |         if thr > 0:
110 |             total_words = float(sum(v.count for v in self.vocab.values()))
111 |             for word in self.vocab:
112 |                 # formula from paper
113 |                 # self.vocab[word].prob = max(0.,1.-sqrt(thr*total_words/self.vocab[word].count))
114 |                 # formula from code
115 |                 self.vocab[word].prob = (sqrt(self.vocab[word].count / (thr * total_words)
116 |                                               ) + 1.) * (thr * total_words) / self.vocab[word].count
117 |         else:
118 |             # if prob is 0, word wont get discarded
119 |             for word in self.vocab:
120 |                 self.vocab[word].prob = 0.
121 | 
122 |     def init_sims(self):
123 |         # for convenience (for later similarity computations, etc.), store all
124 |         # embeddings additionally as unit length vectors
125 |         self.vectors_norm = self.vectors / np.array([np.linalg.norm(self.vectors, axis=1)]).T
126 | 
127 |     def similarity(self, w1, w2):
128 |         """
129 |         Compute cosine similarity between two words.
130 | 
131 |         Example::
132 |           >>> trained_model.similarity('woman', 'man')
133 |           0.73723527
134 |         """
135 |         if self.vectors_norm is None:
136 |             self.init_sims()
137 |         return np.inner(self.vectors_norm[self.vocab[w1].index], self.vectors_norm[self.vocab[w2].index])
138 | 
139 |     def most_similar(self, positive=[], negative=[], topn=10):
140 |         """
141 |         Find the top-N most similar words. Positive words contribute positively towards the
142 |         similarity, negative words negatively.
143 | 
144 |         This method computes cosine similarity between a simple mean of the projection
145 |         weight vectors of the given words, and corresponds to the `word-analogy` and
146 |         `distance` scripts in the original word2vec implementation.
147 | 
148 |         Example::
149 |           >>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
150 |           [('queen', 0.50882536), ...]
151 |         """
152 |         if self.vectors_norm is None:
153 |             self.init_sims()
154 |         if isinstance(positive, str) and not negative:
155 |             # allow calls like most_similar('dog'), as a shorthand for most_similar(['dog'])
156 |             positive = [positive]
157 | 
158 |         # add weights for each word, if not already present; default to 1.0 for positive and -1.0 for negative words
159 |         positive = [(word, 1.) if isinstance(word, str) else word for word in positive]
160 |         negative = [(word, -1.) if isinstance(word, str) else word for word in negative]
161 | 
162 |         # compute the weighted average of all words
163 |         all_words = set()
164 |         mean = np.zeros(self.vector_size)
165 |         for word, weight in positive + negative:
166 |             try:
167 |                 mean += weight * self.vectors_norm[self.vocab[word].index]
168 |                 all_words.add(self.vocab[word].index)
169 |             except KeyError:
170 |                 print("word '%s' not in vocabulary" % word)
171 |         if not all_words:
172 |             raise ValueError("cannot compute similarity with no input")
173 |         dists = np.dot(self.vectors_norm, mean / np.linalg.norm(mean))
174 |         if not topn:
175 |             return dists
176 |         best = np.argsort(dists)[::-1][:topn + len(all_words)]
177 |         # ignore (don't return) words from the input
178 |         result = [(self.index2word[sim], dists[sim]) for sim in best if sim not in all_words]
179 |         return result[:topn]
180 | 
181 |     def doesnt_match(self, words):
182 |         """
183 |         Which word from the given list doesn't go with the others?
184 | 
185 |         Example::
186 |           >>> trained_model.doesnt_match("breakfast cereal dinner lunch".split())
187 |           'cereal'
188 |         """
189 |         if self.vectors_norm is None:
190 |             self.init_sims()
191 |         words = [word for word in words if word in self.vocab]  # filter out OOV words
192 |         logger.debug("using words %s" % words)
193 |         if not words:
194 |             raise ValueError("cannot select a word from an empty list")
195 |         # which word vector representation is furthest away from the mean?
196 |         selection = self.vectors_norm[[self.vocab[word].index for word in words]]
197 |         mean = np.mean(selection, axis=0)
198 |         sim = np.dot(selection, mean / np.linalg.norm(mean))
199 |         return words[np.argmin(sim)]
200 | 
201 | 
202 | class Word2Vec(object):
203 |     """
204 |     Word2Vec Model, which can be trained and then contains word embedding that can be used for all kinds of cool stuff.
205 |     """
206 | 
207 |     def __init__(self, sentences=None, vector_size=100, mtype='sg', hs=1, neg=0, window=5,
208 |                  thr=0, min_count=5, alpha=0.025, min_alpha=0.0001, seed=1):
209 |         """
210 |         Initialize Word2Vec model
211 | 
212 |         Inputs:
213 |             - sentences: (default None) List or generator object supplying lists of (preprocessed) words
214 |                          used to train the model (otherwise train manually with model.train(sentences))
215 |             - vector_size: (default 100) dimensionality of embedding
216 |             - mtype: (default 'sg') type of model: either 'sg' (skipgram) or 'cbow' (bag of words)
217 |             - hs: (default 1) if != 0, hierarchical softmax will be used for training the model
218 |             - neg: (default 0) if > 0, negative sampling will be used for training the model;
219 |                    neg specifies the # of noise words
220 |             - window: (default 5) max distance of context words from target word in training
221 |             - thr: (default 0) threshold for computing probabilities for sub-sampling words in training
222 |             - min_count: (default 5) how often a word has to occur at least to be taken into the vocab
223 |             - alpha: (default 0.025) initial learning rate
224 |             - min_alpha: (default 0.0001) if < alpha, the learning rate will be decreased to min_alpha
225 |             - seed: (default 1) random seed (for initializing the embeddings)
226 |         """
227 |         assert mtype.lower() in ('sg', 'cbow'), "unknown model, use 'sg' or 'cbow'"
228 |         self.wv = Word2VecEmbeddings(vector_size)  # stores the actual word2vec embeddings
229 |         self.mtype = mtype.lower()
230 |         self.hs = hs
231 |         self.neg = neg
232 |         self.window = window
233 |         self.thr = thr
234 |         self.min_count = min_count
235 |         self.alpha = alpha
236 |         self.min_alpha = min_alpha
237 |         self.seed = seed
238 |         # possibly train model
239 |         if sentences:
240 |             self.train_setup(sentences)
241 |             self.train(sentences)
242 | 
243 |     def __str__(self):
244 |         return "Word2Vec(vocab=%s, size=%s, mtype=%s, hs=%i, neg=%i)" % (len(self.wv.index2word), self.wv.vector_size, self.mtype, self.hs, self.neg)
245 | 
246 |     def reset_weights(self):
247 |         """
248 |         Reset all projection weights to an initial (untrained) state, but keep the existing vocabulary.
249 |         """
250 |         np.random.seed(self.seed)
251 |         # weights
252 |         self.syn1 = np.asarray(
253 |             np.random.uniform(
254 |                 low=-4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)),
255 |                 high=4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)),
256 |                 size=(len(self.wv.vocab), self.wv.vector_size)
257 |             ),
258 |             dtype=float
259 |         )
260 |         self.syn1neg = np.asarray(
261 |             np.random.uniform(
262 |                 low=-4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)),
263 |                 high=4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)),
264 |                 size=(len(self.wv.vocab), self.wv.vector_size)
265 |             ),
266 |             dtype=float
267 |         )
268 |         # embedding
269 |         self.wv.vectors = np.asarray(
270 |             np.random.uniform(
271 |                 low=-4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)),
272 |                 high=4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)),
273 |                 size=(len(self.wv.vocab), self.wv.vector_size)
274 |             ),
275 |             dtype=float
276 |         )
277 | 
278 |     def _make_table(self, table_size=100000000., power=0.75):
279 |         """
280 |         Create a table using stored vocabulary word counts for drawing random words in the negative
281 |         sampling training routines.
282 |         """
283 |         vocab_size = len(self.wv.vocab)
284 |         logger.info("constructing a table with noise distribution from %i words" % vocab_size)
285 |         # table (= list of words) of noise distribution for negative sampling
286 |         self.table = np.zeros(int(table_size), dtype=int)
287 |         # compute sum of all power (Z in paper)
288 |         train_words_pow = float(sum([self.wv.vocab[word].count**power for word in self.wv.vocab]))
289 |         # go through the whole table and fill it up with the word indexes proportional to a word's count**power
290 |         widx = 0
291 |         # normalize count^0.75 by Z
292 |         d1 = self.wv.vocab[self.wv.index2word[widx]].count**power / train_words_pow
293 |         for tidx in range(int(table_size)):
294 |             self.table[tidx] = widx
295 |             if tidx / table_size > d1:
296 |                 widx += 1
297 |                 d1 += self.wv.vocab[self.wv.index2word[widx]].count**power / train_words_pow
298 |             if widx >= vocab_size:
299 |                 widx = vocab_size - 1
300 | 
301 |     def _create_binary_tree(self):
302 |         """
303 |         Create a binary Huffman tree for the hs model using stored vocabulary word counts.
304 |         Frequent words will have shorter binary codes.
305 |         """
306 |         vocab_size = len(self.wv.vocab)
307 |         logger.info("constructing a huffman tree from %i words" % vocab_size)
308 |         # build the huffman tree
309 |         heap = list(self.wv.vocab.values())
310 |         heapq.heapify(heap)
311 |         for i in range(vocab_size - 1):
312 |             min1, min2 = heapq.heappop(heap), heapq.heappop(heap)
313 |             heapq.heappush(heap, Vocab(count=min1.count + min2.count, index=i + vocab_size, left=min1, right=min2))
314 |         # recurse over the tree, assigning a binary code to each vocabulary word
315 |         if heap:
316 |             max_depth, stack = 0, [(heap[0], [], [])]
317 |             while stack:
318 |                 node, codes, points = stack.pop()
319 |                 if node.index < vocab_size:
320 |                     # leaf node => store its path from the root
321 |                     node.code, node.point = codes, points
322 |                     max_depth = max(len(codes), max_depth)
323 |                 else:
324 |                     # inner node => continue recursion
325 |                     points = np.array(list(points) + [node.index - vocab_size], dtype=int)
326 |                     stack.append((node.left, np.array(list(codes) + [0], dtype=int), points))
327 |                     stack.append((node.right, np.array(list(codes) + [1], dtype=int), points))
328 |             logger.info("built huffman tree with maximum node depth %i" % max_depth)
329 | 
330 |     def train_setup(self, sentences):
331 |         """
332 |         Do a bunch of initializations etc before training can start
333 |         """
334 |         self.wv.build_vocab(sentences, self.min_count, self.thr)
335 |         # add info about each word's Huffman encoding
336 |         if self.hs:
337 |             self._create_binary_tree()
338 |         # build the table for drawing random words (for negative sampling)
339 |         if self.neg:
340 |             self._make_table()
341 |         # initialize layers
342 |         self.reset_weights()
343 | 
344 |     def train_sentence_sg(self, sentence, alpha):
345 |         """
346 |         Update a skip-gram model by training on a single sentence (batch mode!)
347 |         using hierarchical softmax and/or negative sampling.
348 | 
349 |         The sentence is a list of Vocab objects (or None, where the corresponding
350 |         word is not in the vocabulary. Called internally from `Word2Vec.train()`.
351 |         """
352 |         if self.neg:
353 |             # precompute neg noise labels
354 |             labels = np.zeros(self.neg + 1)
355 |             labels[0] = 1.
356 |         for pos, word in enumerate(sentence):
357 |             if not word or (word.prob and word.prob < np.random.rand()):
358 |                 continue  # OOV word in the input sentence or subsampling => skip
359 |             reduced_window = np.random.randint(self.window - 1)
360 |             # now go over all words from the (reduced) window (at once), predicting each one in turn
361 |             start = max(0, pos - self.window + reduced_window)
362 |             word2_indices = [word2.index for pos2, word2 in enumerate(
363 |                 sentence[start:pos + self.window + 1 - reduced_window], start) if (word2 and not (pos2 == pos))]
364 |             if not word2_indices:
365 |                 continue
366 |             l1 = deepcopy(self.wv.vectors[word2_indices])  # len(word2_indices) x layer1_size
367 |             if self.hs:
368 |                 # work on the entire tree at once --> 2d matrix, codelen x layer1_size
369 |                 l2 = deepcopy(self.syn1[word.point])
370 |                 # propagate hidden -> output (len(word2_indices) x codelen)
371 |                 f = 1. / (1. + np.exp(-np.dot(l1, l2.T)))
372 |                 # vector of error gradients multiplied by the learning rate
373 |                 g = (1. - np.tile(word.code, (len(word2_indices), 1)) - f) * alpha
374 |                 # learn hidden -> output (codelen x layer1_size) batch update
375 |                 self.syn1[word.point] += np.dot(g.T, l1)
376 |                 # learn input -> hidden
377 |                 self.wv.vectors[word2_indices] += np.dot(g, l2)
378 |             if self.neg:
379 |                 # use this word (label = 1) + k other random words not from this sentence (label = 0)
380 |                 word_indices = [word.index]
381 |                 while len(word_indices) < self.neg + 1:
382 |                     w = self.table[np.random.randint(self.table.shape[0])]
383 |                     if not (w == word.index or w in word2_indices):
384 |                         word_indices.append(w)
385 |                 # 2d matrix, k+1 x layer1_size
386 |                 l2 = deepcopy(self.syn1neg[word_indices])
387 |                 # propagate hidden -> output
388 |                 f = 1. / (1. + np.exp(-np.dot(l1, l2.T)))
389 |                 # vector of error gradients multiplied by the learning rate
390 |                 g = (np.tile(labels, (len(word2_indices), 1)) - f) * alpha
391 |                 # learn hidden -> output (batch update)
392 |                 self.syn1neg[word_indices] += np.dot(g.T, l1)
393 |                 # learn input -> hidden
394 |                 self.wv.vectors[word2_indices] += np.dot(g, l2)
395 |         return len([word for word in sentence if word])
396 | 
397 |     def train_sentence_cbow(self, sentence, alpha):
398 |         """
399 |         Update a cbow model by training on a single sentence
400 |         using hierarchical softmax and/or negative sampling.
401 | 
402 |         The sentence is a list of Vocab objects (or None, where the corresponding
403 |         word is not in the vocabulary. Called internally from `Word2Vec.train()`.
404 |         """
405 |         if self.neg:
406 |             # precompute neg noise labels
407 |             labels = np.zeros(self.neg + 1)
408 |             labels[0] = 1.
409 |         for pos, word in enumerate(sentence):
410 |             if not word or (word.prob and word.prob < np.random.rand()):
411 |                 continue  # OOV word in the input sentence or subsampling => skip
412 |             reduced_window = np.random.randint(self.window - 1)  # how much is SUBSTRACTED from the original window
413 |             # get sum of representation from all words in the (reduced) window (if in vocab and not the `word` itself)
414 |             start = max(0, pos - self.window + reduced_window)
415 |             word2_indices = [word2.index for pos2, word2 in enumerate(
416 |                 sentence[start:pos + self.window + 1 - reduced_window], start) if (word2 and not (pos2 == pos))]
417 |             if not word2_indices:
418 |                 # in this case the sum would return zeros, the mean nans but really no point in doing anything at all
419 |                 continue
420 |             l1 = np.sum(self.wv.vectors[word2_indices], axis=0)  # 1xlayer1_size
421 |             if self.hs:
422 |                 # work on the entire tree at once --> 2d matrix, codelen x layer1_size
423 |                 l2 = deepcopy(self.syn1[word.point])
424 |                 # propagate hidden -> output
425 |                 f = 1. / (1. + np.exp(-np.dot(l1, l2.T)))
426 |                 # vector of error gradients multiplied by the learning rate
427 |                 g = (1. - word.code - f) * alpha
428 |                 # learn hidden -> output
429 |                 self.syn1[word.point] += np.outer(g, l1)
430 |                 # learn input -> hidden, here for all words in the window separately
431 |                 self.wv.vectors[word2_indices] += np.dot(g, l2)
432 |             if self.neg:
433 |                 # use this word (label = 1) + k other random words not from this sentence (label = 0)
434 |                 word_indices = [word.index]
435 |                 while len(word_indices) < self.neg + 1:
436 |                     w = self.table[np.random.randint(self.table.shape[0])]
437 |                     if not (w == word.index or w in word2_indices):
438 |                         word_indices.append(w)
439 |                 # 2d matrix, k+1 x layer1_size
440 |                 l2 = deepcopy(self.syn1neg[word_indices])
441 |                 # propagate hidden -> output
442 |                 f = 1. / (1. + np.exp(-np.dot(l1, l2.T)))
443 |                 # vector of error gradients multiplied by the learning rate
444 |                 g = (labels - f) * alpha
445 |                 # learn hidden -> output
446 |                 self.syn1neg[word_indices] += np.outer(g, l1)
447 |                 # learn input -> hidden, here for all words in the window separately
448 |                 self.wv.vectors[word2_indices] += np.dot(g, l2)
449 |         return len([word for word in sentence if word])
450 | 
451 |     def train(self, sentences, alpha=False, min_alpha=False):
452 |         """
453 |         Update the model's embedding and weights from a sequence of sentences (can be a once-only generator stream).
454 |         Each sentence must be a list of strings.
455 |         """
456 |         logger.info("training model on %i vocabulary and %i features" % (len(self.wv.vocab), self.wv.vector_size))
457 |         if not self.wv.vocab:
458 |             self.train_setup(sentences)
459 |         if alpha:
460 |             self.alpha = alpha
461 |         if min_alpha:
462 |             self.min_alpha = min_alpha
463 |         # build the table for drawing random words (for negative sampling)
464 |         # (is usually deleted before saving)
465 |         if self.neg and self.table is None:
466 |             self._make_table()
467 |         start, next_report = time.time(), 20.
468 |         total_words = sum(v.count for v in self.wv.vocab.values())
469 |         word_count = 0
470 |         for sentence_no, sentence in enumerate(sentences):
471 |             # convert input string lists to Vocab objects (or None for OOV words)
472 |             no_oov = [self.wv.vocab.get(word, None) for word in sentence]
473 |             # update the learning rate before every iteration
474 |             alpha = self.min_alpha + (self.alpha - self.min_alpha) * (1. - word_count / total_words)
475 |             # train on the sentence and check how many words did we train on
476 |             # (out-of-vocabulary (unknown) words do not count)
477 |             if self.mtype == 'sg':
478 |                 word_count += self.train_sentence_sg(no_oov, alpha)
479 |             elif self.mtype == 'cbow':
480 |                 word_count += self.train_sentence_cbow(no_oov, alpha)
481 |             else:
482 |                 raise RuntimeError("model type %s not known!" % self.mtype)
483 |             # report progress
484 |             elapsed = time.time() - start
485 |             if elapsed >= next_report:
486 |                 logger.info("PROGRESS: at %.2f%% words, alpha %.05f, %.0f words/s" %
487 |                             (100.0 * word_count / total_words, alpha, word_count / elapsed if elapsed else 0.0))
488 |                 next_report = elapsed + 20.  # don't flood the log, wait at least a second between progress reports
489 |         elapsed = time.time() - start
490 |         logger.info("training on %i words took %.1fs, %.0f words/s" %
491 |                     (word_count, elapsed, word_count / elapsed if elapsed else 0.0))
492 |         # compute vector norms for later stuff
493 |         self.wv.init_sims()
494 | 


--------------------------------------------------------------------------------