├── conec ├── __init__.py ├── context2vec.py └── word2vec.py ├── .gitignore ├── LICENSE ├── README.md └── examples ├── test_analogy.py └── test_ner.py /conec/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = "2.0.1" 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | *~ 3 | *.sw[p|o] 4 | *.py[c|o] 5 | *.py~ 6 | *.csv 7 | *.db 8 | examples/data/ 9 | dist/ 10 | build/ 11 | conec.egg-info/ 12 | MANIFEST.in 13 | setup.py 14 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Context Encoders (ConEc) 2 | 3 | With this code you can train and evaluate Context Encoders (ConEc), an extension of word2vec, which can learn word embeddings from large corpora and create out-of-vocabulary embeddings on the spot as well as distinguish between multiple meanings of words based on their local contexts. 4 | For further details on the model and experiments please refer to the [paper](https://arxiv.org/abs/1706.02496) - and of course if any of this code was helpful for your research, please consider citing it: 5 | ``` 6 | @inproceedings{horn2017conecRepL4NLP, 7 | author = {Horn, Franziska}, 8 | title = {Context encoders as a simple but powerful extension of word2vec}, 9 | booktitle = {Proceedings of the 2nd Workshop on Representation Learning for NLP}, 10 | year = {2017}, 11 | organization = {Association for Computational Linguistics}, 12 | pages = {10--14} 13 | } 14 | ``` 15 | 16 | The code is intended for research purposes. It should run with Python 2.7 and 3 versions - no guarantees on this though (open an issue if you find a bug, please)! 17 | 18 | ### installation 19 | 20 | You either download the code from here and include the conec folder in your `$PYTHONPATH` or install (the library components only) via pip: 21 | ``` 22 | $ pip install conec 23 | ``` 24 | 25 | ### conec library components 26 | 27 | dependencies: `numpy, scipy` 28 | 29 | - `word2vec.py`: code to train a standard word2vec model, adapted from the corresponding [gensim](https://radimrehurek.com/gensim/) implementation. 30 | - `context2vec.py`: code to build a sparse context matrix from a large collection of texts; this context matrix can then be multiplied with the corresponding word2vec embeddings to give the context encoder embeddings: 31 | 32 | ```python 33 | # get the text for training 34 | sentences = Text8Corpus('data/text8') 35 | # train the word2vec model 36 | w2v_model = word2vec.Word2Vec(sentences, mtype='cbow', hs=0, neg=13, vector_size=200, seed=3) 37 | # get the global context matrix for the text 38 | context_model = context2vec.ContextModel(sentences, min_count=w2v_model.min_count, window=w2v_model.window, wordlist=w2v_model.wv.index2word) 39 | context_mat = context_model.get_context_matrix(fill_diag=False, norm='max') 40 | # multiply the context matrix with the (length normalized) word2vec embeddings 41 | # to get the context encoder (ConEc) embeddings 42 | conec_emb = context_mat.dot(w2v_model.wv.vectors_norm) 43 | # renormalize so the word embeddings have unit length again 44 | conec_emb = conec_emb / np.array([np.linalg.norm(conec_emb, axis=1)]).T 45 | ``` 46 | 47 | 48 | ### examples 49 | 50 | additional dependencies: `sklearn` 51 | 52 | `test_analogy.py` and `test_ner.py` contain the code to replicate the analogy and named entity recognition (NER) experiments discussed in the aforementioned paper. 53 | 54 | To run the analogy experiment, it is assumed that the [`text8 corpus`](http://mattmahoney.net/dc/text8.zip) or [`1-billion corpus`](http://code.google.com/p/1-billion-word-language-modeling-benchmark/) as well as the [`analogy questions`](https://code.google.com/archive/p/word2vec/) are in a data directory. 55 | 56 | To run the named entity recognition experiment, it is assumed that the corresponding [`training and test files`](http://www.cnts.ua.ac.be/conll2003/ner/) are located in the data/conll2003 directory. 57 | 58 | 59 | If you have any questions please don't hesitate to send me an [email](mailto:cod3licious@gmail.com) and of course if you should find any bugs or want to contribute other improvements, pull requests are very welcome! 60 | -------------------------------------------------------------------------------- /conec/context2vec.py: -------------------------------------------------------------------------------- 1 | from __future__ import unicode_literals, division, print_function, absolute_import 2 | from builtins import object 3 | from collections import defaultdict 4 | from copy import deepcopy 5 | import numpy as np 6 | from scipy.sparse import csr_matrix, lil_matrix 7 | 8 | 9 | class ContextModel(object): 10 | 11 | def __init__(self, sentences, min_count=5, window=5, wordlist=[], progress=1000, forward=True, backward=True): 12 | """ 13 | sentences: list/generator of lists of words 14 | in case this is based on a pretrained word2vec model, give the index2word attribute as wordlist 15 | 16 | Attributes: 17 | - min_count: how often a word has to occur at least 18 | - window: how many words in a word's context should be considered 19 | - word2index: {word:idx} 20 | - index2word: [word1, word2, ...] 21 | - wcounts: {word: frequency} 22 | - featmat: n_voc x n_voc sparse array with weighted context word counts for every word 23 | - progress: after how many sentences a progress printout should occur (default 1000) 24 | """ 25 | self.progress = progress 26 | self.min_count = min_count 27 | self.window = window 28 | self.forward = forward 29 | self.backward = backward 30 | self.build_windex(sentences, wordlist) 31 | self._get_raw_context_matrix(sentences) 32 | 33 | def build_windex(self, sentences, wordlist=[]): 34 | """ 35 | go through all the sentences and get an overview of all used words and their frequencies 36 | """ 37 | # get an overview of the vocabulary 38 | vocab = defaultdict(int) 39 | for sentence_no, sentence in enumerate(sentences): 40 | if not sentence_no % self.progress: 41 | print("PROGRESS: at sentence #%i, processed %i words and %i unique words" % (sentence_no, sum(vocab.values()), len(vocab))) 42 | for word in sentence: 43 | vocab[word] += 1 44 | print("collected %i unique words from a corpus of %i words and %i sentences" % (len(vocab), sum(vocab.values()), sentence_no + 1)) 45 | # assign a unique index to each word and remove all words with freq < min_count 46 | self.wcounts, self.word2index, self.index2word = {}, {}, [] 47 | if not wordlist: 48 | wordlist = [word for word, c in vocab.items() if c >= self.min_count] 49 | for word in wordlist: 50 | self.word2index[word] = len(self.word2index) 51 | self.index2word.append(word) 52 | self.wcounts[word] = vocab[word] 53 | 54 | def _get_raw_context_matrix(self, sentences): 55 | """ 56 | compute the raw context matrix with weighted counts 57 | it has an entry for every word in the vocabulary 58 | """ 59 | # make the feature matrix 60 | featmat = lil_matrix((len(self.index2word), len(self.index2word)), dtype=float) 61 | for sentence_no, sentence in enumerate(sentences): 62 | if not sentence_no % self.progress: 63 | print("PROGRESS: at sentence #%i" % sentence_no) 64 | sentence = [word if word in self.word2index else None for word in sentence] 65 | # forward pass 66 | if self.forward: 67 | for i, word in enumerate(sentence[:-1]): 68 | if word: 69 | # get all words in the forward window 70 | wwords = sentence[i + 1:min(i + 1 + self.window, len(sentence))] 71 | for j, w in enumerate(wwords, 1): 72 | if w: 73 | featmat[self.word2index[word], self.word2index[w]] += 1. # /j 74 | # backwards pass 75 | if self.backward: 76 | sentence_back = sentence[::-1] 77 | for i, word in enumerate(sentence_back[:-1]): 78 | if word: 79 | # get all words in the forward window of the backwards sentence 80 | wwords = sentence_back[i + 1:min(i + 1 + self.window, len(sentence_back))] 81 | for j, w in enumerate(wwords, 1): 82 | if w: 83 | featmat[self.word2index[word], self.word2index[w]] += 1. # /j 84 | print("PROGRESS: through with all the sentences") 85 | self.featmat = csr_matrix(featmat) 86 | 87 | def get_context_matrix(self, fill_diag=True, norm='count'): 88 | """ 89 | for every word in the sentences, create a vector that contains the counts of its context words 90 | (weighted by the distance to it with a max distance of window) 91 | Inputs: 92 | - norm: if the feature matrix should be normalized to contain ones on the diagonal 93 | (--> average context vectors) 94 | - fill_diag: if diagonal of featmat should be filled with word counts 95 | Returns: 96 | - featmat: n_voc x n_voc sparse array with weighted context word counts for every word 97 | """ 98 | featmat = deepcopy(self.featmat) 99 | # fill up the diagonals with the total counts of each word --> similarity matrix 100 | if fill_diag: 101 | featmat = lil_matrix(featmat) 102 | for i, word in enumerate(self.index2word): 103 | featmat[i, i] = self.wcounts[word] 104 | featmat = csr_matrix(featmat) 105 | assert ((featmat - featmat.transpose()).data**2).sum() < 2.220446049250313e-16, "featmat not symmetric" 106 | # possibly normalize by the max counts 107 | if norm in ("count", "max"): 108 | normmat = lil_matrix(featmat.shape, dtype=float) 109 | if norm == "count": 110 | print("normalizing feature matrix by word count") 111 | normmat.setdiag([1. / self.wcounts[word] for word in self.index2word]) 112 | elif norm == "max": 113 | print("normalizing feature matrix by max counts") 114 | normmat.setdiag([1. / v[0] if v[0] else 1. for v in featmat.max(axis=1).toarray()]) 115 | featmat = csr_matrix(normmat) * featmat # row in featmat multiplied by entry on diagonal 116 | return featmat 117 | 118 | def get_local_context_matrix(self, tokens): 119 | """ 120 | compute a local context matrix. it has an entry for every token, even if it is not present in the vocabulary 121 | Inputs: 122 | - tokens: list of words 123 | Returns: 124 | - local_featmat: size len(set(tokens)) x n_vocab 125 | - tok_idx: {word: index} to map the words from the tokens list to an index of the featmat 126 | """ 127 | # for every token we still only need one representation per document 128 | tok_idx = {word: i for i, word in enumerate(set(tokens))} 129 | featmat = lil_matrix((len(tok_idx), len(self.index2word)), dtype=float) 130 | # clean out context words we don't know 131 | known_tokens = [word if word in self.word2index else None for word in tokens] 132 | # forward pass 133 | if self.forward: 134 | for i, word in enumerate(tokens[:-1]): 135 | # get all words in the forward window 136 | wwords = known_tokens[i + 1:min(i + 1 + self.window, len(known_tokens))] 137 | for j, w in enumerate(wwords, 1): 138 | if w: 139 | featmat[tok_idx[word], self.word2index[w]] += 1. / j 140 | # backwards pass 141 | if self.backward: 142 | tokens_back = tokens[::-1] 143 | known_tokens_back = known_tokens[::-1] 144 | for i, word in enumerate(tokens_back[:-1]): 145 | # get all words in the forward window of the backwards sentence, incl. word itself 146 | wwords = known_tokens_back[i + 1:min(i + 1 + self.window, len(known_tokens_back))] 147 | for j, w in enumerate(wwords, 1): 148 | if w: 149 | featmat[tok_idx[word], self.word2index[w]] += 1. / j 150 | featmat = csr_matrix(featmat) 151 | # normalize matrix 152 | normmat = lil_matrix((featmat.shape[0], featmat.shape[0]), dtype=float) 153 | normmat.setdiag([1. / v[0] if v[0] else 1. for v in featmat.max(axis=1).toarray()]) 154 | featmat = csr_matrix(normmat) * featmat 155 | return featmat, tok_idx 156 | -------------------------------------------------------------------------------- /examples/test_analogy.py: -------------------------------------------------------------------------------- 1 | from __future__ import unicode_literals, division, print_function, absolute_import 2 | from builtins import object, range 3 | from glob import glob 4 | import pickle as pkl 5 | import logging 6 | from copy import deepcopy 7 | import numpy as np 8 | 9 | from conec import word2vec 10 | from conec import context2vec 11 | 12 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 13 | 14 | 15 | class Text8Corpus(object): 16 | """Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip .""" 17 | 18 | def __init__(self, fname): 19 | self.fname = fname 20 | 21 | def __iter__(self): 22 | # the entire corpus is one gigantic line -- there are no sentence marks at all 23 | # so just split the sequence of tokens arbitrarily: 1 sentence = 1000 tokens 24 | sentence, rest, max_sentence_length = [], '', 1000 25 | with open(self.fname) as fin: 26 | while True: 27 | text = rest + fin.read(8192) # avoid loading the entire file (=1 line) into RAM 28 | if text == rest: # EOF 29 | sentence.extend(rest.split()) # return the last chunk of words, too (may be shorter/longer) 30 | if sentence: 31 | yield sentence 32 | break 33 | # the last token may have been split in two... keep it for the next iteration 34 | last_token = text.rfind(' ') 35 | words, rest = (text[:last_token].split(), text[last_token:].strip()) if last_token >= 0 else ([], text) 36 | sentence.extend(words) 37 | while len(sentence) >= max_sentence_length: 38 | yield sentence[:max_sentence_length] 39 | sentence = sentence[max_sentence_length:] 40 | 41 | 42 | class OneBilCorpus(object): 43 | """Iterate over sentences from the "1-billion-word-language-modeling-benchmark" corpus, 44 | downloaded from http://code.google.com/p/1-billion-word-language-modeling-benchmark/ .""" 45 | 46 | def __init__(self): 47 | self.dir = 'data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news*' 48 | 49 | def __iter__(self): 50 | # go file by file 51 | for fname in glob(self.dir): 52 | with open(fname) as f: 53 | yield f.read().lower().split() 54 | 55 | 56 | def analogy(model, a, b, c): 57 | # man:woman as king:x - a:b as c:x - find x 58 | # get embeddings for a, b, and c and multiply with all other words 59 | a_sims = 1. + np.dot(model.wv.vectors_norm, model.wv.vectors_norm[model.wv.vocab[a].index]) 60 | b_sims = 1. + np.dot(model.wv.vectors_norm, model.wv.vectors_norm[model.wv.vocab[b].index]) 61 | c_sims = 1. + np.dot(model.wv.vectors_norm, model.wv.vectors_norm[model.wv.vocab[c].index]) 62 | # add/multiply them as they should 63 | return b_sims - a_sims + c_sims 64 | # return (b_sims*c_sims)/a_sims 65 | 66 | 67 | def accuracy(model, questions, lowercase=True, restrict_vocab=30000): 68 | """ 69 | Compute accuracy of the model. `questions` is a filename where lines are 70 | 4-tuples of words, split into sections by ": SECTION NAME" lines. 71 | See https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt for an example. 72 | 73 | The accuracy is reported (=printed to log and returned as a list) for each 74 | section separately, plus there's one aggregate summary at the end. 75 | 76 | Use `restrict_vocab` to ignore all questions containing a word whose frequency 77 | is not in the top-N most frequent words (default top 30,000). 78 | 79 | This method corresponds to the `compute-accuracy` script of the original C word2vec. 80 | 81 | """ 82 | ok_vocab = dict(sorted(model.wv.vocab.items(), key=lambda item: -item[1].count)[:restrict_vocab]) 83 | ok_index = set(v.index for v in ok_vocab.values()) 84 | 85 | def log_accuracy(section): 86 | correct, incorrect = section['correct'], section['incorrect'] 87 | if correct + incorrect > 0: 88 | print("%s: %.1f%% (%i/%i)" % (section['section'], 89 | 100.0 * correct / (correct + incorrect), correct, correct + incorrect)) 90 | 91 | sections, section = [], None 92 | for line_no, line in enumerate(open(questions)): 93 | # TODO: use level3 BLAS (=evaluate multiple questions at once), for speed 94 | if line.startswith(': '): 95 | # a new section starts => store the old section 96 | if section: 97 | sections.append(section) 98 | log_accuracy(section) 99 | section = {'section': line.lstrip(': ').strip(), 'correct': 0, 'incorrect': 0} 100 | else: 101 | if not section: 102 | raise ValueError("missing section header before line #%i in %s" % (line_no, questions)) 103 | try: 104 | if lowercase: 105 | a, b, c, expected = [word.lower() for word in line.split()] 106 | else: 107 | a, b, c, expected = [word for word in line.split()] 108 | except: 109 | print("skipping invalid line #%i in %s" % (line_no, questions)) 110 | if a not in ok_vocab or b not in ok_vocab or c not in ok_vocab or expected not in ok_vocab: 111 | # print "skipping line #%i with OOV words: %s" % (line_no, line) 112 | continue 113 | 114 | ignore = set(model.wv.vocab[v].index for v in [a, b, c]) # indexes of words to ignore 115 | predicted = None 116 | # find the most likely prediction, ignoring OOV words and input words 117 | # for index in np.argsort(model.wv.most_similar(positive=[b, c], negative=[a], topn=False))[::-1]: 118 | for index in np.argsort(analogy(model, a, b, c))[::-1]: 119 | if index in ok_index and index not in ignore: 120 | predicted = model.wv.index2word[index] 121 | # if predicted != expected: 122 | # print "%s: expected %s, predicted %s" % (line.strip(), expected, predicted) 123 | break 124 | section['correct' if predicted == expected else 'incorrect'] += 1 125 | if section: 126 | # store the last section, too 127 | sections.append(section) 128 | log_accuracy(section) 129 | 130 | total = {'section': 'total', 'correct': sum(s['correct'] 131 | for s in sections), 'incorrect': sum(s['incorrect'] for s in sections)} 132 | log_accuracy(total) 133 | sections.append(total) 134 | return sections 135 | 136 | 137 | def accuracy_examples(model): 138 | # just as advertised... 139 | print(model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)) 140 | # "boy" is to "father" as "girl" is to ...? 141 | print(model.wv.most_similar(['girl', 'father'], ['boy'], topn=3)) 142 | more_examples = ["he his she", "big bigger bad", "going went being"] 143 | for example in more_examples: 144 | a, b, x = example.split() 145 | predicted = model.wv.most_similar([x, b], [a])[0][0] 146 | print("'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted)) 147 | # which word doesn't go with the others? 148 | print(model.wv.doesnt_match("breakfast cereal dinner lunch".split())) 149 | 150 | 151 | def evaluate_google(): 152 | # see https://code.google.com/archive/p/word2vec/ 153 | # load pretrained google embeddings and test 154 | from gensim.models import Word2Vec 155 | model_google = Word2Vec.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True) 156 | _ = accuracy(model_google, "data/questions-words.txt", False) 157 | 158 | 159 | def evaluate_word2vec(corpus, seed=1): 160 | # load and evaluate 161 | fname = "%s_cbow_200_hs0_neg13_seed%i.model" % (corpus, seed) 162 | with open("data/%s" % fname, 'rb') as f: 163 | model = pkl.load(f) 164 | _ = accuracy(model, "data/questions-words.txt") 165 | 166 | 167 | def evaluate_contextenc(corpus, seed=1): 168 | # load word2vec model 169 | print("####### seed = %i" % seed) 170 | fname = "%s_cbow_200_hs0_neg13_seed%i.model" % (corpus, seed) 171 | with open("data/%s" % fname, 'rb') as f: 172 | model_org = pkl.load(f) 173 | # get context matrix 174 | if corpus == 'text8': 175 | sentences = Text8Corpus('data/text8') 176 | elif corpus == '1bil': 177 | sentences = OneBilCorpus() 178 | context_model = context2vec.ContextModel( 179 | sentences, min_count=model_org.min_count, window=model_org.window, wordlist=model_org.wv.index2word) 180 | for fill_diag in [True, False]: 181 | model = deepcopy(model_org) 182 | # build context matrix 183 | print("constructing context matrix for fill_diag: %s" % (fill_diag)) 184 | context_mat = context_model.get_context_matrix(fill_diag, False) 185 | # adapt the word2vec model 186 | print("adapting the word2vec weights - vectors_norm") 187 | model.wv.vectors_norm = context_mat.dot(model.wv.vectors_norm) 188 | # renormalize 189 | model.wv.vectors_norm = model.wv.vectors_norm / np.array([np.linalg.norm(model.wv.vectors_norm, axis=1)]).T 190 | # evaluate 191 | print("evaluating the model") 192 | _ = accuracy(model, "data/questions-words.txt") 193 | 194 | 195 | def train_word2vec(corpus='text8', seed=1, it=10, save_interm=True): 196 | # load text 197 | if corpus == 'text8': 198 | sentences = Text8Corpus('data/text8') 199 | elif corpus == '1bil': 200 | sentences = OneBilCorpus() 201 | 202 | def save_model(model, saven): 203 | # delete the huge stupid table again 204 | table = deepcopy(model.table) 205 | model.table = None 206 | # pickle the entire model to disk, so we can load&resume training later 207 | pkl.dump(model, open("data/%s" % saven, 'wb'), -1) 208 | # reinstate the table to continue training 209 | model.table = table 210 | 211 | # train the cbow model; default window=5 212 | model = word2vec.Word2Vec(sentences, mtype='cbow', hs=0, neg=13, vector_size=200, alpha=0.025, min_alpha=0.01, seed=seed) 213 | for i in range(1, it): 214 | print("####### ITERATION %i ########" % i) 215 | _ = accuracy(model, "data/questions-words.txt") 216 | if save_interm: 217 | save_model(model, "%s_cbow_200_hs0_neg13_seed%i_it%i.model" % (corpus, seed, i)) 218 | model.train(sentences, alpha=0.025, min_alpha=0.01) 219 | save_model(model, "%s_cbow_200_hs0_neg13_seed%i_it%i.model" % (corpus, seed, it)) 220 | print("####### ITERATION %i ########" % it) 221 | _ = accuracy(model, "data/questions-words.txt") 222 | accuracy_examples(model) 223 | 224 | 225 | def main(): 226 | # load the text on which we're training 227 | sentences = Text8Corpus('data/text8') 228 | # this would train the model for 1 iteration 229 | # model = word2vec.Word2Vec(sentences, mtype='cbow', hs=0, neg=13, vector_size=200, seed=3) 230 | # and we don't need the table used for negative sampling (it's huge) 231 | # model.table = None 232 | # however to replicate the results of the paper, you should train the model for 10 iterations 233 | # we set `it' to 3 here to speed up the process, change it to 10 for better accuracies 234 | it = 3 235 | train_word2vec(corpus='text8', seed=3, it=it, save_interm=False) 236 | # since this saves the model (e.g. for training on a cluster), we need to load it again 237 | with open("data/text8_cbow_200_hs0_neg13_seed3_it%i.model" % it, "rb") as f: 238 | model = pkl.load(f) 239 | """ 240 | collected 253854 unique words from a corpus of 17005207 words and 17006 sentences 241 | total of 71290 unique words after removing those with count < 5 242 | training model on 71290 vocabulary and 200 features 243 | training on 16718844 words took 2789.4s, 5994 words/s 244 | """ 245 | # evaluate the accuracy on the analogy task (the results below are after 3 iterations) 246 | _ = accuracy(model, "data/questions-words.txt") 247 | """ 248 | capital-common-countries: 35.8% (181/506) 249 | capital-world: 15.8% (230/1452) 250 | currency: 10.4% (28/268) 251 | city-in-state: 19.1% (300/1571) 252 | family: 73.2% (224/306) 253 | gram1-adjective-to-adverb: 11.2% (85/756) 254 | gram2-opposite: 19.3% (59/306) 255 | gram3-comparative: 57.0% (718/1260) 256 | gram4-superlative: 33.6% (170/506) 257 | gram5-present-participle: 24.2% (240/992) 258 | gram6-nationality-adjective: 62.8% (861/1371) 259 | gram7-past-tense: 27.5% (366/1332) 260 | gram8-plural: 41.3% (410/992) 261 | gram9-plural-verbs: 39.4% (256/650) 262 | total: 33.6% (4128/12268) 263 | """ 264 | # get the global context matrix relying on the same text 265 | context_model = context2vec.ContextModel(sentences, min_count=model.min_count, 266 | window=model.window, wordlist=model.wv.index2word) 267 | # best results on the analogy task when counting the target word in addition to the context words 268 | # --> fill diagonal of the context matrix. normalization is irrelevant since we renormalize later 269 | context_mat = context_model.get_context_matrix(fill_diag=True, norm=False) 270 | # adapt the word embeddings of the word2vec model by multiplying them with the context matrix 271 | model.wv.vectors_norm = context_mat.dot(model.wv.vectors_norm) 272 | # renormalize so the word embeddings have unit length again 273 | model.wv.vectors_norm = model.wv.vectors_norm / np.array([np.linalg.norm(model.wv.vectors_norm, axis=1)]).T 274 | # evaluate the model again 275 | _ = accuracy(model, "data/questions-words.txt") 276 | """ 277 | capital-common-countries: 62.3% (315/506) 278 | capital-world: 34.9% (507/1452) 279 | currency: 15.3% (41/268) 280 | city-in-state: 29.2% (458/1571) 281 | family: 72.5% (222/306) 282 | gram1-adjective-to-adverb: 14.0% (106/756) 283 | gram2-opposite: 19.9% (61/306) 284 | gram3-comparative: 54.2% (683/1260) 285 | gram4-superlative: 32.8% (166/506) 286 | gram5-present-participle: 26.7% (265/992) 287 | gram6-nationality-adjective: 56.1% (769/1371) 288 | gram7-past-tense: 25.5% (340/1332) 289 | gram8-plural: 37.1% (368/992) 290 | gram9-plural-verbs: 24.8% (161/650) 291 | total: 36.4% (4462/12268) 292 | """ 293 | 294 | 295 | if __name__ == '__main__': 296 | main() 297 | -------------------------------------------------------------------------------- /examples/test_ner.py: -------------------------------------------------------------------------------- 1 | from __future__ import unicode_literals, division, print_function, absolute_import 2 | from future import standard_library 3 | standard_library.install_aliases() 4 | from builtins import object, range, next 5 | import logging 6 | import pickle as pkl 7 | import re 8 | import unicodedata 9 | from copy import deepcopy 10 | import numpy as np 11 | from scipy.sparse import csr_matrix, lil_matrix 12 | from sklearn.linear_model import LogisticRegression as logreg 13 | 14 | from conec import word2vec 15 | from conec import context2vec 16 | 17 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 18 | 19 | 20 | def clean_conll2003(text, to_lower=False): 21 | # clean the text: no fucked up characters 22 | nfkd_form = unicodedata.normalize("NFKD", text) 23 | text = nfkd_form.encode("ASCII", "ignore").decode("ASCII") 24 | # normalize numbers 25 | text = re.sub(r"[0-9]", "1", text) 26 | if to_lower: 27 | text = text.lower() 28 | return text 29 | 30 | 31 | class CoNLL2003(object): 32 | # collected 20102 unique words from a corpus of 218609 words and 946 sentences 33 | # generator for the conll2003 training data 34 | 35 | def __init__(self, to_lower=False, sources=["data/conll2003/ner/eng.train"]): 36 | self.sources = sources 37 | self.to_lower = to_lower 38 | 39 | def __iter__(self): 40 | """Iterate through all news articles.""" 41 | for fname in self.sources: 42 | tokens = [] 43 | for line in open(fname): 44 | if line.startswith("-DOCSTART- -X- -X-"): 45 | if tokens: 46 | yield tokens 47 | tokens = [] 48 | elif line.strip(): 49 | tokens.append(clean_conll2003(line.split()[0], self.to_lower)) 50 | else: 51 | tokens.append('') 52 | yield tokens 53 | 54 | 55 | def train_word2vec(train_all=False, it=20, seed=1): 56 | # train all models for 20 iterations 57 | # train the word2vec model on a) the training data 58 | sentences = CoNLL2003(to_lower=True) 59 | 60 | def save_model(model, saven): 61 | # delete the huge stupid table again 62 | table = deepcopy(model.table) 63 | model.table = None 64 | # pickle the entire model to disk, so we can load&resume training later 65 | pkl.dump(model, open("data/%s" % saven, 'wb'), -1) 66 | # reinstate the table to continue training 67 | model.table = table 68 | 69 | # train the cbow model; default window=5 70 | alpha = 0.02 71 | model = word2vec.Word2Vec(sentences, min_count=1, mtype='cbow', hs=0, neg=13, vector_size=200, alpha=alpha, min_alpha=alpha, seed=seed) 72 | for i in range(1, it): 73 | print("####### ITERATION %i ########" % (i + 1)) 74 | if not i % 5: 75 | save_model(model, "conll2003_train_cbow_200_hs0_neg13_seed%i_it%i.model" % (seed, i)) 76 | alpha /= 2. 77 | alpha = max(alpha, 0.0001) 78 | model.train(sentences, alpha=alpha, min_alpha=alpha) 79 | save_model(model, "conll2003_train_cbow_200_hs0_neg13_seed%i_it%i.model" % (seed, it)) 80 | if train_all: 81 | # and b) the training + test data 82 | sentences = CoNLL2003(to_lower=True, sources=[ 83 | "data/conll2003/ner/eng.train", "data/conll2003/ner/eng.testa", "data/conll2003/ner/eng.testb"]) 84 | model = word2vec.Word2Vec(sentences, min_count=1, mtype='cbow', hs=0, neg=13, vector_size=200, seed=seed) 85 | for i in range(19): 86 | model.train(sentences) 87 | # delete the huge stupid table again 88 | model.table = None 89 | # pickle the entire model to disk, so we can load&resume training later 90 | saven = "conll2003_test_20it_cbow_200_hs0_neg13_seed%i.model" % seed 91 | print("saving model") 92 | pkl.dump(model, open("data/%s" % saven, 'wb'), -1) 93 | 94 | 95 | def make_wordfeat(w): 96 | return [int(w.isalnum()), int(w.isalpha()), int(w.isdigit()), 97 | int(w.islower()), int(w.istitle()), int(w.isupper()), 98 | len(w)] 99 | 100 | 101 | def make_featmat_wordfeat(tokens): 102 | # tokens: list of words 103 | return np.array([make_wordfeat(t) for t in tokens]) 104 | 105 | 106 | class ContextEnc_NER(object): 107 | 108 | def __init__(self, w2v_model, contextm=False, sentences=[], w_local=0.4, context_global_only=False, include_wf=False, to_lower=True, normed=True, renorm=True): 109 | self.clf = None 110 | self.w2v_model = w2v_model 111 | self.rep_idx = {word: i for i, word in enumerate(w2v_model.wv.index2word)} 112 | self.include_wf = include_wf 113 | self.to_lower = to_lower 114 | self.w_local = w_local # how much the local context compared to the global should count 115 | self.context_global_only = context_global_only # if only global context should count (0 if global not available -- not same as w_local=0) 116 | self.normed = normed 117 | self.renorm = renorm 118 | # should we include the context? 119 | if contextm: 120 | # sentences: depending on what the word2vec model was trained 121 | self.context_model = context2vec.ContextModel( 122 | sentences, min_count=1, window=w2v_model.window, wordlist=w2v_model.wv.index2word) 123 | # --> create a global context matrix 124 | self.context_model.featmat = self.context_model.get_context_matrix(False, 'max') 125 | else: 126 | self.context_model = None 127 | 128 | def make_featmat_rep(self, tokens, local_context_mat=None, tok_idx={}): 129 | """ 130 | Inputs: 131 | - tokens: list of words 132 | Returns: 133 | - featmat: dense feature matrix for every token 134 | """ 135 | # possibly preprocess tokens 136 | if self.to_lower: 137 | pp_tokens = [t.lower() for t in tokens] 138 | else: 139 | pp_tokens = tokens 140 | dim = self.w2v_model.wv.vector_size 141 | if self.include_wf: 142 | dim += 7 143 | featmat = np.zeros((len(tokens), dim), dtype=float) 144 | # index in featmat for all known tokens 145 | idx_featmat = [i for i, t in enumerate(pp_tokens) if t in self.rep_idx] 146 | if self.normed: 147 | rep_mat = deepcopy(self.w2v_model.wv.vectors_norm) 148 | else: 149 | rep_mat = deepcopy(self.w2v_model.vectors) 150 | if self.context_model: 151 | if self.context_global_only: 152 | # make context matrix out of global context vectors only 153 | context_mat = lil_matrix((len(tokens), len(self.rep_idx))) 154 | global_tok_idx = [self.rep_idx[t] for t in pp_tokens if t in self.rep_idx] 155 | context_mat[idx_featmat, :] = self.context_model.featmat[global_tok_idx, :] 156 | else: 157 | # compute the local context matrix 158 | if not tok_idx: 159 | local_context_mat, tok_idx = self.context_model.get_local_context_matrix(pp_tokens) 160 | local_tok_idx = [tok_idx[t] for t in pp_tokens] 161 | context_mat = lil_matrix(local_context_mat[local_tok_idx, :]) 162 | assert context_mat.shape == (len(tokens), len(self.rep_idx)), "context matrix has wrong shape" 163 | # average it with the global context vectors if available 164 | local_global_tok_idx = [tok_idx[t] for t in pp_tokens if t in self.rep_idx] 165 | global_tok_idx = [self.rep_idx[t] for t in pp_tokens if t in self.rep_idx] 166 | context_mat[idx_featmat, :] = self.w_local * lil_matrix(local_context_mat[local_global_tok_idx, :]) + ( 167 | 1. - self.w_local) * self.context_model.featmat[global_tok_idx, :] 168 | # multiply context_mat with rep_mat to get featmat (+ normalize) 169 | featmat[:, 0:rep_mat.shape[1]] = csr_matrix(context_mat) * rep_mat 170 | # length normalize the feature vectors 171 | if self.renorm: 172 | fnorm = np.linalg.norm(featmat, axis=1) 173 | featmat[fnorm > 0, :] = featmat[fnorm > 0, :] / np.array([fnorm[fnorm > 0]]).T 174 | else: 175 | # we set the feature matrix with the word2vec embeddings directly; 176 | # tokens not in the original vocab will have a zero representation 177 | idx_repmat = [self.rep_idx[t] for t in pp_tokens if t in self.rep_idx] 178 | featmat[idx_featmat, 0:rep_mat.shape[1]] = rep_mat[idx_repmat, :] 179 | if self.include_wf: 180 | featmat[:, dim - 7:] = make_featmat_wordfeat(tokens) 181 | return featmat 182 | 183 | def train_clf(self, trainfiles): 184 | # tokens: list of words, labels: list of corresponding labels 185 | # go document by document because of local context 186 | final_labels = [] 187 | featmat = [] 188 | for trainfile in trainfiles: 189 | for tokens, labels in yield_tokens_labels(trainfile): 190 | final_labels.extend(labels) 191 | featmat.append(self.make_featmat_rep(tokens)) 192 | featmat = np.vstack(featmat) 193 | print("training classifier") 194 | clf = logreg(class_weight='balanced', random_state=1) 195 | clf.fit(featmat, final_labels) 196 | self.clf = clf 197 | 198 | def find_ne_in_text(self, text, local_context_mat=None, tok_idx={}): 199 | featmat = self.make_featmat_rep(text.strip().split(), local_context_mat, tok_idx) 200 | labels = self.clf.predict(featmat) 201 | # stitch text back together 202 | results = [] 203 | for i, t in enumerate(text.strip().split()): 204 | if results and labels[i] == results[-1][1]: 205 | results[-1] = (results[-1][0] + " " + t, results[-1][1]) 206 | else: 207 | if results: 208 | results.append((' ', 'O')) 209 | results.append((t, labels[i])) 210 | return results 211 | 212 | 213 | def process_wordlabels(word_labels): 214 | # process labels 215 | tokens = [] 216 | labels = [] 217 | for word, l in word_labels: 218 | if word: 219 | if l.startswith("I-") or l.startswith("B-"): 220 | l = l[2:] 221 | tokens.append(word) 222 | labels.append(l) 223 | assert len(tokens) == len(labels), "must have same number of tokens as labels" 224 | return tokens, labels 225 | 226 | 227 | def get_tokens_labels(trainfile): 228 | # read in trainfile to generate training labels 229 | with open(trainfile) as f: 230 | word_labels = [(clean_conll2003(line.split()[0]), line.strip().split()[-1]) if line.strip() 231 | else ('', 'O') for line in f if not line.startswith("-DOCSTART- -X- -X-")] 232 | return process_wordlabels(word_labels) 233 | 234 | 235 | def yield_tokens_labels(trainfile): 236 | # generate tokens and labels for every document 237 | word_labels = [] 238 | for line in open(trainfile): 239 | if line.startswith("-DOCSTART- -X- -X-"): 240 | if word_labels: 241 | yield process_wordlabels(word_labels) 242 | word_labels = [] 243 | elif line.strip(): 244 | word_labels.append((clean_conll2003(line.split()[0]), line.strip().split()[-1])) 245 | else: 246 | word_labels.append(('', 'O')) 247 | yield process_wordlabels(word_labels) 248 | 249 | 250 | def ne_results_2_labels(ne_results): 251 | """ 252 | helper function to transform a list of substrings and labels 253 | into a list of labels for every (white space separated) token 254 | """ 255 | l_list = [] 256 | last_l = '' 257 | for i, (substr, l) in enumerate(ne_results): 258 | if substr == ' ': 259 | continue 260 | if not l or l == 'O': 261 | l_out = 'O' 262 | elif l == last_l: 263 | l_out = "B-" + l 264 | else: 265 | l_out = "I-" + l 266 | last_l = l 267 | if (not i) or (substr.startswith(' ') or ne_results[i - 1][0].endswith(' ')): 268 | l_list.append(l_out) 269 | # if there is no space between the previous and last substring, first token gets label 270 | # of longer subsubstr (i.e. either previous or current) 271 | elif i and len(ne_results[i - 1][0].split()[-1]) < len(substr.split()[0]): 272 | l_list.pop() 273 | l_list.append(l_out) 274 | l_list.extend([l_out for n in range(len(substr.split()) - 1)]) 275 | return l_list 276 | 277 | 278 | def apply_conll2003_ner(ner, testfile, outfile): 279 | """ 280 | Inputs: 281 | - ner: named entity classifier with find_ne_in_text method 282 | - testfile: path to the testfile 283 | - outfile: where the output should be saved 284 | """ 285 | documents = CoNLL2003(sources=[testfile], to_lower=True) 286 | documents_it = documents.__iter__() 287 | local_context_mat, tok_idx = None, {} 288 | # read in test file + generate outfile 289 | with open(outfile, 'w') as f_out: 290 | # collect all the words in a sentence and save other rest of the lines 291 | to_write, tokens = [], [] 292 | doc_tokens = [] 293 | for line in open(testfile): 294 | if line.startswith("-DOCSTART- -X- -X-"): 295 | f_out.write("-DOCSTART- -X- -X- O O\n") 296 | # we're at a new document, time for a new local context matrix 297 | if ner.context_model: 298 | doc_tokens = next(documents_it) 299 | local_context_mat, tok_idx = ner.context_model.get_local_context_matrix(doc_tokens) 300 | # outfile: testfile + additional column with predicted label 301 | elif line.strip(): 302 | to_write.append(line.strip()) 303 | tokens.append(clean_conll2003(line.split()[0])) 304 | else: 305 | # end of sentence: find all named entities! 306 | if to_write: 307 | ne_results = ner.find_ne_in_text(" ".join(tokens), local_context_mat, tok_idx) 308 | assert " ".join(tokens) == "".join(r[0] 309 | for r in ne_results), "returned text doesn't match" # sanity check 310 | l_list = ne_results_2_labels(ne_results) 311 | assert len(l_list) == len(tokens), "Error: %i labels but %i tokens" % (len(l_list), len(tokens)) 312 | for i, line in enumerate(to_write): 313 | f_out.write(to_write[i] + " " + l_list[i] + "\n") 314 | to_write, tokens = [], [] 315 | f_out.write("\n") 316 | 317 | 318 | def log_results(clf_ner, description, filen='', subf=''): 319 | import os 320 | if not os.path.exists('data/conll2003_results'): 321 | os.mkdir('data/conll2003_results') 322 | if not os.path.exists('data/conll2003_results%s' % subf): 323 | os.mkdir('data/conll2003_results%s' % subf) 324 | import subprocess 325 | print("applying to training set") 326 | apply_conll2003_ner(clf_ner, 'data/conll2003/ner/eng.train', 'data/conll2003_results%s/eng.out_train.txt' % subf) 327 | print("applying to test set") 328 | apply_conll2003_ner(clf_ner, 'data/conll2003/ner/eng.testa', 'data/conll2003_results%s/eng.out_testa.txt' % subf) 329 | apply_conll2003_ner(clf_ner, 'data/conll2003/ner/eng.testb', 'data/conll2003_results%s/eng.out_testb.txt' % subf) 330 | # write out results 331 | with open('data/conll2003_results/output_all_%s.txt' % filen, 'a') as f: 332 | f.write('%s\n' % description) 333 | f.write('results on training data\n') 334 | out = subprocess.getstatusoutput('data/conll2003/ner/bin/conlleval < data/conll2003_results%s/eng.out_train.txt' % subf)[1] 335 | f.write(out) 336 | f.write('\n') 337 | f.write('results on testa\n') 338 | out = subprocess.getstatusoutput('data/conll2003/ner/bin/conlleval < data/conll2003_results%s/eng.out_testa.txt' % subf)[1] 339 | f.write(out) 340 | f.write('\n') 341 | f.write('results on testb\n') 342 | out = subprocess.getstatusoutput('data/conll2003/ner/bin/conlleval < data/conll2003_results%s/eng.out_testb.txt' % subf)[1] 343 | f.write(out) 344 | f.write('\n') 345 | f.write('\n') 346 | 347 | 348 | if __name__ == '__main__': 349 | seed = 3 350 | it = 20 351 | train_word2vec(train_all=False, it=it, seed=seed) 352 | # load pretrained word2vec model 353 | with open("data/conll2003_train_cbow_200_hs0_neg13_seed%i_it%i.model" % (seed, it), 'rb') as f: 354 | w2v_model = pkl.load(f) 355 | # train a classifier with these word embeddings on the training part 356 | clf_ner = ContextEnc_NER(w2v_model, include_wf=False) 357 | clf_ner.train_clf(['data/conll2003/ner/eng.train']) 358 | # apply the classifier to all training and test parts of the CoNLL2003 task, 359 | # run the evaluation script and save the results 360 | log_results(clf_ner, '####### word2vec model, seed: %i, it: %i' % (seed, it), 'word2vec_%i' % seed, '_word2vec_%i_%i' % (seed, it)) 361 | """ 362 | results on training data 363 | processed 204567 tokens with 23499 phrases; found: 38310 phrases; correct: 11537. 364 | accuracy: 84.48%; precision: 30.11%; recall: 49.10%; FB1: 37.33 365 | LOC: precision: 51.57%; recall: 75.06%; FB1: 61.14 10391 366 | MISC: precision: 21.22%; recall: 39.70%; FB1: 27.66 6432 367 | ORG: precision: 18.52%; recall: 29.08%; FB1: 22.63 9924 368 | PER: precision: 25.73%; recall: 45.08%; FB1: 32.76 11563 369 | results on testa 370 | processed 51578 tokens with 5942 phrases; found: 8422 phrases; correct: 2525. 371 | accuracy: 84.04%; precision: 29.98%; recall: 42.49%; FB1: 35.16 372 | LOC: precision: 52.03%; recall: 66.85%; FB1: 58.52 2360 373 | MISC: precision: 25.25%; recall: 41.54%; FB1: 31.41 1517 374 | ORG: precision: 19.26%; recall: 30.28%; FB1: 23.54 2108 375 | PER: precision: 20.85%; recall: 27.58%; FB1: 23.74 2437 376 | results on testb 377 | processed 46666 tokens with 5648 phrases; found: 7338 phrases; correct: 1960. 378 | accuracy: 82.26%; precision: 26.71%; recall: 34.70%; FB1: 30.19 379 | LOC: precision: 52.07%; recall: 66.49%; FB1: 58.40 2130 380 | MISC: precision: 19.05%; recall: 38.32%; FB1: 25.45 1412 381 | ORG: precision: 19.64%; recall: 22.40%; FB1: 20.93 1894 382 | PER: precision: 11.04%; recall: 12.99%; FB1: 11.94 1902 383 | """ 384 | 385 | # load the text again (same as word2vec model was trained on) to generate the context matrix 386 | sentences = CoNLL2003(to_lower=True) 387 | # only use global context; no rep for out-of-vocab 388 | clf_ner = ContextEnc_NER(w2v_model, contextm=True, sentences=sentences, w_local=0., context_global_only=True) 389 | clf_ner.train_clf(['data/conll2003/ner/eng.train']) 390 | # evaluate the results again 391 | log_results(clf_ner, '####### context enc with global context matrix only, seed: %i, it: %i' % (seed, it), 'conec_global_%i' % seed, '_conec_global_%i_%i' % (seed, it)) 392 | 393 | # for the out-of-vocabulary words in the dev and test set, only the local context matrix (based on only the current doc) 394 | # is used to generate the respective word embeddings; where a global context vector is available (for all words in the training set) 395 | # we use a combination of the local and global context, determined by w_local 396 | for w_local in [0., 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.]: 397 | print(w_local) 398 | clf_ner = ContextEnc_NER(w2v_model, contextm=True, sentences=sentences, w_local=w_local) 399 | clf_ner.train_clf(['data/conll2003/ner/eng.train']) 400 | # evaluate the results again 401 | log_results(clf_ner, '####### context enc with a combination of the global and local context matrix (w_local=%.1f), seed: %i, it: %i' % (w_local, seed, it), 'conec_%i_%i' % (round(w_local*10), seed), '_conec_%i_%i_%i' % (round(w_local*10), seed, it)) 402 | """ 403 | results on training data 404 | processed 204567 tokens with 23499 phrases; found: 33708 phrases; correct: 11675. 405 | accuracy: 84.34%; precision: 34.64%; recall: 49.68%; FB1: 40.82 406 | LOC: precision: 57.46%; recall: 75.34%; FB1: 65.20 9361 407 | MISC: precision: 19.56%; recall: 37.14%; FB1: 25.62 6530 408 | ORG: precision: 19.16%; recall: 24.62%; FB1: 21.55 8119 409 | PER: precision: 35.71%; recall: 52.47%; FB1: 42.50 9698 410 | results on testa 411 | processed 51578 tokens with 5942 phrases; found: 8756 phrases; correct: 3244. 412 | accuracy: 85.01%; precision: 37.05%; recall: 54.59%; FB1: 44.14 413 | LOC: precision: 56.96%; recall: 77.74%; FB1: 65.75 2507 414 | MISC: precision: 22.97%; recall: 41.76%; FB1: 29.64 1676 415 | ORG: precision: 20.96%; recall: 28.64%; FB1: 24.20 1832 416 | PER: precision: 38.20%; recall: 56.84%; FB1: 45.69 2741 417 | results on testb 418 | processed 46666 tokens with 5648 phrases; found: 8407 phrases; correct: 2830. 419 | accuracy: 84.17%; precision: 33.66%; recall: 50.11%; FB1: 40.27 420 | LOC: precision: 53.21%; recall: 74.58%; FB1: 62.11 2338 421 | MISC: precision: 16.29%; recall: 36.32%; FB1: 22.50 1565 422 | ORG: precision: 24.44%; recall: 30.04%; FB1: 26.95 2042 423 | PER: precision: 33.79%; recall: 51.45%; FB1: 40.79 2462 424 | """ 425 | -------------------------------------------------------------------------------- /conec/word2vec.py: -------------------------------------------------------------------------------- 1 | # Original Code by Radim Rehurek 2 | # [Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html] 3 | # see: http://radimrehurek.com/gensim/ 4 | # 5 | # Rewrite by Franziska Horn 6 | 7 | from __future__ import unicode_literals, division, print_function, absolute_import 8 | from builtins import object, range, str 9 | import time 10 | import logging 11 | import heapq 12 | from copy import deepcopy 13 | from math import sqrt 14 | import numpy as np 15 | 16 | logger = logging.getLogger("word2vec") 17 | 18 | 19 | class Vocab(object): 20 | """ 21 | A single vocabulary item, used internally e.g. for constructing binary trees 22 | (incl. both word leaves and inner nodes). 23 | 24 | Possible Fields: 25 | - count: how often the word occurred in the training sentences 26 | - index: the word's index in the embedding 27 | """ 28 | 29 | def __init__(self, **kwargs): 30 | self.count = 0 31 | self.__dict__.update(kwargs) 32 | 33 | def __lt__(self, other): # used for sorting in a priority queue 34 | return self.count < other.count 35 | 36 | def __str__(self): 37 | vals = ['%s:%r' % (key, self.__dict__[key]) for key in sorted(self.__dict__) if not key.startswith('_')] 38 | return "%s(%s)" % (self.__class__.__name__, ', '.join(vals)) 39 | 40 | 41 | class Word2VecEmbeddings(object): 42 | """ 43 | Word2Vec embeddings only - can't be trained further, but enough for all calculations 44 | """ 45 | def __init__(self, vector_size=100): 46 | """ 47 | Initialize Word2Vec embeddings 48 | 49 | Inputs: 50 | - vector_size: (default 100) dimensionality of embedding 51 | """ 52 | self.vector_size = vector_size 53 | self.vectors = np.zeros((0, vector_size)) 54 | self.vectors_norm = None 55 | self.vocab = {} # mapping from a word (string) to a Vocab object 56 | self.index2word = [] # map from a word's matrix index (int) to the word (string) 57 | 58 | def __str__(self): 59 | return "Word2VecEmbeddings(vocab=%s, size=%s)" % (len(self.index2word), self.vector_size) 60 | 61 | def __getitem__(self, word): 62 | """ 63 | Return a word's representations in vector space, as a 1D numpy array. 64 | 65 | Example: 66 | >>> trained_model['woman'] 67 | array([ -1.40128313e-02, ...] 68 | """ 69 | return self.vectors[self.vocab[word].index] 70 | 71 | def __contains__(self, word): 72 | return word in self.vocab 73 | 74 | def build_vocab(self, sentences, min_count=5, thr=0): 75 | """ 76 | Build vocabulary from a sequence of sentences (can be a once-only generator stream). 77 | Each sentence must be a list of strings. 78 | 79 | Inputs: 80 | - sentences: List or generator object supplying lists of (preprocessed) words 81 | used to train the model (otherwise train manually with model.train(sentences)) 82 | - min_count: (default 5) how often a word has to occur at least to be taken into the vocab 83 | - thr: (default 0) threshold for computing probabilities for sub-sampling words in training 84 | """ 85 | logger.info("collecting all words and their counts") 86 | sentence_no, vocab = -1, {} 87 | total_words = 0 88 | for sentence_no, sentence in enumerate(sentences): 89 | if not sentence_no % 10000: 90 | logger.info("PROGRESS: at sentence #%i, processed %i words and %i unique words" % 91 | (sentence_no, total_words, len(vocab))) 92 | for word in sentence: 93 | total_words += 1 94 | try: 95 | vocab[word].count += 1 96 | except KeyError: 97 | vocab[word] = Vocab(count=1) 98 | logger.info("collected %i unique words from a corpus of %i words and %i sentences" % 99 | (len(vocab), total_words, sentence_no + 1)) 100 | # assign a unique index to each word 101 | self.vocab, self.index2word = {}, [] 102 | for word, v in vocab.items(): 103 | if v.count >= min_count: 104 | v.index = len(self.vocab) 105 | self.index2word.append(word) 106 | self.vocab[word] = v 107 | logger.info("total of %i unique words after removing those with count < %s" % (len(self.vocab), min_count)) 108 | # add probabilities for sub-sampling (if thr > 0) 109 | if thr > 0: 110 | total_words = float(sum(v.count for v in self.vocab.values())) 111 | for word in self.vocab: 112 | # formula from paper 113 | # self.vocab[word].prob = max(0.,1.-sqrt(thr*total_words/self.vocab[word].count)) 114 | # formula from code 115 | self.vocab[word].prob = (sqrt(self.vocab[word].count / (thr * total_words) 116 | ) + 1.) * (thr * total_words) / self.vocab[word].count 117 | else: 118 | # if prob is 0, word wont get discarded 119 | for word in self.vocab: 120 | self.vocab[word].prob = 0. 121 | 122 | def init_sims(self): 123 | # for convenience (for later similarity computations, etc.), store all 124 | # embeddings additionally as unit length vectors 125 | self.vectors_norm = self.vectors / np.array([np.linalg.norm(self.vectors, axis=1)]).T 126 | 127 | def similarity(self, w1, w2): 128 | """ 129 | Compute cosine similarity between two words. 130 | 131 | Example:: 132 | >>> trained_model.similarity('woman', 'man') 133 | 0.73723527 134 | """ 135 | if self.vectors_norm is None: 136 | self.init_sims() 137 | return np.inner(self.vectors_norm[self.vocab[w1].index], self.vectors_norm[self.vocab[w2].index]) 138 | 139 | def most_similar(self, positive=[], negative=[], topn=10): 140 | """ 141 | Find the top-N most similar words. Positive words contribute positively towards the 142 | similarity, negative words negatively. 143 | 144 | This method computes cosine similarity between a simple mean of the projection 145 | weight vectors of the given words, and corresponds to the `word-analogy` and 146 | `distance` scripts in the original word2vec implementation. 147 | 148 | Example:: 149 | >>> trained_model.most_similar(positive=['woman', 'king'], negative=['man']) 150 | [('queen', 0.50882536), ...] 151 | """ 152 | if self.vectors_norm is None: 153 | self.init_sims() 154 | if isinstance(positive, str) and not negative: 155 | # allow calls like most_similar('dog'), as a shorthand for most_similar(['dog']) 156 | positive = [positive] 157 | 158 | # add weights for each word, if not already present; default to 1.0 for positive and -1.0 for negative words 159 | positive = [(word, 1.) if isinstance(word, str) else word for word in positive] 160 | negative = [(word, -1.) if isinstance(word, str) else word for word in negative] 161 | 162 | # compute the weighted average of all words 163 | all_words = set() 164 | mean = np.zeros(self.vector_size) 165 | for word, weight in positive + negative: 166 | try: 167 | mean += weight * self.vectors_norm[self.vocab[word].index] 168 | all_words.add(self.vocab[word].index) 169 | except KeyError: 170 | print("word '%s' not in vocabulary" % word) 171 | if not all_words: 172 | raise ValueError("cannot compute similarity with no input") 173 | dists = np.dot(self.vectors_norm, mean / np.linalg.norm(mean)) 174 | if not topn: 175 | return dists 176 | best = np.argsort(dists)[::-1][:topn + len(all_words)] 177 | # ignore (don't return) words from the input 178 | result = [(self.index2word[sim], dists[sim]) for sim in best if sim not in all_words] 179 | return result[:topn] 180 | 181 | def doesnt_match(self, words): 182 | """ 183 | Which word from the given list doesn't go with the others? 184 | 185 | Example:: 186 | >>> trained_model.doesnt_match("breakfast cereal dinner lunch".split()) 187 | 'cereal' 188 | """ 189 | if self.vectors_norm is None: 190 | self.init_sims() 191 | words = [word for word in words if word in self.vocab] # filter out OOV words 192 | logger.debug("using words %s" % words) 193 | if not words: 194 | raise ValueError("cannot select a word from an empty list") 195 | # which word vector representation is furthest away from the mean? 196 | selection = self.vectors_norm[[self.vocab[word].index for word in words]] 197 | mean = np.mean(selection, axis=0) 198 | sim = np.dot(selection, mean / np.linalg.norm(mean)) 199 | return words[np.argmin(sim)] 200 | 201 | 202 | class Word2Vec(object): 203 | """ 204 | Word2Vec Model, which can be trained and then contains word embedding that can be used for all kinds of cool stuff. 205 | """ 206 | 207 | def __init__(self, sentences=None, vector_size=100, mtype='sg', hs=1, neg=0, window=5, 208 | thr=0, min_count=5, alpha=0.025, min_alpha=0.0001, seed=1): 209 | """ 210 | Initialize Word2Vec model 211 | 212 | Inputs: 213 | - sentences: (default None) List or generator object supplying lists of (preprocessed) words 214 | used to train the model (otherwise train manually with model.train(sentences)) 215 | - vector_size: (default 100) dimensionality of embedding 216 | - mtype: (default 'sg') type of model: either 'sg' (skipgram) or 'cbow' (bag of words) 217 | - hs: (default 1) if != 0, hierarchical softmax will be used for training the model 218 | - neg: (default 0) if > 0, negative sampling will be used for training the model; 219 | neg specifies the # of noise words 220 | - window: (default 5) max distance of context words from target word in training 221 | - thr: (default 0) threshold for computing probabilities for sub-sampling words in training 222 | - min_count: (default 5) how often a word has to occur at least to be taken into the vocab 223 | - alpha: (default 0.025) initial learning rate 224 | - min_alpha: (default 0.0001) if < alpha, the learning rate will be decreased to min_alpha 225 | - seed: (default 1) random seed (for initializing the embeddings) 226 | """ 227 | assert mtype.lower() in ('sg', 'cbow'), "unknown model, use 'sg' or 'cbow'" 228 | self.wv = Word2VecEmbeddings(vector_size) # stores the actual word2vec embeddings 229 | self.mtype = mtype.lower() 230 | self.hs = hs 231 | self.neg = neg 232 | self.window = window 233 | self.thr = thr 234 | self.min_count = min_count 235 | self.alpha = alpha 236 | self.min_alpha = min_alpha 237 | self.seed = seed 238 | # possibly train model 239 | if sentences: 240 | self.train_setup(sentences) 241 | self.train(sentences) 242 | 243 | def __str__(self): 244 | return "Word2Vec(vocab=%s, size=%s, mtype=%s, hs=%i, neg=%i)" % (len(self.wv.index2word), self.wv.vector_size, self.mtype, self.hs, self.neg) 245 | 246 | def reset_weights(self): 247 | """ 248 | Reset all projection weights to an initial (untrained) state, but keep the existing vocabulary. 249 | """ 250 | np.random.seed(self.seed) 251 | # weights 252 | self.syn1 = np.asarray( 253 | np.random.uniform( 254 | low=-4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)), 255 | high=4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)), 256 | size=(len(self.wv.vocab), self.wv.vector_size) 257 | ), 258 | dtype=float 259 | ) 260 | self.syn1neg = np.asarray( 261 | np.random.uniform( 262 | low=-4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)), 263 | high=4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)), 264 | size=(len(self.wv.vocab), self.wv.vector_size) 265 | ), 266 | dtype=float 267 | ) 268 | # embedding 269 | self.wv.vectors = np.asarray( 270 | np.random.uniform( 271 | low=-4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)), 272 | high=4 * np.sqrt(6. / (len(self.wv.vocab) + self.wv.vector_size)), 273 | size=(len(self.wv.vocab), self.wv.vector_size) 274 | ), 275 | dtype=float 276 | ) 277 | 278 | def _make_table(self, table_size=100000000., power=0.75): 279 | """ 280 | Create a table using stored vocabulary word counts for drawing random words in the negative 281 | sampling training routines. 282 | """ 283 | vocab_size = len(self.wv.vocab) 284 | logger.info("constructing a table with noise distribution from %i words" % vocab_size) 285 | # table (= list of words) of noise distribution for negative sampling 286 | self.table = np.zeros(int(table_size), dtype=int) 287 | # compute sum of all power (Z in paper) 288 | train_words_pow = float(sum([self.wv.vocab[word].count**power for word in self.wv.vocab])) 289 | # go through the whole table and fill it up with the word indexes proportional to a word's count**power 290 | widx = 0 291 | # normalize count^0.75 by Z 292 | d1 = self.wv.vocab[self.wv.index2word[widx]].count**power / train_words_pow 293 | for tidx in range(int(table_size)): 294 | self.table[tidx] = widx 295 | if tidx / table_size > d1: 296 | widx += 1 297 | d1 += self.wv.vocab[self.wv.index2word[widx]].count**power / train_words_pow 298 | if widx >= vocab_size: 299 | widx = vocab_size - 1 300 | 301 | def _create_binary_tree(self): 302 | """ 303 | Create a binary Huffman tree for the hs model using stored vocabulary word counts. 304 | Frequent words will have shorter binary codes. 305 | """ 306 | vocab_size = len(self.wv.vocab) 307 | logger.info("constructing a huffman tree from %i words" % vocab_size) 308 | # build the huffman tree 309 | heap = list(self.wv.vocab.values()) 310 | heapq.heapify(heap) 311 | for i in range(vocab_size - 1): 312 | min1, min2 = heapq.heappop(heap), heapq.heappop(heap) 313 | heapq.heappush(heap, Vocab(count=min1.count + min2.count, index=i + vocab_size, left=min1, right=min2)) 314 | # recurse over the tree, assigning a binary code to each vocabulary word 315 | if heap: 316 | max_depth, stack = 0, [(heap[0], [], [])] 317 | while stack: 318 | node, codes, points = stack.pop() 319 | if node.index < vocab_size: 320 | # leaf node => store its path from the root 321 | node.code, node.point = codes, points 322 | max_depth = max(len(codes), max_depth) 323 | else: 324 | # inner node => continue recursion 325 | points = np.array(list(points) + [node.index - vocab_size], dtype=int) 326 | stack.append((node.left, np.array(list(codes) + [0], dtype=int), points)) 327 | stack.append((node.right, np.array(list(codes) + [1], dtype=int), points)) 328 | logger.info("built huffman tree with maximum node depth %i" % max_depth) 329 | 330 | def train_setup(self, sentences): 331 | """ 332 | Do a bunch of initializations etc before training can start 333 | """ 334 | self.wv.build_vocab(sentences, self.min_count, self.thr) 335 | # add info about each word's Huffman encoding 336 | if self.hs: 337 | self._create_binary_tree() 338 | # build the table for drawing random words (for negative sampling) 339 | if self.neg: 340 | self._make_table() 341 | # initialize layers 342 | self.reset_weights() 343 | 344 | def train_sentence_sg(self, sentence, alpha): 345 | """ 346 | Update a skip-gram model by training on a single sentence (batch mode!) 347 | using hierarchical softmax and/or negative sampling. 348 | 349 | The sentence is a list of Vocab objects (or None, where the corresponding 350 | word is not in the vocabulary. Called internally from `Word2Vec.train()`. 351 | """ 352 | if self.neg: 353 | # precompute neg noise labels 354 | labels = np.zeros(self.neg + 1) 355 | labels[0] = 1. 356 | for pos, word in enumerate(sentence): 357 | if not word or (word.prob and word.prob < np.random.rand()): 358 | continue # OOV word in the input sentence or subsampling => skip 359 | reduced_window = np.random.randint(self.window - 1) 360 | # now go over all words from the (reduced) window (at once), predicting each one in turn 361 | start = max(0, pos - self.window + reduced_window) 362 | word2_indices = [word2.index for pos2, word2 in enumerate( 363 | sentence[start:pos + self.window + 1 - reduced_window], start) if (word2 and not (pos2 == pos))] 364 | if not word2_indices: 365 | continue 366 | l1 = deepcopy(self.wv.vectors[word2_indices]) # len(word2_indices) x layer1_size 367 | if self.hs: 368 | # work on the entire tree at once --> 2d matrix, codelen x layer1_size 369 | l2 = deepcopy(self.syn1[word.point]) 370 | # propagate hidden -> output (len(word2_indices) x codelen) 371 | f = 1. / (1. + np.exp(-np.dot(l1, l2.T))) 372 | # vector of error gradients multiplied by the learning rate 373 | g = (1. - np.tile(word.code, (len(word2_indices), 1)) - f) * alpha 374 | # learn hidden -> output (codelen x layer1_size) batch update 375 | self.syn1[word.point] += np.dot(g.T, l1) 376 | # learn input -> hidden 377 | self.wv.vectors[word2_indices] += np.dot(g, l2) 378 | if self.neg: 379 | # use this word (label = 1) + k other random words not from this sentence (label = 0) 380 | word_indices = [word.index] 381 | while len(word_indices) < self.neg + 1: 382 | w = self.table[np.random.randint(self.table.shape[0])] 383 | if not (w == word.index or w in word2_indices): 384 | word_indices.append(w) 385 | # 2d matrix, k+1 x layer1_size 386 | l2 = deepcopy(self.syn1neg[word_indices]) 387 | # propagate hidden -> output 388 | f = 1. / (1. + np.exp(-np.dot(l1, l2.T))) 389 | # vector of error gradients multiplied by the learning rate 390 | g = (np.tile(labels, (len(word2_indices), 1)) - f) * alpha 391 | # learn hidden -> output (batch update) 392 | self.syn1neg[word_indices] += np.dot(g.T, l1) 393 | # learn input -> hidden 394 | self.wv.vectors[word2_indices] += np.dot(g, l2) 395 | return len([word for word in sentence if word]) 396 | 397 | def train_sentence_cbow(self, sentence, alpha): 398 | """ 399 | Update a cbow model by training on a single sentence 400 | using hierarchical softmax and/or negative sampling. 401 | 402 | The sentence is a list of Vocab objects (or None, where the corresponding 403 | word is not in the vocabulary. Called internally from `Word2Vec.train()`. 404 | """ 405 | if self.neg: 406 | # precompute neg noise labels 407 | labels = np.zeros(self.neg + 1) 408 | labels[0] = 1. 409 | for pos, word in enumerate(sentence): 410 | if not word or (word.prob and word.prob < np.random.rand()): 411 | continue # OOV word in the input sentence or subsampling => skip 412 | reduced_window = np.random.randint(self.window - 1) # how much is SUBSTRACTED from the original window 413 | # get sum of representation from all words in the (reduced) window (if in vocab and not the `word` itself) 414 | start = max(0, pos - self.window + reduced_window) 415 | word2_indices = [word2.index for pos2, word2 in enumerate( 416 | sentence[start:pos + self.window + 1 - reduced_window], start) if (word2 and not (pos2 == pos))] 417 | if not word2_indices: 418 | # in this case the sum would return zeros, the mean nans but really no point in doing anything at all 419 | continue 420 | l1 = np.sum(self.wv.vectors[word2_indices], axis=0) # 1xlayer1_size 421 | if self.hs: 422 | # work on the entire tree at once --> 2d matrix, codelen x layer1_size 423 | l2 = deepcopy(self.syn1[word.point]) 424 | # propagate hidden -> output 425 | f = 1. / (1. + np.exp(-np.dot(l1, l2.T))) 426 | # vector of error gradients multiplied by the learning rate 427 | g = (1. - word.code - f) * alpha 428 | # learn hidden -> output 429 | self.syn1[word.point] += np.outer(g, l1) 430 | # learn input -> hidden, here for all words in the window separately 431 | self.wv.vectors[word2_indices] += np.dot(g, l2) 432 | if self.neg: 433 | # use this word (label = 1) + k other random words not from this sentence (label = 0) 434 | word_indices = [word.index] 435 | while len(word_indices) < self.neg + 1: 436 | w = self.table[np.random.randint(self.table.shape[0])] 437 | if not (w == word.index or w in word2_indices): 438 | word_indices.append(w) 439 | # 2d matrix, k+1 x layer1_size 440 | l2 = deepcopy(self.syn1neg[word_indices]) 441 | # propagate hidden -> output 442 | f = 1. / (1. + np.exp(-np.dot(l1, l2.T))) 443 | # vector of error gradients multiplied by the learning rate 444 | g = (labels - f) * alpha 445 | # learn hidden -> output 446 | self.syn1neg[word_indices] += np.outer(g, l1) 447 | # learn input -> hidden, here for all words in the window separately 448 | self.wv.vectors[word2_indices] += np.dot(g, l2) 449 | return len([word for word in sentence if word]) 450 | 451 | def train(self, sentences, alpha=False, min_alpha=False): 452 | """ 453 | Update the model's embedding and weights from a sequence of sentences (can be a once-only generator stream). 454 | Each sentence must be a list of strings. 455 | """ 456 | logger.info("training model on %i vocabulary and %i features" % (len(self.wv.vocab), self.wv.vector_size)) 457 | if not self.wv.vocab: 458 | self.train_setup(sentences) 459 | if alpha: 460 | self.alpha = alpha 461 | if min_alpha: 462 | self.min_alpha = min_alpha 463 | # build the table for drawing random words (for negative sampling) 464 | # (is usually deleted before saving) 465 | if self.neg and self.table is None: 466 | self._make_table() 467 | start, next_report = time.time(), 20. 468 | total_words = sum(v.count for v in self.wv.vocab.values()) 469 | word_count = 0 470 | for sentence_no, sentence in enumerate(sentences): 471 | # convert input string lists to Vocab objects (or None for OOV words) 472 | no_oov = [self.wv.vocab.get(word, None) for word in sentence] 473 | # update the learning rate before every iteration 474 | alpha = self.min_alpha + (self.alpha - self.min_alpha) * (1. - word_count / total_words) 475 | # train on the sentence and check how many words did we train on 476 | # (out-of-vocabulary (unknown) words do not count) 477 | if self.mtype == 'sg': 478 | word_count += self.train_sentence_sg(no_oov, alpha) 479 | elif self.mtype == 'cbow': 480 | word_count += self.train_sentence_cbow(no_oov, alpha) 481 | else: 482 | raise RuntimeError("model type %s not known!" % self.mtype) 483 | # report progress 484 | elapsed = time.time() - start 485 | if elapsed >= next_report: 486 | logger.info("PROGRESS: at %.2f%% words, alpha %.05f, %.0f words/s" % 487 | (100.0 * word_count / total_words, alpha, word_count / elapsed if elapsed else 0.0)) 488 | next_report = elapsed + 20. # don't flood the log, wait at least a second between progress reports 489 | elapsed = time.time() - start 490 | logger.info("training on %i words took %.1fs, %.0f words/s" % 491 | (word_count, elapsed, word_count / elapsed if elapsed else 0.0)) 492 | # compute vector norms for later stuff 493 | self.wv.init_sims() 494 | --------------------------------------------------------------------------------