├── README.md ├── __init__.py ├── dataset.py ├── files ├── cbow_hs.png ├── cbow_ns.png ├── huffman.png ├── sent.png ├── sg_hs.png └── sg_ns.png ├── run_training.py ├── tf2.x ├── README.md ├── dataset.py ├── demo_word_similarity.py ├── model.py ├── run_training.py ├── sample_corpus.txt ├── utils.py └── word_vectors.py └── word2vec.py /README.md: -------------------------------------------------------------------------------- 1 | # Word2Vec: Learning distributed word representation from unlabeled text. 2 | 3 | **Update**: [TensorFlow 2.x](tf2.x) 4 | 5 | Word2Vec is a classic model for learning distributed word representation from large unlabeled dataset. There have been many implementations out there since its introduction (e.g. the original C implementation, and the gensim implementation). This is an attempt to reimplement word2vec in TensorFlow using the `tf.data.Dataset` APIs, a recommended way to streamline data preprocessing for TensorFlow models. 6 | 7 | ### Usage 8 | 1. Clone the repository. 9 | ``` 10 | git git@github.com:chao-ji/tf-word2vec.git 11 | ``` 12 | 2. Prepare your data. 13 | Your data should be a number of text files where each line contains a sentence, and words are delimited by space. 14 | 15 | 3. Parameter settings. 16 | This implementation allows you to train the model under *skip gram* or *continuous bag-of-words* architectures (`--arch`), and perform training using *negative sampling* or *hierarchical softmax.* (`--algm`). To see a full list of parameters, run`python run_training.py --help`. 17 | 18 | 4. Run. 19 | Example: 20 | ``` 21 | python run_training.py \ 22 | --filenames=/PATH/TO/FILE/file1.txt,/PATH/TO/FILE/file2.txt 23 | --out_dir=/PATH/TO/OUT_DIR/ 24 | --epochs=5 25 | --batch_size=64 26 | --window_size=5 27 | ``` 28 | The vocabulary words and word embeddings will be saved to `vocab.txt` and `embed.npy` (can be loaded using `np.load`). 29 | 30 | ### Sample results 31 | 32 | The model was trained on the IMDB movie review dataset using the following parameters: 33 | 34 | ``` 35 | --arch=skip_gram --algm=negative_sampling --batch_size=256 --max_vocab_size=0 --min_count=10 --sample=1e-3 --window_size=10 --embed_size=300 --negatives=5 --power=0.75 --alpha=0.025 --min_alpha=0.0001 --epochs=5 36 | ``` 37 | 38 | Below are a sample list of queries with their most similar words. 39 | ``` 40 | query: actor 41 | [('actors', 0.5314413), 42 | ('actress', 0.52641004), 43 | ('performer', 0.43144277), 44 | ('role', 0.40702546), 45 | ('comedian', 0.3910208), 46 | ('performance', 0.37695402), 47 | ('versatile', 0.35130078), 48 | ('actresses', 0.32896513), 49 | ('cast', 0.3219274), 50 | ('performers', 0.31659046)] 51 | ``` 52 | 53 | ``` 54 | query: . 55 | [('!', 0.6234603), 56 | ('?', 0.39236775), 57 | ('and', 0.36783764), 58 | (',', 0.3090561), 59 | ('but', 0.28012913), 60 | ('which', 0.23897173), 61 | (';', 0.22881404), 62 | ('cornerstone', 0.20761433), 63 | ('although', 0.20554386), 64 | ('...', 0.19846405)] 65 | 66 | ``` 67 | 68 | ``` 69 | query: ask 70 | [('asked', 0.54287535), 71 | ('asking', 0.5349437), 72 | ('asks', 0.5262491), 73 | ('question', 0.4397335), 74 | ('answer', 0.3868001), 75 | ('questions', 0.37007764), 76 | ('begs', 0.35407144), 77 | ('wonder', 0.3537388), 78 | ('answers', 0.3410588), 79 | ('wondering', 0.32832426)] 80 | ``` 81 | 82 | ``` 83 | query: you 84 | [('yourself', 0.51918006), 85 | ('u', 0.48620683), 86 | ('your', 0.47644556), 87 | ("'ll", 0.38544628), 88 | ('ya', 0.35932386), 89 | ('we', 0.35398778), 90 | ('i', 0.34099358), 91 | ('unless', 0.3306447), 92 | ('if', 0.3237356), 93 | ("'re", 0.32068467)] 94 | ``` 95 | 96 | ``` 97 | query: amazing 98 | [('incredible', 0.6467944), 99 | ('fantastic', 0.5760295), 100 | ('excellent', 0.56906724), 101 | ('awesome', 0.5625062), 102 | ('wonderful', 0.52154255), 103 | ('extraordinary', 0.519134), 104 | ('remarkable', 0.50572175), 105 | ('outstanding', 0.5042475), 106 | ('superb', 0.5008434), 107 | ('brilliant', 0.47915617)] 108 | ``` 109 | ### Building dataset pipeline 110 | 111 | Here is a concrete example of converting a raw sentence into matrices holding the data to train Word2Vec model with either `skip_gram` or `cbow` architecture. 112 | 113 | Suppose we have a sentence in the corpus: `the quick brown fox jumps over the lazy dog`, with the window sizes (max num of words to the left or right of target word) below the words. Assume that the sentence has already been subsampled and words mapped to indices. 114 | 115 | We call each of the word in the sentence **target word**, and those words within the window centered at target word **context words**. For example, `quick` and `brown` are context words of target word `the`, and `the`, `brown`, `fox` are context words of target word `quick`. 116 | 117 |

118 | 119 |

120 | 121 | For `skip_gram`, the task is to predict context words given the target word. The index of each target word is simply replicated to match the number of its context words. This will be our **input matrix**. 122 | 123 |

124 | 125 |
126 | Skip gram, negative sampling 127 |

128 | 129 | For `cbow`, the task is to predict target word given context words. Because each target word may have a variable number of context words, we pad the list of context words to the maximum possible size (`2*window_size`), and append the true size of context words. 130 | 131 |

132 | 133 |
134 | Continuous bag of words, negative sampling 135 |

136 | 137 | If training algorithm is `negative_sampling`, we simply populate the **label matrix** with the indices of the words to be predicted: context words for `skip_gram` or target words for `cbow`. 138 | 139 | If training algorithm is `hierarchical_softmax`, a Huffman tree is built for the collection of vocabulary words. Each vocabulary word is associated with exactly one leaf node, and the words to be predicted in the case of `negative_sampling` are replaced by a sequence of `codes` and `points` that are determined by the internal nodes along the root-to-leaf path. For example, `E`'s `codes` and `points` would be `3782`, `8435`, `590`, `7103` and `1`, `0`, `1`, `0`. We populate the **label matrix** with the padded `codes` and `points` (up to `max_depth`), along with the true length of `codes`/`points`. 140 | 141 |

142 | 143 |
144 | Huffman tree 145 |

146 | 147 | 148 |

149 | 150 |
151 | Skip gram, hierarchical softmax 152 |

153 | 154 |

155 | 156 |
157 | Continuous bag of words, hierarchical softmax 158 |

159 | 160 | In summary, an **input matrix** and a **label matrix** is created from a raw input sentence that provides the input and label information for the prediction task. 161 | 162 | 163 | 164 | ### Reference 165 | 1. T Mikolov, K Chen, G Corrado, J Dean - Efficient Estimation of Word Representations in Vector Space, ICLR 2013 166 | 2. T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean - Distributed Representations of Words and Phrases and their Compositionality, NIPS 2013 167 | 3. Original implementation by Mikolov, https://code.google.com/archive/p/word2vec/ 168 | 4. Gensim implementation by Radim Řehůřek, https://radimrehurek.com/gensim/models/word2vec.html 169 | 5. IMDB Movie Review dataset, http://ai.stanford.edu/~amaas/data/sentiment/ 170 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/__init__.py -------------------------------------------------------------------------------- /dataset.py: -------------------------------------------------------------------------------- 1 | import heapq 2 | import itertools 3 | import collections 4 | 5 | import numpy as np 6 | import tensorflow as tf 7 | 8 | from functools import partial 9 | 10 | OOV_ID = -1 11 | 12 | 13 | class Word2VecDataset(object): 14 | """Dataset for generating matrices holding word indices to train Word2Vec 15 | models. 16 | """ 17 | def __init__(self, 18 | arch='skip_gram', 19 | algm='negative_sampling', 20 | epochs=5, 21 | batch_size=100, 22 | max_vocab_size=0, 23 | min_count=2, 24 | sample=1e-3, 25 | window_size=5): 26 | """Constructor. 27 | 28 | Args: 29 | arch: string scalar, architecture ('skip_gram' or 'cbow'). 30 | algm: string scalar: training algorithm ('negative_sampling' or 31 | 'hierarchical_softmax'). 32 | epochs: int scalar, num times the dataset is iterated. 33 | batch_size: int scalar, the returned tensors in `get_tensor_dict` have 34 | shapes [batch_size, :]. 35 | max_vocab_size: int scalar, maximum vocabulary size. If > 0, the top 36 | `max_vocab_size` most frequent words are kept in vocabulary. 37 | min_count: int scalar, words whose counts < `min_count` are not included 38 | in the vocabulary. 39 | sample: float scalar, subsampling rate. 40 | window_size: int scalar, num of words on the left or right side of 41 | target word within a window. 42 | """ 43 | self._arch = arch 44 | self._algm = algm 45 | self._epochs = epochs 46 | self._batch_size = batch_size 47 | self._max_vocab_size = max_vocab_size 48 | self._min_count = min_count 49 | self._sample = sample 50 | self._window_size = window_size 51 | 52 | self._iterator_initializer = None 53 | self._table_words = None 54 | self._unigram_counts = None 55 | self._keep_probs = None 56 | self._corpus_size = None 57 | self._max_depth = None 58 | 59 | @property 60 | def iterator_initializer(self): 61 | return self._iterator_initializer 62 | 63 | @property 64 | def table_words(self): 65 | return self._table_words 66 | 67 | @property 68 | def unigram_counts(self): 69 | return self._unigram_counts 70 | 71 | def _build_raw_vocab(self, filenames): 72 | """Builds raw vocabulary. 73 | 74 | Args: 75 | filenames: list of strings, holding names of text files. 76 | 77 | Returns: 78 | raw_vocab: a list of 2-tuples holding the word (string) and count (int), 79 | sorted in descending order of word count. 80 | """ 81 | map_open = partial(open, encoding="utf-8") 82 | lines = itertools.chain(*map(map_open, filenames)) 83 | raw_vocab = collections.Counter() 84 | for line in lines: 85 | raw_vocab.update(line.strip().split()) 86 | raw_vocab = raw_vocab.most_common() 87 | if self._max_vocab_size > 0: 88 | raw_vocab = raw_vocab[:self._max_vocab_size] 89 | return raw_vocab 90 | 91 | def build_vocab(self, filenames): 92 | """Builds vocabulary. 93 | 94 | Has the side effect of setting the following attributes: 95 | - table_words: list of string, holding the list of vocabulary words. Index 96 | of each entry is the same as the word index into the vocabulary. 97 | - unigram_counts: list of int, holding word counts. Index of each entry 98 | is the same as the word index into the vocabulary. 99 | - keep_probs: list of float, holding words' keep prob for subsampling. 100 | Index of each entry is the same as the word index into the vocabulary. 101 | - corpus_size: int scalar, effective corpus size. 102 | 103 | Args: 104 | filenames: list of strings, holding names of text files. 105 | """ 106 | raw_vocab = self._build_raw_vocab(filenames) 107 | raw_vocab = [(w, c) for w, c in raw_vocab if c >= self._min_count] 108 | self._corpus_size = sum(list(zip(*raw_vocab))[1]) 109 | 110 | self._table_words = [] 111 | self._unigram_counts = [] 112 | self._keep_probs = [] 113 | for word, count in raw_vocab: 114 | frac = count / float(self._corpus_size) 115 | keep_prob = (np.sqrt(frac / self._sample) + 1) * (self._sample / frac) 116 | keep_prob = np.minimum(keep_prob, 1.0) 117 | self._table_words.append(word) 118 | self._unigram_counts.append(count) 119 | self._keep_probs.append(keep_prob) 120 | 121 | def _build_binary_tree(self, unigram_counts): 122 | """Builds a Huffman tree for hierarchical softmax. Has the side effect 123 | of setting `max_depth`. 124 | 125 | Args: 126 | unigram_counts: list of int, holding word counts. Index of each entry 127 | is the same as the word index into the vocabulary. 128 | 129 | Returns: 130 | codes_points: an int numpy array of shape [vocab_size, 2*max_depth+1] 131 | where each row holds the codes (0-1 binary values) padded to 132 | `max_depth`, and points (non-leaf node indices) padded to `max_depth`, 133 | of each vocabulary word. The last entry is the true length of code 134 | and point (<= `max_depth`). 135 | """ 136 | vocab_size = len(unigram_counts) 137 | heap = [[unigram_counts[i], i] for i in range(vocab_size)] 138 | heapq.heapify(heap) 139 | for i in range(vocab_size - 1): 140 | min1, min2 = heapq.heappop(heap), heapq.heappop(heap) 141 | heapq.heappush(heap, [min1[0] + min2[0], i + vocab_size, min1, min2]) 142 | 143 | node_list = [] 144 | max_depth, stack = 0, [[heap[0], [], []]] 145 | while stack: 146 | node, code, point = stack.pop() 147 | if node[1] < vocab_size: 148 | node.extend([code, point, len(point)]) 149 | max_depth = np.maximum(len(code), max_depth) 150 | node_list.append(node) 151 | else: 152 | point = np.array(list(point) + [node[1]-vocab_size]) 153 | stack.append([node[2], np.array(list(code)+[0]), point]) 154 | stack.append([node[3], np.array(list(code)+[1]), point]) 155 | 156 | node_list = sorted(node_list, key=lambda items: items[1]) 157 | codes_points = np.zeros([vocab_size, max_depth*2+1], dtype=np.int32) 158 | for i in range(len(node_list)): 159 | length = node_list[i][4] # length of code or point 160 | codes_points[i, -1] = length 161 | codes_points[i, :length] = node_list[i][2] # code 162 | codes_points[i, max_depth:max_depth+length] = node_list[i][3] # point 163 | self._max_depth = max_depth 164 | return codes_points 165 | 166 | def _prepare_inputs_labels(self, tensor): 167 | """Set shape of `tensor` according to architecture and training algorithm, 168 | and split `tensor` into `inputs` and `labels`. 169 | 170 | Args: 171 | tensor: rank-2 int tensor, holding word indices for prediction inputs 172 | and prediction labels, returned by `generate_instances`. 173 | 174 | Returns: 175 | inputs: rank-2 int tensor, holding word indices for prediction inputs. 176 | labels: rank-2 int tensor, holding word indices for prediction labels. 177 | """ 178 | if self._arch == 'skip_gram': 179 | if self._algm == 'negative_sampling': 180 | tensor.set_shape([self._batch_size, 2]) 181 | else: 182 | tensor.set_shape([self._batch_size, 2*self._max_depth+2]) 183 | inputs = tensor[:, :1] 184 | labels = tensor[:, 1:] 185 | else: 186 | if self._algm == 'negative_sampling': 187 | tensor.set_shape([self._batch_size, 2*self._window_size+2]) 188 | else: 189 | tensor.set_shape([self._batch_size, 190 | 2*self._window_size+2*self._max_depth+2]) 191 | inputs = tensor[:, :2*self._window_size+1] 192 | labels = tensor[:, 2*self._window_size+1:] 193 | return inputs, labels 194 | 195 | def get_tensor_dict(self, filenames): 196 | """Generates tensor dict mapping from tensor names to tensors. 197 | 198 | Args: 199 | filenames: list of strings, holding names of text files. 200 | 201 | Returns: 202 | tensor_dict: a dict mapping from tensor names to tensors with shape being: 203 | when arch=='skip_gram', algm=='negative_sampling' 204 | inputs: [N], labels: [N] 205 | when arch=='cbow', algm=='negative_sampling' 206 | inputs: [N, 2*window_size+1], labels: [N] 207 | when arch=='skip_gram', algm=='hierarchical_softmax' 208 | inputs: [N], labels: [N, 2*max_depth+1] 209 | when arch=='cbow', algm=='hierarchical_softmax' 210 | inputs: [N, 2*window_size+1], labels: [N, 2*max_depth+1] 211 | progress: [N], the percentage of sentences covered so far. Used to 212 | compute learning rate. 213 | """ 214 | table_words = self._table_words 215 | unigram_counts = self._unigram_counts 216 | keep_probs = self._keep_probs 217 | if not table_words or not unigram_counts or not keep_probs: 218 | raise ValueError('`table_words`, `unigram_counts`, and `keep_probs` must', 219 | 'be set by calling `build_vocab()`') 220 | 221 | if self._algm == 'hierarchical_softmax': 222 | codes_points = tf.constant(self._build_binary_tree(unigram_counts)) 223 | elif self._algm == 'negative_sampling': 224 | codes_points = None 225 | else: 226 | raise ValueError('algm must be hierarchical_softmax or negative_sampling') 227 | 228 | table_words = tf.contrib.lookup.index_table_from_tensor( 229 | tf.constant(table_words), default_value=OOV_ID) 230 | keep_probs = tf.constant(keep_probs) 231 | 232 | num_sents = sum([len(list(open(fn, encoding="utf-8") 233 | )) for fn in filenames]) 234 | num_sents = self._epochs * num_sents 235 | 236 | # include epoch number, like progress 237 | a_zip = tf.data.TextLineDataset(filenames).repeat(self._epochs) 238 | b_zip = tf.range(1, 1+num_sents) / num_sents 239 | c_zip = tf.repeat(tf.range(1, 1+self._epochs), int(num_sents / self._epochs)) 240 | 241 | dataset = tf.data.Dataset.zip((a_zip, 242 | tf.data.Dataset.from_tensor_slices(b_zip), 243 | tf.data.Dataset.from_tensor_slices(c_zip))) 244 | 245 | dataset = dataset.map(lambda sent, progress, epoch: 246 | (get_word_indices(sent, table_words), progress, epoch)) 247 | dataset = dataset.map(lambda indices, progress, epoch: 248 | (subsample(indices, keep_probs), progress, epoch)) 249 | dataset = dataset.filter(lambda indices, progress, epoch: 250 | tf.greater(tf.size(indices), 1)) 251 | 252 | dataset = dataset.map(lambda indices, progress, epoch: ( 253 | generate_instances( 254 | indices, self._arch, self._window_size, codes_points), progress, epoch)) 255 | 256 | dataset = dataset.map(lambda instances, progress, epoch: ( 257 | instances, tf.fill(tf.shape(instances)[:1], progress), 258 | tf.fill(tf.shape(instances)[:1], epoch))) 259 | 260 | dataset = dataset.flat_map(lambda instances, progress, epoch: 261 | tf.data.Dataset.from_tensor_slices((instances, progress, epoch))) 262 | dataset = dataset.batch(self._batch_size, drop_remainder=True) 263 | 264 | iterator = tf.compat.v1.data.make_initializable_iterator(dataset) 265 | self._iterator_initializer = iterator.initializer 266 | tensor, progress, epoch = iterator.get_next() 267 | progress.set_shape([self._batch_size]) 268 | epoch.set_shape([self._batch_size]) 269 | 270 | inputs, labels = self._prepare_inputs_labels(tensor) 271 | if self._arch == 'skip_gram': 272 | inputs = tf.squeeze(inputs, axis=1) 273 | if self._algm == 'negative_sampling': 274 | labels = tf.squeeze(labels, axis=1) 275 | 276 | return {'inputs': inputs, 'labels': labels, 'progress': progress, 'epoch': epoch} 277 | 278 | 279 | def get_word_indices(sent, table_words): 280 | """Converts a sentence into a list of word indices. 281 | 282 | Args: 283 | sent: a scalar string tensor, a sentence where words are space-delimited. 284 | table_words: a `HashTable` mapping from words (string tensor) to word 285 | indices (int tensor). 286 | 287 | Returns: 288 | indices: rank-1 int tensor, the word indices within a sentence. 289 | """ 290 | words = tf.string_split([sent]).values 291 | indices = tf.to_int32(table_words.lookup(words)) 292 | return indices 293 | 294 | 295 | def subsample(indices, keep_probs): 296 | """Filters out-of-vocabulary words and then applies subsampling on words in a 297 | sentence. Words with high frequencies have lower keep probs. 298 | 299 | Args: 300 | indices: rank-1 int tensor, the word indices within a sentence. 301 | keep_probs: rank-1 float tensor, the prob to drop the each vocabulary word. 302 | 303 | Returns: 304 | indices: rank-1 int tensor, the word indices within a sentence after 305 | subsampling. 306 | """ 307 | indices = tf.boolean_mask(indices, tf.not_equal(indices, OOV_ID)) 308 | keep_probs = tf.gather(keep_probs, indices) 309 | randvars = tf.random.uniform(tf.shape(keep_probs), 0, 1) 310 | indices = tf.boolean_mask(indices, tf.less(randvars, keep_probs)) 311 | return indices 312 | 313 | 314 | def generate_instances(indices, arch, window_size, codes_points=None): 315 | """Generates matrices holding word indices to be passed to Word2Vec models 316 | for each sentence. The shape and contents of output matrices depends on the 317 | architecture ('skip_gram', 'cbow') and training algorithm ('negative_sampling' 318 | , 'hierarchical_softmax'). 319 | 320 | It takes as input a list of word indices in a subsampled-sentence, where each 321 | word is a target word, and their context words are those within the window 322 | centered at a target word. For skip gram architecture, `num_context_words` 323 | instances are generated for a target word, and for cbow architecture, a single 324 | instance is generated for a target word. 325 | 326 | If `codes_points` is not None ('hierarchical softmax'), the word to be 327 | predicted (context word for 'skip_gram', and target word for 'cbow') are 328 | represented by their 'codes' and 'points' in the Huffman tree (See 329 | `_build_binary_tree`). 330 | 331 | Args: 332 | indices: rank-1 int tensor, the word indices within a sentence after 333 | subsampling. 334 | arch: scalar string, architecture ('skip_gram' or 'cbow'). 335 | window_size: int scalar, num of words on the left or right side of 336 | target word within a window. 337 | codes_points: None, or an int tensor of shape [vocab_size, 2*max_depth+1] 338 | where each row holds the codes (0-1 binary values) padded to `max_depth`, 339 | and points (non-leaf node indices) padded to `max_depth`, of each 340 | vocabulary word. The last entry is the true length of code and point 341 | (<= `max_depth`). 342 | 343 | Returns: 344 | instances: an int tensor holding word indices, with shape being 345 | when arch=='skip_gram', algm=='negative_sampling' 346 | shape: [N, 2] 347 | when arch=='cbow', algm=='negative_sampling' 348 | shape: [N, 2*window_size+2] 349 | when arch=='skip_gram', algm=='hierarchical_softmax' 350 | shape: [N, 2*max_depth+2] 351 | when arch=='cbow', algm='hierarchical_softmax' 352 | shape: [N, 2*window_size+2*max_depth+2] 353 | """ 354 | def per_target_fn(index, init_array): 355 | reduced_size = tf.random.uniform([], maxval=window_size, dtype=tf.int32) 356 | left = tf.range(tf.maximum(index - window_size + reduced_size, 0), index) 357 | right = tf.range(index + 1, 358 | tf.minimum(index + 1 + window_size - reduced_size, tf.size(indices))) 359 | context = tf.concat([left, right], axis=0) 360 | context = tf.gather(indices, context) 361 | 362 | if arch == 'skip_gram': 363 | window = tf.stack([tf.fill(tf.shape(context), indices[index]), 364 | context], axis=1) 365 | elif arch == 'cbow': 366 | true_size = tf.size(context) 367 | window = tf.concat([tf.pad(context, [[0, 2*window_size-true_size]]), 368 | [true_size, indices[index]]], axis=0) 369 | window = tf.expand_dims(window, axis=0) 370 | else: 371 | raise ValueError('architecture must be skip_gram or cbow.') 372 | 373 | if codes_points is not None: 374 | window = tf.concat([window[:, :-1], 375 | tf.gather(codes_points, window[:, -1])], axis=1) 376 | return index + 1, init_array.write(index, window) 377 | 378 | size = tf.size(indices) 379 | init_array = tf.TensorArray(tf.int32, size=size, infer_shape=False) 380 | _, result_array = tf.while_loop(lambda i, ta: i < size, 381 | per_target_fn, 382 | [0, init_array], 383 | back_prop=False) 384 | instances = tf.cast(result_array.concat(), tf.int64) 385 | return instances 386 | 387 | -------------------------------------------------------------------------------- /files/cbow_hs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/cbow_hs.png -------------------------------------------------------------------------------- /files/cbow_ns.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/cbow_ns.png -------------------------------------------------------------------------------- /files/huffman.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/huffman.png -------------------------------------------------------------------------------- /files/sent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/sent.png -------------------------------------------------------------------------------- /files/sg_hs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/sg_hs.png -------------------------------------------------------------------------------- /files/sg_ns.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/sg_ns.png -------------------------------------------------------------------------------- /run_training.py: -------------------------------------------------------------------------------- 1 | r"""Executable for training Word2Vec models. 2 | 3 | Example: 4 | python run_training.py \ 5 | --filenames=/PATH/TO/FILE/file1.txt,/PATH/TO/FILE/file2.txt \ 6 | --out_dir=/PATH/TO/OUT_DIR/ \ 7 | --batch_size=64 \ 8 | --window_size=5 \ 9 | 10 | Learned word embeddings will be saved to /PATH/TO/OUT_DIR/embed.npy, and 11 | vocabulary saved to /PATH/TO/OUT_DIR/vocab.txt 12 | """ 13 | import os 14 | import time 15 | 16 | import tensorflow as tf 17 | import numpy as np 18 | 19 | # import project files 20 | from dataset import Word2VecDataset 21 | from word2vec import Word2VecModel 22 | 23 | flags = tf.app.flags 24 | 25 | flags.DEFINE_string('arch', 'skip_gram', 'Architecture (skip_gram or cbow).') 26 | flags.DEFINE_string('algm', 'negative_sampling', 'Training algorithm ' 27 | '(negative_sampling or hierarchical_softmax).') 28 | flags.DEFINE_integer('epochs', 1, 'Num of epochs to iterate training data.') 29 | flags.DEFINE_integer('batch_size', 256, 'Batch size.') 30 | flags.DEFINE_integer('max_vocab_size', 0, 'Maximum vocabulary size. If > 0, ' 31 | 'the top `max_vocab_size` most frequent words are kept in vocabulary.') 32 | flags.DEFINE_integer('min_count', 10, 'Words whose counts < `min_count` are not' 33 | ' included in the vocabulary.') 34 | flags.DEFINE_float('sample', 1e-3, 'Subsampling rate.') 35 | flags.DEFINE_integer('window_size', 10, 'Num of words on the left or right side' 36 | ' of target word within a window.') 37 | 38 | flags.DEFINE_integer('embed_size', 300, 'Length of word vector.') 39 | flags.DEFINE_integer('negatives', 5, 'Num of negative words to sample.') 40 | flags.DEFINE_float('power', 0.75, 'Distortion for negative sampling.') 41 | flags.DEFINE_float('alpha', 0.025, 'Initial learning rate.') 42 | flags.DEFINE_float('min_alpha', 0.0001, 'Final learning rate.') 43 | flags.DEFINE_boolean('add_bias', True, 'Whether to add bias term to dotproduct ' 44 | 'between syn0 and syn1 vectors.') 45 | 46 | flags.DEFINE_integer('log_per_steps', 10000, 'Every `log_per_steps` steps to ' 47 | ' output logs.') 48 | flags.DEFINE_list('filenames', None, 'Names of comma-separated input text files.') 49 | flags.DEFINE_string('out_dir', '/tmp/word2vec', 'Output directory.') 50 | 51 | FLAGS = flags.FLAGS 52 | 53 | 54 | def main(_): 55 | dataset = Word2VecDataset(arch=FLAGS.arch, 56 | algm=FLAGS.algm, 57 | epochs=FLAGS.epochs, 58 | batch_size=FLAGS.batch_size, 59 | max_vocab_size=FLAGS.max_vocab_size, 60 | min_count=FLAGS.min_count, 61 | sample=FLAGS.sample, 62 | window_size=FLAGS.window_size) 63 | dataset.build_vocab(FLAGS.filenames) 64 | 65 | word2vec = Word2VecModel(arch=FLAGS.arch, 66 | algm=FLAGS.algm, 67 | embed_size=FLAGS.embed_size, 68 | batch_size=FLAGS.batch_size, 69 | negatives=FLAGS.negatives, 70 | power=FLAGS.power, 71 | alpha=FLAGS.alpha, 72 | min_alpha=FLAGS.min_alpha, 73 | add_bias=FLAGS.add_bias, 74 | random_seed=0) 75 | to_be_run_dict = word2vec.train(dataset, FLAGS.filenames) 76 | 77 | with tf.Session() as sess: 78 | sess.run(dataset.iterator_initializer) 79 | sess.run(tf.tables_initializer()) 80 | sess.run(tf.global_variables_initializer()) 81 | 82 | average_loss = 0. 83 | step = 0 84 | while True: 85 | try: 86 | result_dict = sess.run(to_be_run_dict) 87 | except tf.errors.OutOfRangeError: 88 | break 89 | 90 | average_loss += result_dict['loss'].mean() 91 | if step % FLAGS.log_per_steps == 0: 92 | if step > 0: 93 | average_loss /= FLAGS.log_per_steps 94 | print('step:', step, 'average_loss:', average_loss, 95 | 'learning_rate:', result_dict['learning_rate']) 96 | average_loss = 0. 97 | 98 | step += 1 99 | 100 | syn0_final = sess.run(word2vec.syn0) 101 | 102 | np.save(os.path.join(FLAGS.out_dir, 'embed'), syn0_final) 103 | with open(os.path.join(FLAGS.out_dir, 'vocab.txt'), 'w', encoding="utf-8") as fid: 104 | for w in dataset.table_words: 105 | fid.write(w + '\n') 106 | 107 | print('Word embeddings saved to', os.path.join(FLAGS.out_dir, 'embed.npy')) 108 | print('Vocabulary saved to', os.path.join(FLAGS.out_dir, 'vocab.txt')) 109 | 110 | if __name__ == '__main__': 111 | tf.flags.mark_flag_as_required('filenames') 112 | 113 | tf.app.run() 114 | -------------------------------------------------------------------------------- /tf2.x/README.md: -------------------------------------------------------------------------------- 1 | This is the same model implemented in TensorFlow 2.x. Detailed usage information can be found in the [original README](../README.md). 2 | -------------------------------------------------------------------------------- /tf2.x/dataset.py: -------------------------------------------------------------------------------- 1 | """Defines word tokenizer and word2vec dataset builder. 2 | """ 3 | import heapq 4 | import itertools 5 | import collections 6 | 7 | import numpy as np 8 | import tensorflow as tf 9 | 10 | OOV_ID = -1 11 | 12 | 13 | class WordTokenizer(object): 14 | """Vanilla word tokenizer that spits out space-separated tokens from raw text 15 | string. Note for non-space separated languages, the corpus must be 16 | pre-tokenized such that tokens are space-delimited. 17 | """ 18 | def __init__(self, max_vocab_size=0, min_count=10, sample=1e-3): 19 | """Constructor. 20 | 21 | Args: 22 | max_vocab_size: int scalar, maximum vocabulary size. If > 0, only the top 23 | `max_vocab_size` most frequent words will be kept in vocabulary. 24 | min_count: int scalar, words whose counts < `min_count` will not be 25 | included in the vocabulary. 26 | sample: float scalar, subsampling rate. 27 | """ 28 | self._max_vocab_size = max_vocab_size 29 | self._min_count = min_count 30 | self._sample = sample 31 | 32 | self._vocab = None 33 | self._table_words = None 34 | self._unigram_counts = None 35 | self._keep_probs = None 36 | 37 | @property 38 | def unigram_counts(self): 39 | return self._unigram_counts 40 | 41 | @property 42 | def table_words(self): 43 | return self._table_words 44 | 45 | def _build_raw_vocab(self, filenames): 46 | """Builds raw vocabulary by iterate through the corpus once and count the 47 | unique words. 48 | 49 | Args: 50 | filenames: list of strings, holding names of text files. 51 | 52 | Returns: 53 | raw_vocab: a list of 2-tuples holding the word (string) and count (int), 54 | sorted in descending order of word count. 55 | """ 56 | lines = [] 57 | for fn in filenames: 58 | with tf.io.gfile.GFile(fn) as f: 59 | lines.append(f) 60 | lines = itertools.chain(*lines) 61 | 62 | raw_vocab = collections.Counter() 63 | for line in lines: 64 | raw_vocab.update(line.strip().split()) 65 | raw_vocab = raw_vocab.most_common() 66 | # truncate to have at most `max_vocab_size` vocab words 67 | if self._max_vocab_size > 0: 68 | raw_vocab = raw_vocab[:self._max_vocab_size] 69 | return raw_vocab 70 | 71 | def build_vocab(self, filenames): 72 | """Builds the vocabulary. 73 | 74 | Has the side effect of setting the following attributes: for each word 75 | `word` we have 76 | 77 | vocab[word] = index 78 | table_words[index] = word `word` 79 | unigram_counts[index] = count of `word` in vocab 80 | keep_probs[index] = keep prob of `word` for subsampling 81 | 82 | Args: 83 | filenames: list of strings, holding names of text files. 84 | """ 85 | raw_vocab = self._build_raw_vocab(filenames) 86 | raw_vocab = [(w, c) for w, c in raw_vocab if c >= self._min_count] 87 | self._corpus_size = sum(list(zip(*raw_vocab))[1]) 88 | 89 | self._vocab = {} 90 | self._table_words = [] 91 | self._unigram_counts = [] 92 | self._keep_probs = [] 93 | for index, (word, count) in enumerate(raw_vocab): 94 | frac = count / float(self._corpus_size) 95 | keep_prob = (np.sqrt(frac / self._sample) + 1) * (self._sample / frac) 96 | keep_prob = np.minimum(keep_prob, 1.0) 97 | self._vocab[word] = index 98 | self._table_words.append(word) 99 | self._unigram_counts.append(count) 100 | self._keep_probs.append(keep_prob) 101 | 102 | def encode(self, string): 103 | """Split raw text string into tokens (space-separated) and tranlate to token 104 | ids. 105 | 106 | Args: 107 | string: string scalar, the raw text string to be tokenized. 108 | 109 | Returns: 110 | ids: a list of ints, the token ids of the tokenized string. 111 | """ 112 | tokens = string.strip().split() 113 | ids = [self._vocab[token] if token in self._vocab else OOV_ID 114 | for token in tokens] 115 | return ids 116 | 117 | 118 | class Word2VecDatasetBuilder(object): 119 | """Builds a tf.data.Dataset instance that generates matrices holding word 120 | indices for training Word2Vec models. 121 | """ 122 | def __init__(self, 123 | tokenizer, 124 | arch='skip_gram', 125 | algm='negative_sampling', 126 | epochs=1, 127 | batch_size=32, 128 | window_size=5): 129 | """Constructor. 130 | 131 | Args: 132 | epochs: int scalar, num times the dataset is iterated. 133 | batch_size: int scalar, the returned tensors in `get_tensor_dict` have 134 | shapes [batch_size, :]. 135 | window_size: int scalar, num of words on the left or right side of 136 | target word within a window. 137 | """ 138 | self._tokenizer = tokenizer 139 | self._arch = arch 140 | self._algm = algm 141 | self._epochs = epochs 142 | self._batch_size = batch_size 143 | self._window_size = window_size 144 | 145 | self._max_depth = None 146 | 147 | def _build_binary_tree(self, unigram_counts): 148 | """Builds a Huffman tree for hierarchical softmax. Has the side effect 149 | of setting `max_depth`. 150 | 151 | Args: 152 | unigram_counts: list of int, holding word counts. Index of each entry 153 | is the same as the word index into the vocabulary. 154 | 155 | Returns: 156 | codes_points: an int numpy array of shape [vocab_size, 2*max_depth+1] 157 | where each row holds the codes (0-1 binary values) padded to 158 | `max_depth`, and points (non-leaf node indices) padded to `max_depth`, 159 | of each vocabulary word. The last entry is the true length of code 160 | and point (<= `max_depth`). 161 | """ 162 | vocab_size = len(unigram_counts) 163 | heap = [[unigram_counts[i], i] for i in range(vocab_size)] 164 | # initialize the min-priority queue, which has length `vocab_size` 165 | heapq.heapify(heap) 166 | 167 | # insert `vocab_size` - 1 internal nodes, with vocab words as leaf nodes. 168 | for i in range(vocab_size - 1): 169 | min1, min2 = heapq.heappop(heap), heapq.heappop(heap) 170 | heapq.heappush(heap, [min1[0] + min2[0], i + vocab_size, min1, min2]) 171 | # At this point we have a len-1 heap, and `heap[0]` will be the root of 172 | # the binary tree; where internal nodes store 173 | # 1. key (frequency) 174 | # 2. vocab index 175 | # 3. left child 176 | # 4. right child 177 | # and leaf nodes store 178 | # 1. key (frequencey) 179 | # 2. vocab index 180 | 181 | # Traverse the Huffman tree rooted at `heap[0]` in the order of 182 | # Depth-First-Search. Each stack item stores the 183 | # 1. `node` 184 | # 2. code of the `node` (list) 185 | # 3. point of the `node` (list) 186 | # 187 | # `point` is the list of vocab IDs of the internal nodes along the path from 188 | # the root up to `node` (not included) 189 | # `code` is the list of labels (0 or 1) of the edges along the path from the 190 | # root up to `node` 191 | # they are empty lists for the root node `heap[0]` 192 | node_list = [] 193 | max_depth, stack = 0, [[heap[0], [], []]] # stack: [root, codde, point] 194 | while stack: 195 | node, code, point = stack.pop() 196 | if node[1] < vocab_size: 197 | # leaf node: len(node) == 2 198 | node.extend([code, point, len(point)]) 199 | max_depth = np.maximum(len(code), max_depth) 200 | node_list.append(node) 201 | else: 202 | # internal node: len(node) == 4 203 | point = np.array(list(point) + [node[1]-vocab_size]) 204 | stack.append([node[2], np.array(list(code)+[0]), point]) 205 | stack.append([node[3], np.array(list(code)+[1]), point]) 206 | 207 | # `len(node_list[i]) = 5` 208 | node_list = sorted(node_list, key=lambda items: items[1]) 209 | # Stores the padded codes and points for each vocab word 210 | codes_points = np.zeros([vocab_size, max_depth*2+1], dtype=np.int64) 211 | for i in range(len(node_list)): 212 | length = node_list[i][4] # length of code or point 213 | codes_points[i, -1] = length 214 | codes_points[i, :length] = node_list[i][2] # code 215 | codes_points[i, max_depth:max_depth+length] = node_list[i][3] # point 216 | self._max_depth = max_depth 217 | return codes_points 218 | 219 | def build_dataset(self, filenames): 220 | """Generates tensor dict mapping from tensor names to tensors. 221 | 222 | Args: 223 | filenames: list of strings, holding names of text files. 224 | 225 | Returns: 226 | dataset: a tf.data.Dataset instance, holding the a tuple of tensors 227 | (inputs, labels, progress) 228 | when arch=='skip_gram', algm=='negative_sampling' 229 | inputs: [N], labels: [N] 230 | when arch=='cbow', algm=='negative_sampling' 231 | inputs: [N, 2*window_size+1], labels: [N] 232 | when arch=='skip_gram', algm=='hierarchical_softmax' 233 | inputs: [N], labels: [N, 2*max_depth+1] 234 | when arch=='cbow', algm=='hierarchical_softmax' 235 | inputs: [N, 2*window_size+1], labels: [N, 2*max_depth+1] 236 | progress: [N], the percentage of sentences covered so far. Used to 237 | compute learning rate. 238 | """ 239 | unigram_counts = self._tokenizer._unigram_counts 240 | keep_probs = self._tokenizer._keep_probs 241 | 242 | if self._algm == 'hierarchical_softmax': 243 | codes_points = tf.constant(self._build_binary_tree(unigram_counts)) 244 | elif self._algm == 'negative_sampling': 245 | codes_points = None 246 | else: 247 | raise ValueError('algm must be hierarchical_softmax or negative_sampling') 248 | 249 | keep_probs = tf.cast(tf.constant(keep_probs), 'float32') 250 | 251 | # total num of sentences (lines) across text files times num of epochs 252 | num_sents = sum([len(list(tf.io.gfile.GFile(fn))) 253 | for fn in filenames]) * self._epochs 254 | 255 | def generator_fn(): 256 | for _ in range(self._epochs): 257 | for fn in filenames: 258 | with tf.io.gfile.GFile(fn) as f: 259 | for line in f: 260 | yield self._tokenizer.encode(line) 261 | 262 | # dataset: [([int], float)] 263 | dataset = tf.data.Dataset.zip(( 264 | tf.data.Dataset.from_generator(generator_fn, tf.int64, [None]), 265 | tf.data.Dataset.from_tensor_slices(tf.range(num_sents) / num_sents))) 266 | # dataset: [([int], float)] 267 | dataset = dataset.map(lambda indices, progress: 268 | (subsample(indices, keep_probs), progress)) 269 | # dataset: [([int], float)] 270 | dataset = dataset.filter(lambda indices, progress: 271 | tf.greater(tf.size(indices), 1)) # sentence must have at least 2 tokens 272 | # dataset: [((None, None), float)] 273 | dataset = dataset.map(lambda indices, progress: (generate_instances( 274 | indices, self._arch, self._window_size, self._max_depth, codes_points), 275 | progress)) 276 | # dataset: [((None, None)), (None,)] 277 | dataset = dataset.map(lambda instances, progress: ( 278 | # replicate `progress` to size `tf.shape(instances)[:1]` 279 | instances, tf.fill(tf.shape(instances)[:1], progress))) 280 | dataset = dataset.flat_map(lambda instances, progress: 281 | # form a dataset by unstacking `instances` in the first dimension, 282 | tf.data.Dataset.from_tensor_slices((instances, progress))) 283 | # batch the dataset 284 | dataset = dataset.batch(self._batch_size, drop_remainder=True) 285 | 286 | def prepare_inputs_labels(tensor, progress): 287 | if self._arch == 'skip_gram': 288 | if self._algm == 'negative_sampling': 289 | tensor.set_shape([self._batch_size, 2]) 290 | else: 291 | tensor.set_shape([self._batch_size, 2*self._max_depth+2]) 292 | inputs = tensor[:, :1] 293 | labels = tensor[:, 1:] 294 | 295 | else: 296 | if self._algm == 'negative_sampling': 297 | tensor.set_shape([self._batch_size, 2*self._window_size+2]) 298 | else: 299 | tensor.set_shape([self._batch_size, 300 | 2*self._window_size+2*self._max_depth+2]) 301 | inputs = tensor[:, :2*self._window_size+1] 302 | labels = tensor[:, 2*self._window_size+1:] 303 | 304 | if self._arch == 'skip_gram': 305 | inputs = tf.squeeze(inputs, axis=1) 306 | if self._algm == 'negative_sampling': 307 | labels = tf.squeeze(labels, axis=1) 308 | progress = tf.cast(progress, 'float32') 309 | return inputs, labels, progress 310 | 311 | dataset = dataset.map(lambda tensor, progress: 312 | prepare_inputs_labels(tensor, progress)) 313 | 314 | return dataset 315 | 316 | 317 | def subsample(indices, keep_probs): 318 | """Filters out-of-vocabulary words and then applies subsampling on words in a 319 | sentence. Words with high frequencies have lower keep probs. 320 | 321 | Args: 322 | indices: rank-1 int tensor, the word indices within a sentence. 323 | keep_probs: rank-1 float tensor, the prob to drop the each vocabulary word. 324 | 325 | Returns: 326 | indices: rank-1 int tensor, the word indices within a sentence after 327 | subsampling. 328 | """ 329 | indices = tf.boolean_mask(indices, tf.not_equal(indices, OOV_ID)) 330 | keep_probs = tf.gather(keep_probs, indices) 331 | randvars = tf.random.uniform(tf.shape(keep_probs), 0, 1) 332 | indices = tf.boolean_mask(indices, tf.less(randvars, keep_probs)) 333 | return indices 334 | 335 | 336 | def generate_instances( 337 | indices, arch, window_size, max_depth=None, codes_points=None): 338 | """Generates matrices holding word indices to be passed to Word2Vec models 339 | for each sentence. The shape and contents of output matrices depends on the 340 | architecture ('skip_gram', 'cbow') and training algorithm ('negative_sampling' 341 | , 'hierarchical_softmax'). 342 | 343 | It takes as input a list of word indices in a subsampled-sentence, where each 344 | word is a target word, and their context words are those within the window 345 | centered at a target word. For skip gram architecture, `num_context_words` 346 | instances are generated for a target word, and for cbow architecture, a single 347 | instance is generated for a target word. 348 | 349 | If `codes_points` is not None ('hierarchical softmax'), the word to be 350 | predicted (context word for 'skip_gram', and target word for 'cbow') are 351 | represented by their 'codes' and 'points' in the Huffman tree (See 352 | `_build_binary_tree`). 353 | 354 | Args: 355 | indices: rank-1 int tensor, the word indices within a sentence after 356 | subsampling. 357 | arch: scalar string, architecture ('skip_gram' or 'cbow'). 358 | window_size: int scalar, num of words on the left or right side of 359 | target word within a window. 360 | max_depth: (Optional) int scalar, the max depth of the Huffman tree. 361 | codes_points: (Optional) an int tensor of shape [vocab_size, 2*max_depth+1] 362 | where each row holds the codes (0-1 binary values) padded to `max_depth`, 363 | and points (non-leaf node indices) padded to `max_depth`, of each 364 | vocabulary word. The last entry is the true length of code and point 365 | (<= `max_depth`). 366 | 367 | Returns: 368 | instances: an int tensor holding word indices, with shape being 369 | when arch=='skip_gram', algm=='negative_sampling' 370 | shape: [N, 2] 371 | when arch=='cbow', algm=='negative_sampling' 372 | shape: [N, 2*window_size+2] 373 | when arch=='skip_gram', algm=='hierarchical_softmax' 374 | shape: [N, 2*max_depth+2] 375 | when arch=='cbow', algm='hierarchical_softmax' 376 | shape: [N, 2*window_size+2*max_depth+2] 377 | """ 378 | def per_target_fn(index, init_array): 379 | """Generate inputs and labels for each target word. 380 | 381 | `index` is the index of the target word in `indices`. 382 | """ 383 | reduced_size = tf.random.uniform([], maxval=window_size, dtype='int32') 384 | left = tf.range(tf.maximum(index - window_size + reduced_size, 0), index) 385 | right = tf.range(index + 1, 386 | tf.minimum(index + 1 + window_size - reduced_size, tf.size(indices))) 387 | context = tf.concat([left, right], axis=0) 388 | context = tf.gather(indices, context) 389 | 390 | if arch == 'skip_gram': 391 | # replicate `indices[index]` to match the size of `context` 392 | # [N, 2] 393 | window = tf.stack([tf.fill(tf.shape(context), indices[index]), 394 | context], axis=1) 395 | elif arch == 'cbow': 396 | true_size = tf.size(context) 397 | # pad `context` to length `2 * window_size` 398 | window = tf.concat([tf.pad(context, [[0, 2*window_size-true_size]]), 399 | [true_size, indices[index]]], axis=0) 400 | # [1, 2*window_size + 2] 401 | window = tf.expand_dims(window, axis=0) 402 | else: 403 | raise ValueError('architecture must be skip_gram or cbow.') 404 | 405 | if codes_points is not None: 406 | # [N, 2*max_depth + 2] or [1, 2*window_size+2*max_depth+2] 407 | window = tf.concat([window[:, :-1], 408 | tf.gather(codes_points, window[:, -1])], axis=1) 409 | return index + 1, init_array.write(index, window) 410 | 411 | size = tf.size(indices) 412 | # initialize a tensor array of length `tf.size(indices)` 413 | init_array = tf.TensorArray('int64', size=size, infer_shape=False) 414 | _, result_array = tf.while_loop(lambda i, ta: i < size, 415 | per_target_fn, 416 | [0, init_array], 417 | back_prop=False) 418 | instances = tf.cast(result_array.concat(), 'int64') 419 | if arch == 'skip_gram': 420 | if max_depth is None: 421 | instances.set_shape([None, 2]) 422 | else: 423 | instances.set_shape([None, 2*max_depth+2]) 424 | else: 425 | if max_depth is None: 426 | instances.set_shape([None, 2*window_size+2]) 427 | else: 428 | instances.set_shape([None, 2*window_size+2*max_depth+2]) 429 | 430 | return instances 431 | -------------------------------------------------------------------------------- /tf2.x/demo_word_similarity.py: -------------------------------------------------------------------------------- 1 | from word_vectors import WordVectors 2 | import numpy as np 3 | 4 | # syn_final.npy: storing word embeddings, numpy array of shape [vocab_size, hidden_size] 5 | # 'vocab.txt': text file storing words in vocabulary, one word per line 6 | 7 | query = ',' 8 | num_similar_words = 10 9 | syn0_final = np.load('syn0_final.npy') 10 | vocab_words = [] 11 | with open('vocab.txt') as f: 12 | vocab_words = [l.strip() for l in f] 13 | 14 | wv = WordVectors(syn0_final, vocab_words) 15 | print(wv.most_similar(query, num_similar_words)) 16 | -------------------------------------------------------------------------------- /tf2.x/model.py: -------------------------------------------------------------------------------- 1 | """Defines word2vec model using tf.keras API. 2 | """ 3 | import tensorflow as tf 4 | 5 | from dataset import WordTokenizer 6 | from dataset import Word2VecDatasetBuilder 7 | 8 | 9 | class Word2VecModel(tf.keras.Model): 10 | """Word2Vec model.""" 11 | def __init__(self, 12 | unigram_counts, 13 | arch='skip_gram', 14 | algm='negative_sampling', 15 | hidden_size=300, 16 | batch_size=256, 17 | negatives=5, 18 | power=0.75, 19 | alpha=0.025, 20 | min_alpha=0.0001, 21 | add_bias=True, 22 | random_seed=0): 23 | """Constructor. 24 | 25 | Args: 26 | unigram_counts: a list of ints, the counts of word tokens in the corpus. 27 | arch: string scalar, architecture ('skip_gram' or 'cbow'). 28 | algm: string scalar, training algorithm ('negative_sampling' or 29 | 'hierarchical_softmax'). 30 | hidden_size: int scalar, length of word vector. 31 | batch_size: int scalar, batch size. 32 | negatives: int scalar, num of negative words to sample. 33 | power: float scalar, distortion for negative sampling. 34 | alpha: float scalar, initial learning rate. 35 | min_alpha: float scalar, final learning rate. 36 | add_bias: bool scalar, whether to add bias term to dotproduct 37 | between syn0 and syn1 vectors. 38 | random_seed: int scalar, random_seed. 39 | """ 40 | super(Word2VecModel, self).__init__() 41 | self._unigram_counts = unigram_counts 42 | self._arch = arch 43 | self._algm = algm 44 | self._hidden_size = hidden_size 45 | self._vocab_size = len(unigram_counts) 46 | self._batch_size = batch_size 47 | self._negatives = negatives 48 | self._power = power 49 | self._alpha = alpha 50 | self._min_alpha = min_alpha 51 | self._add_bias = add_bias 52 | self._random_seed = random_seed 53 | 54 | self._input_size = (self._vocab_size if self._algm == 'negative_sampling' 55 | else self._vocab_size - 1) 56 | 57 | self.add_weight('syn0', 58 | shape=[self._vocab_size, self._hidden_size], 59 | initializer=tf.keras.initializers.RandomUniform( 60 | minval=-0.5/self._hidden_size, 61 | maxval=0.5/self._hidden_size)) 62 | 63 | self.add_weight('syn1', 64 | shape=[self._input_size, self._hidden_size], 65 | initializer=tf.keras.initializers.RandomUniform( 66 | minval=-0.1, maxval=0.1)) 67 | 68 | self.add_weight('biases', 69 | shape=[self._input_size], 70 | initializer=tf.keras.initializers.Zeros()) 71 | 72 | def call(self, inputs, labels): 73 | """Runs the forward pass to compute loss. 74 | 75 | Args: 76 | inputs: int tensor of shape [batch_size] (skip_gram) or 77 | [batch_size, 2*window_size+1] (cbow) 78 | labels: int tensor of shape [batch_size] (negative_sampling) or 79 | [batch_size, 2*max_depth+1] (hierarchical_softmax) 80 | 81 | Returns: 82 | loss: float tensor, cross entropy loss. 83 | """ 84 | if self._algm == 'negative_sampling': 85 | loss = self._negative_sampling_loss(inputs, labels) 86 | elif self._algm == 'hierarchical_softmax': 87 | loss = self._hierarchical_softmax_loss(inputs, labels) 88 | return loss 89 | 90 | def _negative_sampling_loss(self, inputs, labels): 91 | """Builds the loss for negative sampling. 92 | 93 | Args: 94 | inputs: int tensor of shape [batch_size] (skip_gram) or 95 | [batch_size, 2*window_size+1] (cbow) 96 | labels: int tensor of shape [batch_size] 97 | 98 | Returns: 99 | loss: float tensor of shape [batch_size, negatives + 1]. 100 | """ 101 | _, syn1, biases = self.weights 102 | 103 | sampled_values = tf.random.fixed_unigram_candidate_sampler( 104 | true_classes=tf.expand_dims(labels, 1), 105 | num_true=1, 106 | num_sampled=self._batch_size*self._negatives, 107 | unique=True, 108 | range_max=len(self._unigram_counts), 109 | distortion=self._power, 110 | unigrams=self._unigram_counts) 111 | 112 | sampled = sampled_values.sampled_candidates 113 | sampled_mat = tf.reshape(sampled, [self._batch_size, self._negatives]) 114 | inputs_syn0 = self._get_inputs_syn0(inputs) # [batch_size, hidden_size] 115 | true_syn1 = tf.gather(syn1, labels) # [batch_size, hidden_size] 116 | # [batch_size, negatives, hidden_size] 117 | sampled_syn1 = tf.gather(syn1, sampled_mat) 118 | # [batch_size] 119 | true_logits = tf.reduce_sum(tf.multiply(inputs_syn0, true_syn1), 1) 120 | # [batch_size, negatives] 121 | sampled_logits = tf.einsum('ijk,ikl->il', tf.expand_dims(inputs_syn0, 1), 122 | tf.transpose(sampled_syn1, (0, 2, 1))) 123 | 124 | if self._add_bias: 125 | # [batch_size] 126 | true_logits += tf.gather(biases, labels) 127 | # [batch_size, negatives] 128 | sampled_logits += tf.gather(biases, sampled_mat) 129 | 130 | # [batch_size] 131 | true_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits( 132 | labels=tf.ones_like(true_logits), logits=true_logits) 133 | # [batch_size, negatives] 134 | sampled_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits( 135 | labels=tf.zeros_like(sampled_logits), logits=sampled_logits) 136 | 137 | loss = tf.concat( 138 | [tf.expand_dims(true_cross_entropy, 1), sampled_cross_entropy], 1) 139 | return loss 140 | 141 | def _hierarchical_softmax_loss(self, inputs, labels): 142 | """Builds the loss for hierarchical softmax. 143 | 144 | Args: 145 | inputs: int tensor of shape [batch_size] (skip_gram) or 146 | [batch_size, 2*window_size+1] (cbow) 147 | labels: int tensor of shape [batch_size, 2*max_depth+1] 148 | 149 | Returns: 150 | loss: float tensor of shape [sum_of_code_len] 151 | """ 152 | _, syn1, biases = self.weights 153 | 154 | inputs_syn0_list = tf.unstack(self._get_inputs_syn0(inputs)) 155 | codes_points_list = tf.unstack(labels) 156 | max_depth = (labels.shape.as_list()[1] - 1) // 2 157 | loss = [] 158 | for i in range(self._batch_size): 159 | inputs_syn0 = inputs_syn0_list[i] # [hidden_size] 160 | codes_points = codes_points_list[i] # [2*max_depth+1] 161 | true_size = codes_points[-1] 162 | 163 | codes = codes_points[:true_size] 164 | points = codes_points[max_depth:max_depth+true_size] 165 | logits = tf.reduce_sum( 166 | tf.multiply(inputs_syn0, tf.gather(syn1, points)), 1) 167 | if self._add_bias: 168 | logits += tf.gather(biases, points) 169 | 170 | # [true_size] 171 | loss.append(tf.nn.sigmoid_cross_entropy_with_logits( 172 | labels=tf.cast(codes, 'float32'), logits=logits)) 173 | loss = tf.concat(loss, axis=0) 174 | return loss 175 | 176 | def _get_inputs_syn0(self, inputs): 177 | """Builds the activations of hidden layer given input words embeddings 178 | `syn0` and input word indices. 179 | 180 | Args: 181 | inputs: int tensor of shape [batch_size] (skip_gram) or 182 | [batch_size, 2*window_size+1] (cbow) 183 | 184 | Returns: 185 | inputs_syn0: [batch_size, hidden_size] 186 | """ 187 | # syn0: [vocab_size, hidden_size] 188 | syn0, _, _ = self.weights 189 | if self._arch == 'skip_gram': 190 | inputs_syn0 = tf.gather(syn0, inputs) # [batch_size, hidden_size] 191 | else: 192 | inputs_syn0 = [] 193 | contexts_list = tf.unstack(inputs) 194 | for i in range(self._batch_size): 195 | contexts = contexts_list[i] 196 | context_words = contexts[:-1] 197 | true_size = contexts[-1] 198 | inputs_syn0.append( 199 | tf.reduce_mean(tf.gather(syn0, context_words[:true_size]), axis=0)) 200 | inputs_syn0 = tf.stack(inputs_syn0) 201 | 202 | return inputs_syn0 203 | -------------------------------------------------------------------------------- /tf2.x/run_training.py: -------------------------------------------------------------------------------- 1 | """Train a word2vec model to obtain word embedding vectors. 2 | 3 | There are a total of four combination of architectures and training algorithms 4 | that the model can be trained with: 5 | 6 | architecture: 7 | - skip_gram 8 | - cbow (continuous bag-of-words) 9 | 10 | training algorithm 11 | - negative_sampling 12 | - hierarchical_softmax 13 | """ 14 | import os 15 | 16 | import tensorflow as tf 17 | import numpy as np 18 | from absl import app 19 | from absl import flags 20 | 21 | from dataset import WordTokenizer 22 | from dataset import Word2VecDatasetBuilder 23 | from model import Word2VecModel 24 | from word_vectors import WordVectors 25 | 26 | import utils 27 | 28 | flags.DEFINE_string('arch', 'skip_gram', 'Architecture (skip_gram or cbow).') 29 | flags.DEFINE_string('algm', 'negative_sampling', 'Training algorithm ' 30 | '(negative_sampling or hierarchical_softmax).') 31 | flags.DEFINE_integer('epochs', 1, 'Num of epochs to iterate thru corpus.') 32 | flags.DEFINE_integer('batch_size', 256, 'Batch size.') 33 | flags.DEFINE_integer('max_vocab_size', 0, 'Maximum vocabulary size. If > 0, ' 34 | 'the top `max_vocab_size` most frequent words will be kept in vocabulary.') 35 | flags.DEFINE_integer('min_count', 10, 'Words whose counts < `min_count` will ' 36 | 'not be included in the vocabulary.') 37 | flags.DEFINE_float('sample', 1e-3, 'Subsampling rate.') 38 | flags.DEFINE_integer('window_size', 10, 'Num of words on the left or right side' 39 | ' of target word within a window.') 40 | 41 | flags.DEFINE_integer('hidden_size', 300, 'Length of word vector.') 42 | flags.DEFINE_integer('negatives', 5, 'Num of negative words to sample.') 43 | flags.DEFINE_float('power', 0.75, 'Distortion for negative sampling.') 44 | flags.DEFINE_float('alpha', 0.025, 'Initial learning rate.') 45 | flags.DEFINE_float('min_alpha', 0.0001, 'Final learning rate.') 46 | flags.DEFINE_boolean('add_bias', True, 'Whether to add bias term to dotproduct ' 47 | 'between syn0 and syn1 vectors.') 48 | 49 | flags.DEFINE_integer('log_per_steps', 10000, 'Every `log_per_steps` steps to ' 50 | ' log the value of loss to be minimized.') 51 | flags.DEFINE_list( 52 | 'filenames', None, 'Names of comma-separated input text files.') 53 | flags.DEFINE_string('out_dir', '/tmp/word2vec', 'Output directory.') 54 | 55 | FLAGS = flags.FLAGS 56 | 57 | 58 | def main(_): 59 | arch = FLAGS.arch 60 | algm = FLAGS.algm 61 | epochs = FLAGS.epochs 62 | batch_size = FLAGS.batch_size 63 | max_vocab_size = FLAGS.max_vocab_size 64 | min_count = FLAGS.min_count 65 | sample = FLAGS.sample 66 | window_size = FLAGS.window_size 67 | hidden_size = FLAGS.hidden_size 68 | negatives = FLAGS.negatives 69 | power = FLAGS.power 70 | alpha = FLAGS.alpha 71 | min_alpha = FLAGS.min_alpha 72 | add_bias = FLAGS.add_bias 73 | log_per_steps = FLAGS.log_per_steps 74 | filenames = FLAGS.filenames 75 | out_dir = FLAGS.out_dir 76 | 77 | tokenizer = WordTokenizer( 78 | max_vocab_size=max_vocab_size, min_count=min_count, sample=sample) 79 | tokenizer.build_vocab(filenames) 80 | 81 | builder = Word2VecDatasetBuilder(tokenizer, 82 | arch=arch, 83 | algm=algm, 84 | epochs=epochs, 85 | batch_size=batch_size, 86 | window_size=window_size) 87 | dataset = builder.build_dataset(filenames) 88 | word2vec = Word2VecModel(tokenizer.unigram_counts, 89 | arch=arch, 90 | algm=algm, 91 | hidden_size=hidden_size, 92 | batch_size=batch_size, 93 | negatives=negatives, 94 | power=power, 95 | alpha=alpha, 96 | min_alpha=min_alpha, 97 | add_bias=add_bias) 98 | 99 | train_step_signature = utils.get_train_step_signature( 100 | arch, algm, batch_size, window_size, builder._max_depth) 101 | optimizer = tf.keras.optimizers.SGD(1.0) 102 | 103 | @tf.function(input_signature=train_step_signature) 104 | def train_step(inputs, labels, progress): 105 | loss = word2vec(inputs, labels) 106 | gradients = tf.gradients(loss, word2vec.trainable_variables) 107 | 108 | learning_rate = tf.maximum(alpha * (1 - progress[0]) + 109 | min_alpha * progress[0], min_alpha) 110 | 111 | if hasattr(gradients[0], '_values'): 112 | gradients[0]._values *= learning_rate 113 | else: 114 | gradients[0] *= learning_rate 115 | 116 | if hasattr(gradients[1], '_values'): 117 | gradients[1]._values *= learning_rate 118 | else: 119 | gradients[1] *= learning_rate 120 | 121 | if hasattr(gradients[2], '_values'): 122 | gradients[2]._values *= learning_rate 123 | else: 124 | gradients[2] *= learning_rate 125 | 126 | optimizer.apply_gradients( 127 | zip(gradients, word2vec.trainable_variables)) 128 | 129 | return loss, learning_rate 130 | 131 | average_loss = 0. 132 | for step, (inputs, labels, progress) in enumerate(dataset): 133 | loss, learning_rate = train_step(inputs, labels, progress) 134 | average_loss += loss.numpy().mean() 135 | if step % log_per_steps == 0: 136 | if step > 0: 137 | average_loss /= log_per_steps 138 | print('step:', step, 'average_loss:', average_loss, 139 | 'learning_rate:', learning_rate.numpy()) 140 | average_loss = 0. 141 | 142 | syn0_final = word2vec.weights[0].numpy() 143 | np.save(os.path.join(FLAGS.out_dir, 'syn0_final'), syn0_final) 144 | with tf.io.gfile.GFile(os.path.join(FLAGS.out_dir, 'vocab.txt'), 'w') as f: 145 | for w in tokenizer.table_words: 146 | f.write(w + '\n') 147 | print('Word embeddings saved to', 148 | os.path.join(FLAGS.out_dir, 'syn0_final.npy')) 149 | print('Vocabulary saved to', os.path.join(FLAGS.out_dir, 'vocab.txt')) 150 | 151 | 152 | if __name__ == '__main__': 153 | flags.mark_flag_as_required('filenames') 154 | app.run(main) 155 | -------------------------------------------------------------------------------- /tf2.x/sample_corpus.txt: -------------------------------------------------------------------------------- 1 | # one sentence per line, with words (lower case) delimited by single space 2 | 3 | with all this stuff going down at the moment with mj i 've started listening to his music , watching the odd documentary here and there , watched the wiz and watched moonwalker again . 4 | maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent . 5 | moonwalker is part biography , part feature film which i remember going to see at the cinema when it was originally released . 6 | some of it has subtle messages about mj 's feeling towards the press and also the obvious message of drugs are bad m'kay . 7 | visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring . 8 | -------------------------------------------------------------------------------- /tf2.x/utils.py: -------------------------------------------------------------------------------- 1 | """Defines utility functions. 2 | """ 3 | import tensorflow as tf 4 | 5 | 6 | def get_train_step_signature( 7 | arch, algm, batch_size, window_size=None, max_depth=None): 8 | """Get the training step signatures for `inputs`, `labels` and `progress` 9 | tensor. 10 | 11 | Args: 12 | arch: string scalar, architecture ('skip_gram' or 'cbow'). 13 | algm: string scalar, training algorithm ('negative_sampling' or 14 | 'hierarchical_softmax'). 15 | 16 | Returns: 17 | train_step_signature: a list of three tf.TensorSpec instances, 18 | specifying the tensor spec (shape and dtype) for `inputs`, `labels` and 19 | `progress`. 20 | """ 21 | if arch=='skip_gram': 22 | inputs_spec = tf.TensorSpec(shape=(batch_size,), dtype='int64') 23 | elif arch == 'cbow': 24 | inputs_spec = tf.TensorSpec( 25 | shape=(batch_size, 2*window_size+1), dtype='int64') 26 | else: 27 | raise ValueError('`arch` must be either "skip_gram" or "cbow".') 28 | 29 | if algm == 'negative_sampling': 30 | labels_spec = tf.TensorSpec(shape=(batch_size,), dtype='int64') 31 | elif algm == 'hierarchical_softmax': 32 | labels_spec = tf.TensorSpec( 33 | shape=(batch_size, 2*max_depth+1), dtype='int64') 34 | else: 35 | raise ValueError('`algm` must be either "negative_sampling" or ' 36 | '"hierarchical_softmax".') 37 | 38 | progress_spec = tf.TensorSpec(shape=(batch_size,), dtype='float32') 39 | 40 | train_step_signature = [inputs_spec, labels_spec, progress_spec] 41 | return train_step_signature 42 | -------------------------------------------------------------------------------- /tf2.x/word_vectors.py: -------------------------------------------------------------------------------- 1 | """Defines wrapper class for final word vectors. 2 | """ 3 | import heapq 4 | import numpy as np 5 | 6 | 7 | class WordVectors(object): 8 | """Word vectors of trained Word2Vec model. Provides APIs for retrieving 9 | word vector, and most similar words given a query word. 10 | """ 11 | def __init__(self, syn0_final, vocab): 12 | """Constructor. 13 | 14 | Args: 15 | syn0_final: numpy array of shape [vocab_size, embed_size], final word 16 | embeddings. 17 | vocab: a list of strings, holding vocabulary words. 18 | """ 19 | self._syn0_final = syn0_final 20 | self._vocab = vocab 21 | self._rev_vocab = dict([(w, i) for i, w in enumerate(vocab)]) 22 | 23 | def __contains__(self, word): 24 | return word in self._rev_vocab 25 | 26 | def __getitem__(self, word): 27 | return self._syn0_final[self._rev_vocab[word]] 28 | 29 | def most_similar(self, word, k): 30 | """Finds the top-k words with smallest cosine distances w.r.t `word`. 31 | 32 | Args: 33 | word: string scalar, the query word. 34 | k: int scalar, num of words most similar to `word`. 35 | 36 | Returns: 37 | a list of 2-tuples with word and cosine similarities. 38 | """ 39 | if word not in self._rev_vocab: 40 | raise ValueError("Word '%s' not found in the vocabulary" % word) 41 | if k >= self._syn0_final.shape[0]: 42 | raise ValueError("k = %d greater than vocabulary size" % k) 43 | 44 | v0 = self._syn0_final[self._rev_vocab[word]] 45 | sims = np.sum(v0 * self._syn0_final, 1) / (np.linalg.norm(v0) * 46 | np.linalg.norm(self._syn0_final, axis=1)) 47 | 48 | # maintain a sliding min-heap to keep track of k+1 largest elements 49 | min_pq = list(zip(sims[:k+1], range(k+1))) 50 | heapq.heapify(min_pq) 51 | for i in np.arange(k + 1, len(self._vocab)): 52 | if sims[i] > min_pq[0][0]: 53 | min_pq[0] = sims[i], i 54 | heapq.heapify(min_pq) 55 | min_pq = sorted(min_pq, key=lambda p: -p[0]) 56 | return [(self._vocab[i], sim) for sim, i in min_pq[1:]] 57 | -------------------------------------------------------------------------------- /word2vec.py: -------------------------------------------------------------------------------- 1 | import heapq 2 | 3 | import numpy as np 4 | import tensorflow as tf 5 | 6 | 7 | class Word2VecModel(object): 8 | """Word2VecModel. 9 | """ 10 | 11 | def __init__(self, arch, algm, embed_size, batch_size, negatives, power, 12 | alpha, min_alpha, add_bias, random_seed): 13 | """Constructor. 14 | 15 | Args: 16 | arch: string scalar, architecture ('skip_gram' or 'cbow'). 17 | algm: string scalar, training algorithm ('negative_sampling' or 18 | 'hierarchical_softmax'). 19 | embed_size: int scalar, length of word vector. 20 | batch_size: int scalar, batch size. 21 | negatives: int scalar, num of negative words to sample. 22 | power: float scalar, distortion for negative sampling. 23 | alpha: float scalar, initial learning rate. 24 | min_alpha: float scalar, final learning rate. 25 | add_bias: bool scalar, whether to add bias term to dotproduct 26 | between syn0 and syn1 vectors. 27 | random_seed: int scalar, random_seed. 28 | """ 29 | self._arch = arch 30 | self._algm = algm 31 | self._embed_size = embed_size 32 | self._batch_size = batch_size 33 | self._negatives = negatives 34 | self._power = power 35 | self._alpha = alpha 36 | self._min_alpha = min_alpha 37 | self._add_bias = add_bias 38 | self._random_seed = random_seed 39 | 40 | self._syn0 = None 41 | 42 | @property 43 | def syn0(self): 44 | return self._syn0 45 | 46 | def _build_loss(self, inputs, labels, unigram_counts, scope=None): 47 | """Builds the graph that leads from data tensors (`inputs`, `labels`) 48 | to loss. Has the side effect of setting attribute `syn0`. 49 | 50 | Args: 51 | inputs: int tensor of shape [batch_size] (skip_gram) or 52 | [batch_size, 2*window_size+1] (cbow) 53 | labels: int tensor of shape [batch_size] (negative_sampling) or 54 | [batch_size, 2*max_depth+1] (hierarchical_softmax) 55 | unigram_count: list of int, holding word counts. Index of each entry 56 | is the same as the word index into the vocabulary. 57 | scope: string scalar, scope name. 58 | 59 | Returns: 60 | loss: float tensor, cross entropy loss. 61 | """ 62 | syn0, syn1, biases = self._create_embeddings(len(unigram_counts)) 63 | self._syn0 = syn0 64 | with tf.variable_scope(scope, 'Loss', [inputs, labels, syn0, syn1, biases]): 65 | if self._algm == 'negative_sampling': 66 | loss = self._negative_sampling_loss( 67 | unigram_counts, inputs, labels, syn0, syn1, biases) 68 | elif self._algm == 'hierarchical_softmax': 69 | loss = self._hierarchical_softmax_loss( 70 | inputs, labels, syn0, syn1, biases) 71 | return loss 72 | 73 | def train(self, dataset, filenames): 74 | """Adds training related ops to the graph. 75 | 76 | Args: 77 | dataset: a `Word2VecDataset` instance. 78 | filenames: a list of strings, holding names of text files. 79 | 80 | Returns: 81 | to_be_run_dict: dict mapping from names to tensors/operations, holding 82 | the following entries: 83 | { 'grad_update_op': optimization ops, 84 | 'loss': cross entropy loss, 85 | 'learning_rate': float-scalar learning rate} 86 | """ 87 | tensor_dict = dataset.get_tensor_dict(filenames) 88 | inputs, labels = tensor_dict['inputs'], tensor_dict['labels'] 89 | global_step = tf.train.get_or_create_global_step() 90 | learning_rate = tf.maximum(self._alpha * (1 - tensor_dict['progress'][0]) + 91 | self._min_alpha * tensor_dict['progress'][0], self._min_alpha) 92 | 93 | loss = self._build_loss(inputs, labels, dataset.unigram_counts) 94 | optimizer = tf.train.GradientDescentOptimizer(learning_rate) 95 | grad_update_op = optimizer.minimize(loss, global_step=global_step) 96 | 97 | to_be_run_dict = {'grad_update_op': grad_update_op, 98 | 'loss': loss, 99 | 'learning_rate': learning_rate} 100 | return to_be_run_dict 101 | 102 | def _create_embeddings(self, vocab_size, scope=None): 103 | """Creates initial word embedding variables. 104 | 105 | Args: 106 | vocab_size: int scalar, num of words in vocabulary. 107 | scope: string scalar, scope name. 108 | 109 | Returns: 110 | syn0: float tensor of shape [vocab_size, embed_size], input word 111 | embeddings (i.e. weights of hidden layer). 112 | syn1: float tensor of shape [syn1_rows, embed_size], output word 113 | embeddings (i.e. weights of output layer). 114 | biases: float tensor of shape [syn1_rows], biases added onto the logits. 115 | """ 116 | syn1_rows = (vocab_size if self._algm == 'negative_sampling' 117 | else vocab_size - 1) 118 | with tf.variable_scope(scope, 'Embedding'): 119 | syn0 = tf.get_variable('syn0', initializer=tf.random_uniform([vocab_size, 120 | self._embed_size], -0.5/self._embed_size, 0.5/self._embed_size, 121 | seed=self._random_seed)) 122 | syn1 = tf.get_variable('syn1', initializer=tf.random_uniform([syn1_rows, 123 | self._embed_size], -0.1, 0.1)) 124 | biases = tf.get_variable('biases', initializer=tf.zeros([syn1_rows])) 125 | return syn0, syn1, biases 126 | 127 | def _negative_sampling_loss( 128 | self, unigram_counts, inputs, labels, syn0, syn1, biases): 129 | """Builds the loss for negative sampling. 130 | 131 | Args: 132 | unigram_counts: list of int, holding word counts. Index of each entry 133 | is the same as the word index into the vocabulary. 134 | inputs: int tensor of shape [batch_size] (skip_gram) or 135 | [batch_size, 2*window_size+1] (cbow) 136 | labels: int tensor of shape [batch_size] 137 | syn0: float tensor of shape [vocab_size, embed_size], input word 138 | embeddings (i.e. weights of hidden layer). 139 | syn1: float tensor of shape [syn1_rows, embed_size], output word 140 | embeddings (i.e. weights of output layer). 141 | biases: float tensor of shape [syn1_rows], biases added onto the logits. 142 | 143 | Returns: 144 | loss: float tensor of shape [batch_size, sample_size + 1]. 145 | """ 146 | sampled_values = tf.nn.fixed_unigram_candidate_sampler( 147 | true_classes=tf.expand_dims(labels, 1), 148 | num_true=1, 149 | num_sampled=self._batch_size*self._negatives, 150 | unique=True, 151 | range_max=len(unigram_counts), 152 | distortion=self._power, 153 | unigrams=unigram_counts) 154 | 155 | sampled = sampled_values.sampled_candidates 156 | sampled_mat = tf.reshape(sampled, [self._batch_size, self._negatives]) 157 | inputs_syn0 = self._get_inputs_syn0(syn0, inputs) # [N, D] 158 | true_syn1 = tf.gather(syn1, labels) # [N, D] 159 | sampled_syn1 = tf.gather(syn1, sampled_mat) # [N, K, D] 160 | true_logits = tf.reduce_sum(tf.multiply(inputs_syn0, true_syn1), 1) # [N] 161 | sampled_logits = tf.reduce_sum( 162 | tf.multiply(tf.expand_dims(inputs_syn0, 1), sampled_syn1), 2) # [N, K] 163 | 164 | if self._add_bias: 165 | true_logits += tf.gather(biases, labels) # [N] 166 | sampled_logits += tf.gather(biases, sampled_mat) # [N, K] 167 | 168 | true_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits( 169 | labels=tf.ones_like(true_logits), logits=true_logits) 170 | sampled_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits( 171 | labels=tf.zeros_like(sampled_logits), logits=sampled_logits) 172 | loss = tf.concat( 173 | [tf.expand_dims(true_cross_entropy, 1), sampled_cross_entropy], 1) 174 | return loss 175 | 176 | def _hierarchical_softmax_loss(self, inputs, labels, syn0, syn1, biases): 177 | """Builds the loss for hierarchical softmax. 178 | 179 | Args: 180 | inputs: int tensor of shape [batch_size] (skip_gram) or 181 | [batch_size, 2*window_size+1] (cbow) 182 | labels: int tensor of shape [batch_size, 2*max_depth+1] 183 | syn0: float tensor of shape [vocab_size, embed_size], input word 184 | embeddings (i.e. weights of hidden layer). 185 | syn1: float tensor of shape [syn1_rows, embed_size], output word 186 | embeddings (i.e. weights of output layer). 187 | biases: float tensor of shape [syn1_rows], biases added onto the logits. 188 | 189 | Returns: 190 | loss: float tensor of shape [sum_of_code_len] 191 | """ 192 | inputs_syn0_list = tf.unstack(self._get_inputs_syn0(syn0, inputs)) 193 | codes_points_list = tf.unstack(labels) 194 | max_depth = (labels.shape.as_list()[1] - 1) // 2 195 | loss = [] 196 | for inputs_syn0, codes_points in zip(inputs_syn0_list, codes_points_list): 197 | true_size = codes_points[-1] 198 | codes = codes_points[:true_size] 199 | points = codes_points[max_depth:max_depth+true_size] 200 | 201 | logits = tf.reduce_sum( 202 | tf.multiply(inputs_syn0, tf.gather(syn1, points)), 1) 203 | if self._add_bias: 204 | logits += tf.gather(biases, points) 205 | 206 | loss.append(tf.nn.sigmoid_cross_entropy_with_logits( 207 | labels=tf.to_float(codes), logits=logits)) 208 | loss = tf.concat(loss, axis=0) 209 | return loss 210 | 211 | def _get_inputs_syn0(self, syn0, inputs): 212 | """Builds the activations of hidden layer given input words embeddings 213 | `syn0` and input word indices. 214 | 215 | Args: 216 | syn0: float tensor of shape [vocab_size, embed_size] 217 | inputs: int tensor of shape [batch_size] (skip_gram) or 218 | [batch_size, 2*window_size+1] (cbow) 219 | 220 | Returns: 221 | inputs_syn0: [batch_size, embed_size] 222 | """ 223 | if self._arch == 'skip_gram': 224 | inputs_syn0 = tf.gather(syn0, inputs) 225 | else: 226 | inputs_syn0 = [] 227 | contexts_list = tf.unstack(inputs) 228 | for contexts in contexts_list: 229 | context_words = contexts[:-1] 230 | true_size = contexts[-1] 231 | inputs_syn0.append( 232 | tf.reduce_mean(tf.gather(syn0, context_words[:true_size]), axis=0)) 233 | inputs_syn0 = tf.stack(inputs_syn0) 234 | return inputs_syn0 235 | 236 | 237 | class WordVectors(object): 238 | """Word vectors of trained Word2Vec model. Provides APIs for retrieving 239 | word vector, and most similar words given a query word. 240 | """ 241 | def __init__(self, syn0_final, vocab): 242 | """Constructor. 243 | 244 | Args: 245 | syn0_final: numpy array of shape [vocab_size, embed_size], final word 246 | embeddings. 247 | vocab_words: a list of strings, holding vocabulary words. 248 | """ 249 | self._syn0_final = syn0_final 250 | self._vocab = vocab 251 | self._rev_vocab = dict([(w, i) for i, w in enumerate(vocab)]) 252 | 253 | def __contains__(self, word): 254 | return word in self._rev_vocab 255 | 256 | def __getitem__(self, word): 257 | return self._syn0_final[self._rev_vocab[word]] 258 | 259 | def most_similar(self, word, k): 260 | """Finds the top-k words with smallest cosine distances w.r.t `word`. 261 | 262 | Args: 263 | word: string scalar, the query word. 264 | k: int scalar, num of words most similar to `word`. 265 | 266 | Returns: 267 | a list of 2-tuples with word and cosine similarities. 268 | """ 269 | if word not in self._rev_vocab: 270 | raise ValueError("Word '%s' not found in the vocabulary" % word) 271 | if k >= self._syn0_final.shape[0]: 272 | raise ValueError("k = %d greater than vocabulary size" % k) 273 | 274 | v0 = self._syn0_final[self._rev_vocab[word]] 275 | sims = np.sum(v0 * self._syn0_final, 1) / (np.linalg.norm(v0) * 276 | np.linalg.norm(self._syn0_final, axis=1)) 277 | 278 | # maintain a sliding min-heap to keep track of k+1 largest elements 279 | min_pq = list(zip(sims[:k+1], range(k+1))) 280 | heapq.heapify(min_pq) 281 | for i in np.arange(k + 1, len(self._vocab)): 282 | if sims[i] > min_pq[0][0]: 283 | min_pq[0] = sims[i], i 284 | heapq.heapify(min_pq) 285 | min_pq = sorted(min_pq, key=lambda p: -p[0]) 286 | return [(self._vocab[i], sim) for sim, i in min_pq[1:]] 287 | 288 | --------------------------------------------------------------------------------