├── README.md
├── __init__.py
├── dataset.py
├── files
├── cbow_hs.png
├── cbow_ns.png
├── huffman.png
├── sent.png
├── sg_hs.png
└── sg_ns.png
├── run_training.py
├── tf2.x
├── README.md
├── dataset.py
├── demo_word_similarity.py
├── model.py
├── run_training.py
├── sample_corpus.txt
├── utils.py
└── word_vectors.py
└── word2vec.py
/README.md:
--------------------------------------------------------------------------------
1 | # Word2Vec: Learning distributed word representation from unlabeled text.
2 |
3 | **Update**: [TensorFlow 2.x](tf2.x)
4 |
5 | Word2Vec is a classic model for learning distributed word representation from large unlabeled dataset. There have been many implementations out there since its introduction (e.g. the original C implementation, and the gensim implementation). This is an attempt to reimplement word2vec in TensorFlow using the `tf.data.Dataset` APIs, a recommended way to streamline data preprocessing for TensorFlow models.
6 |
7 | ### Usage
8 | 1. Clone the repository.
9 | ```
10 | git git@github.com:chao-ji/tf-word2vec.git
11 | ```
12 | 2. Prepare your data.
13 | Your data should be a number of text files where each line contains a sentence, and words are delimited by space.
14 |
15 | 3. Parameter settings.
16 | This implementation allows you to train the model under *skip gram* or *continuous bag-of-words* architectures (`--arch`), and perform training using *negative sampling* or *hierarchical softmax.* (`--algm`). To see a full list of parameters, run`python run_training.py --help`.
17 |
18 | 4. Run.
19 | Example:
20 | ```
21 | python run_training.py \
22 | --filenames=/PATH/TO/FILE/file1.txt,/PATH/TO/FILE/file2.txt
23 | --out_dir=/PATH/TO/OUT_DIR/
24 | --epochs=5
25 | --batch_size=64
26 | --window_size=5
27 | ```
28 | The vocabulary words and word embeddings will be saved to `vocab.txt` and `embed.npy` (can be loaded using `np.load`).
29 |
30 | ### Sample results
31 |
32 | The model was trained on the IMDB movie review dataset using the following parameters:
33 |
34 | ```
35 | --arch=skip_gram --algm=negative_sampling --batch_size=256 --max_vocab_size=0 --min_count=10 --sample=1e-3 --window_size=10 --embed_size=300 --negatives=5 --power=0.75 --alpha=0.025 --min_alpha=0.0001 --epochs=5
36 | ```
37 |
38 | Below are a sample list of queries with their most similar words.
39 | ```
40 | query: actor
41 | [('actors', 0.5314413),
42 | ('actress', 0.52641004),
43 | ('performer', 0.43144277),
44 | ('role', 0.40702546),
45 | ('comedian', 0.3910208),
46 | ('performance', 0.37695402),
47 | ('versatile', 0.35130078),
48 | ('actresses', 0.32896513),
49 | ('cast', 0.3219274),
50 | ('performers', 0.31659046)]
51 | ```
52 |
53 | ```
54 | query: .
55 | [('!', 0.6234603),
56 | ('?', 0.39236775),
57 | ('and', 0.36783764),
58 | (',', 0.3090561),
59 | ('but', 0.28012913),
60 | ('which', 0.23897173),
61 | (';', 0.22881404),
62 | ('cornerstone', 0.20761433),
63 | ('although', 0.20554386),
64 | ('...', 0.19846405)]
65 |
66 | ```
67 |
68 | ```
69 | query: ask
70 | [('asked', 0.54287535),
71 | ('asking', 0.5349437),
72 | ('asks', 0.5262491),
73 | ('question', 0.4397335),
74 | ('answer', 0.3868001),
75 | ('questions', 0.37007764),
76 | ('begs', 0.35407144),
77 | ('wonder', 0.3537388),
78 | ('answers', 0.3410588),
79 | ('wondering', 0.32832426)]
80 | ```
81 |
82 | ```
83 | query: you
84 | [('yourself', 0.51918006),
85 | ('u', 0.48620683),
86 | ('your', 0.47644556),
87 | ("'ll", 0.38544628),
88 | ('ya', 0.35932386),
89 | ('we', 0.35398778),
90 | ('i', 0.34099358),
91 | ('unless', 0.3306447),
92 | ('if', 0.3237356),
93 | ("'re", 0.32068467)]
94 | ```
95 |
96 | ```
97 | query: amazing
98 | [('incredible', 0.6467944),
99 | ('fantastic', 0.5760295),
100 | ('excellent', 0.56906724),
101 | ('awesome', 0.5625062),
102 | ('wonderful', 0.52154255),
103 | ('extraordinary', 0.519134),
104 | ('remarkable', 0.50572175),
105 | ('outstanding', 0.5042475),
106 | ('superb', 0.5008434),
107 | ('brilliant', 0.47915617)]
108 | ```
109 | ### Building dataset pipeline
110 |
111 | Here is a concrete example of converting a raw sentence into matrices holding the data to train Word2Vec model with either `skip_gram` or `cbow` architecture.
112 |
113 | Suppose we have a sentence in the corpus: `the quick brown fox jumps over the lazy dog`, with the window sizes (max num of words to the left or right of target word) below the words. Assume that the sentence has already been subsampled and words mapped to indices.
114 |
115 | We call each of the word in the sentence **target word**, and those words within the window centered at target word **context words**. For example, `quick` and `brown` are context words of target word `the`, and `the`, `brown`, `fox` are context words of target word `quick`.
116 |
117 |
118 |
119 |
120 |
121 | For `skip_gram`, the task is to predict context words given the target word. The index of each target word is simply replicated to match the number of its context words. This will be our **input matrix**.
122 |
123 |
124 |
125 |
126 | Skip gram, negative sampling
127 |
128 |
129 | For `cbow`, the task is to predict target word given context words. Because each target word may have a variable number of context words, we pad the list of context words to the maximum possible size (`2*window_size`), and append the true size of context words.
130 |
131 |
132 |
133 |
134 | Continuous bag of words, negative sampling
135 |
136 |
137 | If training algorithm is `negative_sampling`, we simply populate the **label matrix** with the indices of the words to be predicted: context words for `skip_gram` or target words for `cbow`.
138 |
139 | If training algorithm is `hierarchical_softmax`, a Huffman tree is built for the collection of vocabulary words. Each vocabulary word is associated with exactly one leaf node, and the words to be predicted in the case of `negative_sampling` are replaced by a sequence of `codes` and `points` that are determined by the internal nodes along the root-to-leaf path. For example, `E`'s `codes` and `points` would be `3782`, `8435`, `590`, `7103` and `1`, `0`, `1`, `0`. We populate the **label matrix** with the padded `codes` and `points` (up to `max_depth`), along with the true length of `codes`/`points`.
140 |
141 |
142 |
143 |
144 | Huffman tree
145 |
146 |
147 |
148 |
149 |
150 |
151 | Skip gram, hierarchical softmax
152 |
153 |
154 |
155 |
156 |
157 | Continuous bag of words, hierarchical softmax
158 |
159 |
160 | In summary, an **input matrix** and a **label matrix** is created from a raw input sentence that provides the input and label information for the prediction task.
161 |
162 |
163 |
164 | ### Reference
165 | 1. T Mikolov, K Chen, G Corrado, J Dean - Efficient Estimation of Word Representations in Vector Space, ICLR 2013
166 | 2. T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean - Distributed Representations of Words and Phrases and their Compositionality, NIPS 2013
167 | 3. Original implementation by Mikolov, https://code.google.com/archive/p/word2vec/
168 | 4. Gensim implementation by Radim Řehůřek, https://radimrehurek.com/gensim/models/word2vec.html
169 | 5. IMDB Movie Review dataset, http://ai.stanford.edu/~amaas/data/sentiment/
170 |
--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/__init__.py
--------------------------------------------------------------------------------
/dataset.py:
--------------------------------------------------------------------------------
1 | import heapq
2 | import itertools
3 | import collections
4 |
5 | import numpy as np
6 | import tensorflow as tf
7 |
8 | from functools import partial
9 |
10 | OOV_ID = -1
11 |
12 |
13 | class Word2VecDataset(object):
14 | """Dataset for generating matrices holding word indices to train Word2Vec
15 | models.
16 | """
17 | def __init__(self,
18 | arch='skip_gram',
19 | algm='negative_sampling',
20 | epochs=5,
21 | batch_size=100,
22 | max_vocab_size=0,
23 | min_count=2,
24 | sample=1e-3,
25 | window_size=5):
26 | """Constructor.
27 |
28 | Args:
29 | arch: string scalar, architecture ('skip_gram' or 'cbow').
30 | algm: string scalar: training algorithm ('negative_sampling' or
31 | 'hierarchical_softmax').
32 | epochs: int scalar, num times the dataset is iterated.
33 | batch_size: int scalar, the returned tensors in `get_tensor_dict` have
34 | shapes [batch_size, :].
35 | max_vocab_size: int scalar, maximum vocabulary size. If > 0, the top
36 | `max_vocab_size` most frequent words are kept in vocabulary.
37 | min_count: int scalar, words whose counts < `min_count` are not included
38 | in the vocabulary.
39 | sample: float scalar, subsampling rate.
40 | window_size: int scalar, num of words on the left or right side of
41 | target word within a window.
42 | """
43 | self._arch = arch
44 | self._algm = algm
45 | self._epochs = epochs
46 | self._batch_size = batch_size
47 | self._max_vocab_size = max_vocab_size
48 | self._min_count = min_count
49 | self._sample = sample
50 | self._window_size = window_size
51 |
52 | self._iterator_initializer = None
53 | self._table_words = None
54 | self._unigram_counts = None
55 | self._keep_probs = None
56 | self._corpus_size = None
57 | self._max_depth = None
58 |
59 | @property
60 | def iterator_initializer(self):
61 | return self._iterator_initializer
62 |
63 | @property
64 | def table_words(self):
65 | return self._table_words
66 |
67 | @property
68 | def unigram_counts(self):
69 | return self._unigram_counts
70 |
71 | def _build_raw_vocab(self, filenames):
72 | """Builds raw vocabulary.
73 |
74 | Args:
75 | filenames: list of strings, holding names of text files.
76 |
77 | Returns:
78 | raw_vocab: a list of 2-tuples holding the word (string) and count (int),
79 | sorted in descending order of word count.
80 | """
81 | map_open = partial(open, encoding="utf-8")
82 | lines = itertools.chain(*map(map_open, filenames))
83 | raw_vocab = collections.Counter()
84 | for line in lines:
85 | raw_vocab.update(line.strip().split())
86 | raw_vocab = raw_vocab.most_common()
87 | if self._max_vocab_size > 0:
88 | raw_vocab = raw_vocab[:self._max_vocab_size]
89 | return raw_vocab
90 |
91 | def build_vocab(self, filenames):
92 | """Builds vocabulary.
93 |
94 | Has the side effect of setting the following attributes:
95 | - table_words: list of string, holding the list of vocabulary words. Index
96 | of each entry is the same as the word index into the vocabulary.
97 | - unigram_counts: list of int, holding word counts. Index of each entry
98 | is the same as the word index into the vocabulary.
99 | - keep_probs: list of float, holding words' keep prob for subsampling.
100 | Index of each entry is the same as the word index into the vocabulary.
101 | - corpus_size: int scalar, effective corpus size.
102 |
103 | Args:
104 | filenames: list of strings, holding names of text files.
105 | """
106 | raw_vocab = self._build_raw_vocab(filenames)
107 | raw_vocab = [(w, c) for w, c in raw_vocab if c >= self._min_count]
108 | self._corpus_size = sum(list(zip(*raw_vocab))[1])
109 |
110 | self._table_words = []
111 | self._unigram_counts = []
112 | self._keep_probs = []
113 | for word, count in raw_vocab:
114 | frac = count / float(self._corpus_size)
115 | keep_prob = (np.sqrt(frac / self._sample) + 1) * (self._sample / frac)
116 | keep_prob = np.minimum(keep_prob, 1.0)
117 | self._table_words.append(word)
118 | self._unigram_counts.append(count)
119 | self._keep_probs.append(keep_prob)
120 |
121 | def _build_binary_tree(self, unigram_counts):
122 | """Builds a Huffman tree for hierarchical softmax. Has the side effect
123 | of setting `max_depth`.
124 |
125 | Args:
126 | unigram_counts: list of int, holding word counts. Index of each entry
127 | is the same as the word index into the vocabulary.
128 |
129 | Returns:
130 | codes_points: an int numpy array of shape [vocab_size, 2*max_depth+1]
131 | where each row holds the codes (0-1 binary values) padded to
132 | `max_depth`, and points (non-leaf node indices) padded to `max_depth`,
133 | of each vocabulary word. The last entry is the true length of code
134 | and point (<= `max_depth`).
135 | """
136 | vocab_size = len(unigram_counts)
137 | heap = [[unigram_counts[i], i] for i in range(vocab_size)]
138 | heapq.heapify(heap)
139 | for i in range(vocab_size - 1):
140 | min1, min2 = heapq.heappop(heap), heapq.heappop(heap)
141 | heapq.heappush(heap, [min1[0] + min2[0], i + vocab_size, min1, min2])
142 |
143 | node_list = []
144 | max_depth, stack = 0, [[heap[0], [], []]]
145 | while stack:
146 | node, code, point = stack.pop()
147 | if node[1] < vocab_size:
148 | node.extend([code, point, len(point)])
149 | max_depth = np.maximum(len(code), max_depth)
150 | node_list.append(node)
151 | else:
152 | point = np.array(list(point) + [node[1]-vocab_size])
153 | stack.append([node[2], np.array(list(code)+[0]), point])
154 | stack.append([node[3], np.array(list(code)+[1]), point])
155 |
156 | node_list = sorted(node_list, key=lambda items: items[1])
157 | codes_points = np.zeros([vocab_size, max_depth*2+1], dtype=np.int32)
158 | for i in range(len(node_list)):
159 | length = node_list[i][4] # length of code or point
160 | codes_points[i, -1] = length
161 | codes_points[i, :length] = node_list[i][2] # code
162 | codes_points[i, max_depth:max_depth+length] = node_list[i][3] # point
163 | self._max_depth = max_depth
164 | return codes_points
165 |
166 | def _prepare_inputs_labels(self, tensor):
167 | """Set shape of `tensor` according to architecture and training algorithm,
168 | and split `tensor` into `inputs` and `labels`.
169 |
170 | Args:
171 | tensor: rank-2 int tensor, holding word indices for prediction inputs
172 | and prediction labels, returned by `generate_instances`.
173 |
174 | Returns:
175 | inputs: rank-2 int tensor, holding word indices for prediction inputs.
176 | labels: rank-2 int tensor, holding word indices for prediction labels.
177 | """
178 | if self._arch == 'skip_gram':
179 | if self._algm == 'negative_sampling':
180 | tensor.set_shape([self._batch_size, 2])
181 | else:
182 | tensor.set_shape([self._batch_size, 2*self._max_depth+2])
183 | inputs = tensor[:, :1]
184 | labels = tensor[:, 1:]
185 | else:
186 | if self._algm == 'negative_sampling':
187 | tensor.set_shape([self._batch_size, 2*self._window_size+2])
188 | else:
189 | tensor.set_shape([self._batch_size,
190 | 2*self._window_size+2*self._max_depth+2])
191 | inputs = tensor[:, :2*self._window_size+1]
192 | labels = tensor[:, 2*self._window_size+1:]
193 | return inputs, labels
194 |
195 | def get_tensor_dict(self, filenames):
196 | """Generates tensor dict mapping from tensor names to tensors.
197 |
198 | Args:
199 | filenames: list of strings, holding names of text files.
200 |
201 | Returns:
202 | tensor_dict: a dict mapping from tensor names to tensors with shape being:
203 | when arch=='skip_gram', algm=='negative_sampling'
204 | inputs: [N], labels: [N]
205 | when arch=='cbow', algm=='negative_sampling'
206 | inputs: [N, 2*window_size+1], labels: [N]
207 | when arch=='skip_gram', algm=='hierarchical_softmax'
208 | inputs: [N], labels: [N, 2*max_depth+1]
209 | when arch=='cbow', algm=='hierarchical_softmax'
210 | inputs: [N, 2*window_size+1], labels: [N, 2*max_depth+1]
211 | progress: [N], the percentage of sentences covered so far. Used to
212 | compute learning rate.
213 | """
214 | table_words = self._table_words
215 | unigram_counts = self._unigram_counts
216 | keep_probs = self._keep_probs
217 | if not table_words or not unigram_counts or not keep_probs:
218 | raise ValueError('`table_words`, `unigram_counts`, and `keep_probs` must',
219 | 'be set by calling `build_vocab()`')
220 |
221 | if self._algm == 'hierarchical_softmax':
222 | codes_points = tf.constant(self._build_binary_tree(unigram_counts))
223 | elif self._algm == 'negative_sampling':
224 | codes_points = None
225 | else:
226 | raise ValueError('algm must be hierarchical_softmax or negative_sampling')
227 |
228 | table_words = tf.contrib.lookup.index_table_from_tensor(
229 | tf.constant(table_words), default_value=OOV_ID)
230 | keep_probs = tf.constant(keep_probs)
231 |
232 | num_sents = sum([len(list(open(fn, encoding="utf-8")
233 | )) for fn in filenames])
234 | num_sents = self._epochs * num_sents
235 |
236 | # include epoch number, like progress
237 | a_zip = tf.data.TextLineDataset(filenames).repeat(self._epochs)
238 | b_zip = tf.range(1, 1+num_sents) / num_sents
239 | c_zip = tf.repeat(tf.range(1, 1+self._epochs), int(num_sents / self._epochs))
240 |
241 | dataset = tf.data.Dataset.zip((a_zip,
242 | tf.data.Dataset.from_tensor_slices(b_zip),
243 | tf.data.Dataset.from_tensor_slices(c_zip)))
244 |
245 | dataset = dataset.map(lambda sent, progress, epoch:
246 | (get_word_indices(sent, table_words), progress, epoch))
247 | dataset = dataset.map(lambda indices, progress, epoch:
248 | (subsample(indices, keep_probs), progress, epoch))
249 | dataset = dataset.filter(lambda indices, progress, epoch:
250 | tf.greater(tf.size(indices), 1))
251 |
252 | dataset = dataset.map(lambda indices, progress, epoch: (
253 | generate_instances(
254 | indices, self._arch, self._window_size, codes_points), progress, epoch))
255 |
256 | dataset = dataset.map(lambda instances, progress, epoch: (
257 | instances, tf.fill(tf.shape(instances)[:1], progress),
258 | tf.fill(tf.shape(instances)[:1], epoch)))
259 |
260 | dataset = dataset.flat_map(lambda instances, progress, epoch:
261 | tf.data.Dataset.from_tensor_slices((instances, progress, epoch)))
262 | dataset = dataset.batch(self._batch_size, drop_remainder=True)
263 |
264 | iterator = tf.compat.v1.data.make_initializable_iterator(dataset)
265 | self._iterator_initializer = iterator.initializer
266 | tensor, progress, epoch = iterator.get_next()
267 | progress.set_shape([self._batch_size])
268 | epoch.set_shape([self._batch_size])
269 |
270 | inputs, labels = self._prepare_inputs_labels(tensor)
271 | if self._arch == 'skip_gram':
272 | inputs = tf.squeeze(inputs, axis=1)
273 | if self._algm == 'negative_sampling':
274 | labels = tf.squeeze(labels, axis=1)
275 |
276 | return {'inputs': inputs, 'labels': labels, 'progress': progress, 'epoch': epoch}
277 |
278 |
279 | def get_word_indices(sent, table_words):
280 | """Converts a sentence into a list of word indices.
281 |
282 | Args:
283 | sent: a scalar string tensor, a sentence where words are space-delimited.
284 | table_words: a `HashTable` mapping from words (string tensor) to word
285 | indices (int tensor).
286 |
287 | Returns:
288 | indices: rank-1 int tensor, the word indices within a sentence.
289 | """
290 | words = tf.string_split([sent]).values
291 | indices = tf.to_int32(table_words.lookup(words))
292 | return indices
293 |
294 |
295 | def subsample(indices, keep_probs):
296 | """Filters out-of-vocabulary words and then applies subsampling on words in a
297 | sentence. Words with high frequencies have lower keep probs.
298 |
299 | Args:
300 | indices: rank-1 int tensor, the word indices within a sentence.
301 | keep_probs: rank-1 float tensor, the prob to drop the each vocabulary word.
302 |
303 | Returns:
304 | indices: rank-1 int tensor, the word indices within a sentence after
305 | subsampling.
306 | """
307 | indices = tf.boolean_mask(indices, tf.not_equal(indices, OOV_ID))
308 | keep_probs = tf.gather(keep_probs, indices)
309 | randvars = tf.random.uniform(tf.shape(keep_probs), 0, 1)
310 | indices = tf.boolean_mask(indices, tf.less(randvars, keep_probs))
311 | return indices
312 |
313 |
314 | def generate_instances(indices, arch, window_size, codes_points=None):
315 | """Generates matrices holding word indices to be passed to Word2Vec models
316 | for each sentence. The shape and contents of output matrices depends on the
317 | architecture ('skip_gram', 'cbow') and training algorithm ('negative_sampling'
318 | , 'hierarchical_softmax').
319 |
320 | It takes as input a list of word indices in a subsampled-sentence, where each
321 | word is a target word, and their context words are those within the window
322 | centered at a target word. For skip gram architecture, `num_context_words`
323 | instances are generated for a target word, and for cbow architecture, a single
324 | instance is generated for a target word.
325 |
326 | If `codes_points` is not None ('hierarchical softmax'), the word to be
327 | predicted (context word for 'skip_gram', and target word for 'cbow') are
328 | represented by their 'codes' and 'points' in the Huffman tree (See
329 | `_build_binary_tree`).
330 |
331 | Args:
332 | indices: rank-1 int tensor, the word indices within a sentence after
333 | subsampling.
334 | arch: scalar string, architecture ('skip_gram' or 'cbow').
335 | window_size: int scalar, num of words on the left or right side of
336 | target word within a window.
337 | codes_points: None, or an int tensor of shape [vocab_size, 2*max_depth+1]
338 | where each row holds the codes (0-1 binary values) padded to `max_depth`,
339 | and points (non-leaf node indices) padded to `max_depth`, of each
340 | vocabulary word. The last entry is the true length of code and point
341 | (<= `max_depth`).
342 |
343 | Returns:
344 | instances: an int tensor holding word indices, with shape being
345 | when arch=='skip_gram', algm=='negative_sampling'
346 | shape: [N, 2]
347 | when arch=='cbow', algm=='negative_sampling'
348 | shape: [N, 2*window_size+2]
349 | when arch=='skip_gram', algm=='hierarchical_softmax'
350 | shape: [N, 2*max_depth+2]
351 | when arch=='cbow', algm='hierarchical_softmax'
352 | shape: [N, 2*window_size+2*max_depth+2]
353 | """
354 | def per_target_fn(index, init_array):
355 | reduced_size = tf.random.uniform([], maxval=window_size, dtype=tf.int32)
356 | left = tf.range(tf.maximum(index - window_size + reduced_size, 0), index)
357 | right = tf.range(index + 1,
358 | tf.minimum(index + 1 + window_size - reduced_size, tf.size(indices)))
359 | context = tf.concat([left, right], axis=0)
360 | context = tf.gather(indices, context)
361 |
362 | if arch == 'skip_gram':
363 | window = tf.stack([tf.fill(tf.shape(context), indices[index]),
364 | context], axis=1)
365 | elif arch == 'cbow':
366 | true_size = tf.size(context)
367 | window = tf.concat([tf.pad(context, [[0, 2*window_size-true_size]]),
368 | [true_size, indices[index]]], axis=0)
369 | window = tf.expand_dims(window, axis=0)
370 | else:
371 | raise ValueError('architecture must be skip_gram or cbow.')
372 |
373 | if codes_points is not None:
374 | window = tf.concat([window[:, :-1],
375 | tf.gather(codes_points, window[:, -1])], axis=1)
376 | return index + 1, init_array.write(index, window)
377 |
378 | size = tf.size(indices)
379 | init_array = tf.TensorArray(tf.int32, size=size, infer_shape=False)
380 | _, result_array = tf.while_loop(lambda i, ta: i < size,
381 | per_target_fn,
382 | [0, init_array],
383 | back_prop=False)
384 | instances = tf.cast(result_array.concat(), tf.int64)
385 | return instances
386 |
387 |
--------------------------------------------------------------------------------
/files/cbow_hs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/cbow_hs.png
--------------------------------------------------------------------------------
/files/cbow_ns.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/cbow_ns.png
--------------------------------------------------------------------------------
/files/huffman.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/huffman.png
--------------------------------------------------------------------------------
/files/sent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/sent.png
--------------------------------------------------------------------------------
/files/sg_hs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/sg_hs.png
--------------------------------------------------------------------------------
/files/sg_ns.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/sg_ns.png
--------------------------------------------------------------------------------
/run_training.py:
--------------------------------------------------------------------------------
1 | r"""Executable for training Word2Vec models.
2 |
3 | Example:
4 | python run_training.py \
5 | --filenames=/PATH/TO/FILE/file1.txt,/PATH/TO/FILE/file2.txt \
6 | --out_dir=/PATH/TO/OUT_DIR/ \
7 | --batch_size=64 \
8 | --window_size=5 \
9 |
10 | Learned word embeddings will be saved to /PATH/TO/OUT_DIR/embed.npy, and
11 | vocabulary saved to /PATH/TO/OUT_DIR/vocab.txt
12 | """
13 | import os
14 | import time
15 |
16 | import tensorflow as tf
17 | import numpy as np
18 |
19 | # import project files
20 | from dataset import Word2VecDataset
21 | from word2vec import Word2VecModel
22 |
23 | flags = tf.app.flags
24 |
25 | flags.DEFINE_string('arch', 'skip_gram', 'Architecture (skip_gram or cbow).')
26 | flags.DEFINE_string('algm', 'negative_sampling', 'Training algorithm '
27 | '(negative_sampling or hierarchical_softmax).')
28 | flags.DEFINE_integer('epochs', 1, 'Num of epochs to iterate training data.')
29 | flags.DEFINE_integer('batch_size', 256, 'Batch size.')
30 | flags.DEFINE_integer('max_vocab_size', 0, 'Maximum vocabulary size. If > 0, '
31 | 'the top `max_vocab_size` most frequent words are kept in vocabulary.')
32 | flags.DEFINE_integer('min_count', 10, 'Words whose counts < `min_count` are not'
33 | ' included in the vocabulary.')
34 | flags.DEFINE_float('sample', 1e-3, 'Subsampling rate.')
35 | flags.DEFINE_integer('window_size', 10, 'Num of words on the left or right side'
36 | ' of target word within a window.')
37 |
38 | flags.DEFINE_integer('embed_size', 300, 'Length of word vector.')
39 | flags.DEFINE_integer('negatives', 5, 'Num of negative words to sample.')
40 | flags.DEFINE_float('power', 0.75, 'Distortion for negative sampling.')
41 | flags.DEFINE_float('alpha', 0.025, 'Initial learning rate.')
42 | flags.DEFINE_float('min_alpha', 0.0001, 'Final learning rate.')
43 | flags.DEFINE_boolean('add_bias', True, 'Whether to add bias term to dotproduct '
44 | 'between syn0 and syn1 vectors.')
45 |
46 | flags.DEFINE_integer('log_per_steps', 10000, 'Every `log_per_steps` steps to '
47 | ' output logs.')
48 | flags.DEFINE_list('filenames', None, 'Names of comma-separated input text files.')
49 | flags.DEFINE_string('out_dir', '/tmp/word2vec', 'Output directory.')
50 |
51 | FLAGS = flags.FLAGS
52 |
53 |
54 | def main(_):
55 | dataset = Word2VecDataset(arch=FLAGS.arch,
56 | algm=FLAGS.algm,
57 | epochs=FLAGS.epochs,
58 | batch_size=FLAGS.batch_size,
59 | max_vocab_size=FLAGS.max_vocab_size,
60 | min_count=FLAGS.min_count,
61 | sample=FLAGS.sample,
62 | window_size=FLAGS.window_size)
63 | dataset.build_vocab(FLAGS.filenames)
64 |
65 | word2vec = Word2VecModel(arch=FLAGS.arch,
66 | algm=FLAGS.algm,
67 | embed_size=FLAGS.embed_size,
68 | batch_size=FLAGS.batch_size,
69 | negatives=FLAGS.negatives,
70 | power=FLAGS.power,
71 | alpha=FLAGS.alpha,
72 | min_alpha=FLAGS.min_alpha,
73 | add_bias=FLAGS.add_bias,
74 | random_seed=0)
75 | to_be_run_dict = word2vec.train(dataset, FLAGS.filenames)
76 |
77 | with tf.Session() as sess:
78 | sess.run(dataset.iterator_initializer)
79 | sess.run(tf.tables_initializer())
80 | sess.run(tf.global_variables_initializer())
81 |
82 | average_loss = 0.
83 | step = 0
84 | while True:
85 | try:
86 | result_dict = sess.run(to_be_run_dict)
87 | except tf.errors.OutOfRangeError:
88 | break
89 |
90 | average_loss += result_dict['loss'].mean()
91 | if step % FLAGS.log_per_steps == 0:
92 | if step > 0:
93 | average_loss /= FLAGS.log_per_steps
94 | print('step:', step, 'average_loss:', average_loss,
95 | 'learning_rate:', result_dict['learning_rate'])
96 | average_loss = 0.
97 |
98 | step += 1
99 |
100 | syn0_final = sess.run(word2vec.syn0)
101 |
102 | np.save(os.path.join(FLAGS.out_dir, 'embed'), syn0_final)
103 | with open(os.path.join(FLAGS.out_dir, 'vocab.txt'), 'w', encoding="utf-8") as fid:
104 | for w in dataset.table_words:
105 | fid.write(w + '\n')
106 |
107 | print('Word embeddings saved to', os.path.join(FLAGS.out_dir, 'embed.npy'))
108 | print('Vocabulary saved to', os.path.join(FLAGS.out_dir, 'vocab.txt'))
109 |
110 | if __name__ == '__main__':
111 | tf.flags.mark_flag_as_required('filenames')
112 |
113 | tf.app.run()
114 |
--------------------------------------------------------------------------------
/tf2.x/README.md:
--------------------------------------------------------------------------------
1 | This is the same model implemented in TensorFlow 2.x. Detailed usage information can be found in the [original README](../README.md).
2 |
--------------------------------------------------------------------------------
/tf2.x/dataset.py:
--------------------------------------------------------------------------------
1 | """Defines word tokenizer and word2vec dataset builder.
2 | """
3 | import heapq
4 | import itertools
5 | import collections
6 |
7 | import numpy as np
8 | import tensorflow as tf
9 |
10 | OOV_ID = -1
11 |
12 |
13 | class WordTokenizer(object):
14 | """Vanilla word tokenizer that spits out space-separated tokens from raw text
15 | string. Note for non-space separated languages, the corpus must be
16 | pre-tokenized such that tokens are space-delimited.
17 | """
18 | def __init__(self, max_vocab_size=0, min_count=10, sample=1e-3):
19 | """Constructor.
20 |
21 | Args:
22 | max_vocab_size: int scalar, maximum vocabulary size. If > 0, only the top
23 | `max_vocab_size` most frequent words will be kept in vocabulary.
24 | min_count: int scalar, words whose counts < `min_count` will not be
25 | included in the vocabulary.
26 | sample: float scalar, subsampling rate.
27 | """
28 | self._max_vocab_size = max_vocab_size
29 | self._min_count = min_count
30 | self._sample = sample
31 |
32 | self._vocab = None
33 | self._table_words = None
34 | self._unigram_counts = None
35 | self._keep_probs = None
36 |
37 | @property
38 | def unigram_counts(self):
39 | return self._unigram_counts
40 |
41 | @property
42 | def table_words(self):
43 | return self._table_words
44 |
45 | def _build_raw_vocab(self, filenames):
46 | """Builds raw vocabulary by iterate through the corpus once and count the
47 | unique words.
48 |
49 | Args:
50 | filenames: list of strings, holding names of text files.
51 |
52 | Returns:
53 | raw_vocab: a list of 2-tuples holding the word (string) and count (int),
54 | sorted in descending order of word count.
55 | """
56 | lines = []
57 | for fn in filenames:
58 | with tf.io.gfile.GFile(fn) as f:
59 | lines.append(f)
60 | lines = itertools.chain(*lines)
61 |
62 | raw_vocab = collections.Counter()
63 | for line in lines:
64 | raw_vocab.update(line.strip().split())
65 | raw_vocab = raw_vocab.most_common()
66 | # truncate to have at most `max_vocab_size` vocab words
67 | if self._max_vocab_size > 0:
68 | raw_vocab = raw_vocab[:self._max_vocab_size]
69 | return raw_vocab
70 |
71 | def build_vocab(self, filenames):
72 | """Builds the vocabulary.
73 |
74 | Has the side effect of setting the following attributes: for each word
75 | `word` we have
76 |
77 | vocab[word] = index
78 | table_words[index] = word `word`
79 | unigram_counts[index] = count of `word` in vocab
80 | keep_probs[index] = keep prob of `word` for subsampling
81 |
82 | Args:
83 | filenames: list of strings, holding names of text files.
84 | """
85 | raw_vocab = self._build_raw_vocab(filenames)
86 | raw_vocab = [(w, c) for w, c in raw_vocab if c >= self._min_count]
87 | self._corpus_size = sum(list(zip(*raw_vocab))[1])
88 |
89 | self._vocab = {}
90 | self._table_words = []
91 | self._unigram_counts = []
92 | self._keep_probs = []
93 | for index, (word, count) in enumerate(raw_vocab):
94 | frac = count / float(self._corpus_size)
95 | keep_prob = (np.sqrt(frac / self._sample) + 1) * (self._sample / frac)
96 | keep_prob = np.minimum(keep_prob, 1.0)
97 | self._vocab[word] = index
98 | self._table_words.append(word)
99 | self._unigram_counts.append(count)
100 | self._keep_probs.append(keep_prob)
101 |
102 | def encode(self, string):
103 | """Split raw text string into tokens (space-separated) and tranlate to token
104 | ids.
105 |
106 | Args:
107 | string: string scalar, the raw text string to be tokenized.
108 |
109 | Returns:
110 | ids: a list of ints, the token ids of the tokenized string.
111 | """
112 | tokens = string.strip().split()
113 | ids = [self._vocab[token] if token in self._vocab else OOV_ID
114 | for token in tokens]
115 | return ids
116 |
117 |
118 | class Word2VecDatasetBuilder(object):
119 | """Builds a tf.data.Dataset instance that generates matrices holding word
120 | indices for training Word2Vec models.
121 | """
122 | def __init__(self,
123 | tokenizer,
124 | arch='skip_gram',
125 | algm='negative_sampling',
126 | epochs=1,
127 | batch_size=32,
128 | window_size=5):
129 | """Constructor.
130 |
131 | Args:
132 | epochs: int scalar, num times the dataset is iterated.
133 | batch_size: int scalar, the returned tensors in `get_tensor_dict` have
134 | shapes [batch_size, :].
135 | window_size: int scalar, num of words on the left or right side of
136 | target word within a window.
137 | """
138 | self._tokenizer = tokenizer
139 | self._arch = arch
140 | self._algm = algm
141 | self._epochs = epochs
142 | self._batch_size = batch_size
143 | self._window_size = window_size
144 |
145 | self._max_depth = None
146 |
147 | def _build_binary_tree(self, unigram_counts):
148 | """Builds a Huffman tree for hierarchical softmax. Has the side effect
149 | of setting `max_depth`.
150 |
151 | Args:
152 | unigram_counts: list of int, holding word counts. Index of each entry
153 | is the same as the word index into the vocabulary.
154 |
155 | Returns:
156 | codes_points: an int numpy array of shape [vocab_size, 2*max_depth+1]
157 | where each row holds the codes (0-1 binary values) padded to
158 | `max_depth`, and points (non-leaf node indices) padded to `max_depth`,
159 | of each vocabulary word. The last entry is the true length of code
160 | and point (<= `max_depth`).
161 | """
162 | vocab_size = len(unigram_counts)
163 | heap = [[unigram_counts[i], i] for i in range(vocab_size)]
164 | # initialize the min-priority queue, which has length `vocab_size`
165 | heapq.heapify(heap)
166 |
167 | # insert `vocab_size` - 1 internal nodes, with vocab words as leaf nodes.
168 | for i in range(vocab_size - 1):
169 | min1, min2 = heapq.heappop(heap), heapq.heappop(heap)
170 | heapq.heappush(heap, [min1[0] + min2[0], i + vocab_size, min1, min2])
171 | # At this point we have a len-1 heap, and `heap[0]` will be the root of
172 | # the binary tree; where internal nodes store
173 | # 1. key (frequency)
174 | # 2. vocab index
175 | # 3. left child
176 | # 4. right child
177 | # and leaf nodes store
178 | # 1. key (frequencey)
179 | # 2. vocab index
180 |
181 | # Traverse the Huffman tree rooted at `heap[0]` in the order of
182 | # Depth-First-Search. Each stack item stores the
183 | # 1. `node`
184 | # 2. code of the `node` (list)
185 | # 3. point of the `node` (list)
186 | #
187 | # `point` is the list of vocab IDs of the internal nodes along the path from
188 | # the root up to `node` (not included)
189 | # `code` is the list of labels (0 or 1) of the edges along the path from the
190 | # root up to `node`
191 | # they are empty lists for the root node `heap[0]`
192 | node_list = []
193 | max_depth, stack = 0, [[heap[0], [], []]] # stack: [root, codde, point]
194 | while stack:
195 | node, code, point = stack.pop()
196 | if node[1] < vocab_size:
197 | # leaf node: len(node) == 2
198 | node.extend([code, point, len(point)])
199 | max_depth = np.maximum(len(code), max_depth)
200 | node_list.append(node)
201 | else:
202 | # internal node: len(node) == 4
203 | point = np.array(list(point) + [node[1]-vocab_size])
204 | stack.append([node[2], np.array(list(code)+[0]), point])
205 | stack.append([node[3], np.array(list(code)+[1]), point])
206 |
207 | # `len(node_list[i]) = 5`
208 | node_list = sorted(node_list, key=lambda items: items[1])
209 | # Stores the padded codes and points for each vocab word
210 | codes_points = np.zeros([vocab_size, max_depth*2+1], dtype=np.int64)
211 | for i in range(len(node_list)):
212 | length = node_list[i][4] # length of code or point
213 | codes_points[i, -1] = length
214 | codes_points[i, :length] = node_list[i][2] # code
215 | codes_points[i, max_depth:max_depth+length] = node_list[i][3] # point
216 | self._max_depth = max_depth
217 | return codes_points
218 |
219 | def build_dataset(self, filenames):
220 | """Generates tensor dict mapping from tensor names to tensors.
221 |
222 | Args:
223 | filenames: list of strings, holding names of text files.
224 |
225 | Returns:
226 | dataset: a tf.data.Dataset instance, holding the a tuple of tensors
227 | (inputs, labels, progress)
228 | when arch=='skip_gram', algm=='negative_sampling'
229 | inputs: [N], labels: [N]
230 | when arch=='cbow', algm=='negative_sampling'
231 | inputs: [N, 2*window_size+1], labels: [N]
232 | when arch=='skip_gram', algm=='hierarchical_softmax'
233 | inputs: [N], labels: [N, 2*max_depth+1]
234 | when arch=='cbow', algm=='hierarchical_softmax'
235 | inputs: [N, 2*window_size+1], labels: [N, 2*max_depth+1]
236 | progress: [N], the percentage of sentences covered so far. Used to
237 | compute learning rate.
238 | """
239 | unigram_counts = self._tokenizer._unigram_counts
240 | keep_probs = self._tokenizer._keep_probs
241 |
242 | if self._algm == 'hierarchical_softmax':
243 | codes_points = tf.constant(self._build_binary_tree(unigram_counts))
244 | elif self._algm == 'negative_sampling':
245 | codes_points = None
246 | else:
247 | raise ValueError('algm must be hierarchical_softmax or negative_sampling')
248 |
249 | keep_probs = tf.cast(tf.constant(keep_probs), 'float32')
250 |
251 | # total num of sentences (lines) across text files times num of epochs
252 | num_sents = sum([len(list(tf.io.gfile.GFile(fn)))
253 | for fn in filenames]) * self._epochs
254 |
255 | def generator_fn():
256 | for _ in range(self._epochs):
257 | for fn in filenames:
258 | with tf.io.gfile.GFile(fn) as f:
259 | for line in f:
260 | yield self._tokenizer.encode(line)
261 |
262 | # dataset: [([int], float)]
263 | dataset = tf.data.Dataset.zip((
264 | tf.data.Dataset.from_generator(generator_fn, tf.int64, [None]),
265 | tf.data.Dataset.from_tensor_slices(tf.range(num_sents) / num_sents)))
266 | # dataset: [([int], float)]
267 | dataset = dataset.map(lambda indices, progress:
268 | (subsample(indices, keep_probs), progress))
269 | # dataset: [([int], float)]
270 | dataset = dataset.filter(lambda indices, progress:
271 | tf.greater(tf.size(indices), 1)) # sentence must have at least 2 tokens
272 | # dataset: [((None, None), float)]
273 | dataset = dataset.map(lambda indices, progress: (generate_instances(
274 | indices, self._arch, self._window_size, self._max_depth, codes_points),
275 | progress))
276 | # dataset: [((None, None)), (None,)]
277 | dataset = dataset.map(lambda instances, progress: (
278 | # replicate `progress` to size `tf.shape(instances)[:1]`
279 | instances, tf.fill(tf.shape(instances)[:1], progress)))
280 | dataset = dataset.flat_map(lambda instances, progress:
281 | # form a dataset by unstacking `instances` in the first dimension,
282 | tf.data.Dataset.from_tensor_slices((instances, progress)))
283 | # batch the dataset
284 | dataset = dataset.batch(self._batch_size, drop_remainder=True)
285 |
286 | def prepare_inputs_labels(tensor, progress):
287 | if self._arch == 'skip_gram':
288 | if self._algm == 'negative_sampling':
289 | tensor.set_shape([self._batch_size, 2])
290 | else:
291 | tensor.set_shape([self._batch_size, 2*self._max_depth+2])
292 | inputs = tensor[:, :1]
293 | labels = tensor[:, 1:]
294 |
295 | else:
296 | if self._algm == 'negative_sampling':
297 | tensor.set_shape([self._batch_size, 2*self._window_size+2])
298 | else:
299 | tensor.set_shape([self._batch_size,
300 | 2*self._window_size+2*self._max_depth+2])
301 | inputs = tensor[:, :2*self._window_size+1]
302 | labels = tensor[:, 2*self._window_size+1:]
303 |
304 | if self._arch == 'skip_gram':
305 | inputs = tf.squeeze(inputs, axis=1)
306 | if self._algm == 'negative_sampling':
307 | labels = tf.squeeze(labels, axis=1)
308 | progress = tf.cast(progress, 'float32')
309 | return inputs, labels, progress
310 |
311 | dataset = dataset.map(lambda tensor, progress:
312 | prepare_inputs_labels(tensor, progress))
313 |
314 | return dataset
315 |
316 |
317 | def subsample(indices, keep_probs):
318 | """Filters out-of-vocabulary words and then applies subsampling on words in a
319 | sentence. Words with high frequencies have lower keep probs.
320 |
321 | Args:
322 | indices: rank-1 int tensor, the word indices within a sentence.
323 | keep_probs: rank-1 float tensor, the prob to drop the each vocabulary word.
324 |
325 | Returns:
326 | indices: rank-1 int tensor, the word indices within a sentence after
327 | subsampling.
328 | """
329 | indices = tf.boolean_mask(indices, tf.not_equal(indices, OOV_ID))
330 | keep_probs = tf.gather(keep_probs, indices)
331 | randvars = tf.random.uniform(tf.shape(keep_probs), 0, 1)
332 | indices = tf.boolean_mask(indices, tf.less(randvars, keep_probs))
333 | return indices
334 |
335 |
336 | def generate_instances(
337 | indices, arch, window_size, max_depth=None, codes_points=None):
338 | """Generates matrices holding word indices to be passed to Word2Vec models
339 | for each sentence. The shape and contents of output matrices depends on the
340 | architecture ('skip_gram', 'cbow') and training algorithm ('negative_sampling'
341 | , 'hierarchical_softmax').
342 |
343 | It takes as input a list of word indices in a subsampled-sentence, where each
344 | word is a target word, and their context words are those within the window
345 | centered at a target word. For skip gram architecture, `num_context_words`
346 | instances are generated for a target word, and for cbow architecture, a single
347 | instance is generated for a target word.
348 |
349 | If `codes_points` is not None ('hierarchical softmax'), the word to be
350 | predicted (context word for 'skip_gram', and target word for 'cbow') are
351 | represented by their 'codes' and 'points' in the Huffman tree (See
352 | `_build_binary_tree`).
353 |
354 | Args:
355 | indices: rank-1 int tensor, the word indices within a sentence after
356 | subsampling.
357 | arch: scalar string, architecture ('skip_gram' or 'cbow').
358 | window_size: int scalar, num of words on the left or right side of
359 | target word within a window.
360 | max_depth: (Optional) int scalar, the max depth of the Huffman tree.
361 | codes_points: (Optional) an int tensor of shape [vocab_size, 2*max_depth+1]
362 | where each row holds the codes (0-1 binary values) padded to `max_depth`,
363 | and points (non-leaf node indices) padded to `max_depth`, of each
364 | vocabulary word. The last entry is the true length of code and point
365 | (<= `max_depth`).
366 |
367 | Returns:
368 | instances: an int tensor holding word indices, with shape being
369 | when arch=='skip_gram', algm=='negative_sampling'
370 | shape: [N, 2]
371 | when arch=='cbow', algm=='negative_sampling'
372 | shape: [N, 2*window_size+2]
373 | when arch=='skip_gram', algm=='hierarchical_softmax'
374 | shape: [N, 2*max_depth+2]
375 | when arch=='cbow', algm='hierarchical_softmax'
376 | shape: [N, 2*window_size+2*max_depth+2]
377 | """
378 | def per_target_fn(index, init_array):
379 | """Generate inputs and labels for each target word.
380 |
381 | `index` is the index of the target word in `indices`.
382 | """
383 | reduced_size = tf.random.uniform([], maxval=window_size, dtype='int32')
384 | left = tf.range(tf.maximum(index - window_size + reduced_size, 0), index)
385 | right = tf.range(index + 1,
386 | tf.minimum(index + 1 + window_size - reduced_size, tf.size(indices)))
387 | context = tf.concat([left, right], axis=0)
388 | context = tf.gather(indices, context)
389 |
390 | if arch == 'skip_gram':
391 | # replicate `indices[index]` to match the size of `context`
392 | # [N, 2]
393 | window = tf.stack([tf.fill(tf.shape(context), indices[index]),
394 | context], axis=1)
395 | elif arch == 'cbow':
396 | true_size = tf.size(context)
397 | # pad `context` to length `2 * window_size`
398 | window = tf.concat([tf.pad(context, [[0, 2*window_size-true_size]]),
399 | [true_size, indices[index]]], axis=0)
400 | # [1, 2*window_size + 2]
401 | window = tf.expand_dims(window, axis=0)
402 | else:
403 | raise ValueError('architecture must be skip_gram or cbow.')
404 |
405 | if codes_points is not None:
406 | # [N, 2*max_depth + 2] or [1, 2*window_size+2*max_depth+2]
407 | window = tf.concat([window[:, :-1],
408 | tf.gather(codes_points, window[:, -1])], axis=1)
409 | return index + 1, init_array.write(index, window)
410 |
411 | size = tf.size(indices)
412 | # initialize a tensor array of length `tf.size(indices)`
413 | init_array = tf.TensorArray('int64', size=size, infer_shape=False)
414 | _, result_array = tf.while_loop(lambda i, ta: i < size,
415 | per_target_fn,
416 | [0, init_array],
417 | back_prop=False)
418 | instances = tf.cast(result_array.concat(), 'int64')
419 | if arch == 'skip_gram':
420 | if max_depth is None:
421 | instances.set_shape([None, 2])
422 | else:
423 | instances.set_shape([None, 2*max_depth+2])
424 | else:
425 | if max_depth is None:
426 | instances.set_shape([None, 2*window_size+2])
427 | else:
428 | instances.set_shape([None, 2*window_size+2*max_depth+2])
429 |
430 | return instances
431 |
--------------------------------------------------------------------------------
/tf2.x/demo_word_similarity.py:
--------------------------------------------------------------------------------
1 | from word_vectors import WordVectors
2 | import numpy as np
3 |
4 | # syn_final.npy: storing word embeddings, numpy array of shape [vocab_size, hidden_size]
5 | # 'vocab.txt': text file storing words in vocabulary, one word per line
6 |
7 | query = ','
8 | num_similar_words = 10
9 | syn0_final = np.load('syn0_final.npy')
10 | vocab_words = []
11 | with open('vocab.txt') as f:
12 | vocab_words = [l.strip() for l in f]
13 |
14 | wv = WordVectors(syn0_final, vocab_words)
15 | print(wv.most_similar(query, num_similar_words))
16 |
--------------------------------------------------------------------------------
/tf2.x/model.py:
--------------------------------------------------------------------------------
1 | """Defines word2vec model using tf.keras API.
2 | """
3 | import tensorflow as tf
4 |
5 | from dataset import WordTokenizer
6 | from dataset import Word2VecDatasetBuilder
7 |
8 |
9 | class Word2VecModel(tf.keras.Model):
10 | """Word2Vec model."""
11 | def __init__(self,
12 | unigram_counts,
13 | arch='skip_gram',
14 | algm='negative_sampling',
15 | hidden_size=300,
16 | batch_size=256,
17 | negatives=5,
18 | power=0.75,
19 | alpha=0.025,
20 | min_alpha=0.0001,
21 | add_bias=True,
22 | random_seed=0):
23 | """Constructor.
24 |
25 | Args:
26 | unigram_counts: a list of ints, the counts of word tokens in the corpus.
27 | arch: string scalar, architecture ('skip_gram' or 'cbow').
28 | algm: string scalar, training algorithm ('negative_sampling' or
29 | 'hierarchical_softmax').
30 | hidden_size: int scalar, length of word vector.
31 | batch_size: int scalar, batch size.
32 | negatives: int scalar, num of negative words to sample.
33 | power: float scalar, distortion for negative sampling.
34 | alpha: float scalar, initial learning rate.
35 | min_alpha: float scalar, final learning rate.
36 | add_bias: bool scalar, whether to add bias term to dotproduct
37 | between syn0 and syn1 vectors.
38 | random_seed: int scalar, random_seed.
39 | """
40 | super(Word2VecModel, self).__init__()
41 | self._unigram_counts = unigram_counts
42 | self._arch = arch
43 | self._algm = algm
44 | self._hidden_size = hidden_size
45 | self._vocab_size = len(unigram_counts)
46 | self._batch_size = batch_size
47 | self._negatives = negatives
48 | self._power = power
49 | self._alpha = alpha
50 | self._min_alpha = min_alpha
51 | self._add_bias = add_bias
52 | self._random_seed = random_seed
53 |
54 | self._input_size = (self._vocab_size if self._algm == 'negative_sampling'
55 | else self._vocab_size - 1)
56 |
57 | self.add_weight('syn0',
58 | shape=[self._vocab_size, self._hidden_size],
59 | initializer=tf.keras.initializers.RandomUniform(
60 | minval=-0.5/self._hidden_size,
61 | maxval=0.5/self._hidden_size))
62 |
63 | self.add_weight('syn1',
64 | shape=[self._input_size, self._hidden_size],
65 | initializer=tf.keras.initializers.RandomUniform(
66 | minval=-0.1, maxval=0.1))
67 |
68 | self.add_weight('biases',
69 | shape=[self._input_size],
70 | initializer=tf.keras.initializers.Zeros())
71 |
72 | def call(self, inputs, labels):
73 | """Runs the forward pass to compute loss.
74 |
75 | Args:
76 | inputs: int tensor of shape [batch_size] (skip_gram) or
77 | [batch_size, 2*window_size+1] (cbow)
78 | labels: int tensor of shape [batch_size] (negative_sampling) or
79 | [batch_size, 2*max_depth+1] (hierarchical_softmax)
80 |
81 | Returns:
82 | loss: float tensor, cross entropy loss.
83 | """
84 | if self._algm == 'negative_sampling':
85 | loss = self._negative_sampling_loss(inputs, labels)
86 | elif self._algm == 'hierarchical_softmax':
87 | loss = self._hierarchical_softmax_loss(inputs, labels)
88 | return loss
89 |
90 | def _negative_sampling_loss(self, inputs, labels):
91 | """Builds the loss for negative sampling.
92 |
93 | Args:
94 | inputs: int tensor of shape [batch_size] (skip_gram) or
95 | [batch_size, 2*window_size+1] (cbow)
96 | labels: int tensor of shape [batch_size]
97 |
98 | Returns:
99 | loss: float tensor of shape [batch_size, negatives + 1].
100 | """
101 | _, syn1, biases = self.weights
102 |
103 | sampled_values = tf.random.fixed_unigram_candidate_sampler(
104 | true_classes=tf.expand_dims(labels, 1),
105 | num_true=1,
106 | num_sampled=self._batch_size*self._negatives,
107 | unique=True,
108 | range_max=len(self._unigram_counts),
109 | distortion=self._power,
110 | unigrams=self._unigram_counts)
111 |
112 | sampled = sampled_values.sampled_candidates
113 | sampled_mat = tf.reshape(sampled, [self._batch_size, self._negatives])
114 | inputs_syn0 = self._get_inputs_syn0(inputs) # [batch_size, hidden_size]
115 | true_syn1 = tf.gather(syn1, labels) # [batch_size, hidden_size]
116 | # [batch_size, negatives, hidden_size]
117 | sampled_syn1 = tf.gather(syn1, sampled_mat)
118 | # [batch_size]
119 | true_logits = tf.reduce_sum(tf.multiply(inputs_syn0, true_syn1), 1)
120 | # [batch_size, negatives]
121 | sampled_logits = tf.einsum('ijk,ikl->il', tf.expand_dims(inputs_syn0, 1),
122 | tf.transpose(sampled_syn1, (0, 2, 1)))
123 |
124 | if self._add_bias:
125 | # [batch_size]
126 | true_logits += tf.gather(biases, labels)
127 | # [batch_size, negatives]
128 | sampled_logits += tf.gather(biases, sampled_mat)
129 |
130 | # [batch_size]
131 | true_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
132 | labels=tf.ones_like(true_logits), logits=true_logits)
133 | # [batch_size, negatives]
134 | sampled_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
135 | labels=tf.zeros_like(sampled_logits), logits=sampled_logits)
136 |
137 | loss = tf.concat(
138 | [tf.expand_dims(true_cross_entropy, 1), sampled_cross_entropy], 1)
139 | return loss
140 |
141 | def _hierarchical_softmax_loss(self, inputs, labels):
142 | """Builds the loss for hierarchical softmax.
143 |
144 | Args:
145 | inputs: int tensor of shape [batch_size] (skip_gram) or
146 | [batch_size, 2*window_size+1] (cbow)
147 | labels: int tensor of shape [batch_size, 2*max_depth+1]
148 |
149 | Returns:
150 | loss: float tensor of shape [sum_of_code_len]
151 | """
152 | _, syn1, biases = self.weights
153 |
154 | inputs_syn0_list = tf.unstack(self._get_inputs_syn0(inputs))
155 | codes_points_list = tf.unstack(labels)
156 | max_depth = (labels.shape.as_list()[1] - 1) // 2
157 | loss = []
158 | for i in range(self._batch_size):
159 | inputs_syn0 = inputs_syn0_list[i] # [hidden_size]
160 | codes_points = codes_points_list[i] # [2*max_depth+1]
161 | true_size = codes_points[-1]
162 |
163 | codes = codes_points[:true_size]
164 | points = codes_points[max_depth:max_depth+true_size]
165 | logits = tf.reduce_sum(
166 | tf.multiply(inputs_syn0, tf.gather(syn1, points)), 1)
167 | if self._add_bias:
168 | logits += tf.gather(biases, points)
169 |
170 | # [true_size]
171 | loss.append(tf.nn.sigmoid_cross_entropy_with_logits(
172 | labels=tf.cast(codes, 'float32'), logits=logits))
173 | loss = tf.concat(loss, axis=0)
174 | return loss
175 |
176 | def _get_inputs_syn0(self, inputs):
177 | """Builds the activations of hidden layer given input words embeddings
178 | `syn0` and input word indices.
179 |
180 | Args:
181 | inputs: int tensor of shape [batch_size] (skip_gram) or
182 | [batch_size, 2*window_size+1] (cbow)
183 |
184 | Returns:
185 | inputs_syn0: [batch_size, hidden_size]
186 | """
187 | # syn0: [vocab_size, hidden_size]
188 | syn0, _, _ = self.weights
189 | if self._arch == 'skip_gram':
190 | inputs_syn0 = tf.gather(syn0, inputs) # [batch_size, hidden_size]
191 | else:
192 | inputs_syn0 = []
193 | contexts_list = tf.unstack(inputs)
194 | for i in range(self._batch_size):
195 | contexts = contexts_list[i]
196 | context_words = contexts[:-1]
197 | true_size = contexts[-1]
198 | inputs_syn0.append(
199 | tf.reduce_mean(tf.gather(syn0, context_words[:true_size]), axis=0))
200 | inputs_syn0 = tf.stack(inputs_syn0)
201 |
202 | return inputs_syn0
203 |
--------------------------------------------------------------------------------
/tf2.x/run_training.py:
--------------------------------------------------------------------------------
1 | """Train a word2vec model to obtain word embedding vectors.
2 |
3 | There are a total of four combination of architectures and training algorithms
4 | that the model can be trained with:
5 |
6 | architecture:
7 | - skip_gram
8 | - cbow (continuous bag-of-words)
9 |
10 | training algorithm
11 | - negative_sampling
12 | - hierarchical_softmax
13 | """
14 | import os
15 |
16 | import tensorflow as tf
17 | import numpy as np
18 | from absl import app
19 | from absl import flags
20 |
21 | from dataset import WordTokenizer
22 | from dataset import Word2VecDatasetBuilder
23 | from model import Word2VecModel
24 | from word_vectors import WordVectors
25 |
26 | import utils
27 |
28 | flags.DEFINE_string('arch', 'skip_gram', 'Architecture (skip_gram or cbow).')
29 | flags.DEFINE_string('algm', 'negative_sampling', 'Training algorithm '
30 | '(negative_sampling or hierarchical_softmax).')
31 | flags.DEFINE_integer('epochs', 1, 'Num of epochs to iterate thru corpus.')
32 | flags.DEFINE_integer('batch_size', 256, 'Batch size.')
33 | flags.DEFINE_integer('max_vocab_size', 0, 'Maximum vocabulary size. If > 0, '
34 | 'the top `max_vocab_size` most frequent words will be kept in vocabulary.')
35 | flags.DEFINE_integer('min_count', 10, 'Words whose counts < `min_count` will '
36 | 'not be included in the vocabulary.')
37 | flags.DEFINE_float('sample', 1e-3, 'Subsampling rate.')
38 | flags.DEFINE_integer('window_size', 10, 'Num of words on the left or right side'
39 | ' of target word within a window.')
40 |
41 | flags.DEFINE_integer('hidden_size', 300, 'Length of word vector.')
42 | flags.DEFINE_integer('negatives', 5, 'Num of negative words to sample.')
43 | flags.DEFINE_float('power', 0.75, 'Distortion for negative sampling.')
44 | flags.DEFINE_float('alpha', 0.025, 'Initial learning rate.')
45 | flags.DEFINE_float('min_alpha', 0.0001, 'Final learning rate.')
46 | flags.DEFINE_boolean('add_bias', True, 'Whether to add bias term to dotproduct '
47 | 'between syn0 and syn1 vectors.')
48 |
49 | flags.DEFINE_integer('log_per_steps', 10000, 'Every `log_per_steps` steps to '
50 | ' log the value of loss to be minimized.')
51 | flags.DEFINE_list(
52 | 'filenames', None, 'Names of comma-separated input text files.')
53 | flags.DEFINE_string('out_dir', '/tmp/word2vec', 'Output directory.')
54 |
55 | FLAGS = flags.FLAGS
56 |
57 |
58 | def main(_):
59 | arch = FLAGS.arch
60 | algm = FLAGS.algm
61 | epochs = FLAGS.epochs
62 | batch_size = FLAGS.batch_size
63 | max_vocab_size = FLAGS.max_vocab_size
64 | min_count = FLAGS.min_count
65 | sample = FLAGS.sample
66 | window_size = FLAGS.window_size
67 | hidden_size = FLAGS.hidden_size
68 | negatives = FLAGS.negatives
69 | power = FLAGS.power
70 | alpha = FLAGS.alpha
71 | min_alpha = FLAGS.min_alpha
72 | add_bias = FLAGS.add_bias
73 | log_per_steps = FLAGS.log_per_steps
74 | filenames = FLAGS.filenames
75 | out_dir = FLAGS.out_dir
76 |
77 | tokenizer = WordTokenizer(
78 | max_vocab_size=max_vocab_size, min_count=min_count, sample=sample)
79 | tokenizer.build_vocab(filenames)
80 |
81 | builder = Word2VecDatasetBuilder(tokenizer,
82 | arch=arch,
83 | algm=algm,
84 | epochs=epochs,
85 | batch_size=batch_size,
86 | window_size=window_size)
87 | dataset = builder.build_dataset(filenames)
88 | word2vec = Word2VecModel(tokenizer.unigram_counts,
89 | arch=arch,
90 | algm=algm,
91 | hidden_size=hidden_size,
92 | batch_size=batch_size,
93 | negatives=negatives,
94 | power=power,
95 | alpha=alpha,
96 | min_alpha=min_alpha,
97 | add_bias=add_bias)
98 |
99 | train_step_signature = utils.get_train_step_signature(
100 | arch, algm, batch_size, window_size, builder._max_depth)
101 | optimizer = tf.keras.optimizers.SGD(1.0)
102 |
103 | @tf.function(input_signature=train_step_signature)
104 | def train_step(inputs, labels, progress):
105 | loss = word2vec(inputs, labels)
106 | gradients = tf.gradients(loss, word2vec.trainable_variables)
107 |
108 | learning_rate = tf.maximum(alpha * (1 - progress[0]) +
109 | min_alpha * progress[0], min_alpha)
110 |
111 | if hasattr(gradients[0], '_values'):
112 | gradients[0]._values *= learning_rate
113 | else:
114 | gradients[0] *= learning_rate
115 |
116 | if hasattr(gradients[1], '_values'):
117 | gradients[1]._values *= learning_rate
118 | else:
119 | gradients[1] *= learning_rate
120 |
121 | if hasattr(gradients[2], '_values'):
122 | gradients[2]._values *= learning_rate
123 | else:
124 | gradients[2] *= learning_rate
125 |
126 | optimizer.apply_gradients(
127 | zip(gradients, word2vec.trainable_variables))
128 |
129 | return loss, learning_rate
130 |
131 | average_loss = 0.
132 | for step, (inputs, labels, progress) in enumerate(dataset):
133 | loss, learning_rate = train_step(inputs, labels, progress)
134 | average_loss += loss.numpy().mean()
135 | if step % log_per_steps == 0:
136 | if step > 0:
137 | average_loss /= log_per_steps
138 | print('step:', step, 'average_loss:', average_loss,
139 | 'learning_rate:', learning_rate.numpy())
140 | average_loss = 0.
141 |
142 | syn0_final = word2vec.weights[0].numpy()
143 | np.save(os.path.join(FLAGS.out_dir, 'syn0_final'), syn0_final)
144 | with tf.io.gfile.GFile(os.path.join(FLAGS.out_dir, 'vocab.txt'), 'w') as f:
145 | for w in tokenizer.table_words:
146 | f.write(w + '\n')
147 | print('Word embeddings saved to',
148 | os.path.join(FLAGS.out_dir, 'syn0_final.npy'))
149 | print('Vocabulary saved to', os.path.join(FLAGS.out_dir, 'vocab.txt'))
150 |
151 |
152 | if __name__ == '__main__':
153 | flags.mark_flag_as_required('filenames')
154 | app.run(main)
155 |
--------------------------------------------------------------------------------
/tf2.x/sample_corpus.txt:
--------------------------------------------------------------------------------
1 | # one sentence per line, with words (lower case) delimited by single space
2 |
3 | with all this stuff going down at the moment with mj i 've started listening to his music , watching the odd documentary here and there , watched the wiz and watched moonwalker again .
4 | maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent .
5 | moonwalker is part biography , part feature film which i remember going to see at the cinema when it was originally released .
6 | some of it has subtle messages about mj 's feeling towards the press and also the obvious message of drugs are bad m'kay .
7 | visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring .
8 |
--------------------------------------------------------------------------------
/tf2.x/utils.py:
--------------------------------------------------------------------------------
1 | """Defines utility functions.
2 | """
3 | import tensorflow as tf
4 |
5 |
6 | def get_train_step_signature(
7 | arch, algm, batch_size, window_size=None, max_depth=None):
8 | """Get the training step signatures for `inputs`, `labels` and `progress`
9 | tensor.
10 |
11 | Args:
12 | arch: string scalar, architecture ('skip_gram' or 'cbow').
13 | algm: string scalar, training algorithm ('negative_sampling' or
14 | 'hierarchical_softmax').
15 |
16 | Returns:
17 | train_step_signature: a list of three tf.TensorSpec instances,
18 | specifying the tensor spec (shape and dtype) for `inputs`, `labels` and
19 | `progress`.
20 | """
21 | if arch=='skip_gram':
22 | inputs_spec = tf.TensorSpec(shape=(batch_size,), dtype='int64')
23 | elif arch == 'cbow':
24 | inputs_spec = tf.TensorSpec(
25 | shape=(batch_size, 2*window_size+1), dtype='int64')
26 | else:
27 | raise ValueError('`arch` must be either "skip_gram" or "cbow".')
28 |
29 | if algm == 'negative_sampling':
30 | labels_spec = tf.TensorSpec(shape=(batch_size,), dtype='int64')
31 | elif algm == 'hierarchical_softmax':
32 | labels_spec = tf.TensorSpec(
33 | shape=(batch_size, 2*max_depth+1), dtype='int64')
34 | else:
35 | raise ValueError('`algm` must be either "negative_sampling" or '
36 | '"hierarchical_softmax".')
37 |
38 | progress_spec = tf.TensorSpec(shape=(batch_size,), dtype='float32')
39 |
40 | train_step_signature = [inputs_spec, labels_spec, progress_spec]
41 | return train_step_signature
42 |
--------------------------------------------------------------------------------
/tf2.x/word_vectors.py:
--------------------------------------------------------------------------------
1 | """Defines wrapper class for final word vectors.
2 | """
3 | import heapq
4 | import numpy as np
5 |
6 |
7 | class WordVectors(object):
8 | """Word vectors of trained Word2Vec model. Provides APIs for retrieving
9 | word vector, and most similar words given a query word.
10 | """
11 | def __init__(self, syn0_final, vocab):
12 | """Constructor.
13 |
14 | Args:
15 | syn0_final: numpy array of shape [vocab_size, embed_size], final word
16 | embeddings.
17 | vocab: a list of strings, holding vocabulary words.
18 | """
19 | self._syn0_final = syn0_final
20 | self._vocab = vocab
21 | self._rev_vocab = dict([(w, i) for i, w in enumerate(vocab)])
22 |
23 | def __contains__(self, word):
24 | return word in self._rev_vocab
25 |
26 | def __getitem__(self, word):
27 | return self._syn0_final[self._rev_vocab[word]]
28 |
29 | def most_similar(self, word, k):
30 | """Finds the top-k words with smallest cosine distances w.r.t `word`.
31 |
32 | Args:
33 | word: string scalar, the query word.
34 | k: int scalar, num of words most similar to `word`.
35 |
36 | Returns:
37 | a list of 2-tuples with word and cosine similarities.
38 | """
39 | if word not in self._rev_vocab:
40 | raise ValueError("Word '%s' not found in the vocabulary" % word)
41 | if k >= self._syn0_final.shape[0]:
42 | raise ValueError("k = %d greater than vocabulary size" % k)
43 |
44 | v0 = self._syn0_final[self._rev_vocab[word]]
45 | sims = np.sum(v0 * self._syn0_final, 1) / (np.linalg.norm(v0) *
46 | np.linalg.norm(self._syn0_final, axis=1))
47 |
48 | # maintain a sliding min-heap to keep track of k+1 largest elements
49 | min_pq = list(zip(sims[:k+1], range(k+1)))
50 | heapq.heapify(min_pq)
51 | for i in np.arange(k + 1, len(self._vocab)):
52 | if sims[i] > min_pq[0][0]:
53 | min_pq[0] = sims[i], i
54 | heapq.heapify(min_pq)
55 | min_pq = sorted(min_pq, key=lambda p: -p[0])
56 | return [(self._vocab[i], sim) for sim, i in min_pq[1:]]
57 |
--------------------------------------------------------------------------------
/word2vec.py:
--------------------------------------------------------------------------------
1 | import heapq
2 |
3 | import numpy as np
4 | import tensorflow as tf
5 |
6 |
7 | class Word2VecModel(object):
8 | """Word2VecModel.
9 | """
10 |
11 | def __init__(self, arch, algm, embed_size, batch_size, negatives, power,
12 | alpha, min_alpha, add_bias, random_seed):
13 | """Constructor.
14 |
15 | Args:
16 | arch: string scalar, architecture ('skip_gram' or 'cbow').
17 | algm: string scalar, training algorithm ('negative_sampling' or
18 | 'hierarchical_softmax').
19 | embed_size: int scalar, length of word vector.
20 | batch_size: int scalar, batch size.
21 | negatives: int scalar, num of negative words to sample.
22 | power: float scalar, distortion for negative sampling.
23 | alpha: float scalar, initial learning rate.
24 | min_alpha: float scalar, final learning rate.
25 | add_bias: bool scalar, whether to add bias term to dotproduct
26 | between syn0 and syn1 vectors.
27 | random_seed: int scalar, random_seed.
28 | """
29 | self._arch = arch
30 | self._algm = algm
31 | self._embed_size = embed_size
32 | self._batch_size = batch_size
33 | self._negatives = negatives
34 | self._power = power
35 | self._alpha = alpha
36 | self._min_alpha = min_alpha
37 | self._add_bias = add_bias
38 | self._random_seed = random_seed
39 |
40 | self._syn0 = None
41 |
42 | @property
43 | def syn0(self):
44 | return self._syn0
45 |
46 | def _build_loss(self, inputs, labels, unigram_counts, scope=None):
47 | """Builds the graph that leads from data tensors (`inputs`, `labels`)
48 | to loss. Has the side effect of setting attribute `syn0`.
49 |
50 | Args:
51 | inputs: int tensor of shape [batch_size] (skip_gram) or
52 | [batch_size, 2*window_size+1] (cbow)
53 | labels: int tensor of shape [batch_size] (negative_sampling) or
54 | [batch_size, 2*max_depth+1] (hierarchical_softmax)
55 | unigram_count: list of int, holding word counts. Index of each entry
56 | is the same as the word index into the vocabulary.
57 | scope: string scalar, scope name.
58 |
59 | Returns:
60 | loss: float tensor, cross entropy loss.
61 | """
62 | syn0, syn1, biases = self._create_embeddings(len(unigram_counts))
63 | self._syn0 = syn0
64 | with tf.variable_scope(scope, 'Loss', [inputs, labels, syn0, syn1, biases]):
65 | if self._algm == 'negative_sampling':
66 | loss = self._negative_sampling_loss(
67 | unigram_counts, inputs, labels, syn0, syn1, biases)
68 | elif self._algm == 'hierarchical_softmax':
69 | loss = self._hierarchical_softmax_loss(
70 | inputs, labels, syn0, syn1, biases)
71 | return loss
72 |
73 | def train(self, dataset, filenames):
74 | """Adds training related ops to the graph.
75 |
76 | Args:
77 | dataset: a `Word2VecDataset` instance.
78 | filenames: a list of strings, holding names of text files.
79 |
80 | Returns:
81 | to_be_run_dict: dict mapping from names to tensors/operations, holding
82 | the following entries:
83 | { 'grad_update_op': optimization ops,
84 | 'loss': cross entropy loss,
85 | 'learning_rate': float-scalar learning rate}
86 | """
87 | tensor_dict = dataset.get_tensor_dict(filenames)
88 | inputs, labels = tensor_dict['inputs'], tensor_dict['labels']
89 | global_step = tf.train.get_or_create_global_step()
90 | learning_rate = tf.maximum(self._alpha * (1 - tensor_dict['progress'][0]) +
91 | self._min_alpha * tensor_dict['progress'][0], self._min_alpha)
92 |
93 | loss = self._build_loss(inputs, labels, dataset.unigram_counts)
94 | optimizer = tf.train.GradientDescentOptimizer(learning_rate)
95 | grad_update_op = optimizer.minimize(loss, global_step=global_step)
96 |
97 | to_be_run_dict = {'grad_update_op': grad_update_op,
98 | 'loss': loss,
99 | 'learning_rate': learning_rate}
100 | return to_be_run_dict
101 |
102 | def _create_embeddings(self, vocab_size, scope=None):
103 | """Creates initial word embedding variables.
104 |
105 | Args:
106 | vocab_size: int scalar, num of words in vocabulary.
107 | scope: string scalar, scope name.
108 |
109 | Returns:
110 | syn0: float tensor of shape [vocab_size, embed_size], input word
111 | embeddings (i.e. weights of hidden layer).
112 | syn1: float tensor of shape [syn1_rows, embed_size], output word
113 | embeddings (i.e. weights of output layer).
114 | biases: float tensor of shape [syn1_rows], biases added onto the logits.
115 | """
116 | syn1_rows = (vocab_size if self._algm == 'negative_sampling'
117 | else vocab_size - 1)
118 | with tf.variable_scope(scope, 'Embedding'):
119 | syn0 = tf.get_variable('syn0', initializer=tf.random_uniform([vocab_size,
120 | self._embed_size], -0.5/self._embed_size, 0.5/self._embed_size,
121 | seed=self._random_seed))
122 | syn1 = tf.get_variable('syn1', initializer=tf.random_uniform([syn1_rows,
123 | self._embed_size], -0.1, 0.1))
124 | biases = tf.get_variable('biases', initializer=tf.zeros([syn1_rows]))
125 | return syn0, syn1, biases
126 |
127 | def _negative_sampling_loss(
128 | self, unigram_counts, inputs, labels, syn0, syn1, biases):
129 | """Builds the loss for negative sampling.
130 |
131 | Args:
132 | unigram_counts: list of int, holding word counts. Index of each entry
133 | is the same as the word index into the vocabulary.
134 | inputs: int tensor of shape [batch_size] (skip_gram) or
135 | [batch_size, 2*window_size+1] (cbow)
136 | labels: int tensor of shape [batch_size]
137 | syn0: float tensor of shape [vocab_size, embed_size], input word
138 | embeddings (i.e. weights of hidden layer).
139 | syn1: float tensor of shape [syn1_rows, embed_size], output word
140 | embeddings (i.e. weights of output layer).
141 | biases: float tensor of shape [syn1_rows], biases added onto the logits.
142 |
143 | Returns:
144 | loss: float tensor of shape [batch_size, sample_size + 1].
145 | """
146 | sampled_values = tf.nn.fixed_unigram_candidate_sampler(
147 | true_classes=tf.expand_dims(labels, 1),
148 | num_true=1,
149 | num_sampled=self._batch_size*self._negatives,
150 | unique=True,
151 | range_max=len(unigram_counts),
152 | distortion=self._power,
153 | unigrams=unigram_counts)
154 |
155 | sampled = sampled_values.sampled_candidates
156 | sampled_mat = tf.reshape(sampled, [self._batch_size, self._negatives])
157 | inputs_syn0 = self._get_inputs_syn0(syn0, inputs) # [N, D]
158 | true_syn1 = tf.gather(syn1, labels) # [N, D]
159 | sampled_syn1 = tf.gather(syn1, sampled_mat) # [N, K, D]
160 | true_logits = tf.reduce_sum(tf.multiply(inputs_syn0, true_syn1), 1) # [N]
161 | sampled_logits = tf.reduce_sum(
162 | tf.multiply(tf.expand_dims(inputs_syn0, 1), sampled_syn1), 2) # [N, K]
163 |
164 | if self._add_bias:
165 | true_logits += tf.gather(biases, labels) # [N]
166 | sampled_logits += tf.gather(biases, sampled_mat) # [N, K]
167 |
168 | true_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
169 | labels=tf.ones_like(true_logits), logits=true_logits)
170 | sampled_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
171 | labels=tf.zeros_like(sampled_logits), logits=sampled_logits)
172 | loss = tf.concat(
173 | [tf.expand_dims(true_cross_entropy, 1), sampled_cross_entropy], 1)
174 | return loss
175 |
176 | def _hierarchical_softmax_loss(self, inputs, labels, syn0, syn1, biases):
177 | """Builds the loss for hierarchical softmax.
178 |
179 | Args:
180 | inputs: int tensor of shape [batch_size] (skip_gram) or
181 | [batch_size, 2*window_size+1] (cbow)
182 | labels: int tensor of shape [batch_size, 2*max_depth+1]
183 | syn0: float tensor of shape [vocab_size, embed_size], input word
184 | embeddings (i.e. weights of hidden layer).
185 | syn1: float tensor of shape [syn1_rows, embed_size], output word
186 | embeddings (i.e. weights of output layer).
187 | biases: float tensor of shape [syn1_rows], biases added onto the logits.
188 |
189 | Returns:
190 | loss: float tensor of shape [sum_of_code_len]
191 | """
192 | inputs_syn0_list = tf.unstack(self._get_inputs_syn0(syn0, inputs))
193 | codes_points_list = tf.unstack(labels)
194 | max_depth = (labels.shape.as_list()[1] - 1) // 2
195 | loss = []
196 | for inputs_syn0, codes_points in zip(inputs_syn0_list, codes_points_list):
197 | true_size = codes_points[-1]
198 | codes = codes_points[:true_size]
199 | points = codes_points[max_depth:max_depth+true_size]
200 |
201 | logits = tf.reduce_sum(
202 | tf.multiply(inputs_syn0, tf.gather(syn1, points)), 1)
203 | if self._add_bias:
204 | logits += tf.gather(biases, points)
205 |
206 | loss.append(tf.nn.sigmoid_cross_entropy_with_logits(
207 | labels=tf.to_float(codes), logits=logits))
208 | loss = tf.concat(loss, axis=0)
209 | return loss
210 |
211 | def _get_inputs_syn0(self, syn0, inputs):
212 | """Builds the activations of hidden layer given input words embeddings
213 | `syn0` and input word indices.
214 |
215 | Args:
216 | syn0: float tensor of shape [vocab_size, embed_size]
217 | inputs: int tensor of shape [batch_size] (skip_gram) or
218 | [batch_size, 2*window_size+1] (cbow)
219 |
220 | Returns:
221 | inputs_syn0: [batch_size, embed_size]
222 | """
223 | if self._arch == 'skip_gram':
224 | inputs_syn0 = tf.gather(syn0, inputs)
225 | else:
226 | inputs_syn0 = []
227 | contexts_list = tf.unstack(inputs)
228 | for contexts in contexts_list:
229 | context_words = contexts[:-1]
230 | true_size = contexts[-1]
231 | inputs_syn0.append(
232 | tf.reduce_mean(tf.gather(syn0, context_words[:true_size]), axis=0))
233 | inputs_syn0 = tf.stack(inputs_syn0)
234 | return inputs_syn0
235 |
236 |
237 | class WordVectors(object):
238 | """Word vectors of trained Word2Vec model. Provides APIs for retrieving
239 | word vector, and most similar words given a query word.
240 | """
241 | def __init__(self, syn0_final, vocab):
242 | """Constructor.
243 |
244 | Args:
245 | syn0_final: numpy array of shape [vocab_size, embed_size], final word
246 | embeddings.
247 | vocab_words: a list of strings, holding vocabulary words.
248 | """
249 | self._syn0_final = syn0_final
250 | self._vocab = vocab
251 | self._rev_vocab = dict([(w, i) for i, w in enumerate(vocab)])
252 |
253 | def __contains__(self, word):
254 | return word in self._rev_vocab
255 |
256 | def __getitem__(self, word):
257 | return self._syn0_final[self._rev_vocab[word]]
258 |
259 | def most_similar(self, word, k):
260 | """Finds the top-k words with smallest cosine distances w.r.t `word`.
261 |
262 | Args:
263 | word: string scalar, the query word.
264 | k: int scalar, num of words most similar to `word`.
265 |
266 | Returns:
267 | a list of 2-tuples with word and cosine similarities.
268 | """
269 | if word not in self._rev_vocab:
270 | raise ValueError("Word '%s' not found in the vocabulary" % word)
271 | if k >= self._syn0_final.shape[0]:
272 | raise ValueError("k = %d greater than vocabulary size" % k)
273 |
274 | v0 = self._syn0_final[self._rev_vocab[word]]
275 | sims = np.sum(v0 * self._syn0_final, 1) / (np.linalg.norm(v0) *
276 | np.linalg.norm(self._syn0_final, axis=1))
277 |
278 | # maintain a sliding min-heap to keep track of k+1 largest elements
279 | min_pq = list(zip(sims[:k+1], range(k+1)))
280 | heapq.heapify(min_pq)
281 | for i in np.arange(k + 1, len(self._vocab)):
282 | if sims[i] > min_pq[0][0]:
283 | min_pq[0] = sims[i], i
284 | heapq.heapify(min_pq)
285 | min_pq = sorted(min_pq, key=lambda p: -p[0])
286 | return [(self._vocab[i], sim) for sim, i in min_pq[1:]]
287 |
288 |
--------------------------------------------------------------------------------