├── README.md
├── __init__.py
├── dataset.py
├── files
    ├── cbow_hs.png
    ├── cbow_ns.png
    ├── huffman.png
    ├── sent.png
    ├── sg_hs.png
    └── sg_ns.png
├── run_training.py
├── tf2.x
    ├── README.md
    ├── dataset.py
    ├── demo_word_similarity.py
    ├── model.py
    ├── run_training.py
    ├── sample_corpus.txt
    ├── utils.py
    └── word_vectors.py
└── word2vec.py


/README.md:
--------------------------------------------------------------------------------
  1 | # Word2Vec: Learning distributed word representation from unlabeled text.
  2 | 
  3 | **Update**: [TensorFlow 2.x](tf2.x)
  4 | 
  5 | Word2Vec is a classic model for learning distributed word representation from large unlabeled dataset. There have been many implementations out there since its introduction (e.g. the original C implementation, and the gensim implementation). This is an attempt to reimplement word2vec in TensorFlow using the `tf.data.Dataset` APIs, a recommended way to streamline data preprocessing for TensorFlow models. 
  6 | 
  7 | ### Usage
  8 | 1. Clone the repository.
  9 | ```
 10 | git git@github.com:chao-ji/tf-word2vec.git 
 11 | ```
 12 | 2. Prepare your data.
 13 | Your data should be a number of text files where each line contains a sentence, and words are delimited by space.
 14 | 
 15 | 3. Parameter settings.
 16 | This implementation allows you to train the model under *skip gram* or *continuous bag-of-words* architectures (`--arch`), and perform training using *negative sampling* or *hierarchical softmax.* (`--algm`). To see a full list of parameters, run`python run_training.py --help`. 
 17 |  
 18 | 4. Run.
 19 | Example:
 20 | ```
 21 |   python run_training.py \
 22 |     --filenames=/PATH/TO/FILE/file1.txt,/PATH/TO/FILE/file2.txt
 23 |     --out_dir=/PATH/TO/OUT_DIR/
 24 |     --epochs=5
 25 |     --batch_size=64
 26 |     --window_size=5
 27 | ```
 28 | The vocabulary words and word embeddings will be saved to `vocab.txt` and `embed.npy` (can be loaded using `np.load`).
 29 | 
 30 | ### Sample results
 31 | 
 32 | The model was trained on the IMDB movie review dataset using the following parameters:
 33 | 
 34 | ```
 35 | --arch=skip_gram --algm=negative_sampling --batch_size=256 --max_vocab_size=0 --min_count=10 --sample=1e-3 --window_size=10 --embed_size=300 --negatives=5 --power=0.75 --alpha=0.025 --min_alpha=0.0001 --epochs=5
 36 | ```
 37 | 
 38 | Below are a sample list of queries with their most similar words.
 39 | ```
 40 | query: actor
 41 | [('actors', 0.5314413),
 42 |  ('actress', 0.52641004),
 43 |  ('performer', 0.43144277),
 44 |  ('role', 0.40702546),
 45 |  ('comedian', 0.3910208),
 46 |  ('performance', 0.37695402),
 47 |  ('versatile', 0.35130078),
 48 |  ('actresses', 0.32896513),
 49 |  ('cast', 0.3219274),
 50 |  ('performers', 0.31659046)]
 51 | ```
 52 | 
 53 | ```
 54 | query: .
 55 | [('!', 0.6234603),
 56 |  ('?', 0.39236775),
 57 |  ('and', 0.36783764),
 58 |  (',', 0.3090561),
 59 |  ('but', 0.28012913),
 60 |  ('which', 0.23897173),
 61 |  (';', 0.22881404),
 62 |  ('cornerstone', 0.20761433),
 63 |  ('although', 0.20554386),
 64 |  ('...', 0.19846405)]
 65 | 
 66 | ```
 67 | 
 68 | ```
 69 | query: ask
 70 | [('asked', 0.54287535),
 71 |  ('asking', 0.5349437),
 72 |  ('asks', 0.5262491),
 73 |  ('question', 0.4397335),
 74 |  ('answer', 0.3868001),
 75 |  ('questions', 0.37007764),
 76 |  ('begs', 0.35407144),
 77 |  ('wonder', 0.3537388),
 78 |  ('answers', 0.3410588),
 79 |  ('wondering', 0.32832426)]
 80 | ```
 81 | 
 82 | ```
 83 | query: you
 84 | [('yourself', 0.51918006),
 85 |  ('u', 0.48620683),
 86 |  ('your', 0.47644556),
 87 |  ("'ll", 0.38544628),
 88 |  ('ya', 0.35932386),
 89 |  ('we', 0.35398778),
 90 |  ('i', 0.34099358),
 91 |  ('unless', 0.3306447),
 92 |  ('if', 0.3237356),
 93 |  ("'re", 0.32068467)]
 94 | ```
 95 | 
 96 | ```
 97 | query: amazing
 98 | [('incredible', 0.6467944),
 99 |  ('fantastic', 0.5760295),
100 |  ('excellent', 0.56906724),
101 |  ('awesome', 0.5625062),
102 |  ('wonderful', 0.52154255),
103 |  ('extraordinary', 0.519134),
104 |  ('remarkable', 0.50572175),
105 |  ('outstanding', 0.5042475),
106 |  ('superb', 0.5008434),
107 |  ('brilliant', 0.47915617)]
108 | ```
109 | ### Building dataset pipeline
110 | 
111 | Here is a concrete example of converting a raw sentence into matrices holding the data to train Word2Vec model with either `skip_gram` or `cbow` architecture.
112 | 
113 | Suppose we have a sentence in the corpus: `the quick brown fox jumps over the lazy dog`, with the window sizes (max num of words to the left or right of target word) below the words. Assume that the sentence has already been subsampled and words mapped to indices.
114 | 
115 | We call each of the word in the sentence **target word**, and those words within the window centered at target word **context words**. For example, `quick` and `brown` are context words of target word `the`, and `the`, `brown`, `fox` are context words of target word `quick`.
116 | 
117 | <p align="center">
118 |   <img src="files/sent.png" width="600">
119 | </p>
120 | 
121 | For `skip_gram`, the task is to predict context words given the target word. The index of each target word is simply replicated to match the number of its context words. This will be our **input matrix**.
122 | 
123 | <p align="center">
124 |   <img src="files/sg_ns.png" width="125">
125 |   <br>
126 |   <b>Skip gram, negative sampling</b>
127 | </p>
128 | 
129 | For `cbow`, the task is to predict target word given context words. Because each target word may have a variable number of context words, we pad the list of context words to the maximum possible size (`2*window_size`), and append the true size of context words.
130 | 
131 | <p align="center">
132 |   <img src="files/cbow_ns.png" width="375">
133 |   <br>
134 |   <b>Continuous bag of words, negative sampling</b>
135 | </p>
136 |   
137 | If training algorithm is `negative_sampling`, we simply populate the **label matrix** with the indices of the words to be predicted: context words for `skip_gram` or target words for `cbow`.
138 | 
139 | If training algorithm is `hierarchical_softmax`, a Huffman tree is built for the collection of vocabulary words. Each vocabulary word is associated with exactly one leaf node, and the words to be predicted in the case of `negative_sampling` are replaced by a sequence of `codes` and `points` that are determined by the internal nodes along the root-to-leaf path. For example, `E`'s `codes` and `points` would be `3782`, `8435`, `590`, `7103` and `1`, `0`, `1`, `0`. We populate the **label matrix** with the padded `codes` and `points` (up to `max_depth`), along with the true length of `codes`/`points`.
140 | 
141 | <p align="center">
142 |   <img src="files/huffman.png" width="400">
143 |   <br>
144 |   <b>Huffman tree</b>
145 | </p>
146 | 
147 | 
148 | <p align="center">
149 |   <img src="files/sg_hs.png" width="500">
150 |   <br>
151 |   <b>Skip gram, hierarchical softmax</b>
152 | </p>
153 | 
154 | <p align="center">
155 |   <img src="files/cbow_hs.png" width="700">
156 |   <br>
157 |   <b>Continuous bag of words, hierarchical softmax</b>
158 | </p>
159 | 
160 | In summary, an **input matrix** and a **label matrix** is created from a raw input sentence that provides the input and label information for the prediction task.
161 | 
162 | 
163 | 
164 | ### Reference
165 | 1. T Mikolov, K Chen, G Corrado, J Dean - Efficient Estimation of Word Representations in Vector Space, ICLR 2013
166 | 2. T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean - Distributed Representations of Words and Phrases and their Compositionality, NIPS 2013
167 | 3. Original implementation by Mikolov, https://code.google.com/archive/p/word2vec/
168 | 4. Gensim implementation by Radim Řehůřek, https://radimrehurek.com/gensim/models/word2vec.html
169 | 5. IMDB Movie Review dataset, http://ai.stanford.edu/~amaas/data/sentiment/
170 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/__init__.py


--------------------------------------------------------------------------------
/dataset.py:
--------------------------------------------------------------------------------
  1 | import heapq
  2 | import itertools
  3 | import collections
  4 | 
  5 | import numpy as np
  6 | import tensorflow as tf
  7 | 
  8 | from functools import partial
  9 | 
 10 | OOV_ID = -1
 11 | 
 12 | 
 13 | class Word2VecDataset(object):
 14 |   """Dataset for generating matrices holding word indices to train Word2Vec 
 15 |   models.
 16 |   """
 17 |   def __init__(self,
 18 |                arch='skip_gram',
 19 |                algm='negative_sampling',
 20 |                epochs=5,
 21 |                batch_size=100,
 22 |                max_vocab_size=0,
 23 |                min_count=2,
 24 |                sample=1e-3,
 25 |                window_size=5):
 26 |     """Constructor.
 27 | 
 28 |     Args:
 29 |       arch: string scalar, architecture ('skip_gram' or 'cbow').
 30 |       algm: string scalar: training algorithm ('negative_sampling' or
 31 |         'hierarchical_softmax').
 32 |       epochs: int scalar, num times the dataset is iterated.
 33 |       batch_size: int scalar, the returned tensors in `get_tensor_dict` have
 34 |         shapes [batch_size, :]. 
 35 |       max_vocab_size: int scalar, maximum vocabulary size. If > 0, the top 
 36 |         `max_vocab_size` most frequent words are kept in vocabulary.
 37 |       min_count: int scalar, words whose counts < `min_count` are not included
 38 |         in the vocabulary.
 39 |       sample: float scalar, subsampling rate.
 40 |       window_size: int scalar, num of words on the left or right side of
 41 |         target word within a window.
 42 |     """
 43 |     self._arch = arch
 44 |     self._algm = algm
 45 |     self._epochs = epochs
 46 |     self._batch_size = batch_size
 47 |     self._max_vocab_size = max_vocab_size
 48 |     self._min_count = min_count
 49 |     self._sample = sample
 50 |     self._window_size = window_size
 51 | 
 52 |     self._iterator_initializer = None
 53 |     self._table_words = None
 54 |     self._unigram_counts = None
 55 |     self._keep_probs = None
 56 |     self._corpus_size = None
 57 |     self._max_depth = None
 58 | 
 59 |   @property
 60 |   def iterator_initializer(self):
 61 |     return self._iterator_initializer
 62 | 
 63 |   @property
 64 |   def table_words(self):
 65 |     return self._table_words
 66 | 
 67 |   @property
 68 |   def unigram_counts(self):
 69 |     return self._unigram_counts
 70 | 
 71 |   def _build_raw_vocab(self, filenames):
 72 |     """Builds raw vocabulary.
 73 | 
 74 |     Args:
 75 |       filenames: list of strings, holding names of text files.
 76 | 
 77 |     Returns: 
 78 |       raw_vocab: a list of 2-tuples holding the word (string) and count (int),
 79 |         sorted in descending order of word count. 
 80 |     """
 81 |     map_open = partial(open, encoding="utf-8")
 82 |     lines = itertools.chain(*map(map_open, filenames))
 83 |     raw_vocab = collections.Counter()
 84 |     for line in lines:
 85 |       raw_vocab.update(line.strip().split())
 86 |     raw_vocab = raw_vocab.most_common()
 87 |     if self._max_vocab_size > 0:
 88 |       raw_vocab = raw_vocab[:self._max_vocab_size]
 89 |     return raw_vocab
 90 | 
 91 |   def build_vocab(self, filenames):
 92 |     """Builds vocabulary.
 93 | 
 94 |     Has the side effect of setting the following attributes:   
 95 |     - table_words: list of string, holding the list of vocabulary words. Index
 96 |         of each entry is the same as the word index into the vocabulary.
 97 |     - unigram_counts: list of int, holding word counts. Index of each entry
 98 |         is the same as the word index into the vocabulary.
 99 |     - keep_probs: list of float, holding words' keep prob for subsampling. 
100 |         Index of each entry is the same as the word index into the vocabulary.
101 |     - corpus_size: int scalar, effective corpus size.
102 | 
103 |     Args:
104 |       filenames: list of strings, holding names of text files.
105 |     """
106 |     raw_vocab = self._build_raw_vocab(filenames)
107 |     raw_vocab = [(w, c) for w, c in raw_vocab if c >= self._min_count]
108 |     self._corpus_size = sum(list(zip(*raw_vocab))[1])
109 | 
110 |     self._table_words = []
111 |     self._unigram_counts = []
112 |     self._keep_probs = []
113 |     for word, count in raw_vocab:
114 |       frac = count / float(self._corpus_size)
115 |       keep_prob = (np.sqrt(frac / self._sample) + 1) * (self._sample / frac)
116 |       keep_prob = np.minimum(keep_prob, 1.0)
117 |       self._table_words.append(word)
118 |       self._unigram_counts.append(count)
119 |       self._keep_probs.append(keep_prob)
120 | 
121 |   def _build_binary_tree(self, unigram_counts):
122 |     """Builds a Huffman tree for hierarchical softmax. Has the side effect
123 |     of setting `max_depth`.
124 | 
125 |     Args:
126 |       unigram_counts: list of int, holding word counts. Index of each entry
127 |         is the same as the word index into the vocabulary.
128 | 
129 |     Returns:
130 |       codes_points: an int numpy array of shape [vocab_size, 2*max_depth+1]
131 |         where each row holds the codes (0-1 binary values) padded to
132 |         `max_depth`, and points (non-leaf node indices) padded to `max_depth`,
133 |         of each vocabulary word. The last entry is the true length of code
134 |         and point (<= `max_depth`).
135 |     """
136 |     vocab_size = len(unigram_counts)
137 |     heap = [[unigram_counts[i], i] for i in range(vocab_size)]
138 |     heapq.heapify(heap)
139 |     for i in range(vocab_size - 1):
140 |       min1, min2 = heapq.heappop(heap), heapq.heappop(heap)
141 |       heapq.heappush(heap, [min1[0] + min2[0], i + vocab_size, min1, min2])
142 | 
143 |     node_list = []
144 |     max_depth, stack = 0, [[heap[0], [], []]]
145 |     while stack:
146 |       node, code, point = stack.pop()
147 |       if node[1] < vocab_size:
148 |         node.extend([code, point, len(point)])
149 |         max_depth = np.maximum(len(code), max_depth)
150 |         node_list.append(node)
151 |       else:
152 |         point = np.array(list(point) + [node[1]-vocab_size])
153 |         stack.append([node[2], np.array(list(code)+[0]), point])
154 |         stack.append([node[3], np.array(list(code)+[1]), point])
155 | 
156 |     node_list = sorted(node_list, key=lambda items: items[1])
157 |     codes_points = np.zeros([vocab_size, max_depth*2+1], dtype=np.int32)
158 |     for i in range(len(node_list)):
159 |       length = node_list[i][4] # length of code or point
160 |       codes_points[i, -1] = length
161 |       codes_points[i, :length] = node_list[i][2] # code
162 |       codes_points[i, max_depth:max_depth+length] = node_list[i][3] # point
163 |     self._max_depth = max_depth
164 |     return codes_points
165 | 
166 |   def _prepare_inputs_labels(self, tensor):
167 |     """Set shape of `tensor` according to architecture and training algorithm,
168 |     and split `tensor` into `inputs` and `labels`.
169 | 
170 |     Args:
171 |       tensor: rank-2 int tensor, holding word indices for prediction inputs
172 |         and prediction labels, returned by `generate_instances`.
173 | 
174 |     Returns:
175 |       inputs: rank-2 int tensor, holding word indices for prediction inputs. 
176 |       labels: rank-2 int tensor, holding word indices for prediction labels.
177 |     """
178 |     if self._arch == 'skip_gram':
179 |       if self._algm == 'negative_sampling':
180 |         tensor.set_shape([self._batch_size, 2])
181 |       else:
182 |         tensor.set_shape([self._batch_size, 2*self._max_depth+2])
183 |       inputs = tensor[:, :1]
184 |       labels = tensor[:, 1:]
185 |     else:
186 |       if self._algm == 'negative_sampling':
187 |         tensor.set_shape([self._batch_size, 2*self._window_size+2])
188 |       else:
189 |         tensor.set_shape([self._batch_size, 
190 |             2*self._window_size+2*self._max_depth+2])
191 |       inputs = tensor[:, :2*self._window_size+1]
192 |       labels = tensor[:, 2*self._window_size+1:]
193 |     return inputs, labels
194 | 
195 |   def get_tensor_dict(self, filenames):
196 |     """Generates tensor dict mapping from tensor names to tensors.
197 | 
198 |     Args:
199 |       filenames: list of strings, holding names of text files.
200 |       
201 |     Returns:
202 |       tensor_dict: a dict mapping from tensor names to tensors with shape being:
203 |         when arch=='skip_gram', algm=='negative_sampling'
204 |           inputs: [N],                    labels: [N]
205 |         when arch=='cbow', algm=='negative_sampling'
206 |           inputs: [N, 2*window_size+1],   labels: [N]
207 |         when arch=='skip_gram', algm=='hierarchical_softmax'
208 |           inputs: [N],                    labels: [N, 2*max_depth+1]
209 |         when arch=='cbow', algm=='hierarchical_softmax'
210 |           inputs: [N, 2*window_size+1],   labels: [N, 2*max_depth+1]
211 |         progress: [N], the percentage of sentences covered so far. Used to 
212 |           compute learning rate.
213 |     """
214 |     table_words = self._table_words
215 |     unigram_counts = self._unigram_counts
216 |     keep_probs = self._keep_probs
217 |     if not table_words or not unigram_counts or not keep_probs:
218 |       raise ValueError('`table_words`, `unigram_counts`, and `keep_probs` must',
219 |           'be set by calling `build_vocab()`')
220 | 
221 |     if self._algm == 'hierarchical_softmax':
222 |       codes_points = tf.constant(self._build_binary_tree(unigram_counts))
223 |     elif self._algm == 'negative_sampling':
224 |       codes_points = None
225 |     else:
226 |       raise ValueError('algm must be hierarchical_softmax or negative_sampling')
227 |    
228 |     table_words = tf.contrib.lookup.index_table_from_tensor(
229 |         tf.constant(table_words), default_value=OOV_ID)
230 |     keep_probs = tf.constant(keep_probs)
231 | 
232 |     num_sents = sum([len(list(open(fn, encoding="utf-8")
233 |                               )) for fn in filenames])
234 |     num_sents = self._epochs * num_sents
235 |     
236 |     # include epoch number, like progress
237 |     a_zip = tf.data.TextLineDataset(filenames).repeat(self._epochs)
238 |     b_zip = tf.range(1, 1+num_sents) / num_sents
239 |     c_zip = tf.repeat(tf.range(1, 1+self._epochs), int(num_sents / self._epochs))
240 |     
241 |     dataset = tf.data.Dataset.zip((a_zip,
242 |                                    tf.data.Dataset.from_tensor_slices(b_zip),
243 |                                    tf.data.Dataset.from_tensor_slices(c_zip)))
244 |         
245 |     dataset = dataset.map(lambda sent, progress, epoch: 
246 |         (get_word_indices(sent, table_words), progress, epoch))
247 |     dataset = dataset.map(lambda indices, progress, epoch: 
248 |         (subsample(indices, keep_probs), progress, epoch))
249 |     dataset = dataset.filter(lambda indices, progress, epoch: 
250 |         tf.greater(tf.size(indices), 1))
251 | 
252 |     dataset = dataset.map(lambda indices, progress, epoch: (
253 |       generate_instances(
254 |         indices, self._arch, self._window_size, codes_points), progress, epoch))
255 |     
256 |     dataset = dataset.map(lambda instances, progress, epoch: (
257 |         instances, tf.fill(tf.shape(instances)[:1], progress),
258 |                    tf.fill(tf.shape(instances)[:1], epoch)))
259 | 
260 |     dataset = dataset.flat_map(lambda instances, progress, epoch: 
261 |         tf.data.Dataset.from_tensor_slices((instances, progress, epoch)))
262 |     dataset = dataset.batch(self._batch_size, drop_remainder=True)
263 |     
264 |     iterator = tf.compat.v1.data.make_initializable_iterator(dataset)
265 |     self._iterator_initializer = iterator.initializer
266 |     tensor, progress, epoch = iterator.get_next()
267 |     progress.set_shape([self._batch_size])
268 |     epoch.set_shape([self._batch_size])
269 | 
270 |     inputs, labels = self._prepare_inputs_labels(tensor)
271 |     if self._arch == 'skip_gram':
272 |       inputs = tf.squeeze(inputs, axis=1)
273 |     if self._algm == 'negative_sampling':
274 |       labels = tf.squeeze(labels, axis=1)
275 |       
276 |     return {'inputs': inputs, 'labels': labels, 'progress': progress, 'epoch': epoch}
277 | 
278 | 
279 | def get_word_indices(sent, table_words):
280 |   """Converts a sentence into a list of word indices.
281 | 
282 |   Args:
283 |     sent: a scalar string tensor, a sentence where words are space-delimited.
284 |     table_words: a `HashTable` mapping from words (string tensor) to word 
285 |       indices (int tensor).
286 |   
287 |   Returns:
288 |     indices: rank-1 int tensor, the word indices within a sentence.
289 |   """
290 |   words = tf.string_split([sent]).values
291 |   indices = tf.to_int32(table_words.lookup(words))
292 |   return indices
293 | 
294 | 
295 | def subsample(indices, keep_probs):
296 |   """Filters out-of-vocabulary words and then applies subsampling on words in a 
297 |   sentence. Words with high frequencies have lower keep probs.
298 | 
299 |   Args:
300 |     indices: rank-1 int tensor, the word indices within a sentence.
301 |     keep_probs: rank-1 float tensor, the prob to drop the each vocabulary word. 
302 | 
303 |   Returns:
304 |     indices: rank-1 int tensor, the word indices within a sentence after 
305 |       subsampling.
306 |   """
307 |   indices = tf.boolean_mask(indices, tf.not_equal(indices, OOV_ID))
308 |   keep_probs = tf.gather(keep_probs, indices)
309 |   randvars = tf.random.uniform(tf.shape(keep_probs), 0, 1)
310 |   indices = tf.boolean_mask(indices, tf.less(randvars, keep_probs))
311 |   return indices
312 | 
313 | 
314 | def generate_instances(indices, arch, window_size, codes_points=None):
315 |   """Generates matrices holding word indices to be passed to Word2Vec models 
316 |   for each sentence. The shape and contents of output matrices depends on the 
317 |   architecture ('skip_gram', 'cbow') and training algorithm ('negative_sampling'
318 |   , 'hierarchical_softmax').
319 | 
320 |   It takes as input a list of word indices in a subsampled-sentence, where each
321 |   word is a target word, and their context words are those within the window 
322 |   centered at a target word. For skip gram architecture, `num_context_words` 
323 |   instances are generated for a target word, and for cbow architecture, a single
324 |   instance is generated for a target word.
325 | 
326 |   If `codes_points` is not None ('hierarchical softmax'), the word to be 
327 |   predicted (context word for 'skip_gram', and target word for 'cbow') are 
328 |   represented by their 'codes' and 'points' in the Huffman tree (See 
329 |   `_build_binary_tree`). 
330 | 
331 |   Args:
332 |     indices: rank-1 int tensor, the word indices within a sentence after
333 |       subsampling.
334 |     arch: scalar string, architecture ('skip_gram' or 'cbow').
335 |     window_size: int scalar, num of words on the left or right side of
336 |       target word within a window.
337 |     codes_points: None, or an int tensor of shape [vocab_size, 2*max_depth+1] 
338 |       where each row holds the codes (0-1 binary values) padded to `max_depth`, 
339 |       and points (non-leaf node indices) padded to `max_depth`, of each 
340 |       vocabulary word. The last entry is the true length of code and point 
341 |       (<= `max_depth`).
342 |     
343 |   Returns:
344 |     instances: an int tensor holding word indices, with shape being
345 |       when arch=='skip_gram', algm=='negative_sampling'
346 |         shape: [N, 2]
347 |       when arch=='cbow', algm=='negative_sampling'
348 |         shape: [N, 2*window_size+2]
349 |       when arch=='skip_gram', algm=='hierarchical_softmax'
350 |         shape: [N, 2*max_depth+2]
351 |       when arch=='cbow', algm='hierarchical_softmax'
352 |         shape: [N, 2*window_size+2*max_depth+2]
353 |   """
354 |   def per_target_fn(index, init_array):
355 |     reduced_size = tf.random.uniform([], maxval=window_size, dtype=tf.int32)
356 |     left = tf.range(tf.maximum(index - window_size + reduced_size, 0), index)
357 |     right = tf.range(index + 1, 
358 |         tf.minimum(index + 1 + window_size - reduced_size, tf.size(indices)))
359 |     context = tf.concat([left, right], axis=0)
360 |     context = tf.gather(indices, context)
361 | 
362 |     if arch == 'skip_gram':
363 |       window = tf.stack([tf.fill(tf.shape(context), indices[index]), 
364 |                         context], axis=1)
365 |     elif arch == 'cbow':
366 |       true_size = tf.size(context)
367 |       window = tf.concat([tf.pad(context, [[0, 2*window_size-true_size]]), 
368 |                           [true_size, indices[index]]], axis=0)
369 |       window = tf.expand_dims(window, axis=0)
370 |     else:
371 |       raise ValueError('architecture must be skip_gram or cbow.')
372 | 
373 |     if codes_points is not None:
374 |       window = tf.concat([window[:, :-1], 
375 |                           tf.gather(codes_points, window[:, -1])], axis=1)
376 |     return index + 1, init_array.write(index, window)
377 | 
378 |   size = tf.size(indices)
379 |   init_array = tf.TensorArray(tf.int32, size=size, infer_shape=False)
380 |   _, result_array = tf.while_loop(lambda i, ta: i < size,
381 |                                   per_target_fn, 
382 |                                   [0, init_array],
383 |                                       back_prop=False)
384 |   instances = tf.cast(result_array.concat(), tf.int64)
385 |   return instances
386 | 
387 | 


--------------------------------------------------------------------------------
/files/cbow_hs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/cbow_hs.png


--------------------------------------------------------------------------------
/files/cbow_ns.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/cbow_ns.png


--------------------------------------------------------------------------------
/files/huffman.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/huffman.png


--------------------------------------------------------------------------------
/files/sent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/sent.png


--------------------------------------------------------------------------------
/files/sg_hs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/sg_hs.png


--------------------------------------------------------------------------------
/files/sg_ns.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chao-ji/tf-word2vec/0834de527ea03e8c7350dc019d0d8a17b499e0a3/files/sg_ns.png


--------------------------------------------------------------------------------
/run_training.py:
--------------------------------------------------------------------------------
  1 | r"""Executable for training Word2Vec models. 
  2 | 
  3 | Example:
  4 |   python run_training.py \
  5 |     --filenames=/PATH/TO/FILE/file1.txt,/PATH/TO/FILE/file2.txt \
  6 |     --out_dir=/PATH/TO/OUT_DIR/ \
  7 |     --batch_size=64 \
  8 |     --window_size=5 \
  9 | 
 10 | Learned word embeddings will be saved to /PATH/TO/OUT_DIR/embed.npy, and
 11 | vocabulary saved to /PATH/TO/OUT_DIR/vocab.txt
 12 | """
 13 | import os
 14 | import time
 15 | 
 16 | import tensorflow as tf
 17 | import numpy as np
 18 | 
 19 | # import project files
 20 | from dataset import Word2VecDataset
 21 | from word2vec import Word2VecModel
 22 | 
 23 | flags = tf.app.flags
 24 | 
 25 | flags.DEFINE_string('arch', 'skip_gram', 'Architecture (skip_gram or cbow).')
 26 | flags.DEFINE_string('algm', 'negative_sampling', 'Training algorithm '
 27 |     '(negative_sampling or hierarchical_softmax).')
 28 | flags.DEFINE_integer('epochs', 1, 'Num of epochs to iterate training data.')
 29 | flags.DEFINE_integer('batch_size', 256, 'Batch size.')
 30 | flags.DEFINE_integer('max_vocab_size', 0, 'Maximum vocabulary size. If > 0, '
 31 |     'the top `max_vocab_size` most frequent words are kept in vocabulary.')
 32 | flags.DEFINE_integer('min_count', 10, 'Words whose counts < `min_count` are not'
 33 |     ' included in the vocabulary.')
 34 | flags.DEFINE_float('sample', 1e-3, 'Subsampling rate.')
 35 | flags.DEFINE_integer('window_size', 10, 'Num of words on the left or right side' 
 36 |     ' of target word within a window.')
 37 | 
 38 | flags.DEFINE_integer('embed_size', 300, 'Length of word vector.')
 39 | flags.DEFINE_integer('negatives', 5, 'Num of negative words to sample.')
 40 | flags.DEFINE_float('power', 0.75, 'Distortion for negative sampling.')
 41 | flags.DEFINE_float('alpha', 0.025, 'Initial learning rate.')
 42 | flags.DEFINE_float('min_alpha', 0.0001, 'Final learning rate.')
 43 | flags.DEFINE_boolean('add_bias', True, 'Whether to add bias term to dotproduct '
 44 |     'between syn0 and syn1 vectors.')
 45 | 
 46 | flags.DEFINE_integer('log_per_steps', 10000, 'Every `log_per_steps` steps to '
 47 |     ' output logs.')
 48 | flags.DEFINE_list('filenames', None, 'Names of comma-separated input text files.')
 49 | flags.DEFINE_string('out_dir', '/tmp/word2vec', 'Output directory.')
 50 | 
 51 | FLAGS = flags.FLAGS
 52 | 
 53 | 
 54 | def main(_):
 55 |   dataset = Word2VecDataset(arch=FLAGS.arch,
 56 |                             algm=FLAGS.algm,
 57 |                             epochs=FLAGS.epochs,
 58 |                             batch_size=FLAGS.batch_size,
 59 |                             max_vocab_size=FLAGS.max_vocab_size,
 60 |                             min_count=FLAGS.min_count,
 61 |                             sample=FLAGS.sample,
 62 |                             window_size=FLAGS.window_size)
 63 |   dataset.build_vocab(FLAGS.filenames)
 64 | 
 65 |   word2vec = Word2VecModel(arch=FLAGS.arch,
 66 |                            algm=FLAGS.algm,
 67 |                            embed_size=FLAGS.embed_size,
 68 |                            batch_size=FLAGS.batch_size,
 69 |                            negatives=FLAGS.negatives,
 70 |                            power=FLAGS.power,
 71 |                            alpha=FLAGS.alpha,
 72 |                            min_alpha=FLAGS.min_alpha,
 73 |                            add_bias=FLAGS.add_bias,
 74 |                            random_seed=0)
 75 |   to_be_run_dict = word2vec.train(dataset, FLAGS.filenames)
 76 | 
 77 |   with tf.Session() as sess:
 78 |     sess.run(dataset.iterator_initializer)
 79 |     sess.run(tf.tables_initializer())
 80 |     sess.run(tf.global_variables_initializer())
 81 | 
 82 |     average_loss = 0.
 83 |     step = 0
 84 |     while True:
 85 |       try: 
 86 |         result_dict = sess.run(to_be_run_dict)
 87 |       except tf.errors.OutOfRangeError:
 88 |         break
 89 |     
 90 |       average_loss += result_dict['loss'].mean()
 91 |       if step % FLAGS.log_per_steps == 0:
 92 |         if step > 0:
 93 |           average_loss /= FLAGS.log_per_steps
 94 |         print('step:', step, 'average_loss:', average_loss, 
 95 |             'learning_rate:', result_dict['learning_rate'])
 96 |         average_loss = 0.
 97 |         
 98 |       step += 1
 99 | 
100 |     syn0_final = sess.run(word2vec.syn0)
101 | 
102 |   np.save(os.path.join(FLAGS.out_dir, 'embed'), syn0_final)
103 |   with open(os.path.join(FLAGS.out_dir, 'vocab.txt'), 'w', encoding="utf-8") as fid:
104 |     for w in dataset.table_words:
105 |       fid.write(w + '\n')
106 |       
107 |   print('Word embeddings saved to', os.path.join(FLAGS.out_dir, 'embed.npy'))
108 |   print('Vocabulary saved to', os.path.join(FLAGS.out_dir, 'vocab.txt'))
109 | 
110 | if __name__ == '__main__':
111 |   tf.flags.mark_flag_as_required('filenames')
112 | 
113 |   tf.app.run()
114 | 


--------------------------------------------------------------------------------
/tf2.x/README.md:
--------------------------------------------------------------------------------
1 | This is the same model implemented in TensorFlow 2.x. Detailed usage information can be found in the [original README](../README.md).
2 | 


--------------------------------------------------------------------------------
/tf2.x/dataset.py:
--------------------------------------------------------------------------------
  1 | """Defines word tokenizer and word2vec dataset builder.
  2 | """
  3 | import heapq
  4 | import itertools
  5 | import collections
  6 | 
  7 | import numpy as np
  8 | import tensorflow as tf
  9 | 
 10 | OOV_ID = -1
 11 | 
 12 | 
 13 | class WordTokenizer(object):
 14 |   """Vanilla word tokenizer that spits out space-separated tokens from raw text 
 15 |   string. Note for non-space separated languages, the corpus must be 
 16 |   pre-tokenized such that tokens are space-delimited.
 17 |   """
 18 |   def __init__(self, max_vocab_size=0, min_count=10, sample=1e-3):
 19 |     """Constructor.
 20 | 
 21 |     Args:
 22 |       max_vocab_size: int scalar, maximum vocabulary size. If > 0, only the top 
 23 |         `max_vocab_size` most frequent words will be kept in vocabulary.
 24 |       min_count: int scalar, words whose counts < `min_count` will not be 
 25 |         included in the vocabulary.
 26 |       sample: float scalar, subsampling rate.
 27 |     """
 28 |     self._max_vocab_size = max_vocab_size
 29 |     self._min_count = min_count
 30 |     self._sample = sample
 31 | 
 32 |     self._vocab = None 
 33 |     self._table_words = None
 34 |     self._unigram_counts = None
 35 |     self._keep_probs = None
 36 | 
 37 |   @property
 38 |   def unigram_counts(self):
 39 |     return self._unigram_counts
 40 | 
 41 |   @property
 42 |   def table_words(self):
 43 |     return self._table_words
 44 | 
 45 |   def _build_raw_vocab(self, filenames):
 46 |     """Builds raw vocabulary by iterate through the corpus once and count the 
 47 |     unique words.
 48 | 
 49 |     Args:
 50 |       filenames: list of strings, holding names of text files.
 51 | 
 52 |     Returns: 
 53 |       raw_vocab: a list of 2-tuples holding the word (string) and count (int),
 54 |         sorted in descending order of word count. 
 55 |     """
 56 |     lines = []
 57 |     for fn in filenames:
 58 |       with tf.io.gfile.GFile(fn) as f:
 59 |         lines.append(f)
 60 |     lines = itertools.chain(*lines)
 61 | 
 62 |     raw_vocab = collections.Counter()
 63 |     for line in lines:
 64 |       raw_vocab.update(line.strip().split())
 65 |     raw_vocab = raw_vocab.most_common()
 66 |     # truncate to have at most `max_vocab_size` vocab words
 67 |     if self._max_vocab_size > 0:
 68 |       raw_vocab = raw_vocab[:self._max_vocab_size]
 69 |     return raw_vocab
 70 |    
 71 |   def build_vocab(self, filenames):
 72 |     """Builds the vocabulary.
 73 | 
 74 |     Has the side effect of setting the following attributes: for each word 
 75 |     `word` we have
 76 | 
 77 |     vocab[word] = index
 78 |     table_words[index] = word `word`
 79 |     unigram_counts[index] = count of `word` in vocab
 80 |     keep_probs[index] = keep prob of `word` for subsampling
 81 | 
 82 |     Args:
 83 |       filenames: list of strings, holding names of text files.
 84 |     """
 85 |     raw_vocab = self._build_raw_vocab(filenames)
 86 |     raw_vocab = [(w, c) for w, c in raw_vocab if c >= self._min_count]
 87 |     self._corpus_size = sum(list(zip(*raw_vocab))[1])
 88 | 
 89 |     self._vocab = {}
 90 |     self._table_words = []
 91 |     self._unigram_counts = []
 92 |     self._keep_probs = []
 93 |     for index, (word, count) in enumerate(raw_vocab):
 94 |       frac = count / float(self._corpus_size)
 95 |       keep_prob = (np.sqrt(frac / self._sample) + 1) * (self._sample / frac)
 96 |       keep_prob = np.minimum(keep_prob, 1.0)
 97 |       self._vocab[word] = index
 98 |       self._table_words.append(word)
 99 |       self._unigram_counts.append(count)
100 |       self._keep_probs.append(keep_prob)
101 | 
102 |   def encode(self, string):
103 |     """Split raw text string into tokens (space-separated) and tranlate to token 
104 |     ids.
105 | 
106 |     Args:
107 |       string: string scalar, the raw text string to be tokenized.
108 | 
109 |     Returns:
110 |       ids: a list of ints, the token ids of the tokenized string.
111 |     """
112 |     tokens = string.strip().split()
113 |     ids = [self._vocab[token] if token in self._vocab else OOV_ID 
114 |         for token in tokens]
115 |     return ids
116 | 
117 | 
118 | class Word2VecDatasetBuilder(object):
119 |   """Builds a tf.data.Dataset instance that generates matrices holding word
120 |   indices for training Word2Vec models.
121 |   """
122 |   def __init__(self,
123 |                tokenizer,
124 |                arch='skip_gram',
125 |                algm='negative_sampling',
126 |                epochs=1,
127 |                batch_size=32,
128 |                window_size=5):
129 |     """Constructor.
130 | 
131 |     Args:
132 |       epochs: int scalar, num times the dataset is iterated.
133 |       batch_size: int scalar, the returned tensors in `get_tensor_dict` have
134 |         shapes [batch_size, :]. 
135 |       window_size: int scalar, num of words on the left or right side of
136 |         target word within a window.
137 |     """
138 |     self._tokenizer = tokenizer
139 |     self._arch = arch
140 |     self._algm = algm
141 |     self._epochs = epochs
142 |     self._batch_size = batch_size
143 |     self._window_size = window_size
144 | 
145 |     self._max_depth = None
146 | 
147 |   def _build_binary_tree(self, unigram_counts):
148 |     """Builds a Huffman tree for hierarchical softmax. Has the side effect
149 |     of setting `max_depth`.
150 | 
151 |     Args:
152 |       unigram_counts: list of int, holding word counts. Index of each entry
153 |         is the same as the word index into the vocabulary.
154 | 
155 |     Returns:
156 |       codes_points: an int numpy array of shape [vocab_size, 2*max_depth+1]
157 |         where each row holds the codes (0-1 binary values) padded to
158 |         `max_depth`, and points (non-leaf node indices) padded to `max_depth`,
159 |         of each vocabulary word. The last entry is the true length of code
160 |         and point (<= `max_depth`).
161 |     """
162 |     vocab_size = len(unigram_counts)
163 |     heap = [[unigram_counts[i], i] for i in range(vocab_size)]
164 |     # initialize the min-priority queue, which has length `vocab_size`
165 |     heapq.heapify(heap)
166 | 
167 |     # insert `vocab_size` - 1 internal nodes, with vocab words as leaf nodes.
168 |     for i in range(vocab_size - 1):
169 |       min1, min2 = heapq.heappop(heap), heapq.heappop(heap)
170 |       heapq.heappush(heap, [min1[0] + min2[0], i + vocab_size, min1, min2])
171 |     # At this point we have a len-1 heap, and `heap[0]` will be the root of 
172 |     # the binary tree; where internal nodes store
173 |     # 1. key (frequency)
174 |     # 2. vocab index
175 |     # 3. left child
176 |     # 4. right child
177 |     # and leaf nodes store
178 |     # 1. key (frequencey)
179 |     # 2. vocab index
180 | 
181 |     # Traverse the Huffman tree rooted at `heap[0]` in the order of 
182 |     # Depth-First-Search. Each stack item stores the
183 |     # 1. `node`
184 |     # 2. code of the `node` (list)
185 |     # 3. point of the `node` (list)
186 |     #
187 |     # `point` is the list of vocab IDs of the internal nodes along the path from 
188 |     # the root up to `node` (not included)
189 |     # `code` is the list of labels (0 or 1) of the edges along the path from the
190 |     # root up to `node` 
191 |     # they are empty lists for the root node `heap[0]`
192 |     node_list = []
193 |     max_depth, stack = 0, [[heap[0], [], []]] # stack: [root, codde, point]
194 |     while stack:
195 |       node, code, point = stack.pop()
196 |       if node[1] < vocab_size:
197 |         # leaf node: len(node) == 2
198 |         node.extend([code, point, len(point)])
199 |         max_depth = np.maximum(len(code), max_depth)
200 |         node_list.append(node)
201 |       else:
202 |         # internal node: len(node) == 4
203 |         point = np.array(list(point) + [node[1]-vocab_size])
204 |         stack.append([node[2], np.array(list(code)+[0]), point])
205 |         stack.append([node[3], np.array(list(code)+[1]), point])
206 | 
207 |     # `len(node_list[i]) = 5`
208 |     node_list = sorted(node_list, key=lambda items: items[1])
209 |     # Stores the padded codes and points for each vocab word
210 |     codes_points = np.zeros([vocab_size, max_depth*2+1], dtype=np.int64)
211 |     for i in range(len(node_list)):
212 |       length = node_list[i][4] # length of code or point
213 |       codes_points[i, -1] = length
214 |       codes_points[i, :length] = node_list[i][2] # code
215 |       codes_points[i, max_depth:max_depth+length] = node_list[i][3] # point
216 |     self._max_depth = max_depth
217 |     return codes_points
218 | 
219 |   def build_dataset(self, filenames):
220 |     """Generates tensor dict mapping from tensor names to tensors.
221 | 
222 |     Args:
223 |       filenames: list of strings, holding names of text files.
224 |       
225 |     Returns:
226 |       dataset: a tf.data.Dataset instance, holding the a tuple of tensors
227 |         (inputs, labels, progress)
228 |         when arch=='skip_gram', algm=='negative_sampling'
229 |           inputs: [N],                    labels: [N]
230 |         when arch=='cbow', algm=='negative_sampling'
231 |           inputs: [N, 2*window_size+1],   labels: [N]
232 |         when arch=='skip_gram', algm=='hierarchical_softmax'
233 |           inputs: [N],                    labels: [N, 2*max_depth+1]
234 |         when arch=='cbow', algm=='hierarchical_softmax'
235 |           inputs: [N, 2*window_size+1],   labels: [N, 2*max_depth+1]
236 |         progress: [N], the percentage of sentences covered so far. Used to 
237 |           compute learning rate.
238 |     """
239 |     unigram_counts = self._tokenizer._unigram_counts
240 |     keep_probs = self._tokenizer._keep_probs
241 | 
242 |     if self._algm == 'hierarchical_softmax':
243 |       codes_points = tf.constant(self._build_binary_tree(unigram_counts))
244 |     elif self._algm == 'negative_sampling':
245 |       codes_points = None
246 |     else:
247 |       raise ValueError('algm must be hierarchical_softmax or negative_sampling')
248 |    
249 |     keep_probs = tf.cast(tf.constant(keep_probs), 'float32')
250 | 
251 |     # total num of sentences (lines) across text files times num of epochs
252 |     num_sents = sum([len(list(tf.io.gfile.GFile(fn))) 
253 |         for fn in filenames]) * self._epochs
254 | 
255 |     def generator_fn():
256 |       for _ in range(self._epochs):
257 |         for fn in filenames:
258 |           with tf.io.gfile.GFile(fn) as f:
259 |             for line in f:
260 |               yield self._tokenizer.encode(line)
261 | 
262 |     # dataset: [([int], float)]
263 |     dataset = tf.data.Dataset.zip((
264 |         tf.data.Dataset.from_generator(generator_fn, tf.int64, [None]),
265 |         tf.data.Dataset.from_tensor_slices(tf.range(num_sents) / num_sents)))
266 |     # dataset: [([int], float)]
267 |     dataset = dataset.map(lambda indices, progress: 
268 |         (subsample(indices, keep_probs), progress))
269 |     # dataset: [([int], float)]
270 |     dataset = dataset.filter(lambda indices, progress: 
271 |         tf.greater(tf.size(indices), 1))  # sentence must have at least 2 tokens
272 |     # dataset: [((None, None), float)]
273 |     dataset = dataset.map(lambda indices, progress: (generate_instances(
274 |         indices, self._arch, self._window_size, self._max_depth, codes_points), 
275 |         progress))
276 |     # dataset: [((None, None)), (None,)]
277 |     dataset = dataset.map(lambda instances, progress: (
278 |         # replicate `progress` to size `tf.shape(instances)[:1]`
279 |         instances, tf.fill(tf.shape(instances)[:1], progress)))
280 |     dataset = dataset.flat_map(lambda instances, progress: 
281 |         # form a dataset by unstacking `instances` in the first dimension,
282 |         tf.data.Dataset.from_tensor_slices((instances, progress)))
283 |     # batch the dataset
284 |     dataset = dataset.batch(self._batch_size, drop_remainder=True)
285 | 
286 |     def prepare_inputs_labels(tensor, progress):
287 |       if self._arch == 'skip_gram':
288 |         if self._algm == 'negative_sampling':
289 |           tensor.set_shape([self._batch_size, 2])
290 |         else:
291 |           tensor.set_shape([self._batch_size, 2*self._max_depth+2])
292 |         inputs = tensor[:, :1]
293 |         labels = tensor[:, 1:]
294 | 
295 |       else:
296 |         if self._algm == 'negative_sampling':
297 |           tensor.set_shape([self._batch_size, 2*self._window_size+2])
298 |         else:
299 |           tensor.set_shape([self._batch_size,
300 |               2*self._window_size+2*self._max_depth+2])
301 |         inputs = tensor[:, :2*self._window_size+1]
302 |         labels = tensor[:, 2*self._window_size+1:]
303 | 
304 |       if self._arch == 'skip_gram':
305 |         inputs = tf.squeeze(inputs, axis=1)
306 |       if self._algm == 'negative_sampling':
307 |         labels = tf.squeeze(labels, axis=1)
308 |       progress = tf.cast(progress, 'float32')
309 |       return inputs, labels, progress
310 | 
311 |     dataset = dataset.map(lambda tensor, progress: 
312 |         prepare_inputs_labels(tensor, progress))
313 | 
314 |     return dataset
315 | 
316 | 
317 | def subsample(indices, keep_probs):
318 |   """Filters out-of-vocabulary words and then applies subsampling on words in a 
319 |   sentence. Words with high frequencies have lower keep probs.
320 | 
321 |   Args:
322 |     indices: rank-1 int tensor, the word indices within a sentence.
323 |     keep_probs: rank-1 float tensor, the prob to drop the each vocabulary word. 
324 | 
325 |   Returns:
326 |     indices: rank-1 int tensor, the word indices within a sentence after 
327 |       subsampling.
328 |   """
329 |   indices = tf.boolean_mask(indices, tf.not_equal(indices, OOV_ID))
330 |   keep_probs = tf.gather(keep_probs, indices)
331 |   randvars = tf.random.uniform(tf.shape(keep_probs), 0, 1)
332 |   indices = tf.boolean_mask(indices, tf.less(randvars, keep_probs))
333 |   return indices
334 | 
335 | 
336 | def generate_instances(
337 |     indices, arch, window_size, max_depth=None, codes_points=None):
338 |   """Generates matrices holding word indices to be passed to Word2Vec models 
339 |   for each sentence. The shape and contents of output matrices depends on the 
340 |   architecture ('skip_gram', 'cbow') and training algorithm ('negative_sampling'
341 |   , 'hierarchical_softmax').
342 | 
343 |   It takes as input a list of word indices in a subsampled-sentence, where each
344 |   word is a target word, and their context words are those within the window 
345 |   centered at a target word. For skip gram architecture, `num_context_words` 
346 |   instances are generated for a target word, and for cbow architecture, a single
347 |   instance is generated for a target word.
348 | 
349 |   If `codes_points` is not None ('hierarchical softmax'), the word to be 
350 |   predicted (context word for 'skip_gram', and target word for 'cbow') are 
351 |   represented by their 'codes' and 'points' in the Huffman tree (See 
352 |   `_build_binary_tree`). 
353 | 
354 |   Args:
355 |     indices: rank-1 int tensor, the word indices within a sentence after
356 |       subsampling.
357 |     arch: scalar string, architecture ('skip_gram' or 'cbow').
358 |     window_size: int scalar, num of words on the left or right side of
359 |       target word within a window.
360 |     max_depth: (Optional) int scalar, the max depth of the Huffman tree. 
361 |     codes_points: (Optional) an int tensor of shape [vocab_size, 2*max_depth+1] 
362 |       where each row holds the codes (0-1 binary values) padded to `max_depth`, 
363 |       and points (non-leaf node indices) padded to `max_depth`, of each 
364 |       vocabulary word. The last entry is the true length of code and point 
365 |       (<= `max_depth`).
366 |     
367 |   Returns:
368 |     instances: an int tensor holding word indices, with shape being
369 |       when arch=='skip_gram', algm=='negative_sampling'
370 |         shape: [N, 2]
371 |       when arch=='cbow', algm=='negative_sampling'
372 |         shape: [N, 2*window_size+2]
373 |       when arch=='skip_gram', algm=='hierarchical_softmax'
374 |         shape: [N, 2*max_depth+2]
375 |       when arch=='cbow', algm='hierarchical_softmax'
376 |         shape: [N, 2*window_size+2*max_depth+2]
377 |   """
378 |   def per_target_fn(index, init_array):
379 |     """Generate inputs and labels for each target word.
380 | 
381 |     `index` is the index of the target word in `indices`.
382 |     """
383 |     reduced_size = tf.random.uniform([], maxval=window_size, dtype='int32')
384 |     left = tf.range(tf.maximum(index - window_size + reduced_size, 0), index)
385 |     right = tf.range(index + 1, 
386 |         tf.minimum(index + 1 + window_size - reduced_size, tf.size(indices)))
387 |     context = tf.concat([left, right], axis=0)
388 |     context = tf.gather(indices, context)
389 | 
390 |     if arch == 'skip_gram':
391 |       # replicate `indices[index]` to match the size of `context`
392 |       # [N, 2]
393 |       window = tf.stack([tf.fill(tf.shape(context), indices[index]), 
394 |                         context], axis=1)
395 |     elif arch == 'cbow':
396 |       true_size = tf.size(context)
397 |       # pad `context` to length `2 * window_size`
398 |       window = tf.concat([tf.pad(context, [[0, 2*window_size-true_size]]), 
399 |                           [true_size, indices[index]]], axis=0)
400 |       # [1, 2*window_size + 2]
401 |       window = tf.expand_dims(window, axis=0)
402 |     else:
403 |       raise ValueError('architecture must be skip_gram or cbow.')
404 | 
405 |     if codes_points is not None:
406 |       # [N, 2*max_depth + 2] or [1, 2*window_size+2*max_depth+2]
407 |       window = tf.concat([window[:, :-1], 
408 |                           tf.gather(codes_points, window[:, -1])], axis=1)
409 |     return index + 1, init_array.write(index, window)
410 | 
411 |   size = tf.size(indices)
412 |   # initialize a tensor array of length `tf.size(indices)`
413 |   init_array = tf.TensorArray('int64', size=size, infer_shape=False)
414 |   _, result_array = tf.while_loop(lambda i, ta: i < size,
415 |                                   per_target_fn, 
416 |                                   [0, init_array],
417 |                                       back_prop=False)
418 |   instances = tf.cast(result_array.concat(), 'int64')
419 |   if arch == 'skip_gram':
420 |     if max_depth is None:
421 |       instances.set_shape([None, 2])
422 |     else:
423 |       instances.set_shape([None, 2*max_depth+2])
424 |   else:
425 |     if max_depth is None:
426 |       instances.set_shape([None, 2*window_size+2])
427 |     else:
428 |       instances.set_shape([None, 2*window_size+2*max_depth+2])
429 | 
430 |   return instances
431 | 


--------------------------------------------------------------------------------
/tf2.x/demo_word_similarity.py:
--------------------------------------------------------------------------------
 1 | from word_vectors import WordVectors                                                                                
 2 | import numpy as np                                                                                                  
 3 | 
 4 | # syn_final.npy: storing word embeddings, numpy array of shape [vocab_size, hidden_size]
 5 | # 'vocab.txt': text file storing words in vocabulary, one word per line
 6 | 
 7 | query = ','
 8 | num_similar_words = 10
 9 | syn0_final = np.load('syn0_final.npy')
10 | vocab_words = []                                                                                                   
11 | with open('vocab.txt') as f: 
12 |     vocab_words = [l.strip() for l in f] 
13 |                                                                                                                     
14 | wv = WordVectors(syn0_final, vocab_words)   
15 | print(wv.most_similar(query, num_similar_words))
16 | 


--------------------------------------------------------------------------------
/tf2.x/model.py:
--------------------------------------------------------------------------------
  1 | """Defines word2vec model using tf.keras API.
  2 | """
  3 | import tensorflow as tf
  4 | 
  5 | from dataset import WordTokenizer
  6 | from dataset import Word2VecDatasetBuilder
  7 |  
  8 | 
  9 | class Word2VecModel(tf.keras.Model):
 10 |   """Word2Vec model."""
 11 |   def __init__(self, 
 12 |                unigram_counts, 
 13 |                arch='skip_gram',
 14 |                algm='negative_sampling', 
 15 |                hidden_size=300, 
 16 |                batch_size=256, 
 17 |                negatives=5, 
 18 |                power=0.75,
 19 |                alpha=0.025,
 20 |                min_alpha=0.0001,
 21 |                add_bias=True,
 22 |                random_seed=0):
 23 |     """Constructor.
 24 | 
 25 |     Args:
 26 |       unigram_counts: a list of ints, the counts of word tokens in the corpus. 
 27 |       arch: string scalar, architecture ('skip_gram' or 'cbow').
 28 |       algm: string scalar, training algorithm ('negative_sampling' or
 29 |         'hierarchical_softmax').
 30 |       hidden_size: int scalar, length of word vector.
 31 |       batch_size: int scalar, batch size.
 32 |       negatives: int scalar, num of negative words to sample.
 33 |       power: float scalar, distortion for negative sampling. 
 34 |       alpha: float scalar, initial learning rate.
 35 |       min_alpha: float scalar, final learning rate.
 36 |       add_bias: bool scalar, whether to add bias term to dotproduct 
 37 |         between syn0 and syn1 vectors.
 38 |       random_seed: int scalar, random_seed.
 39 |     """
 40 |     super(Word2VecModel, self).__init__()
 41 |     self._unigram_counts = unigram_counts
 42 |     self._arch = arch
 43 |     self._algm = algm
 44 |     self._hidden_size = hidden_size
 45 |     self._vocab_size = len(unigram_counts)
 46 |     self._batch_size = batch_size
 47 |     self._negatives = negatives
 48 |     self._power = power
 49 |     self._alpha = alpha
 50 |     self._min_alpha = min_alpha
 51 |     self._add_bias = add_bias
 52 |     self._random_seed = random_seed
 53 | 
 54 |     self._input_size = (self._vocab_size if self._algm == 'negative_sampling'
 55 |                             else self._vocab_size - 1)
 56 | 
 57 |     self.add_weight('syn0',
 58 |                     shape=[self._vocab_size, self._hidden_size],
 59 |                     initializer=tf.keras.initializers.RandomUniform(
 60 |                         minval=-0.5/self._hidden_size,
 61 |                         maxval=0.5/self._hidden_size))
 62 |     
 63 |     self.add_weight('syn1',
 64 |                     shape=[self._input_size, self._hidden_size],
 65 |                     initializer=tf.keras.initializers.RandomUniform(
 66 |                         minval=-0.1, maxval=0.1))
 67 | 
 68 |     self.add_weight('biases', 
 69 |                     shape=[self._input_size], 
 70 |                     initializer=tf.keras.initializers.Zeros()) 
 71 | 
 72 |   def call(self, inputs, labels):
 73 |     """Runs the forward pass to compute loss.
 74 | 
 75 |     Args:
 76 |       inputs: int tensor of shape [batch_size] (skip_gram) or 
 77 |         [batch_size, 2*window_size+1] (cbow) 
 78 |       labels: int tensor of shape [batch_size] (negative_sampling) or
 79 |         [batch_size, 2*max_depth+1] (hierarchical_softmax)
 80 | 
 81 |     Returns:
 82 |       loss: float tensor, cross entropy loss. 
 83 |     """
 84 |     if self._algm == 'negative_sampling':
 85 |       loss = self._negative_sampling_loss(inputs, labels)
 86 |     elif self._algm == 'hierarchical_softmax':
 87 |       loss = self._hierarchical_softmax_loss(inputs, labels)
 88 |     return loss
 89 |  
 90 |   def _negative_sampling_loss(self, inputs, labels):
 91 |     """Builds the loss for negative sampling.
 92 | 
 93 |     Args:
 94 |       inputs: int tensor of shape [batch_size] (skip_gram) or 
 95 |         [batch_size, 2*window_size+1] (cbow)
 96 |       labels: int tensor of shape [batch_size]
 97 | 
 98 |     Returns:
 99 |       loss: float tensor of shape [batch_size, negatives + 1].
100 |     """
101 |     _, syn1, biases = self.weights
102 | 
103 |     sampled_values = tf.random.fixed_unigram_candidate_sampler(
104 |         true_classes=tf.expand_dims(labels, 1),
105 |         num_true=1,
106 |         num_sampled=self._batch_size*self._negatives,
107 |         unique=True,
108 |         range_max=len(self._unigram_counts),
109 |         distortion=self._power,
110 |         unigrams=self._unigram_counts)
111 | 
112 |     sampled = sampled_values.sampled_candidates
113 |     sampled_mat = tf.reshape(sampled, [self._batch_size, self._negatives])
114 |     inputs_syn0 = self._get_inputs_syn0(inputs) # [batch_size, hidden_size]
115 |     true_syn1 = tf.gather(syn1, labels) # [batch_size, hidden_size]
116 |     # [batch_size, negatives, hidden_size]
117 |     sampled_syn1 = tf.gather(syn1, sampled_mat)
118 |     # [batch_size]
119 |     true_logits = tf.reduce_sum(tf.multiply(inputs_syn0, true_syn1), 1)
120 |     # [batch_size, negatives]
121 |     sampled_logits = tf.einsum('ijk,ikl->il', tf.expand_dims(inputs_syn0, 1), 
122 |         tf.transpose(sampled_syn1, (0, 2, 1)))
123 | 
124 |     if self._add_bias:
125 |       # [batch_size]
126 |       true_logits += tf.gather(biases, labels)
127 |       # [batch_size, negatives]
128 |       sampled_logits += tf.gather(biases, sampled_mat)
129 | 
130 |     # [batch_size]
131 |     true_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
132 |         labels=tf.ones_like(true_logits), logits=true_logits)
133 |     # [batch_size, negatives]
134 |     sampled_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
135 |         labels=tf.zeros_like(sampled_logits), logits=sampled_logits)
136 | 
137 |     loss = tf.concat(
138 |         [tf.expand_dims(true_cross_entropy, 1), sampled_cross_entropy], 1)
139 |     return loss
140 | 
141 |   def _hierarchical_softmax_loss(self, inputs, labels):
142 |     """Builds the loss for hierarchical softmax.
143 | 
144 |     Args:
145 |       inputs: int tensor of shape [batch_size] (skip_gram) or 
146 |         [batch_size, 2*window_size+1] (cbow)
147 |       labels: int tensor of shape [batch_size, 2*max_depth+1]
148 | 
149 |     Returns:
150 |       loss: float tensor of shape [sum_of_code_len]
151 |     """
152 |     _, syn1, biases = self.weights
153 | 
154 |     inputs_syn0_list = tf.unstack(self._get_inputs_syn0(inputs))
155 |     codes_points_list = tf.unstack(labels)
156 |     max_depth = (labels.shape.as_list()[1] - 1) // 2
157 |     loss = []
158 |     for i in range(self._batch_size):
159 |       inputs_syn0 = inputs_syn0_list[i] # [hidden_size]
160 |       codes_points = codes_points_list[i] # [2*max_depth+1]
161 |       true_size = codes_points[-1]
162 | 
163 |       codes = codes_points[:true_size]
164 |       points = codes_points[max_depth:max_depth+true_size]
165 |       logits = tf.reduce_sum(
166 |           tf.multiply(inputs_syn0, tf.gather(syn1, points)), 1)
167 |       if self._add_bias:
168 |         logits += tf.gather(biases, points)
169 | 
170 |       # [true_size]
171 |       loss.append(tf.nn.sigmoid_cross_entropy_with_logits(
172 |           labels=tf.cast(codes, 'float32'), logits=logits))
173 |     loss = tf.concat(loss, axis=0)
174 |     return loss
175 | 
176 |   def _get_inputs_syn0(self, inputs):
177 |     """Builds the activations of hidden layer given input words embeddings 
178 |     `syn0` and input word indices.
179 | 
180 |     Args:
181 |       inputs: int tensor of shape [batch_size] (skip_gram) or 
182 |         [batch_size, 2*window_size+1] (cbow)
183 | 
184 |     Returns:
185 |       inputs_syn0: [batch_size, hidden_size]
186 |     """
187 |     # syn0: [vocab_size, hidden_size]
188 |     syn0, _, _ = self.weights
189 |     if self._arch == 'skip_gram':
190 |       inputs_syn0 = tf.gather(syn0, inputs) # [batch_size, hidden_size]
191 |     else:
192 |       inputs_syn0 = []
193 |       contexts_list = tf.unstack(inputs)
194 |       for i in range(self._batch_size):
195 |         contexts = contexts_list[i]
196 |         context_words = contexts[:-1]
197 |         true_size = contexts[-1]
198 |         inputs_syn0.append(
199 |             tf.reduce_mean(tf.gather(syn0, context_words[:true_size]), axis=0))
200 |       inputs_syn0 = tf.stack(inputs_syn0)
201 | 
202 |     return inputs_syn0
203 | 


--------------------------------------------------------------------------------
/tf2.x/run_training.py:
--------------------------------------------------------------------------------
  1 | """Train a word2vec model to obtain word embedding vectors.
  2 | 
  3 | There are a total of four combination of architectures and training algorithms
  4 | that the model can be trained with:
  5 | 
  6 | architecture:
  7 |   - skip_gram
  8 |   - cbow (continuous bag-of-words)
  9 | 
 10 | training algorithm
 11 |   - negative_sampling
 12 |   - hierarchical_softmax
 13 | """
 14 | import os
 15 | 
 16 | import tensorflow as tf
 17 | import numpy as np
 18 | from absl import app
 19 | from absl import flags
 20 | 
 21 | from dataset import WordTokenizer
 22 | from dataset import Word2VecDatasetBuilder
 23 | from model import Word2VecModel
 24 | from word_vectors import WordVectors
 25 | 
 26 | import utils
 27 | 
 28 | flags.DEFINE_string('arch', 'skip_gram', 'Architecture (skip_gram or cbow).')
 29 | flags.DEFINE_string('algm', 'negative_sampling', 'Training algorithm '
 30 |     '(negative_sampling or hierarchical_softmax).')
 31 | flags.DEFINE_integer('epochs', 1, 'Num of epochs to iterate thru corpus.')
 32 | flags.DEFINE_integer('batch_size', 256, 'Batch size.')
 33 | flags.DEFINE_integer('max_vocab_size', 0, 'Maximum vocabulary size. If > 0, '
 34 |     'the top `max_vocab_size` most frequent words will be kept in vocabulary.')
 35 | flags.DEFINE_integer('min_count', 10, 'Words whose counts < `min_count` will '
 36 |     'not be included in the vocabulary.')
 37 | flags.DEFINE_float('sample', 1e-3, 'Subsampling rate.')
 38 | flags.DEFINE_integer('window_size', 10, 'Num of words on the left or right side'
 39 |     ' of target word within a window.')
 40 | 
 41 | flags.DEFINE_integer('hidden_size', 300, 'Length of word vector.')
 42 | flags.DEFINE_integer('negatives', 5, 'Num of negative words to sample.')
 43 | flags.DEFINE_float('power', 0.75, 'Distortion for negative sampling.')
 44 | flags.DEFINE_float('alpha', 0.025, 'Initial learning rate.')
 45 | flags.DEFINE_float('min_alpha', 0.0001, 'Final learning rate.')
 46 | flags.DEFINE_boolean('add_bias', True, 'Whether to add bias term to dotproduct '
 47 |     'between syn0 and syn1 vectors.')
 48 | 
 49 | flags.DEFINE_integer('log_per_steps', 10000, 'Every `log_per_steps` steps to '
 50 |     ' log the value of loss to be minimized.')
 51 | flags.DEFINE_list(
 52 |     'filenames', None, 'Names of comma-separated input text files.')
 53 | flags.DEFINE_string('out_dir', '/tmp/word2vec', 'Output directory.')
 54 | 
 55 | FLAGS = flags.FLAGS
 56 | 
 57 | 
 58 | def main(_):  
 59 |   arch = FLAGS.arch
 60 |   algm = FLAGS.algm
 61 |   epochs = FLAGS.epochs
 62 |   batch_size = FLAGS.batch_size
 63 |   max_vocab_size = FLAGS.max_vocab_size
 64 |   min_count = FLAGS.min_count
 65 |   sample = FLAGS.sample
 66 |   window_size = FLAGS.window_size
 67 |   hidden_size = FLAGS.hidden_size
 68 |   negatives = FLAGS.negatives
 69 |   power = FLAGS.power
 70 |   alpha = FLAGS.alpha
 71 |   min_alpha = FLAGS.min_alpha
 72 |   add_bias = FLAGS.add_bias
 73 |   log_per_steps = FLAGS.log_per_steps
 74 |   filenames = FLAGS.filenames
 75 |   out_dir = FLAGS.out_dir
 76 | 
 77 |   tokenizer = WordTokenizer(
 78 |       max_vocab_size=max_vocab_size, min_count=min_count, sample=sample)
 79 |   tokenizer.build_vocab(filenames)
 80 | 
 81 |   builder = Word2VecDatasetBuilder(tokenizer,
 82 |                                    arch=arch,
 83 |                                    algm=algm,
 84 |                                    epochs=epochs,
 85 |                                    batch_size=batch_size,
 86 |                                    window_size=window_size)
 87 |   dataset = builder.build_dataset(filenames)
 88 |   word2vec = Word2VecModel(tokenizer.unigram_counts,
 89 |                arch=arch,
 90 |                algm=algm,
 91 |                hidden_size=hidden_size,
 92 |                batch_size=batch_size,
 93 |                negatives=negatives,
 94 |                power=power,
 95 |                alpha=alpha,
 96 |                min_alpha=min_alpha,
 97 |                add_bias=add_bias)
 98 | 
 99 |   train_step_signature = utils.get_train_step_signature(
100 |       arch, algm, batch_size, window_size, builder._max_depth)
101 |   optimizer = tf.keras.optimizers.SGD(1.0)
102 | 
103 |   @tf.function(input_signature=train_step_signature)
104 |   def train_step(inputs, labels, progress):
105 |     loss = word2vec(inputs, labels)
106 |     gradients = tf.gradients(loss, word2vec.trainable_variables)
107 |   
108 |     learning_rate = tf.maximum(alpha * (1 - progress[0]) +
109 |         min_alpha * progress[0], min_alpha)
110 | 
111 |     if hasattr(gradients[0], '_values'):
112 |       gradients[0]._values *= learning_rate
113 |     else:
114 |       gradients[0] *= learning_rate
115 | 
116 |     if hasattr(gradients[1], '_values'):
117 |       gradients[1]._values *= learning_rate
118 |     else:
119 |       gradients[1] *= learning_rate
120 | 
121 |     if hasattr(gradients[2], '_values'):
122 |       gradients[2]._values *= learning_rate
123 |     else:
124 |       gradients[2] *= learning_rate
125 | 
126 |     optimizer.apply_gradients(
127 |         zip(gradients, word2vec.trainable_variables))
128 | 
129 |     return loss, learning_rate
130 | 
131 |   average_loss = 0.
132 |   for step, (inputs, labels, progress) in enumerate(dataset):
133 |     loss, learning_rate = train_step(inputs, labels, progress)
134 |     average_loss += loss.numpy().mean()
135 |     if step % log_per_steps == 0:
136 |       if step > 0:
137 |         average_loss /= log_per_steps
138 |       print('step:', step, 'average_loss:', average_loss,
139 |             'learning_rate:', learning_rate.numpy())
140 |       average_loss = 0.
141 | 
142 |   syn0_final = word2vec.weights[0].numpy()
143 |   np.save(os.path.join(FLAGS.out_dir, 'syn0_final'), syn0_final)
144 |   with tf.io.gfile.GFile(os.path.join(FLAGS.out_dir, 'vocab.txt'), 'w') as f:
145 |     for w in tokenizer.table_words:
146 |       f.write(w + '\n')
147 |   print('Word embeddings saved to', 
148 |       os.path.join(FLAGS.out_dir, 'syn0_final.npy'))
149 |   print('Vocabulary saved to', os.path.join(FLAGS.out_dir, 'vocab.txt'))
150 | 
151 | 
152 | if __name__ == '__main__':
153 |   flags.mark_flag_as_required('filenames')
154 |   app.run(main)
155 | 


--------------------------------------------------------------------------------
/tf2.x/sample_corpus.txt:
--------------------------------------------------------------------------------
1 | # one sentence per line, with words (lower case) delimited by single space
2 | 
3 | with all this stuff going down at the moment with mj i 've started listening to his music , watching the odd documentary here and there , watched the wiz and watched moonwalker again .
4 | maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent .
5 | moonwalker is part biography , part feature film which i remember going to see at the cinema when it was originally released .
6 | some of it has subtle messages about mj 's feeling towards the press and also the obvious message of drugs are bad m'kay .
7 | visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring .
8 | 


--------------------------------------------------------------------------------
/tf2.x/utils.py:
--------------------------------------------------------------------------------
 1 | """Defines utility functions.
 2 | """
 3 | import tensorflow as tf
 4 | 
 5 | 
 6 | def get_train_step_signature(
 7 |     arch, algm, batch_size, window_size=None, max_depth=None):
 8 |   """Get the training step signatures for `inputs`, `labels` and `progress` 
 9 |   tensor.
10 | 
11 |   Args:
12 |     arch: string scalar, architecture ('skip_gram' or 'cbow').
13 |     algm: string scalar, training algorithm ('negative_sampling' or
14 |       'hierarchical_softmax').
15 | 
16 |   Returns:
17 |     train_step_signature: a list of three tf.TensorSpec instances,
18 |       specifying the tensor spec (shape and dtype) for `inputs`, `labels` and
19 |       `progress`.
20 |   """
21 |   if arch=='skip_gram': 
22 |     inputs_spec = tf.TensorSpec(shape=(batch_size,), dtype='int64') 
23 |   elif arch == 'cbow':
24 |     inputs_spec = tf.TensorSpec(
25 |         shape=(batch_size, 2*window_size+1), dtype='int64')
26 |   else:
27 |     raise ValueError('`arch` must be either "skip_gram" or "cbow".')
28 | 
29 |   if algm == 'negative_sampling':
30 |     labels_spec = tf.TensorSpec(shape=(batch_size,), dtype='int64') 
31 |   elif algm == 'hierarchical_softmax':
32 |     labels_spec = tf.TensorSpec(
33 |         shape=(batch_size, 2*max_depth+1), dtype='int64')
34 |   else:
35 |     raise ValueError('`algm` must be either "negative_sampling" or '
36 |         '"hierarchical_softmax".')
37 | 
38 |   progress_spec = tf.TensorSpec(shape=(batch_size,), dtype='float32')
39 | 
40 |   train_step_signature = [inputs_spec, labels_spec, progress_spec]
41 |   return train_step_signature
42 | 


--------------------------------------------------------------------------------
/tf2.x/word_vectors.py:
--------------------------------------------------------------------------------
 1 | """Defines wrapper class for final word vectors.
 2 | """
 3 | import heapq
 4 | import numpy as np
 5 | 
 6 | 
 7 | class WordVectors(object):
 8 |   """Word vectors of trained Word2Vec model. Provides APIs for retrieving
 9 |   word vector, and most similar words given a query word.
10 |   """
11 |   def __init__(self, syn0_final, vocab):
12 |     """Constructor.
13 | 
14 |     Args:
15 |       syn0_final: numpy array of shape [vocab_size, embed_size], final word
16 |         embeddings.
17 |       vocab: a list of strings, holding vocabulary words.
18 |     """
19 |     self._syn0_final = syn0_final
20 |     self._vocab = vocab
21 |     self._rev_vocab = dict([(w, i) for i, w in enumerate(vocab)])
22 | 
23 |   def __contains__(self, word):
24 |     return word in self._rev_vocab
25 | 
26 |   def __getitem__(self, word):
27 |     return self._syn0_final[self._rev_vocab[word]]
28 | 
29 |   def most_similar(self, word, k):
30 |     """Finds the top-k words with smallest cosine distances w.r.t `word`.
31 | 
32 |     Args:
33 |       word: string scalar, the query word.
34 |       k: int scalar, num of words most similar to `word`.
35 | 
36 |     Returns:
37 |       a list of 2-tuples with word and cosine similarities.
38 |     """
39 |     if word not in self._rev_vocab:
40 |       raise ValueError("Word '%s' not found in the vocabulary" % word)
41 |     if k >= self._syn0_final.shape[0]:
42 |       raise ValueError("k = %d greater than vocabulary size" % k)
43 | 
44 |     v0 = self._syn0_final[self._rev_vocab[word]]
45 |     sims = np.sum(v0 * self._syn0_final, 1) / (np.linalg.norm(v0) *
46 |         np.linalg.norm(self._syn0_final, axis=1))
47 | 
48 |     # maintain a sliding min-heap to keep track of k+1 largest elements
49 |     min_pq = list(zip(sims[:k+1], range(k+1)))
50 |     heapq.heapify(min_pq)
51 |     for i in np.arange(k + 1, len(self._vocab)):
52 |       if sims[i] > min_pq[0][0]:
53 |         min_pq[0] = sims[i], i
54 |         heapq.heapify(min_pq)
55 |     min_pq = sorted(min_pq, key=lambda p: -p[0])
56 |     return [(self._vocab[i], sim) for sim, i in min_pq[1:]]
57 | 


--------------------------------------------------------------------------------
/word2vec.py:
--------------------------------------------------------------------------------
  1 | import heapq
  2 | 
  3 | import numpy as np
  4 | import tensorflow as tf
  5 | 
  6 | 
  7 | class Word2VecModel(object):
  8 |   """Word2VecModel.
  9 |   """
 10 | 
 11 |   def __init__(self, arch, algm, embed_size, batch_size, negatives, power,
 12 |                alpha, min_alpha, add_bias, random_seed):
 13 |     """Constructor.
 14 | 
 15 |     Args:
 16 |       arch: string scalar, architecture ('skip_gram' or 'cbow').
 17 |       algm: string scalar, training algorithm ('negative_sampling' or
 18 |         'hierarchical_softmax').
 19 |       embed_size: int scalar, length of word vector.
 20 |       batch_size: int scalar, batch size.
 21 |       negatives: int scalar, num of negative words to sample.
 22 |       power: float scalar, distortion for negative sampling. 
 23 |       alpha: float scalar, initial learning rate.
 24 |       min_alpha: float scalar, final learning rate.
 25 |       add_bias: bool scalar, whether to add bias term to dotproduct 
 26 |         between syn0 and syn1 vectors.
 27 |       random_seed: int scalar, random_seed.
 28 |     """
 29 |     self._arch = arch
 30 |     self._algm = algm
 31 |     self._embed_size = embed_size
 32 |     self._batch_size = batch_size
 33 |     self._negatives = negatives
 34 |     self._power = power
 35 |     self._alpha = alpha
 36 |     self._min_alpha = min_alpha
 37 |     self._add_bias = add_bias
 38 |     self._random_seed = random_seed
 39 | 
 40 |     self._syn0 = None
 41 | 
 42 |   @property
 43 |   def syn0(self):
 44 |     return self._syn0
 45 | 
 46 |   def _build_loss(self, inputs, labels, unigram_counts, scope=None):
 47 |     """Builds the graph that leads from data tensors (`inputs`, `labels`)
 48 |     to loss. Has the side effect of setting attribute `syn0`.
 49 | 
 50 |     Args:
 51 |       inputs: int tensor of shape [batch_size] (skip_gram) or 
 52 |         [batch_size, 2*window_size+1] (cbow) 
 53 |       labels: int tensor of shape [batch_size] (negative_sampling) or
 54 |         [batch_size, 2*max_depth+1] (hierarchical_softmax)
 55 |       unigram_count: list of int, holding word counts. Index of each entry
 56 |         is the same as the word index into the vocabulary.
 57 |       scope: string scalar, scope name.
 58 | 
 59 |     Returns:
 60 |       loss: float tensor, cross entropy loss. 
 61 |     """
 62 |     syn0, syn1, biases = self._create_embeddings(len(unigram_counts))
 63 |     self._syn0 = syn0
 64 |     with tf.variable_scope(scope, 'Loss', [inputs, labels, syn0, syn1, biases]):
 65 |       if self._algm == 'negative_sampling':
 66 |         loss = self._negative_sampling_loss(
 67 |             unigram_counts, inputs, labels, syn0, syn1, biases)
 68 |       elif self._algm == 'hierarchical_softmax':
 69 |         loss = self._hierarchical_softmax_loss(
 70 |             inputs, labels, syn0, syn1, biases)
 71 |       return loss
 72 | 
 73 |   def train(self, dataset, filenames):
 74 |     """Adds training related ops to the graph.
 75 | 
 76 |     Args:
 77 |       dataset: a `Word2VecDataset` instance.
 78 |       filenames: a list of strings, holding names of text files.
 79 | 
 80 |     Returns: 
 81 |       to_be_run_dict: dict mapping from names to tensors/operations, holding
 82 |         the following entries:
 83 |         { 'grad_update_op': optimization ops,
 84 |           'loss': cross entropy loss,
 85 |           'learning_rate': float-scalar learning rate}
 86 |     """
 87 |     tensor_dict = dataset.get_tensor_dict(filenames)
 88 |     inputs, labels = tensor_dict['inputs'], tensor_dict['labels']
 89 |     global_step = tf.train.get_or_create_global_step()
 90 |     learning_rate = tf.maximum(self._alpha * (1 - tensor_dict['progress'][0]) +
 91 |          self._min_alpha * tensor_dict['progress'][0], self._min_alpha)
 92 | 
 93 |     loss = self._build_loss(inputs, labels, dataset.unigram_counts)    
 94 |     optimizer = tf.train.GradientDescentOptimizer(learning_rate)
 95 |     grad_update_op = optimizer.minimize(loss, global_step=global_step)
 96 |     
 97 |     to_be_run_dict = {'grad_update_op': grad_update_op, 
 98 |                       'loss': loss, 
 99 |                       'learning_rate': learning_rate}
100 |     return to_be_run_dict
101 | 
102 |   def _create_embeddings(self, vocab_size, scope=None):
103 |     """Creates initial word embedding variables.
104 | 
105 |     Args:
106 |       vocab_size: int scalar, num of words in vocabulary.
107 |       scope: string scalar, scope name.
108 | 
109 |     Returns:
110 |       syn0: float tensor of shape [vocab_size, embed_size], input word 
111 |         embeddings (i.e. weights of hidden layer).
112 |       syn1: float tensor of shape [syn1_rows, embed_size], output word
113 |         embeddings (i.e. weights of output layer).
114 |       biases: float tensor of shape [syn1_rows], biases added onto the logits.
115 |     """
116 |     syn1_rows = (vocab_size if self._algm == 'negative_sampling' 
117 |                             else vocab_size - 1)
118 |     with tf.variable_scope(scope, 'Embedding'):
119 |       syn0 = tf.get_variable('syn0', initializer=tf.random_uniform([vocab_size, 
120 |           self._embed_size], -0.5/self._embed_size, 0.5/self._embed_size, 
121 |           seed=self._random_seed))
122 |       syn1 = tf.get_variable('syn1', initializer=tf.random_uniform([syn1_rows,
123 |           self._embed_size], -0.1, 0.1))
124 |       biases = tf.get_variable('biases', initializer=tf.zeros([syn1_rows]))
125 |     return syn0, syn1, biases
126 | 
127 |   def _negative_sampling_loss(
128 |       self, unigram_counts, inputs, labels, syn0, syn1, biases):
129 |     """Builds the loss for negative sampling.
130 | 
131 |     Args:
132 |       unigram_counts: list of int, holding word counts. Index of each entry
133 |         is the same as the word index into the vocabulary.
134 |       inputs: int tensor of shape [batch_size] (skip_gram) or 
135 |         [batch_size, 2*window_size+1] (cbow)
136 |       labels: int tensor of shape [batch_size]
137 |       syn0: float tensor of shape [vocab_size, embed_size], input word 
138 |         embeddings (i.e. weights of hidden layer).
139 |       syn1: float tensor of shape [syn1_rows, embed_size], output word
140 |         embeddings (i.e. weights of output layer).
141 |       biases: float tensor of shape [syn1_rows], biases added onto the logits.
142 | 
143 |     Returns:
144 |       loss: float tensor of shape [batch_size, sample_size + 1].
145 |     """
146 |     sampled_values = tf.nn.fixed_unigram_candidate_sampler(
147 |         true_classes=tf.expand_dims(labels, 1),
148 |         num_true=1,
149 |         num_sampled=self._batch_size*self._negatives,
150 |         unique=True,
151 |         range_max=len(unigram_counts),
152 |         distortion=self._power,
153 |         unigrams=unigram_counts)
154 |  
155 |     sampled = sampled_values.sampled_candidates
156 |     sampled_mat = tf.reshape(sampled, [self._batch_size, self._negatives])
157 |     inputs_syn0 = self._get_inputs_syn0(syn0, inputs) # [N, D]
158 |     true_syn1 = tf.gather(syn1, labels) # [N, D]
159 |     sampled_syn1 = tf.gather(syn1, sampled_mat) # [N, K, D]
160 |     true_logits = tf.reduce_sum(tf.multiply(inputs_syn0, true_syn1), 1) # [N]
161 |     sampled_logits = tf.reduce_sum(
162 |         tf.multiply(tf.expand_dims(inputs_syn0, 1), sampled_syn1), 2) # [N, K]
163 | 
164 |     if self._add_bias:
165 |       true_logits += tf.gather(biases, labels)  # [N]
166 |       sampled_logits += tf.gather(biases, sampled_mat)  # [N, K]
167 | 
168 |     true_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
169 |       labels=tf.ones_like(true_logits), logits=true_logits)
170 |     sampled_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
171 |       labels=tf.zeros_like(sampled_logits), logits=sampled_logits)
172 |     loss = tf.concat(
173 |         [tf.expand_dims(true_cross_entropy, 1), sampled_cross_entropy], 1)
174 |     return loss
175 | 
176 |   def _hierarchical_softmax_loss(self, inputs, labels, syn0, syn1, biases):
177 |     """Builds the loss for hierarchical softmax.
178 | 
179 |     Args:
180 |       inputs: int tensor of shape [batch_size] (skip_gram) or 
181 |         [batch_size, 2*window_size+1] (cbow)
182 |       labels: int tensor of shape [batch_size, 2*max_depth+1]
183 |       syn0: float tensor of shape [vocab_size, embed_size], input word 
184 |         embeddings (i.e. weights of hidden layer).
185 |       syn1: float tensor of shape [syn1_rows, embed_size], output word
186 |         embeddings (i.e. weights of output layer).
187 |       biases: float tensor of shape [syn1_rows], biases added onto the logits.
188 | 
189 |     Returns:
190 |       loss: float tensor of shape [sum_of_code_len]
191 |     """
192 |     inputs_syn0_list = tf.unstack(self._get_inputs_syn0(syn0, inputs))
193 |     codes_points_list = tf.unstack(labels)
194 |     max_depth = (labels.shape.as_list()[1] - 1) // 2
195 |     loss = []
196 |     for inputs_syn0, codes_points in zip(inputs_syn0_list, codes_points_list):
197 |       true_size = codes_points[-1]
198 |       codes = codes_points[:true_size]
199 |       points = codes_points[max_depth:max_depth+true_size]
200 | 
201 |       logits = tf.reduce_sum(
202 |           tf.multiply(inputs_syn0, tf.gather(syn1, points)), 1)
203 |       if self._add_bias:
204 |         logits += tf.gather(biases, points)
205 | 
206 |       loss.append(tf.nn.sigmoid_cross_entropy_with_logits(
207 |           labels=tf.to_float(codes), logits=logits))
208 |     loss = tf.concat(loss, axis=0)
209 |     return loss
210 | 
211 |   def _get_inputs_syn0(self, syn0, inputs):
212 |     """Builds the activations of hidden layer given input words embeddings 
213 |     `syn0` and input word indices.
214 | 
215 |     Args:
216 |       syn0: float tensor of shape [vocab_size, embed_size]
217 |       inputs: int tensor of shape [batch_size] (skip_gram) or 
218 |         [batch_size, 2*window_size+1] (cbow)
219 | 
220 |     Returns:
221 |       inputs_syn0: [batch_size, embed_size]
222 |     """
223 |     if self._arch == 'skip_gram':
224 |       inputs_syn0 = tf.gather(syn0, inputs)
225 |     else:
226 |       inputs_syn0 = []
227 |       contexts_list = tf.unstack(inputs)
228 |       for contexts in contexts_list:
229 |         context_words = contexts[:-1]
230 |         true_size = contexts[-1]
231 |         inputs_syn0.append(
232 |             tf.reduce_mean(tf.gather(syn0, context_words[:true_size]), axis=0))
233 |       inputs_syn0 = tf.stack(inputs_syn0)
234 |     return inputs_syn0
235 | 
236 | 
237 | class WordVectors(object):
238 |   """Word vectors of trained Word2Vec model. Provides APIs for retrieving
239 |   word vector, and most similar words given a query word.
240 |   """
241 |   def __init__(self, syn0_final, vocab):
242 |     """Constructor.
243 | 
244 |     Args:
245 |       syn0_final: numpy array of shape [vocab_size, embed_size], final word
246 |         embeddings.
247 |       vocab_words: a list of strings, holding vocabulary words.
248 |     """
249 |     self._syn0_final = syn0_final
250 |     self._vocab = vocab
251 |     self._rev_vocab = dict([(w, i) for i, w in enumerate(vocab)])
252 | 
253 |   def __contains__(self, word):
254 |     return word in self._rev_vocab
255 | 
256 |   def __getitem__(self, word):
257 |     return self._syn0_final[self._rev_vocab[word]]
258 | 
259 |   def most_similar(self, word, k):
260 |     """Finds the top-k words with smallest cosine distances w.r.t `word`.
261 | 
262 |     Args:
263 |       word: string scalar, the query word.
264 |       k: int scalar, num of words most similar to `word`.
265 | 
266 |     Returns:
267 |       a list of 2-tuples with word and cosine similarities.
268 |     """
269 |     if word not in self._rev_vocab:
270 |       raise ValueError("Word '%s' not found in the vocabulary" % word)
271 |     if k >= self._syn0_final.shape[0]:
272 |       raise ValueError("k = %d greater than vocabulary size" % k)
273 | 
274 |     v0 = self._syn0_final[self._rev_vocab[word]]
275 |     sims = np.sum(v0 * self._syn0_final, 1) / (np.linalg.norm(v0) * 
276 |         np.linalg.norm(self._syn0_final, axis=1))
277 | 
278 |     # maintain a sliding min-heap to keep track of k+1 largest elements
279 |     min_pq = list(zip(sims[:k+1], range(k+1)))
280 |     heapq.heapify(min_pq)
281 |     for i in np.arange(k + 1, len(self._vocab)):
282 |       if sims[i] > min_pq[0][0]:
283 |         min_pq[0] = sims[i], i
284 |         heapq.heapify(min_pq)
285 |     min_pq = sorted(min_pq, key=lambda p: -p[0])
286 |     return [(self._vocab[i], sim) for sim, i in min_pq[1:]]
287 | 
288 | 


--------------------------------------------------------------------------------