├── LICENSE
├── README.md
├── appendix
    ├── keras_cnn.ipynb
    ├── seq2seq_nmt.ipynb
    └── tensorboard_word_embeddings.ipynb
├── ch10
    ├── bleu_score_example.ipynb
    ├── neural_machine_translation.ipynb
    ├── neural_machine_translation_attention.ipynb
    ├── nmt_with_pretrained_wordvecs.ipynb
    └── word2vec.py
├── ch11
    └── tv_embeddings.ipynb
├── ch2
    ├── tensorflow_introduction.ipynb
    ├── test1.txt
    ├── test2.txt
    └── test3.txt
├── ch3
    ├── ch3_word2vec.ipynb
    └── ch3_wordnet.ipynb
├── ch4
    ├── ch4_document_embedding.ipynb
    ├── ch4_glove.ipynb
    ├── ch4_word2vec_extended.ipynb
    └── ch4_word2vec_improvements.ipynb
├── ch5
    ├── cnn_sentence_classification.ipynb
    └── image_classification_mnist.ipynb
├── ch6
    ├── rnn_language_bigram.ipynb
    └── rnn_language_bigram_multilayer.ipynb
├── ch8
    ├── embeddings.npy
    ├── lstm_extensions.ipynb
    ├── lstm_word2vec.ipynb
    ├── lstm_word2vec_rnn_api.ipynb
    ├── lstms_for_text_generation.ipynb
    ├── plot_perplexity_over_time.ipynb
    └── word2vec.py
└── ch9
    ├── correct_spellings.py
    ├── image_caption_data
        └── class_names.txt
    ├── lstm_image_caption.ipynb
    ├── lstm_image_caption_pretrained_wordvecs_rnn_api.ipynb
    └── word2vec.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Packt
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | 
 4 | # Natural Language Processing with TensorFlow
 5 | This is the code repository for [Natural Language Processing with TensorFlow](https://www.packtpub.com/application-development/natural-language-processing-tensorflow?utm_source=github&utm_medium=repository&utm_campaign=9781788478311), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the book from start to finish.
 6 | ## About the Book
 7 | Natural language processing (NLP) supplies the majority of data available to deep learning applications, while TensorFlow is the most important deep learning framework currently available. Natural Language Processing with TensorFlow brings TensorFlow and NLP together to give you invaluable tools to work with the immense volume of unstructured data in today’s data streams, and apply these tools to specific NLP tasks.
 8 | 
 9 | Thushan Ganegedara starts by giving you a grounding in NLP and TensorFlow basics. You'll then learn how to use Word2vec, including advanced extensions, to create word embeddings that turn sequences of words into vectors accessible to deep learning algorithms. Chapters on classical deep learning algorithms, like convolutional neural networks (CNN) and recurrent neural networks (RNN), demonstrate important NLP tasks as sentence classification and language generation. You will learn how to apply high-performance RNN models, like long short-term memory (LSTM) cells, to NLP tasks. You will also explore neural machine translation and implement a neural machine translator.
10 | 
11 | After reading this book, you will gain an understanding of NLP and you'll have the skills to apply TensorFlow in deep learning NLP applications, and how to perform specific NLP tasks.
12 |  
13 | ## Instructions and Navigations
14 | All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.
15 | 
16 | 
17 | 
18 | The code will look like the following:
19 | ```
20 | graph = tf.Graph() # Creates a graph
21 | session = tf.InteractiveSession(graph=graph) # Creates a session
22 | ```
23 | 
24 | To get the most out of this book, we assume the following from the reader:
25 | * A solid will and an ambition to learn the modern ways of NLP
26 | * Familiarity with basic Python syntax and data structures (for example, lists and dictionaries)
27 | * A good understanding of basic mathematics (for example, matrix/vector multiplication)
28 | * (Optional) Advance mathematics knowledge (for example, derivative calculation) to understand a handful of subsections that cover the details of how certain learning models overcome potential practical issues faced during training
29 | * (Optional) Read research papers to refer to advances/details in systems, beyond what the book covers
30 | 
31 | ## Related Products
32 | * [Hands-On Deep Learning with TensorFlow](https://www.packtpub.com/big-data-and-business-intelligence/hands-deep-learning-tensorflow?utm_source=github&utm_medium=repository&utm_campaign=9781787282773)
33 | 
34 | * [Deep Learning with TensorFlow - Second Edition](https://www.packtpub.com/big-data-and-business-intelligence/deep-learning-tensorflow-second-edition?utm_source=github&utm_medium=repository&utm_campaign=9781788831109)
35 | 
36 | * [Beginning Application Development with TensorFlow and Keras](https://www.packtpub.com/application-development/beginning-application-development-tensorflow-and-keras?utm_source=github&utm_medium=repository&utm_campaign=9781789537291)
37 | ### Download a free PDF
38 | 
39 |  <i>If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.<br>Simply click on the link to claim your free PDF.</i>
40 | <p align="center"> <a href="https://packt.link/free-ebook/9781788478311">https://packt.link/free-ebook/9781788478311 </a> </p>


--------------------------------------------------------------------------------
/appendix/keras_cnn.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stderr",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "c:\\users\\thushan\\documents\\python_virtualenvs\\tensorflow_venv\\lib\\site-packages\\h5py\\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
 13 |       "  from ._conv import register_converters as _register_converters\n"
 14 |      ]
 15 |     }
 16 |    ],
 17 |    "source": [
 18 |     "import tensorflow as tf\n",
 19 |     "import numpy as np\n",
 20 |     "from tensorflow.python.keras.models import Sequential\n",
 21 |     "from tensorflow.python.keras.layers import Conv2D,Dense,MaxPool2D,BatchNormalization,Flatten\n",
 22 |     "from tensorflow.python.keras import backend as K\n",
 23 |     "\n",
 24 |     "# Required for Data downaload and preparation\n",
 25 |     "import struct\n",
 26 |     "import gzip\n",
 27 |     "import os\n",
 28 |     "from six.moves.urllib.request import urlretrieve"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "## Lolading Data\n",
 36 |     "\n",
 37 |     "Here we download (if needed) the MNIST dataset and, perform reshaping and normalization. Also we conver the labels to one hot encoded vectors."
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 2,
 43 |    "metadata": {},
 44 |    "outputs": [
 45 |     {
 46 |      "name": "stdout",
 47 |      "output_type": "stream",
 48 |      "text": [
 49 |       "Found and verified train-images-idx3-ubyte.gz\n",
 50 |       "Found and verified train-labels-idx1-ubyte.gz\n",
 51 |       "Found and verified t10k-images-idx3-ubyte.gz\n",
 52 |       "Found and verified t10k-labels-idx1-ubyte.gz\n",
 53 |       "\n",
 54 |       "Reading files train-images-idx3-ubyte.gz and train-labels-idx1-ubyte.gz\n",
 55 |       "60000 28 28\n",
 56 |       "(Images) Returned a tensor of shape  (60000, 28, 28, 1)\n",
 57 |       "(Labels) Returned a tensor of shape: 60000\n",
 58 |       "Sample labels:  [5 0 4 1 9 2 1 3 1 4]\n",
 59 |       "\n",
 60 |       "Reading files t10k-images-idx3-ubyte.gz and t10k-labels-idx1-ubyte.gz\n",
 61 |       "10000 28 28\n",
 62 |       "(Images) Returned a tensor of shape  (10000, 28, 28, 1)\n",
 63 |       "(Labels) Returned a tensor of shape: 10000\n",
 64 |       "Sample labels:  [7 2 1 0 4 1 4 9 5 9]\n",
 65 |       "\n",
 66 |       "Train size:  60000\n",
 67 |       "\n",
 68 |       "Test size:  10000\n"
 69 |      ]
 70 |     }
 71 |    ],
 72 |    "source": [
 73 |     "def maybe_download(url, filename, expected_bytes, force=False):\n",
 74 |     "  \"\"\"Download a file if not present, and make sure it's the right size.\"\"\"\n",
 75 |     "  if force or not os.path.exists(filename):\n",
 76 |     "    print('Attempting to download:', filename) \n",
 77 |     "    filename, _ = urlretrieve(url + filename, filename)\n",
 78 |     "    print('\\nDownload Complete!')\n",
 79 |     "  statinfo = os.stat(filename)\n",
 80 |     "  if statinfo.st_size == expected_bytes:\n",
 81 |     "    print('Found and verified', filename)\n",
 82 |     "  else:\n",
 83 |     "    raise Exception(\n",
 84 |     "      'Failed to verify ' + filename + '. Can you get to it with a browser?')\n",
 85 |     "  return filename\n",
 86 |     "\n",
 87 |     "\n",
 88 |     "def read_mnist(fname_img, fname_lbl, one_hot=False):\n",
 89 |     "    print('\\nReading files %s and %s'%(fname_img, fname_lbl))\n",
 90 |     "    \n",
 91 |     "    # Processing images\n",
 92 |     "    with gzip.open(fname_img) as fimg:        \n",
 93 |     "        magic, num, rows, cols = struct.unpack(\">IIII\", fimg.read(16))\n",
 94 |     "        print(num,rows,cols)\n",
 95 |     "        img = (np.frombuffer(fimg.read(num*rows*cols), dtype=np.uint8).reshape(num, rows, cols,1)).astype(np.float32)\n",
 96 |     "        print('(Images) Returned a tensor of shape ',img.shape)\n",
 97 |     "        \n",
 98 |     "        #img = (img - np.mean(img)) /np.std(img)\n",
 99 |     "        img *= 1.0 / 255.0\n",
100 |     "    \n",
101 |     "    # Processing labels\n",
102 |     "    with gzip.open(fname_lbl) as flbl:\n",
103 |     "        # flbl.read(8) reads upto 8 bytes\n",
104 |     "        magic, num = struct.unpack(\">II\", flbl.read(8))               \n",
105 |     "        lbl = np.frombuffer(flbl.read(num), dtype=np.int8)\n",
106 |     "        if one_hot:\n",
107 |     "            one_hot_lbl = np.zeros(shape=(num,10),dtype=np.float32)\n",
108 |     "            one_hot_lbl[np.arange(num),lbl] = 1.0\n",
109 |     "        print('(Labels) Returned a tensor of shape: %s'%lbl.shape)\n",
110 |     "        print('Sample labels: ',lbl[:10])\n",
111 |     "    \n",
112 |     "    if not one_hot:\n",
113 |     "        return img, lbl\n",
114 |     "    else:\n",
115 |     "        return img, one_hot_lbl\n",
116 |     "    \n",
117 |     "    \n",
118 |     "# Download data if needed\n",
119 |     "url = 'http://yann.lecun.com/exdb/mnist/'\n",
120 |     "# training data\n",
121 |     "maybe_download(url,'train-images-idx3-ubyte.gz',9912422)\n",
122 |     "maybe_download(url,'train-labels-idx1-ubyte.gz',28881)\n",
123 |     "# testing data\n",
124 |     "maybe_download(url,'t10k-images-idx3-ubyte.gz',1648877)\n",
125 |     "maybe_download(url,'t10k-labels-idx1-ubyte.gz',4542)\n",
126 |     "\n",
127 |     "# Read the training and testing data \n",
128 |     "train_inputs, train_labels = read_mnist('train-images-idx3-ubyte.gz', 'train-labels-idx1-ubyte.gz',True)\n",
129 |     "test_inputs, test_labels = read_mnist('t10k-images-idx3-ubyte.gz', 't10k-labels-idx1-ubyte.gz',True)\n",
130 |     "\n",
131 |     "\n",
132 |     "print('\\nTrain size: ', train_inputs.shape[0])\n",
133 |     "print('\\nTest size: ', test_inputs.shape[0])"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "## Data Generators for MNIST\n",
141 |     "\n",
142 |     "Here we have the logic to iterate through each training, validation and testing datasets, in `batch_size` size strides."
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "code",
147 |    "execution_count": 3,
148 |    "metadata": {
149 |     "collapsed": true
150 |    },
151 |    "outputs": [],
152 |    "source": [
153 |     "train_index, test_index = 0,0\n",
154 |     "\n",
155 |     "def get_train_batch(images, labels, batch_size):\n",
156 |     "    global train_index\n",
157 |     "    batch = images[train_index:train_index+batch_size,:,:,:], labels[train_index:train_index+batch_size,:]\n",
158 |     "    train_index = (train_index + batch_size)%(images.shape[0] - batch_size)\n",
159 |     "    return batch\n",
160 |     "\n",
161 |     "\n",
162 |     "def get_test_batch(images, labels, batch_size):\n",
163 |     "    global test_index\n",
164 |     "    batch = images[test_index:test_index+batch_size,:,:,:], labels[test_index:test_index+batch_size,:]\n",
165 |     "    test_index = (test_index + batch_size)%(images.shape[0] - batch_size)\n",
166 |     "    return batch"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": 4,
172 |    "metadata": {
173 |     "collapsed": true
174 |    },
175 |    "outputs": [],
176 |    "source": [
177 |     "config = tf.ConfigProto(allow_soft_placement=True)\n",
178 |     "# Good practice to use this to avoid any surprising errors thrown by TensorFlow\n",
179 |     "config.gpu_options.allow_growth = True \n",
180 |     "config.gpu_options.per_process_gpu_memory_fraction = 0.9 # Making sure Tensorflow doesn't overflow the GPU\n",
181 |     "sess = tf.Session(config=config)\n",
182 |     "K.set_session(sess)"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 5,
188 |    "metadata": {
189 |     "collapsed": true
190 |    },
191 |    "outputs": [],
192 |    "source": [
193 |     "# Define a sequential model\n",
194 |     "model = Sequential()\n",
195 |     "\n",
196 |     "# Added a convolution layer\n",
197 |     "model.add(Conv2D(32, 3, activation='relu', input_shape=[28, 28, 1]))\n",
198 |     "\n",
199 |     "# Add a max pool lyer\n",
200 |     "model.add(MaxPool2D())\n",
201 |     "\n",
202 |     "# Add a batch norm layer\n",
203 |     "model.add(BatchNormalization())\n",
204 |     "\n",
205 |     "# Convolution layer\n",
206 |     "model.add(Conv2D(64, 3, activation='relu'))\n",
207 |     "# Max pool layer\n",
208 |     "model.add(MaxPool2D())\n",
209 |     "# Add a batch norm layer\n",
210 |     "model.add(BatchNormalization())\n",
211 |     "\n",
212 |     "# More convolution, max pool, batch norm\n",
213 |     "model.add(Conv2D(128, 3, activation='relu'))\n",
214 |     "model.add(MaxPool2D())\n",
215 |     "model.add(BatchNormalization())\n",
216 |     "\n",
217 |     "model.add(Flatten())\n",
218 |     "\n",
219 |     "model.add(Dense(256, activation='relu'))\n",
220 |     "model.add(Dense(10, activation='softmax'))\n",
221 |     "\n",
222 |     "model.compile(\n",
223 |     "    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']\n",
224 |     ")"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "code",
229 |    "execution_count": 6,
230 |    "metadata": {},
231 |    "outputs": [
232 |     {
233 |      "name": "stdout",
234 |      "output_type": "stream",
235 |      "text": [
236 |       "Training for epoch:  0\n",
237 |       "Epoch 1/1\n",
238 |       "60000/60000 [==============================] - 13s 210us/step - loss: 0.1245 - acc: 0.9628\n",
239 |       "10000/10000 [==============================] - 1s 67us/step\n",
240 |       "\tEpoch ( 0 ) Test accuracy:  0.9757000070810318\n",
241 |       "Training for epoch:  1\n",
242 |       "Epoch 1/1\n",
243 |       "60000/60000 [==============================] - 11s 182us/step - loss: 0.0452 - acc: 0.9859\n",
244 |       "10000/10000 [==============================] - 1s 55us/step\n",
245 |       "\tEpoch ( 1 ) Test accuracy:  0.9827000075578689\n",
246 |       "Training for epoch:  2\n",
247 |       "Epoch 1/1\n",
248 |       "60000/60000 [==============================] - 11s 186us/step - loss: 0.0294 - acc: 0.9910\n",
249 |       "10000/10000 [==============================] - 1s 56us/step\n",
250 |       "\tEpoch ( 2 ) Test accuracy:  0.9847000110149383\n",
251 |       "Training for epoch:  3\n",
252 |       "Epoch 1/1\n",
253 |       "60000/60000 [==============================] - 11s 184us/step - loss: 0.0255 - acc: 0.9918\n",
254 |       "10000/10000 [==============================] - 1s 56us/step\n",
255 |       "\tEpoch ( 3 ) Test accuracy:  0.9859000080823899\n",
256 |       "Training for epoch:  4\n",
257 |       "Epoch 1/1\n",
258 |       "60000/60000 [==============================] - 11s 182us/step - loss: 0.0197 - acc: 0.9937\n",
259 |       "10000/10000 [==============================] - 1s 56us/step\n",
260 |       "\tEpoch ( 4 ) Test accuracy:  0.9872000062465668\n"
261 |      ]
262 |     }
263 |    ],
264 |    "source": [
265 |     "n_epochs = 5 # Number of epochs the training runs for\n",
266 |     "\n",
267 |     "n_train = 55000\n",
268 |     "n_test = 10000\n",
269 |     "\n",
270 |     "batch_size = 100\n",
271 |     "\n",
272 |     "x_train, y_train = train_inputs, train_labels\n",
273 |     "      \n",
274 |     "x_test, y_test = test_inputs, test_labels\n",
275 |     "    \n",
276 |     "for epoch in range(n_epochs):\n",
277 |     "\n",
278 |     "    print('Training for epoch: ',epoch)    \n",
279 |     "    \n",
280 |     "    # Training for a single epoch\n",
281 |     "    model.fit(x_train, y_train, batch_size = batch_size)\n",
282 |     "    \n",
283 |     "    # Testing phase\n",
284 |     "    # Returns a list where first item is loss and second is accuracy\n",
285 |     "    test_acc = model.evaluate(x_test, y_test, batch_size=batch_size)  \n",
286 |     "    print('\\tEpoch (', epoch ,') Test accuracy: ',test_acc[1])"
287 |    ]
288 |   },
289 |   {
290 |    "cell_type": "code",
291 |    "execution_count": null,
292 |    "metadata": {
293 |     "collapsed": true
294 |    },
295 |    "outputs": [],
296 |    "source": []
297 |   }
298 |  ],
299 |  "metadata": {
300 |   "kernelspec": {
301 |    "display_name": "Python 3",
302 |    "language": "python",
303 |    "name": "python3"
304 |   },
305 |   "language_info": {
306 |    "codemirror_mode": {
307 |     "name": "ipython",
308 |     "version": 3
309 |    },
310 |    "file_extension": ".py",
311 |    "mimetype": "text/x-python",
312 |    "name": "python",
313 |    "nbconvert_exporter": "python",
314 |    "pygments_lexer": "ipython3",
315 |    "version": "3.5.2"
316 |   }
317 |  },
318 |  "nbformat": 4,
319 |  "nbformat_minor": 2
320 | }
321 | 


--------------------------------------------------------------------------------
/appendix/tensorboard_word_embeddings.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Visualizing Word Embeddings on the Tensorboard"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 17,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import numpy as np\n",
 19 |     "import tensorflow as tf\n",
 20 |     "import os\n",
 21 |     "import zipfile\n",
 22 |     "from tensorflow.contrib.tensorboard.plugins import projector\n",
 23 |     "import csv\n"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "## Read the GloVe file\n",
 31 |     "\n",
 32 |     "Here we first need to download the GloVe word embeddings (`glove.6B.zip`) found at this [website](https://nlp.stanford.edu/projects/glove/). Then we read the GloVe file to get the first 50000 words in the file. We will be using 50 dimensional word vectors"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": 18,
 38 |    "metadata": {},
 39 |    "outputs": [
 40 |     {
 41 |      "name": "stdout",
 42 |      "output_type": "stream",
 43 |      "text": [
 44 |       ".....\tDone\n"
 45 |      ]
 46 |     }
 47 |    ],
 48 |    "source": [
 49 |     "vocabulary_size = 50000\n",
 50 |     "\n",
 51 |     "pret_embeddings = np.empty(shape=(vocabulary_size,50),dtype=np.float32)\n",
 52 |     "\n",
 53 |     "words = [] \n",
 54 |     "\n",
 55 |     "word_idx = 0\n",
 56 |     "# Open the zip file\n",
 57 |     "with zipfile.ZipFile('glove.6B.zip') as glovezip:\n",
 58 |     "    # Read the file with 50 dimensional embeddings\n",
 59 |     "    with glovezip.open('glove.6B.50d.txt') as glovefile:\n",
 60 |     "        # Read line by line\n",
 61 |     "        for li, line in enumerate(glovefile):\n",
 62 |     "            # Print progress\n",
 63 |     "            if (li+1)%10000==0: print('.',end='')\n",
 64 |     "                \n",
 65 |     "            # Get the word and the corresponding vector\n",
 66 |     "            line_tokens = line.decode('utf-8').split(' ')\n",
 67 |     "            word = line_tokens[0]\n",
 68 |     "            vector = [float(v) for v in line_tokens[1:]]\n",
 69 |     "            \n",
 70 |     "            assert len(vector)==50\n",
 71 |     "            words.append(word)\n",
 72 |     "            # Update the embedding matrix\n",
 73 |     "            pret_embeddings[word_idx,:] = np.array(vector)\n",
 74 |     "            word_idx += 1\n",
 75 |     "            # If the first 50000 words being read, finish\n",
 76 |     "            if word_idx == vocabulary_size:\n",
 77 |     "                break\n",
 78 |     "                \n",
 79 |     "print('\\tDone')"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "## Create TensorFlow Variable\n",
 87 |     "\n",
 88 |     "Here we create a TensorFlow variable to store the embeddings we read above and save it to the disk. This is necessary for the visualization."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 19,
 94 |    "metadata": {
 95 |     "collapsed": true
 96 |    },
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "# Create a directory to save our model\n",
100 |     "log_dir = 'models'\n",
101 |     "if not os.path.exists(log_dir):\n",
102 |     "    os.mkdir(log_dir)\n",
103 |     "\n",
104 |     "tf.reset_default_graph()\n",
105 |     "\n",
106 |     "# Create a Tensorflow variable initialized with the word embedings we just read in\n",
107 |     "embeddings = tf.get_variable('embeddings',shape=[vocabulary_size, 50],\n",
108 |     "                             initializer=tf.constant_initializer(pret_embeddings))\n",
109 |     "\n",
110 |     "session = tf.InteractiveSession()\n",
111 |     "\n",
112 |     "tf.global_variables_initializer().run()\n",
113 |     "\n",
114 |     "# Define a saver, that will save the Tensorflow variables to a given location\n",
115 |     "saver = tf.train.Saver({'embeddings':embeddings})\n",
116 |     "# Save the file\n",
117 |     "saver.save(session, os.path.join(log_dir, \"model.ckpt\"), 0)\n",
118 |     "\n",
119 |     "# Define metadata for word embeddings\n",
120 |     "with open(os.path.join(log_dir,'metadata.tsv'), 'w',encoding='utf-8') as csvfile:\n",
121 |     "    writer = csv.writer(csvfile, delimiter='\\t',\n",
122 |     "                            quotechar='|', quoting=csv.QUOTE_MINIMAL)\n",
123 |     "    writer.writerow(['Word','Word ID'])\n",
124 |     "    for wi,w in enumerate(words):\n",
125 |     "      writer.writerow([w,wi])"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "metadata": {},
131 |    "source": [
132 |     "## Define the configuration to tell the Tensorboard where and what to look"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": 20,
138 |    "metadata": {
139 |     "collapsed": true
140 |    },
141 |    "outputs": [],
142 |    "source": [
143 |     "config = projector.ProjectorConfig()\n",
144 |     "\n",
145 |     "# You can add multiple embeddings. Here we add only one.\n",
146 |     "embedding_config = config.embeddings.add()\n",
147 |     "embedding_config.tensor_name = embeddings.name\n",
148 |     "# Link this tensor to its metadata file (e.g. labels).\n",
149 |     "embedding_config.metadata_path = 'metadata.tsv'\n",
150 |     "\n",
151 |     "# Use the same LOG_DIR where you stored your checkpoint.\n",
152 |     "summary_writer = tf.summary.FileWriter(log_dir)\n",
153 |     "\n",
154 |     "# The next line writes a projector_config.pbtxt in the LOG_DIR. TensorBoard will\n",
155 |     "# read this file during startup.\n",
156 |     "projector.visualize_embeddings(summary_writer, config)"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "code",
161 |    "execution_count": null,
162 |    "metadata": {
163 |     "collapsed": true
164 |    },
165 |    "outputs": [],
166 |    "source": []
167 |   }
168 |  ],
169 |  "metadata": {
170 |   "kernelspec": {
171 |    "display_name": "Python 3",
172 |    "language": "python",
173 |    "name": "python3"
174 |   },
175 |   "language_info": {
176 |    "codemirror_mode": {
177 |     "name": "ipython",
178 |     "version": 3
179 |    },
180 |    "file_extension": ".py",
181 |    "mimetype": "text/x-python",
182 |    "name": "python",
183 |    "nbconvert_exporter": "python",
184 |    "pygments_lexer": "ipython3",
185 |    "version": "3.5.2"
186 |   }
187 |  },
188 |  "nbformat": 4,
189 |  "nbformat_minor": 2
190 | }
191 | 


--------------------------------------------------------------------------------
/ch10/bleu_score_example.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Performance Evaluation of MT\n",
  8 |     "### BLEU Score: Adequacy and Fluency of Translations\n",
  9 |     "\n",
 10 |     "1. Calculate the modified n-gram precision for the full corpus for all n=1...N\n",
 11 |     "2. Calculate the geometric mean (gm-precision) of all the precisions\n",
 12 |     "3. Calculate the brevity penalty (bp) for the full corpus\n",
 13 |     "3. Calculate the BLEU by bp * gm-precision\n",
 14 |     "\n",
 15 |     "Example Calculation:\n",
 16 |     "\n",
 17 |     "* Candidate1: the the the the the the the\n",
 18 |     "* Candidate2: the cat is on the mat\n",
 19 |     "* Ref1: the cat sat on the mat\n",
 20 |     "* Ref2: there is a cat on the mat\n",
 21 |     "\n",
 22 |     "#### Modified 1-gram Precision (Measures adequacy)\n",
 23 |     "* Candidate1: $\\frac{2}{7}$\n",
 24 |     "* Candidate2: $\\frac{5}{5}$\n",
 25 |     "\n",
 26 |     "#### Modified 2-gram Precision (Measures fluency)\n",
 27 |     "* Candidate1: $\\frac{0}{1}$\n",
 28 |     "* Candidate2: $\\frac{3}{5}$\n"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": 1,
 34 |    "metadata": {
 35 |     "collapsed": true
 36 |    },
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "import numpy as np\n",
 40 |     "from collections import Counter"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": 11,
 46 |    "metadata": {},
 47 |    "outputs": [
 48 |     {
 49 |      "name": "stdout",
 50 |      "output_type": "stream",
 51 |      "text": [
 52 |       "Calculating modified 1-gram precision\n",
 53 |       "\tReference sentence:  the cat sat on the mat\n",
 54 |       "\t  Candidate sentence:  the the the\n",
 55 |       "\tReference sentence:  there is a cat on the mat\n",
 56 |       "\t  Candidate sentence:  the a cat\n",
 57 |       "Calculating modified 2-gram precision\n",
 58 |       "\tReference sentence:  the cat sat on the mat\n",
 59 |       "\t  Candidate sentence:  the the the\n",
 60 |       "\tReference sentence:  there is a cat on the mat\n",
 61 |       "\t  Candidate sentence:  the a cat\n",
 62 |       "Calculating modified 3-gram precision\n",
 63 |       "\tReference sentence:  the cat sat on the mat\n",
 64 |       "\t  Candidate sentence:  the the the\n",
 65 |       "\tReference sentence:  there is a cat on the mat\n",
 66 |       "\t  Candidate sentence:  the a cat\n",
 67 |       "\n",
 68 |       "BLEU-3:  8.568589920310384e-35\n",
 69 |       "\n",
 70 |       "Calculating modified 1-gram precision\n",
 71 |       "\tReference sentence:  the cat sat on the mat\n",
 72 |       "\t  Candidate sentence:  the dog on the mat\n",
 73 |       "\tReference sentence:  there is a cat on the mat\n",
 74 |       "\t  Candidate sentence:  there is cat on the mat\n",
 75 |       "Calculating modified 2-gram precision\n",
 76 |       "\tReference sentence:  the cat sat on the mat\n",
 77 |       "\t  Candidate sentence:  the dog on the mat\n",
 78 |       "\tReference sentence:  there is a cat on the mat\n",
 79 |       "\t  Candidate sentence:  there is cat on the mat\n",
 80 |       "Calculating modified 3-gram precision\n",
 81 |       "\tReference sentence:  the cat sat on the mat\n",
 82 |       "\t  Candidate sentence:  the dog on the mat\n",
 83 |       "\tReference sentence:  there is a cat on the mat\n",
 84 |       "\t  Candidate sentence:  there is cat on the mat\n",
 85 |       "\n",
 86 |       "BLEU-3:  0.5319658954895262\n"
 87 |      ]
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "def unique_n_gram_string(n_gram):\n",
 92 |     "    string = ''\n",
 93 |     "    for g in n_gram[:-1]:\n",
 94 |     "        string += str(g)+'-'\n",
 95 |     "        \n",
 96 |     "    string += str(n_gram[-1])\n",
 97 |     "    return string\n",
 98 |     "\n",
 99 |     "def calculate_mod_n_gram_precision(n_gram, refs, cands):\n",
100 |     "    \n",
101 |     "    denominator = 0.0\n",
102 |     "    \n",
103 |     "    tot_bleu = 0.0\n",
104 |     "    \n",
105 |     "    tot_ref_length, tot_cand_length = 0, 0\n",
106 |     "    for ref, cand in zip(refs, cands):\n",
107 |     "        \n",
108 |     "        print('\\tReference sentence: ',' '.join([reverse_test_dict[r] for r in ref]))\n",
109 |     "        print('\\t  Candidate sentence: ',' '.join([reverse_test_dict[c] for c in cand]))\n",
110 |     "\n",
111 |     "        denominator += max(cand.size + 1 - n_gram,1)\n",
112 |     "        tot_ref_length += ref.size\n",
113 |     "        tot_cand_length += cand.size\n",
114 |     "        \n",
115 |     "        # find unique n-grams in predicted\n",
116 |     "        cand_n_grams = [unique_n_gram_string(cand[w_i:w_i+n_gram]) for w_i in range(cand.size + 1 - n_gram)]\n",
117 |     "        cand_n_grams = list(set(cand_n_grams))\n",
118 |     "\n",
119 |     "        occurences_for_unique_grams = dict(zip(cand_n_grams,[0 for _ in cand_n_grams]))\n",
120 |     "\n",
121 |     "        ref_n_grams = [unique_n_gram_string(ref[w_i:w_i+n_gram]) for w_i in range(ref.size + 1 - n_gram)]\n",
122 |     "        ref_counts = Counter(ref_n_grams)\n",
123 |     "\n",
124 |     "        # iterates through every n_gram in the predicted\n",
125 |     "        for w_i in range(cand.size + 1 - n_gram):        \n",
126 |     "            c_gram = cand[w_i:w_i+n_gram]\n",
127 |     "            gram_string = unique_n_gram_string(c_gram)\n",
128 |     "            \n",
129 |     "            for ref_i in range(ref.size + 1 - n_gram):\n",
130 |     "\n",
131 |     "                r_gram = ref[ref_i:ref_i+n_gram]\n",
132 |     "\n",
133 |     "                found_gram_in_actual = int(np.prod(c_gram == r_gram))\n",
134 |     "\n",
135 |     "                occurences_for_unique_grams[gram_string] += found_gram_in_actual\n",
136 |     "\n",
137 |     "        \n",
138 |     "        for g, occ in occurences_for_unique_grams.items():\n",
139 |     "            g_bleu = float(occ)\n",
140 |     "            if g in ref_counts:\n",
141 |     "                g_bleu = min(g_bleu,ref_counts[g])\n",
142 |     "\n",
143 |     "            tot_bleu += g_bleu\n",
144 |     "\n",
145 |     "    mod_n_prec = tot_bleu/denominator\n",
146 |     "    \n",
147 |     "    \n",
148 |     "    return mod_n_prec, tot_ref_length, tot_cand_length\n",
149 |     "\n",
150 |     "\n",
151 |     "def calculate_bleu(refs, cands, high_n):\n",
152 |     "    weight = 1.0/high_n # using the same weight for all mod n_gram precisions\n",
153 |     "    \n",
154 |     "    tot_precision = []\n",
155 |     "    for n in range(1,high_n+1): \n",
156 |     "        print('Calculating modified %d-gram precision'%n)\n",
157 |     "        prec, tot_ref_length, tot_cand_length = calculate_mod_n_gram_precision(n,refs,cands)\n",
158 |     "        tot_precision.append(weight*np.log(prec + 1e-100))\n",
159 |     "        \n",
160 |     "    brevity_penalty = 1.0\n",
161 |     "    \n",
162 |     "    if tot_cand_length <= tot_ref_length:\n",
163 |     "        brevity_penalty = np.exp(1.0-(tot_ref_length*1.0/max(tot_cand_length,1)))\n",
164 |     "        \n",
165 |     "    bleu = brevity_penalty * np.exp(np.sum(tot_precision))\n",
166 |     "\n",
167 |     "    return bleu\n",
168 |     "\n",
169 |     "test_dict = {'the':10,'cat':11,'sat':12,'on':13,'mat':14,'is':15,'there':16,'a':17,'dog':18}\n",
170 |     "reverse_test_dict = dict(zip(test_dict.values(),test_dict.keys()))\n",
171 |     "\n",
172 |     "sample_text_refs = [['the','cat','sat','on','the','mat'],['there','is','a','cat','on','the','mat']]\n",
173 |     "sample_refs = []\n",
174 |     "for r in sample_text_refs:\n",
175 |     "    sample_refs.append(np.asarray([test_dict[w] for w in r],dtype=np.int32))\n",
176 |     "\n",
177 |     "sample_text_cands_1 = [['the','the','the'],['the','a','cat']]\n",
178 |     "sample_cands_1 = []\n",
179 |     "for c1 in sample_text_cands_1:\n",
180 |     "    sample_cands_1.append(np.asarray([test_dict[w] for w in c1],dtype=np.int32))\n",
181 |     "\n",
182 |     "\n",
183 |     "sample_text_cands_2 = [['the','dog','on','the','mat'],['there','is','cat','on','the','mat']]\n",
184 |     "sample_cands_2 = []\n",
185 |     "for c2 in sample_text_cands_2:\n",
186 |     "    sample_cands_2.append(np.asarray([test_dict[w] for w in c2],dtype=np.int32))\n",
187 |     "\n",
188 |     "\n",
189 |     "b1 = calculate_bleu(sample_refs,sample_cands_1,3)\n",
190 |     "print('\\nBLEU-3: ',b1)\n",
191 |     "print()\n",
192 |     "\n",
193 |     "b2 = calculate_bleu(sample_refs,sample_cands_2,3)\n",
194 |     "print('\\nBLEU-3: ',b2)"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "code",
199 |    "execution_count": null,
200 |    "metadata": {
201 |     "collapsed": true
202 |    },
203 |    "outputs": [],
204 |    "source": []
205 |   }
206 |  ],
207 |  "metadata": {
208 |   "kernelspec": {
209 |    "display_name": "Python 3",
210 |    "language": "python",
211 |    "name": "python3"
212 |   },
213 |   "language_info": {
214 |    "codemirror_mode": {
215 |     "name": "ipython",
216 |     "version": 3
217 |    },
218 |    "file_extension": ".py",
219 |    "mimetype": "text/x-python",
220 |    "name": "python",
221 |    "nbconvert_exporter": "python",
222 |    "pygments_lexer": "ipython3",
223 |    "version": "3.5.2"
224 |   }
225 |  },
226 |  "nbformat": 4,
227 |  "nbformat_minor": 2
228 | }
229 | 


--------------------------------------------------------------------------------
/ch10/word2vec.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | import numpy as np
  3 | import math
  4 | sentence_cursors = None
  5 | tot_sentences = None
  6 | src_max_sent_length, tgt_max_sent_length = 0, 0
  7 | src_dictionary, tgt_dictionary = {}, {}
  8 | src_reverse_dictionary, tgt_reverse_dictionary = {},{}
  9 | train_inputs, train_outputs = None, None
 10 | embedding_size = None # Dimension of the embedding vector.
 11 | vocabulary_size = None
 12 | def define_data_and_hyperparameters(
 13 |         _tot_sentences, _src_max, _tgt_max, _src_dict, _tgt_dict,
 14 |         _src_rev_dict, _tgt_rev_dict, _tr_inp, _tr_out, _emb_size, _vocab_size):
 15 |     global tot_sentences, sentence_cursors
 16 |     global src_max_sent_length, tgt_max_sent_length
 17 |     global src_dictionary, tgt_dictionary
 18 |     global src_reverse_dictionary, tgt_reverse_dictionary
 19 |     global train_inputs, train_outputs
 20 |     global embedding_size, vocabulary_size
 21 | 
 22 |     embedding_size = _emb_size
 23 |     vocabulary_size = _vocab_size
 24 |     src_max_sent_length, tgt_max_sent_length = _src_max, _tgt_max
 25 | 
 26 |     src_dictionary = _src_dict
 27 |     tgt_dictionary = _tgt_dict
 28 | 
 29 |     src_reverse_dictionary = _src_rev_dict
 30 |     tgt_reverse_dictionary = _tgt_rev_dict
 31 | 
 32 |     train_inputs = _tr_inp
 33 |     train_outputs = _tr_out
 34 | 
 35 |     tot_sentences = _tot_sentences
 36 |     sentence_cursors = [0 for _ in range(tot_sentences)]
 37 | 
 38 | 
 39 | def generate_batch_for_word2vec(batch_size, window_size, is_source):
 40 |     # window_size is the amount of words we're looking at from each side of a given word
 41 |     # creates a single batch
 42 |     global sentence_cursors
 43 |     global src_dictionary, tgt_dictionary
 44 |     global train_inputs, train_outputs
 45 |     span = 2 * window_size + 1  # [ skip_window target skip_window ]
 46 | 
 47 |     batch = np.ndarray(shape=(batch_size, span - 1), dtype=np.int32)
 48 |     labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
 49 |     # e.g if skip_window = 2 then span = 5
 50 |     # span is the length of the whole frame we are considering for a single word (left + word + right)
 51 |     # skip_window is the length of one side
 52 | 
 53 |     sentence_ids_for_batch = np.random.randint(0, tot_sentences, batch_size)
 54 | 
 55 |     for b_i in range(batch_size):
 56 |         sent_id = sentence_ids_for_batch[b_i]
 57 | 
 58 |         if is_source:
 59 |             buffer = train_inputs[sent_id, sentence_cursors[sent_id]:sentence_cursors[sent_id] + span]
 60 |         else:
 61 |             buffer = train_outputs[sent_id, sentence_cursors[sent_id]:sentence_cursors[sent_id] + span]
 62 |         assert buffer.size == span, 'Buffer length (%d), Current data index (%d), Span(%d)' % (
 63 |         buffer.size, sentence_cursors[sent_id], span)
 64 |         # If we only have EOS tokesn in the sampled text, we sample a new one
 65 |         if is_source:
 66 |             while np.all(buffer == src_dictionary['</s>']):
 67 |                 # reset the sentence_cursors for that cap_id
 68 |                 sentence_cursors[sent_id] = 0
 69 |                 # sample a new cap_id
 70 |                 sent_id = np.random.randint(0, tot_sentences)
 71 |                 buffer = train_inputs[sent_id, sentence_cursors[sent_id]:sentence_cursors[sent_id] + span]
 72 |         else:
 73 |             while np.all(buffer == tgt_dictionary['</s>']):
 74 |                 # reset the sentence_cursors for that cap_id
 75 |                 sentence_cursors[sent_id] = 0
 76 |                 # sample a new cap_id
 77 |                 sent_id = np.random.randint(0, tot_sentences)
 78 |                 buffer = train_outputs[sent_id, sentence_cursors[sent_id]:sentence_cursors[sent_id] + span]
 79 | 
 80 |         # fill left and right sides of batch
 81 |         batch[b_i, :window_size] = buffer[:window_size]
 82 |         batch[b_i, window_size:] = buffer[window_size + 1:]
 83 | 
 84 |         labels[b_i, 0] = buffer[window_size]
 85 | 
 86 |         # increase the corresponding index
 87 |         if is_source:
 88 |             sentence_cursors[sent_id] = (sentence_cursors[sent_id] + 1) % (src_max_sent_length - span)
 89 |         else:
 90 |             sentence_cursors[sent_id] = (sentence_cursors[sent_id] + 1) % (tgt_max_sent_length - span)
 91 | 
 92 |     assert batch.shape[0] == batch_size and batch.shape[1] == span - 1
 93 |     return batch, labels
 94 | 
 95 | 
 96 | def print_some_batches():
 97 |     global sentence_cursors, tot_sentences
 98 |     global src_reverse_dictionary
 99 | 
100 |     for window_size in [1, 2]:
101 |         sentence_cursors = [0 for _ in range(tot_sentences)]
102 |         batch, labels = generate_batch_for_word2vec(batch_size=8, window_size=window_size, is_source=True)
103 |         print('\nwith window_size = %d:' % (window_size))
104 |         print('    batch:', [[src_reverse_dictionary[bii] for bii in bi] for bi in batch])
105 |         print('    labels:', [src_reverse_dictionary[li] for li in labels.reshape(8)])
106 | 
107 |     sentence_cursors = [0 for _ in range(tot_sentences)]
108 | 
109 | batch_size, window_size = None, None
110 | valid_size, valid_window, valid_examples = None, None, None
111 | num_sampled = None
112 | 
113 | train_dataset, train_labels = None, None
114 | valid_dataset = None
115 | 
116 | softmax_weights, softmax_biases = None, None
117 | 
118 | loss, optimizer, similarity, normalized_embeddings = None, None, None, None
119 | 
120 | def define_word2vec_tensorflow(batch_size):
121 | 
122 |     global embedding_size, window_size
123 |     global	valid_size, valid_window, valid_examples
124 |     global num_sampled
125 |     global train_dataset, train_labels
126 |     global valid_dataset
127 |     global softmax_weights, softmax_biases
128 |     global loss, optimizer, similarity
129 |     global vocabulary_size, embedding_size
130 |     global normalized_embeddings
131 | 
132 | 
133 |     window_size = 2  # How many words to consider left and right.
134 |     # We pick a random validation set to sample nearest neighbors. here we limit the
135 |     # validation samples to the words that have a low numeric ID, which by
136 |     # construction are also the most frequent.
137 |     valid_size = 20  # Random set of words to evaluate similarity on.
138 |     valid_window = 100  # Only pick dev samples in the head of the distribution.
139 |     # pick 16 samples from 100
140 |     valid_examples = np.array(np.random.randint(0, valid_window, valid_size // 2))
141 |     valid_examples = np.append(valid_examples, np.random.randint(1000, 1000 + valid_window, valid_size // 2))
142 |     num_sampled = 32  # Number of negative examples to sample.
143 | 
144 |     tf.reset_default_graph()
145 | 
146 |     # Input data.
147 |     train_dataset = tf.placeholder(tf.int32, shape=[batch_size, 2 * window_size])
148 |     train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
149 |     valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
150 | 
151 |     # Variables.
152 |     # embedding, vector for each word in the vocabulary
153 |     embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0, dtype=tf.float32))
154 |     softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
155 |                                                       stddev=1.0 / math.sqrt(embedding_size), dtype=tf.float32))
156 |     softmax_biases = tf.Variable(tf.zeros([vocabulary_size], dtype=tf.float32))
157 | 
158 |     # Model.
159 |     # Look up embeddings for inputs.
160 |     # this might efficiently find the embeddings for given ids (traind dataset)
161 |     # manually doing this might not be efficient given there are 50000 entries in embeddings
162 |     stacked_embedings = None
163 |     print('Defining %d embedding lookups representing each word in the context' % (2 * window_size))
164 |     for i in range(2 * window_size):
165 |         embedding_i = tf.nn.embedding_lookup(embeddings, train_dataset[:, i])
166 |         x_size, y_size = embedding_i.get_shape().as_list()
167 |         if stacked_embedings is None:
168 |             stacked_embedings = tf.reshape(embedding_i, [x_size, y_size, 1])
169 |         else:
170 |             stacked_embedings = tf.concat(axis=2,
171 |                                           values=[stacked_embedings, tf.reshape(embedding_i, [x_size, y_size, 1])])
172 | 
173 |     assert stacked_embedings.get_shape().as_list()[2] == 2 * window_size
174 |     print("Stacked embedding size: %s" % stacked_embedings.get_shape().as_list())
175 |     mean_embeddings = tf.reduce_mean(stacked_embedings, 2, keepdims=False)
176 |     print("Reduced mean embedding size: %s" % mean_embeddings.get_shape().as_list())
177 | 
178 |     # Compute the softmax loss, using a sample of the negative labels each time.
179 |     # inputs are embeddings of the train words
180 |     # with this loss we optimize weights, biases, embeddings
181 | 
182 |     loss = tf.reduce_mean(
183 |         tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=mean_embeddings,
184 |                                    labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
185 | 
186 |     # Optimizer.
187 |     # Note: The optimizer will optimize the softmax_weights AND the embeddings.
188 |     optimizer = tf.train.AdamOptimizer(0.001).minimize(loss)
189 | 
190 |     # Compute the similarity between minibatch examples and all embeddings.
191 |     # We use the cosine distance:
192 |     norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
193 |     normalized_embeddings = embeddings / norm
194 |     valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
195 |     similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
196 | 
197 | 
198 | def run_word2vec_source(batch_size):
199 |     global embedding_size, window_size
200 |     global valid_size, valid_window, valid_examples
201 |     global num_sampled
202 |     global train_dataset, train_labels
203 |     global valid_dataset
204 |     global softmax_weights, softmax_biases
205 |     global loss, optimizer, similarity, normalized_embeddings
206 |     global src_reverse_dictionary
207 |     global vocabulary_size, embedding_size
208 | 
209 |     num_steps = 100001
210 | 
211 |     config=tf.ConfigProto(allow_soft_placement=True) 
212 |     config.gpu_options.allow_growth = True	
213 |     	
214 |     with tf.Session(config=config) as session:
215 |         tf.global_variables_initializer().run()
216 |         print('Initialized')
217 |         average_loss = 0
218 |         for step in range(num_steps):
219 | 
220 |             batch_data, batch_labels = generate_batch_for_word2vec(batch_size, window_size, is_source=True)
221 |             feed_dict = {train_dataset: batch_data, train_labels: batch_labels}
222 |             _, l = session.run([optimizer, loss], feed_dict=feed_dict)
223 |             average_loss += l
224 |             if (step + 1) % 2000 == 0:
225 |                 if step > 0:
226 |                     average_loss = average_loss / 2000
227 |                     # The average loss is an estimate of the loss over the last 2000 batches.
228 |                 print('Average loss at step %d: %f' % (step + 1, average_loss))
229 |                 average_loss = 0
230 |             # note that this is expensive (~20% slowdown if computed every 500 steps)
231 |             if (step + 1) % 10000 == 0:
232 |                 sim = similarity.eval()
233 |                 for i in range(valid_size):
234 |                     valid_word = src_reverse_dictionary[valid_examples[i]]
235 |                     top_k = 8  # number of nearest neighbors
236 |                     nearest = (-sim[i, :]).argsort()[1:top_k + 1]
237 |                     log = 'Nearest to %s:' % valid_word
238 |                     for k in range(top_k):
239 |                         close_word = src_reverse_dictionary[nearest[k]]
240 |                         log = '%s %s,' % (log, close_word)
241 |                     print(log)
242 |         cbow_final_embeddings = normalized_embeddings.eval()
243 | 
244 |     np.save('de-embeddings.npy', cbow_final_embeddings)
245 | 
246 | def run_word2vec_target(batch_size):
247 |     global embedding_size, window_size
248 |     global valid_size, valid_window, valid_examples
249 |     global num_sampled
250 |     global train_dataset, train_labels
251 |     global valid_dataset
252 |     global softmax_weights, softmax_biases
253 |     global loss, optimizer, similarity, normalized_embeddings
254 |     global tgt_reverse_dictionary
255 |     global vocabulary_size, embedding_size
256 | 
257 |     num_steps = 100001
258 |     
259 |     config=tf.ConfigProto(allow_soft_placement=True) 
260 |     config.gpu_options.allow_growth = True	
261 |     with tf.Session(config=config) as session:
262 |         tf.global_variables_initializer().run()
263 |         print('Initialized')
264 |         average_loss = 0
265 |         for step in range(num_steps):
266 | 
267 |             batch_data, batch_labels = generate_batch_for_word2vec(batch_size, window_size, is_source=False)
268 |             feed_dict = {train_dataset: batch_data, train_labels: batch_labels}
269 |             _, l = session.run([optimizer, loss], feed_dict=feed_dict)
270 |             average_loss += l
271 |             if (step + 1) % 2000 == 0:
272 |                 if step > 0:
273 |                     average_loss = average_loss / 2000
274 |                     # The average loss is an estimate of the loss over the last 2000 batches.
275 |                 print('Average loss at step %d: %f' % (step + 1, average_loss))
276 |                 average_loss = 0
277 |             # note that this is expensive (~20% slowdown if computed every 500 steps)
278 |             if (step + 1) % 10000 == 0:
279 |                 sim = similarity.eval()
280 |                 for i in range(valid_size):
281 |                     valid_word = tgt_reverse_dictionary[valid_examples[i]]
282 |                     top_k = 8  # number of nearest neighbors
283 |                     nearest = (-sim[i, :]).argsort()[1:top_k + 1]
284 |                     log = 'Nearest to %s:' % valid_word
285 |                     for k in range(top_k):
286 |                         close_word = tgt_reverse_dictionary[nearest[k]]
287 |                         log = '%s %s,' % (log, close_word)
288 |                     print(log)
289 |         cbow_final_embeddings = normalized_embeddings.eval()
290 | 
291 |     np.save('en-embeddings.npy', cbow_final_embeddings)


--------------------------------------------------------------------------------
/ch2/test1.txt:
--------------------------------------------------------------------------------
1 | 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0
2 | 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0
3 | 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0
4 | 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0
5 | 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0
6 | 


--------------------------------------------------------------------------------
/ch2/test2.txt:
--------------------------------------------------------------------------------
1 | 0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
2 | 0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
3 | 0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
4 | 0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
5 | 0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
6 | 


--------------------------------------------------------------------------------
/ch2/test3.txt:
--------------------------------------------------------------------------------
1 | 1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1
2 | 1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1
3 | 1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1
4 | 1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1
5 | 1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1
6 | 


--------------------------------------------------------------------------------
/ch3/ch3_wordnet.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "# You first need to download the wordnet following these commands \n",
 12 |     "# before importing it\n",
 13 |     "import nltk\n",
 14 |     "nltk.download('wordnet')"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 2,
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "# you will need to download the wordnet corpus from nltk using nltk.download()\n",
 24 |     "from nltk.corpus import wordnet as wn"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "markdown",
 29 |    "metadata": {},
 30 |    "source": [
 31 |     "## Various Synset Relationships\n",
 32 |     "Here we will look at what lemmas, hypernyms, hyponyms, meronyms and holonyms look like"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": 3,
 38 |    "metadata": {},
 39 |    "outputs": [
 40 |     {
 41 |      "name": "stdout",
 42 |      "output_type": "stream",
 43 |      "text": [
 44 |       "All the available Synsets for  car\n",
 45 |       "\t [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')] \n",
 46 |       "\n",
 47 |       "Example definitions of available Synsets ...\n",
 48 |       "\t car.n.01 :  a motor vehicle with four wheels; usually propelled by an internal combustion engine\n",
 49 |       "\t car.n.02 :  a wheeled vehicle adapted to the rails of railroad\n",
 50 |       "\t car.n.03 :  the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant\n",
 51 |       "\n",
 52 |       "\n",
 53 |       "Example lemmas for the Synset  car.n.03\n",
 54 |       "\t ['car', 'auto', 'automobile'] \n",
 55 |       "\n",
 56 |       "Hypernyms of the Synset  car.n.01\n",
 57 |       "\t motor_vehicle.n.01 \n",
 58 |       "\n",
 59 |       "Hyponyms of the Synset  car.n.01\n",
 60 |       "\t ['ambulance.n.01', 'beach_wagon.n.01', 'bus.n.04'] \n",
 61 |       "\n",
 62 |       "Holonyms (Part) of the Synset  car.n.03\n",
 63 |       "\t ['airship.n.01'] \n",
 64 |       "\n",
 65 |       "Meronyms (Part) of the Synset  car.n.01\n",
 66 |       "\t ['accelerator.n.01', 'air_bag.n.01', 'auto_accessory.n.01'] \n",
 67 |       "\n"
 68 |      ]
 69 |     }
 70 |    ],
 71 |    "source": [
 72 |     "# shows all the available synsets\n",
 73 |     "word = 'car'\n",
 74 |     "car_syns = wn.synsets(word)\n",
 75 |     "print('All the available Synsets for ',word)\n",
 76 |     "print('\\t',car_syns,'\\n')\n",
 77 |     "\n",
 78 |     "# The definition of the first two synsets\n",
 79 |     "syns_defs = [car_syns[i].definition() for i in range(len(car_syns))]\n",
 80 |     "print('Example definitions of available Synsets ...')\n",
 81 |     "for i in range(3):\n",
 82 |     "    print('\\t',car_syns[i].name(),': ',syns_defs[i])\n",
 83 |     "print('\\n')\n",
 84 |     "\n",
 85 |     "# Get the lemmas for the first Synset\n",
 86 |     "print('Example lemmas for the Synset ',car_syns[i].name())\n",
 87 |     "car_lemmas = car_syns[0].lemmas()[:3]\n",
 88 |     "print('\\t',[lemma.name() for lemma in car_lemmas],'\\n')\n",
 89 |     "\n",
 90 |     "# Let us get hypernyms for a Synset (general superclass)\n",
 91 |     "syn = car_syns[0]\n",
 92 |     "print('Hypernyms of the Synset ',syn.name())\n",
 93 |     "print('\\t',syn.hypernyms()[0].name(),'\\n')\n",
 94 |     "\n",
 95 |     "# Let us get hyponyms for a Synset (specific subclass)\n",
 96 |     "syn = car_syns[0]\n",
 97 |     "print('Hyponyms of the Synset ',syn.name())\n",
 98 |     "print('\\t',[hypo.name() for hypo in syn.hyponyms()[:3]],'\\n')\n",
 99 |     "\n",
100 |     "# Let us get part-holonyms for a Synset (specific subclass)\n",
101 |     "# also there is another holonym category called \"substance-holonyms\"\n",
102 |     "syn = car_syns[2]\n",
103 |     "print('Holonyms (Part) of the Synset ',syn.name())\n",
104 |     "print('\\t',[holo.name() for holo in syn.part_holonyms()],'\\n')\n",
105 |     "\n",
106 |     "# Let us get meronyms for a Synset (specific subclass)\n",
107 |     "# also there is another meronym category called \"substance-meronyms\"\n",
108 |     "syn = car_syns[0]\n",
109 |     "print('Meronyms (Part) of the Synset ',syn.name())\n",
110 |     "print('\\t',[mero.name() for mero in syn.part_meronyms()[:3]],'\\n')"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "metadata": {},
116 |    "source": [
117 |     "## Similarity between Synsets"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 4,
123 |    "metadata": {},
124 |    "outputs": [
125 |     {
126 |      "name": "stdout",
127 |      "output_type": "stream",
128 |      "text": [
129 |       "Word Similarity (car)<->(lorry):  0.6956521739130435\n",
130 |       "Word Similarity (car)<->(tree):  0.38095238095238093\n"
131 |      ]
132 |     }
133 |    ],
134 |    "source": [
135 |     "word1, word2, word3 = 'car','lorry','tree'\n",
136 |     "w1_syns, w2_syns, w3_syns = wn.synsets(word1), wn.synsets(word2), wn.synsets(word3)\n",
137 |     "\n",
138 |     "print('Word Similarity (%s)<->(%s): '%(word1,word2),wn.wup_similarity(w1_syns[0], w2_syns[0]))\n",
139 |     "print('Word Similarity (%s)<->(%s): '%(word1,word3),wn.wup_similarity(w1_syns[0], w3_syns[0]))"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": null,
145 |    "metadata": {
146 |     "collapsed": true
147 |    },
148 |    "outputs": [],
149 |    "source": []
150 |   }
151 |  ],
152 |  "metadata": {
153 |   "kernelspec": {
154 |    "display_name": "Python 3",
155 |    "language": "python",
156 |    "name": "python3"
157 |   },
158 |   "language_info": {
159 |    "codemirror_mode": {
160 |     "name": "ipython",
161 |     "version": 3
162 |    },
163 |    "file_extension": ".py",
164 |    "mimetype": "text/x-python",
165 |    "name": "python",
166 |    "nbconvert_exporter": "python",
167 |    "pygments_lexer": "ipython3",
168 |    "version": "3.5.2"
169 |   }
170 |  },
171 |  "nbformat": 4,
172 |  "nbformat_minor": 2
173 | }
174 | 


--------------------------------------------------------------------------------
/ch4/ch4_glove.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# GloVe: Global Vectors for Word2Vec"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {},
 14 |    "outputs": [
 15 |     {
 16 |      "name": "stderr",
 17 |      "output_type": "stream",
 18 |      "text": [
 19 |       "c:\\users\\thushan\\documents\\python_virtualenvs\\tensorflow_venv\\lib\\site-packages\\h5py\\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
 20 |       "  from ._conv import register_converters as _register_converters\n"
 21 |      ]
 22 |     }
 23 |    ],
 24 |    "source": [
 25 |     "# These are all the modules we'll be using later. Make sure you can import them\n",
 26 |     "# before proceeding further.\n",
 27 |     "%matplotlib inline\n",
 28 |     "from __future__ import print_function\n",
 29 |     "import collections\n",
 30 |     "import math\n",
 31 |     "import numpy as np\n",
 32 |     "import os\n",
 33 |     "import random\n",
 34 |     "import tensorflow as tf\n",
 35 |     "import bz2\n",
 36 |     "from matplotlib import pylab\n",
 37 |     "from six.moves import range\n",
 38 |     "from six.moves.urllib.request import urlretrieve\n",
 39 |     "from sklearn.manifold import TSNE\n",
 40 |     "from sklearn.cluster import KMeans\n",
 41 |     "from scipy.sparse import lil_matrix\n",
 42 |     "import nltk # standard preprocessing\n",
 43 |     "import operator # sorting items in dictionary by value\n",
 44 |     "#nltk.download() #tokenizers/punkt/PY3/english.pickle\n",
 45 |     "from math import ceil"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "## Dataset\n",
 53 |     "This code downloads a [dataset](http://www.evanjones.ca/software/wikipedia2text.html) consisting of several Wikipedia articles totaling up to roughly 61 megabytes. Additionally the code makes sure the file has the correct size after downloading it."
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 2,
 59 |    "metadata": {},
 60 |    "outputs": [
 61 |     {
 62 |      "name": "stdout",
 63 |      "output_type": "stream",
 64 |      "text": [
 65 |       "Found and verified wikipedia2text-extracted.txt.bz2\n"
 66 |      ]
 67 |     }
 68 |    ],
 69 |    "source": [
 70 |     "url = 'http://www.evanjones.ca/software/'\n",
 71 |     "\n",
 72 |     "def maybe_download(filename, expected_bytes):\n",
 73 |     "  \"\"\"Download a file if not present, and make sure it's the right size.\"\"\"\n",
 74 |     "  if not os.path.exists(filename):\n",
 75 |     "    filename, _ = urlretrieve(url + filename, filename)\n",
 76 |     "  statinfo = os.stat(filename)\n",
 77 |     "  if statinfo.st_size == expected_bytes:\n",
 78 |     "    print('Found and verified %s' % filename)\n",
 79 |     "  else:\n",
 80 |     "    print(statinfo.st_size)\n",
 81 |     "    raise Exception(\n",
 82 |     "      'Failed to verify ' + filename + '. Can you get to it with a browser?')\n",
 83 |     "  return filename\n",
 84 |     "\n",
 85 |     "filename = maybe_download('wikipedia2text-extracted.txt.bz2', 18377035)"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "## Read Data with Preprocessing with NLTK\n",
 93 |     "Reads data as it is to a string, convert to lower-case and tokenize it using the nltk library. This code reads data in 1MB portions as processing the full text at once slows down the task and returns a list of words"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 3,
 99 |    "metadata": {},
100 |    "outputs": [
101 |     {
102 |      "name": "stdout",
103 |      "output_type": "stream",
104 |      "text": [
105 |       "Reading data...\n",
106 |       "Data size 3360286\n",
107 |       "Example words (start):  ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']\n",
108 |       "Example words (end):  ['favorable', 'long-term', 'outcomes', 'for', 'around', 'half', 'of', 'those', 'diagnosed', 'with']\n"
109 |      ]
110 |     }
111 |    ],
112 |    "source": [
113 |     "def read_data(filename):\n",
114 |     "  \"\"\"\n",
115 |     "  Extract the first file enclosed in a zip file as a list of words\n",
116 |     "  and pre-processes it using the nltk python library\n",
117 |     "  \"\"\"\n",
118 |     "\n",
119 |     "  with bz2.BZ2File(filename) as f:\n",
120 |     "\n",
121 |     "    data = []\n",
122 |     "    file_size = os.stat(filename).st_size\n",
123 |     "    chunk_size = 1024 * 1024 # reading 1 MB at a time as the dataset is moderately large\n",
124 |     "    print('Reading data...')\n",
125 |     "    for i in range(ceil(file_size//chunk_size)+1):\n",
126 |     "        bytes_to_read = min(chunk_size,file_size-(i*chunk_size))\n",
127 |     "        file_string = f.read(bytes_to_read).decode('utf-8')\n",
128 |     "        file_string = file_string.lower()\n",
129 |     "        # tokenizes a string to words residing in a list\n",
130 |     "        file_string = nltk.word_tokenize(file_string)\n",
131 |     "        data.extend(file_string)\n",
132 |     "  return data\n",
133 |     "\n",
134 |     "words = read_data(filename)\n",
135 |     "print('Data size %d' % len(words))\n",
136 |     "token_count = len(words)\n",
137 |     "\n",
138 |     "print('Example words (start): ',words[:10])\n",
139 |     "print('Example words (end): ',words[-10:])"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "markdown",
144 |    "metadata": {},
145 |    "source": [
146 |     "## Building the Dictionaries\n",
147 |     "Builds the following. To understand each of these elements, let us also assume the text \"I like to go to school\"\n",
148 |     "\n",
149 |     "* `dictionary`: maps a string word to an ID (e.g. {I:0, like:1, to:2, go:3, school:4})\n",
150 |     "* `reverse_dictionary`: maps an ID to a string word (e.g. {0:I, 1:like, 2:to, 3:go, 4:school}\n",
151 |     "* `count`: List of list of (word, frequency) elements (e.g. [(I,1),(like,1),(to,2),(go,1),(school,1)]\n",
152 |     "* `data` : Contain the string of text we read, where string words are replaced with word IDs (e.g. [0, 1, 2, 3, 2, 4])\n",
153 |     "\n",
154 |     "It also introduces an additional special token `UNK` to denote rare words to are too rare to make use of."
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 4,
160 |    "metadata": {},
161 |    "outputs": [
162 |     {
163 |      "name": "stdout",
164 |      "output_type": "stream",
165 |      "text": [
166 |       "Most common words (+UNK) [['UNK', 69215], ('the', 226881), (',', 184013), ('.', 120944), ('of', 116323)]\n",
167 |       "Sample data [1730, 9, 8, 16741, 223, 4, 5169, 4509, 26, 11641]\n"
168 |      ]
169 |     }
170 |    ],
171 |    "source": [
172 |     "# we restrict our vocabulary size to 50000\n",
173 |     "vocabulary_size = 50000 \n",
174 |     "\n",
175 |     "def build_dataset(words):\n",
176 |     "  count = [['UNK', -1]]\n",
177 |     "  # Gets only the vocabulary_size most common words as the vocabulary\n",
178 |     "  # All the other words will be replaced with UNK token\n",
179 |     "  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))\n",
180 |     "  dictionary = dict()\n",
181 |     "\n",
182 |     "  # Create an ID for each word by giving the current length of the dictionary\n",
183 |     "  # And adding that item to the dictionary\n",
184 |     "  for word, _ in count:\n",
185 |     "    dictionary[word] = len(dictionary)\n",
186 |     "    \n",
187 |     "  data = list()\n",
188 |     "  unk_count = 0\n",
189 |     "  # Traverse through all the text we have and produce a list\n",
190 |     "  # where each element corresponds to the ID of the word found at that index\n",
191 |     "  for word in words:\n",
192 |     "    # If word is in the dictionary use the word ID,\n",
193 |     "    # else use the ID of the special token \"UNK\"\n",
194 |     "    if word in dictionary:\n",
195 |     "      index = dictionary[word]\n",
196 |     "    else:\n",
197 |     "      index = 0  # dictionary['UNK']\n",
198 |     "      unk_count = unk_count + 1\n",
199 |     "    data.append(index)\n",
200 |     "    \n",
201 |     "  # update the count variable with the number of UNK occurences\n",
202 |     "  count[0][1] = unk_count\n",
203 |     "  \n",
204 |     "  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) \n",
205 |     "  # Make sure the dictionary is of size of the vocabulary\n",
206 |     "  assert len(dictionary) == vocabulary_size\n",
207 |     "    \n",
208 |     "  return data, count, dictionary, reverse_dictionary\n",
209 |     "\n",
210 |     "data, count, dictionary, reverse_dictionary = build_dataset(words)\n",
211 |     "print('Most common words (+UNK)', count[:5])\n",
212 |     "print('Sample data', data[:10])\n",
213 |     "del words  # Hint to reduce memory."
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "markdown",
218 |    "metadata": {},
219 |    "source": [
220 |     "## Generating Batches of Data for GloVe\n",
221 |     "Generates a batch or target words (`batch`) and a batch of corresponding context words (`labels`). It reads `2*window_size+1` words at a time (called a `span`) and create `2*window_size` datapoints in a single span. The function continue in this manner until `batch_size` datapoints are created. Everytime we reach the end of the word sequence, we start from beginning. "
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": 5,
227 |    "metadata": {},
228 |    "outputs": [
229 |     {
230 |      "name": "stdout",
231 |      "output_type": "stream",
232 |      "text": [
233 |       "data: ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed']\n",
234 |       "\n",
235 |       "with window_size = 2:\n",
236 |       "    batch: ['a', 'a', 'a', 'a', 'concerted', 'concerted', 'concerted', 'concerted']\n",
237 |       "    labels: ['propaganda', 'is', 'concerted', 'set', 'is', 'a', 'set', 'of']\n",
238 |       "    weights: [0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 0.5]\n",
239 |       "\n",
240 |       "with window_size = 4:\n",
241 |       "    batch: ['set', 'set', 'set', 'set', 'set', 'set', 'set', 'set']\n",
242 |       "    labels: ['propaganda', 'is', 'a', 'concerted', 'of', 'messages', 'aimed', 'at']\n",
243 |       "    weights: [0.25, 0.33333334, 0.5, 1.0, 1.0, 0.5, 0.33333334, 0.25]\n"
244 |      ]
245 |     }
246 |    ],
247 |    "source": [
248 |     "data_index = 0\n",
249 |     "\n",
250 |     "def generate_batch(batch_size, window_size):\n",
251 |     "  # data_index is updated by 1 everytime we read a data point\n",
252 |     "  global data_index \n",
253 |     "    \n",
254 |     "  # two numpy arras to hold target words (batch)\n",
255 |     "  # and context words (labels)\n",
256 |     "  batch = np.ndarray(shape=(batch_size), dtype=np.int32)\n",
257 |     "  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)\n",
258 |     "  weights = np.ndarray(shape=(batch_size), dtype=np.float32)\n",
259 |     "\n",
260 |     "  # span defines the total window size, where\n",
261 |     "  # data we consider at an instance looks as follows. \n",
262 |     "  # [ skip_window target skip_window ]\n",
263 |     "  span = 2 * window_size + 1 \n",
264 |     "    \n",
265 |     "  # The buffer holds the data contained within the span\n",
266 |     "  buffer = collections.deque(maxlen=span)\n",
267 |     "  \n",
268 |     "  # Fill the buffer and update the data_index\n",
269 |     "  for _ in range(span):\n",
270 |     "    buffer.append(data[data_index])\n",
271 |     "    data_index = (data_index + 1) % len(data)\n",
272 |     "  \n",
273 |     "  # This is the number of context words we sample for a single target word\n",
274 |     "  num_samples = 2*window_size \n",
275 |     "\n",
276 |     "  # We break the batch reading into two for loops\n",
277 |     "  # The inner for loop fills in the batch and labels with \n",
278 |     "  # num_samples data points using data contained withing the span\n",
279 |     "  # The outper for loop repeat this for batch_size//num_samples times\n",
280 |     "  # to produce a full batch\n",
281 |     "  for i in range(batch_size // num_samples):\n",
282 |     "    k=0\n",
283 |     "    # avoid the target word itself as a prediction\n",
284 |     "    # fill in batch and label numpy arrays\n",
285 |     "    for j in list(range(window_size))+list(range(window_size+1,2*window_size+1)):\n",
286 |     "      batch[i * num_samples + k] = buffer[window_size]\n",
287 |     "      labels[i * num_samples + k, 0] = buffer[j]\n",
288 |     "      weights[i * num_samples + k] = abs(1.0/(j - window_size))\n",
289 |     "      k += 1 \n",
290 |     "    \n",
291 |     "    # Everytime we read num_samples data points,\n",
292 |     "    # we have created the maximum number of datapoints possible\n",
293 |     "    # withing a single span, so we need to move the span by 1\n",
294 |     "    # to create a fresh new span\n",
295 |     "    buffer.append(data[data_index])\n",
296 |     "    data_index = (data_index + 1) % len(data)\n",
297 |     "  return batch, labels, weights\n",
298 |     "\n",
299 |     "print('data:', [reverse_dictionary[di] for di in data[:8]])\n",
300 |     "\n",
301 |     "for window_size in [2, 4]:\n",
302 |     "    data_index = 0\n",
303 |     "    batch, labels, weights = generate_batch(batch_size=8, window_size=window_size)\n",
304 |     "    print('\\nwith window_size = %d:' %window_size)\n",
305 |     "    print('    batch:', [reverse_dictionary[bi] for bi in batch])\n",
306 |     "    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])\n",
307 |     "    print('    weights:', [w for w in weights])"
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "markdown",
312 |    "metadata": {},
313 |    "source": [
314 |     "## Creating the Word Co-Occurance Matrix\n",
315 |     "Why GloVe shine above context window based method is that it employs global statistics of the corpus in to the model (according to authors). This is done by using information from the word co-occurance matrix to optimize the word vectors. Basically, the X(i,j) entry of the co-occurance matrix says how frequent word i to appear near j. We also use a weighting mechanishm to give more weight to words close together than to ones further-apart (from experiments section of the paper)."
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "code",
320 |    "execution_count": 6,
321 |    "metadata": {},
322 |    "outputs": [
323 |     {
324 |      "name": "stdout",
325 |      "output_type": "stream",
326 |      "text": [
327 |       "(50000, 50000)\n",
328 |       "Running 420035 iterations to compute the co-occurance matrix\n",
329 |       "\tFinished 100000 iterations\n",
330 |       "\tFinished 200000 iterations\n",
331 |       "\tFinished 300000 iterations\n",
332 |       "\tFinished 400000 iterations\n",
333 |       "Sample chunks of co-occurance matrix\n",
334 |       "\n",
335 |       "Target Word: \"UNK\"\n",
336 |       "Context word:\",\"(id:2,count:3482.30), \"UNK\"(id:0,count:2164.01), \"the\"(id:1,count:2020.93), \"and\"(id:5,count:1454.50), \".\"(id:3,count:1310.58), \"of\"(id:4,count:1086.33), \"(\"(id:13,count:1047.17), \")\"(id:12,count:831.17), \"in\"(id:6,count:776.17), \"a\"(id:8,count:624.50), \n",
337 |       "\n",
338 |       "Target Word: \"imagery\"\n",
339 |       "Context word:\"and\"(id:5,count:1.25), \"UNK\"(id:0,count:1.00), \"generated\"(id:3145,count:1.00), \"demonstrates\"(id:10422,count:1.00), \"explored\"(id:5276,count:1.00), \"horrific\"(id:16241,count:1.00), \"(\"(id:13,count:1.00), \",\"(id:2,count:0.58), \"computer\"(id:936,count:0.50), \"goya\"(id:22688,count:0.50), \n",
340 |       "\n",
341 |       "Target Word: \"defining\"\n",
342 |       "Context word:\"the\"(id:1,count:5.00), \"of\"(id:4,count:2.83), \"and\"(id:5,count:1.50), \"feature\"(id:1397,count:1.00), \"other\"(id:42,count:1.00), \"influence\"(id:452,count:1.00), \"it\"(id:24,count:1.00), \"or\"(id:29,count:1.00), \"than\"(id:62,count:1.00), \"moments\"(id:7053,count:1.00), \n",
343 |       "\n",
344 |       "Target Word: \"liberalism\"\n",
345 |       "Context word:\"of\"(id:4,count:5.75), \".\"(id:3,count:3.25), \"forms\"(id:423,count:1.50), \"the\"(id:1,count:1.08), \"western\"(id:216,count:1.00), \"within\"(id:152,count:1.00), \"are\"(id:22,count:1.00), \"may\"(id:73,count:1.00), \"on\"(id:18,count:1.00), \"<\"(id:1716,count:1.00), \n",
346 |       "\n",
347 |       "Target Word: \"rampant\"\n",
348 |       "Context word:\",\"(id:2,count:2.08), \"and\"(id:5,count:1.25), \"UNK\"(id:0,count:1.00), \"government\"(id:84,count:1.00), \"also\"(id:37,count:1.00), \"were\"(id:31,count:1.00), \"throughout\"(id:308,count:1.00), \".\"(id:3,count:1.00), \"the\"(id:1,count:0.83), \"expenditures\"(id:10039,count:0.50), \n",
349 |       "\n",
350 |       "Target Word: \"and\"\n",
351 |       "Context word:\",\"(id:2,count:3990.46), \"the\"(id:1,count:2566.00), \"UNK\"(id:0,count:1488.59), \"of\"(id:4,count:1001.66), \".\"(id:3,count:894.16), \"in\"(id:6,count:728.16), \"to\"(id:7,count:555.67), \"a\"(id:8,count:549.25), \")\"(id:12,count:412.92), \"and\"(id:5,count:318.50), \n",
352 |       "\n",
353 |       "Target Word: \"in\"\n",
354 |       "Context word:\"the\"(id:1,count:3765.79), \",\"(id:2,count:1934.93), \".\"(id:3,count:1836.76), \"UNK\"(id:0,count:776.17), \"of\"(id:4,count:747.16), \"and\"(id:5,count:723.41), \"a\"(id:8,count:685.08), \"to\"(id:7,count:425.67), \"in\"(id:6,count:316.00), \"was\"(id:11,count:290.08), \n",
355 |       "\n",
356 |       "Target Word: \"to\"\n",
357 |       "Context word:\"the\"(id:1,count:2449.92), \",\"(id:2,count:990.33), \".\"(id:3,count:687.00), \"a\"(id:8,count:613.00), \"be\"(id:30,count:573.75), \"and\"(id:5,count:527.33), \"UNK\"(id:0,count:470.42), \"of\"(id:4,count:457.09), \"in\"(id:6,count:403.67), \"is\"(id:9,count:282.67), \n",
358 |       "\n",
359 |       "Target Word: \"a\"\n",
360 |       "Context word:\",\"(id:2,count:1496.51), \"of\"(id:4,count:1298.42), \".\"(id:3,count:907.00), \"in\"(id:6,count:713.08), \"the\"(id:1,count:640.42), \"to\"(id:7,count:625.92), \"as\"(id:10,count:614.67), \"UNK\"(id:0,count:602.92), \"and\"(id:5,count:583.08), \"is\"(id:9,count:558.25), \n",
361 |       "\n",
362 |       "Target Word: \"is\"\n",
363 |       "Context word:\"the\"(id:1,count:1062.00), \",\"(id:2,count:651.92), \".\"(id:3,count:567.50), \"a\"(id:8,count:504.00), \"it\"(id:24,count:381.92), \"of\"(id:4,count:340.67), \"UNK\"(id:0,count:298.83), \"to\"(id:7,count:261.42), \"in\"(id:6,count:237.42), \"and\"(id:5,count:232.08), \n"
364 |      ]
365 |     }
366 |    ],
367 |    "source": [
368 |     "# We are creating the co-occurance matrix as a compressed sparse colum matrix from scipy. \n",
369 |     "cooc_data_index = 0\n",
370 |     "dataset_size = len(data) # We iterate through the full text\n",
371 |     "skip_window = 4 # How many words to consider left and right.\n",
372 |     "\n",
373 |     "# The sparse matrix that stores the word co-occurences\n",
374 |     "cooc_mat = lil_matrix((vocabulary_size, vocabulary_size), dtype=np.float32)\n",
375 |     "\n",
376 |     "print(cooc_mat.shape)\n",
377 |     "def generate_cooc(batch_size,skip_window):\n",
378 |     "    '''\n",
379 |     "    Generate co-occurence matrix by processing batches of data\n",
380 |     "    '''\n",
381 |     "    data_index = 0\n",
382 |     "    print('Running %d iterations to compute the co-occurance matrix'%(dataset_size//batch_size))\n",
383 |     "    for i in range(dataset_size//batch_size):\n",
384 |     "        # Printing progress\n",
385 |     "        if i>0 and i%100000==0:\n",
386 |     "            print('\\tFinished %d iterations'%i)\n",
387 |     "            \n",
388 |     "        # Generating a single batch of data\n",
389 |     "        batch, labels, weights = generate_batch(batch_size, skip_window)\n",
390 |     "        labels = labels.reshape(-1)\n",
391 |     "        \n",
392 |     "        # Incrementing the sparse matrix entries accordingly\n",
393 |     "        for inp,lbl,w in zip(batch,labels,weights):            \n",
394 |     "            cooc_mat[inp,lbl] += (1.0*w)\n",
395 |     "\n",
396 |     "# Generate the matrix\n",
397 |     "generate_cooc(8,skip_window)    \n",
398 |     "\n",
399 |     "# Just printing some parts of co-occurance matrix\n",
400 |     "print('Sample chunks of co-occurance matrix')\n",
401 |     "\n",
402 |     "\n",
403 |     "# Basically calculates the highest cooccurance of several chosen word\n",
404 |     "for i in range(10):\n",
405 |     "    idx_target = i\n",
406 |     "    \n",
407 |     "    # get the ith row of the sparse matrix and make it dense\n",
408 |     "    ith_row = cooc_mat.getrow(idx_target)     \n",
409 |     "    ith_row_dense = ith_row.toarray('C').reshape(-1)        \n",
410 |     "    \n",
411 |     "    # select target words only with a reasonable words around it.\n",
412 |     "    while np.sum(ith_row_dense)<10 or np.sum(ith_row_dense)>50000:\n",
413 |     "        # Choose a random word\n",
414 |     "        idx_target = np.random.randint(0,vocabulary_size)\n",
415 |     "        \n",
416 |     "        # get the ith row of the sparse matrix and make it dense\n",
417 |     "        ith_row = cooc_mat.getrow(idx_target) \n",
418 |     "        ith_row_dense = ith_row.toarray('C').reshape(-1)    \n",
419 |     "        \n",
420 |     "    print('\\nTarget Word: \"%s\"'%reverse_dictionary[idx_target])\n",
421 |     "        \n",
422 |     "    sort_indices = np.argsort(ith_row_dense).reshape(-1) # indices with highest count of ith_row_dense\n",
423 |     "    sort_indices = np.flip(sort_indices,axis=0) # reverse the array (to get max values to the start)\n",
424 |     "\n",
425 |     "    # printing several context words to make sure cooc_mat is correct\n",
426 |     "    print('Context word:',end='')\n",
427 |     "    for j in range(10):        \n",
428 |     "        idx_context = sort_indices[j]       \n",
429 |     "        print('\"%s\"(id:%d,count:%.2f), '%(reverse_dictionary[idx_context],idx_context,ith_row_dense[idx_context]),end='')\n",
430 |     "    print()"
431 |    ]
432 |   },
433 |   {
434 |    "cell_type": "markdown",
435 |    "metadata": {},
436 |    "source": [
437 |     "## GloVe Algorithm"
438 |    ]
439 |   },
440 |   {
441 |    "cell_type": "markdown",
442 |    "metadata": {
443 |     "collapsed": true
444 |    },
445 |    "source": [
446 |     "### Defining Hyperparameters\n",
447 |     "\n",
448 |     "Here we define several hyperparameters including `batch_size` (amount of samples in a single batch) `embedding_size` (size of embedding vectors) `window_size` (context window size)."
449 |    ]
450 |   },
451 |   {
452 |    "cell_type": "code",
453 |    "execution_count": 7,
454 |    "metadata": {
455 |     "collapsed": true
456 |    },
457 |    "outputs": [],
458 |    "source": [
459 |     "batch_size = 128 # Data points in a single batch\n",
460 |     "embedding_size = 128 # Dimension of the embedding vector.\n",
461 |     "window_size = 4 # How many words to consider left and right.\n",
462 |     "\n",
463 |     "# We pick a random validation set to sample nearest neighbors\n",
464 |     "valid_size = 16 # Random set of words to evaluate similarity on.\n",
465 |     "# We sample valid datapoints randomly from a large window without always being deterministic\n",
466 |     "valid_window = 50\n",
467 |     "\n",
468 |     "# When selecting valid examples, we select some of the most frequent words as well as\n",
469 |     "# some moderately rare words as well\n",
470 |     "valid_examples = np.array(random.sample(range(valid_window), valid_size))\n",
471 |     "valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)\n",
472 |     "\n",
473 |     "num_sampled = 32 # Number of negative examples to sample.\n",
474 |     "\n",
475 |     "epsilon = 1 # used for the stability of log in the loss function"
476 |    ]
477 |   },
478 |   {
479 |    "cell_type": "markdown",
480 |    "metadata": {},
481 |    "source": [
482 |     "### Defining Inputs and Outputs\n",
483 |     "\n",
484 |     "Here we define placeholders for feeding in training inputs and outputs (each of size `batch_size`) and a constant tensor to contain validation examples."
485 |    ]
486 |   },
487 |   {
488 |    "cell_type": "code",
489 |    "execution_count": 8,
490 |    "metadata": {
491 |     "collapsed": true
492 |    },
493 |    "outputs": [],
494 |    "source": [
495 |     "tf.reset_default_graph()\n",
496 |     "\n",
497 |     "# Training input data (target word IDs).\n",
498 |     "train_dataset = tf.placeholder(tf.int32, shape=[batch_size])\n",
499 |     "# Training input label data (context word IDs)\n",
500 |     "train_labels = tf.placeholder(tf.int32, shape=[batch_size])\n",
501 |     "# Validation input data, we don't need a placeholder\n",
502 |     "# as we have already defined the IDs of the words selected\n",
503 |     "# as validation data\n",
504 |     "valid_dataset = tf.constant(valid_examples, dtype=tf.int32)"
505 |    ]
506 |   },
507 |   {
508 |    "cell_type": "markdown",
509 |    "metadata": {},
510 |    "source": [
511 |     "### Defining Model Parameters and Other Variables\n",
512 |     "We now define four TensorFlow variables which is composed of an embedding layer, a bias for each input and output words."
513 |    ]
514 |   },
515 |   {
516 |    "cell_type": "code",
517 |    "execution_count": 9,
518 |    "metadata": {
519 |     "collapsed": true
520 |    },
521 |    "outputs": [],
522 |    "source": [
523 |     "# Variables.\n",
524 |     "in_embeddings = tf.Variable(\n",
525 |     "    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='embeddings')\n",
526 |     "in_bias_embeddings = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01,dtype=tf.float32),name='embeddings_bias')\n",
527 |     "\n",
528 |     "out_embeddings = tf.Variable(\n",
529 |     "    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='embeddings')\n",
530 |     "out_bias_embeddings = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01,dtype=tf.float32),name='embeddings_bias')"
531 |    ]
532 |   },
533 |   {
534 |    "cell_type": "markdown",
535 |    "metadata": {},
536 |    "source": [
537 |     "### Defining the Model Computations\n",
538 |     "\n",
539 |     "We first defing a lookup function to fetch the corresponding embedding vectors for a set of given inputs. Then we define a placeholder that takes in the weights for a given batch of data points (`weights_x`) and co-occurence matrix weights (`x_ij`). `weights_x` measures the importance of a data point with respect to how much those two words co-occur and `x_ij` denotes the co-occurence matrix value for the row and column denoted by the words in a datapoint. With these defined, we can define the loss as shown below. For exact details refer Chapter 4 text."
540 |    ]
541 |   },
542 |   {
543 |    "cell_type": "code",
544 |    "execution_count": 10,
545 |    "metadata": {
546 |     "collapsed": true
547 |    },
548 |    "outputs": [],
549 |    "source": [
550 |     "# Look up embeddings for inputs and outputs\n",
551 |     "# Have two seperate embedding vector spaces for inputs and outputs\n",
552 |     "embed_in = tf.nn.embedding_lookup(in_embeddings, train_dataset)\n",
553 |     "embed_out = tf.nn.embedding_lookup(out_embeddings, train_labels)\n",
554 |     "embed_bias_in = tf.nn.embedding_lookup(in_bias_embeddings,train_dataset)\n",
555 |     "embed_bias_out = tf.nn.embedding_lookup(out_bias_embeddings,train_labels)\n",
556 |     "\n",
557 |     "# weights used in the cost function\n",
558 |     "weights_x = tf.placeholder(tf.float32,shape=[batch_size],name='weights_x') \n",
559 |     "# Cooccurence value for that position\n",
560 |     "x_ij = tf.placeholder(tf.float32,shape=[batch_size],name='x_ij')\n",
561 |     "\n",
562 |     "# Compute the loss defined in the paper. Note that \n",
563 |     "# I'm not following the exact equation given (which is computing a pair of words at a time)\n",
564 |     "# I'm calculating the loss for a batch at one time, but the calculations are identical.\n",
565 |     "# I also made an assumption about the bias, that it is a smaller type of embedding\n",
566 |     "loss = tf.reduce_mean(\n",
567 |     "    weights_x * (tf.reduce_sum(embed_in*embed_out,axis=1) + embed_bias_in + embed_bias_out - tf.log(epsilon+x_ij))**2)\n"
568 |    ]
569 |   },
570 |   {
571 |    "cell_type": "markdown",
572 |    "metadata": {},
573 |    "source": [
574 |     "### Calculating Word Similarities \n",
575 |     "We calculate the similarity between two given words in terms of the cosine distance. To do this efficiently we use matrix operations to do so, as shown below."
576 |    ]
577 |   },
578 |   {
579 |    "cell_type": "code",
580 |    "execution_count": 11,
581 |    "metadata": {
582 |     "collapsed": true
583 |    },
584 |    "outputs": [],
585 |    "source": [
586 |     "# Compute the similarity between minibatch examples and all embeddings.\n",
587 |     "# We use the cosine distance:\n",
588 |     "embeddings = (in_embeddings + out_embeddings)/2.0\n",
589 |     "norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))\n",
590 |     "normalized_embeddings = embeddings / norm\n",
591 |     "valid_embeddings = tf.nn.embedding_lookup(\n",
592 |     "normalized_embeddings, valid_dataset)\n",
593 |     "similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))"
594 |    ]
595 |   },
596 |   {
597 |    "cell_type": "markdown",
598 |    "metadata": {},
599 |    "source": [
600 |     "### Model Parameter Optimizer\n",
601 |     "\n",
602 |     "We then define a constant learning rate and an optimizer which uses the Adagrad method. Feel free to experiment with other optimizers listed [here](https://www.tensorflow.org/api_guides/python/train)."
603 |    ]
604 |   },
605 |   {
606 |    "cell_type": "code",
607 |    "execution_count": 12,
608 |    "metadata": {
609 |     "collapsed": true
610 |    },
611 |    "outputs": [],
612 |    "source": [
613 |     "# Optimizer.\n",
614 |     "optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)"
615 |    ]
616 |   },
617 |   {
618 |    "cell_type": "markdown",
619 |    "metadata": {},
620 |    "source": [
621 |     "## Running the GloVe Algorithm\n",
622 |     "\n",
623 |     "Here we run the GloVe algorithm we defined above. Specifically, we first initialize variables, and then train the algorithm for many steps (`num_steps`). And every few steps we evaluate the algorithm on a fixed validation set and print out the words that appear to be closest for a given set of words."
624 |    ]
625 |   },
626 |   {
627 |    "cell_type": "code",
628 |    "execution_count": 13,
629 |    "metadata": {},
630 |    "outputs": [
631 |     {
632 |      "name": "stdout",
633 |      "output_type": "stream",
634 |      "text": [
635 |       "Initialized\n",
636 |       "Average loss at step 0: 9.578778\n",
637 |       "Nearest to it: karol, burgh, destabilise, armchair, crook, roguery, one-sixth, swains,\n",
638 |       "Nearest to that: wmap, partake, ahmadi, armstrong, memberships, forza, director-general, condo,\n",
639 |       "Nearest to has: mentality, vastly, approaches, bulwark, enzymes, originally, privatize, reunify,\n",
640 |       "Nearest to but: inhabited, potrero, trust, memory, curran, philips, p.m.s, pagoda,\n",
641 |       "Nearest to city: seals, counter-revolution, tubular, kayaking, central, 1568, override, buckland,\n",
642 |       "Nearest to this: dispersion, intermarriage, dialysis, moguls, aldermen, alcoholic, codes, farallon,\n",
643 |       "Nearest to UNK: 40.3, tatsam, jupiter, verify, unequal, berliners, march, 1559,\n",
644 |       "Nearest to by: functionalists, synthesised, palladius, chiapas, synaptic, sumner, raining, valued,\n",
645 |       "Nearest to or: amherst, 'mother, epiglottis, wen, stanislaus, trafford, cuticle, reminded,\n",
646 |       "Nearest to been: 640,961., depression-era, uniquely, mami, 375,000, stickiness, medium-sized, amor,\n",
647 |       "Nearest to with: anti-statist, pitigliano, branches, reparations, acquittal, frowned, pishpek, left-leaning,\n",
648 |       "Nearest to be: i-20, kevin, greased, rightly, conductors, hypercholesterolemia, pedro, douaumont,\n",
649 |       "Nearest to as: gabon, horda, mead, protruding, soundtrack, algeria, 48, macon,\n",
650 |       "Nearest to at: kambula, tisa, spelled, 130,000, 2008, organisers, |jul_rec_lo_°f, arrows,\n",
651 |       "Nearest to ,: is, of, its, malton, martinů, retiree, reliant, uri,\n",
652 |       "Nearest to its: of, ,, galleon, gitlow, rugby-playing, varanasi, fono, clusters,\n",
653 |       "Average loss at step 2000: 0.739107\n",
654 |       "Average loss at step 4000: 0.091107\n",
655 |       "Average loss at step 6000: 0.068614\n",
656 |       "Average loss at step 8000: 0.076040\n",
657 |       "Average loss at step 10000: 0.058149\n",
658 |       "Nearest to it: was, is, that, not, a, in, to, .,\n",
659 |       "Nearest to that: is, was, the, a, ., ,, to, in,\n",
660 |       "Nearest to has: is, it, that, a, been, was, to, mentality,\n",
661 |       "Nearest to but: with, said, trust, mating, not, squamous, war—the, r101,\n",
662 |       "Nearest to city: of, 's, counter-revolution, the, professed, ., equilibrium, seals,\n",
663 |       "Nearest to this: is, ., for, in, was, the, a, that,\n",
664 |       "Nearest to UNK: and, ,, (, in, the, ., ), a,\n",
665 |       "Nearest to by: the, and, ,, ., in, was, of, a,\n",
666 |       "Nearest to or: UNK, ,, and, a, cuticle, donnchad, ``, 'mother,\n",
667 |       "Nearest to been: have, had, to, has, be, was, that, it,\n",
668 |       "Nearest to with: ,, and, a, the, in, of, for, .,\n",
669 |       "Nearest to be: to, have, that, a, for, not, can, been,\n",
670 |       "Nearest to as: a, ,, for, and, UNK, ``, is, in,\n",
671 |       "Nearest to at: the, of, ., in, ,, and, 's, UNK,\n",
672 |       "Nearest to ,: and, UNK, in, the, ., a, of, for,\n",
673 |       "Nearest to its: compacted, for, puzzling, buddha, bjorn, d'etat, tēōtl, encapsulated,\n",
674 |       "Average loss at step 12000: 0.048867\n",
675 |       "Average loss at step 14000: 0.102374\n",
676 |       "Average loss at step 16000: 0.047017\n",
677 |       "Average loss at step 18000: 0.041279\n",
678 |       "Average loss at step 20000: 0.065086\n",
679 |       "Nearest to it: is, was, not, that, a, to, he, has,\n",
680 |       "Nearest to that: is, was, the, a, to, ,, ., and,\n",
681 |       "Nearest to has: it, been, was, a, is, that, to, .,\n",
682 |       "Nearest to but: with, not, which, that, ,, said, mating, trust,\n",
683 |       "Nearest to city: of, 's, the, ., in, counter-revolution, for, at,\n",
684 |       "Nearest to this: ., is, was, for, in, the, it, of,\n",
685 |       "Nearest to UNK: and, ,, (, ), in, the, a, .,\n",
686 |       "Nearest to by: the, in, ,, and, was, ., of, a,\n",
687 |       "Nearest to or: UNK, ,, a, and, (, ``, ), with,\n",
688 |       "Nearest to been: have, had, has, be, to, was, that, not,\n",
689 |       "Nearest to with: and, ,, a, of, the, in, for, UNK,\n",
690 |       "Nearest to be: to, have, that, a, not, from, is, been,\n",
691 |       "Nearest to as: a, ,, for, and, is, UNK, the, of,\n",
692 |       "Nearest to at: the, ., of, in, 's, and, ,, by,\n",
693 |       "Nearest to ,: and, UNK, in, the, a, ., of, with,\n",
694 |       "Nearest to its: for, with, and, compacted, of, his, tēōtl, ,,\n",
695 |       "Average loss at step 22000: 0.036469\n",
696 |       "Average loss at step 24000: 0.037744\n",
697 |       "Average loss at step 26000: 0.035548\n",
698 |       "Average loss at step 28000: 0.035010\n",
699 |       "Average loss at step 30000: 0.038970\n",
700 |       "Nearest to it: is, was, not, that, has, he, a, this,\n",
701 |       "Nearest to that: is, was, the, to, ,, ., a, and,\n",
702 |       "Nearest to has: it, is, was, been, a, that, have, to,\n",
703 |       "Nearest to but: which, with, not, ,, was, that, is, it,\n",
704 |       "Nearest to city: of, 's, the, ., in, at, from, world,\n",
705 |       "Nearest to this: is, ., was, in, for, it, the, a,\n",
706 |       "Nearest to UNK: and, ,, (, ), in, ., the, a,\n",
707 |       "Nearest to by: the, was, in, ,, and, ., of, a,\n",
708 |       "Nearest to or: UNK, (, ,, ``, and, a, ), with,\n",
709 |       "Nearest to been: have, had, has, be, to, was, that, not,\n",
710 |       "Nearest to with: ,, and, a, of, the, in, ., UNK,\n",
711 |       "Nearest to be: to, have, from, not, that, a, is, can,\n",
712 |       "Nearest to as: a, ,, for, an, and, is, UNK, with,\n",
713 |       "Nearest to at: the, of, ., in, 's, ,, and, UNK,\n",
714 |       "Nearest to ,: and, UNK, in, the, ., a, with, of,\n",
715 |       "Nearest to its: for, with, his, and, to, of, the, ,,\n",
716 |       "Average loss at step 32000: 0.033023\n",
717 |       "Average loss at step 34000: 0.031445\n",
718 |       "Average loss at step 36000: 0.030053\n",
719 |       "Average loss at step 38000: 0.028875\n",
720 |       "Average loss at step 40000: 0.028649\n",
721 |       "Nearest to it: is, was, not, that, has, a, also, he,\n",
722 |       "Nearest to that: is, was, to, a, the, it, ,, .,\n",
723 |       "Nearest to has: it, was, is, been, a, that, had, also,\n",
724 |       "Nearest to but: which, not, with, that, ,, was, it, is,\n",
725 |       "Nearest to city: 's, of, the, ., in, at, world, from,\n",
726 |       "Nearest to this: is, ., was, in, for, it, the, a,\n",
727 |       "Nearest to UNK: and, ,, ), (, in, the, ., a,\n",
728 |       "Nearest to by: was, ,, the, in, and, ., of, to,\n",
729 |       "Nearest to or: UNK, (, ,, a, ), and, ``, with,\n",
730 |       "Nearest to been: have, had, has, be, to, was, not, that,\n",
731 |       "Nearest to with: and, ,, a, of, the, in, for, UNK,\n",
732 |       "Nearest to be: to, have, from, that, not, a, can, is,\n",
733 |       "Nearest to as: a, ,, an, for, and, is, such, to,\n",
734 |       "Nearest to at: the, ., of, in, 's, ,, by, and,\n",
735 |       "Nearest to ,: and, UNK, in, the, a, ., with, of,\n",
736 |       "Nearest to its: for, and, with, to, his, of, the, ,,\n",
737 |       "Average loss at step 42000: 0.037198\n",
738 |       "Average loss at step 44000: 0.027172\n",
739 |       "Average loss at step 46000: 0.027344\n",
740 |       "Average loss at step 48000: 0.028739\n",
741 |       "Average loss at step 50000: 0.105829\n",
742 |       "Nearest to it: is, was, that, not, has, he, also, a,\n",
743 |       "Nearest to that: is, was, a, to, it, by, the, .,\n",
744 |       "Nearest to has: it, is, been, was, that, a, have, had,\n",
745 |       "Nearest to but: which, not, with, ,, it, that, was, is,\n",
746 |       "Nearest to city: of, the, 's, ., in, from, at, is,\n",
747 |       "Nearest to this: is, ., in, for, was, it, the, a,\n",
748 |       "Nearest to UNK: (, and, ,, from, ., at, by, a,\n",
749 |       "Nearest to by: the, ,, and, ., a, in, was, of,\n",
750 |       "Nearest to or: ``, ), (, ,, a, with, and, '',\n",
751 |       "Nearest to been: have, has, had, be, to, that, was, also,\n",
752 |       "Nearest to with: and, ,, a, for, the, in, of, .,\n",
753 |       "Nearest to be: to, have, not, from, a, that, is, can,\n",
754 |       "Nearest to as: a, an, ,, for, is, and, to, such,\n",
755 |       "Nearest to at: the, of, ., in, by, 's, ,, and,\n",
756 |       "Nearest to ,: and, in, the, ., a, of, with, for,\n",
757 |       "Nearest to its: for, with, and, ,, his, the, of, in,\n",
758 |       "Average loss at step 52000: 0.111760\n",
759 |       "Average loss at step 54000: 0.031062\n",
760 |       "Average loss at step 56000: 0.070919\n",
761 |       "Average loss at step 58000: 0.027815\n",
762 |       "Average loss at step 60000: 0.025161\n",
763 |       "Nearest to it: is, was, that, not, has, also, he, a,\n",
764 |       "Nearest to that: is, was, it, to, the, ., a, by,\n",
765 |       "Nearest to has: it, is, was, been, a, that, had, also,\n",
766 |       "Nearest to but: which, not, with, ,, it, was, that, is,\n",
767 |       "Nearest to city: 's, of, the, ., in, from, at, world,\n",
768 |       "Nearest to this: is, ., was, in, it, for, the, at,\n",
769 |       "Nearest to UNK: (, ), and, ,, the, a, by, .,\n",
770 |       "Nearest to by: the, was, in, ., ,, and, of, is,\n",
771 |       "Nearest to or: (, ), ,, a, ``, and, with, '',\n",
772 |       "Nearest to been: have, has, had, be, was, also, that, to,\n",
773 |       "Nearest to with: and, ,, a, of, in, for, the, .,\n",
774 |       "Nearest to be: to, have, can, not, from, a, is, that,\n",
775 |       "Nearest to as: a, an, ,, for, such, and, is, to,\n",
776 |       "Nearest to at: the, ., of, in, 's, by, ,, is,\n",
777 |       "Nearest to ,: and, in, the, ., a, with, of, for,\n",
778 |       "Nearest to its: for, with, and, to, ,, of, the, his,\n",
779 |       "Average loss at step 62000: 0.024341\n",
780 |       "Average loss at step 64000: 0.024122\n",
781 |       "Average loss at step 66000: 0.023625\n",
782 |       "Average loss at step 68000: 0.023307\n",
783 |       "Average loss at step 70000: 0.023168\n",
784 |       "Nearest to it: is, was, not, that, has, also, a, he,\n",
785 |       "Nearest to that: is, was, to, the, it, ., a, ,,\n",
786 |       "Nearest to has: it, was, been, had, a, is, that, have,\n",
787 |       "Nearest to but: which, not, with, ,, it, was, that, is,\n",
788 |       "Nearest to city: of, 's, the, ., in, at, world, from,\n",
789 |       "Nearest to this: is, ., was, in, it, for, the, at,\n",
790 |       "Nearest to UNK: (, ), and, ,, by, or, the, in,\n",
791 |       "Nearest to by: the, ,, was, ., and, in, of, a,\n",
792 |       "Nearest to or: (, ), a, ,, and, ``, UNK, with,\n",
793 |       "Nearest to been: have, has, had, be, was, also, to, that,\n",
794 |       "Nearest to with: and, ,, a, the, in, of, for, .,\n",
795 |       "Nearest to be: to, have, can, not, a, that, is, from,\n",
796 |       "Nearest to as: a, an, for, ,, such, and, is, to,\n",
797 |       "Nearest to at: the, of, ., in, 's, by, ,, is,\n",
798 |       "Nearest to ,: and, in, the, ., a, of, with, for,\n",
799 |       "Nearest to its: for, with, and, their, his, ,, of, the,\n"
800 |      ]
801 |     },
802 |     {
803 |      "name": "stdout",
804 |      "output_type": "stream",
805 |      "text": [
806 |       "Average loss at step 72000: 0.022413\n",
807 |       "Average loss at step 74000: 0.021599\n",
808 |       "Average loss at step 76000: 0.021968\n",
809 |       "Average loss at step 78000: 0.021922\n",
810 |       "Average loss at step 80000: 0.021073\n",
811 |       "Nearest to it: is, was, not, that, also, has, a, this,\n",
812 |       "Nearest to that: is, was, it, to, ., the, ,, a,\n",
813 |       "Nearest to has: it, been, was, is, also, that, had, a,\n",
814 |       "Nearest to but: which, not, with, it, ,, was, that, and,\n",
815 |       "Nearest to city: of, 's, the, ., in, at, world, from,\n",
816 |       "Nearest to this: is, ., was, in, it, for, at, the,\n",
817 |       "Nearest to UNK: (, ), and, ,, or, a, ., by,\n",
818 |       "Nearest to by: the, ,, was, and, ., in, a, of,\n",
819 |       "Nearest to or: (, UNK, a, ), ,, and, ``, with,\n",
820 |       "Nearest to been: have, has, had, also, be, to, that, was,\n",
821 |       "Nearest to with: ,, and, a, the, of, in, for, .,\n",
822 |       "Nearest to be: to, have, can, not, from, that, is, a,\n",
823 |       "Nearest to as: a, an, ,, such, for, and, ., is,\n",
824 |       "Nearest to at: of, the, ., in, 's, ,, and, by,\n",
825 |       "Nearest to ,: and, in, the, ., a, with, of, for,\n",
826 |       "Nearest to its: for, and, with, their, ,, his, to, the,\n",
827 |       "Average loss at step 82000: 0.021116\n",
828 |       "Average loss at step 84000: 0.020798\n",
829 |       "Average loss at step 86000: 0.020017\n",
830 |       "Average loss at step 88000: 0.019837\n",
831 |       "Average loss at step 90000: 0.019543\n",
832 |       "Nearest to it: is, was, that, also, not, has, this, a,\n",
833 |       "Nearest to that: was, is, to, the, ., it, a, ,,\n",
834 |       "Nearest to has: it, been, was, is, a, had, also, that,\n",
835 |       "Nearest to but: which, not, ,, with, it, was, and, that,\n",
836 |       "Nearest to city: of, 's, the, ., in, new, world, at,\n",
837 |       "Nearest to this: is, ., was, it, in, for, the, at,\n",
838 |       "Nearest to UNK: (, and, ), ,, in, or, a, .,\n",
839 |       "Nearest to by: the, was, ,, in, ., and, a, of,\n",
840 |       "Nearest to or: (, UNK, ), ``, a, ,, and, with,\n",
841 |       "Nearest to been: have, has, had, also, be, that, was, to,\n",
842 |       "Nearest to with: and, ,, a, the, of, in, for, .,\n",
843 |       "Nearest to be: to, have, can, not, that, from, is, would,\n",
844 |       "Nearest to as: a, an, ,, such, for, and, is, the,\n",
845 |       "Nearest to at: of, the, ., in, 's, ,, and, by,\n",
846 |       "Nearest to ,: and, in, the, ., a, with, of, UNK,\n",
847 |       "Nearest to its: for, and, their, with, his, ,, the, of,\n",
848 |       "Average loss at step 92000: 0.019305\n",
849 |       "Average loss at step 94000: 0.019555\n",
850 |       "Average loss at step 96000: 0.019266\n",
851 |       "Average loss at step 98000: 0.018803\n",
852 |       "Average loss at step 100000: 0.018488\n",
853 |       "Nearest to it: is, was, also, that, not, has, this, a,\n",
854 |       "Nearest to that: was, is, to, it, the, a, ., ,,\n",
855 |       "Nearest to has: it, been, was, had, also, is, that, a,\n",
856 |       "Nearest to but: which, not, ,, it, with, was, and, a,\n",
857 |       "Nearest to city: of, 's, the, ., in, is, new, world,\n",
858 |       "Nearest to this: is, ., was, it, in, for, the, at,\n",
859 |       "Nearest to UNK: (, and, ), ,, or, a, the, .,\n",
860 |       "Nearest to by: the, ., was, ,, and, in, of, a,\n",
861 |       "Nearest to or: UNK, (, ``, a, ), ,, and, with,\n",
862 |       "Nearest to been: have, has, had, also, be, was, that, to,\n",
863 |       "Nearest to with: and, ,, a, the, of, in, for, .,\n",
864 |       "Nearest to be: to, have, can, not, would, from, that, a,\n",
865 |       "Nearest to as: a, such, an, ,, for, is, and, to,\n",
866 |       "Nearest to at: of, ., the, in, 's, by, ,, and,\n",
867 |       "Nearest to ,: and, in, the, ., a, with, UNK, of,\n",
868 |       "Nearest to its: for, their, and, with, his, ,, to, the,\n"
869 |      ]
870 |     }
871 |    ],
872 |    "source": [
873 |     "num_steps = 100001\n",
874 |     "glove_loss = []\n",
875 |     "\n",
876 |     "average_loss = 0\n",
877 |     "with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:\n",
878 |     "    \n",
879 |     "    tf.global_variables_initializer().run()\n",
880 |     "    print('Initialized')\n",
881 |     "    \n",
882 |     "    for step in range(num_steps):\n",
883 |     "        \n",
884 |     "        # generate a single batch (data,labels,co-occurance weights)\n",
885 |     "        batch_data, batch_labels, batch_weights = generate_batch(\n",
886 |     "            batch_size, skip_window) \n",
887 |     "        \n",
888 |     "        # Computing the weights required by the loss function\n",
889 |     "        batch_weights = [] # weighting used in the loss function\n",
890 |     "        batch_xij = [] # weighted frequency of finding i near j\n",
891 |     "        \n",
892 |     "        # Compute the weights for each datapoint in the batch\n",
893 |     "        for inp,lbl in zip(batch_data,batch_labels.reshape(-1)):     \n",
894 |     "            point_weight = (cooc_mat[inp,lbl]/100.0)**0.75 if cooc_mat[inp,lbl]<100.0 else 1.0 \n",
895 |     "            batch_weights.append(point_weight)\n",
896 |     "            batch_xij.append(cooc_mat[inp,lbl])\n",
897 |     "        batch_weights = np.clip(batch_weights,-100,1)\n",
898 |     "        batch_xij = np.asarray(batch_xij)\n",
899 |     "        \n",
900 |     "        # Populate the feed_dict and run the optimizer (minimize loss)\n",
901 |     "        # and compute the loss. Specifically we provide\n",
902 |     "        # train_dataset/train_labels: training inputs and training labels\n",
903 |     "        # weights_x: measures the importance of a data point with respect to how much those two words co-occur\n",
904 |     "        # x_ij: co-occurence matrix value for the row and column denoted by the words in a datapoint\n",
905 |     "        feed_dict = {train_dataset : batch_data.reshape(-1), train_labels : batch_labels.reshape(-1),\n",
906 |     "                    weights_x:batch_weights,x_ij:batch_xij}\n",
907 |     "        _, l = session.run([optimizer, loss], feed_dict=feed_dict)\n",
908 |     "        \n",
909 |     "        # Update the average loss variable\n",
910 |     "        average_loss += l\n",
911 |     "        if step % 2000 == 0:\n",
912 |     "          if step > 0:\n",
913 |     "            average_loss = average_loss / 2000\n",
914 |     "          # The average loss is an estimate of the loss over the last 2000 batches.\n",
915 |     "          print('Average loss at step %d: %f' % (step, average_loss))\n",
916 |     "          glove_loss.append(average_loss)\n",
917 |     "          average_loss = 0\n",
918 |     "        \n",
919 |     "        # Here we compute the top_k closest words for a given validation word\n",
920 |     "        # in terms of the cosine distance\n",
921 |     "        # We do this for all the words in the validation set\n",
922 |     "        # Note: This is an expensive step\n",
923 |     "        if step % 10000 == 0:\n",
924 |     "          sim = similarity.eval()\n",
925 |     "          for i in range(valid_size):\n",
926 |     "            valid_word = reverse_dictionary[valid_examples[i]]\n",
927 |     "            top_k = 8 # number of nearest neighbors\n",
928 |     "            nearest = (-sim[i, :]).argsort()[1:top_k+1]\n",
929 |     "            log = 'Nearest to %s:' % valid_word\n",
930 |     "            for k in range(top_k):\n",
931 |     "              close_word = reverse_dictionary[nearest[k]]\n",
932 |     "              log = '%s %s,' % (log, close_word)\n",
933 |     "            print(log)\n",
934 |     "            \n",
935 |     "    final_embeddings = normalized_embeddings.eval()\n"
936 |    ]
937 |   },
938 |   {
939 |    "cell_type": "code",
940 |    "execution_count": null,
941 |    "metadata": {
942 |     "collapsed": true
943 |    },
944 |    "outputs": [],
945 |    "source": []
946 |   }
947 |  ],
948 |  "metadata": {
949 |   "kernelspec": {
950 |    "display_name": "Python 3",
951 |    "language": "python",
952 |    "name": "python3"
953 |   },
954 |   "language_info": {
955 |    "codemirror_mode": {
956 |     "name": "ipython",
957 |     "version": 3
958 |    },
959 |    "file_extension": ".py",
960 |    "mimetype": "text/x-python",
961 |    "name": "python",
962 |    "nbconvert_exporter": "python",
963 |    "pygments_lexer": "ipython3",
964 |    "version": "3.5.2"
965 |   }
966 |  },
967 |  "nbformat": 4,
968 |  "nbformat_minor": 2
969 | }
970 | 


--------------------------------------------------------------------------------
/ch4/ch4_word2vec_extended.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Extended Word2vec and GloVe"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {},
 14 |    "outputs": [
 15 |     {
 16 |      "name": "stderr",
 17 |      "output_type": "stream",
 18 |      "text": [
 19 |       "c:\\users\\thushan\\documents\\python_virtualenvs\\tensorflow_venv\\lib\\site-packages\\h5py\\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
 20 |       "  from ._conv import register_converters as _register_converters\n"
 21 |      ]
 22 |     }
 23 |    ],
 24 |    "source": [
 25 |     "# These are all the modules we'll be using later. Make sure you can import them\n",
 26 |     "# before proceeding further.\n",
 27 |     "%matplotlib inline\n",
 28 |     "from __future__ import print_function\n",
 29 |     "import collections\n",
 30 |     "import math\n",
 31 |     "import numpy as np\n",
 32 |     "import os\n",
 33 |     "import random\n",
 34 |     "import tensorflow as tf\n",
 35 |     "import bz2\n",
 36 |     "from matplotlib import pylab\n",
 37 |     "from six.moves import range\n",
 38 |     "from six.moves.urllib.request import urlretrieve\n",
 39 |     "from sklearn.manifold import TSNE\n",
 40 |     "from sklearn.cluster import KMeans\n",
 41 |     "import nltk # standard preprocessing\n",
 42 |     "import operator # sorting items in dictionary by value\n",
 43 |     "#nltk.download() #tokenizers/punkt/PY3/english.pickle\n",
 44 |     "from math import ceil\n",
 45 |     "import csv"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "## Dataset\n",
 53 |     "This code downloads a [dataset](http://www.evanjones.ca/software/wikipedia2text.html) consisting of several Wikipedia articles totaling up to roughly 61 megabytes. Additionally the code makes sure the file has the correct size after downloading it."
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 2,
 59 |    "metadata": {},
 60 |    "outputs": [
 61 |     {
 62 |      "name": "stdout",
 63 |      "output_type": "stream",
 64 |      "text": [
 65 |       "Found and verified wikipedia2text-extracted.txt.bz2\n"
 66 |      ]
 67 |     }
 68 |    ],
 69 |    "source": [
 70 |     "url = 'http://www.evanjones.ca/software/'\n",
 71 |     "\n",
 72 |     "def maybe_download(filename, expected_bytes):\n",
 73 |     "  \"\"\"Download a file if not present, and make sure it's the right size.\"\"\"\n",
 74 |     "  if not os.path.exists(filename):\n",
 75 |     "    filename, _ = urlretrieve(url + filename, filename)\n",
 76 |     "  statinfo = os.stat(filename)\n",
 77 |     "  if statinfo.st_size == expected_bytes:\n",
 78 |     "    print('Found and verified %s' % filename)\n",
 79 |     "  else:\n",
 80 |     "    print(statinfo.st_size)\n",
 81 |     "    raise Exception(\n",
 82 |     "      'Failed to verify ' + filename + '. Can you get to it with a browser?')\n",
 83 |     "  return filename\n",
 84 |     "\n",
 85 |     "filename = maybe_download('wikipedia2text-extracted.txt.bz2', 18377035)"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "## Read Data with Preprocessing with NLTK\n",
 93 |     "Reads data as it is to a string, convert to lower-case and tokenize it using the nltk library. This code reads data in 1MB portions as processing the full text at once slows down the task and returns a list of words"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 3,
 99 |    "metadata": {},
100 |    "outputs": [
101 |     {
102 |      "name": "stdout",
103 |      "output_type": "stream",
104 |      "text": [
105 |       "Reading data...\n",
106 |       "Data size 3360286\n",
107 |       "Example words (start):  ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']\n",
108 |       "Example words (end):  ['favorable', 'long-term', 'outcomes', 'for', 'around', 'half', 'of', 'those', 'diagnosed', 'with']\n"
109 |      ]
110 |     }
111 |    ],
112 |    "source": [
113 |     "def read_data(filename):\n",
114 |     "  \"\"\"\n",
115 |     "  Extract the first file enclosed in a zip file as a list of words\n",
116 |     "  and pre-processes it using the nltk python library\n",
117 |     "  \"\"\"\n",
118 |     "\n",
119 |     "  with bz2.BZ2File(filename) as f:\n",
120 |     "\n",
121 |     "    data = []\n",
122 |     "    file_size = os.stat(filename).st_size\n",
123 |     "    chunk_size = 1024 * 1024 # reading 1 MB at a time as the dataset is moderately large\n",
124 |     "    print('Reading data...')\n",
125 |     "    for i in range(ceil(file_size//chunk_size)+1):\n",
126 |     "        bytes_to_read = min(chunk_size,file_size-(i*chunk_size))\n",
127 |     "        file_string = f.read(bytes_to_read).decode('utf-8')\n",
128 |     "        file_string = file_string.lower()\n",
129 |     "        # tokenizes a string to words residing in a list\n",
130 |     "        file_string = nltk.word_tokenize(file_string)\n",
131 |     "        data.extend(file_string)\n",
132 |     "  return data\n",
133 |     "\n",
134 |     "words = read_data(filename)\n",
135 |     "print('Data size %d' % len(words))\n",
136 |     "token_count = len(words)\n",
137 |     "\n",
138 |     "print('Example words (start): ',words[:10])\n",
139 |     "print('Example words (end): ',words[-10:])"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "markdown",
144 |    "metadata": {},
145 |    "source": [
146 |     "## Building the Dictionaries\n",
147 |     "Builds the following. To understand each of these elements, let us also assume the text \"I like to go to school\"\n",
148 |     "\n",
149 |     "* `dictionary`: maps a string word to an ID (e.g. {I:0, like:1, to:2, go:3, school:4})\n",
150 |     "* `reverse_dictionary`: maps an ID to a string word (e.g. {0:I, 1:like, 2:to, 3:go, 4:school}\n",
151 |     "* `count`: List of list of (word, frequency) elements (e.g. [(I,1),(like,1),(to,2),(go,1),(school,1)]\n",
152 |     "* `data` : Contain the string of text we read, where string words are replaced with word IDs (e.g. [0, 1, 2, 3, 2, 4])\n",
153 |     "\n",
154 |     "It also introduces an additional special token `UNK` to denote rare words to are too rare to make use of."
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 4,
160 |    "metadata": {},
161 |    "outputs": [
162 |     {
163 |      "name": "stdout",
164 |      "output_type": "stream",
165 |      "text": [
166 |       "Most common words (+UNK) [['UNK', 69215], ('the', 226881), (',', 184013), ('.', 120944), ('of', 116323)]\n",
167 |       "Sample data [1728, 9, 8, 17488, 223, 4, 5211, 4461, 26, 11637]\n"
168 |      ]
169 |     }
170 |    ],
171 |    "source": [
172 |     "\n",
173 |     "vocabulary_size = 50000\n",
174 |     "\n",
175 |     "def build_dataset(words):\n",
176 |     "  count = [['UNK', -1]]\n",
177 |     "  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))\n",
178 |     "  dictionary = dict()\n",
179 |     "  for word, _ in count:\n",
180 |     "    dictionary[word] = len(dictionary)\n",
181 |     "  data = list()\n",
182 |     "  unk_count = 0\n",
183 |     "  for word in words:\n",
184 |     "    if word in dictionary:\n",
185 |     "      index = dictionary[word]\n",
186 |     "    else:\n",
187 |     "      index = 0  # dictionary['UNK']\n",
188 |     "      unk_count = unk_count + 1\n",
189 |     "    data.append(index)\n",
190 |     "  count[0][1] = unk_count\n",
191 |     "  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) \n",
192 |     "  assert len(dictionary) == vocabulary_size\n",
193 |     "  return data, count, dictionary, reverse_dictionary\n",
194 |     "\n",
195 |     "data, count, dictionary, reverse_dictionary = build_dataset(words)\n",
196 |     "print('Most common words (+UNK)', count[:5])\n",
197 |     "print('Sample data', data[:10])\n",
198 |     "del words  # Hint to reduce memory."
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": 5,
204 |    "metadata": {},
205 |    "outputs": [
206 |     {
207 |      "name": "stdout",
208 |      "output_type": "stream",
209 |      "text": [
210 |       "data: ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed']\n",
211 |       "\n",
212 |       "with window_size = 1:\n",
213 |       "    batch: ['is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at']\n",
214 |       "    labels: [['propaganda', 'a'], ['is', 'concerted'], ['a', 'set'], ['concerted', 'of'], ['set', 'messages'], ['of', 'aimed'], ['messages', 'at'], ['aimed', 'influencing']]\n",
215 |       "\n",
216 |       "with window_size = 2:\n",
217 |       "    batch: ['a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']\n",
218 |       "    labels: [['propaganda', 'is', 'concerted', 'set'], ['is', 'a', 'set', 'of'], ['a', 'concerted', 'of', 'messages'], ['concerted', 'set', 'messages', 'aimed'], ['set', 'of', 'aimed', 'at'], ['of', 'messages', 'at', 'influencing'], ['messages', 'aimed', 'influencing', 'the'], ['aimed', 'at', 'the', 'opinions']]\n"
219 |      ]
220 |     }
221 |    ],
222 |    "source": [
223 |     "data_index = 0\n",
224 |     "\n",
225 |     "def generate_batch(batch_size, window_size):\n",
226 |     "  global data_index\n",
227 |     "\n",
228 |     "  # two numpy arras to hold target words (batch)\n",
229 |     "  # and context words (labels)\n",
230 |     "  # Note that the labels array has 2*window_size columns\n",
231 |     "  batch = np.ndarray(shape=(batch_size), dtype=np.int32)\n",
232 |     "  labels = np.ndarray(shape=(batch_size, 2*window_size), dtype=np.int32)\n",
233 |     "  \n",
234 |     "  # span defines the total window size, where\n",
235 |     "  # data we consider at an instance looks as follows. \n",
236 |     "  # [ skip_window target skip_window ]\n",
237 |     "  span = 2 * window_size + 1 # [ skip_window target skip_window ]\n",
238 |     "  \n",
239 |     "  buffer = collections.deque(maxlen=span)\n",
240 |     "  \n",
241 |     "  # Fill the buffer and update the data_index\n",
242 |     "  for _ in range(span):\n",
243 |     "    buffer.append(data[data_index])\n",
244 |     "    data_index = (data_index + 1) % len(data)\n",
245 |     "  \n",
246 |     "  # for a full length of batch size, we do the following\n",
247 |     "  # make the target word the i th input word (i th row of batch)\n",
248 |     "  # make all the context words the columns of labels array\n",
249 |     "  # Update the data index and the buffer \n",
250 |     "  for i in range(batch_size):\n",
251 |     "    batch[i] = buffer[window_size]\n",
252 |     "    labels[i, :] = [buffer[span_idx] for span_idx in list(range(0,window_size))+ list(range(window_size+1,span))]\n",
253 |     "    buffer.append(data[data_index])\n",
254 |     "    data_index = (data_index + 1) % len(data)\n",
255 |     "  \n",
256 |     "  return batch, labels\n",
257 |     "\n",
258 |     "print('data:', [reverse_dictionary[di] for di in data[:8]])\n",
259 |     "\n",
260 |     "for window_size in [1,2]:\n",
261 |     "    data_index = 0\n",
262 |     "    batch, labels = generate_batch(batch_size=8, window_size=window_size)\n",
263 |     "    print('\\nwith window_size = %d:' % window_size)\n",
264 |     "    print('    batch:', [reverse_dictionary[bi] for bi in batch])\n",
265 |     "    print('    labels:', [[reverse_dictionary[li] for li in lbls] for lbls in labels])"
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "markdown",
270 |    "metadata": {},
271 |    "source": [
272 |     "# Structured Skip-Gram Algorithm\n",
273 |     "The basic idea behind the structured skip-gram algorithm is to pay attention to the position of the context words during learning. Giving the algorithm the power to distinguish between words falling very close to the target word and the ones that fall far away from the context words allow the structured skip-gram model to learn better word vectors ([Paper](http://www.cs.cmu.edu/~lingwang/papers/naacl2015.pdf)). You can learn about this algorithm in more detail in Chapter 4 text."
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "markdown",
278 |    "metadata": {},
279 |    "source": [
280 |     "### Defining Hyperparameters\n",
281 |     "\n",
282 |     "Here we define several hyperparameters including `batch_size` (amount of samples in a single batch) `embedding_size` (size of embedding vectors) `window_size` (context window size)."
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "code",
287 |    "execution_count": 6,
288 |    "metadata": {
289 |     "collapsed": true
290 |    },
291 |    "outputs": [],
292 |    "source": [
293 |     "batch_size = 128 # Data points in a single batch\n",
294 |     "embedding_size = 128 # Dimension of the embedding vector.\n",
295 |     "window_size = 2 # How many words to consider left and right.\n",
296 |     "\n",
297 |     "# We pick a random validation set to sample nearest neighbors\n",
298 |     "valid_size = 16 # Random set of words to evaluate similarity on.\n",
299 |     "# We sample valid datapoints randomly from a large window without always being deterministic\n",
300 |     "valid_window = 50\n",
301 |     "\n",
302 |     "# When selecting valid examples, we select some of the most frequent words as well as\n",
303 |     "# some moderately rare words as well\n",
304 |     "valid_examples = np.array(random.sample(range(valid_window), valid_size))\n",
305 |     "valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)\n",
306 |     "\n",
307 |     "num_sampled = 32 # Number of negative examples to sample."
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "markdown",
312 |    "metadata": {},
313 |    "source": [
314 |     "### Defining Inputs and Outputs\n",
315 |     "\n",
316 |     "Here we define placeholders for feeding in training inputs and outputs (each of size `batch_size`) and a constant tensor to contain validation examples."
317 |    ]
318 |   },
319 |   {
320 |    "cell_type": "code",
321 |    "execution_count": 7,
322 |    "metadata": {
323 |     "collapsed": true
324 |    },
325 |    "outputs": [],
326 |    "source": [
327 |     "tf.reset_default_graph()\n",
328 |     "\n",
329 |     "# Training input data (target word IDs).\n",
330 |     "train_dataset = tf.placeholder(tf.int32, shape=[batch_size])\n",
331 |     "# Training input label data (context word IDs)\n",
332 |     "train_labels = [tf.placeholder(tf.int32, shape=[batch_size, 1]) for _ in range(2*window_size)]\n",
333 |     "# Validation input data, we don't need a placeholder\n",
334 |     "# as we have already defined the IDs of the words selected\n",
335 |     "# as validation data\n",
336 |     "valid_dataset = tf.constant(valid_examples, dtype=tf.int32)"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "markdown",
341 |    "metadata": {},
342 |    "source": [
343 |     "### Defining Model Parameters and Other Variables\n",
344 |     "We now define several TensorFlow variables such as an embedding layer (`embeddings`) and neural network parameters (`softmax_weights` and `softmax_biases`). Note that the softmax weights is `2*window_size` larger than the original skip-gram algorithms's softmax weights."
345 |    ]
346 |   },
347 |   {
348 |    "cell_type": "code",
349 |    "execution_count": 8,
350 |    "metadata": {
351 |     "collapsed": true
352 |    },
353 |    "outputs": [],
354 |    "source": [
355 |     "embeddings = tf.Variable(\n",
356 |     "tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))\n",
357 |     "softmax_weights = [tf.Variable(\n",
358 |     "tf.truncated_normal([vocabulary_size, embedding_size],\n",
359 |     "                     stddev=0.5 / math.sqrt(embedding_size))) for _ in range(2*window_size)]\n",
360 |     "softmax_biases = [tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01)) for _ in range(2*window_size)]\n"
361 |    ]
362 |   },
363 |   {
364 |    "cell_type": "markdown",
365 |    "metadata": {},
366 |    "source": [
367 |     "### Defining the Model Computations\n",
368 |     "\n",
369 |     "We first defing a lookup function to fetch the corresponding embedding vectors for a set of given inputs. With that, we define negative sampling loss function `tf.nn.sampled_softmax_loss` which takes in the embedding vectors and previously defined neural network parameters."
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "code",
374 |    "execution_count": 9,
375 |    "metadata": {},
376 |    "outputs": [
377 |     {
378 |      "name": "stdout",
379 |      "output_type": "stream",
380 |      "text": [
381 |       "WARNING:tensorflow:From c:\\users\\thushan\\documents\\python_virtualenvs\\tensorflow_venv\\lib\\site-packages\\tensorflow\\python\\ops\\nn_impl.py:1344: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.\n",
382 |       "Instructions for updating:\n",
383 |       "\n",
384 |       "Future major versions of TensorFlow will allow gradients to flow\n",
385 |       "into the labels input on backprop by default.\n",
386 |       "\n",
387 |       "See @{tf.nn.softmax_cross_entropy_with_logits_v2}.\n",
388 |       "\n"
389 |      ]
390 |     }
391 |    ],
392 |    "source": [
393 |     "\n",
394 |     "# Model.\n",
395 |     "# Look up embeddings for inputs.\n",
396 |     "embed = tf.nn.embedding_lookup(embeddings, train_dataset)\n",
397 |     "\n",
398 |     "# You might see the warning when running the line below\n",
399 |     "# WARNING:tensorflow:From c:\\...\\lib\\site-packages\\tensorflow\\python\\ops\\nn_impl.py:1346: \n",
400 |     "#softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and \n",
401 |     "# will be removed in a future version.\n",
402 |     "# This is due to the sampled_softmax_loss function using a deprecated function internally\n",
403 |     "# therefore, this is not an error in the code and you can ignore this error\n",
404 |     "\n",
405 |     "# Compute the softmax loss, using a sample of the negative labels each time.\n",
406 |     "loss = tf.reduce_sum(\n",
407 |     "[\n",
408 |     "    tf.reduce_mean(tf.nn.sampled_softmax_loss(weights=softmax_weights[wi], biases=softmax_biases[wi], inputs=embed,\n",
409 |     "                           labels=train_labels[wi], num_sampled=num_sampled, num_classes=vocabulary_size))\n",
410 |     "    for wi in range(window_size*2)\n",
411 |     "]\n",
412 |     ")\n",
413 |     "\n"
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "markdown",
418 |    "metadata": {},
419 |    "source": [
420 |     "### Calculating Word Similarities \n",
421 |     "We calculate the similarity between two given words in terms of the cosine distance. To do this efficiently we use matrix operations to do so, as shown below."
422 |    ]
423 |   },
424 |   {
425 |    "cell_type": "code",
426 |    "execution_count": 10,
427 |    "metadata": {
428 |     "collapsed": true
429 |    },
430 |    "outputs": [],
431 |    "source": [
432 |     "# Compute the similarity between minibatch examples and all embeddings.\n",
433 |     "# We use the cosine distance:\n",
434 |     "norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))\n",
435 |     "normalized_embeddings = embeddings / norm\n",
436 |     "valid_embeddings = tf.nn.embedding_lookup(\n",
437 |     "normalized_embeddings, valid_dataset)\n",
438 |     "similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))\n"
439 |    ]
440 |   },
441 |   {
442 |    "cell_type": "markdown",
443 |    "metadata": {},
444 |    "source": [
445 |     "### Model Parameter Optimizer\n",
446 |     "\n",
447 |     "We then define a constant learning rate and an optimizer which uses the Adagrad method. Feel free to experiment with other optimizers listed [here](https://www.tensorflow.org/api_guides/python/train)."
448 |    ]
449 |   },
450 |   {
451 |    "cell_type": "code",
452 |    "execution_count": 11,
453 |    "metadata": {
454 |     "collapsed": true
455 |    },
456 |    "outputs": [],
457 |    "source": [
458 |     "\n",
459 |     "# Optimizer.\n",
460 |     "optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)\n"
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "markdown",
465 |    "metadata": {},
466 |    "source": [
467 |     "## Running the Structured Skip-gram Algorithm\n",
468 |     "\n",
469 |     "Here we run the structured skip-gram algorithm we defined above. Specifically, we first initialize variables, and then train the algorithm for many steps (`num_steps`). And every few steps we evaluate the algorithm on a fixed validation set and print out the words that appear to be closest for a given set of words."
470 |    ]
471 |   },
472 |   {
473 |    "cell_type": "code",
474 |    "execution_count": 12,
475 |    "metadata": {
476 |     "scrolled": true
477 |    },
478 |    "outputs": [
479 |     {
480 |      "name": "stdout",
481 |      "output_type": "stream",
482 |      "text": [
483 |       "Initialized\n",
484 |       "Average loss at step 2000: 14.825290\n",
485 |       "Average loss at step 4000: 12.924444\n",
486 |       "Average loss at step 6000: 12.492212\n",
487 |       "Average loss at step 8000: 12.194010\n",
488 |       "Average loss at step 10000: 11.985265\n",
489 |       "Nearest to .: ;, ,, of, :, and, reclassify, '', in,\n",
490 |       "Nearest to which: but, that, who, and, it, what, where, then,\n",
491 |       "Nearest to an: a, the, chong, its, constrained, rockwell, spartan, cigars,\n",
492 |       "Nearest to as: creating, by, 29.9, kunda, ravens, diracodon, including, attractive,\n",
493 |       "Nearest to be: been, have, being, daly, was, spectroscopic, often, were,\n",
494 |       "Nearest to first: last, second, next, only, same, main, original, late,\n",
495 |       "Nearest to ,: ;, ., (, and, :, of, —, ''protecteur,\n",
496 |       "Nearest to (: ;, ,, na+, 30.1, travis, :, per, dram,\n",
497 |       "Nearest to from: in, into, across, by, at, resident, lcd, between,\n",
498 |       "Nearest to for: soo, of, over, follower, bien, among, inequalities, introductions,\n",
499 |       "Nearest to ;: ., ,, :, (, —, ..., magill, one-man,\n",
500 |       "Nearest to have: had, has, were, are, be, 7-6, year.the, make,\n",
501 |       "Nearest to UNK: artificially, postulated, disasters, cooling, tselinograd, indefinite, enthusiastic, mass-marketed,\n",
502 |       "Nearest to or: and, centred, hematoma, allowing, preeminent, than, resident, tubulin,\n",
503 |       "Nearest to :: ;, ., dexter, (, stallion, methodologies, ,, une,\n",
504 |       "Nearest to the: a, its, their, his, any, this, an, another,\n",
505 |       "Average loss at step 12000: 11.888148\n",
506 |       "Average loss at step 14000: 11.728705\n",
507 |       "Average loss at step 16000: 11.675454\n",
508 |       "Average loss at step 18000: 11.627067\n",
509 |       "Average loss at step 20000: 11.579372\n",
510 |       "Nearest to .: ;, ,, :, —, and, of, (, pear-shaped,\n",
511 |       "Nearest to which: that, but, who, where, and, can, this, ecstasy,\n",
512 |       "Nearest to an: a, the, rockwell, irwin, spartan, reclassify, oyo, corrugated,\n",
513 |       "Nearest to as: kunda, called, yorkville, ordinances, attractive, creating, diracodon, revitalized,\n",
514 |       "Nearest to be: been, being, have, become, were, write, amnh, clearly,\n",
515 |       "Nearest to first: last, second, next, only, final, original, drury, main,\n",
516 |       "Nearest to ,: ;, ., —, -, :, (, and, turkish-cypriot,\n",
517 |       "Nearest to (: ;, -, ,, —, or, per, travis, :,\n",
518 |       "Nearest to from: across, into, in, by, between, towards, through, on,\n",
519 |       "Nearest to for: during, bien, yamaha, with, within, after, soo, among,\n",
520 |       "Nearest to ;: ., ,, :, —, (, consumes, -, censured,\n",
521 |       "Nearest to have: had, has, were, be, having, are, martyred, preserve,\n",
522 |       "Nearest to UNK: tselinograd, 10,000, 1835., 4, quilting, r, jewellery, deadline,\n",
523 |       "Nearest to or: and, (, buddhist, hematoma, centred, preeminent, retain, angélil,\n",
524 |       "Nearest to :: ., ;, ,, —, pan-slavic, consumes, resorts, reunify,\n",
525 |       "Nearest to the: its, a, their, his, an, her, typing, paroled,\n",
526 |       "Average loss at step 22000: 11.470342\n",
527 |       "Average loss at step 24000: 11.460939\n",
528 |       "Average loss at step 26000: 11.296547\n",
529 |       "Average loss at step 28000: 10.891410\n",
530 |       "Average loss at step 30000: 10.739479\n",
531 |       "Nearest to .: ;, ,, alfa, —, --, outcroppings, conventions, shaved,\n",
532 |       "Nearest to which: who, that, but, and, where, whom, shoppers, repeal,\n",
533 |       "Nearest to an: spartan, rockwell, pablo, cheadle, novgorodians, irwin, resnick, corrugated,\n",
534 |       "Nearest to as: creating, yorkville, kunda, sài, f-75, including, retains, rossini,\n",
535 |       "Nearest to be: been, being, become, spectroscopic, clearly, remain, write, kieft,\n",
536 |       "Nearest to first: last, next, second, final, best, only, third, highest,\n",
537 |       "Nearest to ,: ;, ., —, -, hazardous, –, ''protecteur, inflicted,\n",
538 |       "Nearest to (: -, na+, travis, =, quis, 4-d, preemptive, sarawak,\n",
539 |       "Nearest to from: across, in, longevity, overcome, snoop, via, missile, panicked,\n",
540 |       "Nearest to for: bien, water…and, nasty, hectare, of, goryeo, keller, −300,\n",
541 |       "Nearest to ;: ., ,, —, -, --, one-man, :, mucous,\n",
542 |       "Nearest to have: had, has, having, rely, are, western-style, year.the, were,\n",
543 |       "Nearest to UNK: blue, tarnów, monkees, tselinograd, silent, 1.1, artificially, 300,\n",
544 |       "Nearest to or: and, preeminent, formatted, landmass, langston, morton, erysipelas, tubulin,\n",
545 |       "Nearest to :: dexter, freshest, une, -, cowdery, ;, include, aerostatic/aerodynamic,\n",
546 |       "Nearest to the: its, a, his, their, harriet, debbie, our, tranquilizer,\n",
547 |       "Average loss at step 32000: 10.743503\n",
548 |       "Average loss at step 34000: 10.741862\n",
549 |       "Average loss at step 36000: 10.708315\n",
550 |       "Average loss at step 38000: 10.610916\n",
551 |       "Average loss at step 40000: 10.676803\n",
552 |       "Nearest to .: ;, ,, :, and, 1932–33, harvested, cornelius, kalmykova,\n",
553 |       "Nearest to which: that, who, whom, but, what, where, repeal, ecstasy,\n",
554 |       "Nearest to an: rockwell, sarai, constrained, irwin, resnick, open-spandrel, spartan, corrugated,\n",
555 |       "Nearest to as: yorkville, self-esteem, kunda, attractive, |mar_lo_°c, ravens, f-75, disperse,\n",
556 |       "Nearest to be: been, being, fully, remain, amnh, have, was, clearly,\n",
557 |       "Nearest to first: last, second, next, earliest, only, final, original, best,\n",
558 |       "Nearest to ,: ;, ., —, and, -, ''protecteur, shipboard, –,\n",
559 |       "Nearest to (: -, —, dram, approximately, =, 30.1, –, na+,\n",
560 |       "Nearest to from: into, across, collects, lcd, documenting, mastiff, deep-seated, accommodating,\n",
561 |       "Nearest to for: nasty, soo, introductions, in, water…and, yamaha, among, tumwater,\n",
562 |       "Nearest to ;: ,, ., —, superfluous, petitioner, pro-russian, complains, fund-raising,\n",
563 |       "Nearest to have: had, has, are, having, contain, were, martyred, apply,\n",
564 |       "Nearest to UNK: r, tselinograd, perch, zha, re-instated, eighth, 300, speculates,\n",
565 |       "Nearest to or: and, somebody, formatted, nor, dat, preeminent, 4-5, landmass,\n",
566 |       "Nearest to :: dexter, consumes, word, providers, stallion, differentiating, pan-slavic, .,\n",
567 |       "Nearest to the: a, their, your, his, its, any, zaidi, generalfeldmarschall,\n",
568 |       "Average loss at step 42000: 10.604452\n",
569 |       "Average loss at step 44000: 10.669711\n",
570 |       "Average loss at step 46000: 10.638800\n",
571 |       "Average loss at step 48000: 10.602861\n",
572 |       "Average loss at step 50000: 10.685731\n",
573 |       "Nearest to .: ;, ,, :, —, photographing, interviewing, shias, in,\n",
574 |       "Nearest to which: that, and, but, where, what, whom, who, ecstasy,\n",
575 |       "Nearest to an: rockwell, corrugated, reclassify, boise, irwin, novgorodians, resnick, the,\n",
576 |       "Nearest to as: self-esteem, kunda, triploid, attractive, ravens, |mar_lo_°c, racquet, yorkville,\n",
577 |       "Nearest to be: been, being, easily, replace, surpass, remain, solve, readily,\n",
578 |       "Nearest to first: last, next, second, earliest, final, fourth, third, only,\n",
579 |       "Nearest to ,: —, ;, ., (, and, in, djurgårdens, shipboard,\n",
580 |       "Nearest to (: —, -, dram, ,, –, ''hancock, approximately, ;,\n",
581 |       "Nearest to from: into, in, through, lcd, across, sault, liaison, towards,\n",
582 |       "Nearest to for: nasty, yamaha, introductions, soo, during, arbitrarily, bien, in,\n",
583 |       "Nearest to ;: ., ,, —, :, -, consumes, (, than,\n",
584 |       "Nearest to have: had, has, having, are, were, contain, martyred, rely,\n",
585 |       "Nearest to UNK: hawkeye, silent, tselinograd, brown, non-living, aesthetics, d, here,\n",
586 |       "Nearest to or: and, formosan, dat, desc, nor, preeminent, containing, formatted,\n",
587 |       "Nearest to :: ., ;, differentiating, consumes, resorts, dexter, cowdery, methodologies,\n",
588 |       "Nearest to the: a, its, their, this, horsetails, his, acelhuate, delagoa,\n",
589 |       "Average loss at step 52000: 10.430200\n",
590 |       "Average loss at step 54000: 10.324997\n",
591 |       "Average loss at step 56000: 10.216399\n",
592 |       "Average loss at step 58000: 10.217039\n",
593 |       "Average loss at step 60000: 10.210400\n",
594 |       "Nearest to .: ,, ;, albinus, of, :, ?, 'big, matsui,\n",
595 |       "Nearest to which: that, who, whom, what, shoppers, but, sheikh, repeal,\n",
596 |       "Nearest to an: rockwell, resnick, spartan, open-spandrel, kant, irwin, corrugated, takings,\n",
597 |       "Nearest to as: ravens, self-esteem, blaming, sài, beginning, result, creating, attractive,\n",
598 |       "Nearest to be: been, have, being, easily, clearly, fully, replace, grow,\n",
599 |       "Nearest to first: last, second, earliest, next, only, final, best, original,\n",
600 |       "Nearest to ,: ., ;, —, theobromine, -, ''protecteur, cabled, :,\n",
601 |       "Nearest to (: [, -, 405, bernard, adventurers, dram, horace, 30.1,\n",
602 |       "Nearest to from: jawbone, metrovick, overcome, lcd, replacing, across, into, in,\n",
603 |       "Nearest to for: introductions, seeker, spion, reactor, nasty, smelting, bien, rehabilitated,\n",
604 |       "Nearest to ;: ., ,, —, khaldun, prowess, -, avellaneda, :,\n",
605 |       "Nearest to have: has, had, be, having, dumps, rely, apply, contain,\n",
606 |       "Nearest to UNK: tselinograd, 1800, re-instated, -1, fostered, tarnów, r., cobo,\n",
607 |       "Nearest to or: and, formatted, landmass, centred, meaning, preeminent, hematoma, reciting,\n",
608 |       "Nearest to :: termed, dexter, providers, freshest, ., teufel, stallion, nickname,\n",
609 |       "Nearest to the: a, glial, blackburn, our, appease, 'the, atheistic, various,\n"
610 |      ]
611 |     },
612 |     {
613 |      "name": "stdout",
614 |      "output_type": "stream",
615 |      "text": [
616 |       "Average loss at step 62000: 10.223157\n",
617 |       "Average loss at step 64000: 10.105503\n",
618 |       "Average loss at step 66000: 10.191790\n",
619 |       "Average loss at step 68000: 10.157220\n",
620 |       "Average loss at step 70000: 10.154481\n",
621 |       "Nearest to .: ,, ;, of, and, in, that, verbiage, rostand,\n",
622 |       "Nearest to which: that, who, and, ecstasy, whom, but, repeal, whose,\n",
623 |       "Nearest to an: rockwell, resnick, spartan, irwin, sarai, cavitation, boise, novgorodians,\n",
624 |       "Nearest to as: ravens, blaming, sài, kunda, rossini, medial, continuous-wave, result,\n",
625 |       "Nearest to be: been, being, customisation, easily, replace, surpass, occur, were,\n",
626 |       "Nearest to first: last, second, next, earliest, fourth, final, only, oldest,\n",
627 |       "Nearest to ,: ., ;, and, —, of, langevin, recuperating, assaulted,\n",
628 |       "Nearest to (: -, 1187., wander, ;, holmgard, —, eucharistic, mib,\n",
629 |       "Nearest to from: in, towards, lcd, longevity, accommodating, accra, into, rampa,\n",
630 |       "Nearest to for: yamaha, introductions, bien, nasty, during, fucking, gourmet, dislodge,\n",
631 |       "Nearest to ;: ., ,, pro-russian, —, (, consumes, ?, :,\n",
632 |       "Nearest to have: had, has, were, are, having, rely, brutish, be,\n",
633 |       "Nearest to UNK: silent, hellene, weak, berger, hardiness, headingley, bone, 39,\n",
634 |       "Nearest to or: and, formatted, preeminent, sax, pre-s2, reciting, baleen, buddhist,\n",
635 |       "Nearest to :: reunify, termed, differentiating, replicates, consumes, liberalised, teufel, ;,\n",
636 |       "Nearest to the: its, a, their, glial, these, 1937–1945, this, his,\n",
637 |       "Average loss at step 72000: 10.217530\n",
638 |       "Average loss at step 74000: 10.146726\n",
639 |       "Average loss at step 76000: 10.247005\n",
640 |       "Average loss at step 78000: 10.026597\n",
641 |       "Average loss at step 80000: 9.882595\n",
642 |       "Nearest to .: ;, ,, 1924., 10., 2003., 2006., 2004., 1983.,\n",
643 |       "Nearest to which: that, whom, who, 35.6, where, shoppers, roney, but,\n",
644 |       "Nearest to an: rockwell, resnick, irwin, corrugated, novgorodians, spartan, cheadle, sarai,\n",
645 |       "Nearest to as: yorkville, quintessentially, |mar_lo_°c, sài, self-esteem, escapes, mississippians, thessaly,\n",
646 |       "Nearest to be: been, being, surpass, easily, customisation, replace, deliberately, occur,\n",
647 |       "Nearest to first: last, second, next, fourth, only, earliest, final, oldest,\n",
648 |       "Nearest to ,: ., ;, and, melinda, refuelling, —, apostate, hunslet,\n",
649 |       "Nearest to (: -, na+, dram, chanute, indented, lihue, 4-d, approximately,\n",
650 |       "Nearest to from: in, lampboard, deep-seated, waterways, across, israeli-palestinian, cambodian, lcd,\n",
651 |       "Nearest to for: water…and, nasty, concealing, γαλαξίας, yamaha, bien, keller, kopfstein,\n",
652 |       "Nearest to ;: ., ,, deco, --, —, one-man, penned, mucous,\n",
653 |       "Nearest to have: had, has, having, rely, contain, year.the, were, spend,\n",
654 |       "Nearest to UNK: unsuited, 1.1, 99, tarnów, tselinograd, schlich, monkees, natasha,\n",
655 |       "Nearest to or: formatted, and, preeminent, plus, landmass, pre-s2, semivowels, lukewarm,\n",
656 |       "Nearest to :: differentiating, une, termed, freshest, aerostatic/aerodynamic, dexter, dragged, rhinos,\n",
657 |       "Nearest to the: a, its, non-agricultural, forster, an, shawnee, tanzanian, paroled,\n",
658 |       "Average loss at step 82000: 9.922878\n",
659 |       "Average loss at step 84000: 9.897537\n",
660 |       "Average loss at step 86000: 9.913045\n",
661 |       "Average loss at step 88000: 9.824237\n",
662 |       "Average loss at step 90000: 9.811843\n",
663 |       "Nearest to .: ,, ;, jazzy, 2001., bethad, 1821., supertankers, align=,\n",
664 |       "Nearest to which: that, who, whom, what, but, shoppers, where, and,\n",
665 |       "Nearest to an: rockwell, resnick, sarai, corrugated, thane, open-spandrel, fended, entremeses,\n",
666 |       "Nearest to as: self-esteem, |mar_lo_°c, thelma, triploid, kunda, ravens, yorkville, quintessentially,\n",
667 |       "Nearest to be: been, being, surpass, become, fully, grow, regain, remain,\n",
668 |       "Nearest to first: second, last, earliest, next, fourth, oldest, final, facultatively,\n",
669 |       "Nearest to ,: ., ;, —, posit, and, -, 802.11b, of,\n",
670 |       "Nearest to (: -, —, 30.1, teddy, investment-grade, 'scouse, <, 405,\n",
671 |       "Nearest to from: lampboard, into, mastiff, in, deep-seated, fasi, lcd, alchemical,\n",
672 |       "Nearest to for: yamaha, soo, introductions, nasty, water…and, electrical, fattened, reactor,\n",
673 |       "Nearest to ;: ,, ., —, complains, superfluous, pro-russian, scorers, transliterations,\n",
674 |       "Nearest to have: had, has, exist, contain, having, are, contribute, represent,\n",
675 |       "Nearest to UNK: re-instated, reuters, loftus, -1, gaston, multi-instrumentalist, harvard, tarnów,\n",
676 |       "Nearest to or: formatted, and, landmass, 4-5, lampooned, buddhist, preeminent, nor,\n",
677 |       "Nearest to :: freshest, dexter, providers, differentiating, actor-managers, stanislaus, retorted, aerostatic/aerodynamic,\n",
678 |       "Nearest to the: a, his, paroled, its, newly-created, delagoa, stormtrooper, woda,\n",
679 |       "Average loss at step 92000: 9.857963\n",
680 |       "Average loss at step 94000: 9.855468\n",
681 |       "Average loss at step 96000: 9.892065\n",
682 |       "Average loss at step 98000: 9.858063\n",
683 |       "Average loss at step 100000: 9.912151\n",
684 |       "Nearest to .: ;, ,, pear-shaped, of, d'etat, :, and, seaways,\n",
685 |       "Nearest to which: that, whom, what, where, who, alushta, redfin, 35.6,\n",
686 |       "Nearest to an: rockwell, corrugated, resnick, irwin, boise, 40.4, novgorodians, thane,\n",
687 |       "Nearest to as: kunda, triploid, yorkville, racquet, quintessentially, ravens, self-esteem, result,\n",
688 |       "Nearest to be: been, being, surpass, replace, easily, was, formally, partially,\n",
689 |       "Nearest to first: last, second, earliest, next, fourth, oldest, final, best,\n",
690 |       "Nearest to ,: ., ;, —, ''protecteur, and, diphthongisation, chokai, ultimatetv,\n",
691 |       "Nearest to (: —, dram, bernard, -, 30.1, –, sarawak, toray,\n",
692 |       "Nearest to from: lampboard, phosphorus, rossby, lighter-than-air, lcd, wendt, alchemical, longevity,\n",
693 |       "Nearest to for: yamaha, introductions, nasty, freest, seeker, water…and, mistook, reminded,\n",
694 |       "Nearest to ;: ., ,, :, durant, --, —, basel-landschaft, >,\n",
695 |       "Nearest to have: has, had, contain, having, spend, contribute, rely, 've,\n",
696 |       "Nearest to UNK: silent, darts, tselinograd, 4th, berger, jewellery, honour, claudius,\n",
697 |       "Nearest to or: formatted, and, slew, desc, sax, preeminent, pre-s2, meaning,\n",
698 |       "Nearest to :: differentiating, ;, consumes, pour, freshest, mattila, termed, recounting,\n",
699 |       "Nearest to the: a, his, its, their, 1959-1960, this, species-rich, 1.88,\n"
700 |      ]
701 |     }
702 |    ],
703 |    "source": [
704 |     "num_steps = 100001\n",
705 |     "decay_learning_rate_every = 2000\n",
706 |     "skip_gram_loss = [] # Collect the sequential loss values for plotting purposes\n",
707 |     "\n",
708 |     "with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:\n",
709 |     "  tf.global_variables_initializer().run()\n",
710 |     "  print('Initialized')\n",
711 |     "  average_loss = 0\n",
712 |     "  for step in range(num_steps):\n",
713 |     "    batch_data, batch_labels = generate_batch(\n",
714 |     "      batch_size, window_size)\n",
715 |     "    feed_dict = {train_dataset : batch_data}\n",
716 |     "    for wi in range(2*window_size):\n",
717 |     "        feed_dict.update({train_labels[wi]:np.reshape(batch_labels[:,wi],(-1,1))})\n",
718 |     "    \n",
719 |     "    _, l = session.run([optimizer, loss], feed_dict=feed_dict)\n",
720 |     "    average_loss += l\n",
721 |     "    \n",
722 |     "    if (step+1) % 2000 == 0:\n",
723 |     "      if step > 0:\n",
724 |     "        average_loss = average_loss / 2000\n",
725 |     "      # The average loss is an estimate of the loss over the last 2000 batches.\n",
726 |     "      print('Average loss at step %d: %f' % (step+1, average_loss))\n",
727 |     "      skip_gram_loss.append(average_loss)\n",
728 |     "      average_loss = 0\n",
729 |     "    # note that this is expensive (~20% slowdown if computed every 500 steps)\n",
730 |     "    if (step+1) % 10000 == 0:\n",
731 |     "      sim = similarity.eval()\n",
732 |     "      for i in range(valid_size):\n",
733 |     "        valid_word = reverse_dictionary[valid_examples[i]]\n",
734 |     "        top_k = 8 # number of nearest neighbors\n",
735 |     "        nearest = (-sim[i, :]).argsort()[1:top_k+1]\n",
736 |     "        log = 'Nearest to %s:' % valid_word\n",
737 |     "        for k in range(top_k):\n",
738 |     "          close_word = reverse_dictionary[nearest[k]]\n",
739 |     "          log = '%s %s,' % (log, close_word)\n",
740 |     "        print(log)\n",
741 |     "  skip_gram_final_embeddings = normalized_embeddings.eval()\n",
742 |     "\n",
743 |     "# We will save the word vectors learned and the loss over time\n",
744 |     "# as this information is required later for comparisons\n",
745 |     "np.save('struct_skip_embeddings',skip_gram_final_embeddings)\n",
746 |     "\n",
747 |     "with open('struct_skip_losses.csv', 'wt') as f:\n",
748 |     "    writer = csv.writer(f, delimiter=',')\n",
749 |     "    writer.writerow(skip_gram_loss)"
750 |    ]
751 |   },
752 |   {
753 |    "cell_type": "code",
754 |    "execution_count": null,
755 |    "metadata": {
756 |     "collapsed": true
757 |    },
758 |    "outputs": [],
759 |    "source": []
760 |   }
761 |  ],
762 |  "metadata": {
763 |   "kernelspec": {
764 |    "display_name": "Python 3",
765 |    "language": "python",
766 |    "name": "python3"
767 |   },
768 |   "language_info": {
769 |    "codemirror_mode": {
770 |     "name": "ipython",
771 |     "version": 3
772 |    },
773 |    "file_extension": ".py",
774 |    "mimetype": "text/x-python",
775 |    "name": "python",
776 |    "nbconvert_exporter": "python",
777 |    "pygments_lexer": "ipython3",
778 |    "version": "3.5.2"
779 |   }
780 |  },
781 |  "nbformat": 4,
782 |  "nbformat_minor": 2
783 | }
784 | 


--------------------------------------------------------------------------------
/ch5/cnn_sentence_classification.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Sentence Classification with Convolution Neural Networks\n",
  8 |     "[Paper](https://arxiv.org/pdf/1408.5882.pdf): Convolutional Neural Networks for Sentence Classification by Yoon Kim"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "metadata": {},
 15 |    "outputs": [
 16 |     {
 17 |      "name": "stderr",
 18 |      "output_type": "stream",
 19 |      "text": [
 20 |       "c:\\users\\thushan\\documents\\python_virtualenvs\\tensorflow_venv\\lib\\site-packages\\h5py\\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
 21 |       "  from ._conv import register_converters as _register_converters\n"
 22 |      ]
 23 |     }
 24 |    ],
 25 |    "source": [
 26 |     "# These are all the modules we'll be using later. Make sure you can import them\n",
 27 |     "# before proceeding further.\n",
 28 |     "%matplotlib inline\n",
 29 |     "from __future__ import print_function\n",
 30 |     "import collections\n",
 31 |     "import math\n",
 32 |     "import numpy as np\n",
 33 |     "import os\n",
 34 |     "import random\n",
 35 |     "import tensorflow as tf\n",
 36 |     "import zipfile\n",
 37 |     "from matplotlib import pylab\n",
 38 |     "from six.moves import range\n",
 39 |     "from six.moves.urllib.request import urlretrieve\n",
 40 |     "import tensorflow as tf"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "## Downloading and Checking the Dataset\n",
 48 |     "This [dataset](Dataset: http://cogcomp.cs.illinois.edu/Data/QA/QC/) is composed of questions as inputs and their respective type as the output. For example, (e.g. Who was Abraham Lincon?) and the output or label would be Human."
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 2,
 54 |    "metadata": {},
 55 |    "outputs": [
 56 |     {
 57 |      "name": "stdout",
 58 |      "output_type": "stream",
 59 |      "text": [
 60 |       "question-classif-data\\train_1000.label\n",
 61 |       "Found and verified question-classif-data\\train_1000.label\n",
 62 |       "question-classif-data\\TREC_10.label\n",
 63 |       "Found and verified question-classif-data\\TREC_10.label\n"
 64 |      ]
 65 |     }
 66 |    ],
 67 |    "source": [
 68 |     "url = 'http://cogcomp.org/Data/QA/QC/'\n",
 69 |     "dir_name = 'question-classif-data'\n",
 70 |     "\n",
 71 |     "def maybe_download(dir_name, filename, expected_bytes):\n",
 72 |     "  \"\"\"Download a file if not present, and make sure it's the right size.\"\"\"\n",
 73 |     "  if not os.path.exists(dir_name):\n",
 74 |     "        os.mkdir(dir_name)\n",
 75 |     "  if not os.path.exists(os.path.join(dir_name,filename)):\n",
 76 |     "    filename, _ = urlretrieve(url + filename, os.path.join(dir_name,filename))\n",
 77 |     "  print(os.path.join(dir_name,filename))\n",
 78 |     "  statinfo = os.stat(os.path.join(dir_name,filename))\n",
 79 |     "  if statinfo.st_size == expected_bytes:\n",
 80 |     "    print('Found and verified %s' % os.path.join(dir_name,filename))\n",
 81 |     "  else:\n",
 82 |     "    print(statinfo.st_size)\n",
 83 |     "    raise Exception(\n",
 84 |     "      'Failed to verify ' + os.path.join(dir_name,filename) + '. Can you get to it with a browser?')\n",
 85 |     "  return filename\n",
 86 |     "\n",
 87 |     "filename = maybe_download(dir_name, 'train_1000.label', 60774)\n",
 88 |     "test_filename = maybe_download(dir_name, 'TREC_10.label',23354)"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 3,
 94 |    "metadata": {},
 95 |    "outputs": [
 96 |     {
 97 |      "name": "stdout",
 98 |      "output_type": "stream",
 99 |      "text": [
100 |       "Files found and verified.\n"
101 |      ]
102 |     }
103 |    ],
104 |    "source": [
105 |     "# Check the existence of files\n",
106 |     "filenames = ['train_1000.label','TREC_10.label']\n",
107 |     "num_files = len(filenames)\n",
108 |     "for i in range(len(filenames)):\n",
109 |     "    file_exists = os.path.isfile(os.path.join(dir_name,filenames[i]))\n",
110 |     "    assert file_exists\n",
111 |     "print('Files found and verified.')"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "markdown",
116 |    "metadata": {},
117 |    "source": [
118 |     "## Loading and Preprocessing Data\n",
119 |     "Below we load the text into the program and do some simple preprocessing on data"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 4,
125 |    "metadata": {},
126 |    "outputs": [
127 |     {
128 |      "name": "stdout",
129 |      "output_type": "stream",
130 |      "text": [
131 |       "\n",
132 |       "Processing file question-classif-data\\train_1000.label\n",
133 |       "\tQuestion 0: ['manner', 'how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?']\n",
134 |       "\tLabel 0: DESC\n",
135 |       "\n",
136 |       "\tQuestion 1: ['cremat', 'what', 'films', 'featured', 'the', 'character', 'popeye', 'doyle', '?']\n",
137 |       "\tLabel 1: ENTY\n",
138 |       "\n",
139 |       "\tQuestion 2: ['manner', 'how', 'can', 'i', 'find', 'a', 'list', 'of', 'celebrities', \"'\", 'real', 'names', '?']\n",
140 |       "\tLabel 2: DESC\n",
141 |       "\n",
142 |       "\tQuestion 3: ['animal', 'what', 'fowl', 'grabs', 'the', 'spotlight', 'after', 'the', 'chinese', 'year', 'of', 'the', 'monkey', '?']\n",
143 |       "\tLabel 3: ENTY\n",
144 |       "\n",
145 |       "\tQuestion 4: ['exp', 'what', 'is', 'the', 'full', 'form', 'of', '.com', '?']\n",
146 |       "\tLabel 4: ABBR\n",
147 |       "\n",
148 |       "\n",
149 |       "Processing file question-classif-data\\TREC_10.label\n",
150 |       "\tQuestion 0: ['manner', 'how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?']\n",
151 |       "\tLabel 0: DESC\n",
152 |       "\n",
153 |       "\tQuestion 1: ['cremat', 'what', 'films', 'featured', 'the', 'character', 'popeye', 'doyle', '?']\n",
154 |       "\tLabel 1: ENTY\n",
155 |       "\n",
156 |       "\tQuestion 2: ['manner', 'how', 'can', 'i', 'find', 'a', 'list', 'of', 'celebrities', \"'\", 'real', 'names', '?']\n",
157 |       "\tLabel 2: DESC\n",
158 |       "\n",
159 |       "\tQuestion 3: ['animal', 'what', 'fowl', 'grabs', 'the', 'spotlight', 'after', 'the', 'chinese', 'year', 'of', 'the', 'monkey', '?']\n",
160 |       "\tLabel 3: ENTY\n",
161 |       "\n",
162 |       "\tQuestion 4: ['exp', 'what', 'is', 'the', 'full', 'form', 'of', '.com', '?']\n",
163 |       "\tLabel 4: ABBR\n",
164 |       "\n",
165 |       "Max Sentence Length: 33\n",
166 |       "\n",
167 |       "Normalizing all sentences to same length\n"
168 |      ]
169 |     }
170 |    ],
171 |    "source": [
172 |     "# Records the maximum length of the sentences\n",
173 |     "# as we need to pad shorter sentences accordingly\n",
174 |     "max_sent_length = 0 \n",
175 |     "\n",
176 |     "def read_data(filename):\n",
177 |     "  '''\n",
178 |     "  Read data from a file with given filename\n",
179 |     "  Returns a list of strings where each string is a lower case word\n",
180 |     "  '''\n",
181 |     "  global max_sent_length\n",
182 |     "  questions = []\n",
183 |     "  labels = []\n",
184 |     "  with open(filename,'r',encoding='latin-1') as f:        \n",
185 |     "    for row in f:\n",
186 |     "        row_str = row.split(\":\")\n",
187 |     "        lb,q = row_str[0],row_str[1]\n",
188 |     "        q = q.lower()\n",
189 |     "        labels.append(lb)\n",
190 |     "        questions.append(q.split())        \n",
191 |     "        if len(questions[-1])>max_sent_length:\n",
192 |     "            max_sent_length = len(questions[-1])\n",
193 |     "  return questions,labels\n",
194 |     "\n",
195 |     "# Process train and Test data\n",
196 |     "for i in range(num_files):    \n",
197 |     "    print('\\nProcessing file %s'%os.path.join(dir_name,filenames[i]))\n",
198 |     "    if i==0:\n",
199 |     "        # Processing training data\n",
200 |     "        train_questions,train_labels = read_data(os.path.join(dir_name,filenames[i]))\n",
201 |     "        # Making sure we got all the questions and corresponding labels\n",
202 |     "        assert len(train_questions)==len(train_labels)\n",
203 |     "    elif i==1:\n",
204 |     "        # Processing testing data\n",
205 |     "        test_questions,test_labels = read_data(os.path.join(dir_name,filenames[i]))\n",
206 |     "        # Making sure we got all the questions and corresponding labels.\n",
207 |     "        assert len(test_questions)==len(test_labels)\n",
208 |     "        \n",
209 |     "    # Print some data to see everything is okey\n",
210 |     "    for j in range(5):\n",
211 |     "        print('\\tQuestion %d: %s' %(j,train_questions[j]))\n",
212 |     "        print('\\tLabel %d: %s\\n'%(j,train_labels[j]))\n",
213 |     "        \n",
214 |     "print('Max Sentence Length: %d'%max_sent_length)\n",
215 |     "print('\\nNormalizing all sentences to same length')"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "markdown",
220 |    "metadata": {},
221 |    "source": [
222 |     "## Padding Shorter Sentences\n",
223 |     "We use padding to pad short sentences so that all the sentences are of the same length."
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": 5,
229 |    "metadata": {},
230 |    "outputs": [
231 |     {
232 |      "name": "stdout",
233 |      "output_type": "stream",
234 |      "text": [
235 |       "Train questions padded\n",
236 |       "\n",
237 |       "Test questions padded\n",
238 |       "\n",
239 |       "Sample test question: %s ['dist', 'how', 'far', 'is', 'it', 'from', 'denver', 'to', 'aspen', '?', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD']\n"
240 |      ]
241 |     }
242 |    ],
243 |    "source": [
244 |     "# Padding training data\n",
245 |     "for qi,que in enumerate(train_questions):\n",
246 |     "    for _ in range(max_sent_length-len(que)):\n",
247 |     "        que.append('PAD')\n",
248 |     "    assert len(que)==max_sent_length\n",
249 |     "    train_questions[qi] = que\n",
250 |     "print('Train questions padded')\n",
251 |     "\n",
252 |     "# Padding testing data\n",
253 |     "for qi,que in enumerate(test_questions):\n",
254 |     "    for _ in range(max_sent_length-len(que)):\n",
255 |     "        que.append('PAD')\n",
256 |     "    assert len(que)==max_sent_length\n",
257 |     "    test_questions[qi] = que\n",
258 |     "print('\\nTest questions padded')  \n",
259 |     "\n",
260 |     "# Printing a test question to see if everything is correct\n",
261 |     "print('\\nSample test question: %s',test_questions[0])"
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "markdown",
266 |    "metadata": {},
267 |    "source": [
268 |     "## Building the Dictionaries\n",
269 |     "Builds the following. To understand each of these elements, let us also assume the text \"I like to go to school\"\n",
270 |     "\n",
271 |     "* `dictionary`: maps a string word to an ID (e.g. {I:0, like:1, to:2, go:3, school:4})\n",
272 |     "* `reverse_dictionary`: maps an ID to a string word (e.g. {0:I, 1:like, 2:to, 3:go, 4:school}\n",
273 |     "* `count`: List of list of (word, frequency) elements (e.g. [(I,1),(like,1),(to,2),(go,1),(school,1)]\n",
274 |     "* `data` : Contain the string of text we read, where string words are replaced with word IDs (e.g. [0, 1, 2, 3, 2, 4])\n",
275 |     "\n",
276 |     "We do not replace rare words with \"UNK\" because the vocabulary is already quite small."
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "execution_count": 6,
282 |    "metadata": {},
283 |    "outputs": [
284 |     {
285 |      "name": "stdout",
286 |      "output_type": "stream",
287 |      "text": [
288 |       "49500 Words found.\n",
289 |       "Found 3369 words in the vocabulary. \n",
290 |       "All words (count) [('PAD', 34407), ('?', 1454), ('the', 999), ('what', 963), ('is', 587)]\n",
291 |       "\n",
292 |       "0th entry in dictionary: %s PAD\n",
293 |       "\n",
294 |       "Sample data [38, 12, 19, 1977, 1118, 6, 28, 2230, 3107, 686, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
295 |       "\n",
296 |       "Sample data [44, 3, 881, 2852, 2, 173, 2113, 2996, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
297 |       "\n",
298 |       "Vocabulary:  3369\n",
299 |       "\n",
300 |       "Number of training questions:  1000\n",
301 |       "Number of testing questions:  500\n"
302 |      ]
303 |     }
304 |    ],
305 |    "source": [
306 |     "def build_dataset(questions):\n",
307 |     "    words = []\n",
308 |     "    data_list = []\n",
309 |     "    count = []\n",
310 |     "    \n",
311 |     "    # First create a large list with all the words in all the questions\n",
312 |     "    for d in questions:\n",
313 |     "        words.extend(d)\n",
314 |     "    print('%d Words found.'%len(words))    \n",
315 |     "    print('Found %d words in the vocabulary. '%len(collections.Counter(words).most_common()))\n",
316 |     "    \n",
317 |     "    # Sort words by there frequency\n",
318 |     "    count.extend(collections.Counter(words).most_common())\n",
319 |     "    \n",
320 |     "    # Create an ID for each word by giving the current length of the dictionary\n",
321 |     "    # And adding that item to the dictionary\n",
322 |     "    dictionary = dict()\n",
323 |     "    for word, _ in count:\n",
324 |     "        dictionary[word] = len(dictionary)\n",
325 |     "    \n",
326 |     "    # Traverse through all the text and \n",
327 |     "    # replace the string words with the ID \n",
328 |     "    # of the word found at that index\n",
329 |     "    for d in questions:\n",
330 |     "        data = list()\n",
331 |     "        for word in d:\n",
332 |     "            index = dictionary[word]        \n",
333 |     "            data.append(index)\n",
334 |     "            \n",
335 |     "        data_list.append(data)\n",
336 |     "        \n",
337 |     "    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) \n",
338 |     "    \n",
339 |     "    return data_list, count, dictionary, reverse_dictionary\n",
340 |     "\n",
341 |     "# Create a dataset with both train and test questions\n",
342 |     "all_questions = list(train_questions)\n",
343 |     "all_questions.extend(test_questions)\n",
344 |     "\n",
345 |     "# Use the above created dataset to build the vocabulary\n",
346 |     "all_question_ind, count, dictionary, reverse_dictionary = build_dataset(all_questions)\n",
347 |     "\n",
348 |     "# Print some statistics about the processed data\n",
349 |     "print('All words (count)', count[:5])\n",
350 |     "print('\\n0th entry in dictionary: %s',reverse_dictionary[0])\n",
351 |     "print('\\nSample data', all_question_ind[0])\n",
352 |     "print('\\nSample data', all_question_ind[1])\n",
353 |     "print('\\nVocabulary: ',len(dictionary))\n",
354 |     "vocabulary_size = len(dictionary)\n",
355 |     "\n",
356 |     "print('\\nNumber of training questions: ',len(train_questions))\n",
357 |     "print('Number of testing questions: ',len(test_questions))"
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "markdown",
362 |    "metadata": {},
363 |    "source": [
364 |     "## Generating Batches of Data\n",
365 |     "Below I show the code to generate a batch of data from a given set of questions and labels."
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "code",
370 |    "execution_count": 7,
371 |    "metadata": {},
372 |    "outputs": [
373 |     {
374 |      "name": "stdout",
375 |      "output_type": "stream",
376 |      "text": [
377 |       "Sample batch labels\n",
378 |       "[3 4 3 4 5 2 2 2 3 2 0 3 2 2 4 1]\n",
379 |       "[3 0 3 3 0 4 2 3 3 4 2 1 4 1 5 4]\n"
380 |      ]
381 |     }
382 |    ],
383 |    "source": [
384 |     "batch_size = 16 # We process 16 questions at a time\n",
385 |     "sent_length = max_sent_length\n",
386 |     "\n",
387 |     "num_classes = 6 # Number of classes\n",
388 |     "# All the types of question that are in the dataset\n",
389 |     "all_labels = ['NUM','LOC','HUM','DESC','ENTY','ABBR'] \n",
390 |     "\n",
391 |     "class BatchGenerator(object):\n",
392 |     "    '''\n",
393 |     "    Generates a batch of data\n",
394 |     "    '''\n",
395 |     "    def __init__(self,batch_size,questions,labels):\n",
396 |     "        self.questions = questions\n",
397 |     "        self.labels = labels\n",
398 |     "        self.text_size = len(questions)\n",
399 |     "        self.batch_size = batch_size\n",
400 |     "        self.data_index = 0\n",
401 |     "        assert len(self.questions)==len(self.labels)\n",
402 |     "        \n",
403 |     "    def generate_batch(self):\n",
404 |     "        '''\n",
405 |     "        Data generation function. This outputs two matrices\n",
406 |     "        inputs: a batch of questions where each question is a tensor of size\n",
407 |     "        [sent_length, vocabulary_size] with each word one-hot-encoded\n",
408 |     "        labels_ohe: one-hot-encoded labels corresponding to the questions in inputs\n",
409 |     "        '''\n",
410 |     "        global sent_length,num_classes\n",
411 |     "        global dictionary, all_labels\n",
412 |     "        \n",
413 |     "        # Numpy arrays holding input and label data\n",
414 |     "        inputs = np.zeros((self.batch_size,sent_length,vocabulary_size),dtype=np.float32)\n",
415 |     "        labels_ohe = np.zeros((self.batch_size,num_classes),dtype=np.float32)\n",
416 |     "        \n",
417 |     "        # When we reach the end of the dataset\n",
418 |     "        # start from beginning\n",
419 |     "        if self.data_index + self.batch_size >= self.text_size:\n",
420 |     "            self.data_index = 0\n",
421 |     "            \n",
422 |     "        # For each question in the dataset\n",
423 |     "        for qi,que in enumerate(self.questions[self.data_index:self.data_index+self.batch_size]):\n",
424 |     "            # For each word in the question\n",
425 |     "            for wi,word in enumerate(que): \n",
426 |     "                # Set the element at the word ID index to 1\n",
427 |     "                # this gives the one-hot-encoded vector of that word\n",
428 |     "                inputs[qi,wi,dictionary[word]] = 1.0\n",
429 |     "            \n",
430 |     "            # Set the index corrsponding to that particular class to 1\n",
431 |     "            labels_ohe[qi,all_labels.index(self.labels[self.data_index + qi])] = 1.0\n",
432 |     "        \n",
433 |     "        # Update the data index to get the next batch of data\n",
434 |     "        self.data_index = (self.data_index + self.batch_size)%self.text_size\n",
435 |     "            \n",
436 |     "        return inputs,labels_ohe\n",
437 |     "    \n",
438 |     "    def return_index(self):\n",
439 |     "        # Get the current index of data\n",
440 |     "        return self.data_index\n",
441 |     "\n",
442 |     "# Test our batch generator\n",
443 |     "sample_gen = BatchGenerator(batch_size,train_questions,train_labels)\n",
444 |     "# Generate a single batch\n",
445 |     "sample_batch_inputs,sample_batch_labels = sample_gen.generate_batch()\n",
446 |     "# Generate another batch\n",
447 |     "sample_batch_inputs_2,sample_batch_labels_2 = sample_gen.generate_batch()\n",
448 |     "\n",
449 |     "# Make sure that we infact have the question 0 as the 0th element of our batch\n",
450 |     "assert np.all(np.asarray([dictionary[w] for w in train_questions[0]],dtype=np.int32) \n",
451 |     "              == np.argmax(sample_batch_inputs[0,:,:],axis=1))\n",
452 |     "\n",
453 |     "# Print some data labels we obtained\n",
454 |     "print('Sample batch labels')\n",
455 |     "print(np.argmax(sample_batch_labels,axis=1))\n",
456 |     "print(np.argmax(sample_batch_labels_2,axis=1))"
457 |    ]
458 |   },
459 |   {
460 |    "cell_type": "markdown",
461 |    "metadata": {},
462 |    "source": [
463 |     "## Sentence Classifying Convolution Neural Network\n",
464 |     "We are going to implement a very simple CNN to classify sentences. However you will see that even with this simple structure we achieve good accuracies. Our CNN will have one layer (with 3 different parallel layers). This will be followed by a pooling-over-time layer and finally a fully connected layer that produces the logits."
465 |    ]
466 |   },
467 |   {
468 |    "cell_type": "markdown",
469 |    "metadata": {},
470 |    "source": [
471 |     "## Defining hyperparameters and inputs"
472 |    ]
473 |   },
474 |   {
475 |    "cell_type": "code",
476 |    "execution_count": 8,
477 |    "metadata": {
478 |     "collapsed": true
479 |    },
480 |    "outputs": [],
481 |    "source": [
482 |     "tf.reset_default_graph()\n",
483 |     "\n",
484 |     "batch_size = 32\n",
485 |     "# Different filter sizes we use in a single convolution layer\n",
486 |     "filter_sizes = [3,5,7] \n",
487 |     "\n",
488 |     "# inputs and labels\n",
489 |     "sent_inputs = tf.placeholder(shape=[batch_size,sent_length,vocabulary_size],dtype=tf.float32,name='sentence_inputs')\n",
490 |     "sent_labels = tf.placeholder(shape=[batch_size,num_classes],dtype=tf.float32,name='sentence_labels')\n"
491 |    ]
492 |   },
493 |   {
494 |    "cell_type": "markdown",
495 |    "metadata": {},
496 |    "source": [
497 |     "## Defining Model Parameters\n",
498 |     "Our model has following parameters.\n",
499 |     "* 3 sets of convolution layer weights and biases (one for each parallel layer)\n",
500 |     "* 1 fully connected output layer"
501 |    ]
502 |   },
503 |   {
504 |    "cell_type": "code",
505 |    "execution_count": 9,
506 |    "metadata": {
507 |     "collapsed": true
508 |    },
509 |    "outputs": [],
510 |    "source": [
511 |     "# 3 filters with different context window sizes (3,5,7)\n",
512 |     "# Each of this filter spans the full one-hot-encoded length of each word and the context window width\n",
513 |     "\n",
514 |     "# Weights of the first parallel layer\n",
515 |     "w1 = tf.Variable(tf.truncated_normal([filter_sizes[0],vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_1')\n",
516 |     "b1 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_1')\n",
517 |     "\n",
518 |     "# Weights of the second parallel layer\n",
519 |     "w2 = tf.Variable(tf.truncated_normal([filter_sizes[1],vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_2')\n",
520 |     "b2 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_2')\n",
521 |     "\n",
522 |     "# Weights of the third parallel layer\n",
523 |     "w3 = tf.Variable(tf.truncated_normal([filter_sizes[2],vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_3')\n",
524 |     "b3 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_3')\n",
525 |     "\n",
526 |     "# Fully connected layer\n",
527 |     "w_fc1 = tf.Variable(tf.truncated_normal([len(filter_sizes),num_classes],stddev=0.5,dtype=tf.float32),name='weights_fulcon_1')\n",
528 |     "b_fc1 = tf.Variable(tf.random_uniform([num_classes],0,0.01,dtype=tf.float32),name='bias_fulcon_1')"
529 |    ]
530 |   },
531 |   {
532 |    "cell_type": "markdown",
533 |    "metadata": {},
534 |    "source": [
535 |     "## Defining Inference of the CNN\n",
536 |     "Here we define the CNN inference logic. First compute the convolution output for each parallel layer within the convolution layer. Then perform pooling-over-time over all the convolution outputs. Finally feed the output of the pooling layer to a fully connected layer to obtain the output logits."
537 |    ]
538 |   },
539 |   {
540 |    "cell_type": "code",
541 |    "execution_count": 10,
542 |    "metadata": {
543 |     "collapsed": true
544 |    },
545 |    "outputs": [],
546 |    "source": [
547 |     "# Calculate the output for all the filters with a stride 1\n",
548 |     "# We use relu activation as the activation function\n",
549 |     "h1_1 = tf.nn.relu(tf.nn.conv1d(sent_inputs,w1,stride=1,padding='SAME') + b1)\n",
550 |     "h1_2 = tf.nn.relu(tf.nn.conv1d(sent_inputs,w2,stride=1,padding='SAME') + b2)\n",
551 |     "h1_3 = tf.nn.relu(tf.nn.conv1d(sent_inputs,w3,stride=1,padding='SAME') + b3)\n",
552 |     "\n",
553 |     "# Pooling over time operation\n",
554 |     "\n",
555 |     "# This is doing the max pooling. Thereare two options to do the max pooling\n",
556 |     "# 1. Use tf.nn.max_pool operation on a tensor made by concatenating h1_1,h1_2,h1_3 and converting that tensor to 4D\n",
557 |     "# (Because max_pool takes a tensor of rank >= 4 )\n",
558 |     "# 2. Do the max pooling separately for each filter output and combine them using tf.concat \n",
559 |     "# (this is the one used in the code)\n",
560 |     "\n",
561 |     "h2_1 = tf.reduce_max(h1_1,axis=1)\n",
562 |     "h2_2 = tf.reduce_max(h1_2,axis=1)\n",
563 |     "h2_3 = tf.reduce_max(h1_3,axis=1)\n",
564 |     "\n",
565 |     "h2 = tf.concat([h2_1,h2_2,h2_3],axis=1)\n",
566 |     "\n",
567 |     "# Calculate the fully connected layer output (no activation)\n",
568 |     "# Note: since h2 is 2d [batch_size,number of parallel filters] \n",
569 |     "# reshaping the output is not required as it usually do in CNNs\n",
570 |     "logits = tf.matmul(h2,w_fc1) + b_fc1"
571 |    ]
572 |   },
573 |   {
574 |    "cell_type": "markdown",
575 |    "metadata": {},
576 |    "source": [
577 |     "## Model Loss and the Optimizer\n",
578 |     "We compute the cross entropy loss and use the momentum optimizer (which works better than standard gradient descent) to optimize our model"
579 |    ]
580 |   },
581 |   {
582 |    "cell_type": "code",
583 |    "execution_count": 11,
584 |    "metadata": {
585 |     "collapsed": true
586 |    },
587 |    "outputs": [],
588 |    "source": [
589 |     "# Loss (Cross-Entropy)\n",
590 |     "loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=sent_labels,logits=logits))\n",
591 |     "\n",
592 |     "# Momentum Optimizer\n",
593 |     "optimizer = tf.train.MomentumOptimizer(learning_rate=0.01,momentum=0.9).minimize(loss)"
594 |    ]
595 |   },
596 |   {
597 |    "cell_type": "markdown",
598 |    "metadata": {},
599 |    "source": [
600 |     "## Model Predictions\n",
601 |     "Note that we are not getting the raw predictions, but the index of the maximally activated element in the prediction vector."
602 |    ]
603 |   },
604 |   {
605 |    "cell_type": "code",
606 |    "execution_count": 12,
607 |    "metadata": {
608 |     "collapsed": true
609 |    },
610 |    "outputs": [],
611 |    "source": [
612 |     "predictions = tf.argmax(tf.nn.softmax(logits),axis=1)"
613 |    ]
614 |   },
615 |   {
616 |    "cell_type": "markdown",
617 |    "metadata": {},
618 |    "source": [
619 |     "## Running Our Model to Classify Sentences\n",
620 |     "\n",
621 |     "Below we run our algorithm for 50 epochs. With the provided hyperparameters you should achieve around 90% accuracy on the test set. However you are welcome to play around with the hyperparameters."
622 |    ]
623 |   },
624 |   {
625 |    "cell_type": "code",
626 |    "execution_count": 13,
627 |    "metadata": {},
628 |    "outputs": [
629 |     {
630 |      "name": "stdout",
631 |      "output_type": "stream",
632 |      "text": [
633 |       "Initialized\n",
634 |       "\n",
635 |       "Train Loss at Epoch 0: 1.75\n",
636 |       "Test accuracy at Epoch 0: 13.333\n",
637 |       "Train Loss at Epoch 1: 1.69\n",
638 |       "Test accuracy at Epoch 1: 13.333\n",
639 |       "Train Loss at Epoch 2: 1.63\n",
640 |       "Test accuracy at Epoch 2: 26.875\n",
641 |       "Train Loss at Epoch 3: 1.58\n",
642 |       "Test accuracy at Epoch 3: 28.542\n",
643 |       "Train Loss at Epoch 4: 1.53\n",
644 |       "Test accuracy at Epoch 4: 30.417\n",
645 |       "Train Loss at Epoch 5: 1.49\n",
646 |       "Test accuracy at Epoch 5: 34.792\n",
647 |       "Train Loss at Epoch 6: 1.45\n",
648 |       "Test accuracy at Epoch 6: 40.833\n",
649 |       "Train Loss at Epoch 7: 1.42\n",
650 |       "Test accuracy at Epoch 7: 45.625\n",
651 |       "Train Loss at Epoch 8: 1.39\n",
652 |       "Test accuracy at Epoch 8: 47.083\n",
653 |       "Train Loss at Epoch 9: 1.37\n",
654 |       "Test accuracy at Epoch 9: 48.542\n",
655 |       "Train Loss at Epoch 10: 1.34\n",
656 |       "Test accuracy at Epoch 10: 48.750\n",
657 |       "Train Loss at Epoch 11: 1.33\n",
658 |       "Test accuracy at Epoch 11: 48.750\n",
659 |       "Train Loss at Epoch 12: 1.29\n",
660 |       "Test accuracy at Epoch 12: 50.208\n",
661 |       "Train Loss at Epoch 13: 1.27\n",
662 |       "Test accuracy at Epoch 13: 53.958\n",
663 |       "Train Loss at Epoch 14: 1.23\n",
664 |       "Test accuracy at Epoch 14: 57.708\n",
665 |       "Train Loss at Epoch 15: 1.16\n",
666 |       "Test accuracy at Epoch 15: 61.667\n",
667 |       "Train Loss at Epoch 16: 1.10\n",
668 |       "Test accuracy at Epoch 16: 65.208\n",
669 |       "Train Loss at Epoch 17: 1.04\n",
670 |       "Test accuracy at Epoch 17: 64.375\n",
671 |       "Train Loss at Epoch 18: 0.98\n",
672 |       "Test accuracy at Epoch 18: 63.750\n",
673 |       "Train Loss at Epoch 19: 0.93\n",
674 |       "Test accuracy at Epoch 19: 63.125\n",
675 |       "Train Loss at Epoch 20: 0.88\n",
676 |       "Test accuracy at Epoch 20: 63.333\n",
677 |       "Train Loss at Epoch 21: 0.83\n",
678 |       "Test accuracy at Epoch 21: 63.333\n",
679 |       "Train Loss at Epoch 22: 0.80\n",
680 |       "Test accuracy at Epoch 22: 63.542\n",
681 |       "Train Loss at Epoch 23: 0.77\n",
682 |       "Test accuracy at Epoch 23: 65.000\n",
683 |       "Train Loss at Epoch 24: 0.74\n",
684 |       "Test accuracy at Epoch 24: 69.583\n",
685 |       "Train Loss at Epoch 25: 0.70\n",
686 |       "Test accuracy at Epoch 25: 72.500\n",
687 |       "Train Loss at Epoch 26: 0.67\n",
688 |       "Test accuracy at Epoch 26: 75.208\n",
689 |       "Train Loss at Epoch 27: 0.64\n",
690 |       "Test accuracy at Epoch 27: 76.667\n",
691 |       "Train Loss at Epoch 28: 0.61\n",
692 |       "Test accuracy at Epoch 28: 78.125\n",
693 |       "Train Loss at Epoch 29: 0.58\n",
694 |       "Test accuracy at Epoch 29: 80.417\n",
695 |       "Train Loss at Epoch 30: 0.55\n",
696 |       "Test accuracy at Epoch 30: 82.083\n",
697 |       "Train Loss at Epoch 31: 0.53\n",
698 |       "Test accuracy at Epoch 31: 83.125\n",
699 |       "Train Loss at Epoch 32: 0.51\n",
700 |       "Test accuracy at Epoch 32: 83.542\n",
701 |       "Train Loss at Epoch 33: 0.48\n",
702 |       "Test accuracy at Epoch 33: 84.167\n",
703 |       "Train Loss at Epoch 34: 0.47\n",
704 |       "Test accuracy at Epoch 34: 85.000\n",
705 |       "Train Loss at Epoch 35: 0.44\n",
706 |       "Test accuracy at Epoch 35: 85.417\n",
707 |       "Train Loss at Epoch 36: 0.43\n",
708 |       "Test accuracy at Epoch 36: 85.625\n",
709 |       "Train Loss at Epoch 37: 0.42\n",
710 |       "Test accuracy at Epoch 37: 85.833\n",
711 |       "Train Loss at Epoch 38: 0.41\n",
712 |       "Test accuracy at Epoch 38: 86.667\n",
713 |       "Train Loss at Epoch 39: 0.39\n",
714 |       "Test accuracy at Epoch 39: 87.292\n",
715 |       "Train Loss at Epoch 40: 0.38\n",
716 |       "Test accuracy at Epoch 40: 87.292\n",
717 |       "Train Loss at Epoch 41: 0.36\n",
718 |       "Test accuracy at Epoch 41: 87.500\n",
719 |       "Train Loss at Epoch 42: 0.36\n",
720 |       "Test accuracy at Epoch 42: 87.917\n",
721 |       "Train Loss at Epoch 43: 0.34\n",
722 |       "Test accuracy at Epoch 43: 88.542\n",
723 |       "Train Loss at Epoch 44: 0.33\n",
724 |       "Test accuracy at Epoch 44: 88.542\n",
725 |       "Train Loss at Epoch 45: 0.32\n",
726 |       "Test accuracy at Epoch 45: 88.542\n",
727 |       "Train Loss at Epoch 46: 0.32\n",
728 |       "Test accuracy at Epoch 46: 88.333\n",
729 |       "Train Loss at Epoch 47: 0.31\n",
730 |       "Test accuracy at Epoch 47: 88.542\n",
731 |       "Train Loss at Epoch 48: 0.30\n",
732 |       "Test accuracy at Epoch 48: 88.542\n",
733 |       "Train Loss at Epoch 49: 0.29\n",
734 |       "Test accuracy at Epoch 49: 88.750\n"
735 |      ]
736 |     }
737 |    ],
738 |    "source": [
739 |     "# With filter widths [3,5,7] and batch_size 32 the algorithm \n",
740 |     "# achieves around ~90% accuracy on test dataset (50 epochs). \n",
741 |     "# From batch sizes [16,32,64] I found 32 to give best performance\n",
742 |     "\n",
743 |     "session = tf.InteractiveSession()\n",
744 |     "\n",
745 |     "num_steps = 50 # Number of epochs the algorithm runs for\n",
746 |     "\n",
747 |     "# Initialize all variables\n",
748 |     "tf.global_variables_initializer().run()\n",
749 |     "print('Initialized\\n')\n",
750 |     "\n",
751 |     "# Define data batch generators for train and test data\n",
752 |     "train_gen = BatchGenerator(batch_size,train_questions,train_labels)\n",
753 |     "test_gen = BatchGenerator(batch_size,test_questions,test_labels)\n",
754 |     "\n",
755 |     "# How often do we compute the test accuracy\n",
756 |     "test_interval = 1\n",
757 |     "\n",
758 |     "# Compute accuracy for a given set of predictions and labels\n",
759 |     "def accuracy(labels,preds):\n",
760 |     "    return np.sum(np.argmax(labels,axis=1)==preds)/labels.shape[0]\n",
761 |     "\n",
762 |     "# Running the algorithm\n",
763 |     "for step in range(num_steps):\n",
764 |     "    avg_loss = []\n",
765 |     "    \n",
766 |     "    # A single traverse through the whole training set\n",
767 |     "    for tr_i in range((len(train_questions)//batch_size)-1):\n",
768 |     "        # Get a batch of data\n",
769 |     "        tr_inputs, tr_labels = train_gen.generate_batch()\n",
770 |     "        # Optimize the network and compute the loss\n",
771 |     "        l,_ = session.run([loss,optimizer],feed_dict={sent_inputs: tr_inputs, sent_labels: tr_labels})\n",
772 |     "        avg_loss.append(l)\n",
773 |     "\n",
774 |     "    # Print average loss\n",
775 |     "    print('Train Loss at Epoch %d: %.2f'%(step,np.mean(avg_loss)))\n",
776 |     "    test_accuracy = []\n",
777 |     "    \n",
778 |     "    # Compute the test accuracy\n",
779 |     "    if (step+1)%test_interval==0:        \n",
780 |     "        for ts_i in range((len(test_questions)-1)//batch_size):\n",
781 |     "            # Get a batch of test data\n",
782 |     "            ts_inputs,ts_labels = test_gen.generate_batch()\n",
783 |     "            # Get predictions for that batch\n",
784 |     "            preds = session.run(predictions,feed_dict={sent_inputs: ts_inputs, sent_labels: ts_labels})\n",
785 |     "            # Compute test accuracy\n",
786 |     "            test_accuracy.append(accuracy(ts_labels,preds))\n",
787 |     "        \n",
788 |     "        # Display the mean test accuracy\n",
789 |     "        print('Test accuracy at Epoch %d: %.3f'%(step,np.mean(test_accuracy)*100.0))"
790 |    ]
791 |   },
792 |   {
793 |    "cell_type": "code",
794 |    "execution_count": null,
795 |    "metadata": {
796 |     "collapsed": true
797 |    },
798 |    "outputs": [],
799 |    "source": []
800 |   }
801 |  ],
802 |  "metadata": {
803 |   "kernelspec": {
804 |    "display_name": "Python 3",
805 |    "language": "python",
806 |    "name": "python3"
807 |   },
808 |   "language_info": {
809 |    "codemirror_mode": {
810 |     "name": "ipython",
811 |     "version": 3
812 |    },
813 |    "file_extension": ".py",
814 |    "mimetype": "text/x-python",
815 |    "name": "python",
816 |    "nbconvert_exporter": "python",
817 |    "pygments_lexer": "ipython3",
818 |    "version": "3.5.2"
819 |   }
820 |  },
821 |  "nbformat": 4,
822 |  "nbformat_minor": 2
823 | }
824 | 


--------------------------------------------------------------------------------
/ch8/embeddings.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-TensorFlow/1d432b7e6fceb7819a60c9fd29560c864633a25b/ch8/embeddings.npy


--------------------------------------------------------------------------------
/ch8/word2vec.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import tensorflow as tf
  3 | import collections
  4 | import random
  5 | import math
  6 | 
  7 | data_indices = None
  8 | data_list = None
  9 | reverse_dictionary = None
 10 | embedding_size = None
 11 | vocabulary_size = None
 12 | num_files = None
 13 | def define_data_and_hyperparameters(_num_files,_data_list, _reverse_dictionary, _emb_size, _vocab_size):
 14 |     global num_files, data_indices, data_list, reverse_dictionary
 15 |     global embedding_size, vocabulary_size
 16 |     
 17 |     num_files = _num_files
 18 |     data_indices = [0 for _ in range(num_files)]
 19 |     data_list = _data_list
 20 |     reverse_dictionary = _reverse_dictionary
 21 |     embedding_size = _emb_size
 22 |     vocabulary_size = _vocab_size
 23 | 	
 24 | 
 25 | def generate_batch_for_word2vec(data_list, doc_id, batch_size, window_size):
 26 |     # window_size is the amount of words we're looking at from each side of a given word
 27 |     # creates a single batch
 28 |     # doc_id is the ID of the story we want to extract a batch from
 29 |     
 30 |     # data_indices[doc_id] is updated by 1 everytime we read a set of data point
 31 |     # from the document identified by doc_id
 32 |     global data_indices
 33 | 
 34 |     # span defines the total window size, where
 35 |     # data we consider at an instance looks as follows. 
 36 |     # [ skip_window target skip_window ]
 37 |     # e.g if skip_window = 2 then span = 5
 38 |     span = 2 * window_size + 1 
 39 | 
 40 |     # two numpy arras to hold target words (batch)
 41 |     # and context words (labels)
 42 |     # Note that batch has span-1=2*window_size columns
 43 |     batch = np.ndarray(shape=(batch_size,span-1), dtype=np.int32)
 44 |     labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
 45 |     
 46 |     # The buffer holds the data contained within the span
 47 |     buffer = collections.deque(maxlen=span)
 48 | 
 49 |     # Fill the buffer and update the data_index
 50 |     for _ in range(span):
 51 |         buffer.append(data_list[doc_id][data_indices[doc_id]])
 52 |         data_indices[doc_id] = (data_indices[doc_id] + 1) % len(data_list[doc_id])
 53 | 
 54 |     # Here we do the batch reading
 55 |     # We iterate through each batch index
 56 |     # For each batch index, we iterate through span elements
 57 |     # to fill in the columns of batch array
 58 |     for i in range(batch_size):
 59 |         target = window_size  # target label at the center of the buffer
 60 |         target_to_avoid = [ window_size ] # we only need to know the words around a given word, not the word itself
 61 | 
 62 |         # add selected target to avoid_list for next time
 63 |         col_idx = 0
 64 |         for j in range(span):
 65 |             # ignore the target word when creating the batch
 66 |             if j==span//2:
 67 |                 continue
 68 |             batch[i,col_idx] = buffer[j] 
 69 |             col_idx += 1
 70 |         labels[i, 0] = buffer[target]
 71 | 
 72 |         # Everytime we read a data point,
 73 |         # we need to move the span by 1
 74 |         # to update the span
 75 |         buffer.append(data_list[doc_id][data_indices[doc_id]])
 76 |         data_indices[doc_id] = (data_indices[doc_id] + 1) % len(data_list[doc_id])
 77 | 
 78 |     assert batch.shape[0]==batch_size and batch.shape[1]== span-1
 79 |     return batch, labels
 80 | 
 81 | def print_some_batches():
 82 |     global num_files, data_list, reverse_dictionary
 83 |     
 84 |     for window_size in [1,2]:
 85 | 	    data_indices = [0 for _ in range(num_files)]
 86 | 	    batch, labels = generate_batch_for_word2vec(data_list, doc_id=0, batch_size=8, window_size=window_size)
 87 | 	    print('\nwith window_size = %d:' % (window_size))
 88 | 	    print('    batch:', [[reverse_dictionary[bii] for bii in bi] for bi in batch])
 89 | 	    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
 90 | 		
 91 | batch_size, embedding_size, window_size = None, None, None
 92 | valid_size, valid_window, valid_examples = None, None, None
 93 | num_sampled = None
 94 | 
 95 | train_dataset, train_labels = None, None
 96 | valid_dataset = None
 97 | 
 98 | softmax_weights, softmax_biases = None, None
 99 | 
100 | loss, optimizer, similarity, normalized_embeddings = None, None, None, None
101 | 	
102 | def define_word2vec_tensorflow():
103 |     global batch_size, embedding_size, window_size
104 |     global	valid_size, valid_window, valid_examples
105 |     global num_sampled
106 |     global train_dataset, train_labels
107 |     global valid_dataset
108 |     global softmax_weights, softmax_biases 
109 |     global loss, optimizer, similarity
110 |     global vocabulary_size, embedding_size
111 |     global normalized_embeddings
112 | 	
113 |     batch_size = 128 # Data points in a single batch
114 |     
115 |     # How many words to consider left and right.
116 |     # Skip gram by design does not require to have all the context words in a given step
117 |     # However, for CBOW that's a requirement, so we limit the window size
118 |     window_size = 3 
119 |     
120 |     # We pick a random validation set to sample nearest neighbors
121 |     valid_size = 16 # Random set of words to evaluate similarity on.
122 |     # We sample valid datapoints randomly from a large window without always being deterministic
123 |     valid_window = 50
124 |     
125 |     # When selecting valid examples, we select some of the most frequent words as well as
126 |     # some moderately rare words as well
127 |     valid_examples = np.array(random.sample(range(valid_window), valid_size))
128 |     valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)
129 | 
130 |     num_sampled = 32 # Number of negative examples to sample.
131 | 
132 |     tf.reset_default_graph()
133 | 
134 |     # Training input data (target word IDs). Note that it has 2*window_size columns
135 |     train_dataset = tf.placeholder(tf.int32, shape=[batch_size,2*window_size])
136 |     # Training input label data (context word IDs)
137 |     train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
138 |     # Validation input data, we don't need a placeholder
139 |     # as we have already defined the IDs of the words selected
140 |     # as validation data
141 |     valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
142 | 
143 |     # Variables.
144 | 
145 |     # Embedding layer, contains the word embeddings
146 |     embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0,dtype=tf.float32))
147 | 
148 |     # Softmax Weights and Biases
149 |     softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
150 |                  stddev=0.5 / math.sqrt(embedding_size),dtype=tf.float32))
151 |     softmax_biases = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01))
152 |     
153 |     # Model.
154 |     # Look up embeddings for a batch of inputs.
155 |     # Here we do embedding lookups for each column in the input placeholder
156 |     # and then average them to produce an embedding_size word vector
157 |     stacked_embedings = None
158 |     print('Defining %d embedding lookups representing each word in the context'%(2*window_size))
159 |     for i in range(2*window_size):
160 |         embedding_i = tf.nn.embedding_lookup(embeddings, train_dataset[:,i])        
161 |         x_size,y_size = embedding_i.get_shape().as_list()
162 |         if stacked_embedings is None:
163 |             stacked_embedings = tf.reshape(embedding_i,[x_size,y_size,1])
164 |         else:
165 |             stacked_embedings = tf.concat(axis=2,values=[stacked_embedings,tf.reshape(embedding_i,[x_size,y_size,1])])
166 | 
167 |     assert stacked_embedings.get_shape().as_list()[2]==2*window_size
168 |     print("Stacked embedding size: %s"%stacked_embedings.get_shape().as_list())
169 |     mean_embeddings =  tf.reduce_mean(stacked_embedings,2,keepdims=False)
170 |     print("Reduced mean embedding size: %s"%mean_embeddings.get_shape().as_list())
171 |     
172 |     	
173 |     # Compute the softmax loss, using a sample of the negative labels each time.
174 |     # inputs are embeddings of the train words
175 |     # with this loss we optimize weights, biases, embeddings
176 |     loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=mean_embeddings,
177 |                            labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
178 |     # AdamOptimizer.
179 |     optimizer = tf.train.AdamOptimizer(0.0005).minimize(loss)
180 | 
181 |     # Compute the similarity between minibatch examples and all embeddings.
182 |     # We use the cosine distance:
183 |     norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
184 |     normalized_embeddings = embeddings / norm
185 |     valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
186 |     similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
187 | 	
188 | 
189 | def run_word2vec():
190 |     global batch_size, embedding_size, window_size
191 |     global	valid_size, valid_window, valid_examples
192 |     global num_sampled
193 |     global train_dataset, train_labels
194 |     global valid_dataset
195 |     global softmax_weights, softmax_biases 
196 |     global loss, optimizer, similarity, normalized_embeddings
197 |     global data_list, num_files, reverse_dictionary
198 |     global vocabulary_size, embedding_size
199 | 	
200 |     num_steps = 10
201 |     steps_per_doc = 100
202 |     
203 |     session = tf.InteractiveSession()
204 |     
205 |     # Initialize the variables in the graph
206 |     tf.global_variables_initializer().run()
207 |     print('Initialized')
208 |     
209 |     average_loss = 0
210 |     
211 |     for step in range(num_steps):
212 |     
213 |         # Iterate through the documents in a random order
214 |         for doc_id in np.random.permutation(num_files):
215 |             for doc_step in range(steps_per_doc):
216 |                 
217 |                 # Generate a single batch of data from a document
218 |                 batch_data, batch_labels = generate_batch_for_word2vec(data_list, doc_id, batch_size, window_size)
219 |                 
220 |                 # Populate the feed_dict and run the optimizer (minimize loss)
221 |                 # and compute the loss
222 |                 feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
223 |                 _, l = session.run([optimizer, loss], feed_dict=feed_dict)
224 |                 
225 |                 average_loss += l
226 |             
227 |         if (step+1) % 1 == 0:
228 |             if step > 0:
229 |                 # compute average loss
230 |                 average_loss = average_loss / (doc_id*steps_per_doc)
231 |             
232 |             print('Average loss at step %d: %f' % (step+1, average_loss))
233 |             average_loss = 0 # reset average loss
234 | 
235 |         # Evaluating validation set word similarities
236 |         if (step+1) % 5 == 0:
237 |             sim = similarity.eval()
238 |         
239 |             # Here we compute the top_k closest words for a given validation word
240 |             # in terms of the cosine distance
241 |             # We do this for all the words in the validation set
242 |             # Note: This is an expensive step
243 |             for i in range(valid_size):
244 |                 valid_word = reverse_dictionary[valid_examples[i]]
245 |                 top_k = 4 # number of nearest neighbors
246 |                 nearest = (-sim[i, :]).argsort()[1:top_k+1]
247 |                 log = 'Nearest to %s:' % valid_word
248 |                 for k in range(top_k):
249 |                     close_word = reverse_dictionary[nearest[k]]
250 |                     log = '%s %s,' % (log, close_word)
251 |                 print(log)
252 |     cbow_final_embeddings = normalized_embeddings.eval()
253 | 
254 |     # We save the embeddings as embeddings.npy 
255 |     np.save('embeddings',cbow_final_embeddings)


--------------------------------------------------------------------------------
/ch9/correct_spellings.py:
--------------------------------------------------------------------------------
 1 | from difflib import SequenceMatcher
 2 | 
 3 | def string_similarity(a, b):
 4 |     return SequenceMatcher(None, a, b).ratio()
 5 | 
 6 | def correct_wrong_word(cw,gw,cap):
 7 |     '''
 8 |     Spelling correction logic
 9 |     This is a very simple logic that replaces
10 |     words with incorrect spelling with the word that highest
11 |     similarity. Some words are manually corrected as the words
12 |     found to be most similar semantically did not match.
13 |     '''
14 |     correct_word = None
15 |     found_similar_word = False
16 |     sim = string_similarity(gw,cw)
17 |     if sim>0.9:
18 |         if cw != 'stting' and cw != 'sittign' and cw != 'smilling' and \
19 |             cw!='skiies' and cw!='childi' and cw!='sittion' and cw!='peacefuly' and cw!='stainding' and\
20 |             cw != 'staning' and cw!='lating' and cw!='sking' and cw!='trolly' and cw!='umping' and cw!='earing' and \
21 |             cw !='baters' and cw !='talkes' and cw !='trowing' and cw !='convered' and cw !='onsie' and cw !='slying':
22 |             print(gw,' ',cw,' ',sim,' (',cap,')')
23 |             correct_word = gw
24 |             found_similar_word = True
25 |         elif cw == 'stting' or cw == 'sittign' or cw == 'sittion':
26 |             correct_word = 'sitting'
27 |             found_similar_word = True
28 |         elif cw == 'smilling':
29 |             correct_word = 'smiling'
30 |             found_similar_word = True
31 |         elif cw == 'skiies':
32 |             correct_word = 'skis'
33 |             found_similar_word = True
34 |         elif cw == 'childi':
35 |             correct_word = 'child'
36 |             found_similar_word = True
37 |         elif cw == 'peacefuly':
38 |             correct_word = 'peacefully'
39 |             found_similar_word = True
40 |         elif cw == 'stainding' or cw == 'staning':
41 |             correct_word = 'standing'
42 |             found_similar_word = True
43 |         elif cw == 'lating':
44 |             correct_word = 'laying'
45 |             found_similar_word = True
46 |         elif cw == 'sking':
47 |             correct_word = 'skiing'
48 |             found_similar_word = True
49 |         elif cw == 'trolly':
50 |             correct_word = 'trolley'
51 |             found_similar_word = True
52 |         elif cw == 'umping':
53 |             correct_word = 'jumping'
54 |             found_similar_word = True
55 |         elif cw == 'earing':
56 |             correct_word = 'eating'
57 |             found_similar_word = True
58 |         elif cw == 'baters':
59 |             correct_word = 'batters'
60 |             found_similar_word = True
61 |         elif cw == 'talkes':
62 |             correct_word = 'talks'
63 |             found_similar_word = True
64 |         elif cw == 'trowing':
65 |             correct_word = 'throwing'
66 |             found_similar_word = True
67 |         elif cw =='convered':
68 |             correct_word = 'covered'
69 |             found_similar_word = True
70 |         elif cw == 'onsie':
71 |             correct_word = cw
72 |             found_similar_word = True
73 |         elif cw =='slying':
74 |             correct_word = 'flying'
75 |             found_similar_word = True
76 |         else:
77 |             raise NotImplementedError
78 |     else:
79 |         correct_word = cw
80 |         found_similar_word = False
81 |         
82 |     return correct_word, found_similar_word


--------------------------------------------------------------------------------
/ch9/image_caption_data/class_names.txt:
--------------------------------------------------------------------------------
   1 | tench, Tinca tinca
   2 | goldfish, Carassius auratus
   3 | great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
   4 | tiger shark, Galeocerdo cuvieri
   5 | hammerhead, hammerhead shark
   6 | electric ray, crampfish, numbfish, torpedo
   7 | stingray
   8 | cock
   9 | hen
  10 | ostrich, Struthio camelus
  11 | brambling, Fringilla montifringilla
  12 | goldfinch, Carduelis carduelis
  13 | house finch, linnet, Carpodacus mexicanus
  14 | junco, snowbird
  15 | indigo bunting, indigo finch, indigo bird, Passerina cyanea
  16 | robin, American robin, Turdus migratorius
  17 | bulbul
  18 | jay
  19 | magpie
  20 | chickadee
  21 | water ouzel, dipper
  22 | kite
  23 | bald eagle, American eagle, Haliaeetus leucocephalus
  24 | vulture
  25 | great grey owl, great gray owl, Strix nebulosa
  26 | European fire salamander, Salamandra salamandra
  27 | common newt, Triturus vulgaris
  28 | eft
  29 | spotted salamander, Ambystoma maculatum
  30 | axolotl, mud puppy, Ambystoma mexicanum
  31 | bullfrog, Rana catesbeiana
  32 | tree frog, tree-frog
  33 | tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui
  34 | loggerhead, loggerhead turtle, Caretta caretta
  35 | leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea
  36 | mud turtle
  37 | terrapin
  38 | box turtle, box tortoise
  39 | banded gecko
  40 | common iguana, iguana, Iguana iguana
  41 | American chameleon, anole, Anolis carolinensis
  42 | whiptail, whiptail lizard
  43 | agama
  44 | frilled lizard, Chlamydosaurus kingi
  45 | alligator lizard
  46 | Gila monster, Heloderma suspectum
  47 | green lizard, Lacerta viridis
  48 | African chameleon, Chamaeleo chamaeleon
  49 | Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis
  50 | African crocodile, Nile crocodile, Crocodylus niloticus
  51 | American alligator, Alligator mississipiensis
  52 | triceratops
  53 | thunder snake, worm snake, Carphophis amoenus
  54 | ringneck snake, ring-necked snake, ring snake
  55 | hognose snake, puff adder, sand viper
  56 | green snake, grass snake
  57 | king snake, kingsnake
  58 | garter snake, grass snake
  59 | water snake
  60 | vine snake
  61 | night snake, Hypsiglena torquata
  62 | boa constrictor, Constrictor constrictor
  63 | rock python, rock snake, Python sebae
  64 | Indian cobra, Naja naja
  65 | green mamba
  66 | sea snake
  67 | horned viper, cerastes, sand viper, horned asp, Cerastes cornutus
  68 | diamondback, diamondback rattlesnake, Crotalus adamanteus
  69 | sidewinder, horned rattlesnake, Crotalus cerastes
  70 | trilobite
  71 | harvestman, daddy longlegs, Phalangium opilio
  72 | scorpion
  73 | black and gold garden spider, Argiope aurantia
  74 | barn spider, Araneus cavaticus
  75 | garden spider, Aranea diademata
  76 | black widow, Latrodectus mactans
  77 | tarantula
  78 | wolf spider, hunting spider
  79 | tick
  80 | centipede
  81 | black grouse
  82 | ptarmigan
  83 | ruffed grouse, partridge, Bonasa umbellus
  84 | prairie chicken, prairie grouse, prairie fowl
  85 | peacock
  86 | quail
  87 | partridge
  88 | African grey, African gray, Psittacus erithacus
  89 | macaw
  90 | sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita
  91 | lorikeet
  92 | coucal
  93 | bee eater
  94 | hornbill
  95 | hummingbird
  96 | jacamar
  97 | toucan
  98 | drake
  99 | red-breasted merganser, Mergus serrator
 100 | goose
 101 | black swan, Cygnus atratus
 102 | tusker
 103 | echidna, spiny anteater, anteater
 104 | platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus
 105 | wallaby, brush kangaroo
 106 | koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus
 107 | wombat
 108 | jellyfish
 109 | sea anemone, anemone
 110 | brain coral
 111 | flatworm, platyhelminth
 112 | nematode, nematode worm, roundworm
 113 | conch
 114 | snail
 115 | slug
 116 | sea slug, nudibranch
 117 | chiton, coat-of-mail shell, sea cradle, polyplacophore
 118 | chambered nautilus, pearly nautilus, nautilus
 119 | Dungeness crab, Cancer magister
 120 | rock crab, Cancer irroratus
 121 | fiddler crab
 122 | king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica
 123 | American lobster, Northern lobster, Maine lobster, Homarus americanus
 124 | spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish
 125 | crayfish, crawfish, crawdad, crawdaddy
 126 | hermit crab
 127 | isopod
 128 | white stork, Ciconia ciconia
 129 | black stork, Ciconia nigra
 130 | spoonbill
 131 | flamingo
 132 | little blue heron, Egretta caerulea
 133 | American egret, great white heron, Egretta albus
 134 | bittern
 135 | crane
 136 | limpkin, Aramus pictus
 137 | European gallinule, Porphyrio porphyrio
 138 | American coot, marsh hen, mud hen, water hen, Fulica americana
 139 | bustard
 140 | ruddy turnstone, Arenaria interpres
 141 | red-backed sandpiper, dunlin, Erolia alpina
 142 | redshank, Tringa totanus
 143 | dowitcher
 144 | oystercatcher, oyster catcher
 145 | pelican
 146 | king penguin, Aptenodytes patagonica
 147 | albatross, mollymawk
 148 | grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus
 149 | killer whale, killer, orca, grampus, sea wolf, Orcinus orca
 150 | dugong, Dugong dugon
 151 | sea lion
 152 | Chihuahua
 153 | Japanese spaniel
 154 | Maltese dog, Maltese terrier, Maltese
 155 | Pekinese, Pekingese, Peke
 156 | Shih-Tzu
 157 | Blenheim spaniel
 158 | papillon
 159 | toy terrier
 160 | Rhodesian ridgeback
 161 | Afghan hound, Afghan
 162 | basset, basset hound
 163 | beagle
 164 | bloodhound, sleuthhound
 165 | bluetick
 166 | black-and-tan coonhound
 167 | Walker hound, Walker foxhound
 168 | English foxhound
 169 | redbone
 170 | borzoi, Russian wolfhound
 171 | Irish wolfhound
 172 | Italian greyhound
 173 | whippet
 174 | Ibizan hound, Ibizan Podenco
 175 | Norwegian elkhound, elkhound
 176 | otterhound, otter hound
 177 | Saluki, gazelle hound
 178 | Scottish deerhound, deerhound
 179 | Weimaraner
 180 | Staffordshire bullterrier, Staffordshire bull terrier
 181 | American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier
 182 | Bedlington terrier
 183 | Border terrier
 184 | Kerry blue terrier
 185 | Irish terrier
 186 | Norfolk terrier
 187 | Norwich terrier
 188 | Yorkshire terrier
 189 | wire-haired fox terrier
 190 | Lakeland terrier
 191 | Sealyham terrier, Sealyham
 192 | Airedale, Airedale terrier
 193 | cairn, cairn terrier
 194 | Australian terrier
 195 | Dandie Dinmont, Dandie Dinmont terrier
 196 | Boston bull, Boston terrier
 197 | miniature schnauzer
 198 | giant schnauzer
 199 | standard schnauzer
 200 | Scotch terrier, Scottish terrier, Scottie
 201 | Tibetan terrier, chrysanthemum dog
 202 | silky terrier, Sydney silky
 203 | soft-coated wheaten terrier
 204 | West Highland white terrier
 205 | Lhasa, Lhasa apso
 206 | flat-coated retriever
 207 | curly-coated retriever
 208 | golden retriever
 209 | Labrador retriever
 210 | Chesapeake Bay retriever
 211 | German short-haired pointer
 212 | vizsla, Hungarian pointer
 213 | English setter
 214 | Irish setter, red setter
 215 | Gordon setter
 216 | Brittany spaniel
 217 | clumber, clumber spaniel
 218 | English springer, English springer spaniel
 219 | Welsh springer spaniel
 220 | cocker spaniel, English cocker spaniel, cocker
 221 | Sussex spaniel
 222 | Irish water spaniel
 223 | kuvasz
 224 | schipperke
 225 | groenendael
 226 | malinois
 227 | briard
 228 | kelpie
 229 | komondor
 230 | Old English sheepdog, bobtail
 231 | Shetland sheepdog, Shetland sheep dog, Shetland
 232 | collie
 233 | Border collie
 234 | Bouvier des Flandres, Bouviers des Flandres
 235 | Rottweiler
 236 | German shepherd, German shepherd dog, German police dog, alsatian
 237 | Doberman, Doberman pinscher
 238 | miniature pinscher
 239 | Greater Swiss Mountain dog
 240 | Bernese mountain dog
 241 | Appenzeller
 242 | EntleBucher
 243 | boxer
 244 | bull mastiff
 245 | Tibetan mastiff
 246 | French bulldog
 247 | Great Dane
 248 | Saint Bernard, St Bernard
 249 | Eskimo dog, husky
 250 | malamute, malemute, Alaskan malamute
 251 | Siberian husky
 252 | dalmatian, coach dog, carriage dog
 253 | affenpinscher, monkey pinscher, monkey dog
 254 | basenji
 255 | pug, pug-dog
 256 | Leonberg
 257 | Newfoundland, Newfoundland dog
 258 | Great Pyrenees
 259 | Samoyed, Samoyede
 260 | Pomeranian
 261 | chow, chow chow
 262 | keeshond
 263 | Brabancon griffon
 264 | Pembroke, Pembroke Welsh corgi
 265 | Cardigan, Cardigan Welsh corgi
 266 | toy poodle
 267 | miniature poodle
 268 | standard poodle
 269 | Mexican hairless
 270 | timber wolf, grey wolf, gray wolf, Canis lupus
 271 | white wolf, Arctic wolf, Canis lupus tundrarum
 272 | red wolf, maned wolf, Canis rufus, Canis niger
 273 | coyote, prairie wolf, brush wolf, Canis latrans
 274 | dingo, warrigal, warragal, Canis dingo
 275 | dhole, Cuon alpinus
 276 | African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus
 277 | hyena, hyaena
 278 | red fox, Vulpes vulpes
 279 | kit fox, Vulpes macrotis
 280 | Arctic fox, white fox, Alopex lagopus
 281 | grey fox, gray fox, Urocyon cinereoargenteus
 282 | tabby, tabby cat
 283 | tiger cat
 284 | Persian cat
 285 | Siamese cat, Siamese
 286 | Egyptian cat
 287 | cougar, puma, catamount, mountain lion, painter, panther, Felis concolor
 288 | lynx, catamount
 289 | leopard, Panthera pardus
 290 | snow leopard, ounce, Panthera uncia
 291 | jaguar, panther, Panthera onca, Felis onca
 292 | lion, king of beasts, Panthera leo
 293 | tiger, Panthera tigris
 294 | cheetah, chetah, Acinonyx jubatus
 295 | brown bear, bruin, Ursus arctos
 296 | American black bear, black bear, Ursus americanus, Euarctos americanus
 297 | ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus
 298 | sloth bear, Melursus ursinus, Ursus ursinus
 299 | mongoose
 300 | meerkat, mierkat
 301 | tiger beetle
 302 | ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle
 303 | ground beetle, carabid beetle
 304 | long-horned beetle, longicorn, longicorn beetle
 305 | leaf beetle, chrysomelid
 306 | dung beetle
 307 | rhinoceros beetle
 308 | weevil
 309 | fly
 310 | bee
 311 | ant, emmet, pismire
 312 | grasshopper, hopper
 313 | cricket
 314 | walking stick, walkingstick, stick insect
 315 | cockroach, roach
 316 | mantis, mantid
 317 | cicada, cicala
 318 | leafhopper
 319 | lacewing, lacewing fly
 320 | dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk
 321 | damselfly
 322 | admiral
 323 | ringlet, ringlet butterfly
 324 | monarch, monarch butterfly, milkweed butterfly, Danaus plexippus
 325 | cabbage butterfly
 326 | sulphur butterfly, sulfur butterfly
 327 | lycaenid, lycaenid butterfly
 328 | starfish, sea star
 329 | sea urchin
 330 | sea cucumber, holothurian
 331 | wood rabbit, cottontail, cottontail rabbit
 332 | hare
 333 | Angora, Angora rabbit
 334 | hamster
 335 | porcupine, hedgehog
 336 | fox squirrel, eastern fox squirrel, Sciurus niger
 337 | marmot
 338 | beaver
 339 | guinea pig, Cavia cobaya
 340 | sorrel
 341 | zebra
 342 | hog, pig, grunter, squealer, Sus scrofa
 343 | wild boar, boar, Sus scrofa
 344 | warthog
 345 | hippopotamus, hippo, river horse, Hippopotamus amphibius
 346 | ox
 347 | water buffalo, water ox, Asiatic buffalo, Bubalus bubalis
 348 | bison
 349 | ram, tup
 350 | bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis
 351 | ibex, Capra ibex
 352 | hartebeest
 353 | impala, Aepyceros melampus
 354 | gazelle
 355 | Arabian camel, dromedary, Camelus dromedarius
 356 | llama
 357 | weasel
 358 | mink
 359 | polecat, fitch, foulmart, foumart, Mustela putorius
 360 | black-footed ferret, ferret, Mustela nigripes
 361 | otter
 362 | skunk, polecat, wood pussy
 363 | badger
 364 | armadillo
 365 | three-toed sloth, ai, Bradypus tridactylus
 366 | orangutan, orang, orangutang, Pongo pygmaeus
 367 | gorilla, Gorilla gorilla
 368 | chimpanzee, chimp, Pan troglodytes
 369 | gibbon, Hylobates lar
 370 | siamang, Hylobates syndactylus, Symphalangus syndactylus
 371 | guenon, guenon monkey
 372 | patas, hussar monkey, Erythrocebus patas
 373 | baboon
 374 | macaque
 375 | langur
 376 | colobus, colobus monkey
 377 | proboscis monkey, Nasalis larvatus
 378 | marmoset
 379 | capuchin, ringtail, Cebus capucinus
 380 | howler monkey, howler
 381 | titi, titi monkey
 382 | spider monkey, Ateles geoffroyi
 383 | squirrel monkey, Saimiri sciureus
 384 | Madagascar cat, ring-tailed lemur, Lemur catta
 385 | indri, indris, Indri indri, Indri brevicaudatus
 386 | Indian elephant, Elephas maximus
 387 | African elephant, Loxodonta africana
 388 | lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens
 389 | giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca
 390 | barracouta, snoek
 391 | eel
 392 | coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch
 393 | rock beauty, Holocanthus tricolor
 394 | anemone fish
 395 | sturgeon
 396 | gar, garfish, garpike, billfish, Lepisosteus osseus
 397 | lionfish
 398 | puffer, pufferfish, blowfish, globefish
 399 | abacus
 400 | abaya
 401 | academic gown, academic robe, judge's robe
 402 | accordion, piano accordion, squeeze box
 403 | acoustic guitar
 404 | aircraft carrier, carrier, flattop, attack aircraft carrier
 405 | airliner
 406 | airship, dirigible
 407 | altar
 408 | ambulance
 409 | amphibian, amphibious vehicle
 410 | analog clock
 411 | apiary, bee house
 412 | apron
 413 | ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin
 414 | assault rifle, assault gun
 415 | backpack, back pack, knapsack, packsack, rucksack, haversack
 416 | bakery, bakeshop, bakehouse
 417 | balance beam, beam
 418 | balloon
 419 | ballpoint, ballpoint pen, ballpen, Biro
 420 | Band Aid
 421 | banjo
 422 | bannister, banister, balustrade, balusters, handrail
 423 | barbell
 424 | barber chair
 425 | barbershop
 426 | barn
 427 | barometer
 428 | barrel, cask
 429 | barrow, garden cart, lawn cart, wheelbarrow
 430 | baseball
 431 | basketball
 432 | bassinet
 433 | bassoon
 434 | bathing cap, swimming cap
 435 | bath towel
 436 | bathtub, bathing tub, bath, tub
 437 | beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon
 438 | beacon, lighthouse, beacon light, pharos
 439 | beaker
 440 | bearskin, busby, shako
 441 | beer bottle
 442 | beer glass
 443 | bell cote, bell cot
 444 | bib
 445 | bicycle-built-for-two, tandem bicycle, tandem
 446 | bikini, two-piece
 447 | binder, ring-binder
 448 | binoculars, field glasses, opera glasses
 449 | birdhouse
 450 | boathouse
 451 | bobsled, bobsleigh, bob
 452 | bolo tie, bolo, bola tie, bola
 453 | bonnet, poke bonnet
 454 | bookcase
 455 | bookshop, bookstore, bookstall
 456 | bottlecap
 457 | bow
 458 | bow tie, bow-tie, bowtie
 459 | brass, memorial tablet, plaque
 460 | brassiere, bra, bandeau
 461 | breakwater, groin, groyne, mole, bulwark, seawall, jetty
 462 | breastplate, aegis, egis
 463 | broom
 464 | bucket, pail
 465 | buckle
 466 | bulletproof vest
 467 | bullet train, bullet
 468 | butcher shop, meat market
 469 | cab, hack, taxi, taxicab
 470 | caldron, cauldron
 471 | candle, taper, wax light
 472 | cannon
 473 | canoe
 474 | can opener, tin opener
 475 | cardigan
 476 | car mirror
 477 | carousel, carrousel, merry-go-round, roundabout, whirligig
 478 | carpenter's kit, tool kit
 479 | carton
 480 | car wheel
 481 | cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM
 482 | cassette
 483 | cassette player
 484 | castle
 485 | catamaran
 486 | CD player
 487 | cello, violoncello
 488 | cellular telephone, cellular phone, cellphone, cell, mobile phone
 489 | chain
 490 | chainlink fence
 491 | chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour
 492 | chain saw, chainsaw
 493 | chest
 494 | chiffonier, commode
 495 | chime, bell, gong
 496 | china cabinet, china closet
 497 | Christmas stocking
 498 | church, church building
 499 | cinema, movie theater, movie theatre, movie house, picture palace
 500 | cleaver, meat cleaver, chopper
 501 | cliff dwelling
 502 | cloak
 503 | clog, geta, patten, sabot
 504 | cocktail shaker
 505 | coffee mug
 506 | coffeepot
 507 | coil, spiral, volute, whorl, helix
 508 | combination lock
 509 | computer keyboard, keypad
 510 | confectionery, confectionary, candy store
 511 | container ship, containership, container vessel
 512 | convertible
 513 | corkscrew, bottle screw
 514 | cornet, horn, trumpet, trump
 515 | cowboy boot
 516 | cowboy hat, ten-gallon hat
 517 | cradle
 518 | crane
 519 | crash helmet
 520 | crate
 521 | crib, cot
 522 | Crock Pot
 523 | croquet ball
 524 | crutch
 525 | cuirass
 526 | dam, dike, dyke
 527 | desk
 528 | desktop computer
 529 | dial telephone, dial phone
 530 | diaper, nappy, napkin
 531 | digital clock
 532 | digital watch
 533 | dining table, board
 534 | dishrag, dishcloth
 535 | dishwasher, dish washer, dishwashing machine
 536 | disk brake, disc brake
 537 | dock, dockage, docking facility
 538 | dogsled, dog sled, dog sleigh
 539 | dome
 540 | doormat, welcome mat
 541 | drilling platform, offshore rig
 542 | drum, membranophone, tympan
 543 | drumstick
 544 | dumbbell
 545 | Dutch oven
 546 | electric fan, blower
 547 | electric guitar
 548 | electric locomotive
 549 | entertainment center
 550 | envelope
 551 | espresso maker
 552 | face powder
 553 | feather boa, boa
 554 | file, file cabinet, filing cabinet
 555 | fireboat
 556 | fire engine, fire truck
 557 | fire screen, fireguard
 558 | flagpole, flagstaff
 559 | flute, transverse flute
 560 | folding chair
 561 | football helmet
 562 | forklift
 563 | fountain
 564 | fountain pen
 565 | four-poster
 566 | freight car
 567 | French horn, horn
 568 | frying pan, frypan, skillet
 569 | fur coat
 570 | garbage truck, dustcart
 571 | gasmask, respirator, gas helmet
 572 | gas pump, gasoline pump, petrol pump, island dispenser
 573 | goblet
 574 | go-kart
 575 | golf ball
 576 | golfcart, golf cart
 577 | gondola
 578 | gong, tam-tam
 579 | gown
 580 | grand piano, grand
 581 | greenhouse, nursery, glasshouse
 582 | grille, radiator grille
 583 | grocery store, grocery, food market, market
 584 | guillotine
 585 | hair slide
 586 | hair spray
 587 | half track
 588 | hammer
 589 | hamper
 590 | hand blower, blow dryer, blow drier, hair dryer, hair drier
 591 | hand-held computer, hand-held microcomputer
 592 | handkerchief, hankie, hanky, hankey
 593 | hard disc, hard disk, fixed disk
 594 | harmonica, mouth organ, harp, mouth harp
 595 | harp
 596 | harvester, reaper
 597 | hatchet
 598 | holster
 599 | home theater, home theatre
 600 | honeycomb
 601 | hook, claw
 602 | hoopskirt, crinoline
 603 | horizontal bar, high bar
 604 | horse cart, horse-cart
 605 | hourglass
 606 | iPod
 607 | iron, smoothing iron
 608 | jack-o'-lantern
 609 | jean, blue jean, denim
 610 | jeep, landrover
 611 | jersey, T-shirt, tee shirt
 612 | jigsaw puzzle
 613 | jinrikisha, ricksha, rickshaw
 614 | joystick
 615 | kimono
 616 | knee pad
 617 | knot
 618 | lab coat, laboratory coat
 619 | ladle
 620 | lampshade, lamp shade
 621 | laptop, laptop computer
 622 | lawn mower, mower
 623 | lens cap, lens cover
 624 | letter opener, paper knife, paperknife
 625 | library
 626 | lifeboat
 627 | lighter, light, igniter, ignitor
 628 | limousine, limo
 629 | liner, ocean liner
 630 | lipstick, lip rouge
 631 | Loafer
 632 | lotion
 633 | loudspeaker, speaker, speaker unit, loudspeaker system, speaker system
 634 | loupe, jeweler's loupe
 635 | lumbermill, sawmill
 636 | magnetic compass
 637 | mailbag, postbag
 638 | mailbox, letter box
 639 | maillot
 640 | maillot, tank suit
 641 | manhole cover
 642 | maraca
 643 | marimba, xylophone
 644 | mask
 645 | matchstick
 646 | maypole
 647 | maze, labyrinth
 648 | measuring cup
 649 | medicine chest, medicine cabinet
 650 | megalith, megalithic structure
 651 | microphone, mike
 652 | microwave, microwave oven
 653 | military uniform
 654 | milk can
 655 | minibus
 656 | miniskirt, mini
 657 | minivan
 658 | missile
 659 | mitten
 660 | mixing bowl
 661 | mobile home, manufactured home
 662 | Model T
 663 | modem
 664 | monastery
 665 | monitor
 666 | moped
 667 | mortar
 668 | mortarboard
 669 | mosque
 670 | mosquito net
 671 | motor scooter, scooter
 672 | mountain bike, all-terrain bike, off-roader
 673 | mountain tent
 674 | mouse, computer mouse
 675 | mousetrap
 676 | moving van
 677 | muzzle
 678 | nail
 679 | neck brace
 680 | necklace
 681 | nipple
 682 | notebook, notebook computer
 683 | obelisk
 684 | oboe, hautboy, hautbois
 685 | ocarina, sweet potato
 686 | odometer, hodometer, mileometer, milometer
 687 | oil filter
 688 | organ, pipe organ
 689 | oscilloscope, scope, cathode-ray oscilloscope, CRO
 690 | overskirt
 691 | oxcart
 692 | oxygen mask
 693 | packet
 694 | paddle, boat paddle
 695 | paddlewheel, paddle wheel
 696 | padlock
 697 | paintbrush
 698 | pajama, pyjama, pj's, jammies
 699 | palace
 700 | panpipe, pandean pipe, syrinx
 701 | paper towel
 702 | parachute, chute
 703 | parallel bars, bars
 704 | park bench
 705 | parking meter
 706 | passenger car, coach, carriage
 707 | patio, terrace
 708 | pay-phone, pay-station
 709 | pedestal, plinth, footstall
 710 | pencil box, pencil case
 711 | pencil sharpener
 712 | perfume, essence
 713 | Petri dish
 714 | photocopier
 715 | pick, plectrum, plectron
 716 | pickelhaube
 717 | picket fence, paling
 718 | pickup, pickup truck
 719 | pier
 720 | piggy bank, penny bank
 721 | pill bottle
 722 | pillow
 723 | ping-pong ball
 724 | pinwheel
 725 | pirate, pirate ship
 726 | pitcher, ewer
 727 | plane, carpenter's plane, woodworking plane
 728 | planetarium
 729 | plastic bag
 730 | plate rack
 731 | plow, plough
 732 | plunger, plumber's helper
 733 | Polaroid camera, Polaroid Land camera
 734 | pole
 735 | police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria
 736 | poncho
 737 | pool table, billiard table, snooker table
 738 | pop bottle, soda bottle
 739 | pot, flowerpot
 740 | potter's wheel
 741 | power drill
 742 | prayer rug, prayer mat
 743 | printer
 744 | prison, prison house
 745 | projectile, missile
 746 | projector
 747 | puck, hockey puck
 748 | punching bag, punch bag, punching ball, punchball
 749 | purse
 750 | quill, quill pen
 751 | quilt, comforter, comfort, puff
 752 | racer, race car, racing car
 753 | racket, racquet
 754 | radiator
 755 | radio, wireless
 756 | radio telescope, radio reflector
 757 | rain barrel
 758 | recreational vehicle, RV, R.V.
 759 | reel
 760 | reflex camera
 761 | refrigerator, icebox
 762 | remote control, remote
 763 | restaurant, eating house, eating place, eatery
 764 | revolver, six-gun, six-shooter
 765 | rifle
 766 | rocking chair, rocker
 767 | rotisserie
 768 | rubber eraser, rubber, pencil eraser
 769 | rugby ball
 770 | rule, ruler
 771 | running shoe
 772 | safe
 773 | safety pin
 774 | saltshaker, salt shaker
 775 | sandal
 776 | sarong
 777 | sax, saxophone
 778 | scabbard
 779 | scale, weighing machine
 780 | school bus
 781 | schooner
 782 | scoreboard
 783 | screen, CRT screen
 784 | screw
 785 | screwdriver
 786 | seat belt, seatbelt
 787 | sewing machine
 788 | shield, buckler
 789 | shoe shop, shoe-shop, shoe store
 790 | shoji
 791 | shopping basket
 792 | shopping cart
 793 | shovel
 794 | shower cap
 795 | shower curtain
 796 | ski
 797 | ski mask
 798 | sleeping bag
 799 | slide rule, slipstick
 800 | sliding door
 801 | slot, one-armed bandit
 802 | snorkel
 803 | snowmobile
 804 | snowplow, snowplough
 805 | soap dispenser
 806 | soccer ball
 807 | sock
 808 | solar dish, solar collector, solar furnace
 809 | sombrero
 810 | soup bowl
 811 | space bar
 812 | space heater
 813 | space shuttle
 814 | spatula
 815 | speedboat
 816 | spider web, spider's web
 817 | spindle
 818 | sports car, sport car
 819 | spotlight, spot
 820 | stage
 821 | steam locomotive
 822 | steel arch bridge
 823 | steel drum
 824 | stethoscope
 825 | stole
 826 | stone wall
 827 | stopwatch, stop watch
 828 | stove
 829 | strainer
 830 | streetcar, tram, tramcar, trolley, trolley car
 831 | stretcher
 832 | studio couch, day bed
 833 | stupa, tope
 834 | submarine, pigboat, sub, U-boat
 835 | suit, suit of clothes
 836 | sundial
 837 | sunglass
 838 | sunglasses, dark glasses, shades
 839 | sunscreen, sunblock, sun blocker
 840 | suspension bridge
 841 | swab, swob, mop
 842 | sweatshirt
 843 | swimming trunks, bathing trunks
 844 | swing
 845 | switch, electric switch, electrical switch
 846 | syringe
 847 | table lamp
 848 | tank, army tank, armored combat vehicle, armoured combat vehicle
 849 | tape player
 850 | teapot
 851 | teddy, teddy bear
 852 | television, television system
 853 | tennis ball
 854 | thatch, thatched roof
 855 | theater curtain, theatre curtain
 856 | thimble
 857 | thresher, thrasher, threshing machine
 858 | throne
 859 | tile roof
 860 | toaster
 861 | tobacco shop, tobacconist shop, tobacconist
 862 | toilet seat
 863 | torch
 864 | totem pole
 865 | tow truck, tow car, wrecker
 866 | toyshop
 867 | tractor
 868 | trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi
 869 | tray
 870 | trench coat
 871 | tricycle, trike, velocipede
 872 | trimaran
 873 | tripod
 874 | triumphal arch
 875 | trolleybus, trolley coach, trackless trolley
 876 | trombone
 877 | tub, vat
 878 | turnstile
 879 | typewriter keyboard
 880 | umbrella
 881 | unicycle, monocycle
 882 | upright, upright piano
 883 | vacuum, vacuum cleaner
 884 | vase
 885 | vault
 886 | velvet
 887 | vending machine
 888 | vestment
 889 | viaduct
 890 | violin, fiddle
 891 | volleyball
 892 | waffle iron
 893 | wall clock
 894 | wallet, billfold, notecase, pocketbook
 895 | wardrobe, closet, press
 896 | warplane, military plane
 897 | washbasin, handbasin, washbowl, lavabo, wash-hand basin
 898 | washer, automatic washer, washing machine
 899 | water bottle
 900 | water jug
 901 | water tower
 902 | whiskey jug
 903 | whistle
 904 | wig
 905 | window screen
 906 | window shade
 907 | Windsor tie
 908 | wine bottle
 909 | wing
 910 | wok
 911 | wooden spoon
 912 | wool, woolen, woollen
 913 | worm fence, snake fence, snake-rail fence, Virginia fence
 914 | wreck
 915 | yawl
 916 | yurt
 917 | web site, website, internet site, site
 918 | comic book
 919 | crossword puzzle, crossword
 920 | street sign
 921 | traffic light, traffic signal, stoplight
 922 | book jacket, dust cover, dust jacket, dust wrapper
 923 | menu
 924 | plate
 925 | guacamole
 926 | consomme
 927 | hot pot, hotpot
 928 | trifle
 929 | ice cream, icecream
 930 | ice lolly, lolly, lollipop, popsicle
 931 | French loaf
 932 | bagel, beigel
 933 | pretzel
 934 | cheeseburger
 935 | hotdog, hot dog, red hot
 936 | mashed potato
 937 | head cabbage
 938 | broccoli
 939 | cauliflower
 940 | zucchini, courgette
 941 | spaghetti squash
 942 | acorn squash
 943 | butternut squash
 944 | cucumber, cuke
 945 | artichoke, globe artichoke
 946 | bell pepper
 947 | cardoon
 948 | mushroom
 949 | Granny Smith
 950 | strawberry
 951 | orange
 952 | lemon
 953 | fig
 954 | pineapple, ananas
 955 | banana
 956 | jackfruit, jak, jack
 957 | custard apple
 958 | pomegranate
 959 | hay
 960 | carbonara
 961 | chocolate sauce, chocolate syrup
 962 | dough
 963 | meat loaf, meatloaf
 964 | pizza, pizza pie
 965 | potpie
 966 | burrito
 967 | red wine
 968 | espresso
 969 | cup
 970 | eggnog
 971 | alp
 972 | bubble
 973 | cliff, drop, drop-off
 974 | coral reef
 975 | geyser
 976 | lakeside, lakeshore
 977 | promontory, headland, head, foreland
 978 | sandbar, sand bar
 979 | seashore, coast, seacoast, sea-coast
 980 | valley, vale
 981 | volcano
 982 | ballplayer, baseball player
 983 | groom, bridegroom
 984 | scuba diver
 985 | rapeseed
 986 | daisy
 987 | yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum
 988 | corn
 989 | acorn
 990 | hip, rose hip, rosehip
 991 | buckeye, horse chestnut, conker
 992 | coral fungus
 993 | agaric
 994 | gyromitra
 995 | stinkhorn, carrion fungus
 996 | earthstar
 997 | hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa
 998 | bolete
 999 | ear, spike, capitulum
1000 | toilet tissue, toilet paper, bathroom tissue


--------------------------------------------------------------------------------
/ch9/word2vec.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import tensorflow as tf
  3 | import collections
  4 | import random
  5 | import math
  6 | import os
  7 | 
  8 | tot_captions, only_captions = None, None
  9 | 
 10 | data_indices = None
 11 | reverse_dictionary = None
 12 | embedding_size = None
 13 | vocabulary_size = None
 14 | max_caption_length = None
 15 | 
 16 | def define_data_and_hyperparameters(
 17 |         _tot_captions, _only_captions, _reverse_dictionary,
 18 |         _emb_size, _vocab_size, _max_cap_length):
 19 |     global data_indices, tot_captions, only_captions, reverse_dictionary
 20 |     global embedding_size, vocabulary_size, max_caption_length
 21 | 
 22 |     tot_captions = _tot_captions
 23 |     only_captions = _only_captions
 24 | 
 25 |     data_indices = [0 for _ in range(tot_captions)]
 26 |     reverse_dictionary = _reverse_dictionary
 27 |     embedding_size = _emb_size
 28 |     vocabulary_size = _vocab_size
 29 |     max_caption_length = _max_cap_length
 30 | 
 31 | 
 32 | def generate_batch_for_word2vec(batch_size, window_size):
 33 |     # window_size is the amount of words we're looking at from each side of a given word
 34 |     # creates a single batch
 35 |     global data_indices
 36 | 
 37 |     span = 2 * window_size + 1  # [ skip_window target skip_window ]
 38 | 
 39 |     batch = np.ndarray(shape=(batch_size, span - 1), dtype=np.int32)
 40 |     labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
 41 |     # e.g if skip_window = 2 then span = 5
 42 |     # span is the length of the whole frame we are considering for a single word (left + word + right)
 43 |     # skip_window is the length of one side
 44 | 
 45 |     caption_ids_for_batch = np.random.randint(0, tot_captions, batch_size)
 46 | 
 47 |     for b_i in range(batch_size):
 48 |         cap_id = caption_ids_for_batch[b_i]
 49 | 
 50 |         buffer = only_captions[cap_id, data_indices[cap_id]:data_indices[cap_id] + span]
 51 |         assert buffer.size == span, 'Buffer length (%d), Current data index (%d), Span(%d)' % (
 52 |         buffer.size, data_indices[cap_id], span)
 53 |         # If we only have EOS tokesn in the sampled text, we sample a new one
 54 |         while np.all(buffer == 1):
 55 |             # reset the data_indices for that cap_id
 56 |             data_indices[cap_id] = 0
 57 |             # sample a new cap_id
 58 |             cap_id = np.random.randint(0, tot_captions)
 59 |             buffer = only_captions[cap_id, data_indices[cap_id]:data_indices[cap_id] + span]
 60 | 
 61 |         # fill left and right sides of batch
 62 |         batch[b_i, :window_size] = buffer[:window_size]
 63 |         batch[b_i, window_size:] = buffer[window_size + 1:]
 64 | 
 65 |         labels[b_i, 0] = buffer[window_size]
 66 | 
 67 |         # increase the corresponding index
 68 |         data_indices[cap_id] = (data_indices[cap_id] + 1) % (max_caption_length - span)
 69 | 
 70 |     assert batch.shape[0] == batch_size and batch.shape[1] == span - 1
 71 |     return batch, labels
 72 | 
 73 | def print_some_batches():
 74 |     global data_indices, reverse_dictionary
 75 | 
 76 |     for w_size in [1, 2]:
 77 |         data_indices = [0 for _ in range(tot_captions)]
 78 |         batch, labels = generate_batch_for_word2vec(batch_size=8, window_size=w_size)
 79 |         print('\nwith window_size = %d:' %w_size)
 80 |         print('    batch:', [[reverse_dictionary[bii] for bii in bi] for bi in batch])
 81 |         print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
 82 | 		
 83 | batch_size, embedding_size, window_size = None, None, None
 84 | valid_size, valid_window, valid_examples = None, None, None
 85 | num_sampled = None
 86 | 
 87 | train_dataset, train_labels = None, None
 88 | valid_dataset = None
 89 | 
 90 | softmax_weights, softmax_biases = None, None
 91 | 
 92 | loss, optimizer, similarity, normalized_embeddings = None, None, None, None
 93 | 	
 94 | def define_word2vec_tensorflow(batch_size):
 95 |     global embedding_size, window_size
 96 |     global	valid_size, valid_window, valid_examples
 97 |     global num_sampled
 98 |     global train_dataset, train_labels
 99 |     global valid_dataset
100 |     global softmax_weights, softmax_biases 
101 |     global loss, optimizer, similarity
102 |     global vocabulary_size, embedding_size
103 |     global normalized_embeddings
104 |     
105 |     # How many words to consider left and right.
106 |     # Skip gram by design does not require to have all the context words in a given step
107 |     # However, for CBOW that's a requirement, so we limit the window size
108 |     window_size = 3 
109 |     
110 |     # We pick a random validation set to sample nearest neighbors
111 |     valid_size = 16 # Random set of words to evaluate similarity on.
112 |     # We sample valid datapoints randomly from a large window without always being deterministic
113 |     valid_window = 50
114 |     
115 |     # When selecting valid examples, we select some of the most frequent words as well as
116 |     # some moderately rare words as well
117 |     valid_examples = np.array(random.sample(range(valid_window), valid_size))
118 |     valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)
119 | 
120 |     num_sampled = 32 # Number of negative examples to sample.
121 | 
122 |     tf.reset_default_graph()
123 | 
124 |     # Training input data (target word IDs). Note that it has 2*window_size columns
125 |     train_dataset = tf.placeholder(tf.int32, shape=[batch_size,2*window_size])
126 |     # Training input label data (context word IDs)
127 |     train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
128 |     # Validation input data, we don't need a placeholder
129 |     # as we have already defined the IDs of the words selected
130 |     # as validation data
131 |     valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
132 | 
133 |     # Variables.
134 | 
135 |     # Embedding layer, contains the word embeddings
136 |     embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0,dtype=tf.float32))
137 | 
138 |     # Softmax Weights and Biases
139 |     softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
140 |                  stddev=0.5 / math.sqrt(embedding_size),dtype=tf.float32))
141 |     softmax_biases = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01))
142 |     
143 |     # Model.
144 |     # Look up embeddings for a batch of inputs.
145 |     # Here we do embedding lookups for each column in the input placeholder
146 |     # and then average them to produce an embedding_size word vector
147 |     stacked_embedings = None
148 |     print('Defining %d embedding lookups representing each word in the context'%(2*window_size))
149 |     for i in range(2*window_size):
150 |         embedding_i = tf.nn.embedding_lookup(embeddings, train_dataset[:,i])        
151 |         x_size,y_size = embedding_i.get_shape().as_list()
152 |         if stacked_embedings is None:
153 |             stacked_embedings = tf.reshape(embedding_i,[x_size,y_size,1])
154 |         else:
155 |             stacked_embedings = tf.concat(axis=2,values=[stacked_embedings,tf.reshape(embedding_i,[x_size,y_size,1])])
156 | 
157 |     assert stacked_embedings.get_shape().as_list()[2]==2*window_size
158 |     print("Stacked embedding size: %s"%stacked_embedings.get_shape().as_list())
159 |     mean_embeddings =  tf.reduce_mean(stacked_embedings,2,keepdims=False)
160 |     print("Reduced mean embedding size: %s"%mean_embeddings.get_shape().as_list())
161 |     
162 |     	
163 |     # Compute the softmax loss, using a sample of the negative labels each time.
164 |     # inputs are embeddings of the train words
165 |     # with this loss we optimize weights, biases, embeddings
166 |     loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=mean_embeddings,
167 |                            labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
168 |     # AdamOptimizer.
169 |     optimizer = tf.train.AdamOptimizer(0.0005).minimize(loss)
170 | 
171 |     # Compute the similarity between minibatch examples and all embeddings.
172 |     # We use the cosine distance:
173 |     norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
174 |     normalized_embeddings = embeddings / norm
175 |     valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
176 |     similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
177 | 	
178 | 
179 | def run_word2vec(batch_size):
180 |     global embedding_size, window_size
181 |     global valid_size, valid_window, valid_examples
182 |     global num_sampled
183 |     global train_dataset, train_labels
184 |     global valid_dataset
185 |     global softmax_weights, softmax_biases 
186 |     global loss, optimizer, similarity, normalized_embeddings
187 |     global data_list, num_files, reverse_dictionary
188 |     global vocabulary_size, embedding_size
189 | 
190 |     work_dir = 'image_caption_data'
191 |     num_steps = 100001
192 | 
193 |     session = tf.InteractiveSession()
194 | 
195 |     tf.global_variables_initializer().run()
196 |     print('Initialized')
197 |     average_loss = 0
198 |     for step in range(num_steps):
199 | 
200 |         # Load a batch of data
201 |         batch_data, batch_labels = generate_batch_for_word2vec(batch_size, window_size)
202 | 
203 |         # Populate the feed_dict and run the optimizer and get the loss out
204 |         feed_dict = {train_dataset: batch_data, train_labels: batch_labels}
205 |         _, l = session.run([optimizer, loss], feed_dict=feed_dict)
206 | 
207 |         average_loss += l
208 | 
209 |         if (step + 1) % 2000 == 0:
210 |             if step > 0:
211 |                 # The average loss is an estimate of the loss over the last 2000 batches.
212 |                 average_loss = average_loss / 2000
213 | 
214 |             print('Average loss at step %d: %f' % (step + 1, average_loss))
215 |             average_loss = 0 # Reset average loss
216 | 
217 |         if (step + 1) % 10000 == 0:
218 |             sim = similarity.eval()
219 |             # Calculate the most similar (top_k) words
220 |             # to the previosly selected set of valid words
221 |             # Note that this is an expensive step
222 |             for i in range(valid_size):
223 |                 valid_word = reverse_dictionary[valid_examples[i]]
224 |                 top_k = 3  # number of nearest neighbors
225 |                 nearest = (-sim[i, :]).argsort()[1:top_k + 1]
226 |                 log = 'Nearest to %s:' % valid_word
227 |                 for k in range(top_k):
228 |                     close_word = reverse_dictionary[nearest[k]]
229 |                     log = '%s %s,' % (log, close_word)
230 |                 print(log)
231 | 
232 |     # Get the normalized embeddings we learnt
233 |     cbow_final_embeddings = normalized_embeddings.eval()
234 | 
235 |     # Save the embeddings to the disk as 'caption_embeddings-tmp.npy'
236 |     # If you want to use this embeddings in the next steps
237 |     # please change the filename to 'caption-embeddings.npy'
238 |     np.save(os.path.join(work_dir,'caption-embeddings-tmp'), cbow_final_embeddings)


--------------------------------------------------------------------------------