├── .gitignore
├── Neural Language Model.ipynb
├── Neural+Language+Model.py
├── README.md
├── corpus.txt
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | .ipynb*
3 | chkpnts/*
4 | best_chkpnts/*


--------------------------------------------------------------------------------
/Neural Language Model.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 1. Neural Language Model\n",
  8 |     "If you are here that means you wish to cut the crap and understand how to train your own Neural Language Model. If you are a regular user of frameworks like Keras, Tflearn, etc., then you know how easy it has become these days to build, train and deploy Neural Network Models. If not then you will probably by the end of this post.\n",
  9 |     "\n",
 10 |     "# 2. Prerequisite\n",
 11 |     "1. [Python](https://www.tutorialspoint.com/python/): I will be using Python 3.5 for this tutorial\n",
 12 |     "\n",
 13 |     "2. [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/): If you dont know what LSTMs are, then this is a must read.\n",
 14 |     "\n",
 15 |     "3. [Basics of Machine Learning](https://www.youtube.com/watch?v=2uiulzZxmGg): If you want to dive into Machine Learning/Deep Learning, then I strongly recommend the first 4 lectures from [Stanford's CS231]() by Andrej Karpathy.\n",
 16 |     "\n",
 17 |     "4. [Language Model](https://en.wikipedia.org/wiki/Language_model): If you want to have a basic understanding of Language Models.\n",
 18 |     "\n",
 19 |     "# 3. Frameworks\n",
 20 |     "1. [Tflearn](http://tflearn.org/installation/) 0.3.2\n",
 21 |     "2. [Spacy](https://spacy.io/) 1.9.0\n",
 22 |     "3. [Tensorflow](https://spacy.io/) 1.0.1\n",
 23 |     "\n",
 24 |     "### Note\n",
 25 |     "you can take this post as a hands-on exercise on \"How to build your own Neural Language Model\" from scratch. If you have a ready to use virtualenv with all the dependencies installed then you can skip Section 4 and jump to Section 5. "
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "# 4. Install Dependencies\n",
 33 |     "We will install everythin in a virtual environment and I would suggest you to run this Jupyter Notebook in the same virtualenv. I have also provided a ```requirements.txt``` file with the [repository](https://github.com/dashayushman/neural-language-model) to make things easier.\n",
 34 |     "\n",
 35 |     "### 4.1 Virtual Environment\n",
 36 |     "You can follow [this](http://docs.python-guide.org/en/latest/dev/virtualenvs/) for a fast guide to Virtual Environments.\n",
 37 |     "\n",
 38 |     "```sh\n",
 39 |     "pip install virtualenv\n",
 40 |     "```\n",
 41 |     "\n",
 42 |     "### 4.2 Tflearn\n",
 43 |     "Follow [this](http://tflearn.org/installation/) and install Tflearn. Make sure to have the versions correct in case you want to avoid weird errors. \n",
 44 |     "\n",
 45 |     "```sh\n",
 46 |     "pip install -Iv tflearn==0.3.2\n",
 47 |     "```\n",
 48 |     "\n",
 49 |     "### 4.3 Tensorflow\n",
 50 |     "Install Tensorflow by following the instructions [here](https://www.tensorflow.org/install/). To make sure of installing the right version, use this\n",
 51 |     "\n",
 52 |     "```sh\n",
 53 |     "pip install -Iv tensorflow-gpu==1.0.1\n",
 54 |     "```\n",
 55 |     "Note that this is the GPU version of Tensorflow. You can even install the CPU version for this tutorial, but I would strongly recommend the GPU version if you intend to intend to scale it to use in the real world.\n",
 56 |     "\n",
 57 |     "### 4.4 Spacy\n",
 58 |     "Install Spacy by following the instructions [here](https://spacy.io/docs/usage/). For the right version use,\n",
 59 |     "\n",
 60 |     "```sh\n",
 61 |     "pip install -Iv spacy==1.9.0\n",
 62 |     "```\n",
 63 |     "\n",
 64 |     "### 4.5 Others\n",
 65 |     "```sh\n",
 66 |     "pip install numpy\n",
 67 |     "```"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "# 5. Get the Repo\n",
 75 |     "clone the Neural Language Model GitHub repository onto your computer and start the Jupyter Notebook server.\n",
 76 |     "\n",
 77 |     "```sh\n",
 78 |     "git clone https://github.com/dashayushman/neural-language-model.git\n",
 79 |     "cd neural-language-model\n",
 80 |     "jupyter notebook\n",
 81 |     "```\n",
 82 |     "\n",
 83 |     "Open the notebook names **Neural Language Model** and you can start off."
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "# 6. Neural Language Model\n",
 91 |     "We will start building our own Language model using an LSTM Network. To do so we will need a corpus. For the purpose of this tutorial, let us use a toy corpus, which is a text file called ```corpus.txt``` that 0I downloaded from Wikipedia. I will use this to demponstrate how to build your own Neural Language Model, and you can use the same knowledge to extend the model further for a more realistic scenario (I will give pointers to do so too).\n",
 92 |     "\n",
 93 |     "## 6.1 Loading The Corpus\n",
 94 |     "In this section you will load the ```corpus.txt``` and do minimal preprocessing."
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": 1,
100 |    "metadata": {
101 |     "scrolled": true
102 |    },
103 |    "outputs": [
104 |     {
105 |      "name": "stdout",
106 |      "output_type": "stream",
107 |      "text": [
108 |       "Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, partially supervised or unsupervised.\n",
109 |       "Some representations are loosely based on interpretation of information processing and communication patterns in a biological nervous system, such as neural coding that attempts to define a relationship between various stimuli and associated neuronal responses in the brain. Research attempts to create efficient systems to learn these representations from large-scale, unlabeled data sets.\n",
110 |       "Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation and bioinformatics where they produced results comparable to and in some cases superior to human experts.\n",
111 |       "Deep learning is a class of machine learning algorithms that:\n",
112 |       "use a cascade of many layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. The algorithms may be supervised or unsupervised and applications include pattern analysis and classification .\n"
113 |      ]
114 |     }
115 |    ],
116 |    "source": [
117 |     "import re\n",
118 |     "\n",
119 |     "with open('corpus.txt', 'r') as cf:\n",
120 |     "    corpus = []\n",
121 |     "    for line in cf: # loops over all the lines in the corpus\n",
122 |     "        line = line.strip() # strips off \\n \\r from the ends \n",
123 |     "        if line: # Take only non empty lines\n",
124 |     "            line = re.sub(r'\\([^)]*\\)', '', line) # Regular Expression to remove text in between brackets\n",
125 |     "            line = re.sub(' +',' ', line) # Removes consecutive spaces\n",
126 |     "            # add more pre-processing steps\n",
127 |     "            corpus.append(line)\n",
128 |     "print(\"\\n\".join(corpus[:5])) # Shows the first 5 lines of the corpus"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "metadata": {},
134 |    "source": [
135 |     "As you can see that this small piece of code loads the toy text corpus, extracts lines from it, ignores empty lines, and removes text in between brackets. Note that in reality you will not be able to load the entire corpus into memory. You will need to write a [generator](https://wiki.python.org/moin/Generators) to yield text lines from the corpus, or use some advanced features provided by the Deep Learning frameworks like [Tensorflow's Input Pipelines](https://www.tensorflow.org/programmers_guide/reading_data). \n",
136 |     "\n",
137 |     "## 6.2 Tokenizing the Corpus\n",
138 |     "In this section we will see how to tokenize the text lines that we extracted and then create a **Vocabulary**."
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": 2,
144 |    "metadata": {
145 |     "collapsed": true
146 |    },
147 |    "outputs": [],
148 |    "source": [
149 |     "# Load Spacy\n",
150 |     "import spacy\n",
151 |     "import numpy as np\n",
152 |     "nlp = spacy.load('en_core_web_sm')"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 3,
158 |    "metadata": {},
159 |    "outputs": [
160 |     {
161 |      "name": "stdout",
162 |      "output_type": "stream",
163 |      "text": [
164 |       "['SEQUENCE_BEGIN', 'deep', 'learning', 'is', 'part', 'of', 'a', 'broader', 'family', 'of', 'machine', 'learning', 'methods', 'based', 'on', 'learning', 'data', 'representations', ',', 'as', 'opposed', 'to', 'task', '-', 'specific', 'algorithms', '.', 'SEQUENCE_END', 'SEQUENCE_BEGIN', 'learning']\n",
165 |       "Mean Sentence Length: 31.991413024995747\n",
166 |       "Sentence Length Standard Deviation: 15.024047302248745\n",
167 |       "Max Sentence Length: 179\n"
168 |      ]
169 |     }
170 |    ],
171 |    "source": [
172 |     "def preprocess_corpus(corpus):\n",
173 |     "    corpus_tokens = []\n",
174 |     "    sentence_lengths = []\n",
175 |     "    for line in corpus:\n",
176 |     "        doc = nlp(line) # Parse each line in the corpus\n",
177 |     "        for sent in doc.sents: # Loop over all the sentences in the line\n",
178 |     "            corpus_tokens.append('SEQUENCE_BEGIN')\n",
179 |     "            s_len = 1\n",
180 |     "            for tok in sent: # Loop over all the words in a sentence\n",
181 |     "                if tok.text.strip() != '' and tok.ent_type_ != '': # If the token is a Named Entity then do not lowercase it \n",
182 |     "                    corpus_tokens.append(tok.text)\n",
183 |     "                else:\n",
184 |     "                    corpus_tokens.append(tok.text.lower())\n",
185 |     "                s_len += 1\n",
186 |     "            corpus_tokens.append('SEQUENCE_END')\n",
187 |     "            sentence_lengths.append(s_len+1)\n",
188 |     "    return corpus_tokens, sentence_lengths\n",
189 |     "\n",
190 |     "corpus_tokens, sentence_lengths = preprocess_corpus(corpus)\n",
191 |     "print(corpus_tokens[:30]) # Prints the first 30 tokens\n",
192 |     "mean_sentence_length = np.mean(sentence_lengths)\n",
193 |     "deviation_sentence_length = np.std(sentence_lengths)\n",
194 |     "max_sentence_length = np.max(sentence_lengths)\n",
195 |     "print('Mean Sentence Length: {}\\nSentence Length Standard Deviation: {}\\n'\n",
196 |     "      'Max Sentence Length: {}'.format(mean_sentence_length, deviation_sentence_length, max_sentence_length))"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "markdown",
201 |    "metadata": {
202 |     "collapsed": true
203 |    },
204 |    "source": [
205 |     "Notice that we did not lowercase the [Named Entities(NEs)](https://en.wikipedia.org/wiki/Named-entity_recognition). This is totally your choice. It part of a normalization step and I believe it is a good idea to let the model learn the Named Entities in the corpus. But do not blindly consider any library for NEs. I chose Spacy as it is very simple to use, fast and efficient. Note that I am using the [**en_core_web_sm**](https://spacy.io/docs/usage/models) model of Spacy, which is very small and good enough for this tutorial. You would probably want to choose your own NE recognizer.\n",
206 |     "\n",
207 |     "Other Normalization steps include [stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) which I will not implement because **(1)** I want my Language Model to learn the various forms of a word and their occurances by itself; **(2)** In a real world scenario you will train your Model with a huge corpus with Millions of text lines, and you can assume that the corpus covers the most commonly used terms in Language. Hence, no extra normalization is required. \n",
208 |     "\n",
209 |     "### 6.2.1 SEQUENCE_BEGIN and SEQUENCE_END\n",
210 |     "Along with the naturally occurring terms in the corpus, we will add two new terms called the *SEQUENCE_BEGIN* and **SEQUENCE_END** term. These terms mark the beginning and end of a sentence. We do this because we want our model to learn word occurring at the beginning and at the end of sentences. Note that we are dependent on Spacy's Tokenization algorithm here. You are free to explore other tokenizers and use whichever you find is best.\n",
211 |     "\n",
212 |     "## 6.3 Create a Vocabulary\n",
213 |     "After we have minimally preprocessed the corpus and extracted sequence of terms from it, we will create a vocabulary for our Language Model. This means that we will create two python dictionaries,\n",
214 |     "1. **Word2Idx** : This dictionary has all the unique words(terms) as keys with a corresponding unique ID as values\n",
215 |     "2. **Idx2Word** : This is the reverse of Word2Idx. It has the unique IDs as keys and their corresponding words(terms) as values"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": 4,
221 |    "metadata": {
222 |     "collapsed": true
223 |    },
224 |    "outputs": [],
225 |    "source": [
226 |     "vocab = list(set(corpus_tokens)) # This works well for a very small corpus\n",
227 |     "#print(vocab)"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "markdown",
232 |    "metadata": {},
233 |    "source": [
234 |     "**Alternatively**, if your corpus is huge, you would probably want to iterate through it entirely and generate term frequencies. Once you have the term frequencies, it is better to select the most commonly occuring terms in the vocabulary (as it covers most of the Natural Language)."
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": 5,
240 |    "metadata": {},
241 |    "outputs": [
242 |     {
243 |      "name": "stdout",
244 |      "output_type": "stream",
245 |      "text": [
246 |       "Vocab Size: 10000\n",
247 |       "[('the', 20158), (',', 17897), ('of', 13340), ('SEQUENCE_BEGIN', 11762), ('SEQUENCE_END', 11762), ('.', 10932), ('and', 9357), ('in', 7566), ('to', 6953), ('a', 6901), ('development', 3632), ('-', 3582), ('that', 3569), ('is', 3077), ('history', 3077), ('for', 2951), ('\"', 2410), ('on', 2057), ('as', 2036), ('with', 2034), (\"'s\", 1801), ('by', 1641), ('[', 1633), (']', 1626), ('it', 1561), ('was', 1525), ('an', 1316), ('this', 1316), ('named', 1301), ('from', 1269), ('at', 1203), ('are', 1203), ('be', 1189), ('has', 1149), ('have', 1116), ('or', 1055), ('not', 881), ('its', 855), ('which', 829), (':', 821), ('but', 820), ('influence', 819), ('his', 809), (';', 804), ('been', 769), ('their', 735), ('were', 708), ('he', 660), ('we', 637), ('who', 620), ('one', 606), ('--', 594), ('after', 562), ('these', 550), ('had', 544), ('more', 536), ('other', 525), ('’s', 507), ('most', 502), ('also', 493), ('will', 490), ('all', 487), ('during', 482), ('can', 480), ('about', 476), ('they', 473), (\"'\", 453), ('i', 432), ('when', 421), ('new', 417), ('such', 410), ('there', 405), ('than', 403), ('ordered', 396), ('into', 390), ('may', 389), ('our', 366), ('first', 362), ('you', 361), ('time', 360), ('would', 348), ('no', 343), ('so', 337), ('only', 327), ('two', 317), ('“', 313), ('early', 311), ('because', 306), ('many', 303), ('some', 302), ('cells', 301), ('if', 299), ('”', 297), ('American', 296), ('years', 293), ('name', 293), ('up', 278), ('over', 278), ('out', 274), ('launched', 273)]\n"
248 |      ]
249 |     }
250 |    ],
251 |    "source": [
252 |     "import collections\n",
253 |     "\n",
254 |     "word_counter = collections.Counter()\n",
255 |     "for term in corpus_tokens:\n",
256 |     "    word_counter.update({term: 1})\n",
257 |     "vocab = word_counter.most_common(10000) # 10000 Most common terms\n",
258 |     "print('Vocab Size: {}'.format(len(vocab))) \n",
259 |     "print(word_counter.most_common(100)) # just to show the top 100 terms"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "markdown",
264 |    "metadata": {},
265 |    "source": [
266 |     "This was we make sure to consider the ***top K***(in this case 100) most commonly used terms in the Language (assuming that the corpus represents the Language or domain specific language. For e.g., medical corpora, e-commerce corpora, etc.). In Neural Machine Translation Models, usually a vocabulary size of 10,000 to 100,000 is used. But remember, it all depends on your task, corpus size, and the Language itself. "
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "markdown",
271 |    "metadata": {},
272 |    "source": [
273 |     "### 6.3.1 UNKNOWN and PAD\n",
274 |     "Along with the vocabulary terms that we generated, we need two more special terms:\n",
275 |     "1. **UNKNOWN**: This term is used for all the words that the model will observe apart from the vocabulary terms.\n",
276 |     "2. **PAD**: The pad term is used to pad the sequences to a maximum length. This is required for feeding variable length sequences into the Network (we use DynamicRnn to handle variable length sequences. So, padding makes no difference. It is just required for feeding the data to Tensorflow)\n",
277 |     "\n",
278 |     "This is required as during inference time there will be many unknown words (words that the model has never seen). It is better to add an **UNKNOWN** token in the vocabulary so that the model will learn to handle terms that are unknown to the Model."
279 |    ]
280 |   },
281 |   {
282 |    "cell_type": "code",
283 |    "execution_count": 6,
284 |    "metadata": {},
285 |    "outputs": [
286 |     {
287 |      "name": "stdout",
288 |      "output_type": "stream",
289 |      "text": [
290 |       "Word2Idx Size: 10002\n",
291 |       "Idx2Word Size: 10002\n"
292 |      ]
293 |     }
294 |    ],
295 |    "source": [
296 |     "vocab.append(('UNKNOWN', 1))\n",
297 |     "Idx = range(1, len(vocab)+1)\n",
298 |     "vocab = [t[0] for t in vocab]\n",
299 |     "\n",
300 |     "Word2Idx = dict(zip(vocab, Idx))\n",
301 |     "Idx2Word = dict(zip(Idx, vocab))\n",
302 |     "\n",
303 |     "Word2Idx['PAD'] = 0\n",
304 |     "Idx2Word[0] = 'PAD'\n",
305 |     "VOCAB_SIZE = len(Word2Idx)\n",
306 |     "print('Word2Idx Size: {}'.format(len(Word2Idx)))\n",
307 |     "print('Idx2Word Size: {}'.format(len(Idx2Word)))"
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "markdown",
312 |    "metadata": {},
313 |    "source": [
314 |     "## 6.4 Preload Word Vectors\n",
315 |     "Since you are here, I am almost sure that you are familiar with or have atleast heard of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html). Read about it if you don't know. \n",
316 |     "\n",
317 |     "Spacy provides a set of pretrained word vectors. We will make use of these to initialize our embedding layer (details in the following section). "
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": 7,
323 |    "metadata": {
324 |     "scrolled": true
325 |    },
326 |    "outputs": [
327 |     {
328 |      "name": "stdout",
329 |      "output_type": "stream",
330 |      "text": [
331 |       "Shape of w2v: (10002, 300)\n",
332 |       "Some Vectors\n",
333 |       "[ 0.32350999  0.35554001  0.029381    0.15276    -0.14915     0.22169\n",
334 |       "  0.007907   -0.61286002  0.24625     0.094113  ] PAD\n",
335 |       "[  3.73400003e-02   1.01959996e-03   1.12499997e-01  -3.48410010e-01\n",
336 |       "  -1.22720003e-01   8.06659982e-02   4.93220001e-01   7.56980032e-02\n",
337 |       "   4.80910003e-01   2.67359996e+00] time\n"
338 |      ]
339 |     }
340 |    ],
341 |    "source": [
342 |     "w2v = np.random.rand(len(Word2Idx), 300) # We use 300 because Spacy provides us with vectors of size 300\n",
343 |     "\n",
344 |     "for w_i, key in enumerate(Word2Idx):\n",
345 |     "    token = nlp(key)\n",
346 |     "    if token.has_vector:\n",
347 |     "        #print(token.text, Word2Idx[key])\n",
348 |     "        w2v[Word2Idx[key]:] = token.vector\n",
349 |     "EMBEDDING_SIZE = w2v.shape[-1]\n",
350 |     "print('Shape of w2v: {}'.format(w2v.shape))\n",
351 |     "print('Some Vectors')\n",
352 |     "print(w2v[0][:10], Idx2Word[0])\n",
353 |     "print(w2v[80][:10], Idx2Word[80])"
354 |    ]
355 |   },
356 |   {
357 |    "cell_type": "markdown",
358 |    "metadata": {},
359 |    "source": [
360 |     "## 6.5 Splitting the Data\n",
361 |     "We are almost there. Have patience :) We need to split the data into Training and Validation set before we proceed any further. So,"
362 |    ]
363 |   },
364 |   {
365 |    "cell_type": "code",
366 |    "execution_count": 8,
367 |    "metadata": {},
368 |    "outputs": [
369 |     {
370 |      "name": "stdout",
371 |      "output_type": "stream",
372 |      "text": [
373 |       "Train Size: 301026\n",
374 |       "Validation Size: 75256\n"
375 |      ]
376 |     }
377 |    ],
378 |    "source": [
379 |     "train_val_split = int(len(corpus_tokens) * 0.8) # We use 80% of the data for Training and 20% for validating\n",
380 |     "train = corpus_tokens[:train_val_split]\n",
381 |     "validation = corpus_tokens[train_val_split:-1]\n",
382 |     "\n",
383 |     "print('Train Size: {}\\nValidation Size: {}'.format(len(train), len(validation)))"
384 |    ]
385 |   },
386 |   {
387 |    "cell_type": "markdown",
388 |    "metadata": {},
389 |    "source": [
390 |     "## 6.6 Prepare The Training Data\n",
391 |     "We will prepare the data by doing the following fro both train and Validation data:\n",
392 |     "1. Convert word sequences to id sequences (which will be later used in the embedding layer)\n",
393 |     "2. Generate n-grams from the input sequences\n",
394 |     "3. Pad the generated n_grams to a max-length so that it can be fed to Tensorflow"
395 |    ]
396 |   },
397 |   {
398 |    "cell_type": "code",
399 |    "execution_count": 9,
400 |    "metadata": {
401 |     "collapsed": true
402 |    },
403 |    "outputs": [],
404 |    "source": [
405 |     "from tflearn.data_utils import to_categorical, pad_sequences"
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "code",
410 |    "execution_count": 10,
411 |    "metadata": {},
412 |    "outputs": [
413 |     {
414 |      "name": "stdout",
415 |      "output_type": "stream",
416 |      "text": [
417 |       "Sample Train IDs\n",
418 |       "[1005, 10001, 17, 10001, 17, 8, 10, 10001, 10001]\n",
419 |       "Sample Validation IDs\n",
420 |       "[137, 3630, 10, 2134, 222, 183, 99, 9, 86]\n"
421 |      ]
422 |     }
423 |    ],
424 |    "source": [
425 |     "# A method to convert a sequence of words into a sequence of IDs given a Word2Idx dictionary\n",
426 |     "def word2idseq(data, word2idx):\n",
427 |     "    id_seq = []\n",
428 |     "    for word in data:\n",
429 |     "        if word in word2idx:\n",
430 |     "            id_seq.append(word2idx[word])\n",
431 |     "        else:\n",
432 |     "            id_seq.append(word2idx['UNKNOWN'])\n",
433 |     "    return id_seq\n",
434 |     "\n",
435 |     "# Thanks to http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/\n",
436 |     "# This method generated n-grams\n",
437 |     "def find_ngrams(input_list, n):\n",
438 |     "    return zip(*[input_list[i:] for i in range(n)])\n",
439 |     "\n",
440 |     "train_id_seqs = word2idseq(train, Word2Idx)\n",
441 |     "validation_id_seqs = word2idseq(validation, Word2Idx)\n",
442 |     "\n",
443 |     "print('Sample Train IDs')\n",
444 |     "print(train_id_seqs[-10:-1])\n",
445 |     "print('Sample Validation IDs')\n",
446 |     "print(validation_id_seqs[-10:-1])"
447 |    ]
448 |   },
449 |   {
450 |    "cell_type": "markdown",
451 |    "metadata": {},
452 |    "source": [
453 |     "### 6.6.1 Generating the Targets from N-Grams\n",
454 |     "This might look a little tricky but it is not. Here we take the sequence of ids and generate n-grams. For the purpose of training, we need sequences of terms as the training examples and the next term in the sequence as the target. Not clear right? Let us look at an example. If our sequence of words were ```['hello', 'my', 'friend']```, then we extract extract n-grams, where n=2-3 (that means we split bigrams and trigrams from the sequence). So the sequence is split into ```['hello', 'my'], ['my', 'friend'] and ['hello', 'my', 'friend']```. Well to train our network this is not enough right? We need some objective/target that we can infer about. So to get a target we split the last term of the n-grams out. In the case of our example, the corresponding targets are ```['friend', 'my', 'friend']```. To show you the bigger picture, the input sequence ```['my', 'friend', 'friend']``` is split into n-grams and then split again to pop out a target term.\n",
455 |     "\n",
456 |     "```python\n",
457 |     "bigram['hello', 'my'] --> input['hello'] --> target['my']\n",
458 |     "bigram['my', 'friend'] --> input['my'] --> target['friend']\n",
459 |     "trigram['hello', 'my', 'friend'] --> input['hello', 'my'] --> target['friend']\n",
460 |     "```"
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "code",
465 |    "execution_count": 11,
466 |    "metadata": {},
467 |    "outputs": [],
468 |    "source": [
469 |     "import random\n",
470 |     "\n",
471 |     "def prepare_data(data, n_grams=5, batch_size=64, n_epochs=10):\n",
472 |     "    X, Y = [], []\n",
473 |     "    buff_size, start, end = 1000, 0, 1000\n",
474 |     "    n_buffer = 0\n",
475 |     "    epoch = 0\n",
476 |     "    while epoch == n_epochs:\n",
477 |     "        if len(X) >= batch_size:\n",
478 |     "            X_batch = X[:batch_size]\n",
479 |     "            Y_batch = Y[:batch_size]\n",
480 |     "            X_batch = pad_sequences(X_batch, maxlen=n_grams, value=0)\n",
481 |     "            Y_batch = to_categorical(Y_batch, VOCAB_SIZE)\n",
482 |     "            yield (X_batch, Y_batch, epoch)\n",
483 |     "            X = X[batch_size:]\n",
484 |     "            Y = Y[batch_size:]\n",
485 |     "            continue\n",
486 |     "        n = random.randrange(2, n_grams)\n",
487 |     "        if len(data) < n: continue\n",
488 |     "        if end > len(data): end = len(data)\n",
489 |     "        grams = find_ngrams(data[start: end], n) # generates the n-grams\n",
490 |     "        splits = list(zip(*grams)) # split it\n",
491 |     "        X += list(zip(*splits[:len(splits)-1])) # from the inputs\n",
492 |     "        X = [list(x) for x in X] \n",
493 |     "        Y += splits[-1] # form the targets\n",
494 |     "        if start + buff_size > len(data):\n",
495 |     "            start = 0\n",
496 |     "            epoch += 1\n",
497 |     "            end = start + buff_size\n",
498 |     "        else:\n",
499 |     "            start = start + buff_size\n",
500 |     "            end = end + buff_size"
501 |    ]
502 |   },
503 |   {
504 |    "cell_type": "markdown",
505 |    "metadata": {},
506 |    "source": [
507 |     "## 6.7 The Model\n",
508 |     "We now define a Dynamic LSTM Model that will be our Language Model. Restart the kernel and run all cells if it does not work (some Tflearn bug). "
509 |    ]
510 |   },
511 |   {
512 |    "cell_type": "code",
513 |    "execution_count": 12,
514 |    "metadata": {
515 |     "collapsed": true
516 |    },
517 |    "outputs": [],
518 |    "source": [
519 |     "# Hyperparameters\n",
520 |     "LR = 0.0001\n",
521 |     "HIDDEN_DIMS = 256\n",
522 |     "BATCH_SIZE = 32\n",
523 |     "N_EPOCHS=100\n",
524 |     "N_GRAMS = 5\n",
525 |     "N_VALIDATE = 10000"
526 |    ]
527 |   },
528 |   {
529 |    "cell_type": "code",
530 |    "execution_count": 13,
531 |    "metadata": {},
532 |    "outputs": [],
533 |    "source": [
534 |     "train = prepare_data(train_id_seqs, N_GRAMS, BATCH_SIZE, N_EPOCHS)\n",
535 |     "validate = prepare_data(validation_id_seqs, N_GRAMS, N_VALIDATE, N_EPOCHS)"
536 |    ]
537 |   },
538 |   {
539 |    "cell_type": "code",
540 |    "execution_count": 14,
541 |    "metadata": {
542 |     "collapsed": true
543 |    },
544 |    "outputs": [],
545 |    "source": [
546 |     "import tensorflow as tf\n",
547 |     "import tflearn"
548 |    ]
549 |   },
550 |   {
551 |    "cell_type": "code",
552 |    "execution_count": 15,
553 |    "metadata": {},
554 |    "outputs": [
555 |     {
556 |      "name": "stderr",
557 |      "output_type": "stream",
558 |      "text": [
559 |       "/home/dash/venvs/exercise/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py:91: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.\n",
560 |       "  \"Converting sparse IndexedSlices to a dense Tensor of unknown shape. \"\n"
561 |      ]
562 |     },
563 |     {
564 |      "name": "stdout",
565 |      "output_type": "stream",
566 |      "text": [
567 |       "Training epoch 0\n"
568 |      ]
569 |     },
570 |     {
571 |      "ename": "StopIteration",
572 |      "evalue": "",
573 |      "traceback": [
574 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
575 |       "\u001b[0;31mStopIteration\u001b[0m                             Traceback (most recent call last)",
576 |       "\u001b[0;32m<ipython-input-15-9e984de82d49>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     13\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mepoch\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mN_EPOCHS\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     14\u001b[0m     \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Training epoch {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mepoch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m     \u001b[0;34m(\u001b[0m\u001b[0mX_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mval_epoch\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalidate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     16\u001b[0m     \u001b[0;32mfor\u001b[0m \u001b[0mbatch\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     17\u001b[0m         model.fit(batch[0], batch[1], validation_set=(X_test, Y_test),\n",
577 |       "\u001b[0;31mStopIteration\u001b[0m: "
578 |      ],
579 |      "output_type": "error"
580 |     }
581 |    ],
582 |    "source": [
583 |     "# Build the model\n",
584 |     "embedding_matrix = tf.constant(w2v, dtype=tf.float32)\n",
585 |     "net = tflearn.input_data([None, N_GRAMS], dtype=tf.int32, name='input')\n",
586 |     "net = tflearn.embedding(net, input_dim=VOCAB_SIZE, output_dim=EMBEDDING_SIZE,\n",
587 |     "                        weights_init=embedding_matrix, trainable=True)\n",
588 |     "net = tflearn.lstm(net, HIDDEN_DIMS, dropout=0.8, dynamic=True)\n",
589 |     "net = tflearn.fully_connected(net, VOCAB_SIZE, activation='softmax')\n",
590 |     "net = tflearn.regression(net, optimizer='adam', learning_rate=LR,\n",
591 |     "                         loss='categorical_crossentropy', name='target')\n",
592 |     "model = tflearn.DNN(net, checkpoint_path=\"./chkpnts\", best_checkpoint_path=\"./best_chkpnts\",\n",
593 |     "                    tensorboard_dir='./chkpnts', best_val_accuracy=0.70)\n",
594 |     "\n",
595 |     "for epoch in range(N_EPOCHS):\n",
596 |     "    print('Training epoch {}'.format(epoch))\n",
597 |     "    (X_test, Y_test, val_epoch) = next(validate)\n",
598 |     "    for batch in train:\n",
599 |     "        model.fit(batch[0], batch[1], validation_set=(X_test, Y_test),\n",
600 |     "                  show_metric=True, n_epoch=1)"
601 |    ]
602 |   },
603 |   {
604 |    "cell_type": "markdown",
605 |    "metadata": {
606 |     "collapsed": true
607 |    },
608 |    "source": [
609 |     "# 7. Inference\n",
610 |     "The story does not get over after you train the model. We need to understand how to make inference using this trained model. Well honestly, this model is not even close to trained. We used just one article from Wikipedia to train this Language Model so we cannot expect it to be good. The idea was to realise the steps required actually build a Language Model from scratch. Now let us look at how to make an inference from the model that we just trained.\n",
611 |     "\n",
612 |     "## 7.1 Log Probability of a Sequence \n",
613 |     "Given a new sequence of terms, we would like to know the probability of the occurance of this sequence in the Language. We make use of our trained model (which we assume to be a represenattion of the Langauge) and calculate the n-gram probabilities and aggregate them to find a final probability score."
614 |    ]
615 |   },
616 |   {
617 |    "cell_type": "code",
618 |    "execution_count": null,
619 |    "metadata": {
620 |     "collapsed": true
621 |    },
622 |    "outputs": [],
623 |    "source": [
624 |     "def get_sequence_prob(in_string, n, model):\n",
625 |     "    in_tokens, in_lengths = preprocess_corpus(in_string)\n",
626 |     "    in_ids = word2idseq(in_tokens, Word2Idx)\n",
627 |     "    X, Y_, Y = prepare_data(in_ids, n)\n",
628 |     "    preds = model.predict(X)\n",
629 |     "    log_prob = 0.0\n",
630 |     "    for y_i, y in enumerate(Y):\n",
631 |     "        log_prob += np.log(preds[y_i, y])\n",
632 |     "\n",
633 |     "    log_prob = log_prob/len(Y)\n",
634 |     "    return log_prob\n",
635 |     "\n",
636 |     "in_strings = ['hello I am science', 'blah blah blah', 'deep learning', 'answer',\n",
637 |     "              'Boltzman', 'from the previous layer as input', 'ahcblheb eDHLHW SLcA']\n",
638 |     "for in_string in in_strings:\n",
639 |     "    log_prob = get_sequence_prob(in_string, 5, model)\n",
640 |     "    print(log_prob)"
641 |    ]
642 |   },
643 |   {
644 |    "cell_type": "markdown",
645 |    "metadata": {},
646 |    "source": [
647 |     "To get the probability of the sequence, we take the n-grams of the sequence and we infer the probability of the next term to occur, take it's log and sum it with the log probabilities of all the other n-grams. The final score is the average over all. There can be other ways to look at it too. You can notmalize by n too, where n is the number of grans you considered. "
648 |    ]
649 |   },
650 |   {
651 |    "cell_type": "markdown",
652 |    "metadata": {},
653 |    "source": [
654 |     "# 7.2 Generating a Sequence\n",
655 |     "Since we trained this Language model to predict the next term given the previous 'n' terms, we can sample sequences out of this model too. We start with a random term and feed it to the Model. The Model predicts the next term and then we concat it with our previous term and feed it again to the Model. In this way we can generate arbitarily long sequences from the Model. Let us see how this naive model generates sequences,"
656 |    ]
657 |   },
658 |   {
659 |    "cell_type": "code",
660 |    "execution_count": null,
661 |    "metadata": {
662 |     "collapsed": true
663 |    },
664 |    "outputs": [],
665 |    "source": [
666 |     "def generate_sequences(term, word2idx, idx2word, seq_len, n_grams, model):\n",
667 |     "    if term not in word2idx:\n",
668 |     "        idseq = [[word2idx['UNKNOWN']]]\n",
669 |     "    else:\n",
670 |     "        idseq = [[word2idx[term]]]\n",
671 |     "    for i in range(seq_len-1):\n",
672 |     "        #print(idseq)\n",
673 |     "        padded_idseq = pad_sequences(idseq, maxlen=n_grams, value=0)\n",
674 |     "        next_label = model.predict_label(padded_idseq)\n",
675 |     "        print(next_label)\n",
676 |     "        idseq[0].append(next_label[0][0])\n",
677 |     "    generated_str = []\n",
678 |     "    for id in idseq[0]:\n",
679 |     "        generated_str.append(idx2word[id])\n",
680 |     "    return ' '.join(generated_str)\n",
681 |     "    \n",
682 |     "term = 'SEENCE_BEGIN'\n",
683 |     "seq = generate_sequences(term, Word2Idx, Idx2Word, 10, 5, model)\n",
684 |     "print(seq)"
685 |    ]
686 |   },
687 |   {
688 |    "cell_type": "code",
689 |    "execution_count": null,
690 |    "metadata": {
691 |     "collapsed": true
692 |    },
693 |    "outputs": [],
694 |    "source": [
695 |     ""
696 |    ]
697 |   },
698 |   {
699 |    "cell_type": "code",
700 |    "execution_count": null,
701 |    "metadata": {
702 |     "collapsed": true
703 |    },
704 |    "outputs": [],
705 |    "source": [
706 |     ""
707 |    ]
708 |   }
709 |  ],
710 |  "metadata": {
711 |   "kernelspec": {
712 |    "display_name": "Python 3",
713 |    "language": "python",
714 |    "name": "python3"
715 |   },
716 |   "language_info": {
717 |    "codemirror_mode": {
718 |     "name": "ipython",
719 |     "version": 3.0
720 |    },
721 |    "file_extension": ".py",
722 |    "mimetype": "text/x-python",
723 |    "name": "python",
724 |    "nbconvert_exporter": "python",
725 |    "pygments_lexer": "ipython3",
726 |    "version": "3.4.3"
727 |   }
728 |  },
729 |  "nbformat": 4,
730 |  "nbformat_minor": 0
731 | }


--------------------------------------------------------------------------------
/Neural+Language+Model.py:
--------------------------------------------------------------------------------
  1 | 
  2 | # coding: utf-8
  3 | 
  4 | # # 1. Neural Language Model
  5 | # If you are here that means you wish to cut the crap and understand how to train your own Neural Language Model. If you are a regular user of frameworks like Keras, Tflearn, etc., then you know how easy it has become these days to build, train and deploy Neural Network Models. If not then you will probably by the end of this post.
  6 | # 
  7 | # # 2. Prerequisite
  8 | # 1. [Python](https://www.tutorialspoint.com/python/): I will be using Python 3.5 for this tutorial
  9 | # 
 10 | # 2. [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/): If you dont know what LSTMs are, then this is a must read.
 11 | # 
 12 | # 3. [Basics of Machine Learning](https://www.youtube.com/watch?v=2uiulzZxmGg): If you want to dive into Machine Learning/Deep Learning, then I strongly recommend the first 4 lectures from [Stanford's CS231]() by Andrej Karpathy.
 13 | # 
 14 | # 4. [Language Model](https://en.wikipedia.org/wiki/Language_model): If you want to have a basic understanding of Language Models.
 15 | # 
 16 | # # 3. Frameworks
 17 | # 1. [Tflearn](http://tflearn.org/installation/) 0.3.2
 18 | # 2. [Spacy](https://spacy.io/) 1.9.0
 19 | # 3. [Tensorflow](https://spacy.io/) 1.0.1
 20 | # 
 21 | # ### Note
 22 | # you can take this post as a hands-on exercise on "How to build your own Neural Language Model" from scratch. If you have a ready to use virtualenv with all the dependencies installed then you can skip Section 4 and jump to Section 5. 
 23 | 
 24 | # # 4. Install Dependencies
 25 | # We will install everythin in a virtual environment and I would suggest you to run this Jupyter Notebook in the same virtualenv. I have also provided a ```requirements.txt``` file with the [repository](https://github.com/dashayushman/neural-language-model) to make things easier.
 26 | # 
 27 | # ### 4.1 Virtual Environment
 28 | # You can follow [this](http://docs.python-guide.org/en/latest/dev/virtualenvs/) for a fast guide to Virtual Environments.
 29 | # 
 30 | # ```sh
 31 | # pip install virtualenv
 32 | # ```
 33 | # 
 34 | # ### 4.2 Tflearn
 35 | # Follow [this](http://tflearn.org/installation/) and install Tflearn. Make sure to have the versions correct in case you want to avoid weird errors. 
 36 | # 
 37 | # ```sh
 38 | # pip install -Iv tflearn==0.3.2
 39 | # ```
 40 | # 
 41 | # ### 4.3 Tensorflow
 42 | # Install Tensorflow by following the instructions [here](https://www.tensorflow.org/install/). To make sure of installing the right version, use this
 43 | # 
 44 | # ```sh
 45 | # pip install -Iv tensorflow-gpu==1.0.1
 46 | # ```
 47 | # Note that this is the GPU version of Tensorflow. You can even install the CPU version for this tutorial, but I would strongly recommend the GPU version if you intend to intend to scale it to use in the real world.
 48 | # 
 49 | # ### 4.4 Spacy
 50 | # Install Spacy by following the instructions [here](https://spacy.io/docs/usage/). For the right version use,
 51 | # 
 52 | # ```sh
 53 | # pip install -Iv spacy==1.9.0
 54 | # ```
 55 | # 
 56 | # ### 4.5 Others
 57 | # ```sh
 58 | # pip install numpy
 59 | # ```
 60 | 
 61 | # # 5. Get the Repo
 62 | # clone the Neural Language Model GitHub repository onto your computer and start the Jupyter Notebook server.
 63 | # 
 64 | # ```sh
 65 | # git clone https://github.com/dashayushman/neural-language-model.git
 66 | # cd neural-language-model
 67 | # jupyter notebook
 68 | # ```
 69 | # 
 70 | # Open the notebook names **Neural Language Model** and you can start off.
 71 | 
 72 | # # 6. Neural Language Model
 73 | # We will start building our own Language model using an LSTM Network. To do so we will need a corpus. For the purpose of this tutorial, let us use a toy corpus, which is a text file called ```corpus.txt``` that 0I downloaded from Wikipedia. I will use this to demponstrate how to build your own Neural Language Model, and you can use the same knowledge to extend the model further for a more realistic scenario (I will give pointers to do so too).
 74 | # 
 75 | # ## 6.1 Loading The Corpus
 76 | # In this section you will load the ```corpus.txt``` and do minimal preprocessing.
 77 | 
 78 | # In[1]:
 79 | 
 80 | 
 81 | import re
 82 | 
 83 | with open('corpus.txt', 'r') as cf:
 84 |     corpus = []
 85 |     for line in cf: # loops over all the lines in the corpus
 86 |         line = line.strip() # strips off \n \r from the ends 
 87 |         if line: # Take only non empty lines
 88 |             line = re.sub(r'\([^)]*\)', '', line) # Regular Expression to remove text in between brackets
 89 |             line = re.sub(' +',' ', line) # Removes consecutive spaces
 90 |             # add more pre-processing steps
 91 |             corpus.append(line)
 92 | print("\n".join(corpus[:5])) # Shows the first 5 lines of the corpus
 93 | 
 94 | 
 95 | # As you can see that this small piece of code loads the toy text corpus, extracts lines from it, ignores empty lines, and removes text in between brackets. Note that in reality you will not be able to load the entire corpus into memory. You will need to write a [generator](https://wiki.python.org/moin/Generators) to yield text lines from the corpus, or use some advanced features provided by the Deep Learning frameworks like [Tensorflow's Input Pipelines](https://www.tensorflow.org/programmers_guide/reading_data). 
 96 | # 
 97 | # ## 6.2 Tokenizing the Corpus
 98 | # In this section we will see how to tokenize the text lines that we extracted and then create a **Vocabulary**.
 99 | 
100 | # In[2]:
101 | 
102 | 
103 | # Load Spacy
104 | import spacy
105 | import numpy as np
106 | nlp = spacy.load('en_core_web_sm')
107 | 
108 | 
109 | # In[3]:
110 | 
111 | 
112 | def preprocess_corpus(corpus):
113 |     corpus_tokens = []
114 |     sentence_lengths = []
115 |     for line in corpus:
116 |         doc = nlp(line) # Parse each line in the corpus
117 |         for sent in doc.sents: # Loop over all the sentences in the line
118 |             corpus_tokens.append('SEQUENCE_BEGIN')
119 |             s_len = 1
120 |             for tok in sent: # Loop over all the words in a sentence
121 |                 if tok.text.strip() != '' and tok.ent_type_ != '': # If the token is a Named Entity then do not lowercase it 
122 |                     corpus_tokens.append(tok.text)
123 |                 else:
124 |                     corpus_tokens.append(tok.text.lower())
125 |                 s_len += 1
126 |             corpus_tokens.append('SEQUENCE_END')
127 |             sentence_lengths.append(s_len+1)
128 |     return corpus_tokens, sentence_lengths
129 | 
130 | corpus_tokens, sentence_lengths = preprocess_corpus(corpus)
131 | print(corpus_tokens[:30]) # Prints the first 30 tokens
132 | mean_sentence_length = np.mean(sentence_lengths)
133 | deviation_sentence_length = np.std(sentence_lengths)
134 | max_sentence_length = np.max(sentence_lengths)
135 | print('Mean Sentence Length: {}\nSentence Length Standard Deviation: {}\n'
136 |       'Max Sentence Length: {}'.format(mean_sentence_length, deviation_sentence_length, max_sentence_length))
137 | 
138 | 
139 | # Notice that we did not lowercase the [Named Entities(NEs)](https://en.wikipedia.org/wiki/Named-entity_recognition). This is totally your choice. It part of a normalization step and I believe it is a good idea to let the model learn the Named Entities in the corpus. But do not blindly consider any library for NEs. I chose Spacy as it is very simple to use, fast and efficient. Note that I am using the [**en_core_web_sm**](https://spacy.io/docs/usage/models) model of Spacy, which is very small and good enough for this tutorial. You would probably want to choose your own NE recognizer.
140 | # 
141 | # Other Normalization steps include [stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) which I will not implement because **(1)** I want my Language Model to learn the various forms of a word and their occurances by itself; **(2)** In a real world scenario you will train your Model with a huge corpus with Millions of text lines, and you can assume that the corpus covers the most commonly used terms in Language. Hence, no extra normalization is required. 
142 | # 
143 | # ### 6.2.1 SEQUENCE_BEGIN and SEQUENCE_END
144 | # Along with the naturally occurring terms in the corpus, we will add two new terms called the *SEQUENCE_BEGIN* and **SEQUENCE_END** term. These terms mark the beginning and end of a sentence. We do this because we want our model to learn word occurring at the beginning and at the end of sentences. Note that we are dependent on Spacy's Tokenization algorithm here. You are free to explore other tokenizers and use whichever you find is best.
145 | # 
146 | # ## 6.3 Create a Vocabulary
147 | # After we have minimally preprocessed the corpus and extracted sequence of terms from it, we will create a vocabulary for our Language Model. This means that we will create two python dictionaries,
148 | # 1. **Word2Idx** : This dictionary has all the unique words(terms) as keys with a corresponding unique ID as values
149 | # 2. **Idx2Word** : This is the reverse of Word2Idx. It has the unique IDs as keys and their corresponding words(terms) as values
150 | 
151 | # In[4]:
152 | 
153 | 
154 | vocab = list(set(corpus_tokens)) # This works well for a very small corpus
155 | #print(vocab)
156 | 
157 | 
158 | # **Alternatively**, if your corpus is huge, you would probably want to iterate through it entirely and generate term frequencies. Once you have the term frequencies, it is better to select the most commonly occuring terms in the vocabulary (as it covers most of the Natural Language).
159 | 
160 | # In[5]:
161 | 
162 | 
163 | import collections
164 | 
165 | word_counter = collections.Counter()
166 | for term in corpus_tokens:
167 |     word_counter.update({term: 1})
168 | vocab = word_counter.most_common(10000) # 10000 Most common terms
169 | print('Vocab Size: {}'.format(len(vocab))) 
170 | print(word_counter.most_common(100)) # just to show the top 100 terms
171 | 
172 | 
173 | # This was we make sure to consider the ***top K***(in this case 100) most commonly used terms in the Language (assuming that the corpus represents the Language or domain specific language. For e.g., medical corpora, e-commerce corpora, etc.). In Neural Machine Translation Models, usually a vocabulary size of 10,000 to 100,000 is used. But remember, it all depends on your task, corpus size, and the Language itself. 
174 | 
175 | # ### 6.3.1 UNKNOWN and PAD
176 | # Along with the vocabulary terms that we generated, we need two more special terms:
177 | # 1. **UNKNOWN**: This term is used for all the words that the model will observe apart from the vocabulary terms.
178 | # 2. **PAD**: The pad term is used to pad the sequences to a maximum length. This is required for feeding variable length sequences into the Network (we use DynamicRnn to handle variable length sequences. So, padding makes no difference. It is just required for feeding the data to Tensorflow)
179 | # 
180 | # This is required as during inference time there will be many unknown words (words that the model has never seen). It is better to add an **UNKNOWN** token in the vocabulary so that the model will learn to handle terms that are unknown to the Model.
181 | 
182 | # In[6]:
183 | 
184 | 
185 | vocab.append(('UNKNOWN', 1))
186 | Idx = range(1, len(vocab)+1)
187 | vocab = [t[0] for t in vocab]
188 | 
189 | Word2Idx = dict(zip(vocab, Idx))
190 | Idx2Word = dict(zip(Idx, vocab))
191 | 
192 | Word2Idx['PAD'] = 0
193 | Idx2Word[0] = 'PAD'
194 | VOCAB_SIZE = len(Word2Idx)
195 | print('Word2Idx Size: {}'.format(len(Word2Idx)))
196 | print('Idx2Word Size: {}'.format(len(Idx2Word)))
197 | 
198 | 
199 | # ## 6.4 Preload Word Vectors
200 | # Since you are here, I am almost sure that you are familiar with or have atleast heard of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html). Read about it if you don't know. 
201 | # 
202 | # Spacy provides a set of pretrained word vectors. We will make use of these to initialize our embedding layer (details in the following section). 
203 | 
204 | # In[7]:
205 | 
206 | 
207 | w2v = np.random.rand(len(Word2Idx), 300) # We use 300 because Spacy provides us with vectors of size 300
208 | 
209 | for w_i, key in enumerate(Word2Idx):
210 |     token = nlp(key)
211 |     if token.has_vector:
212 |         #print(token.text, Word2Idx[key])
213 |         w2v[Word2Idx[key]:] = token.vector
214 | EMBEDDING_SIZE = w2v.shape[-1]
215 | print('Shape of w2v: {}'.format(w2v.shape))
216 | print('Some Vectors')
217 | print(w2v[0][:10], Idx2Word[0])
218 | print(w2v[80][:10], Idx2Word[80])
219 | 
220 | 
221 | # ## 6.5 Splitting the Data
222 | # We are almost there. Have patience :) We need to split the data into Training and Validation set before we proceed any further. So,
223 | 
224 | # In[8]:
225 | 
226 | 
227 | train_val_split = int(len(corpus_tokens) * 0.8) # We use 80% of the data for Training and 20% for validating
228 | train = corpus_tokens[:train_val_split]
229 | validation = corpus_tokens[train_val_split:-1]
230 | 
231 | print('Train Size: {}\nValidation Size: {}'.format(len(train), len(validation)))
232 | 
233 | 
234 | # ## 6.6 Prepare The Training Data
235 | # We will prepare the data by doing the following fro both train and Validation data:
236 | # 1. Convert word sequences to id sequences (which will be later used in the embedding layer)
237 | # 2. Generate n-grams from the input sequences
238 | # 3. Pad the generated n_grams to a max-length so that it can be fed to Tensorflow
239 | 
240 | # In[9]:
241 | 
242 | 
243 | from tflearn.data_utils import to_categorical, pad_sequences
244 | 
245 | 
246 | # In[10]:
247 | 
248 | 
249 | # A method to convert a sequence of words into a sequence of IDs given a Word2Idx dictionary
250 | def word2idseq(data, word2idx):
251 |     id_seq = []
252 |     for word in data:
253 |         if word in word2idx:
254 |             id_seq.append(word2idx[word])
255 |         else:
256 |             id_seq.append(word2idx['UNKNOWN'])
257 |     return id_seq
258 | 
259 | # Thanks to http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/
260 | # This method generated n-grams
261 | def find_ngrams(input_list, n):
262 |     return zip(*[input_list[i:] for i in range(n)])
263 | 
264 | train_id_seqs = word2idseq(train, Word2Idx)
265 | validation_id_seqs = word2idseq(validation, Word2Idx)
266 | 
267 | print('Sample Train IDs')
268 | print(train_id_seqs[-10:-1])
269 | print('Sample Validation IDs')
270 | print(validation_id_seqs[-10:-1])
271 | 
272 | 
273 | # ### 6.6.1 Generating the Targets from N-Grams
274 | # This might look a little tricky but it is not. Here we take the sequence of ids and generate n-grams. For the purpose of training, we need sequences of terms as the training examples and the next term in the sequence as the target. Not clear right? Let us look at an example. If our sequence of words were ```['hello', 'my', 'friend']```, then we extract extract n-grams, where n=2-3 (that means we split bigrams and trigrams from the sequence). So the sequence is split into ```['hello', 'my'], ['my', 'friend'] and ['hello', 'my', 'friend']```. Well to train our network this is not enough right? We need some objective/target that we can infer about. So to get a target we split the last term of the n-grams out. In the case of our example, the corresponding targets are ```['friend', 'my', 'friend']```. To show you the bigger picture, the input sequence ```['my', 'friend', 'friend']``` is split into n-grams and then split again to pop out a target term.
275 | # 
276 | # ```python
277 | # bigram['hello', 'my'] --> input['hello'] --> target['my']
278 | # bigram['my', 'friend'] --> input['my'] --> target['friend']
279 | # trigram['hello', 'my', 'friend'] --> input['hello', 'my'] --> target['friend']
280 | # ```
281 | 
282 | # In[11]:
283 | 
284 | 
285 | import random
286 | 
287 | def prepare_data(data, n_grams=5, batch_size=64, n_epochs=10):
288 |     X, Y = [], []
289 |     buff_size, start, end = 1000, 0, 1000
290 |     n_buffer = 0
291 |     epoch = 0
292 |     while epoch < n_epochs:
293 |         if len(X) >= batch_size:
294 |             X_batch = X[:batch_size]
295 |             Y_batch = Y[:batch_size]
296 |             X_batch = pad_sequences(X_batch, maxlen=n_grams, value=0)
297 |             Y_batch = to_categorical(Y_batch, VOCAB_SIZE)
298 |             yield (X_batch, Y_batch, epoch)
299 |             X = X[batch_size:]
300 |             Y = Y[batch_size:]
301 |             continue
302 |         n = random.randrange(2, n_grams)
303 |         if len(data) < n: continue
304 |         if end > len(data): end = len(data)
305 |         grams = find_ngrams(data[start: end], n) # generates the n-grams
306 |         splits = list(zip(*grams)) # split it
307 |         X += list(zip(*splits[:len(splits)-1])) # from the inputs
308 |         X = [list(x) for x in X] 
309 |         Y += splits[-1] # form the targets
310 |         if start + buff_size > len(data):
311 |             start = 0
312 |             epoch += 1
313 |             end = start + buff_size
314 |         else:
315 |             start = start + buff_size
316 |             end = end + buff_size
317 | 
318 | 
319 | # ## 6.7 The Model
320 | # We now define a Dynamic LSTM Model that will be our Language Model. Restart the kernel and run all cells if it does not work (some Tflearn bug). 
321 | 
322 | # In[12]:
323 | 
324 | 
325 | # Hyperparameters
326 | LR = 0.0001
327 | HIDDEN_DIMS = 256
328 | N_LAYERS = 3
329 | BATCH_SIZE = 10000
330 | N_EPOCHS=100
331 | N_GRAMS = 5
332 | N_VALIDATE = 3000
333 | 
334 | 
335 | # In[13]:
336 | 
337 | 
338 | train = prepare_data(train_id_seqs, N_GRAMS, BATCH_SIZE, N_EPOCHS)
339 | validate = prepare_data(validation_id_seqs, N_GRAMS, N_VALIDATE, N_EPOCHS)
340 | 
341 | 
342 | # In[14]:
343 | 
344 | 
345 | import tensorflow as tf
346 | import tflearn
347 | 
348 | 
349 | # In[15]:
350 | 
351 | 
352 | # Build the model
353 | embedding_matrix = tf.constant(w2v, dtype=tf.float32)
354 | net = tflearn.input_data([None, N_GRAMS], dtype=tf.int32, name='input')
355 | net = tflearn.embedding(net, input_dim=VOCAB_SIZE, output_dim=EMBEDDING_SIZE,
356 |                         weights_init=embedding_matrix, trainable=True)
357 | net = tflearn.lstm(net, HIDDEN_DIMS, dropout=0.8, dynamic=True)
358 | net = tflearn.fully_connected(net, VOCAB_SIZE, activation='softmax')
359 | net = tflearn.regression(net, optimizer='adam', learning_rate=LR,
360 |                          loss='categorical_crossentropy', name='target')
361 | model = tflearn.DNN(net, best_checkpoint_path="./best_chkpnts/",
362 |                     max_checkpoints= 100, tensorboard_dir='./chkpnts/',
363 |                     best_val_accuracy=0.70, tensorboard_verbose=0)
364 | 
365 | prev_epoch = -1
366 | n_batch = 1
367 | for batch in train:
368 |     if batch[2] != prev_epoch:
369 |         n_batch = 1
370 |         prev_epoch = batch[2]
371 |         print('Training Epoch {}'.format(batch[2]))
372 |         (X_test, Y_test, val_epoch) = next(validate)
373 |     print('Fitting Batch: {}'.format(n_batch))
374 |     model.fit(batch[0], batch[1], validation_set=(X_test, Y_test),
375 |               show_metric=True, n_epoch=1)
376 |     n_batch += 1
377 | 
378 | 
379 | # # 7. Inference
380 | # The story does not get over after you train the model. We need to understand how to make inference using this trained model. Well honestly, this model is not even close to trained. We used just one article from Wikipedia to train this Language Model so we cannot expect it to be good. The idea was to realise the steps required actually build a Language Model from scratch. Now let us look at how to make an inference from the model that we just trained.
381 | # 
382 | # ## 7.1 Log Probability of a Sequence 
383 | # Given a new sequence of terms, we would like to know the probability of the occurance of this sequence in the Language. We make use of our trained model (which we assume to be a represenattion of the Langauge) and calculate the n-gram probabilities and aggregate them to find a final probability score.
384 | 
385 | # In[ ]:
386 | 
387 | 
388 | def get_sequence_prob(in_string, n, model):
389 |     in_tokens, in_lengths = preprocess_corpus(in_string)
390 |     in_ids = word2idseq(in_tokens, Word2Idx)
391 |     X, Y_, Y = prepare_data(in_ids, n)
392 |     preds = model.predict(X)
393 |     log_prob = 0.0
394 |     for y_i, y in enumerate(Y):
395 |         log_prob += np.log(preds[y_i, y])
396 | 
397 |     log_prob = log_prob/len(Y)
398 |     return log_prob
399 | 
400 | in_strings = ['hello I am science', 'blah blah blah', 'deep learning', 'answer',
401 |               'Boltzman', 'from the previous layer as input', 'ahcblheb eDHLHW SLcA']
402 | for in_string in in_strings:
403 |     log_prob = get_sequence_prob(in_string, 5, model)
404 |     print(log_prob)
405 | 
406 | 
407 | # To get the probability of the sequence, we take the n-grams of the sequence and we infer the probability of the next term to occur, take it's log and sum it with the log probabilities of all the other n-grams. The final score is the average over all. There can be other ways to look at it too. You can notmalize by n too, where n is the number of grans you considered. 
408 | 
409 | # # 7.2 Generating a Sequence
410 | # Since we trained this Language model to predict the next term given the previous 'n' terms, we can sample sequences out of this model too. We start with a random term and feed it to the Model. The Model predicts the next term and then we concat it with our previous term and feed it again to the Model. In this way we can generate arbitarily long sequences from the Model. Let us see how this naive model generates sequences,
411 | 
412 | # In[ ]:
413 | 
414 | 
415 | def generate_sequences(term, word2idx, idx2word, seq_len, n_grams, model):
416 |     if term not in word2idx:
417 |         idseq = [[word2idx['UNKNOWN']]]
418 |     else:
419 |         idseq = [[word2idx[term]]]
420 |     for i in range(seq_len-1):
421 |         #print(idseq)
422 |         padded_idseq = pad_sequences(idseq, maxlen=n_grams, value=0)
423 |         next_label = model.predict_label(padded_idseq)
424 |         print(next_label)
425 |         idseq[0].append(next_label[0][0])
426 |     generated_str = []
427 |     for id in idseq[0]:
428 |         generated_str.append(idx2word[id])
429 |     return ' '.join(generated_str)
430 |     
431 | term = 'SEENCE_BEGIN'
432 | seq = generate_sequences(term, Word2Idx, Idx2Word, 10, 5, model)
433 | print(seq)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # neural-language-model
2 | A tutorial on how to build your own Neural Language Model
3 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | backports-abc==0.5
 2 | backports.weakref==1.0rc1
 3 | bleach==1.5.0
 4 | certifi==2017.7.27.1
 5 | chardet==3.0.4
 6 | cymem==1.31.2
 7 | cytoolz==0.8.2
 8 | decorator==4.1.2
 9 | dill==0.2.7.1
10 | en-core-web-sm==1.2.0
11 | entrypoints==0.2.3
12 | ftfy==4.4.3
13 | h5py==2.7.0
14 | html5lib==0.9999999
15 | idna==2.6
16 | ipykernel==4.6.1
17 | ipython==6.1.0
18 | ipython-genutils==0.2.0
19 | ipywidgets==7.0.0
20 | jedi==0.10.2
21 | Jinja2==2.9.6
22 | jsonschema==2.6.0
23 | jupyter==1.0.0
24 | jupyter-client==5.1.0
25 | jupyter-console==5.1.0
26 | jupyter-core==4.3.0
27 | Keras==2.0.6
28 | Markdown==2.6.9
29 | MarkupSafe==1.0
30 | mistune==0.7.4
31 | murmurhash==0.26.4
32 | nbconvert==5.2.1
33 | nbformat==4.3.0
34 | notebook==5.0.0
35 | numpy==1.13.1
36 | olefile==0.44
37 | pandocfilters==1.4.2
38 | pathlib==1.0.1
39 | pexpect==4.2.1
40 | pickleshare==0.7.4
41 | Pillow==4.2.1
42 | plac==0.9.6
43 | preshed==1.0.0
44 | prompt-toolkit==1.0.15
45 | protobuf==3.4.0
46 | ptyprocess==0.5.2
47 | Pygments==2.2.0
48 | python-dateutil==2.6.1
49 | PyYAML==3.12
50 | pyzmq==16.0.2
51 | qtconsole==4.3.1
52 | regex==2017.7.28
53 | requests==2.18.4
54 | scipy==0.19.1
55 | simplegeneric==0.8.1
56 | six==1.10.0
57 | spacy==1.9.0
58 | tensorflow==1.3.0
59 | tensorflow-gpu==1.0.1
60 | tensorflow-tensorboard==0.1.4
61 | termcolor==1.1.0
62 | terminado==0.6
63 | testpath==0.3.1
64 | tflearn==0.3.2
65 | Theano==0.9.0
66 | thinc==6.5.2
67 | toolz==0.8.2
68 | tornado==4.5.1
69 | tqdm==4.15.0
70 | traitlets==4.3.2
71 | typing==3.6.2
72 | ujson==1.35
73 | urllib3==1.22
74 | wcwidth==0.1.7
75 | Werkzeug==0.12.2
76 | widgetsnbextension==3.0.0
77 | wrapt==1.10.11
78 | 


--------------------------------------------------------------------------------