├── .gitignore ├── Neural Language Model.ipynb ├── Neural+Language+Model.py ├── README.md ├── corpus.txt └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | .ipynb* 3 | chkpnts/* 4 | best_chkpnts/* -------------------------------------------------------------------------------- /Neural Language Model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 1. Neural Language Model\n", 8 | "If you are here that means you wish to cut the crap and understand how to train your own Neural Language Model. If you are a regular user of frameworks like Keras, Tflearn, etc., then you know how easy it has become these days to build, train and deploy Neural Network Models. If not then you will probably by the end of this post.\n", 9 | "\n", 10 | "# 2. Prerequisite\n", 11 | "1. [Python](https://www.tutorialspoint.com/python/): I will be using Python 3.5 for this tutorial\n", 12 | "\n", 13 | "2. [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/): If you dont know what LSTMs are, then this is a must read.\n", 14 | "\n", 15 | "3. [Basics of Machine Learning](https://www.youtube.com/watch?v=2uiulzZxmGg): If you want to dive into Machine Learning/Deep Learning, then I strongly recommend the first 4 lectures from [Stanford's CS231]() by Andrej Karpathy.\n", 16 | "\n", 17 | "4. [Language Model](https://en.wikipedia.org/wiki/Language_model): If you want to have a basic understanding of Language Models.\n", 18 | "\n", 19 | "# 3. Frameworks\n", 20 | "1. [Tflearn](http://tflearn.org/installation/) 0.3.2\n", 21 | "2. [Spacy](https://spacy.io/) 1.9.0\n", 22 | "3. [Tensorflow](https://spacy.io/) 1.0.1\n", 23 | "\n", 24 | "### Note\n", 25 | "you can take this post as a hands-on exercise on \"How to build your own Neural Language Model\" from scratch. If you have a ready to use virtualenv with all the dependencies installed then you can skip Section 4 and jump to Section 5. " 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "# 4. Install Dependencies\n", 33 | "We will install everythin in a virtual environment and I would suggest you to run this Jupyter Notebook in the same virtualenv. I have also provided a ```requirements.txt``` file with the [repository](https://github.com/dashayushman/neural-language-model) to make things easier.\n", 34 | "\n", 35 | "### 4.1 Virtual Environment\n", 36 | "You can follow [this](http://docs.python-guide.org/en/latest/dev/virtualenvs/) for a fast guide to Virtual Environments.\n", 37 | "\n", 38 | "```sh\n", 39 | "pip install virtualenv\n", 40 | "```\n", 41 | "\n", 42 | "### 4.2 Tflearn\n", 43 | "Follow [this](http://tflearn.org/installation/) and install Tflearn. Make sure to have the versions correct in case you want to avoid weird errors. \n", 44 | "\n", 45 | "```sh\n", 46 | "pip install -Iv tflearn==0.3.2\n", 47 | "```\n", 48 | "\n", 49 | "### 4.3 Tensorflow\n", 50 | "Install Tensorflow by following the instructions [here](https://www.tensorflow.org/install/). To make sure of installing the right version, use this\n", 51 | "\n", 52 | "```sh\n", 53 | "pip install -Iv tensorflow-gpu==1.0.1\n", 54 | "```\n", 55 | "Note that this is the GPU version of Tensorflow. You can even install the CPU version for this tutorial, but I would strongly recommend the GPU version if you intend to intend to scale it to use in the real world.\n", 56 | "\n", 57 | "### 4.4 Spacy\n", 58 | "Install Spacy by following the instructions [here](https://spacy.io/docs/usage/). For the right version use,\n", 59 | "\n", 60 | "```sh\n", 61 | "pip install -Iv spacy==1.9.0\n", 62 | "```\n", 63 | "\n", 64 | "### 4.5 Others\n", 65 | "```sh\n", 66 | "pip install numpy\n", 67 | "```" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "# 5. Get the Repo\n", 75 | "clone the Neural Language Model GitHub repository onto your computer and start the Jupyter Notebook server.\n", 76 | "\n", 77 | "```sh\n", 78 | "git clone https://github.com/dashayushman/neural-language-model.git\n", 79 | "cd neural-language-model\n", 80 | "jupyter notebook\n", 81 | "```\n", 82 | "\n", 83 | "Open the notebook names **Neural Language Model** and you can start off." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "# 6. Neural Language Model\n", 91 | "We will start building our own Language model using an LSTM Network. To do so we will need a corpus. For the purpose of this tutorial, let us use a toy corpus, which is a text file called ```corpus.txt``` that 0I downloaded from Wikipedia. I will use this to demponstrate how to build your own Neural Language Model, and you can use the same knowledge to extend the model further for a more realistic scenario (I will give pointers to do so too).\n", 92 | "\n", 93 | "## 6.1 Loading The Corpus\n", 94 | "In this section you will load the ```corpus.txt``` and do minimal preprocessing." 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 1, 100 | "metadata": { 101 | "scrolled": true 102 | }, 103 | "outputs": [ 104 | { 105 | "name": "stdout", 106 | "output_type": "stream", 107 | "text": [ 108 | "Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, partially supervised or unsupervised.\n", 109 | "Some representations are loosely based on interpretation of information processing and communication patterns in a biological nervous system, such as neural coding that attempts to define a relationship between various stimuli and associated neuronal responses in the brain. Research attempts to create efficient systems to learn these representations from large-scale, unlabeled data sets.\n", 110 | "Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation and bioinformatics where they produced results comparable to and in some cases superior to human experts.\n", 111 | "Deep learning is a class of machine learning algorithms that:\n", 112 | "use a cascade of many layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. The algorithms may be supervised or unsupervised and applications include pattern analysis and classification .\n" 113 | ] 114 | } 115 | ], 116 | "source": [ 117 | "import re\n", 118 | "\n", 119 | "with open('corpus.txt', 'r') as cf:\n", 120 | " corpus = []\n", 121 | " for line in cf: # loops over all the lines in the corpus\n", 122 | " line = line.strip() # strips off \\n \\r from the ends \n", 123 | " if line: # Take only non empty lines\n", 124 | " line = re.sub(r'\\([^)]*\\)', '', line) # Regular Expression to remove text in between brackets\n", 125 | " line = re.sub(' +',' ', line) # Removes consecutive spaces\n", 126 | " # add more pre-processing steps\n", 127 | " corpus.append(line)\n", 128 | "print(\"\\n\".join(corpus[:5])) # Shows the first 5 lines of the corpus" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "As you can see that this small piece of code loads the toy text corpus, extracts lines from it, ignores empty lines, and removes text in between brackets. Note that in reality you will not be able to load the entire corpus into memory. You will need to write a [generator](https://wiki.python.org/moin/Generators) to yield text lines from the corpus, or use some advanced features provided by the Deep Learning frameworks like [Tensorflow's Input Pipelines](https://www.tensorflow.org/programmers_guide/reading_data). \n", 136 | "\n", 137 | "## 6.2 Tokenizing the Corpus\n", 138 | "In this section we will see how to tokenize the text lines that we extracted and then create a **Vocabulary**." 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 2, 144 | "metadata": { 145 | "collapsed": true 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "# Load Spacy\n", 150 | "import spacy\n", 151 | "import numpy as np\n", 152 | "nlp = spacy.load('en_core_web_sm')" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 3, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "name": "stdout", 162 | "output_type": "stream", 163 | "text": [ 164 | "['SEQUENCE_BEGIN', 'deep', 'learning', 'is', 'part', 'of', 'a', 'broader', 'family', 'of', 'machine', 'learning', 'methods', 'based', 'on', 'learning', 'data', 'representations', ',', 'as', 'opposed', 'to', 'task', '-', 'specific', 'algorithms', '.', 'SEQUENCE_END', 'SEQUENCE_BEGIN', 'learning']\n", 165 | "Mean Sentence Length: 31.991413024995747\n", 166 | "Sentence Length Standard Deviation: 15.024047302248745\n", 167 | "Max Sentence Length: 179\n" 168 | ] 169 | } 170 | ], 171 | "source": [ 172 | "def preprocess_corpus(corpus):\n", 173 | " corpus_tokens = []\n", 174 | " sentence_lengths = []\n", 175 | " for line in corpus:\n", 176 | " doc = nlp(line) # Parse each line in the corpus\n", 177 | " for sent in doc.sents: # Loop over all the sentences in the line\n", 178 | " corpus_tokens.append('SEQUENCE_BEGIN')\n", 179 | " s_len = 1\n", 180 | " for tok in sent: # Loop over all the words in a sentence\n", 181 | " if tok.text.strip() != '' and tok.ent_type_ != '': # If the token is a Named Entity then do not lowercase it \n", 182 | " corpus_tokens.append(tok.text)\n", 183 | " else:\n", 184 | " corpus_tokens.append(tok.text.lower())\n", 185 | " s_len += 1\n", 186 | " corpus_tokens.append('SEQUENCE_END')\n", 187 | " sentence_lengths.append(s_len+1)\n", 188 | " return corpus_tokens, sentence_lengths\n", 189 | "\n", 190 | "corpus_tokens, sentence_lengths = preprocess_corpus(corpus)\n", 191 | "print(corpus_tokens[:30]) # Prints the first 30 tokens\n", 192 | "mean_sentence_length = np.mean(sentence_lengths)\n", 193 | "deviation_sentence_length = np.std(sentence_lengths)\n", 194 | "max_sentence_length = np.max(sentence_lengths)\n", 195 | "print('Mean Sentence Length: {}\\nSentence Length Standard Deviation: {}\\n'\n", 196 | " 'Max Sentence Length: {}'.format(mean_sentence_length, deviation_sentence_length, max_sentence_length))" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": { 202 | "collapsed": true 203 | }, 204 | "source": [ 205 | "Notice that we did not lowercase the [Named Entities(NEs)](https://en.wikipedia.org/wiki/Named-entity_recognition). This is totally your choice. It part of a normalization step and I believe it is a good idea to let the model learn the Named Entities in the corpus. But do not blindly consider any library for NEs. I chose Spacy as it is very simple to use, fast and efficient. Note that I am using the [**en_core_web_sm**](https://spacy.io/docs/usage/models) model of Spacy, which is very small and good enough for this tutorial. You would probably want to choose your own NE recognizer.\n", 206 | "\n", 207 | "Other Normalization steps include [stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) which I will not implement because **(1)** I want my Language Model to learn the various forms of a word and their occurances by itself; **(2)** In a real world scenario you will train your Model with a huge corpus with Millions of text lines, and you can assume that the corpus covers the most commonly used terms in Language. Hence, no extra normalization is required. \n", 208 | "\n", 209 | "### 6.2.1 SEQUENCE_BEGIN and SEQUENCE_END\n", 210 | "Along with the naturally occurring terms in the corpus, we will add two new terms called the *SEQUENCE_BEGIN* and **SEQUENCE_END** term. These terms mark the beginning and end of a sentence. We do this because we want our model to learn word occurring at the beginning and at the end of sentences. Note that we are dependent on Spacy's Tokenization algorithm here. You are free to explore other tokenizers and use whichever you find is best.\n", 211 | "\n", 212 | "## 6.3 Create a Vocabulary\n", 213 | "After we have minimally preprocessed the corpus and extracted sequence of terms from it, we will create a vocabulary for our Language Model. This means that we will create two python dictionaries,\n", 214 | "1. **Word2Idx** : This dictionary has all the unique words(terms) as keys with a corresponding unique ID as values\n", 215 | "2. **Idx2Word** : This is the reverse of Word2Idx. It has the unique IDs as keys and their corresponding words(terms) as values" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 4, 221 | "metadata": { 222 | "collapsed": true 223 | }, 224 | "outputs": [], 225 | "source": [ 226 | "vocab = list(set(corpus_tokens)) # This works well for a very small corpus\n", 227 | "#print(vocab)" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "**Alternatively**, if your corpus is huge, you would probably want to iterate through it entirely and generate term frequencies. Once you have the term frequencies, it is better to select the most commonly occuring terms in the vocabulary (as it covers most of the Natural Language)." 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 5, 240 | "metadata": {}, 241 | "outputs": [ 242 | { 243 | "name": "stdout", 244 | "output_type": "stream", 245 | "text": [ 246 | "Vocab Size: 10000\n", 247 | "[('the', 20158), (',', 17897), ('of', 13340), ('SEQUENCE_BEGIN', 11762), ('SEQUENCE_END', 11762), ('.', 10932), ('and', 9357), ('in', 7566), ('to', 6953), ('a', 6901), ('development', 3632), ('-', 3582), ('that', 3569), ('is', 3077), ('history', 3077), ('for', 2951), ('\"', 2410), ('on', 2057), ('as', 2036), ('with', 2034), (\"'s\", 1801), ('by', 1641), ('[', 1633), (']', 1626), ('it', 1561), ('was', 1525), ('an', 1316), ('this', 1316), ('named', 1301), ('from', 1269), ('at', 1203), ('are', 1203), ('be', 1189), ('has', 1149), ('have', 1116), ('or', 1055), ('not', 881), ('its', 855), ('which', 829), (':', 821), ('but', 820), ('influence', 819), ('his', 809), (';', 804), ('been', 769), ('their', 735), ('were', 708), ('he', 660), ('we', 637), ('who', 620), ('one', 606), ('--', 594), ('after', 562), ('these', 550), ('had', 544), ('more', 536), ('other', 525), ('’s', 507), ('most', 502), ('also', 493), ('will', 490), ('all', 487), ('during', 482), ('can', 480), ('about', 476), ('they', 473), (\"'\", 453), ('i', 432), ('when', 421), ('new', 417), ('such', 410), ('there', 405), ('than', 403), ('ordered', 396), ('into', 390), ('may', 389), ('our', 366), ('first', 362), ('you', 361), ('time', 360), ('would', 348), ('no', 343), ('so', 337), ('only', 327), ('two', 317), ('“', 313), ('early', 311), ('because', 306), ('many', 303), ('some', 302), ('cells', 301), ('if', 299), ('”', 297), ('American', 296), ('years', 293), ('name', 293), ('up', 278), ('over', 278), ('out', 274), ('launched', 273)]\n" 248 | ] 249 | } 250 | ], 251 | "source": [ 252 | "import collections\n", 253 | "\n", 254 | "word_counter = collections.Counter()\n", 255 | "for term in corpus_tokens:\n", 256 | " word_counter.update({term: 1})\n", 257 | "vocab = word_counter.most_common(10000) # 10000 Most common terms\n", 258 | "print('Vocab Size: {}'.format(len(vocab))) \n", 259 | "print(word_counter.most_common(100)) # just to show the top 100 terms" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "This was we make sure to consider the ***top K***(in this case 100) most commonly used terms in the Language (assuming that the corpus represents the Language or domain specific language. For e.g., medical corpora, e-commerce corpora, etc.). In Neural Machine Translation Models, usually a vocabulary size of 10,000 to 100,000 is used. But remember, it all depends on your task, corpus size, and the Language itself. " 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "### 6.3.1 UNKNOWN and PAD\n", 274 | "Along with the vocabulary terms that we generated, we need two more special terms:\n", 275 | "1. **UNKNOWN**: This term is used for all the words that the model will observe apart from the vocabulary terms.\n", 276 | "2. **PAD**: The pad term is used to pad the sequences to a maximum length. This is required for feeding variable length sequences into the Network (we use DynamicRnn to handle variable length sequences. So, padding makes no difference. It is just required for feeding the data to Tensorflow)\n", 277 | "\n", 278 | "This is required as during inference time there will be many unknown words (words that the model has never seen). It is better to add an **UNKNOWN** token in the vocabulary so that the model will learn to handle terms that are unknown to the Model." 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 6, 284 | "metadata": {}, 285 | "outputs": [ 286 | { 287 | "name": "stdout", 288 | "output_type": "stream", 289 | "text": [ 290 | "Word2Idx Size: 10002\n", 291 | "Idx2Word Size: 10002\n" 292 | ] 293 | } 294 | ], 295 | "source": [ 296 | "vocab.append(('UNKNOWN', 1))\n", 297 | "Idx = range(1, len(vocab)+1)\n", 298 | "vocab = [t[0] for t in vocab]\n", 299 | "\n", 300 | "Word2Idx = dict(zip(vocab, Idx))\n", 301 | "Idx2Word = dict(zip(Idx, vocab))\n", 302 | "\n", 303 | "Word2Idx['PAD'] = 0\n", 304 | "Idx2Word[0] = 'PAD'\n", 305 | "VOCAB_SIZE = len(Word2Idx)\n", 306 | "print('Word2Idx Size: {}'.format(len(Word2Idx)))\n", 307 | "print('Idx2Word Size: {}'.format(len(Idx2Word)))" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "## 6.4 Preload Word Vectors\n", 315 | "Since you are here, I am almost sure that you are familiar with or have atleast heard of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html). Read about it if you don't know. \n", 316 | "\n", 317 | "Spacy provides a set of pretrained word vectors. We will make use of these to initialize our embedding layer (details in the following section). " 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 7, 323 | "metadata": { 324 | "scrolled": true 325 | }, 326 | "outputs": [ 327 | { 328 | "name": "stdout", 329 | "output_type": "stream", 330 | "text": [ 331 | "Shape of w2v: (10002, 300)\n", 332 | "Some Vectors\n", 333 | "[ 0.32350999 0.35554001 0.029381 0.15276 -0.14915 0.22169\n", 334 | " 0.007907 -0.61286002 0.24625 0.094113 ] PAD\n", 335 | "[ 3.73400003e-02 1.01959996e-03 1.12499997e-01 -3.48410010e-01\n", 336 | " -1.22720003e-01 8.06659982e-02 4.93220001e-01 7.56980032e-02\n", 337 | " 4.80910003e-01 2.67359996e+00] time\n" 338 | ] 339 | } 340 | ], 341 | "source": [ 342 | "w2v = np.random.rand(len(Word2Idx), 300) # We use 300 because Spacy provides us with vectors of size 300\n", 343 | "\n", 344 | "for w_i, key in enumerate(Word2Idx):\n", 345 | " token = nlp(key)\n", 346 | " if token.has_vector:\n", 347 | " #print(token.text, Word2Idx[key])\n", 348 | " w2v[Word2Idx[key]:] = token.vector\n", 349 | "EMBEDDING_SIZE = w2v.shape[-1]\n", 350 | "print('Shape of w2v: {}'.format(w2v.shape))\n", 351 | "print('Some Vectors')\n", 352 | "print(w2v[0][:10], Idx2Word[0])\n", 353 | "print(w2v[80][:10], Idx2Word[80])" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": {}, 359 | "source": [ 360 | "## 6.5 Splitting the Data\n", 361 | "We are almost there. Have patience :) We need to split the data into Training and Validation set before we proceed any further. So," 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": 8, 367 | "metadata": {}, 368 | "outputs": [ 369 | { 370 | "name": "stdout", 371 | "output_type": "stream", 372 | "text": [ 373 | "Train Size: 301026\n", 374 | "Validation Size: 75256\n" 375 | ] 376 | } 377 | ], 378 | "source": [ 379 | "train_val_split = int(len(corpus_tokens) * 0.8) # We use 80% of the data for Training and 20% for validating\n", 380 | "train = corpus_tokens[:train_val_split]\n", 381 | "validation = corpus_tokens[train_val_split:-1]\n", 382 | "\n", 383 | "print('Train Size: {}\\nValidation Size: {}'.format(len(train), len(validation)))" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "## 6.6 Prepare The Training Data\n", 391 | "We will prepare the data by doing the following fro both train and Validation data:\n", 392 | "1. Convert word sequences to id sequences (which will be later used in the embedding layer)\n", 393 | "2. Generate n-grams from the input sequences\n", 394 | "3. Pad the generated n_grams to a max-length so that it can be fed to Tensorflow" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": 9, 400 | "metadata": { 401 | "collapsed": true 402 | }, 403 | "outputs": [], 404 | "source": [ 405 | "from tflearn.data_utils import to_categorical, pad_sequences" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": 10, 411 | "metadata": {}, 412 | "outputs": [ 413 | { 414 | "name": "stdout", 415 | "output_type": "stream", 416 | "text": [ 417 | "Sample Train IDs\n", 418 | "[1005, 10001, 17, 10001, 17, 8, 10, 10001, 10001]\n", 419 | "Sample Validation IDs\n", 420 | "[137, 3630, 10, 2134, 222, 183, 99, 9, 86]\n" 421 | ] 422 | } 423 | ], 424 | "source": [ 425 | "# A method to convert a sequence of words into a sequence of IDs given a Word2Idx dictionary\n", 426 | "def word2idseq(data, word2idx):\n", 427 | " id_seq = []\n", 428 | " for word in data:\n", 429 | " if word in word2idx:\n", 430 | " id_seq.append(word2idx[word])\n", 431 | " else:\n", 432 | " id_seq.append(word2idx['UNKNOWN'])\n", 433 | " return id_seq\n", 434 | "\n", 435 | "# Thanks to http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/\n", 436 | "# This method generated n-grams\n", 437 | "def find_ngrams(input_list, n):\n", 438 | " return zip(*[input_list[i:] for i in range(n)])\n", 439 | "\n", 440 | "train_id_seqs = word2idseq(train, Word2Idx)\n", 441 | "validation_id_seqs = word2idseq(validation, Word2Idx)\n", 442 | "\n", 443 | "print('Sample Train IDs')\n", 444 | "print(train_id_seqs[-10:-1])\n", 445 | "print('Sample Validation IDs')\n", 446 | "print(validation_id_seqs[-10:-1])" 447 | ] 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "metadata": {}, 452 | "source": [ 453 | "### 6.6.1 Generating the Targets from N-Grams\n", 454 | "This might look a little tricky but it is not. Here we take the sequence of ids and generate n-grams. For the purpose of training, we need sequences of terms as the training examples and the next term in the sequence as the target. Not clear right? Let us look at an example. If our sequence of words were ```['hello', 'my', 'friend']```, then we extract extract n-grams, where n=2-3 (that means we split bigrams and trigrams from the sequence). So the sequence is split into ```['hello', 'my'], ['my', 'friend'] and ['hello', 'my', 'friend']```. Well to train our network this is not enough right? We need some objective/target that we can infer about. So to get a target we split the last term of the n-grams out. In the case of our example, the corresponding targets are ```['friend', 'my', 'friend']```. To show you the bigger picture, the input sequence ```['my', 'friend', 'friend']``` is split into n-grams and then split again to pop out a target term.\n", 455 | "\n", 456 | "```python\n", 457 | "bigram['hello', 'my'] --> input['hello'] --> target['my']\n", 458 | "bigram['my', 'friend'] --> input['my'] --> target['friend']\n", 459 | "trigram['hello', 'my', 'friend'] --> input['hello', 'my'] --> target['friend']\n", 460 | "```" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 11, 466 | "metadata": {}, 467 | "outputs": [], 468 | "source": [ 469 | "import random\n", 470 | "\n", 471 | "def prepare_data(data, n_grams=5, batch_size=64, n_epochs=10):\n", 472 | " X, Y = [], []\n", 473 | " buff_size, start, end = 1000, 0, 1000\n", 474 | " n_buffer = 0\n", 475 | " epoch = 0\n", 476 | " while epoch == n_epochs:\n", 477 | " if len(X) >= batch_size:\n", 478 | " X_batch = X[:batch_size]\n", 479 | " Y_batch = Y[:batch_size]\n", 480 | " X_batch = pad_sequences(X_batch, maxlen=n_grams, value=0)\n", 481 | " Y_batch = to_categorical(Y_batch, VOCAB_SIZE)\n", 482 | " yield (X_batch, Y_batch, epoch)\n", 483 | " X = X[batch_size:]\n", 484 | " Y = Y[batch_size:]\n", 485 | " continue\n", 486 | " n = random.randrange(2, n_grams)\n", 487 | " if len(data) < n: continue\n", 488 | " if end > len(data): end = len(data)\n", 489 | " grams = find_ngrams(data[start: end], n) # generates the n-grams\n", 490 | " splits = list(zip(*grams)) # split it\n", 491 | " X += list(zip(*splits[:len(splits)-1])) # from the inputs\n", 492 | " X = [list(x) for x in X] \n", 493 | " Y += splits[-1] # form the targets\n", 494 | " if start + buff_size > len(data):\n", 495 | " start = 0\n", 496 | " epoch += 1\n", 497 | " end = start + buff_size\n", 498 | " else:\n", 499 | " start = start + buff_size\n", 500 | " end = end + buff_size" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "## 6.7 The Model\n", 508 | "We now define a Dynamic LSTM Model that will be our Language Model. Restart the kernel and run all cells if it does not work (some Tflearn bug). " 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": 12, 514 | "metadata": { 515 | "collapsed": true 516 | }, 517 | "outputs": [], 518 | "source": [ 519 | "# Hyperparameters\n", 520 | "LR = 0.0001\n", 521 | "HIDDEN_DIMS = 256\n", 522 | "BATCH_SIZE = 32\n", 523 | "N_EPOCHS=100\n", 524 | "N_GRAMS = 5\n", 525 | "N_VALIDATE = 10000" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": 13, 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [ 534 | "train = prepare_data(train_id_seqs, N_GRAMS, BATCH_SIZE, N_EPOCHS)\n", 535 | "validate = prepare_data(validation_id_seqs, N_GRAMS, N_VALIDATE, N_EPOCHS)" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": 14, 541 | "metadata": { 542 | "collapsed": true 543 | }, 544 | "outputs": [], 545 | "source": [ 546 | "import tensorflow as tf\n", 547 | "import tflearn" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 15, 553 | "metadata": {}, 554 | "outputs": [ 555 | { 556 | "name": "stderr", 557 | "output_type": "stream", 558 | "text": [ 559 | "/home/dash/venvs/exercise/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py:91: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.\n", 560 | " \"Converting sparse IndexedSlices to a dense Tensor of unknown shape. \"\n" 561 | ] 562 | }, 563 | { 564 | "name": "stdout", 565 | "output_type": "stream", 566 | "text": [ 567 | "Training epoch 0\n" 568 | ] 569 | }, 570 | { 571 | "ename": "StopIteration", 572 | "evalue": "", 573 | "traceback": [ 574 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 575 | "\u001b[0;31mStopIteration\u001b[0m Traceback (most recent call last)", 576 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mepoch\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mN_EPOCHS\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Training epoch {}'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mepoch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m \u001b[0;34m(\u001b[0m\u001b[0mX_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mval_epoch\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalidate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 16\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mbatch\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 17\u001b[0m model.fit(batch[0], batch[1], validation_set=(X_test, Y_test),\n", 577 | "\u001b[0;31mStopIteration\u001b[0m: " 578 | ], 579 | "output_type": "error" 580 | } 581 | ], 582 | "source": [ 583 | "# Build the model\n", 584 | "embedding_matrix = tf.constant(w2v, dtype=tf.float32)\n", 585 | "net = tflearn.input_data([None, N_GRAMS], dtype=tf.int32, name='input')\n", 586 | "net = tflearn.embedding(net, input_dim=VOCAB_SIZE, output_dim=EMBEDDING_SIZE,\n", 587 | " weights_init=embedding_matrix, trainable=True)\n", 588 | "net = tflearn.lstm(net, HIDDEN_DIMS, dropout=0.8, dynamic=True)\n", 589 | "net = tflearn.fully_connected(net, VOCAB_SIZE, activation='softmax')\n", 590 | "net = tflearn.regression(net, optimizer='adam', learning_rate=LR,\n", 591 | " loss='categorical_crossentropy', name='target')\n", 592 | "model = tflearn.DNN(net, checkpoint_path=\"./chkpnts\", best_checkpoint_path=\"./best_chkpnts\",\n", 593 | " tensorboard_dir='./chkpnts', best_val_accuracy=0.70)\n", 594 | "\n", 595 | "for epoch in range(N_EPOCHS):\n", 596 | " print('Training epoch {}'.format(epoch))\n", 597 | " (X_test, Y_test, val_epoch) = next(validate)\n", 598 | " for batch in train:\n", 599 | " model.fit(batch[0], batch[1], validation_set=(X_test, Y_test),\n", 600 | " show_metric=True, n_epoch=1)" 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "metadata": { 606 | "collapsed": true 607 | }, 608 | "source": [ 609 | "# 7. Inference\n", 610 | "The story does not get over after you train the model. We need to understand how to make inference using this trained model. Well honestly, this model is not even close to trained. We used just one article from Wikipedia to train this Language Model so we cannot expect it to be good. The idea was to realise the steps required actually build a Language Model from scratch. Now let us look at how to make an inference from the model that we just trained.\n", 611 | "\n", 612 | "## 7.1 Log Probability of a Sequence \n", 613 | "Given a new sequence of terms, we would like to know the probability of the occurance of this sequence in the Language. We make use of our trained model (which we assume to be a represenattion of the Langauge) and calculate the n-gram probabilities and aggregate them to find a final probability score." 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": null, 619 | "metadata": { 620 | "collapsed": true 621 | }, 622 | "outputs": [], 623 | "source": [ 624 | "def get_sequence_prob(in_string, n, model):\n", 625 | " in_tokens, in_lengths = preprocess_corpus(in_string)\n", 626 | " in_ids = word2idseq(in_tokens, Word2Idx)\n", 627 | " X, Y_, Y = prepare_data(in_ids, n)\n", 628 | " preds = model.predict(X)\n", 629 | " log_prob = 0.0\n", 630 | " for y_i, y in enumerate(Y):\n", 631 | " log_prob += np.log(preds[y_i, y])\n", 632 | "\n", 633 | " log_prob = log_prob/len(Y)\n", 634 | " return log_prob\n", 635 | "\n", 636 | "in_strings = ['hello I am science', 'blah blah blah', 'deep learning', 'answer',\n", 637 | " 'Boltzman', 'from the previous layer as input', 'ahcblheb eDHLHW SLcA']\n", 638 | "for in_string in in_strings:\n", 639 | " log_prob = get_sequence_prob(in_string, 5, model)\n", 640 | " print(log_prob)" 641 | ] 642 | }, 643 | { 644 | "cell_type": "markdown", 645 | "metadata": {}, 646 | "source": [ 647 | "To get the probability of the sequence, we take the n-grams of the sequence and we infer the probability of the next term to occur, take it's log and sum it with the log probabilities of all the other n-grams. The final score is the average over all. There can be other ways to look at it too. You can notmalize by n too, where n is the number of grans you considered. " 648 | ] 649 | }, 650 | { 651 | "cell_type": "markdown", 652 | "metadata": {}, 653 | "source": [ 654 | "# 7.2 Generating a Sequence\n", 655 | "Since we trained this Language model to predict the next term given the previous 'n' terms, we can sample sequences out of this model too. We start with a random term and feed it to the Model. The Model predicts the next term and then we concat it with our previous term and feed it again to the Model. In this way we can generate arbitarily long sequences from the Model. Let us see how this naive model generates sequences," 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": { 662 | "collapsed": true 663 | }, 664 | "outputs": [], 665 | "source": [ 666 | "def generate_sequences(term, word2idx, idx2word, seq_len, n_grams, model):\n", 667 | " if term not in word2idx:\n", 668 | " idseq = [[word2idx['UNKNOWN']]]\n", 669 | " else:\n", 670 | " idseq = [[word2idx[term]]]\n", 671 | " for i in range(seq_len-1):\n", 672 | " #print(idseq)\n", 673 | " padded_idseq = pad_sequences(idseq, maxlen=n_grams, value=0)\n", 674 | " next_label = model.predict_label(padded_idseq)\n", 675 | " print(next_label)\n", 676 | " idseq[0].append(next_label[0][0])\n", 677 | " generated_str = []\n", 678 | " for id in idseq[0]:\n", 679 | " generated_str.append(idx2word[id])\n", 680 | " return ' '.join(generated_str)\n", 681 | " \n", 682 | "term = 'SEENCE_BEGIN'\n", 683 | "seq = generate_sequences(term, Word2Idx, Idx2Word, 10, 5, model)\n", 684 | "print(seq)" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": null, 690 | "metadata": { 691 | "collapsed": true 692 | }, 693 | "outputs": [], 694 | "source": [ 695 | "" 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": null, 701 | "metadata": { 702 | "collapsed": true 703 | }, 704 | "outputs": [], 705 | "source": [ 706 | "" 707 | ] 708 | } 709 | ], 710 | "metadata": { 711 | "kernelspec": { 712 | "display_name": "Python 3", 713 | "language": "python", 714 | "name": "python3" 715 | }, 716 | "language_info": { 717 | "codemirror_mode": { 718 | "name": "ipython", 719 | "version": 3.0 720 | }, 721 | "file_extension": ".py", 722 | "mimetype": "text/x-python", 723 | "name": "python", 724 | "nbconvert_exporter": "python", 725 | "pygments_lexer": "ipython3", 726 | "version": "3.4.3" 727 | } 728 | }, 729 | "nbformat": 4, 730 | "nbformat_minor": 0 731 | } -------------------------------------------------------------------------------- /Neural+Language+Model.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # # 1. Neural Language Model 5 | # If you are here that means you wish to cut the crap and understand how to train your own Neural Language Model. If you are a regular user of frameworks like Keras, Tflearn, etc., then you know how easy it has become these days to build, train and deploy Neural Network Models. If not then you will probably by the end of this post. 6 | # 7 | # # 2. Prerequisite 8 | # 1. [Python](https://www.tutorialspoint.com/python/): I will be using Python 3.5 for this tutorial 9 | # 10 | # 2. [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/): If you dont know what LSTMs are, then this is a must read. 11 | # 12 | # 3. [Basics of Machine Learning](https://www.youtube.com/watch?v=2uiulzZxmGg): If you want to dive into Machine Learning/Deep Learning, then I strongly recommend the first 4 lectures from [Stanford's CS231]() by Andrej Karpathy. 13 | # 14 | # 4. [Language Model](https://en.wikipedia.org/wiki/Language_model): If you want to have a basic understanding of Language Models. 15 | # 16 | # # 3. Frameworks 17 | # 1. [Tflearn](http://tflearn.org/installation/) 0.3.2 18 | # 2. [Spacy](https://spacy.io/) 1.9.0 19 | # 3. [Tensorflow](https://spacy.io/) 1.0.1 20 | # 21 | # ### Note 22 | # you can take this post as a hands-on exercise on "How to build your own Neural Language Model" from scratch. If you have a ready to use virtualenv with all the dependencies installed then you can skip Section 4 and jump to Section 5. 23 | 24 | # # 4. Install Dependencies 25 | # We will install everythin in a virtual environment and I would suggest you to run this Jupyter Notebook in the same virtualenv. I have also provided a ```requirements.txt``` file with the [repository](https://github.com/dashayushman/neural-language-model) to make things easier. 26 | # 27 | # ### 4.1 Virtual Environment 28 | # You can follow [this](http://docs.python-guide.org/en/latest/dev/virtualenvs/) for a fast guide to Virtual Environments. 29 | # 30 | # ```sh 31 | # pip install virtualenv 32 | # ``` 33 | # 34 | # ### 4.2 Tflearn 35 | # Follow [this](http://tflearn.org/installation/) and install Tflearn. Make sure to have the versions correct in case you want to avoid weird errors. 36 | # 37 | # ```sh 38 | # pip install -Iv tflearn==0.3.2 39 | # ``` 40 | # 41 | # ### 4.3 Tensorflow 42 | # Install Tensorflow by following the instructions [here](https://www.tensorflow.org/install/). To make sure of installing the right version, use this 43 | # 44 | # ```sh 45 | # pip install -Iv tensorflow-gpu==1.0.1 46 | # ``` 47 | # Note that this is the GPU version of Tensorflow. You can even install the CPU version for this tutorial, but I would strongly recommend the GPU version if you intend to intend to scale it to use in the real world. 48 | # 49 | # ### 4.4 Spacy 50 | # Install Spacy by following the instructions [here](https://spacy.io/docs/usage/). For the right version use, 51 | # 52 | # ```sh 53 | # pip install -Iv spacy==1.9.0 54 | # ``` 55 | # 56 | # ### 4.5 Others 57 | # ```sh 58 | # pip install numpy 59 | # ``` 60 | 61 | # # 5. Get the Repo 62 | # clone the Neural Language Model GitHub repository onto your computer and start the Jupyter Notebook server. 63 | # 64 | # ```sh 65 | # git clone https://github.com/dashayushman/neural-language-model.git 66 | # cd neural-language-model 67 | # jupyter notebook 68 | # ``` 69 | # 70 | # Open the notebook names **Neural Language Model** and you can start off. 71 | 72 | # # 6. Neural Language Model 73 | # We will start building our own Language model using an LSTM Network. To do so we will need a corpus. For the purpose of this tutorial, let us use a toy corpus, which is a text file called ```corpus.txt``` that 0I downloaded from Wikipedia. I will use this to demponstrate how to build your own Neural Language Model, and you can use the same knowledge to extend the model further for a more realistic scenario (I will give pointers to do so too). 74 | # 75 | # ## 6.1 Loading The Corpus 76 | # In this section you will load the ```corpus.txt``` and do minimal preprocessing. 77 | 78 | # In[1]: 79 | 80 | 81 | import re 82 | 83 | with open('corpus.txt', 'r') as cf: 84 | corpus = [] 85 | for line in cf: # loops over all the lines in the corpus 86 | line = line.strip() # strips off \n \r from the ends 87 | if line: # Take only non empty lines 88 | line = re.sub(r'\([^)]*\)', '', line) # Regular Expression to remove text in between brackets 89 | line = re.sub(' +',' ', line) # Removes consecutive spaces 90 | # add more pre-processing steps 91 | corpus.append(line) 92 | print("\n".join(corpus[:5])) # Shows the first 5 lines of the corpus 93 | 94 | 95 | # As you can see that this small piece of code loads the toy text corpus, extracts lines from it, ignores empty lines, and removes text in between brackets. Note that in reality you will not be able to load the entire corpus into memory. You will need to write a [generator](https://wiki.python.org/moin/Generators) to yield text lines from the corpus, or use some advanced features provided by the Deep Learning frameworks like [Tensorflow's Input Pipelines](https://www.tensorflow.org/programmers_guide/reading_data). 96 | # 97 | # ## 6.2 Tokenizing the Corpus 98 | # In this section we will see how to tokenize the text lines that we extracted and then create a **Vocabulary**. 99 | 100 | # In[2]: 101 | 102 | 103 | # Load Spacy 104 | import spacy 105 | import numpy as np 106 | nlp = spacy.load('en_core_web_sm') 107 | 108 | 109 | # In[3]: 110 | 111 | 112 | def preprocess_corpus(corpus): 113 | corpus_tokens = [] 114 | sentence_lengths = [] 115 | for line in corpus: 116 | doc = nlp(line) # Parse each line in the corpus 117 | for sent in doc.sents: # Loop over all the sentences in the line 118 | corpus_tokens.append('SEQUENCE_BEGIN') 119 | s_len = 1 120 | for tok in sent: # Loop over all the words in a sentence 121 | if tok.text.strip() != '' and tok.ent_type_ != '': # If the token is a Named Entity then do not lowercase it 122 | corpus_tokens.append(tok.text) 123 | else: 124 | corpus_tokens.append(tok.text.lower()) 125 | s_len += 1 126 | corpus_tokens.append('SEQUENCE_END') 127 | sentence_lengths.append(s_len+1) 128 | return corpus_tokens, sentence_lengths 129 | 130 | corpus_tokens, sentence_lengths = preprocess_corpus(corpus) 131 | print(corpus_tokens[:30]) # Prints the first 30 tokens 132 | mean_sentence_length = np.mean(sentence_lengths) 133 | deviation_sentence_length = np.std(sentence_lengths) 134 | max_sentence_length = np.max(sentence_lengths) 135 | print('Mean Sentence Length: {}\nSentence Length Standard Deviation: {}\n' 136 | 'Max Sentence Length: {}'.format(mean_sentence_length, deviation_sentence_length, max_sentence_length)) 137 | 138 | 139 | # Notice that we did not lowercase the [Named Entities(NEs)](https://en.wikipedia.org/wiki/Named-entity_recognition). This is totally your choice. It part of a normalization step and I believe it is a good idea to let the model learn the Named Entities in the corpus. But do not blindly consider any library for NEs. I chose Spacy as it is very simple to use, fast and efficient. Note that I am using the [**en_core_web_sm**](https://spacy.io/docs/usage/models) model of Spacy, which is very small and good enough for this tutorial. You would probably want to choose your own NE recognizer. 140 | # 141 | # Other Normalization steps include [stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) which I will not implement because **(1)** I want my Language Model to learn the various forms of a word and their occurances by itself; **(2)** In a real world scenario you will train your Model with a huge corpus with Millions of text lines, and you can assume that the corpus covers the most commonly used terms in Language. Hence, no extra normalization is required. 142 | # 143 | # ### 6.2.1 SEQUENCE_BEGIN and SEQUENCE_END 144 | # Along with the naturally occurring terms in the corpus, we will add two new terms called the *SEQUENCE_BEGIN* and **SEQUENCE_END** term. These terms mark the beginning and end of a sentence. We do this because we want our model to learn word occurring at the beginning and at the end of sentences. Note that we are dependent on Spacy's Tokenization algorithm here. You are free to explore other tokenizers and use whichever you find is best. 145 | # 146 | # ## 6.3 Create a Vocabulary 147 | # After we have minimally preprocessed the corpus and extracted sequence of terms from it, we will create a vocabulary for our Language Model. This means that we will create two python dictionaries, 148 | # 1. **Word2Idx** : This dictionary has all the unique words(terms) as keys with a corresponding unique ID as values 149 | # 2. **Idx2Word** : This is the reverse of Word2Idx. It has the unique IDs as keys and their corresponding words(terms) as values 150 | 151 | # In[4]: 152 | 153 | 154 | vocab = list(set(corpus_tokens)) # This works well for a very small corpus 155 | #print(vocab) 156 | 157 | 158 | # **Alternatively**, if your corpus is huge, you would probably want to iterate through it entirely and generate term frequencies. Once you have the term frequencies, it is better to select the most commonly occuring terms in the vocabulary (as it covers most of the Natural Language). 159 | 160 | # In[5]: 161 | 162 | 163 | import collections 164 | 165 | word_counter = collections.Counter() 166 | for term in corpus_tokens: 167 | word_counter.update({term: 1}) 168 | vocab = word_counter.most_common(10000) # 10000 Most common terms 169 | print('Vocab Size: {}'.format(len(vocab))) 170 | print(word_counter.most_common(100)) # just to show the top 100 terms 171 | 172 | 173 | # This was we make sure to consider the ***top K***(in this case 100) most commonly used terms in the Language (assuming that the corpus represents the Language or domain specific language. For e.g., medical corpora, e-commerce corpora, etc.). In Neural Machine Translation Models, usually a vocabulary size of 10,000 to 100,000 is used. But remember, it all depends on your task, corpus size, and the Language itself. 174 | 175 | # ### 6.3.1 UNKNOWN and PAD 176 | # Along with the vocabulary terms that we generated, we need two more special terms: 177 | # 1. **UNKNOWN**: This term is used for all the words that the model will observe apart from the vocabulary terms. 178 | # 2. **PAD**: The pad term is used to pad the sequences to a maximum length. This is required for feeding variable length sequences into the Network (we use DynamicRnn to handle variable length sequences. So, padding makes no difference. It is just required for feeding the data to Tensorflow) 179 | # 180 | # This is required as during inference time there will be many unknown words (words that the model has never seen). It is better to add an **UNKNOWN** token in the vocabulary so that the model will learn to handle terms that are unknown to the Model. 181 | 182 | # In[6]: 183 | 184 | 185 | vocab.append(('UNKNOWN', 1)) 186 | Idx = range(1, len(vocab)+1) 187 | vocab = [t[0] for t in vocab] 188 | 189 | Word2Idx = dict(zip(vocab, Idx)) 190 | Idx2Word = dict(zip(Idx, vocab)) 191 | 192 | Word2Idx['PAD'] = 0 193 | Idx2Word[0] = 'PAD' 194 | VOCAB_SIZE = len(Word2Idx) 195 | print('Word2Idx Size: {}'.format(len(Word2Idx))) 196 | print('Idx2Word Size: {}'.format(len(Idx2Word))) 197 | 198 | 199 | # ## 6.4 Preload Word Vectors 200 | # Since you are here, I am almost sure that you are familiar with or have atleast heard of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html). Read about it if you don't know. 201 | # 202 | # Spacy provides a set of pretrained word vectors. We will make use of these to initialize our embedding layer (details in the following section). 203 | 204 | # In[7]: 205 | 206 | 207 | w2v = np.random.rand(len(Word2Idx), 300) # We use 300 because Spacy provides us with vectors of size 300 208 | 209 | for w_i, key in enumerate(Word2Idx): 210 | token = nlp(key) 211 | if token.has_vector: 212 | #print(token.text, Word2Idx[key]) 213 | w2v[Word2Idx[key]:] = token.vector 214 | EMBEDDING_SIZE = w2v.shape[-1] 215 | print('Shape of w2v: {}'.format(w2v.shape)) 216 | print('Some Vectors') 217 | print(w2v[0][:10], Idx2Word[0]) 218 | print(w2v[80][:10], Idx2Word[80]) 219 | 220 | 221 | # ## 6.5 Splitting the Data 222 | # We are almost there. Have patience :) We need to split the data into Training and Validation set before we proceed any further. So, 223 | 224 | # In[8]: 225 | 226 | 227 | train_val_split = int(len(corpus_tokens) * 0.8) # We use 80% of the data for Training and 20% for validating 228 | train = corpus_tokens[:train_val_split] 229 | validation = corpus_tokens[train_val_split:-1] 230 | 231 | print('Train Size: {}\nValidation Size: {}'.format(len(train), len(validation))) 232 | 233 | 234 | # ## 6.6 Prepare The Training Data 235 | # We will prepare the data by doing the following fro both train and Validation data: 236 | # 1. Convert word sequences to id sequences (which will be later used in the embedding layer) 237 | # 2. Generate n-grams from the input sequences 238 | # 3. Pad the generated n_grams to a max-length so that it can be fed to Tensorflow 239 | 240 | # In[9]: 241 | 242 | 243 | from tflearn.data_utils import to_categorical, pad_sequences 244 | 245 | 246 | # In[10]: 247 | 248 | 249 | # A method to convert a sequence of words into a sequence of IDs given a Word2Idx dictionary 250 | def word2idseq(data, word2idx): 251 | id_seq = [] 252 | for word in data: 253 | if word in word2idx: 254 | id_seq.append(word2idx[word]) 255 | else: 256 | id_seq.append(word2idx['UNKNOWN']) 257 | return id_seq 258 | 259 | # Thanks to http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/ 260 | # This method generated n-grams 261 | def find_ngrams(input_list, n): 262 | return zip(*[input_list[i:] for i in range(n)]) 263 | 264 | train_id_seqs = word2idseq(train, Word2Idx) 265 | validation_id_seqs = word2idseq(validation, Word2Idx) 266 | 267 | print('Sample Train IDs') 268 | print(train_id_seqs[-10:-1]) 269 | print('Sample Validation IDs') 270 | print(validation_id_seqs[-10:-1]) 271 | 272 | 273 | # ### 6.6.1 Generating the Targets from N-Grams 274 | # This might look a little tricky but it is not. Here we take the sequence of ids and generate n-grams. For the purpose of training, we need sequences of terms as the training examples and the next term in the sequence as the target. Not clear right? Let us look at an example. If our sequence of words were ```['hello', 'my', 'friend']```, then we extract extract n-grams, where n=2-3 (that means we split bigrams and trigrams from the sequence). So the sequence is split into ```['hello', 'my'], ['my', 'friend'] and ['hello', 'my', 'friend']```. Well to train our network this is not enough right? We need some objective/target that we can infer about. So to get a target we split the last term of the n-grams out. In the case of our example, the corresponding targets are ```['friend', 'my', 'friend']```. To show you the bigger picture, the input sequence ```['my', 'friend', 'friend']``` is split into n-grams and then split again to pop out a target term. 275 | # 276 | # ```python 277 | # bigram['hello', 'my'] --> input['hello'] --> target['my'] 278 | # bigram['my', 'friend'] --> input['my'] --> target['friend'] 279 | # trigram['hello', 'my', 'friend'] --> input['hello', 'my'] --> target['friend'] 280 | # ``` 281 | 282 | # In[11]: 283 | 284 | 285 | import random 286 | 287 | def prepare_data(data, n_grams=5, batch_size=64, n_epochs=10): 288 | X, Y = [], [] 289 | buff_size, start, end = 1000, 0, 1000 290 | n_buffer = 0 291 | epoch = 0 292 | while epoch < n_epochs: 293 | if len(X) >= batch_size: 294 | X_batch = X[:batch_size] 295 | Y_batch = Y[:batch_size] 296 | X_batch = pad_sequences(X_batch, maxlen=n_grams, value=0) 297 | Y_batch = to_categorical(Y_batch, VOCAB_SIZE) 298 | yield (X_batch, Y_batch, epoch) 299 | X = X[batch_size:] 300 | Y = Y[batch_size:] 301 | continue 302 | n = random.randrange(2, n_grams) 303 | if len(data) < n: continue 304 | if end > len(data): end = len(data) 305 | grams = find_ngrams(data[start: end], n) # generates the n-grams 306 | splits = list(zip(*grams)) # split it 307 | X += list(zip(*splits[:len(splits)-1])) # from the inputs 308 | X = [list(x) for x in X] 309 | Y += splits[-1] # form the targets 310 | if start + buff_size > len(data): 311 | start = 0 312 | epoch += 1 313 | end = start + buff_size 314 | else: 315 | start = start + buff_size 316 | end = end + buff_size 317 | 318 | 319 | # ## 6.7 The Model 320 | # We now define a Dynamic LSTM Model that will be our Language Model. Restart the kernel and run all cells if it does not work (some Tflearn bug). 321 | 322 | # In[12]: 323 | 324 | 325 | # Hyperparameters 326 | LR = 0.0001 327 | HIDDEN_DIMS = 256 328 | N_LAYERS = 3 329 | BATCH_SIZE = 10000 330 | N_EPOCHS=100 331 | N_GRAMS = 5 332 | N_VALIDATE = 3000 333 | 334 | 335 | # In[13]: 336 | 337 | 338 | train = prepare_data(train_id_seqs, N_GRAMS, BATCH_SIZE, N_EPOCHS) 339 | validate = prepare_data(validation_id_seqs, N_GRAMS, N_VALIDATE, N_EPOCHS) 340 | 341 | 342 | # In[14]: 343 | 344 | 345 | import tensorflow as tf 346 | import tflearn 347 | 348 | 349 | # In[15]: 350 | 351 | 352 | # Build the model 353 | embedding_matrix = tf.constant(w2v, dtype=tf.float32) 354 | net = tflearn.input_data([None, N_GRAMS], dtype=tf.int32, name='input') 355 | net = tflearn.embedding(net, input_dim=VOCAB_SIZE, output_dim=EMBEDDING_SIZE, 356 | weights_init=embedding_matrix, trainable=True) 357 | net = tflearn.lstm(net, HIDDEN_DIMS, dropout=0.8, dynamic=True) 358 | net = tflearn.fully_connected(net, VOCAB_SIZE, activation='softmax') 359 | net = tflearn.regression(net, optimizer='adam', learning_rate=LR, 360 | loss='categorical_crossentropy', name='target') 361 | model = tflearn.DNN(net, best_checkpoint_path="./best_chkpnts/", 362 | max_checkpoints= 100, tensorboard_dir='./chkpnts/', 363 | best_val_accuracy=0.70, tensorboard_verbose=0) 364 | 365 | prev_epoch = -1 366 | n_batch = 1 367 | for batch in train: 368 | if batch[2] != prev_epoch: 369 | n_batch = 1 370 | prev_epoch = batch[2] 371 | print('Training Epoch {}'.format(batch[2])) 372 | (X_test, Y_test, val_epoch) = next(validate) 373 | print('Fitting Batch: {}'.format(n_batch)) 374 | model.fit(batch[0], batch[1], validation_set=(X_test, Y_test), 375 | show_metric=True, n_epoch=1) 376 | n_batch += 1 377 | 378 | 379 | # # 7. Inference 380 | # The story does not get over after you train the model. We need to understand how to make inference using this trained model. Well honestly, this model is not even close to trained. We used just one article from Wikipedia to train this Language Model so we cannot expect it to be good. The idea was to realise the steps required actually build a Language Model from scratch. Now let us look at how to make an inference from the model that we just trained. 381 | # 382 | # ## 7.1 Log Probability of a Sequence 383 | # Given a new sequence of terms, we would like to know the probability of the occurance of this sequence in the Language. We make use of our trained model (which we assume to be a represenattion of the Langauge) and calculate the n-gram probabilities and aggregate them to find a final probability score. 384 | 385 | # In[ ]: 386 | 387 | 388 | def get_sequence_prob(in_string, n, model): 389 | in_tokens, in_lengths = preprocess_corpus(in_string) 390 | in_ids = word2idseq(in_tokens, Word2Idx) 391 | X, Y_, Y = prepare_data(in_ids, n) 392 | preds = model.predict(X) 393 | log_prob = 0.0 394 | for y_i, y in enumerate(Y): 395 | log_prob += np.log(preds[y_i, y]) 396 | 397 | log_prob = log_prob/len(Y) 398 | return log_prob 399 | 400 | in_strings = ['hello I am science', 'blah blah blah', 'deep learning', 'answer', 401 | 'Boltzman', 'from the previous layer as input', 'ahcblheb eDHLHW SLcA'] 402 | for in_string in in_strings: 403 | log_prob = get_sequence_prob(in_string, 5, model) 404 | print(log_prob) 405 | 406 | 407 | # To get the probability of the sequence, we take the n-grams of the sequence and we infer the probability of the next term to occur, take it's log and sum it with the log probabilities of all the other n-grams. The final score is the average over all. There can be other ways to look at it too. You can notmalize by n too, where n is the number of grans you considered. 408 | 409 | # # 7.2 Generating a Sequence 410 | # Since we trained this Language model to predict the next term given the previous 'n' terms, we can sample sequences out of this model too. We start with a random term and feed it to the Model. The Model predicts the next term and then we concat it with our previous term and feed it again to the Model. In this way we can generate arbitarily long sequences from the Model. Let us see how this naive model generates sequences, 411 | 412 | # In[ ]: 413 | 414 | 415 | def generate_sequences(term, word2idx, idx2word, seq_len, n_grams, model): 416 | if term not in word2idx: 417 | idseq = [[word2idx['UNKNOWN']]] 418 | else: 419 | idseq = [[word2idx[term]]] 420 | for i in range(seq_len-1): 421 | #print(idseq) 422 | padded_idseq = pad_sequences(idseq, maxlen=n_grams, value=0) 423 | next_label = model.predict_label(padded_idseq) 424 | print(next_label) 425 | idseq[0].append(next_label[0][0]) 426 | generated_str = [] 427 | for id in idseq[0]: 428 | generated_str.append(idx2word[id]) 429 | return ' '.join(generated_str) 430 | 431 | term = 'SEENCE_BEGIN' 432 | seq = generate_sequences(term, Word2Idx, Idx2Word, 10, 5, model) 433 | print(seq) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # neural-language-model 2 | A tutorial on how to build your own Neural Language Model 3 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | backports-abc==0.5 2 | backports.weakref==1.0rc1 3 | bleach==1.5.0 4 | certifi==2017.7.27.1 5 | chardet==3.0.4 6 | cymem==1.31.2 7 | cytoolz==0.8.2 8 | decorator==4.1.2 9 | dill==0.2.7.1 10 | en-core-web-sm==1.2.0 11 | entrypoints==0.2.3 12 | ftfy==4.4.3 13 | h5py==2.7.0 14 | html5lib==0.9999999 15 | idna==2.6 16 | ipykernel==4.6.1 17 | ipython==6.1.0 18 | ipython-genutils==0.2.0 19 | ipywidgets==7.0.0 20 | jedi==0.10.2 21 | Jinja2==2.9.6 22 | jsonschema==2.6.0 23 | jupyter==1.0.0 24 | jupyter-client==5.1.0 25 | jupyter-console==5.1.0 26 | jupyter-core==4.3.0 27 | Keras==2.0.6 28 | Markdown==2.6.9 29 | MarkupSafe==1.0 30 | mistune==0.7.4 31 | murmurhash==0.26.4 32 | nbconvert==5.2.1 33 | nbformat==4.3.0 34 | notebook==5.0.0 35 | numpy==1.13.1 36 | olefile==0.44 37 | pandocfilters==1.4.2 38 | pathlib==1.0.1 39 | pexpect==4.2.1 40 | pickleshare==0.7.4 41 | Pillow==4.2.1 42 | plac==0.9.6 43 | preshed==1.0.0 44 | prompt-toolkit==1.0.15 45 | protobuf==3.4.0 46 | ptyprocess==0.5.2 47 | Pygments==2.2.0 48 | python-dateutil==2.6.1 49 | PyYAML==3.12 50 | pyzmq==16.0.2 51 | qtconsole==4.3.1 52 | regex==2017.7.28 53 | requests==2.18.4 54 | scipy==0.19.1 55 | simplegeneric==0.8.1 56 | six==1.10.0 57 | spacy==1.9.0 58 | tensorflow==1.3.0 59 | tensorflow-gpu==1.0.1 60 | tensorflow-tensorboard==0.1.4 61 | termcolor==1.1.0 62 | terminado==0.6 63 | testpath==0.3.1 64 | tflearn==0.3.2 65 | Theano==0.9.0 66 | thinc==6.5.2 67 | toolz==0.8.2 68 | tornado==4.5.1 69 | tqdm==4.15.0 70 | traitlets==4.3.2 71 | typing==3.6.2 72 | ujson==1.35 73 | urllib3==1.22 74 | wcwidth==0.1.7 75 | Werkzeug==0.12.2 76 | widgetsnbextension==3.0.0 77 | wrapt==1.10.11 78 | --------------------------------------------------------------------------------