├── Deep Learning for Supervised Language Identification ├── Identification of Language.html ├── Identification of Language.ipynb └── ReadMe.md ├── Dropout Analysis for Deep Nets ├── Dropout+Analysis.html ├── Dropout+Analysis.ipynb └── ReadMe.md ├── README.md └── TowardsWeightInitilizationInDeepNets ├── README.md ├── Weight Initialization.html └── WeightInitialization.ipynb /Deep Learning for Supervised Language Identification/Identification of Language.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Language Identification from Written Text\n", 8 | "\n", 9 | "The goal of this Notebook is to perform language identification from a written document. To this extent I use the Genesis dataset from NLTK which has six languages : Finnish, English, German, French, Swedish and Portuguese." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": false 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "# -*- coding: utf-8 -*-\n", 21 | "from nltk.corpus import genesis as dataset\n", 22 | "languages = [\"finnish\", \"german\", \"portuguese\",\"english\", \"french\", \"swedish\"]" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## Approach 1: Consider Stop Words as Features of Language.\n", 30 | "\n", 31 | "Hypothesis 1: For the given corpus, analyze top-K words and see how many of these are in a pre-defined stop words list. The stop-words of the language containing the most top-K words is the language of the document.\n", 32 | "For experiments, we use the Genesis Dataset (from NLTK) which has text in multiple languages.\n", 33 | "\n", 34 | "Hypothesis 1 Verification: For the above approach to work, the frequency distribution of the documents considered should follow Zipf's law. In the next step, I analyze Frequency Distribution of the dataset for verification." 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 2, 40 | "metadata": { 41 | "collapsed": false 42 | }, 43 | "outputs": [ 44 | { 45 | "data": { 46 | "text/html": [ 47 | "" 48 | ], 49 | "text/plain": [ 50 | "" 51 | ] 52 | }, 53 | "metadata": {}, 54 | "output_type": "display_data" 55 | }, 56 | { 57 | "data": { 58 | "text/html": [ 59 | "
" 60 | ], 61 | "text/plain": [ 62 | "" 63 | ] 64 | }, 65 | "metadata": {}, 66 | "output_type": "display_data" 67 | } 68 | ], 69 | "source": [ 70 | "from nltk.probability import FreqDist\n", 71 | "from plotly.graph_objs import *\n", 72 | "from plotly.offline import init_notebook_mode, iplot\n", 73 | "init_notebook_mode(connected=True)\n", 74 | "\n", 75 | "corpus_words = {\"finnish\" : dataset.words('finnish.txt') , \"german\":dataset.words('german.txt'), \n", 76 | " \"portuguese\": dataset.words('portuguese.txt'), \"english\": dataset.words('english-web.txt'),\n", 77 | " \"french\": dataset.words('french.txt'), \"swedish\": dataset.words('swedish.txt')}\n", 78 | "\n", 79 | "distributions = {}\n", 80 | "for lang in corpus_words.keys():\n", 81 | " dist = dict(FreqDist(w.lower() for w in corpus_words[lang])) \n", 82 | " distributions[lang] = (sorted(dist.values(),reverse=True))\n", 83 | " \n", 84 | " \n", 85 | "data = []\n", 86 | "for lang in distributions.keys():\n", 87 | " data.append(Scatter(\n", 88 | " x = range(1, len(distributions[lang])+1),\n", 89 | " y = distributions[lang],\n", 90 | " name = lang))\n", 91 | " \n", 92 | "iplot({ 'data' : data,\n", 93 | " 'layout': Layout(title = \"Word Frequency Distribution\") \n", 94 | " })\n" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "The above charts show that the considered languages(corpus) are following a Zipf's law and hence, will have certain words that are used more frequently the others.\n", 102 | "\n", 103 | "In NLP literature, these words are termed as Stop-Words.\n", 104 | "\n", 105 | "**NOTE: In order to Enable-Disable a Language curve from the graph just click on the language at the top right-corner.** " 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 3, 111 | "metadata": { 112 | "collapsed": false 113 | }, 114 | "outputs": [], 115 | "source": [ 116 | "from nltk.corpus import stopwords\n", 117 | "\n", 118 | "stop_words_list = {}\n", 119 | "for lang in languages:\n", 120 | " stop_words_list[lang] = set(stopwords.words(lang)) # It is faster to search in a set(hashing O(1)) \n", 121 | " # than a list(linear O(n))." 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "The NLTK genesis corpus has sentences >1000 for each language under-consideration. For making test cases, I consider One Document in each langugage to be 50 sentences long i.e. for each language, there will be more 1000/40 = 20 documents." 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 4, 134 | "metadata": { 135 | "collapsed": false 136 | }, 137 | "outputs": [ 138 | { 139 | "name": "stdout", 140 | "output_type": "stream", 141 | "text": [ 142 | "Number of Sentences in:\n", 143 | " finnish 2160\n", 144 | " german 1900\n", 145 | " portuguese 1669\n", 146 | " english-web 2232\n", 147 | " french 2004\n", 148 | " swedish 1386\n" 149 | ] 150 | } 151 | ], 152 | "source": [ 153 | "#Printing Number of Sentences in each language\n", 154 | "print \"Number of Sentences in:\"\n", 155 | "for lang in languages:\n", 156 | " if lang == \"english\":\n", 157 | " lang+=\"-web\" \n", 158 | " print \" \",lang, len(dataset.sents(lang+\".txt\"))" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 5, 164 | "metadata": { 165 | "collapsed": false 166 | }, 167 | "outputs": [ 168 | { 169 | "name": "stdout", 170 | "output_type": "stream", 171 | "text": [ 172 | "230\n" 173 | ] 174 | } 175 | ], 176 | "source": [ 177 | "doc_set = []\n", 178 | "for lang in languages:\n", 179 | " if lang == \"english\":\n", 180 | " lang+=\"-web\" \n", 181 | " sentences = list(dataset.sents(lang+\".txt\"))\n", 182 | " doc_set+=[(lang, sentences[i:i+50]) for i in range(0, len(sentences), 50)]\n", 183 | "print len(doc_set)\n", 184 | " " 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 6, 190 | "metadata": { 191 | "collapsed": false 192 | }, 193 | "outputs": [], 194 | "source": [ 195 | "import string #for punctuations\n", 196 | "\n", 197 | "def predict_lang(lang_sentences):\n", 198 | " \n", 199 | " \"\"\"Returns the predicted language of a set of sentences.\n", 200 | " Input: \n", 201 | " Output: \n", 202 | " \"\"\"\n", 203 | " k = 30\n", 204 | " sentences = lang_sentences[1]\n", 205 | " words = [] \n", 206 | " for sentence in sentences:\n", 207 | " for word in sentence: \n", 208 | " if word not in string.punctuation:\n", 209 | " words += [word.lower()]\n", 210 | " \n", 211 | " dist = dict(FreqDist(words))\n", 212 | " top_k_words = sorted(dist.items(),key=lambda x:x[1],reverse=True)[:k]\n", 213 | " top_k_words = map(lambda x: x[0], top_k_words)\n", 214 | " language, max_score = None, -0.1\n", 215 | " for lang in languages:\n", 216 | " score = float(len(stop_words_list[lang].intersection(top_k_words)))/k\n", 217 | " if score > max_score:\n", 218 | " language = lang\n", 219 | " max_score = score\n", 220 | " return language " 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 7, 226 | "metadata": { 227 | "collapsed": false 228 | }, 229 | "outputs": [], 230 | "source": [ 231 | "from random import shuffle\n", 232 | "def test_stopwords_approach(doc_set):\n", 233 | " shuffle(doc_set)\n", 234 | " y_pred, y_actual = [],[]\n", 235 | " for i in doc_set:\n", 236 | " y_pred += [predict_lang(i)]\n", 237 | " if i[0] == 'english-web': \n", 238 | " y_actual += ['english']\n", 239 | " else:\n", 240 | " y_actual += [i[0]]\n", 241 | " return [y_actual, y_pred]\n" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 12, 247 | "metadata": { 248 | "collapsed": false 249 | }, 250 | "outputs": [ 251 | { 252 | "name": "stdout", 253 | "output_type": "stream", 254 | "text": [ 255 | " precision recall f1-score support\n", 256 | "\n", 257 | " english 1.00 1.00 1.00 45\n", 258 | " finnish 1.00 0.95 0.98 44\n", 259 | " french 0.95 1.00 0.98 41\n", 260 | " german 1.00 1.00 1.00 38\n", 261 | " portuguese 1.00 1.00 1.00 34\n", 262 | " swedish 1.00 1.00 1.00 28\n", 263 | "\n", 264 | "avg / total 0.99 0.99 0.99 230\n", 265 | "\n" 266 | ] 267 | } 268 | ], 269 | "source": [ 270 | "from sklearn.metrics import confusion_matrix, classification_report\n", 271 | "y_actual, y_pred = test_stopwords_approach(doc_set)\n", 272 | "print classification_report(y_actual, y_pred)" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "**Observations:**\n", 280 | "1. The above approach seems to be working well and perfectly for 4 languages: English, German, Swedish and Portuguese. However, by looking at the numbers, it seems French and Finnish are similar.\n", 281 | "2. Can there be similar or same Stop Words because of which this is occuring? Upon taking a look the stop words, it seems French and Finnish have 7 common Stop Words." 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 13, 287 | "metadata": { 288 | "collapsed": false 289 | }, 290 | "outputs": [ 291 | { 292 | "name": "stdout", 293 | "output_type": "stream", 294 | "text": [ 295 | "Number of Stop Words common in: \n", 296 | " Finnish and French: 7\n" 297 | ] 298 | } 299 | ], 300 | "source": [ 301 | "print \"Number of Stop Words common in: \"\n", 302 | "print \" Finnish and French: \",len(stop_words_list['finnish'].intersection(stop_words_list['french']))" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "**Advantages of the approach**:\n", 310 | "1. Impressive performance with respect to F1 Score and other classification metrics.\n", 311 | "2. Robust.\n", 312 | "3. Unsupervised. No training required." 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "**Limitations of the approach :**\n", 320 | "1. This approach will not work for short documents like Tweets, Search Queries etc as these datasets don't have Stop-Words." 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": { 326 | "collapsed": true 327 | }, 328 | "source": [ 329 | "### Hypothesis 2:\n", 330 | "The stop-words approach will not work with small text samples.\n", 331 | "\n", 332 | "Using the same Genesis Dataset, I created a dataset such that each document is three words. If the stop words would work, the result should be similar. These Tri-gram Language Pairs still have stop-words." 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": 14, 338 | "metadata": { 339 | "collapsed": false 340 | }, 341 | "outputs": [ 342 | { 343 | "name": "stdout", 344 | "output_type": "stream", 345 | "text": [ 346 | " precision recall f1-score support\n", 347 | "\n", 348 | " english 0.14 0.23 0.17 32600\n", 349 | " finnish 0.17 0.01 0.02 36178\n", 350 | " french 0.18 0.53 0.27 40281\n", 351 | " german 0.00 0.00 0.00 36141\n", 352 | " portuguese 0.33 0.49 0.40 40117\n", 353 | " swedish 0.00 0.00 0.00 48222\n", 354 | "\n", 355 | "avg / total 0.13 0.21 0.14 233539\n", 356 | "\n" 357 | ] 358 | } 359 | ], 360 | "source": [ 361 | "from nltk import ngrams\n", 362 | "from string import punctuation\n", 363 | "\n", 364 | "def create_trigram_dataset(doc_set):\n", 365 | " \"\"\"\n", 366 | " Input: Doc_Set : List of pairs\n", 367 | " Outpu: List of pairs\n", 368 | " \"\"\"\n", 369 | " \n", 370 | " result = []\n", 371 | " for i in doc_set: \n", 372 | " for k in i[1]: \n", 373 | " punc = list(punctuation) \n", 374 | " temp = filter(lambda x: x not in punc+[','\"'\",\"!\",\":\"], k)\n", 375 | " temp = map(lambda x: x.lower(), temp)\n", 376 | " trigrams = ngrams(temp,3)\n", 377 | " for j in trigrams:\n", 378 | " result.append([i[0],j])\n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " return result\n", 383 | "\n", 384 | "tri_gram_dataset = create_trigram_dataset(doc_set)\n", 385 | "\n", 386 | "y_actual, y_pred = test_stopwords_approach(tri_gram_dataset)\n", 387 | "print classification_report(y_actual, y_pred)" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "### Hypothesis 2 Results: \n", 395 | "As we can see, the stop-words approach failed when the length of the documents was three words.\n", 396 | "\n", 397 | "**Next Steps:**\n", 398 | " 1. What if we could look 'popular' character n-grams instead of stop-words(popular-words)" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | "## Approach 2: Identification of language using most popular char-n-grams\n", 406 | "\n", 407 | "In [1], the authors used the most popular char-n-grams for language detection.\n", 408 | "\n", 409 | "**Hypothesis 3:** There are certain char-n-grams which are more frequent in a language than most other char-n-grams.\n", 410 | "\n", 411 | "**Hypothesis 3 Validation:** Created char-n-gram (only trigrams) to see if there Zipf's is followed by char-n-grams." 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": 16, 417 | "metadata": { 418 | "collapsed": false 419 | }, 420 | "outputs": [ 421 | { 422 | "ename": "KeyboardInterrupt", 423 | "evalue": "", 424 | "output_type": "error", 425 | "traceback": [ 426 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 427 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", 428 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0mtri_grams\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mword\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mcorpus_words\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mlang\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mtri_grams\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtri_grams\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mn_grams\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlower\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0mdist\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mFreqDist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtri_grams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0mchar_trigrams\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mlang\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0msorted\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdist\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mreverse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 429 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: " 430 | ] 431 | } 432 | ], 433 | "source": [ 434 | "#corpus_words was declared earlier in the notebook - contains words of the 6 languages being considered.\n", 435 | "\n", 436 | "def n_grams(s, n=3):\n", 437 | " \"\"\"\"Returns char-n-grams of a words \n", 438 | " \"\"\"\n", 439 | " s = \"#\"+ s + \"#\"\n", 440 | " return [s[i:i+n] for i in range(len(s)-n+1)]\n", 441 | "\n", 442 | "char_trigrams = {}\n", 443 | "for lang in corpus_words.keys(): \n", 444 | " tri_grams = []\n", 445 | " for word in corpus_words[lang]:\n", 446 | " tri_grams = tri_grams + n_grams(word.lower())\n", 447 | " dist = dict(FreqDist(tri_grams))\n", 448 | " char_trigrams[lang] = (sorted(dist.values(),reverse=True))\n", 449 | " \n", 450 | "data = []\n", 451 | "for lang in char_trigrams.keys():\n", 452 | " data.append(Scatter(\n", 453 | " x = range(1, len(char_trigrams[lang])+1),\n", 454 | " y = char_trigrams[lang],\n", 455 | " name = lang))\n", 456 | " \n", 457 | "iplot({ 'data' : data,\n", 458 | " 'layout': Layout(title = \"Char-tri-gram Frequency Distribution\") \n", 459 | " })\n" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "**Hypothesis 3** is validated. There are certain n-grams (tri-grams) which are more frequent than most other char-n-grams.\n", 467 | "\n", 468 | "**Method:** Divide the data into train-test division (80%-20%). From the training set, most frequent top-k character n-grams. For each document (word-trigrams) in the test set, extract the char-tri-grams and select the language with which there is maximum overlap." 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": null, 474 | "metadata": { 475 | "collapsed": false 476 | }, 477 | "outputs": [], 478 | "source": [ 479 | "from random import shuffle\n", 480 | "from sklearn.cross_validation import train_test_split\n", 481 | "\n", 482 | "shuffle(tri_gram_dataset)\n", 483 | "train_set, test_set = train_test_split(tri_gram_dataset, test_size = 0.20)\n", 484 | "print len(train_set), len(test_set)" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": null, 490 | "metadata": { 491 | "collapsed": false 492 | }, 493 | "outputs": [], 494 | "source": [ 495 | "def get_char_ngram(trigram,k=3):\n", 496 | " tri_grams = []\n", 497 | " for word in trigram:\n", 498 | " tri_grams = tri_grams + n_grams(word.lower())\n", 499 | " return tri_grams\n", 500 | "\n", 501 | "def top_k_ngrams_features(n=3, k=50):\n", 502 | " \"\"\"Input: n of the char-n-grams;\n", 503 | " k of top-k\n", 504 | " Processes Word-Corpus{language: } defined above\n", 505 | " Returns the top-k character-n-grams of each language in the form of\n", 506 | " {language, }\n", 507 | " \"\"\" \n", 508 | " char_trigrams = {} \n", 509 | " for i in train_set:\n", 510 | " \n", 511 | " if i[0] in char_trigrams:\n", 512 | " char_trigrams[i[0]] += get_char_ngram(i[1])\n", 513 | " else:\n", 514 | " char_trigrams[i[0]] = get_char_ngram(i[1])\n", 515 | " \n", 516 | " for lang in char_trigrams.keys():\n", 517 | " dist = dict(FreqDist(char_trigrams[lang])) \n", 518 | " top_k_char_n_gram = (sorted(dist, dist.get, reverse=True))[:100]\n", 519 | " char_trigrams[lang] = set(top_k_char_n_gram)\n", 520 | " return char_trigrams\n", 521 | "\n", 522 | "char_trigrams = top_k_ngrams_features()\n", 523 | "\n", 524 | " " 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": null, 530 | "metadata": { 531 | "collapsed": false 532 | }, 533 | "outputs": [], 534 | "source": [ 535 | "def predict_language_char_ngrams(trigram):\n", 536 | " language, max_score = None, -0.1\n", 537 | " char_ngrams = get_char_ngram(trigram) \n", 538 | " \n", 539 | " for lang in languages:\n", 540 | " if lang == 'english':\n", 541 | " lang = 'english-web'\n", 542 | " score = float(len(char_trigrams[lang].intersection(char_ngrams)))/float(len(char_ngrams))\n", 543 | " if score > max_score:\n", 544 | " language = lang\n", 545 | " max_score = score\n", 546 | " return language \n", 547 | " \n", 548 | "\n", 549 | "y_actual, y_pred = [], []\n", 550 | "\n", 551 | "for i in test_set:\n", 552 | " y_actual.append(i[0])\n", 553 | " y_pred.append(predict_language_char_ngrams(i[1]))\n", 554 | " \n", 555 | "print classification_report(y_actual, y_pred)\n", 556 | " " 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": null, 562 | "metadata": { 563 | "collapsed": false 564 | }, 565 | "outputs": [], 566 | "source": [ 567 | "#Checking Scores for Stop-Words Approach only on TestSet\n", 568 | "y_actual, y_pred = test_stopwords_approach(test_set)\n", 569 | "print classification_report(y_actual, y_pred)" 570 | ] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "metadata": { 575 | "collapsed": true 576 | }, 577 | "source": [ 578 | "#### Observations:\n", 579 | "1. There is a certain increase in precision, however the recall has increased by only 0.02.\n", 580 | "2. Approach 2 (Character n-gram) approach might still work better if more n-grams and more data is taken." 581 | ] 582 | }, 583 | { 584 | "cell_type": "markdown", 585 | "metadata": {}, 586 | "source": [ 587 | "## Approach 3: Using Distributed Character-Level Word Representation\n", 588 | "\n", 589 | "### Background:\n", 590 | "#### Skip-Gram Model: \n", 591 | "In [2], the authors proposed two neural network models for distributed representation of words - CBOW (Continuous Bag-of-Words Model) and Skip-Gram. \n", 592 | "* For the CBOW model, the model takes input 'n' before-and-after words and predict the middle word.\n", 593 | "* For the Skip-Gram model, the model takes input a words and predicts 'n' before-and-after words of that word.\n", 594 | "![title](CBOW-SkipGram.png)\n", 595 | "\n", 596 | "In a latest work called FastText [3, 4], the CBOW and Skip-Gram models have extended to incorporate character level information for better understanding of text. The major contribution of the work was to modified a word vector to be a sum of vector of it's character n-grams." 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": null, 602 | "metadata": { 603 | "collapsed": false 604 | }, 605 | "outputs": [], 606 | "source": [ 607 | "import fasttext\n", 608 | "\n", 609 | "def create_train_file(doc_set,fname):\n", 610 | " \"\"\" Creates a text file such that for each tri-gram: <__label__, trigram>\n", 611 | " where label is the language. FastText takes a file as input for training.\n", 612 | " Returns: File name of the created file.\n", 613 | " \"\"\" \n", 614 | " train_file = open(fname,\"w+\")\n", 615 | " for i in doc_set: \n", 616 | " label = \"__label__\"+i[0]\n", 617 | " text = \" \".join(i[1])\n", 618 | " train_file.write(label.encode('utf8')+ \" \" +text.encode('utf8')+\"\\n\")\n", 619 | " \n", 620 | " train_file.close()\n", 621 | " return fname\n", 622 | "\n", 623 | "train_filename = create_train_file(train_set,\"Train_File.txt\")\n" 624 | ] 625 | }, 626 | { 627 | "cell_type": "code", 628 | "execution_count": null, 629 | "metadata": { 630 | "collapsed": false 631 | }, 632 | "outputs": [], 633 | "source": [ 634 | "model = fasttext.supervised(train_filename, 'model',min_count=1,epoch=10,ws=3,\n", 635 | " label_prefix='__label__',dim=50)\n", 636 | "#For sanity checks\n", 637 | "print model.labels" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "metadata": { 644 | "collapsed": true 645 | }, 646 | "outputs": [], 647 | "source": [] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": 55, 652 | "metadata": { 653 | "collapsed": false 654 | }, 655 | "outputs": [ 656 | { 657 | "name": "stdout", 658 | "output_type": "stream", 659 | "text": [ 660 | " precision recall f1-score support\n", 661 | "\n", 662 | "english-web 1.00 1.00 1.00 6413\n", 663 | " finnish 0.98 0.99 0.99 7201\n", 664 | " french 0.99 0.99 0.99 8008\n", 665 | " german 0.99 0.99 0.99 7193\n", 666 | " portuguese 0.99 0.99 0.99 8110\n", 667 | " swedish 0.99 0.99 0.99 9783\n", 668 | "\n", 669 | "avg / total 0.99 0.99 0.99 46708\n", 670 | "\n" 671 | ] 672 | } 673 | ], 674 | "source": [ 675 | "def get_test_pred(test_set):\n", 676 | " \"\"\"\n", 677 | " Input: TestSet : Pairs\n", 678 | " Ouput: List of \n", 679 | " \"\"\"\n", 680 | " y_actual, y_pred = [], []\n", 681 | " for i in test_set:\n", 682 | " y_actual.append(i[0])\n", 683 | " pred = model.predict([\" \".join(i[1])])[0][0]\n", 684 | " y_pred.append(pred)\n", 685 | " return [y_actual, y_pred]\n", 686 | "\n", 687 | "\n", 688 | "y_actual, y_pred = get_test_pred(test_set)\n", 689 | "print classification_report(y_actual, y_pred)" 690 | ] 691 | }, 692 | { 693 | "cell_type": "markdown", 694 | "metadata": { 695 | "collapsed": false 696 | }, 697 | "source": [ 698 | "## Observations:\n", 699 | "\n", 700 | "1. The word embedding based classification gave really impressive results even with such small trainig data.\n", 701 | "\n", 702 | "#### Note: The reason for not employing original Skip-gram and CBOW models for language identification is the limitations of the original models to handle un-seen words, which was addressed in the FastText paper.\n" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "## Conclusions:\n", 710 | "1. Stop-words approach is robust and doesn't require any training but fails when comes to short-texts.\n", 711 | "2. Char-n-gram approach showed some improvement in short-text datasets but the results were not impressive.\n", 712 | "3. Char-gram based word embedding successfully identified short-texts into the actual language.\n", 713 | "4. Hence, for long texts, stop words approach could be the best fit but for short texts, a pre-trained model based on char-n-grams could be used for better results." 714 | ] 715 | }, 716 | { 717 | "cell_type": "markdown", 718 | "metadata": {}, 719 | "source": [ 720 | "References:\n", 721 | "1. Cavnar, William B., and John M. Trenkle. \"N-gram-based text categorization.\" Ann Arbor MI 48113.2 (1994): 161-175.\n", 722 | "2. Bojanowski, Piotr, et al. \"Enriching Word Vectors with Subword Information.\" arXiv preprint arXiv:1607.04606 (2016).\n", 723 | "3. Joulin, Armand, et al. \"Bag of Tricks for Efficient Text Classification.\" arXiv preprint arXiv:1607.01759 (2016).\n", 724 | "4. Mikolov, Tomas, et al. \"Efficient estimation of word representations in vector space.\" arXiv preprint arXiv:1301.3781 (2013)." 725 | ] 726 | } 727 | ], 728 | "metadata": { 729 | "kernelspec": { 730 | "display_name": "Python 2", 731 | "language": "python", 732 | "name": "python2" 733 | }, 734 | "language_info": { 735 | "codemirror_mode": { 736 | "name": "ipython", 737 | "version": 2 738 | }, 739 | "file_extension": ".py", 740 | "mimetype": "text/x-python", 741 | "name": "python", 742 | "nbconvert_exporter": "python", 743 | "pygments_lexer": "ipython2", 744 | "version": "2.7.6" 745 | } 746 | }, 747 | "nbformat": 4, 748 | "nbformat_minor": 0 749 | } 750 | -------------------------------------------------------------------------------- /Deep Learning for Supervised Language Identification/ReadMe.md: -------------------------------------------------------------------------------- 1 | # Deep Learning for Supervised Langaugae Identification 2 | In this post, I primarily build approaches to see how Deep Learning performs in comparision to standard baselines for language detection for texts. Please note the text is in computer readable form i.e. this is not OCR (Optical Character Recognition) related experiment. 3 | 4 | I employed [FastText](https://github.com/facebookresearch/fastText) langauge package of Python to perform the task at hand. To learn more please head to the [blog post @ medium](https://medium.com/@amarbudhiraja/supervised-language-identification-for-short-and-long-texts-with-code-626f9c78c47c#.64uegxk0m). 5 | -------------------------------------------------------------------------------- /Dropout Analysis for Deep Nets/ReadMe.md: -------------------------------------------------------------------------------- 1 | # Dropout Analysis for Deep Nets 2 | 3 | In this notebook, I try to see how Dropout works in Deep Neural Networks. I used the CIFAR-10 dataset and a convolutional network and trained it for dropout 0.0 (no dropout) to 0.9 dropout. Plotted accuracy and loss for validation. 4 | For more detail, please refer to the [blog post @medium](https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5#.298s1htjv) 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Learning Experiments 2 | 3 | In this repository, I will primarily push code related to the deep learning experiments that I do. This ReadMe will have an updated list with the blog posts to the experiments as well as the code. 4 | 5 | ### Contents 6 | 1. Towards Weight Initialization in Deep Neural Networks : [Blog Post (@Medium)](https://medium.com/@amarbudhiraja/towards-weight-initialization-in-deep-neural-networks-908d3d9f1e02) and [Code (on GitHub)](https://github.com/budhiraja/DeepLearningExperiments/tree/master/TowardsWeightInitilizationInDeepNets) 7 | 2. Deep Learning for Supervised Language Identification for written text : [Blog Post (@Medium)](https://medium.com/@amarbudhiraja/supervised-language-identification-for-short-and-long-texts-with-code-626f9c78c47c) and [Code (on Github)] (https://github.com/budhiraja/DeepLearningExperiments/tree/master/Deep%20Learning%20for%20Supervised%20Language%20Identification) 8 | 3. Dropout in (Deep) Machine learning : [Blog Post (@Medium)](https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5#.298s1htjv) and [Code (on Github)](https://github.com/budhiraja/DeepLearningExperiments/tree/master/Dropout%20Analysis%20for%20Deep%20Nets) 9 | -------------------------------------------------------------------------------- /TowardsWeightInitilizationInDeepNets/README.md: -------------------------------------------------------------------------------- 1 | # Weight Initialization for Deep Networks 2 | 3 | The repo contains an Jupyter notebook where I try to compare different weight initializations for Deep Nets. 4 | I use a convolutional DNN for image classification in MNIST dataset. 5 | The notebook uses Keras. 6 | 7 | A blog post can be found here: https://medium.com/@amarbudhiraja/towards-weight-initialization-in-deep-neural-networks-908d3d9f1e02#.6ik0pvnwu 8 | --------------------------------------------------------------------------------