├── Introduction to NLP test.ipynb ├── README.md └── NLP workshop - IBM Developer 2022.ipynb /Introduction to NLP test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction to Natural Language Processing\n", 8 | "7th December, 2019\n", 9 | "\n", 10 | "Created by Grishma Jena\n" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "!pip install wikipedia\n", 20 | "try:\n", 21 | " import numpy as np\n", 22 | " import pandas as pd\n", 23 | " import sklearn, lxml, requests, matplotlib, wikipedia, gensim, wordcloud\n", 24 | " import nltk\n", 25 | " nltk.download('punkt')\n", 26 | " nltk.download('gutenberg')\n", 27 | "except:\n", 28 | " print('Error!')\n", 29 | "print('Done! Everything works as expected.')" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "from nltk.corpus import gutenberg\n", 39 | "\n", 40 | "print(gutenberg.raw('austen-emma.txt')[:50]) #Printing the first 50 characters in file austen-emma.txt" 41 | ] 42 | } 43 | ], 44 | "metadata": { 45 | "kernelspec": { 46 | "display_name": "Python 3", 47 | "language": "python", 48 | "name": "python3" 49 | }, 50 | "language_info": { 51 | "codemirror_mode": { 52 | "name": "ipython", 53 | "version": 3 54 | }, 55 | "file_extension": ".py", 56 | "mimetype": "text/x-python", 57 | "name": "python", 58 | "nbconvert_exporter": "python", 59 | "pygments_lexer": "ipython3", 60 | "version": "3.7.3" 61 | } 62 | }, 63 | "nbformat": 4, 64 | "nbformat_minor": 2 65 | } 66 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Tutorial: Natural Language Processing 2 | Instructor: [Grishma Jena](https://gjena.github.io) 3 | 4 | How can computers interpret something so human like language? Can they actually understand what we are saying or are they hiding behind a façade of rules and algorithms? How do these systems of zeroes and ones make sense of words? This workshop introduces Natural Language Processing in Python and sheds light on how computers interpret our language. Attendees are introduced to NLTK and Gensim that help them tokenise, process and represent textual data. We will see how data is distilled into different linguistic features that power Machine Learning applications like text classifier, sentiment analyser and topic modeler. 5 | 6 | We will be using Jupyter to execute Python code for the purpose of this Natural Language Processing tutorial. It is highly recommended to use Python 3 as Python 2 ~~will be~~ has been sunset on January 1, 2020. A virtual environment can be used to manage and isolate the packages for our project. Please follow these instructions to have all the dependencies ready before the tutorial as that will enable us to hit the ground running. 7 | 8 | __Pre-requisites__ 9 | 10 | *Using Jupyter on your local machine* 11 | 12 | Requires installation of packages but you will be able to use Jupyter and run code offline. 13 | 1. Ensure that pip is installed and upgrade it. Pip should already be available if you are using Python 2 >= 2.7.9 or Python 3 >= 3.4 downloaded from python.org. For further installation instructions check [this](https://pip.pypa.io/en/stable/installing/). 14 | 15 | 2. Optional: If you plan on using a virtual environment, ensure virtualenv (Python 2) or venv (Python 3) is installed. Create a virtual environment and activate it. Detailed instructions [here](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). 16 | 17 | 3. Install the required packages using pip in the terminal: 18 | 19 | * Python 3: 20 | 21 | ``` 22 | python3 -m pip install jupyter nltk lxml requests matplotlib scikit-learn 23 | wikipedia gensim wordcloud --user 24 | ``` 25 | 26 | If you face problems installing NLTK, take a look at [this](https://www.nltk.org/install.html) or try with Python 3. 27 | 28 | 4. Open a Jupyter notebook with jupyter notebook in your terminal. This opens in your browser at default port 8888. 29 | 30 | 5. Download the sample notebook ‘Introduction to NLP test’ and open it in Jupyter. Execute the code by clicking on Cell -> Run Cells. Check out [this video](https://www.youtube.com/watch?v=jZ952vChhuI) for a quick introduction to Jupyter. 31 | 32 | 33 | Feel free to contact me in case of any queries. 34 | 35 | -------------------------------------------------------------------------------- /NLP workshop - IBM Developer 2022.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# A hands-on introduction to Natural Language Processing\n", 8 | "#### 23rd March, 2022 for IBM Call for Code\n", 9 | "## Pre-requisites\n", 10 | "Please have the pre-requisites ready on your machine. \n", 11 | "\n", 12 | "![Typing cat gif](https://media.giphy.com/media/o0vwzuFwCGAFO/giphy.gif)\n", 13 | "\n", 14 | "```jupyter notebook``` opens up a Jupyter notebook in your browser at default port 8888." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## NLTK\n", 22 | "Natural Language Toolkit is the most popular collection of libraries and programs to do NLP. You can find more about it here: http://www.nltk.org/book/ch00.html. We will mainly be using NLTK to perform different tasks, along with a few other packages." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 14, 28 | "metadata": {}, 29 | "outputs": [ 30 | { 31 | "name": "stderr", 32 | "output_type": "stream", 33 | "text": [ 34 | "[nltk_data] Downloading package punkt to /Users/gjena/nltk_data...\n", 35 | "[nltk_data] Package punkt is already up-to-date!\n", 36 | "[nltk_data] Downloading package gutenberg to /Users/gjena/nltk_data...\n", 37 | "[nltk_data] Package gutenberg is already up-to-date!\n" 38 | ] 39 | } 40 | ], 41 | "source": [ 42 | "import nltk\n", 43 | "nltk.download('punkt')\n", 44 | "nltk.download('gutenberg')\n", 45 | "\n", 46 | "import warnings\n", 47 | "warnings.filterwarnings('ignore') # Ignoring warnings" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "NLTK comes pre-loaded with texts from the *Project Gutenberg* archive that you can use. It also has a collection of informal text from discussion forums, conversations, chat sessions, movie scripts, etc. NLTK has corpora in other languages as well." 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 150, 60 | "metadata": { 61 | "collapsed": true 62 | }, 63 | "outputs": [ 64 | { 65 | "name": "stdout", 66 | "output_type": "stream", 67 | "text": [ 68 | "austen-emma.txt [Emma by Jane Austen 1816]\n", 69 | "\n", 70 | "VOLUME I\n", 71 | "\n", 72 | "CHAPTER I\n", 73 | "\n", 74 | "\n", 75 | "Emma Woodhouse, handsome, clever, and rich, with a ...\n", 76 | "austen-persuasion.txt [Persuasion by Jane Austen 1818]\n", 77 | "\n", 78 | "\n", 79 | "Chapter 1\n", 80 | "\n", 81 | "\n", 82 | "Sir Walter Elliot, of Kellynch Hall, in Somersetshire ...\n", 83 | "austen-sense.txt [Sense and Sensibility by Jane Austen 1811]\n", 84 | "\n", 85 | "CHAPTER 1\n", 86 | "\n", 87 | "\n", 88 | "The family of Dashwood had long been settle ...\n", 89 | "bible-kjv.txt [The King James Bible]\n", 90 | "\n", 91 | "The Old Testament of the King James Bible\n", 92 | "\n", 93 | "The First Book of Moses: Called ...\n", 94 | "blake-poems.txt [Poems by William Blake 1789]\n", 95 | "\n", 96 | " \n", 97 | "SONGS OF INNOCENCE AND OF EXPERIENCE\n", 98 | "and THE BOOK of THEL\n", 99 | "\n", 100 | "\n", 101 | " SONGS ...\n", 102 | "bryant-stories.txt [Stories to Tell to Children by Sara Cone Bryant 1918] \r\n", 103 | "\r\n", 104 | "\r\n", 105 | "TWO LITTLE RIDDLES IN RHYME\r\n", 106 | "\r\n", 107 | "\r\n", 108 | " T ...\n", 109 | "burgess-busterbrown.txt [The Adventures of Buster Bear by Thornton W. Burgess 1920]\r\n", 110 | "\r\n", 111 | "I\r\n", 112 | "\r\n", 113 | "BUSTER BEAR GOES FISHING\r\n", 114 | "\r\n", 115 | "\r\n", 116 | "Bu ...\n", 117 | "carroll-alice.txt [Alice's Adventures in Wonderland by Lewis Carroll 1865]\n", 118 | "\n", 119 | "CHAPTER I. Down the Rabbit-Hole\n", 120 | "\n", 121 | "Alice was ...\n", 122 | "chesterton-ball.txt [The Ball and The Cross by G.K. Chesterton 1909]\n", 123 | "\n", 124 | "\n", 125 | "I. A DISCUSSION SOMEWHAT IN THE AIR\n", 126 | "\n", 127 | "The flying s ...\n", 128 | "chesterton-brown.txt [The Wisdom of Father Brown by G. K. Chesterton 1914]\n", 129 | "\n", 130 | "\n", 131 | "I. The Absence of Mr Glass\n", 132 | "\n", 133 | "\n", 134 | "THE consulting- ...\n", 135 | "chesterton-thursday.txt [The Man Who Was Thursday by G. K. Chesterton 1908]\n", 136 | "\n", 137 | "To Edmund Clerihew Bentley\n", 138 | "\n", 139 | "A cloud was on the ...\n", 140 | "edgeworth-parents.txt [The Parent's Assistant, by Maria Edgeworth]\r\n", 141 | "\r\n", 142 | "\r\n", 143 | "THE ORPHANS.\r\n", 144 | "\r\n", 145 | "Near the ruins of the castle of Ro ...\n", 146 | "melville-moby_dick.txt [Moby Dick by Herman Melville 1851]\r\n", 147 | "\r\n", 148 | "\r\n", 149 | "ETYMOLOGY.\r\n", 150 | "\r\n", 151 | "(Supplied by a Late Consumptive Usher to a Gr ...\n", 152 | "milton-paradise.txt [Paradise Lost by John Milton 1667] \n", 153 | " \n", 154 | " \n", 155 | "Book I \n", 156 | " \n", 157 | " \n", 158 | "Of Man's first disobedience, and the fruit \n", 159 | "Of ...\n", 160 | "shakespeare-caesar.txt [The Tragedie of Julius Caesar by William Shakespeare 1599]\n", 161 | "\n", 162 | "\n", 163 | "Actus Primus. Scoena Prima.\n", 164 | "\n", 165 | "Enter Fla ...\n", 166 | "shakespeare-hamlet.txt [The Tragedie of Hamlet by William Shakespeare 1599]\n", 167 | "\n", 168 | "\n", 169 | "Actus Primus. Scoena Prima.\n", 170 | "\n", 171 | "Enter Barnardo a ...\n", 172 | "shakespeare-macbeth.txt [The Tragedie of Macbeth by William Shakespeare 1603]\n", 173 | "\n", 174 | "\n", 175 | "Actus Primus. Scoena Prima.\n", 176 | "\n", 177 | "Thunder and Lig ...\n", 178 | "whitman-leaves.txt [Leaves of Grass by Walt Whitman 1855]\n", 179 | "\n", 180 | "\n", 181 | "Come, said my soul,\n", 182 | "Such verses for my Body let us write, ( ...\n" 183 | ] 184 | } 185 | ], 186 | "source": [ 187 | "from nltk.corpus import gutenberg\n", 188 | "\n", 189 | "# Printing the first 100 characters of each of the files\n", 190 | "for fileid in gutenberg.fileids():\n", 191 | " print(fileid, gutenberg.raw(fileid)[:100], '...')\n" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "## Getting the data\n", 199 | "![Fetch data](https://media.giphy.com/media/4FQMuOKR6zQRO/giphy.gif)\n", 200 | "\n", 201 | "Data can come from a variety of sources in different formats. Natural language can be in the form of text or speech. For the purpose of this tutorial, we will be focusing on text-based processing as opposed to speech recognition and synthesis. Textual data can be stored in databases, dataframes, text files, webpages, etc. A list of text datasets can be found here: https://github.com/niderhoff/nlp-datasets.
\n", 202 | "\n", 203 | "Let's create a list with a few sentences that will serve as the sample data." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 15, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "\n", 216 | "Sample data ['Today is 23rd March. I am in San Francisco, California. Currently I am attending a Natural Language Processing workshop.']\n" 217 | ] 218 | } 219 | ], 220 | "source": [ 221 | "sample_data = [\"Today is 23rd March. I am in San Francisco, California. Currently I am \" \\\n", 222 | " \"attending a Natural Language Processing workshop.\"]\n", 223 | "print('\\nSample data', sample_data)" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "You can also scrape a webpage using the `requests` and `lxml` libraries. Let's trying scraping a paragraph from the landing page for IBM Call for Code. Use 'Inspect' functionality of your browser for the webpage to get the XPath for a particular element. " 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": 11, 236 | "metadata": {}, 237 | "outputs": [ 238 | { 239 | "name": "stdout", 240 | "output_type": "stream", 241 | "text": [ 242 | "\n", 243 | "Conference data: ['Through Call for Code, top solutions are actively supported to bring the technology into communities in need, working with partners like the United Nations and the Linux Foundation. Deployments are underway across the globe.\\xa0']\n" 244 | ] 245 | } 246 | ], 247 | "source": [ 248 | "from lxml import html\n", 249 | "import requests\n", 250 | "\n", 251 | "# Scraping data from a webpage element\n", 252 | "page = requests.get('https://developer.ibm.com/callforcode/')\n", 253 | "tree = html.fromstring(page.content)\n", 254 | "webpage_data = tree.xpath('////*[@id=\"about\"]/div/div/div[1]/p/text()') \n", 255 | "\n", 256 | "# Iterating over all the elements\n", 257 | "conference_data = []\n", 258 | "for item in webpage_data:\n", 259 | " conference_data.append(item)\n", 260 | "print('\\nConference data:', conference_data)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "## Sentence segmentation\n", 268 | "A paragraph is nothing but a collection of sentences. Also called sentence tokenization or sentence boundary disambiguation, this process breaks up sentences by deciding where a sentence starts and ends. Challenges include recognizing ambiguous puncutation marks. For example, `.` can be used for a decimal point, an ellipsis or a period. Let's use ```sent_tokenize``` from ```nltk.tokenize``` to get sentences." 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 16, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "from nltk.tokenize import sent_tokenize\n", 278 | "\n", 279 | "def get_sent_tokens(data):\n", 280 | " \"\"\"Sentence tokenization\"\"\"\n", 281 | " sentences = []\n", 282 | " for sent in data:\n", 283 | " sentences.extend(sent_tokenize(sent))\n", 284 | " print('\\nSentence tokens:', sentences)\n", 285 | " return sentences" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 17, 291 | "metadata": {}, 292 | "outputs": [ 293 | { 294 | "name": "stdout", 295 | "output_type": "stream", 296 | "text": [ 297 | "\n", 298 | "Sentence tokens: ['Today is 23rd March.', 'I am in San Francisco, California.', 'Currently I am attending a Natural Language Processing workshop.']\n" 299 | ] 300 | } 301 | ], 302 | "source": [ 303 | "sample_sentences = get_sent_tokens(sample_data)" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 18, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "name": "stdout", 313 | "output_type": "stream", 314 | "text": [ 315 | "\n", 316 | "Sentence tokens: ['Through Call for Code, top solutions are actively supported to bring the technology into communities in need, working with partners like the United Nations and the Linux Foundation.', 'Deployments are underway across the globe.']\n" 317 | ] 318 | } 319 | ], 320 | "source": [ 321 | "conference_sentences = get_sent_tokens(conference_data)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "## Word tokenization\n", 329 | "A sentence is a collection of words. Word tokenization is similar to sentence tokenization, but works on words. Let's use ```word_tokenize``` from ```nltk.tokenize``` to get the words. " 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 19, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "from nltk.tokenize import word_tokenize\n", 339 | "\n", 340 | "def get_word_tokens(sentences):\n", 341 | " '''Word tokenization'''\n", 342 | " words = []\n", 343 | " for sent in sentences:\n", 344 | " words.extend(word_tokenize(sent))\n", 345 | " print('\\nWord tokens:', words)\n", 346 | " return(words)" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 20, 352 | "metadata": {}, 353 | "outputs": [ 354 | { 355 | "name": "stdout", 356 | "output_type": "stream", 357 | "text": [ 358 | "\n", 359 | "Word tokens: ['Today', 'is', '23rd', 'March', '.', 'I', 'am', 'in', 'San', 'Francisco', ',', 'California', '.', 'Currently', 'I', 'am', 'attending', 'a', 'Natural', 'Language', 'Processing', 'workshop', '.']\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "sample_words = get_word_tokens(sample_sentences)" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 21, 370 | "metadata": {}, 371 | "outputs": [ 372 | { 373 | "name": "stdout", 374 | "output_type": "stream", 375 | "text": [ 376 | "\n", 377 | "Word tokens: ['Through', 'Call', 'for', 'Code', ',', 'top', 'solutions', 'are', 'actively', 'supported', 'to', 'bring', 'the', 'technology', 'into', 'communities', 'in', 'need', ',', 'working', 'with', 'partners', 'like', 'the', 'United', 'Nations', 'and', 'the', 'Linux', 'Foundation', '.', 'Deployments', 'are', 'underway', 'across', 'the', 'globe', '.']\n" 378 | ] 379 | } 380 | ], 381 | "source": [ 382 | "conference_words = get_word_tokens(conference_sentences)" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "## Frequency distribution\n", 390 | "Calculates the frequency distribution for each word in the data. Use ```nltk.probability``` from ```FreqDist``` and ```matplotlib```." 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 22, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [ 399 | "import matplotlib\n", 400 | "from nltk.probability import FreqDist\n", 401 | "matplotlib.use('TkAgg') \n", 402 | "\n", 403 | "def plot_freq_dist(words, num_words = 20):\n", 404 | " '''Frequency distribution'''\n", 405 | " fdist = FreqDist(words)\n", 406 | " fdist.plot(num_words, cumulative=False)" 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": 23, 412 | "metadata": { 413 | "scrolled": true 414 | }, 415 | "outputs": [ 416 | { 417 | "data": { 418 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAErCAYAAADEyxRmAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAArxUlEQVR4nO3deXhd5X3u/e8teRQe8IgF2DiAGWxjAxIzJUASYsCcvMlL29C8hPSEupySlDZNLpq3SchAT9NzetI2IQMkUJI2JU0TaJDNmDCYIQFkgmcMZgie5xF5kvQ7f+y1zbbYGizttdeWdH+ua1/eew173WDQT8+znvU8igjMzMzaqso6gJmZVSYXCDMzK8oFwszMinKBMDOzolwgzMysKBcIMzMrakDWAUpp7NixMXny5G6du2fPHoYOHVraQL00RyVkcA7n6A05KiFDT3MsWLBgc0SMK7ozIvrMq66uLrqrsbGx2+eWUiXkqIQMEc7RlnMcqhJyVEKGiJ7lABqjnZ+p7mIyM7OiXCDMzKwoFwgzMyvKBcLMzIpKrUBIGiLpeUkLJS2V9JUixwyW9B+SVkp6TtLkgn2fT7avkPTBtHKamVlxabYg9gGXRsRM4HRglqRz2xzzSWBbRJwI/CPw9wCSpgIfBaYBs4DvSKpOMauZmbWRWoFIRlDtTj4OTF5t5xb/EPDD5P3PgPdJUrL9JxGxLyLeAFYCZ6eRc8POvXz3idd45PWmNL7ezKzXUqS4HkTyW/8C4ETg2xFxc5v9S4BZEbE6+fwacA7wZeA3EfFvyfY7gQcj4mdFrjEHmANQW1tb19DQcFgZF27Yx1fnb2PCEVXcdvk4cvUpO01NTdTU1PT7DM7hHL0hRyVk6GmO+vr6BRFRX2xfqk9SR0QLcLqkI4H7JE2PiCUlvsYdwB0A9fX1UVdXd1jnz2xp5Tsv/or1u/czpHYK048ZWcp4h23BggUc7j9DX8zgHM7RG3JUQoY0c5RlFFNEbAceJ3c/odAaYCKApAHASGBL4fbEscm2khtQXcUVp9UC0LBobRqXMDPrldIcxTQuaTkgaSjwAeDlNofdD1yXvL8aeCx59Pt+4KPJKKf3AFOA59PKOnvG0QDMXbiONLvczMx6kzS7mGqBHyb3IaqAn0bEXElfJTf3x/3AncC/SloJbCU3comIWCrpp8AyoBm4MemuSkX9caMYPbSKNdv38NtV2zlz0qi0LmVm1mukViAiYhFwRpHtXyp4vxf4/XbO/1vgb9PKV6iqSpx/7BDmvtpEw8K1LhBmZvhJ6oMunDQEgHmL1tHS6m4mMzMXiMSJowYycfRQNu7axwtvbs06jplZ5lwgEpLeuVnt0UxmZi4Qha5KCsQDi9fT3NKacRozs2y5QBQ4tXY4x487gq1v7+fZ17ZkHcfMLFMuEAUkHWxFuJvJzPo7F4g2rpqZe6r6oSXr2d/sbiYz679cINo4cfxwTpkwnJ17m3nq1U1ZxzEzy4wLRBFXzcx1MzUsdDeTmfVfLhBF5O9DPLpsA3sPpDbDh5lZRXOBKGLSmBpmHjuSt/e38PjLG7OOY2aWCReIduQfmvMU4GbWX7lAtOPKGbnRTI+9vJHd+5ozTmNmVn4uEO04+sihnDV5FHsPtPKr5RuyjmNmVnYuEB042M20cF3GSczMys8FogOXnzaBKsGTr2xkR9OBrOOYmZVVmkuOTpT0uKRlkpZKuqnIMZ+T9FLyWiKpRdLoZN+bkhYn+xrTytmR8cOHcO7xYzjQEjy8bH0WEczMMpNmC6IZ+KuImAqcC9woaWrhARHxvyPi9Ig4Hfg88GREFC7GcEmyvz7FnB3KPzQ3d5G7mcysf0mtQETEuoh4MXm/C1gOHNPBKdcA96SVp7tmTZvAgCrxzMrNbNm9L+s4ZmZlo4j0l9eUNBmYD0yPiJ1F9tcAq4ET8y0ISW8A24AAbo+IO9r57jnAHIDa2tq6hoaGbmVsamqipqam6L5bn9rKb9fv50/PHMFlJxQ/plQ6ylEulZDBOZyjN+SohAw9zVFfX7+g3V6aiEj1BQwDFgAf6eCYPwQa2mw7JvlzPLAQuKiza9XV1UV3NTY2trvvZ42r4rib58Yf3v5st7+/FDnKpRIyRDhHW85xqErIUQkZInqWA2iMdn6mpjqKSdJA4OfAjyPi3g4O/ShtupciYk3y50bgPuDstHJ25gPTjmJQdRXPvbGVDTv3ZhXDzKys0hzFJOBOYHlEfKOD40YC7wV+UbDtCEnD8++By4AlaWXtzIghA7n45HFEwAOLfbPazPqHNFsQFwDXApcWDGW9QtINkm4oOO7DwCMR8XbBtqOApyUtBJ4H5kXEQylm7ZSnADez/mZAWl8cEU8D6sJxdwN3t9n2OjAzlWDd9L5TxzN0YDUvvrWd1duaOHZU9jemzMzS5Cepu6hm0ADed+p4AOb5mQgz6wdcIA6DpwA3s/7EBeIwXHzyOIYNHsCSNTt5Y/PbnZ9gZtaLuUAchiEDq7ls6lEAzPXNajPr41wgDpPnZjKz/sIF4jBdcOJYjqwZyIoNu1ixflfWcczMUuMCcZgGDahi1rQJAMz1zWoz68NcILqhsJspyjDZoZlZFlwguuGc94xm7LBBvLH5bZaufdfktGZmfYILRDcMqK7iitNqAT8TYWZ9lwtENx3sZlrobiYz65tcILqpbtIoJowYwprte3jxre1ZxzEzKzkXiG6qqhKzZ+S6mTyaycz6IheIHpiddDPNW7SOllZ3M5lZ3+IC0QMzjx3JxNFD2bhrHy+8uTXrOGZmJeUC0QOSuGqGFxIys74pzSVHJ0p6XNIySUsl3VTkmIsl7ShYce5LBftmSVohaaWkv04rZ0/lpwB/cMl6mltaM05jZlY6abYgmoG/ioipwLnAjZKmFjnuqYg4PXl9FUBSNfBt4HJgKnBNO+dm7tTa4Zww7gi2vr2fZ1/bknUcM7OSSa1ARMS6iHgxeb8LWA4c08XTzwZWRsTrEbEf+AnwoXSS9oykdxYScjeTmfUhKsdDXpImA/OB6RGxs2D7xcDPgdXAWuCzEbFU0tXArIi4PjnuWuCciPhUke+eA8wBqK2trWtoaOhWxqamJmpqurfO9Oqdzdz08GaOGCjuvGo8A6s7XYo7lRylUgkZnMM5ekOOSsjQ0xz19fULIqK+6M6ISPUFDAMWAB8psm8EMCx5fwXwavL+auAHBcddC9zW2bXq6uqiuxobG7t9bkTErH+aH8fdPDceXbq+R9/T0xylUAkZIpyjLec4VCXkqIQMET3LATRGOz9TUx3FJGkguRbCjyPi3iLFaWdE7E7ePwAMlDQWWANMLDj02GRbxfJDc2bW16Q5iknAncDyiPhGO8dMSI5D0tlJni3AC8AUSe+RNAj4KHB/WllLIT/c9dFlG9izvyXjNGZmPZdmC+ICcl1DlxYMY71C0g2SbkiOuRpYImkh8E3go0mrpxn4FPAwuZvbP42IpSlm7bFJY2qYeexI3t7fwuMrNmYdx8ysxwak9cUR8TTQ4d3aiLgNuK2dfQ8AD6QQLTVXzTyahat3MHfR2oPTgZuZ9VZ+krqErkzuQ/xq+UZ272vOOI2ZWc+4QJRQ7cihnDV5FPuaW/nV8g1ZxzEz6xEXiBLLLyTkh+bMrLdzgSixy6fXUiV48pVN7Gg6kHUcM7Nuc4EosXHDB3PeCWM40BI8vGx91nHMzLrNBSIFnpvJzPoCF4gUzJo2gQFV4tnXtrBl976s45iZdYsLRApGHTGI35sylpbW4MEl7mYys97JBSIl7mYys97OBSIlH5h2FIMGVPH8m1vZsHNv1nHMzA6bC0RKRgwZyMUnjSMC5i1al3UcM7PD5gKRovxDc54C3Mx6IxeIFL3v1PEMHVjNi29tZ9XWpqzjmJkdFheIFNUMGsD7Th0PwLzF7mYys97FBSJl7mYys94qzRXlJkp6XNIySUsl3VTkmI9JWiRpsaRnJc0s2Pdmsv0lSY1p5Uzbe08ax/DBA1iyZidvbH476zhmZl2WZguiGfiriJgKnAvcKGlqm2PeAN4bEacBXwPuaLP/kog4PSLqU8yZqiEDq/nAtKMAmOtnIsysF0mtQETEuoh4MXm/i9zSoce0OebZiNiWfPwNcGxaebJ0cApwdzOZWS9SlnsQkiYDZwDPdXDYJ4EHCz4H8IikBZLmpBgvdReeOJYjawbyyobdrFi/K+s4ZmZdoohI9wLSMOBJ4G8j4t52jrkE+A5wYURsSbYdExFrJI0HHgU+HRHzi5w7B5gDUFtbW9fQ0NCtnE1NTdTU1HTr3K74buMOfvnGHq4+9QiumT48sxxdUQkZnMM5ekOOSsjQ0xz19fUL2u3Gj4jUXsBA4GHgMx0cMwN4DTipg2O+DHy2s+vV1dVFdzU2Nnb73K54+tVNcdzNc+O9/+uxaG1tzSxHV1RChgjnaMs5DlUJOSohQ0TPcgCN0c7P1DRHMQm4E1geEd9o55hJwL3AtRHxSsH2IyQNz78HLgOWpJW1HM49fgxjhw3mzS1NLF27M+s4ZmadSvMexAXAtcClyVDVlyRdIekGSTckx3wJGAN8p81w1qOApyUtBJ4H5kXEQylmTV11lbjytAmAZ3g1s95hQFpfHBFPA+rkmOuB64tsfx2Y+e4zerfZM4/mh7/+HXMXreOvLz+FXCPLzKwy+UnqMqqbNIrakUNYs30PL761Pes4ZmYdcoEoo6oqceVptYC7mcys8rlAlFn+obkHFq+jpTXdIcZmZj3hAlFmM44dyaTRNWzctY/n39iadRwzs3YddoGQNErSjDTC9AeSmD0j183kGV7NrJJ1qUBIekLSCEmjgReB70sq+myDdS7fzfTgkvUcaGnNOI2ZWXFdbUGMjIidwEeAH0XEOcD704vVt50yYTgnjDuCrW/v59nXtmQdx8ysqK4WiAGSaoE/AOammKdfkPTOQkIezWRmFaqrBeIr5OZUWhkRL0g6Hng1vVh93+wZuQLx0NL17GtuyTiNmdm7dbVArIuIGRHxZ3DwSWffg+iBE8cP49TaEeza28xTr2zOOo6Z2bt0tUB8q4vb7DBcNTN5aM6jmcysAnU4F5Ok84DzgXGSPlOwawRQnWaw/mD2aUfzvx5awaPLNrBnfwtDB/lfqZlVjs5aEIOAYeQKyfCC107g6nSj9X2TxtQwc+KRNO1v4fEVG7OOY2Z2iA5bEBHxJPCkpLsj4ndlytSvXDWjloWrttOwcC1XJPM0mZlVgq7egxgs6Q5Jj0h6LP9KNVk/cWXyVPVjL29k977mjNOYmb2jq+tB/CfwPeAHgMdkllDtyKGcPXk0z7+5lV8u28DErAOZmSW62oJojojvRsTzEbEg/+roBEkTJT0uaZmkpZJuKnKMJH1T0kpJiySdWbDvOkmvJq/rDvOfq1eZPdNTgJtZ5elqgWiQ9GeSaiWNzr86OacZ+KuImAqcC9woaWqbYy4HpiSvOcB3AZLvvgU4BzgbuEXSqC5m7XUun15LlWD+q5vYvd9zM5lZZehqF1P+N/jPFWwL4Pj2ToiIdcC65P0uScuBY4BlBYd9iNzcTgH8RtKRyZQeFwOPRsRWAEmPArOAe7qYt1cZN3ww550whmdWbuHrz2zjJ6912DhL3d7dO/j7k/YyfviQTHOYWbaU+9mc8kWkycB8YHoy6V9++1zg68n61Uj6FXAzuQIxJCJuTbZ/EdgTEf9Q5LvnkGt9UFtbW9fQ0NCtjE1NTdTU1HTr3FJ44s09fOuFHZldv63/dlIN180ckWmGrP9OnMM5ekOGnuaor69fEBH1xfZ1qQUh6ePFtkfEj7pw7jDg58BfFBaHUomIO4A7AOrr66Ourq5b37NgwQK6e24pnHFGcPaMzSxc9grHH99uwyx1a7fv4dZ5y1mwMfinM86kqkqZZcn678Q5nKM3ZEgzR1e7mM4qeD8EeB+5dSE6LBCSBpIrDj+OiHuLHLIGDhm4c2yybQ25VkTh9ie6mLVXqqoSvzdlHDU736Iuw+chWluD7z22gjXb9/DbVduoO66zW01m1ld16SZ1RHy64PUnwJnknrBulyQBdwLLI6K9if3uBz6ejGY6F9iR3Lt4GLgsWb1uFHBZss1SVlUlzp+Yu/fQsHBdxmnMLEvdXZP6beA9nRxzAXAtcKmkl5LXFZJukHRDcswDwOvASuD7QH622K3A14AXktdX8zesLX0XTBwKwLzF62hpTf8elZlVpq7eg2ggN2oJcpP0nQr8tKNzkhvPHXZgJ6OXbmxn313AXV3JZ6V1wqgBHDemht9taeK5N7Zw/gljs45kZhno6j2IwtFDzcDvImJ1CnmsAkhi9oxavv34a8xdtM4Fwqyf6uo9iCeBl8nN5DoK2J9mKMtefknUBxev40CLH94z64+6VCAk/QHwPPD75Nalfk6Sp/vuw04+ajgnjh/GtqYDPLPSK96Z9UddvUn9N8BZEXFdRHyc3PQXX0wvlmVNElcl62bPXeTRTGb9UVcLRFVEFK5os+UwzrVeKj+J4MNL17Ov2ZP4mvU3Xf0h/5CkhyV9QtIngHnkhqhaH3bCuGFMrR3Brr3NzH/F3Uxm/U2HBULSiZIuiIjPAbcDM5LXr0mmt7C+LX+z2lORm/U/nbUg/onc+tNExL0R8ZmI+AxwX7LP+rjZyYp3v1y+gT373c1k1p90ViCOiojFbTcm2yanksgqysTRNZw+8Uia9rfw2MsbOz/BzPqMzgrEkR3sG1rCHFbB3M1k1j91ViAaJf1J242SrgeyXdXGyubK02qR4PEVG9m190DWccysTDqbauMvgPskfYx3CkI9MAj4cIq5rIJMGDmEsyaP5vk3tvLL5Rv48BnHZh3JzMqgwxZERGyIiPOBrwBvJq+vRMR5EbE+/XhWKa5KblbP9RTgZv1GV+diejwivpW8Hks7lFWey0+rpUow/9VNbG/yVFxm/YGfhrYuGTtsMBecOJYDLcHDS914NOsPXCCsy/LPRHhuJrP+IbUCIekuSRslLWln/+cKVppbIqlF0uhk35uSFif7GtPKaIfng9MmMLBaPLNyM5t378s6jpmlLM0WxN3ArPZ2RsT/jojTI+J04PPAk22WFb0k2V+fYkY7DEfWDOL3poyjNeDBJe5mMuvrUisQETEf6Oo60tcA96SVxUrnqmSGVz80Z9b3KbcsdEpfLk0G5kbE9A6OqQFWAyfmWxCS3gC2kVsH+/aIaHdiQElzgDkAtbW1dQ0NDd3K2tTURE1NTbfOLaVKyNFRhqYDrXzy/o0caIXbZ49jzNDqTHKUk3M4RyVn6GmO+vr6Be321EREai9y8zUt6eSYPwQa2mw7JvlzPLAQuKgr16urq4vuamxs7Pa5pVQJOTrL8Kc/aozjbp4bP3jq9UxzlItzHMo5KitDRM9yAI3Rzs/UShjF9FHadC9FxJrkz43kZo49O4Nc1g7PzWTWP2RaICSNBN4L/KJg2xGShuffA5cBRUdCWTYuPWU8NYOqeWnVdlZtbco6jpmlJM1hrveQW1joZEmrJX1S0g2Sbig47MPAIxHxdsG2o4CnJS0EngfmRcRDaeW0wzd0UDXvP/UowM9EmPVlnU3W120RcU0Xjrmb3HDYwm2vAzPTSWWlctXMo7l/4VoaFq7lf1x8QtZxzCwFlXAPwnqhi04ay/AhA1i2bievbdqddRwzS4ELhHXL4AHVfHDaBMAzvJr1VS4Q1m35uZkaFq3ND082sz7EBcK67YITxzKqZiArN+5mxYZdWccxsxJzgbBuG1hdxazpnnrDrK9ygbAeyc/NNHfROnczmfUxLhDWI+e8Zwzjhg/md1uaWLxmR9ZxzKyEXCCsR6qrxJWneSEhs77IBcJ67GA308K1tLa6m8msr3CBsB47Y+Iojh45hLU79vLbVduyjmNmJeICYT1WVSVmH5zh1d1MZn2FC4SVxFUzcgVi3uJ1tLibyaxPcIGwkph+zAiOG1PDpl37eO6NLVnHMbMScIGwkpB0sBXhbiazvsEFwkpmdjKa6aEl6zjQ0ppxGjPrqTQXDLpL0kZJRVeDk3SxpB2SXkpeXyrYN0vSCkkrJf11WhmttE4+ajhTxg9jW9MBnlm5Oes4ZtZDabYg7gZmdXLMUxFxevL6KoCkauDbwOXAVOAaSVNTzGklIqlgvWp3M5n1dqkViIiYD2ztxqlnAysj4vWI2A/8BPhQScNZavJTgD+ydD37mlsyTmNmPZH1PYjzJC2U9KCkacm2Y4BVBcesTrZZL3D8uGFMO3oEu/Y18+SKTVnHMbMeUJozcEqaDMyNiOlF9o0AWiNit6QrgH+OiCmSrgZmRcT1yXHXAudExKfaucYcYA5AbW1tXUNDQ7eyNjU1UVNT061zS6kScvQ0w30v7+bfFu/mwolD+Mtzj8wsR6k4h3NUcoae5qivr18QEfVFd0ZEai9gMrCki8e+CYwFzgMeLtj+eeDzXfmOurq66K7GxsZun1tKlZCjpxne2vJ2HHfz3DjlCw/G2/sOZJajVJzjUM5RWRkiepYDaIx2fqZm1sUkaYIkJe/PJtfdtQV4AZgi6T2SBgEfBe7PKqcdvomjazhj0pHsOdDCYy9vzDqOmXVTmsNc7wF+DZwsabWkT0q6QdINySFXA0skLQS+CXw0KWjNwKeAh4HlwE8jYmlaOS0ds5OH5uZ6NJNZrzUgrS+OiGs62X8bcFs7+x4AHkgjl5XHlafVcuu8ZTy2YiO79h5g+JCBWUcys8OU9Sgm66MmjBzCWZNHs7+5lV8u35B1HDPrBhcIS40fmjPr3VwgLDWXT59AlWD+K5vY3rQ/6zhmdphcICw1Y4cN5oITx9LcGjy8dH3WcczsMLlAWKo8BbhZ7+UCYan64LQJDKwWz762mc2792Udx8wOgwuEpWpkzUAumjKO1oAHF7sVYdabuEBY6vILCTUscoEw601cICx17z/1KAYPqOKFN7eybseerOOYWRe5QFjqhg8ZyKWnjCcC5rkVYdZruEBYWRycm8kFwqzXcIGwsrj0lPHUDKrmpVXbWbW1Kes4ZtYFLhBWFkMHVfP+U48C3Iow6y1cIKxs3pmbaW3GScysK1wgrGwuOmksw4cMYNm6nby2aXfWccysEy4QVjaDB1TzwWkTAC8kZNYbpLmi3F2SNkpa0s7+j0laJGmxpGclzSzY92ay/SVJjWlltPI72M20aG1+zXEzq1BptiDuBmZ1sP8N4L0RcRrwNeCONvsviYjTI6I+pXyWgfNPGMOomoGs3LibFRt2ZR3HzDqQWoGIiPnA1g72PxsR25KPvwGOTSuLVY6B1VVcfloy9YZvVptVNKXZzJc0GZgbEdM7Oe6zwCkRcX3y+Q1gGxDA7RHRtnVReO4cYA5AbW1tXUNDQ7eyNjU1UVNT061zS6kScqSdYcnGfdzy5DaOOqKab18+FkmZ5Ogq53COSs7Q0xz19fUL2u2piYjUXsBkYEknx1wCLAfGFGw7JvlzPLAQuKgr16urq4vuamxs7Pa5pVQJOdLO0NzSGvW3PhrH3Tw3Fq7allmOrnKOQzlHZWWI6FkOoDHa+Zma6SgmSTOAHwAfiogt+e0RsSb5cyNwH3B2NgktDdVV4kp3M5lVvMwKhKRJwL3AtRHxSsH2IyQNz78HLgOKjoSy3uuqZArweYvW0drq0UxmlWhAWl8s6R7gYmCspNXALcBAgIj4HvAlYAzwnaQPujly/WBHAfcl2wYA/x4RD6WV07JxxsRRHHPkUNZs38OLb22jfvLorCOZWRupFYiIuKaT/dcD1xfZ/jow891nWF9SVSWunFHLHfNfp2HhWhcIswrkJ6ktM1clU4DPW7yeFnczmVUcFwjLzPRjRjB5TA2bd+/jude3dH6CmZWVC4RlRtLBhYS8XrVZ5XGBsEzl52Z6cMk6DrS0ZpzGzAq5QFimTp4wnCnjh7G96QBPr9ycdRwzK+ACYZnLtyI8BbhZZXGBsMzNnpF7aO6RpevZe6Al4zRmlucCYZk7ftwwph09gl37mpn/yqas45hZwgXCKsI7Cwm5m8msUrhAWEXIT973y2UbaNrfnHEaMwMXCKsQE0fXcMakI9lzoIXHXt6YdRwzwwXCKkh+6g1PAW5WGVwgrGJcOaMWCR5fsYldew9kHces33OBsIpx1IghnD15NPubW3l02Yas45j1ey4QVlFm5x+a82gms8y5QFhFuXz6BKqrxPxXNrFrv+dmMstSqgVC0l2SNkoqumSocr4paaWkRZLOLNh3naRXk9d1aea0yjF22GDOP2EMza3Bc6v3Zh3HrF9LuwVxNzCrg/2XA1OS1xzguwCSRpNbovQc4GzgFkmjUk1qFSM/mumZVS4QZllKbclRgIiYL2lyB4d8CPhRRATwG0lHSqolt5b1oxGxFUDSo+QKzT1p5rXK8MFpE/ib/1rM4o37+cA3nsw6Dnv27mXok87hHJWZIZ/j3yc3MWlMTUm/N9UC0QXHAKsKPq9OtrW3/V0kzSHX+qC2tpYFCxZ0K0hTU1O3zy2lSshRCRnOPWYwT721l1c37s40x0E7neMQzlFZGYDfLlrMphGl/ZGedYHosYi4A7gDoL6+Purq6rr1PQsWLKC755ZSJeSohAz/cnor8+a/wKlTp2aaA2Dp0qVMmzYt6xjOUYE5KiFDPses36tn8IDqkn5v1gViDTCx4POxybY15LqZCrc/UbZUlrkB1VUcO2IAJx01POso7Fo90Dmco2Iz5HOUujhA9sNc7wc+noxmOhfYERHrgIeByySNSm5OX5ZsMzOzMkm1BSHpHnItgbGSVpMbmTQQICK+BzwAXAGsBJqAP072bZX0NeCF5Ku+mr9hbWZm5ZH2KKZrOtkfwI3t7LsLuCuNXGZm1rmsu5jMzKxCuUCYmVlRLhBmZlaUC4SZmRWl3H3ivkHSJuB33Tx9LLC5hHG6qxJyVEIGcI62nONQlZCjEjJAz3IcFxHjiu3oUwWiJyQ1RkS9c1RGBudwjt6QoxIypJnDXUxmZlaUC4SZmRXlAvGOO7IOkKiEHJWQAZyjLec4VCXkqIQMkFIO34MwM7Oi3IIwM7OiXCDMzKwoFwirGJJOyzqDmb3D9yCsYkh6ChgM3A38OCJ2ZJvIrLJJGgSclHxcEREHSvr9LhCHkjQhItZnnaPcJM0GvgYcR24aeJGbkX1EmXNMAf478PvA88C/RMSj5cyQ5LgAeCki3pb0/wFnAv8cEd19Uv9wrz+6o/3lXh9F0jjgT4DJFCwTEBH/vZw5kiznF8nxozJe/wLgy7z7/5Xjy5UhyXEx8EPgzSTDROC6iJhfsmu4QBxK0ryIuLJM19oFFPsLKPsPZ0krgY8AiyPj/ygkVQP/D/BNYCe5fx//f0TcW8YMi4CZwAxyLZofAH8QEe8t0/XfIPffhoBJwLbk/ZHAWxHxnnLkKMjzLPAUsABoyW+PiJ+XOce/AicALxXkiIj48zJmeBn4S97972JLuTIkORYAfxQRK5LPJwH3RETJFpTPek3qilOu4pBcK/vFbN+xCliSZXGQNIPcqoJXAo8CV0XEi5KOBn4NlK1AAM0REZI+BNwWEXdK+mS5Lp4vAJK+D9wXEQ8kny8nVzzLrSYibs7gum3VA1Mz/iVmR0Q8mOH18wbmiwNARLwiaWApL+AWhAEg6SxyXUxPAvvy2yPiG2XM8CS539R/FhF72uy7NiL+tcxZHiJXsC4CNgILI6KsN9IlLW57zWLbypDjVuDZfKHKiqT/BP48Wbs+qwxfB6rJ/cJS+P/Ki2XOcRfQCvxbsuljQHUpu/1cIAwASY8Au4HF5P6jAyAivpJZqAxJmgD8EfBCRDwlaRJwcTn7upMcD5Pr2in8IXBRRHywTNfPd4MKOILcD8QDZHeP6nHgdHL3pwp/OP+3MmdoKyLi0nJlSHIMJrdk84XJpqeA70TEvvbPOsxruEAYgKQlETE94wxTgL8DpgJD8tvLffOvkiQ3q28h14oJYD7w1XLfpK4UkoreA4qIJ8udJWuSjgD2RkRL8rkaGBwRTaW6hu9BWN4Dki6LiEcyzPAv5H4Y/iNwCbnunbI+qyPp6Yi4sMgAgiwGDlQD34qIj5Xrmh1k+TDwWH7osaQjybWo/qucOSqlEEi6EpjGob/IfLXMMX4FvJ9cyx9gKPAIcH6pLuAWhAEHuxIy7UKQtCAi6gr72PPbypWh0kh6Grg0IvZnnOOliDi9zbbfRsQZZbp+JRXu7wE15H6J+QFwNfB8RJRtEEOSo9jfybu29YRbEAbkRlQl3RlTKPitqMz2SaoCXpX0KWANMCyjLJXideAZSfcDb+c3lnPwQKJYS65sPz8i4sLkz0oY+Xd+RMyQtCgiviLp/wBZjGp6W9KZ+ZvjkuqAPZ2cc1hcIAwASdcDNwHHkhtjfi7wLPC+Msa4idxvZn9ObkTVpcB1Zbx+JXoteVUBWf5wbJT0DeDbyecbyT0H0B/lfwg3JUOwtwK1GeT4C+A/Ja0l15KaAPxhKS/gLiYDckMngbOA30TE6ZJOAf5nRHwk42hWAZIbol8k1+cd5J5T+duIeLvDE/sgSV8EvkXuF5h8wfxBRHwxgywDgZOTjyWfasMtCMvbGxF7JSFpcES8LOnkzk/rOUkNFH+iHCjvEMZKkQyzvYXckOMvAZ8m96T7y8BN5XwOILlZPjciLinXNStR8qzQqoj4WvJ5GLlh4S+TG1hR7jy/DzwUEUskfQE4U9KtpXwew7O5Wt7qZGTKfwGPSvoFUJZ5h4B/AP4P8Aa55vv3k9duct0r/dHdwDJyT7g/Tu7fy5Xkxrp/r5xBkmGUrZJGlvO6Feh2YD+ApIuAryfbdpDNynJfjIhdki4k1xV8J/DdUl7AXUz2LslY85Hkfjsp2+gZSY0RUd/Ztv6gcISQpLciYlLBvpKOVOlinl8AZ5DrWiq8WV62OZCyJmlhRMxM3n8b2BQRX04+Z/F38tuIOEPS35GbQ+3fSz2yzF1M9i4ZjjU/QtLxEfE6gKT3kBt62x8Vtu7bPr1dXc4giXsp71xYlaha0oCIaCb3G/ucgn1Z/CxdI+l24APA3ydPVpe0V8gFwirJXwJPSHqd3KiM4zj0f8L+5BeShkXE7oj4Qn6jpBOBFR2cl4qI+GG5r1mB7gGelLSZXJffU3Dw7ySLtUv+AJgF/ENEbJdUC3yulBdwF5NVlOS3oFOSjy+Xcl6Z3iYZSXYM8FxE7C7YPisiHipzFk+DAkg6l9yQ1kfyI7iSabaHlWuyPkkjImKn2lkzpJTTsLhAWMVIhuz9D3LzDgE8Adxe6qF7vYGkTwOfApaTm5zupoj4RbLvxYg4s8x5nuadaVCuIpkGJSK+VM4cBpLmRsRsHbpmSF6Usmi7QFjFkPQDYCC5VbIArgVaIuL67FJlI3ku5byI2C1pMvAz4F8j4p/LOcVFQR5Pg9IP+R6EZa7gxt9Z+VEiicckLcwqV8aq8t1KEfGmcstL/kzScRz6G2O5eBqUClOOCRT9HIRVgueTP1sknZDfKOl4CpZ07Gc2SDo9/yEpFrOBsUBZFwtKFE6DUkeuddffp0HJ2i354gAQEdvJdQOWjFsQVgnyvxF/Fng8GcUEuYXp/ziTRNn7ONBcuCFpZX08GdpYVhHxQvJ2N/3376TSpD6Bou9BWOYkrQbys5MO5Z1x/i3AngxmLrVEMotsu/rjNCiVQrklR7dz6ASKoyPiE6W6hlsQVgmqyfVnt+1bH0C2M5ganEduuo97gOfI5v6HFfdpchMo/kfy+VFyRaJk3IKwzGUxbNO6Jpmo7wPANcAMYB5wT0QszTSYlYVbEFYJ/FtphUom6nsIeCh5iPEack+7fyUibss2Xf+WPKD3WXL36g7+LI+IS0t2DbcgLGuSRpfy6U8rraQwXEmuOEwG7gfuiog1Webq75Ih4N8jt3DTwdF+EVGyhZxcIMysXZJ+BEwHHgB+EhFLMo5kiXI8qOgCYWbtktTKO9N7F/6wELlpHUaUP5UBSPoysBG4Dzg4Z5nnYjIz6+eSuZja8lxMZmaWPo9iMjPrpSRN591TsLddYKr73+8WhJlZ7yPpFuBicgXiAeBy4OmIuLpU1/BkfWZmvdPV5JY+XR8RfwzMJLeWfMm4QJiZ9U57IqIVaJY0gtyIpomlvIDvQZiZ9U6NyRoQ3yf3sNxu4NelvIDvQZiZ9TKSBBwbEauSz5OBERGxqKTXcYEwM+t9Cpd/TYvvQZiZ9U4vSjorzQu4BWFm1gtJehmYArxJbjqU/PQnM0p2DRcIM7PeQ9KkiHhL0nHF9kfE70p2LRcIM7Peo3CBLUk/j4j/N61r+R6EmVnvUrjAVskm5ivGBcLMrHeJdt6XnLuYzMx6EUktvHNTeijQlN9FidfocIEwM7Oi3MVkZmZFuUCYmVlRLhBmRUj6G0lLJS2S9JKkc1K81hOS6tP6frPu8myuZm1IOg+YDZwZEfskjQUGZRzLrOzcgjB7t1pgc0TsA4iIzRGxVtKXJL0gaYmkO5IZNfMtgH+U1ChpuaSzJN0r6VVJtybHTJb0sqQfJ8f8TFJN2wtLukzSryW9KOk/JQ1Ltn9d0rKkRfMPZfx3Yf2YC4TZuz0CTJT0iqTvSHpvsv22iDgrIqaTG144u+Cc/RFRD3wP+AVwIzAd+ISkMckxJwPfiYhTgZ3AnxVeNGmpfAF4f/KkbCPwmeT8DwPTknl2bk3hn9nsXVwgzNqIiN1AHTAH2AT8h6RPAJdIek7SYuBSYFrBafcnfy4GlkbEuqQF8jrvrPK1KiKeSd7/G3Bhm0ufS2594WckvQRcBxwH7AD2AndK+gjvjHs3S5XvQZgVEREtwBPAE0lB+FNgBlAfEaskfRkYUnDKvuTP1oL3+c/5/8/aPnTU9rOARyPimrZ5JJ1Nbv3hq4FPkStQZqlyC8KsDUknS5pSsOl0YEXyfnNyX+Dqbnz1pOQGOMAfAU+32f8b4AJJJyY5jpB0UnK9kRHxAPCX5BanN0udWxBm7zYM+Fay3m8zsJJcd9N2YAmwHnihG9+7ArhR0l3AMuC7hTsjYlPSlXWPpMHJ5i8Au4BfSBpCrpXxmW5c2+yweaoNszJI1gyem9zgNusV3MVkZmZFuQVhZmZFuQVhZmZFuUCYmVlRLhBmZlaUC4SZmRXlAmFmZkW5QJiZWVH/Fw5r1oiJcT74AAAAAElFTkSuQmCC\n", 419 | "text/plain": [ 420 | "
" 421 | ] 422 | }, 423 | "metadata": { 424 | "needs_background": "light" 425 | }, 426 | "output_type": "display_data" 427 | } 428 | ], 429 | "source": [ 430 | "%matplotlib inline\n", 431 | "plot_freq_dist(sample_words, num_words=10)" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 24, 437 | "metadata": {}, 438 | "outputs": [ 439 | { 440 | "data": { 441 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAE+CAYAAACA8heHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAA7hUlEQVR4nO2deZgcZbX/P99M1slCEoiQn0Aii1wVAZlhU5RF8bogKirugleN4oYiitsF8XpVrrggKKgsgnCVVSERRWRfBJ2whsVLAGWRLQkhy2TP+f3xvj1TU109XT3dPd2ZOp/n6We6qk6derumu877vue858jMcBzHcYrLqFY3wHEcx2ktbggcx3EKjhsCx3GcguOGwHEcp+C4IXAcxyk4o1vdgFrZYostbPbs2UM6d9WqVUyYMKGhsq7TdbpO19luOrOYP3/+IjObkXnQzDapV1dXlw2Vnp6ehsu6TtfpOl1nu+nMAuixCs9VnxpyHMcpOG4IHMdxCo4bAsdxnILjhsBxHKfguCFwHMcpOE03BJI6JN0haV7GsXGSLpC0UNJtkmY3uz2O4zjOQIZjRHAUcH+FYx8BnjOzHYAfAicOQ3scx3GcBE01BJK2Bt4MnFFB5K3AOfH9xcBrJakZbfnelQ/wsbnP8Ls7nmiGesdxnE0WWRPrEUi6GPgOMBk4xswOTh1fALzBzB6P2w8Be5nZopTcHGAOwMyZM7vmzp1bc1vOvnMZ8x7s5fBdJnPIThOryvf29tLZ2dkwOdfpOl2n6xwunVl0d3fPN7PuzIOVVprV+wIOBn4a3+8PzMuQWQBsndh+CNhiML1DXVl86jUP2qxj59l3rrg/l/xIW2noOl2n6yyOzixo0criVwGHSPoH8BvgQEnnpWSeALYBkDQa2AxY3IzGTJ84FoDnVq5thnrHcZxNlqYZAjP7ipltbWazgfcA15jZB1JilwOHx/fvjDJNmaua1hkMwWI3BI7jOAMY9uyjkr5JGKJcDpwJ/ErSQmAJwWA0hb4RQa8bAsdxnCTDYgjM7Drguvj+uMT+1cC7hqMNPjXkOI6TTWFWFpcMgU8NOY7jDKQwhmCzCWMQ8PyqdazfsLHVzXEcx2kbCmMIOkaJyWPDWrWlq9a1uDWO4zjtQ2EMAcDkceHjLvHpIcdxnD4KZQimuCFwHMcpww2B4zhOwSmUIZg81g2B4zhOmkIZgtKIwNcSOI7j9FNIQ+BrCRzHcfoppCHwNBOO4zj9FNIQuI/AcRynn0IZAncWO47jlFMoQ+DOYsdxnHIKZghCionFK9fSpLIHjuM4mxyFMgTjOsS40aNYs34jq9ZtaHVzHMdx2oJCGQJJbB7TUbufwHEcJ1AoQwAwzQ2B4zjOAApnCKa7IXAcxxmAGwLHcZyCUzhDMK3TDYHjOE6SwhmCkrPY00w4juMECmcI3FnsOI4zkMIZAg8fdRzHGUjhDIGPCBzHcQbSNEMgabykv0q6S9K9kk7IkDlC0rOS7oyvjzarPSU8ashxHGcgo5uoew1woJmtkDQGuEnSH8zs1pTcBWb26Sa2YwDT+5zF64brko7jOG1N00YEFlgRN8fEV8szvU2dMAYIUUMbNra8OY7jOC1HzczCKakDmA/sAPzEzI5NHT8C+A7wLPB/wOfN7LEMPXOAOQAzZ87smjt37pDa09vbS2dnJ4df9jQr1hpnH/KCvtTUlWTz6qzl+q7TdbpO19lMnVl0d3fPN7PuzINm1vQXMBW4Ftg5tX9zYFx8/3Hgmmq6urq6bKj09PSYmdkB37vWZh07zx58enlV2bw6GynrOl2n63Sd9cqmAXqswnN1WKKGzGxpNARvSO1fbGZr4uYZQNdwtMcjhxzHcfppZtTQDElT4/sJwEHAAymZmYnNQ4D7m9WeJB455DiO008zo4ZmAudEP8Eo4EIzmyfpm4QhyuXAZyUdAqwHlgBHNLE9fUz3fEOO4zh9NM0QmNndwCsy9h+XeP8V4CvNakMlpnm+IcdxnD4Kt7IYPM2E4zhOkkIaAncWO47j9FNIQ+AjAsdxnH4KaQh8ROA4jtNPIQ2BRw05juP0U0xDMMmjhhzHcUoU0hBMHNvB2I5R9K7dwOp1G1rdHMdxnJZSSEMgyVcXO47jRAppCMAdxo7jOCUKawimTwx1CdwQOI5TdApsCMYB7jB2HMcpriHoDCOCxSvcEDiOU2yKawh8ROA4jgMU2hC4j8BxHAcKbAg8ashxHCdQWEPg6wgcx3ECbgjcEDiOU3CKawg6Pd+Q4zgOFNgQ9JerXMfGjdbi1jiO47SOwhqCMR2jmDx+NBs2GstWr2t1cxzHcVpGYQ0BeKUyx3EcKLgh8BBSx3GcghsCr1TmOI5TdEMw0SOHHMdxmmYIJI2X9FdJd0m6V9IJGTLjJF0gaaGk2yTNblZ7sigZgsU+InAcp8A0c0SwBjjQzHYFdgPeIGnvlMxHgOfMbAfgh8CJTWxPGX0jAjcEjuMUmKYZAgusiJtj4isdsP9W4Jz4/mLgtZLUrDal6XcWe/io4zjFRWbNW0wlqQOYD+wA/MTMjk0dXwC8wcwej9sPAXuZ2aKU3BxgDsDMmTO75s6dO6T29Pb20tnZ2bf9t3+t5rs3L2X3rcbxtVdPG1Q2r85GyLpO1+k6XWe9smm6u7vnm1l35kEza/oLmApcC+yc2r8A2Dqx/RCwxWC6urq6bKj09PQM2J7/zyU269h5dsipN1WVzauzEbKu03W6TtdZr2waoMcqPFeHJWrIzJZGQ/CG1KEngG0AJI0GNgMWD0ebIBk+uma4Luk4jtN2NDNqaIakqfH9BOAg4IGU2OXA4fH9O4FrouUaFvryDbmPwHGcAjO6ibpnAudEP8Eo4EIzmyfpm4QhyuXAmcCvJC0ElgDvaWJ7ypgyfjSjR4kVa9azZv0Gxo3uGM7LO47jtAVNMwRmdjfwioz9xyXerwbe1aw2VEMS0yaO5dnla3hu5Tq22swNgeM4xaPQK4vBE885juMU3hBM83xDjuMUnMIbgr6SlZ5vyHGcguKGwNNMOI5TcApvCKZ54jnHcQpO4Q3B5j4icByn4BTeEExzH4HjOAWn8IagL83ECjcEjuMUEzcEXqXMcZyC44bAncWO4xScwhuCaRPHAMFZPIz57hzHcdqGwhuCcaM7mDRuNOs3GsvXrG91cxzHcYadmg2BpGmSdmlGY1pFaVTgDmPHcYpILkMg6TpJUyRNB24HfiHpB81t2vAxfeI4wENIHccpJnlHBJuZ2TLgUOBcM9sLeF3zmjW8TO/0EYHjOMUlryEYLWkmcBgwr4ntaQm+qMxxnCKT1xCcAFwJLDSzv0naDniwec0aXjzNhOM4RSZvhbInzazPQWxmD48kH8E0L07jOE6ByTsiOCXnvk0Sr1LmOE6RGXREIGkf4JXADElHJw5NAUZMgd9SlTJPM+E4ThGpNjU0FpgU5SYn9i8D3tmsRg03nmbCcZwiM6ghMLPrgesl/dLM/jlMbRp2vEqZ4zhFJq+zeJyknwOzk+eY2YHNaNRw4yMCx3GKTF5DcBFwOnAGsKF5zWkNU8aPoWOUWL56Pes2bGRMR+FTMDmOUyDyGoL1ZnZaLYolbQOcC2wJGPBzMzs5JbM/cBnwSNx1qZl9s5brNIJRo8S0zjEsWrGW53rX8oLJ44e7CY7jOC0jryGYK+mTwG+BNaWdZrZkkHPWA18ws9slTQbmS7rKzO5Lyd1oZgfX1OomMK1zLItWrGXJSjcEjuMUi7yG4PD494uJfQZsV+kEM3sSeDK+Xy7pfuCFQNoQtAXTfS2B4zgFRcNRjEXSbOAGYOeYvK60f3/gEuBx4F/AMWZ2b8b5c4A5ADNnzuyaO3fukNrR29tLZ2dn5rHv3fIctz6xhqP33oxXbTNhUNm8Oocq6zpdp+t0nfXKpunu7p5vZt2ZB82s6gv4UNYr57mTgPnAoRnHpgCT4vs3AQ9W09fV1WVDpaenp+Kxr1x6t806dp6de8sjVWXz6hyqrOt0na7TddYrmwbosQrP1bxTQ3sk3o8HXkuoS3DuYCdJGkPo8Z9vZpdmGKFlifdXSPqppC3MbFHOdjWM/jQT64b70o7jOC0llyEws88ktyVNBX4z2DmSBJwJ3G9mmQnqJG0FPG1mJmlPQu6jxXna1GhKaSaWrFxTRdJxHGdkkXdEkGYl8KIqMq8CPgjcI+nOuO+rwLYAZnY6IU3FkZLWA6uA98QhzLCz+aRSTQIfETiOUyxyGQJJcwlRQhCSzb0EuHCwc8zsJkBVZE4FTs3ThmbTl3jOo4YcxykYeUcEJyXerwf+aWaPN6E9LcPTTDiOU1Ry5VKwkHzuAUIG0mnAiHtaeuI5x3GKSi5DIOkw4K/Auwh1i2+TNGLSUMPABWUtclM4juO0hLxTQ18D9jCzZwAkzQD+DFzcrIYNN+PHdNA5toPetRtYuXbE5dVzHMepSN40m6NKRiCyuIZzNxncYew4ThHJOyL4o6QrgV/H7XcDVzSnSa1j+sSxPLF0lTuMHccpFNVqFu8AbGlmX5R0KLBvPPQX4PxmN264STqMp7S4LY7jOMNFtemdHxHqE2Nml5rZ0WZ2NCEd9Y+a27Thx0NIHccpItUMwZZmdk96Z9w3uyktaiEeQuo4ThGpZgimDnJsQgPb0Rb0hZD2uiFwHKc4VDMEPZI+lt4p6aOE1NIjir7EcyvcEDiOUxyqRQ19DvitpPfT/+DvBsYCb29iu1rCwBHBoGmSHMdxRgyDGgIzexp4paQDgJ3j7t+b2TVNb1kLGOgjGNfaxjiO4wwTeesRXAtc2+S2tJzpE8cApbrFbggcxykGI251cD1Mnxge/u4sdhynSLghSLDZhDFIsLR3HRs2euI5x3GKgRuCBB2j1Bc5tGLtxha3xnEcZ3hwQ5BiWmfwEyxb6yMCx3GKgRuCFKXIoWVrfETgOE4xcEOQwg2B4zhFww1BCjcEjuMUDTcEKUqGYLkbAsdxCoIbghSlqKFlHjXkOE5BcEOQwqeGHMcpGk0zBJK2kXStpPsk3SvpqAwZSfqxpIWS7pa0e7Pakxc3BI7jFI28NYuHwnrgC2Z2u6TJwHxJV5nZfQmZNwI7xtdewGnxb8voSzy3eiPPLFtdVf65VRtyyQFsNF+b4DhO+9E0Q2BmTwJPxvfLJd0PvBBIGoK3AueamQG3SpoqaWY8tyWUDMGjz69nz29fne+kefnkdt1yLJd1D7VljuM4zUE2DL1USbOBG4CdzWxZYv884LtmdlPcvho41sx6UufPAeYAzJw5s2vu3LlDakdvby+dnZ2Dymw048Sbl/LgkrVI1WsSmFkuuaWrNyLg14duyZiOweXztLNWWdfpOl1ncXRm0d3dPd/MsruiZtbUFzCJUNTm0Ixj84B9E9tXA92D6evq6rKh0tPT03DZvHKv+Z9rbNax8+yBJ5c1TGctsq7TdbrO4ujMAuixCs/VpkYNSRoDXAKcb2aXZog8AWyT2N467htx7DBjEgAPPbuixS1xHMcZSDOjhgScCdxvZj+oIHY58KEYPbQ38Ly10D/QTHZ4QTAEC59xQ+A4TnvRzKihVwEfBO6RdGfc91VgWwAzOx24AngTsBDoBT7cxPa0lO3dEDiO06Y0M2roJqpUgI/zVp9qVhvaCR8ROI7TrvjK4mFi++gjeHjRCjZ69TPHcdoINwTDxGYTxjB1/ChWr9vIE0tXtbo5juM4fbghGEa2nhxm4hZ65JDjOG2EG4JhZOspwRA85H4Cx3HaCDcEw8gLJ3cA7jB2HKe9cEMwjLywNCLwqSHHcdoINwTDSGlqyEcEjuO0E24IhpHp40cxadxonutdx+IVa1rdHMdxHMANwbAiie1nTAR8VOA4TvvghmCY6Us14X4Cx3HaBDcEw0wp1cRDz6xscUscx3ECbgiGmVI6ah8ROI7TLrghGGb6RwRuCBzHaQ/cEAwz207vZEyHeGLpKlauWd/q5jiO47ghGG5Gd4xi9uYhcuiRRe4ncByn9bghaAFem8BxnHbCDUELcEPgOE474YagBZSK1LghcBynHXBD0AJ28EVljuO0EW4IWsB2Mc3EPxevZN2GjS1ujeM4RccNQQvoHDuaF06dwLoNxqNLelvdHMdxCo4bghbhDmPHcdoFNwQtwh3GjuO0C24IWoSnmnAcp11omiGQdJakZyQtqHB8f0nPS7ozvo5rVlvakT5D4JFDjuO0mNFN1P1L4FTg3EFkbjSzg5vYhral3xCsxMyQ1OIWOY5TVJo2IjCzG4AlzdK/qTN94limdY5hxZr1PLVsdaub4zhOgZGZNU+5NBuYZ2Y7ZxzbH7gEeBz4F3CMmd1bQc8cYA7AzJkzu+bOnTuk9vT29tLZ2dlQ2Xp0fv3axdy/aB3HvWYau245rm3b6Tpdp+vctHRm0d3dPd/MujMPmlnTXsBsYEGFY1OASfH9m4AH8+js6uqyodLT09Nw2Xp0fvmSu2zWsfPs7JsebpjOeuVcp+t0nZu+ziyAHqvwXG1Z1JCZLTOzFfH9FcAYSVu0qj2tYHuvVuY4ThvQMkMgaStFD6mkPWNbFreqPa1ge19U5jhOG9C0qCFJvwb2B7aQ9DhwPDAGwMxOB94JHClpPbAKeE8cvhSGvvrFXsjecZwW0jRDYGbvrXL8VEJ4aWF54dQJTBjTwaIVa3i+dx2bdY5pdZMcxykgvrK4hYwapb5MpO4ncBynVbghaDGeasJxnFbjhqDFeOSQ4zitxg1Bi/F01I7jtBo3BC3Gk885jtNq3BC0mFmbd9IxSjy2pJfV6za0ujmO4xQQNwQtZtzoDrad3slGg0cW+XoCx3GGHzcEbYBXK3Mcp5W4IWgD3GHsOE4rcUPQBrjD2HGcVuKGoA3YvrS62EcEjuO0ADcEbUApC+nDi1ayYWOh8u45jtMGuCFoA6aMH8OWU8axdv1GHn+ut9XNcRynYLghaBPcYew4TqtwQ9AmlGoTuMPYcZzhxg1Bm+DVyhzHaRVuCNqEHXxRmeM4LcINQZuQ9BEUrGKn4zgtxg1BmzBj8jgmjx/NstXreX7NxlY3x3GcAuGGoE2Q1Jdz6PFl61vcGsdxioQbgjaiND30+HJPR+04zvDhhqCNKBmCJ3xE4DjOMOKGoI3YwaeGHMdpAW4I2ojSWoInlrshcBxn+GiaIZB0lqRnJC2ocFySfixpoaS7Je3erLZsKmwzbQJjO0axeNVGVqxxY+A4zvAwuom6fwmcCpxb4fgbgR3jay/gtPi3sIzuGMWLtpjI359ezlX3PcV2W0yqes7CJesY/djShsm5TtfpOttX52PL1tOVS7I21MzFS5JmA/PMbOeMYz8DrjOzX8ftvwP7m9mTg+ns7u62np6eIbVn/vz5dHXlu415ZRut81Pn387v7xn0FjiOU1B2nD6Gq770+iGdK2m+mXVnHWvmiKAaLwQeS2w/HveVPQUlzQHmAMycOZP58+cP6YK9vb25z80r22ide2++lgendbAB5dK5ceNGRo2qPsOXV851uk7X2b46t+pkyM+/QTGzpr2A2cCCCsfmAfsmtq8Guqvp7OrqsqHS09PTcFnX6Tpdp+tsN51ZAD1W4bnayqihJ4BtEttbx32O4zjOMNJKQ3A58KEYPbQ38LxV8Q84juM4jadpPgJJvwb2B7aQ9DhwPDAGwMxOB64A3gQsBHqBDzerLY7jOE5lmmYIzOy9VY4b8KlmXd9xHMfJh68sdhzHKThuCBzHcQqOGwLHcZyC44bAcRyn4DQ1xUQzkPQs8M8hnr4FsKjBsq7TdbpO19luOrOYZWYzMo9UWmk2El8MsrJuqLKu03W6TtfZbjprffnUkOM4TsFxQ+A4jlNwimYIft4EWdfpOl2n62w3nTWxyTmLHcdxnMZStBGB4ziOk8INgeM4TsFxQ+A4jlNwWlmqsu2RNBNYYmZrmnydccA7CBXd+v4nZvbNOnR2AOea2fvrbqADhHtqZhta3Q6nWEja3MwWN/MaI94QSNoS+Dbw/8zsjZJeCuxjZmfmOP1XwPaSLjGzYxI6Bbwf2M7MvilpW2ArM/trzjZtZWZPJXZdBjwPzAcyjY6ke4Asz74IWb13Se40sw2SZkkaa2Zrq7Snpnsk6VBg39iem8zstxkyE4FVZrZR0ouBfwP+YGbrMmQvBc6MxzdmHJ9b4bOXPushg3yuPeLmX83smQpy0zN2L89o64OSLgHONrP7KrUnofcQ4DVx83ozm5sh8z/At4BVwB+BXYDPm9l5FXS+kvIOw7mDfA4ScktSurLu6/NAD/AzM1ud99opuXcBfzSz5ZK+DuwOfMvMbk/JdQJfALY1s49J2hHYyczmJWRq/UzbA4+b2RpJ+xPu57lmtnQwPYMh6XVm9ufUvsPN7JwM2VcB3wBmEe5T6fe5XYbcnWa2UtIHCPfoZDPLyppwq6Q7gbMJv5GGR/iM+KghSX8g3MCvmdmukkYDd5jZy3OeL+ClZnZvYt9pwEbgQDN7iaRpwJ/MbI9KelI6f29mb05sLzCznaucM2uw41lfIEnnAi8hVINbmZD9QUou9z2S9FNgB+DXcde7gYfM7FMpufnAq4FpwM3A34C1WSMUSa8jFCbaG7iI8KD9e+L4fvHtocBWQOkh+V7gaTP7fIbOw4DvAdcRfoyvBr5oZhdnyP6DUDb1uSg7FXgKeBr4mJnNj3KTgffEto4CzgJ+Y2bLMnR+B9gTOD/R1r+Z2VdTcnea2W6S3g4cDBwN3GBmu2bo/BWwPXAnUBqZmJl9Nh5/hPBgF7Bt6vM8amYvSuk7GZjBwP/lsqhjipl9MO+1U3rvNrNdJO1LMHLfA44zs71SchcQOj8fMrOdo2G4xcx2S8jU+pnuBLoJBusKQifrZWb2pnh8OYN3KqZkfJ4bgHuBY4BJwBnAGjN7Z4bsA8Dn4+fqGz2me/SS7gZ2JRiqX0adh5nZfqSIz6DXAf9B6NhcCPzSzP6v0ueomWYsV26nF+HHB+HBVtp3Z506b8/QeVcd+n4OvLwJn/34rFc99wh4gNiBiNujgPsHuUefAb6U574DmwGfAB4DbiE8cMckjpctr8/aV/p/AC9IbM+o9D8CfgH8e2L79cDPCIbptgrn7Eeosb0SOAfYIXX8bmBUYrsDuDtDz4L49wzgDYN9l4D7k/d+kPv4C+BNie03Enr4mf/3Ct+Fe4dy7eT3CPgO8L70dyv9v8vzO6rhM5W+d18EPjPItf8L+CQwGZgCHAl8s8K1RTACD8bXewf57Jnfl0HaeRzwkeS+KucdEL93S4HrCSP3up8VI35qCFgpaXNiL6BUH7lOneviHHxJ5wzCCKEmEtM9o4EPS3qYMDVUNt0zSE+mJFvWkzGzE+K5k+L2igpNqeUeLST0zEojkG3ivoyPp30IU2gfifs6KugkXv8DwAeBOwg96X2BwwklTwEmStrOzB6O57wImFhB5SgbOBW0mMrBEXub2cdKG2b2J0knmdnHo/+m1MYO4M0EAzUb+H5s56sJvc8Xp/ROBUpTF5tVuPa82ItcBRwZv0urK8guIIyIqtX2Tn+eP8QpqDSTJG1rZo8CxCnOSfFYejox77UBnpD0M+Ag4MR4D7Pu/VpJE+j/3m1PhanRGj7TOknvJXxv3hL3jcmQO8QGjrpOk3QX4cGcZhphdPcQsDUwS5IsPplTXCvpe8Clyc9iqWkxYLmkrxC+86+RNKpCO9O/jacJnavLgd0II+gXZZ1XC0UwBEcTbtr2km4m9AzLhnQ18mPgt8ALJP131Pf1Ieg5OK+gmU2uVbmknQl+julxexFhGH5vSrSWezQZuF9SyR+yB9Aj6fLYztJ8/VHAV4Dfmtm9krYDrq3Qzt8CO8W2vsXMSg+bCyT1JEQ/D1wXDaYI87Afr9DOP0i6koHTHldUkH1S0rHAbxKyT8cHf9LAPxg/w/fM7JbE/oslvYaBfAe4Q9K1sa2vIdyPAZjZl+MD7XkLfp1e4K0V2rkFcF+898mHTNpH8q84N1+aQns/8K8MfV8AbpL0UGzji4BPRv/OOTDAjzA557UBDgPeAJxkZksVgi6+mCF3PMEvso2k84FXAUdU+Ox5P9OHCaPK/zazR2Jn4VcZcislvZ/wPzfC1N3KDDmAW4HvmtlZ0XCdSJjufGWGbGn6qzuxz4ADU3LvBt5HGA08FY3w9ypc/y/xM7zNzB5P7O+RdHqFc2pixPsIAOKc906EL/vfLcNhWYOuUYQpgyXAa6POq83s/jp05nVWJs95ATC+tF3q1aVkbiHM+18bt/cHvm1mZV/gvPcoMV+fiZldP9jxCjoPKLUxh+w4guMZ4AGrENEl6UTgNsKoAuBGQq/y2AzZLQgPpZLszcAJhFHRtma2MMpNGmRUldWGmQx0Vj+VIdNJMMTbmtkcZThME7KZ9z59z+P36XiC8THgBsK0x5L0uan7+XcrdxDX9P+OxvNeM/u3Cqek2ynC70mEB+5kM3ukgmyuz5QHSbOBkwnGxwj/88+Z2T8yZLdN/74kvcbMbhjKteP5J6a/i1n74v5Ko4+GURRDkCvaoQZ9d5jZKxrQtJK+f5DDWRllDyFMSfw/4BlCr/h+M3tZht67UsPfAfsUon8qYmaXVmhv1WgchUihYyi/7wcmZGq6fuKhOcsqRJkkZG83s91T++62VHRVLUj6ccbu5wlz3ZelZK82s9fm2FfVYVpHeyeaWaVebkkmbyRQLQ+uywjz82Wdk5TczcAbLTrbJb0EuMgGCZyo9pmUM2qnFjJGexCUlhkCSZvRb7AgzON/08yeT8nl/n7G6cIvAS9jYOcvPcoYMiN+akgVoh2AIRsC4GpJ7wAubZClvgq42MyuBJD0esK6grOBn9I/3ITg5Nob+LOZvULSAYT5wywelvSf9A+NPwA8nDhemkN9AWGYe03cPoDgrC0zBCqPxjlFUlY0zkXA6QQnaKXY+7dU2A/hf5S+/tmEh+Y+cfuJeJ1kuOGRBCfgdgqRGSUmE3p9ZeQxWpHxhN7zRXH7HcAjwK5xVPM5SeOBTmALhWgyRdkpwAszLr+9mb07zmtjZr2SlBSQdJOZ7atyP1Gmfyg+3M8gzPdvK2lX4ONm9smUXC2/jYOA9EP/jRn7IMyp3xunkZLRaulppG8DcyW9iXBfzyVM+ZSR9zMRwpDLonYy9M0APkb5//w/MsST01rjCf6C+ZRP90CIJFtAmB6DMK9/NiHirdr3MzndmOR84ALCVPInCP6PZyt9tiFRzZu8qb+oIdqhBp3LCXPHawnhdsuBZXXouydj393x752p/aVIi7uIUSmkIi2AX8W/RxP8GbfH14+AaRnX+hMwM7E9E7iyQltzReMA85vwv6waZUJwys4m+AZmJV7TB9F7FyFqZE+gq/TKkLsV6EhsjybM33YA98V9RxGMwxqC0X0kvu4CPp2h8xZgAv1RJNsTRln13KfbCCPM5H1akCFX9bcR78s9hAf63YnXI8B5Fc7ZL+tVQfZt8R7cA7y4AZ8pb9TOLYS5/sMIBv0dwDtynrsNcEmFY3cOtm+I38/58e/diX1lEV/1vEb8iIDaoh1yYWaT45zljiSGanWQ11kJsFQhCugG4HxJz1Du5OqS9P8IPYcDiD3HeEyUs431O2ghTEltW6GteaNx5kr6JMGpnnQuZs1T513QVjXKxMIQ/HmC8y8v683stBxy0wg90tIwfyLhB7xB0pp4/ZOBkyV9xsxOyaGzFodpbszssdTAIqt3nOe38b/AHwjO7y8n9i/P+l/Ga1+vsO5lRzP7c5zu6osYk3QKA0c2mxEicj4tCctYm1DDZ8obtdNpGdNaOXmcsD4ni1WS9jWzm6BvqmpVoh1938/4+96S0KGYFH1QWdNpJX/dk5LeTHCSD7rQrlZGrCHQ0KId8ur+KKHntzVhWL03oYfx2kFOG4z3ER4Iv4vbN8d9HcQhpqQdCF+atxK+WJ8nDKNnEcLJkpwOXA1sR1gl2td0wj1Jz5derfIImz+TTd5onMPj3+SwOuvaEBbUnA18LW7/H2EonDYE36D8ofnhCu2shbxG63+AOyVdR38k0LcVomwG3C8zOyXP/LuZXSXpdvodpkeZ2VBr0pZ4LF7bJI0hfFezghnyRCGZmf1D0qfSJ0uaXsGwfwyYQ3hYbU+YEjud/t9HT+qU+VQn72fKG7UzT9KbzKxSJFkfKcM1ihC2mTYsJY4Ezom+AhGCSo7I0Plpwvf5afo7e0ZYYJbmW1HfF4BTCNOMZYso62HEOotjtIMIw78vJQ8BJ1pqlWONuu8hOEtvtbAq9N8I0TiDOj/rQdI84Ctmdk9q/8vjtcvm2yWdZmZH5tT/dvodXDdYRtqIKJc7Gicvkv5mZnsknfCKK24zZDcnEWXSgIdmafVqGrMMB6NCJNCecfNvZpYVwlhx/t36VwH/m5k9IGn3rPMzerC5UYiCOpmwGlWEqb+jrHx1a9UoJEnzzOxgDVzhmxDNvEd3Eu7RbYn/5z2WczV/DZ/ps5VGJTn0LSeM6NYQetwV1+NIOjyxuR74h5ll+psS50whKCxbdR6PLwT2Sv9PKsiOt1Q0V6MZsSOC0pdZ0hgrD3GbUKf61Wa2WhKSxsUf9E5DVaYQa15mkW2gs3LLtBGIMvcohMKVkdcIRG4hfMkNGCxn0kHxod/nyJV0AimnYey1HUm/cbmOsBI0Kyw114I29Ufd/D5j35CxVJqCKuxBWEAGoSeXaQgIPdKXWuWe1tGEXvP3s5pEtiMyF9E4Vk02mP5dVJAprXW5mRABc6OZPVDltDVmtrY0jaMQmtx3HyRdaGaHqUL+LMuO7NrJUulJ4rTLzal9uaJ2apnetYycQmkkfcDMzpN0dGp/SccPUqc8Rv6FrQskPU3odN1IyO9V76LYAYxYQzCU6JEaeFzSVMJUzlWSnqN/pe1QOCbxfjzBcbU+JTN1kPPrMmzKEQk0hPt5GmGl5E/j9gfjvo9myH6BQRa0qfZInFxIOtDMrlGFMFYrD1/9LsEQlPIHfVbSPpbKHxQZdP7dzObEt29M9/bi5x0yClFQpxE6DztL2oWwkvZb8XhNUUiRMwkG8JTom7mdYBROzpC9XtJXgQmSDiJ8b5IJ946Kf3MvqCRMiaRHT1n7Bo3aKVHL9K7yhaSWVrhnLfzM6gw8TFgc+XsGTsulDQZmtoPCgrNXE1a2/0TS0qwR81AZyVNDmxGce7mdXEO8zn4EZ9cfrUqWzxr1/tXM9kxs/xq4xsx+kZL7KKGX/u46rnVX1PFM3J5BCE/dNSFT0/1UlTUMGfIVF7RJOgr4HGHtxBP0G4JlwC/M7NTaPnGf3hPM7HhJZ2ccNkuFEkYDuJvFDKnR2XdHVg82jvJ2I4yuKvqmlB1PXravFiRdT/DN/CwxNVM1sWEOvR0EQ3gAIYxxlWUsHFNYdPkRQs4mAVcCZwwyOhrsmvsQQps/B/wwcWgK8PaM71jZlGKFfbmnd5UzkVyUfVV62qjCvuOzPq/FtDAp2a0JRmA/QqK6JYRRwXeydAyFETsisKFFjwzlOjWvpE2jgSuLRxHCFzdLiX0O+K3CsviSc60bGAu8vc4mVI0EGsL93CBpezN7CEAhxURmXHd8wP4GuKAkn7r2yZJOBb5qZv+V8/pViUZgFCG174U5T5tK9fxBEHqQFZG0FWE0M0HSKxg4yunM2ZZKdJrZXzUwwiY9wkTSRywVmSXpu2b25QzZqwm93r8Qpif2sAppvaOh/EV8lZExEuk7RPmIZCwhUms0A3vby8hOgzJo1E6CWqZ3nzezP1Q4libXyMX684B1mllvFZ2PErL3ftvMPpGzHTUxYg3BJsZ8+h1x6wkx2h9JCpjZ08ArFRaQlXp2vzeza6ifPyp/Xp68fJEQypfMC1Qpwuct8ZoXStpIiBi60BKhdBZCNA8lLKhrGBbqJXyJkNq3Gln5g8oemlFvtQ7CvxOiSbYGktMBy4GsqaZaWBSnb0o+l3eSPUX1Dkmrzez8KPcTKk8z3k3ooOxM6BAslfQXMyt7yFabSrEa8mbF+3i9pF9adq7+NLmidqhterdqSGpi5DIj5SeYQkayxSh/JtUXyAG8ghCc8T5JXybkvLo+bcTrYcRODTnVUQxJNbOb1V9sBkKK2/Ozeuc16h9HmO6BMN1TtdKbQtqI/wTeb2YdqWMnEXqkjVrRXdL7XWARwQAlV8JmTXkNmj+o1vl3Se8ws0sa80n6dG5HSG3+SkLakkcI9/OfKbkJBN/MWYQkcUvN7CgGQaEmwxEEv9ZWZjYuQyb3VEqOz/IjCyu2M4sTpafaEucNGrWTkh10ejca/oxLD0iXsh8hS+4nCKGyJZYDc83swZTO2wgjmsvzTN8prB3alzBF9IHYgFnVPlte3BC0AaotwqaR1605JDWHzpocsInzZhFGBe8mPDwuMLPvp2RKIX8bCMP9wZybtbS5FBqZbut28fig8/U2hFBP9UeZfKHCtcuchjXoHkd4yMwmxPIvCypD6dPUVORkQq/4ZmIK5goG8NOEh1AX8A9iBEvWiFTSbVZHeHZKV5eZzVeVUFdViNpJyA35ftaCpFl5Ri6le6SBIdOZPjSFDLzjCM7s0n2vJzilDJ8aag9qibBpJDWHpOZgP0LOoiwjYmTnL7qN8PkvAt5lsd5ARrtqTsWdk5cSIltK5TdvZGCvLivEs69ZDC3UsxRlMmlQqaFxGWFUdzvZ4a2lqcgSIkSjvJnKi/7GE6aw5ptZmb8hRd7VvVWxmHAxx1RbrVE7VRmicemNn71agrhcC+SiD+tkM8tKpd0wfETQBtQaYdPA6z5oZjtWOLbQzHaoQ/eLLJVOOGtf3L+TJUpTVtF7CImRk2VkHh1CWy8k9JpLYaHvAzYzs8Mqn9W+5IkQig+YfdLRLA26ftWplCHozF0LOCNCp2xfzmt+3Mx+ViHCp2+ElTrnT4QpxmNIJIiz8sytuRb9RdkeM+tO728kbgjaAIUUA++ygRE2F1sdIYQ5r9vMkNSssMj5ZtaV2K6px6XyOP73EhLRlRV8qbGt95nZS3Psa/gUnmrLgplX58+BU7JGeym5vmmJRqJEFbnB9tWoM28t4IaH41Zoz+fM7EcZ++ebWZcSKaUVV87Xca3cPqyh4lND7cEx5I+waSSfo8EhqQrx2C8DNkv5CaZQvoKz1uH8mxgYx38OoaxlXYYAuF3S3mZ2a9S7F+X5cKA5U3iXEaai/swgaZNrZF/giOj7yCx9Gml0OvUSF1MeQnkRwb8wVAYN4aw1aqcBHE3I5psmV4I4hcppn6G8A5Dl/C51yJL5nipN4Q0JNwQtRmGRzq6Epe41RdjUizUnJHUnworRqQz0Eywn9HyT1/9ZfPvnrOF8Bf1TyRfHXxX1pzgYA9wi6dG4PQvISqOwR2q67hqFxXj1UE8WzEq8MafcxwkPtA2S6na+19gJqJVqfoda1xvUS1YWX8ifIO53hPDRuVSpd261pUAZEj411AYotYp4JKCQeuEvOWVzDecVird8l1A3uC+O38wuGGIbBw2/S0dmNGMKT9K3CBXJ6l230XIkvZVQX+AQQlhqieXAb2xgnedadZf8DqUHVsloHZiSyxW1Uy+SHjWzSqna85yfO7JKNZQzHXJ73BC0Hkk/JPRK03OAQ85A2WoU0jZkhUX+R0KmpvQB8ZxSHL8Rsn+W1QFuFpJeS8hbM2AKz3LWW66gM3cWzGbQaOd7HOEea2bfrrtxQV9pmqfUAzdCda6bKgQeNKysowZfAT3BzEYnZNM1FgZgqRoLkt5HmAX4E1Uiq9TEcqYlfGqoPdgt/k1GIdSVgbINSD5QxhN8DulQxqEM5/ehP8xzNKGGwLBgZleXemNxV91TeE0Mia1KhvP9qBhhM2Sfi4UV4G8jFBpqBFn3ZxbwNUnfMLPfpI41rKxjjf+bkk/pVYRw5NIo9V3AfRnyLyf4mA5kYD2CrN981XKm9eIjAmdYiOGKN5nZKzOO5RrOS/opsAMDU2E8ZGZlRVOahXIWe69B32uy9ltGYfRGoxqS6NWot+kjXIVFcX+uFJnWyKidGtt1K7CvxbUWMdLsRjPbOyW3kJCmvGqiSkmlrKg3m9nuCulDft3I6WQfEbQBCitB30H5A6YsTnkTZkfgBRWOnSHpXWa2FEAh1fRvzOzfU3IHAi8pRbjEqKF7m9TeMlRbsfe81FIYvRlMpUHO9wS7xb9NG+Ga2ZIKveKml3WswjTC1Gbpnk6K+9IsINz7zMR9KY6nCeVMk7ghaA8uIyTymk+qBu+mSmJ+tVQe8ylSxWsSbFEyAgBm9pykLKOxkFBLuTR62CbuGy6qFZupGUul8ZC0Ddlhic3g24TQ2etg8CR6tWBmB9Sroxox0u25jENNL+tYhe9SnpjwGxlyU4EHJP2NKiV0rTnlTAfgU0NtgBqQK35TRtJ8gnP40bg9mxDbnh72X0+Y0y5VUNuDMDf7PFROQNbAdl5EKI84WLH3eq8h4N70YrYmXes8Qn3o5wj5gxrmfI+98bTDtuYRrrKrmE0n9PQ/ZNWrpQ07CinGSxFBt2XdU+UrE9rwHFeV8BFBe3CLpJdXWwm6KTDEL+/XgJvig16E5GZzMuSOq7+FdZGn2HtNqLbC6I2mVHXsEMKU1x2SbrDsqmO5kXQ6oabCAcAZBMf/YOVPByNdxcyAxWa2MrlT0mDfDbMG1rHIQQfBQT0aeLGkF2f4fHYBzjOzrFFNiVKOq/GE0ehdhN/HLoQO0D6NarCPCFqIpAWEiIHRhDn0hxl8JWjbo+w8MyXK4r4T572A8PC/g5AT/5ksh2nsbe1Ja8JHq/bihqDz8MRmrsLojUQ5q47VqPNuM9sl8XcSofjPq6uePPRrfiFj90RCXY/NzawZyf2y2nEiIYjhXhLRQOnOQlw/8h6C0T8LuLLSlKOkS4HjSx1FSTsD3zCzhi2Uc0PQQhSKYexW6fhwLIxpB5RdP/YvGYuFPkoYFVxDMJb7EQqTnzWsDR4hqLzq2E1WoepYjXpLKZZvJdQKXkyY7hpyEsMarz+Z8H36CKHg0Pcb8blyXvvvwC55worjNODrCelkugltPdNSdUAk3WtmL6u2rx58aqi1PDJSH/aqLUHbUfTXjz1AsX5shtwXgVdYTDQmaXNCjvZhMQQKaRNOJEQ/iQYs/pJ0MKHqWjqr5nAsKMtddaxG5ilU//of+nNYnVGnzqrEkNKjgfcD5wC7V5l6aQYPE0JnqxoCMzNJTxECKdYToosulnSVmX0pIXq3pDOA8+L2+wn/u4bhI4IWIulxBpYpHIANUzGNZhC/uGMIP0gIi2c2mFlZgrZSnLekO4G9zGxNhV7QLcD+pdhrSWMJq2HL1iY0gxj7/RYzK8sbX6fOQ4F7GhmNVGMbqlYdq1HfBEIn4NX013c4zcxW19nUwa75PcJ9/DnwEzNb0axrVWnHJYTcYVcz0I+UXll8FPAhQlbRM4Dfmdm6uN7mQTPbPiE7noGdqhto8P30EUFr6SDEGTd0lWCbUEuCtrz1YxcCt0m6jPCAeSuht3Q0DIvhfLqRRiDyGLCgFUZA5VXHziI8tOvlHEJ+oR/H7fcR1lo0s77DFwgP3q8TVh2X9g9ryg5CjqXLq0qFyKdD0zMCFmpoH5zat5qQgiWZhqWh+IighagJedLbBQ0xQZsGqR+r7AIhfZjZCfW1umKbSpk09wO2IhisZG8vs/xmTt17EKaGrk/pbPpoUNIxhAd/nqpjtejNVd+h6CgUrC850G80s8yOksqL8gD9pVQbgY8IWstIHAmUSNZYgLBqumqNhcEicJr1oM9BctFXL8HBVyKz/GYN/DewghAiOLYOPTVjZic1SXXe+g4jDlWpf52Q+ywhSq703TlP0s/N7JQMtWeSUZSnkfiIoIVImm4NrDLUTkh6F3AlwQC8jRDz/LV6FsHE0NSsH9kmm5xvJC4mlHQ/ITHfo3HXtsDfCQ7RTTIsOi8xgKHEeELSuelmdlxK7m5CqdCVcXsiIVKu7N6ohpTVQ8VHBC1kpBqByH+a2UWSphBi1E8iVPOq5wt9TOL9eEJ+poZNaVQj5jY6ygbmRPq+1VFWErhC0uvN7E+NaGOb8IZWN6BVWHnN4R/FlfPpBW9iYO9+A5VnCKoV5akbNwROsyh9yd8M/MLMfh8X0QwZM5uf2nVzXOU7XOxi5TmR6q35eyRwjKSW1CNoBiM1JDoPqZX1owjrA7Kes2cTAh9KadTfRpgCyqLUeSqV+izl72rYSNgNgdMsnpD0M+Ag4ESFDKuj6lEY48RLlH5kjcqYmYdRkqaVYtNje+r6DVkL6xE4TeH7iffrCdFYZdFSZvYDhWR/+8ZdHzazOyrovC5jX0Pn9N0QOM3iMMIUwUlmtlShstgXq5xTjfn0ZzRdR/iRfaROnbXwfeAvMfkchPnf/65XqaRdKE9BXo8D2mkRViXzaqoz84/46jtWYbo4uSZiPCH/UkPDmN1Z7GwySDqMEFa6TNJ/ArsD/9XIudIcbXgp/UPya8wsq/pULfrOIiQRS+emqcfv4LQIhRTYx9O/+Ot6QhqU5+PxUlRRlj/A8oSExtH1lWa2f0MajRsCZxMikcRsX0Ls/UnAcc2OqEi1YV9gRzM7W6E+7iTLqJ1bgz6Prx9BxJXFCxi4on5XMzu08lk1X2MaIeFiw3I3+dSQsynRcAd0LcQFbd2E0MizCSk0ziNUjBoqf5H00npHFk7bsL2ZvSOxfUJMnVJGXKhYqr99o5n9roJcsiZDBzCDgdXf6sYNgbMp0XAHdI28HXgFsV6Amf0r5umph3MJxuApNvEU5A4AqyTta2Y3Qd+q4LIkfiqvv/0JSQdZdv3tZMqJ9YRUJw0Nm3ZD4GxKNMMBXQtrY8bIUs3kiQ3QeSZh+uAe+n0EzqbLJ4Bzo68AQvW3wzPkctffHo5wXDcEziaDmfWSSOdgoWRk08pGZnBhHJFMlfQx4D+AX9Sp81kzy5OkzGljJG1rZo/GfEG7xoWUmNmyCqe0uv72ANwQOE5+ZgAXA8sIfoLjgNfVqfMOSf8LzKVBieyclvA7QhQbki5J+QmymAzcHxdEGqHqXo+ky6H59bfTeNSQ4+QkK1tsKZKpDp1nZ+z28NFNDEl3mNkr0u8Hkc8se1pisOSLzcBHBI5TBUlHAp8EtovJwkpMBuqqL2xmVTOyOpsEVuF9trDZ9ZJmEUKR/xyL+Yw2s+VNa+Eg+IjAcaoQHX/TgO8AX04cWl5v4kBJWwOn0B+CeiMhsd3j9eh1hhdJG4CVhKivCYR05VAhd1T0Mc0hZCbdXtKOwOlm9tphbHZ/e9wQOE7rkHQV8L/Ar+KuDwDvN7ODWtcqp9nEtQV7ArclppTuMbOXt6I9wxmD7ThOOTPM7GwzWx9fvyQ4pZ2RzZpkBT5Jo2lwIrlacEPgOK1lsaQPSOqIrw8A6Zz2zsjjeklfBSZIOgi4iBA51hJ8ashxWkh0GJ5CqOBmwC3AZ8zssZY2zGkqkkYRMue+nuBHuBI4w1r0QHZD4DgtJK4o/VyqxsFJHj468olJCzGzZ1vdFp8acpzWskvJCEBf+dJ6q545bYoC35C0iFDH+e+SnpWULmU5rLghcJzWMiqmFQYaU/XMaWs+TwgV3sPMppvZdEIpyldJ+nyrGuVTQ47TQiR9CPgqwVkIseqZmf2q8lnOpoqkO4CDzGxRav8M4E/VViQ3C+95OE4LMbNzJfXQX/XsUK9NMKIZkzYCEPwEksa0okHghsBxWk588PvDvxisHeKxpuJTQ47jOMNEIhVF2SFgvJm1ZFTghsBxHKfgeNSQ4zhOwXFD4DiOU3DcEDiFRtLXJN0r6W5Jd0raq4nXuk5Sd7P0O85Q8aghp7BI2gc4GNjdzNZI2gIY2+JmOc6w4yMCp8jMBBaZ2RoAM1tkZv+SdJykv0laIOnnkgR9PfofSuqRdL+kPSRdKulBSd+KMrMlPSDp/ChzsaTO9IUlvV7SXyTdLukiSZPi/u9Kui+OUE4axnvhFBg3BE6R+ROwjaT/k/TTRB3ZU81sDzPbmVBt6uDEOWvNrBs4HbgM+BSwM3CEpM2jzE7AT83sJYRC959MXjSOPL4OvC7WQO4Bjo7nvx14WayD/K0mfGbHKcMNgVNYzGwF0EUoGfgscIGkI4ADJN0m6R7Cit+XJU67PP69B7jXzJ6MI4qHgW3iscfMrFTL+Dxg39Sl9wZeCtwcK1UdDswCngdWA2dKOpT+coeO01TcR+AUGjPbAFwHXBcf/B8HdgG6zewxSd8AxidOWRP/bky8L22Xfk/pxTnpbQFXmdl70+2RtCfwWuCdwKfpTz3hOE3DRwROYZG0UywaXmI3QmpggEVx3v6dQ1C9bXREA7wPuCl1/FZCtskdYjsmSnpxvN5mZnYFIUvlrkO4tuPUjI8InCIzCThF0lRgPbCQME20FFgAPAX8bQh6/w58StJZhBxCpyUPxgRjRwC/ljQu7v46sBy4TNJ4wqjh6CFc23FqxlNMOE4DkTQbmBcdzY6zSeBTQ47jOAXHRwSO4zgFx0cEjuM4BccNgeM4TsFxQ+A4jlNw3BA4juMUHDcEjuM4Bef/A/cTLY1j3cBDAAAAAElFTkSuQmCC\n", 442 | "text/plain": [ 443 | "
" 444 | ] 445 | }, 446 | "metadata": { 447 | "needs_background": "light" 448 | }, 449 | "output_type": "display_data" 450 | } 451 | ], 452 | "source": [ 453 | "plot_freq_dist(conference_words, num_words=30)" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "## Cleaning the data\n", 461 | "![Cleaning data](https://media.giphy.com/media/10zsjaH4g0GgmY/giphy.gif)\n", 462 | "\n", 463 | "Oops, we missed a crucial step! Real world data is often messy and needs to undergo cleaning. You can do a bunch of preprocessing to ensure the data is clean, like:\n", 464 | "- Removing special characters and numbers - These are usually not important when trying to derive the semantics\n", 465 | "- Removing stopwords - A special category of words that don't have any significance on their own and are often used as filler words or to ensure correct grammer. Eg. the, and, but, of, is, or, those, her, \n", 466 | "- Removing HTML tags - Raw data from webpages can often be laden with HTML tags. Use a library like `BeautifulSoup` to process and remove the tags.\n", 467 | "- Standardizing words - This aims to consolidate different versions of the same version Eg. SMS/Twitter language, slang, misspellings \n", 468 | "- Converting to lower case - To ensure uniformity across all words\n" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 25, 474 | "metadata": {}, 475 | "outputs": [ 476 | { 477 | "name": "stderr", 478 | "output_type": "stream", 479 | "text": [ 480 | "[nltk_data] Downloading package stopwords to /Users/gjena/nltk_data...\n", 481 | "[nltk_data] Unzipping corpora/stopwords.zip.\n" 482 | ] 483 | } 484 | ], 485 | "source": [ 486 | "import re\n", 487 | "nltk.download('stopwords')\n", 488 | "from nltk.corpus import stopwords\n", 489 | "\n", 490 | "stop_words = set(stopwords.words('english'))\n", 491 | "\n", 492 | "def get_clean_sentences(sentences, remove_digits=False):\n", 493 | " '''Cleaning sentences by removing special characters and optionally digits'''\n", 494 | " clean_sentences = []\n", 495 | " for sent in sentences:\n", 496 | " pattern = r'[^a-zA-Z0-9\\s]' if not remove_digits else r'[^a-zA-Z\\s]' \n", 497 | " clean_text = re.sub(pattern, '', sent)\n", 498 | " clean_text = clean_text.lower() # Converting to lower case\n", 499 | " clean_sentences.append(clean_text)\n", 500 | " print('\\nClean sentences:', clean_sentences)\n", 501 | " return clean_sentences\n", 502 | "\n", 503 | "def filter_stopwords(words):\n", 504 | " '''Removing stopwords from given words'''\n", 505 | " filtered_words = [w for w in words if w not in stop_words]\n", 506 | " print('\\nFiltered words:', filtered_words)\n", 507 | " return filtered_words" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": 26, 513 | "metadata": {}, 514 | "outputs": [ 515 | { 516 | "name": "stdout", 517 | "output_type": "stream", 518 | "text": [ 519 | "\n", 520 | "Clean sentences: ['today is rd march', 'i am in san francisco california', 'currently i am attending a natural language processing workshop']\n", 521 | "\n", 522 | "Word tokens: ['today', 'is', 'rd', 'march', 'i', 'am', 'in', 'san', 'francisco', 'california', 'currently', 'i', 'am', 'attending', 'a', 'natural', 'language', 'processing', 'workshop']\n", 523 | "\n", 524 | "Filtered words: ['today', 'rd', 'march', 'san', 'francisco', 'california', 'currently', 'attending', 'natural', 'language', 'processing', 'workshop']\n" 525 | ] 526 | } 527 | ], 528 | "source": [ 529 | "sample_sentences = get_clean_sentences(sample_sentences, remove_digits = True)\n", 530 | "sample_words = get_word_tokens(sample_sentences)\n", 531 | "sample_words = filter_stopwords(sample_words)" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": 27, 537 | "metadata": {}, 538 | "outputs": [ 539 | { 540 | "name": "stdout", 541 | "output_type": "stream", 542 | "text": [ 543 | "\n", 544 | "Clean sentences: ['through call for code top solutions are actively supported to bring the technology into communities in need working with partners like the united nations and the linux foundation', 'deployments are underway across the globe']\n", 545 | "\n", 546 | "Word tokens: ['through', 'call', 'for', 'code', 'top', 'solutions', 'are', 'actively', 'supported', 'to', 'bring', 'the', 'technology', 'into', 'communities', 'in', 'need', 'working', 'with', 'partners', 'like', 'the', 'united', 'nations', 'and', 'the', 'linux', 'foundation', 'deployments', 'are', 'underway', 'across', 'the', 'globe']\n", 547 | "\n", 548 | "Filtered words: ['call', 'code', 'top', 'solutions', 'actively', 'supported', 'bring', 'technology', 'communities', 'need', 'working', 'partners', 'like', 'united', 'nations', 'linux', 'foundation', 'deployments', 'underway', 'across', 'globe']\n" 549 | ] 550 | } 551 | ], 552 | "source": [ 553 | "conference_sentences = get_clean_sentences(conference_sentences)\n", 554 | "conference_words = get_word_tokens(conference_sentences)\n", 555 | "conference_words = filter_stopwords(conference_words)" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "After cleaning the text and using tokenization, we are left with words. Words have certain properties which we'll be exploring in the next few sections. These characteristics can often be used as features for a Machine Learning model." 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "metadata": {}, 568 | "source": [ 569 | "## POS tagging\n", 570 | "\n", 571 | "The English language is formed of different parts of speech (POS) like nouns, verbs, pronouns, adjectives, etc. POS tagging analyzes the words in a sentences and associates it with a POS tag depending on the way it is used. Also called grammatical tagging or word-category disambiguation. Use ```nltk.pos_tag``` for the process. There are different types of tagsets used with the most common being the Penn Treebank tagset and the Universal tagset. \n", 572 | "\n", 573 | "![Penn POS tags](https://slideplayer.com/slide/6855236/23/images/11/Penn+TreeBank+POS+Tag+set.jpg)" 574 | ] 575 | }, 576 | { 577 | "cell_type": "code", 578 | "execution_count": 28, 579 | "metadata": {}, 580 | "outputs": [ 581 | { 582 | "name": "stderr", 583 | "output_type": "stream", 584 | "text": [ 585 | "[nltk_data] Downloading package averaged_perceptron_tagger to\n", 586 | "[nltk_data] /Users/gjena/nltk_data...\n", 587 | "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n", 588 | "[nltk_data] date!\n" 589 | ] 590 | } 591 | ], 592 | "source": [ 593 | "nltk.download('averaged_perceptron_tagger')\n", 594 | "\n", 595 | "def get_pos_tags(words):\n", 596 | " '''Get the part of speech (POS) tags for the words'''\n", 597 | " tags=[]\n", 598 | " for word in words:\n", 599 | " tags.append(nltk.pos_tag([word]))\n", 600 | " return tags" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": 29, 606 | "metadata": {}, 607 | "outputs": [ 608 | { 609 | "data": { 610 | "text/plain": [ 611 | "[[('today', 'NN')],\n", 612 | " [('rd', 'NN')],\n", 613 | " [('march', 'NN')],\n", 614 | " [('san', 'NN')],\n", 615 | " [('francisco', 'NN')],\n", 616 | " [('california', 'NN')],\n", 617 | " [('currently', 'RB')],\n", 618 | " [('attending', 'VBG')],\n", 619 | " [('natural', 'JJ')],\n", 620 | " [('language', 'NN')],\n", 621 | " [('processing', 'NN')],\n", 622 | " [('workshop', 'NN')]]" 623 | ] 624 | }, 625 | "execution_count": 29, 626 | "metadata": {}, 627 | "output_type": "execute_result" 628 | } 629 | ], 630 | "source": [ 631 | "sample_tags = get_pos_tags(sample_words)\n", 632 | "sample_tags" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": 30, 638 | "metadata": { 639 | "scrolled": true 640 | }, 641 | "outputs": [ 642 | { 643 | "data": { 644 | "text/plain": [ 645 | "[[('call', 'NN')],\n", 646 | " [('code', 'NN')],\n", 647 | " [('top', 'NN')],\n", 648 | " [('solutions', 'NNS')],\n", 649 | " [('actively', 'RB')],\n", 650 | " [('supported', 'VBN')],\n", 651 | " [('bring', 'NN')],\n", 652 | " [('technology', 'NN')],\n", 653 | " [('communities', 'NNS')],\n", 654 | " [('need', 'NN')],\n", 655 | " [('working', 'VBG')],\n", 656 | " [('partners', 'NNS')],\n", 657 | " [('like', 'IN')],\n", 658 | " [('united', 'JJ')],\n", 659 | " [('nations', 'NNS')],\n", 660 | " [('linux', 'NN')],\n", 661 | " [('foundation', 'NN')],\n", 662 | " [('deployments', 'NNS')],\n", 663 | " [('underway', 'RB')],\n", 664 | " [('across', 'IN')],\n", 665 | " [('globe', 'NN')]]" 666 | ] 667 | }, 668 | "execution_count": 30, 669 | "metadata": {}, 670 | "output_type": "execute_result" 671 | } 672 | ], 673 | "source": [ 674 | "conference_tags = get_pos_tags(conference_words)\n", 675 | "conference_tags" 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "## Text processing\n", 683 | "Text processing approaches like stemming and lemmatization help in reducing inflectional forms of words. \n", 684 | "### Dictionary and thesaurus\n", 685 | "WordNet is a lexical database that also has relationships between different words. You can use synsets to find definitions, synonyms and antonyms for words. You can also find hyponyms and hypernyms using the same process. Hypernym is a generalized concept like 'programming language' whereas hyponym is a specific concept like 'Python' or 'Java'.\n", 686 | "\n", 687 | "![Hypernym and hyponym](https://upload.wikimedia.org/wikipedia/en/thumb/1/1f/Hyponymsandhypernyms.jpg/300px-Hyponymsandhypernyms.jpg)\n" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": 31, 693 | "metadata": { 694 | "scrolled": true 695 | }, 696 | "outputs": [ 697 | { 698 | "name": "stderr", 699 | "output_type": "stream", 700 | "text": [ 701 | "[nltk_data] Downloading package wordnet to /Users/gjena/nltk_data...\n", 702 | "[nltk_data] Package wordnet is already up-to-date!\n" 703 | ] 704 | } 705 | ], 706 | "source": [ 707 | "nltk.download('wordnet')\n", 708 | "from nltk.corpus import wordnet \n", 709 | "\n", 710 | "def get_wordnet_properties(words):\n", 711 | " '''Returns definition, synonyms and antonyms of words'''\n", 712 | " for word in words:\n", 713 | " synonyms = []\n", 714 | " antonyms = []\n", 715 | "# hyponyms = []\n", 716 | "# hypernyms = []\n", 717 | " definitions = []\n", 718 | " for syn in wordnet.synsets(word):\n", 719 | " for lm in syn.lemmas():\n", 720 | " synonyms.append(lm.name())\n", 721 | " if lm.antonyms(): \n", 722 | " antonyms.append(lm.antonyms()[0].name())\n", 723 | "# hyponyms.append(syn.hyponyms())\n", 724 | "# hypernyms.append(syn.hypernyms())\n", 725 | "# definitions.append(syn.definition())\n", 726 | " \n", 727 | " print(word)\n", 728 | " print('Synonyms:', synonyms, '\\nAntonyms:', antonyms, '\\n')\n", 729 | "# print('Definition:', definitions, '\\n')" 730 | ] 731 | }, 732 | { 733 | "cell_type": "markdown", 734 | "metadata": {}, 735 | "source": [ 736 | "Have you watched the series 'Friends'? Do you remember the [episode](https://youtu.be/B1tOqZUNebs?t=100) where Joey has to write a letter of recommendation for Monica and Chandler for the adoption agency? He uses a thesaurus to make himself sound smarter in the letter! Let's see if we get the same results:\n", 737 | "\n", 738 | "'They are warm, nice people with big hearts' -> 'They are humid, prepossessing Homo Sapiens with full-sized aortic pumps'\n", 739 | "\n", 740 | "![Joey Friends](https://media.giphy.com/media/VEsfbW0pBu145PPhOi/giphy.gif)" 741 | ] 742 | }, 743 | { 744 | "cell_type": "code", 745 | "execution_count": 32, 746 | "metadata": { 747 | "scrolled": true 748 | }, 749 | "outputs": [ 750 | { 751 | "name": "stdout", 752 | "output_type": "stream", 753 | "text": [ 754 | "they\n", 755 | "Synonyms: [] \n", 756 | "Antonyms: [] \n", 757 | "\n", 758 | "are\n", 759 | "Synonyms: ['are', 'ar', 'be', 'be', 'be', 'exist', 'be', 'be', 'equal', 'be', 'constitute', 'represent', 'make_up', 'comprise', 'be', 'be', 'follow', 'embody', 'be', 'personify', 'be', 'be', 'live', 'be', 'cost', 'be'] \n", 760 | "Antonyms: ['differ'] \n", 761 | "\n", 762 | "warm\n", 763 | "Synonyms: ['warm', 'warm_up', 'warm', 'warm', 'warm', 'warm', 'affectionate', 'fond', 'lovesome', 'tender', 'warm', 'strong', 'warm', 'quick', 'warm', 'ardent', 'warm', 'warm', 'warm', 'warm', 'warmly', 'warm'] \n", 764 | "Antonyms: ['cool', 'cool', 'cool'] \n", 765 | "\n", 766 | "nice\n", 767 | "Synonyms: ['Nice', 'nice', 'decent', 'nice', 'nice', 'skillful', 'dainty', 'nice', 'overnice', 'prissy', 'squeamish', 'courteous', 'gracious', 'nice'] \n", 768 | "Antonyms: ['nasty'] \n", 769 | "\n", 770 | "people\n", 771 | "Synonyms: ['people', 'citizenry', 'people', 'people', 'multitude', 'masses', 'mass', 'hoi_polloi', 'people', 'the_great_unwashed', 'people', 'people'] \n", 772 | "Antonyms: [] \n", 773 | "\n", 774 | "with\n", 775 | "Synonyms: [] \n", 776 | "Antonyms: [] \n", 777 | "\n", 778 | "big\n", 779 | "Synonyms: ['large', 'big', 'big', 'bad', 'big', 'big', 'big', 'large', 'prominent', 'big', 'heavy', 'boastful', 'braggart', 'bragging', 'braggy', 'big', 'cock-a-hoop', 'crowing', 'self-aggrandizing', 'self-aggrandising', 'big', 'swelled', 'vainglorious', 'adult', 'big', 'full-grown', 'fully_grown', 'grown', 'grownup', 'big', 'big', 'large', 'magnanimous', 'big', 'bighearted', 'bounteous', 'bountiful', 'freehanded', 'handsome', 'giving', 'liberal', 'openhanded', 'big', 'enceinte', 'expectant', 'gravid', 'great', 'large', 'heavy', 'with_child', 'big', 'boastfully', 'vauntingly', 'big', 'large', 'big', 'big'] \n", 780 | "Antonyms: ['small', 'little', 'small'] \n", 781 | "\n", 782 | "hearts\n", 783 | "Synonyms: ['hearts', 'Black_Maria', 'heart', 'bosom', 'heart', 'pump', 'ticker', 'heart', 'mettle', 'nerve', 'spunk', 'center', 'centre', 'middle', 'heart', 'eye', 'kernel', 'substance', 'core', 'center', 'centre', 'essence', 'gist', 'heart', 'heart_and_soul', 'inwardness', 'marrow', 'meat', 'nub', 'pith', 'sum', 'nitty-gritty', 'heart', 'spirit', 'heart', 'heart', 'affection', 'affectionateness', 'fondness', 'tenderness', 'heart', 'warmness', 'warmheartedness', 'philia', 'heart'] \n", 784 | "Antonyms: [] \n", 785 | "\n" 786 | ] 787 | } 788 | ], 789 | "source": [ 790 | "joey_dialogue = ['they', 'are', 'warm', 'nice', 'people', 'with', 'big', 'hearts']\n", 791 | "get_wordnet_properties(joey_dialogue)" 792 | ] 793 | }, 794 | { 795 | "cell_type": "markdown", 796 | "metadata": {}, 797 | "source": [ 798 | "## Word Sense Disambiguation\n", 799 | "\n", 800 | "These synsets are also used for disambiguation, particularly Word Sense Disambiguation using Lesk Algorithm. See: http://www.nltk.org/howto/wsd.html" 801 | ] 802 | }, 803 | { 804 | "cell_type": "code", 805 | "execution_count": 33, 806 | "metadata": {}, 807 | "outputs": [ 808 | { 809 | "name": "stdout", 810 | "output_type": "stream", 811 | "text": [ 812 | "Synset('savings_bank.n.02')\n" 813 | ] 814 | } 815 | ], 816 | "source": [ 817 | "from nltk.wsd import lesk\n", 818 | "sent = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']\n", 819 | "print(lesk(sent, 'bank', 'n'))" 820 | ] 821 | }, 822 | { 823 | "cell_type": "code", 824 | "execution_count": 34, 825 | "metadata": {}, 826 | "outputs": [ 827 | { 828 | "name": "stdout", 829 | "output_type": "stream", 830 | "text": [ 831 | "Synset('bank.n.06')\n" 832 | ] 833 | } 834 | ], 835 | "source": [ 836 | "sent = ['I', 'was', 'sitting', 'by', 'the', 'bank', '.']\n", 837 | "print(lesk(sent, 'bank', 'n'))" 838 | ] 839 | }, 840 | { 841 | "cell_type": "markdown", 842 | "metadata": {}, 843 | "source": [ 844 | "## Stemming\n", 845 | "Stemming tries to cut off at the ends of the words in the hope of deriving the base form. Stems aren't always real words. Use ```PorterStemmer``` from ```ntlk.stem```." 846 | ] 847 | }, 848 | { 849 | "cell_type": "code", 850 | "execution_count": 35, 851 | "metadata": {}, 852 | "outputs": [], 853 | "source": [ 854 | "from nltk.stem import PorterStemmer\n", 855 | "\n", 856 | "def get_stems(words):\n", 857 | " '''Reduce the words to their base word (stem) by cutting off the ends'''\n", 858 | " ps = PorterStemmer()\n", 859 | " stems = []\n", 860 | " for word in words:\n", 861 | " stems.append(ps.stem(word))\n", 862 | " print(stems)\n", 863 | " return stems" 864 | ] 865 | }, 866 | { 867 | "cell_type": "code", 868 | "execution_count": 36, 869 | "metadata": { 870 | "scrolled": true 871 | }, 872 | "outputs": [ 873 | { 874 | "name": "stdout", 875 | "output_type": "stream", 876 | "text": [ 877 | "['today', 'rd', 'march', 'san', 'francisco', 'california', 'current', 'attend', 'natur', 'languag', 'process', 'workshop']\n" 878 | ] 879 | } 880 | ], 881 | "source": [ 882 | "sample_stems = get_stems(sample_words)" 883 | ] 884 | }, 885 | { 886 | "cell_type": "code", 887 | "execution_count": 37, 888 | "metadata": {}, 889 | "outputs": [ 890 | { 891 | "name": "stdout", 892 | "output_type": "stream", 893 | "text": [ 894 | "['call', 'code', 'top', 'solut', 'activ', 'support', 'bring', 'technolog', 'commun', 'need', 'work', 'partner', 'like', 'unit', 'nation', 'linux', 'foundat', 'deploy', 'underway', 'across', 'globe']\n" 895 | ] 896 | } 897 | ], 898 | "source": [ 899 | "conference_stems = get_stems(conference_words)" 900 | ] 901 | }, 902 | { 903 | "cell_type": "markdown", 904 | "metadata": {}, 905 | "source": [ 906 | "## Lemmatization\n", 907 | "Lemmatization groups different inflected forms of a words so they can be mapped to the same base. Lemmas are real words. More complex than stemming, context of words is also analyzed. Uses WordNet which is a lexical English database. \n", 908 | "Use ```WordNetLemmatizer``` from ```nltk.stem``` and provide it the POS tag along with the word. NLTK’s POS tags are in a format different from to that of wordnet lemmatizer, so a mapping is needed. https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python\n" 909 | ] 910 | }, 911 | { 912 | "cell_type": "code", 913 | "execution_count": 38, 914 | "metadata": {}, 915 | "outputs": [ 916 | { 917 | "name": "stderr", 918 | "output_type": "stream", 919 | "text": [ 920 | "[nltk_data] Downloading package wordnet to /Users/gjena/nltk_data...\n", 921 | "[nltk_data] Package wordnet is already up-to-date!\n" 922 | ] 923 | } 924 | ], 925 | "source": [ 926 | "nltk.download('wordnet')\n", 927 | "from nltk.stem import WordNetLemmatizer\n", 928 | "from nltk.corpus import wordnet\n", 929 | "\n", 930 | "def get_lemma(word_tags):\n", 931 | " '''Reduce the words to their base word (lemma) by using a lexicon'''\n", 932 | " wordnet_lemmatizer = WordNetLemmatizer()\n", 933 | " lemma = []\n", 934 | " for element in word_tags:\n", 935 | " word = element[0][0]\n", 936 | " pos = element[0][1]\n", 937 | " tag = nltk.pos_tag([word])[0][1][0].upper()\n", 938 | " tag_dict = {\"J\": wordnet.ADJ, # Mapping NLTK POS tags to WordNet POS tags\n", 939 | " \"N\": wordnet.NOUN,\n", 940 | " \"V\": wordnet.VERB,\n", 941 | " \"R\": wordnet.ADV}\n", 942 | "\n", 943 | " wordnet_pos = tag_dict.get(tag, wordnet.NOUN)\n", 944 | " lemma.append(wordnet_lemmatizer.lemmatize(word, wordnet_pos))\n", 945 | " print(lemma)\n", 946 | " return(lemma)" 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": 39, 952 | "metadata": { 953 | "scrolled": true 954 | }, 955 | "outputs": [ 956 | { 957 | "name": "stdout", 958 | "output_type": "stream", 959 | "text": [ 960 | "['today', 'rd', 'march', 'san', 'francisco', 'california', 'currently', 'attend', 'natural', 'language', 'processing', 'workshop']\n" 961 | ] 962 | } 963 | ], 964 | "source": [ 965 | "sample_lemma = get_lemma(sample_tags)" 966 | ] 967 | }, 968 | { 969 | "cell_type": "code", 970 | "execution_count": 40, 971 | "metadata": { 972 | "scrolled": true 973 | }, 974 | "outputs": [ 975 | { 976 | "name": "stdout", 977 | "output_type": "stream", 978 | "text": [ 979 | "['call', 'code', 'top', 'solution', 'actively', 'support', 'bring', 'technology', 'community', 'need', 'work', 'partner', 'like', 'united', 'nation', 'linux', 'foundation', 'deployment', 'underway', 'across', 'globe']\n" 980 | ] 981 | } 982 | ], 983 | "source": [ 984 | "conference_lemma = get_lemma(conference_tags)" 985 | ] 986 | }, 987 | { 988 | "cell_type": "markdown", 989 | "metadata": {}, 990 | "source": [ 991 | "These processes can create features that act as inputs to predictive models. It also helps in using lesser memory by making the data smaller and reducing the size of the vocabulary. Often times, these normalized words are sufficient to provide the semantics. Like in the case of understanding the meaning behind the sentences:" 992 | ] 993 | }, 994 | { 995 | "cell_type": "markdown", 996 | "metadata": {}, 997 | "source": [ 998 | "## Distances \n", 999 | "You can calculate distances between words. There are a variety of distance metrics available: https://en.wikipedia.org/wiki/String_metric. The most common ones are Levenshtein, Cosine distances and Jaccard similarity. Applications include spell checking, correction for OCRs and Machine Translation. For an implementation of a spell checker, see here: https://norvig.com/spell-correct.html\n", 1000 | "\n", 1001 | "![Edit distance](https://i.stack.imgur.com/5Pjr7.png)\n" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "metadata": {}, 1007 | "source": [ 1008 | "## Named Entity Recognition (NER) \n", 1009 | "\n", 1010 | "Also known as entity chunking or extraction, is a sub-process of information extraction. This involves identifies and classifies named entities mentions into sub-categories like person name, organization, location, time, etc. In other words, Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. \n", 1011 | "\n", 1012 | "Some of the most popular NER models are here: https://towardsdatascience.com/a-review-of-named-entity-recognition-ner-using-automatic-summarization-of-resumes-5248a75de175.
\n", 1013 | "\n", 1014 | "Example use-cases include customer support, search engine, news classification. Another emerging application is for redacting personally identifiable information (PII). A great demo of NER in action is here: https://explosion.ai/demos/displacy-ent\n" 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "markdown", 1019 | "metadata": {}, 1020 | "source": [ 1021 | "## Text representation\n", 1022 | "### Bag of words\n", 1023 | "Bag of words is an approach for text feature extraction. Just imagine a bag of popcorn, \n", 1024 | "and each popcorn kernel represents a word that is present in the text. Each sentence can be represented as a vector\n", 1025 | "of all the words present in a vocabulary. If a word is present in the sentence, it is 1, otherwise 0.\n", 1026 | "\n", 1027 | "![Bag of words](https://cdn-images-1.medium.com/max/1600/1*zMdHVQQ7HYv_mMZ5Ne-2yQ.png)\n", 1028 | "\n", 1029 | "### TF-IDF\n", 1030 | "Term-frequency inverse document frequency assigns scores to words inside a document. Commonly occuring words in all documents would have less weightage.\n", 1031 | "![TF IDF](https://cdn-images-1.medium.com/max/1600/1*g9r9kJy71m-Ry_Y6XUYLLA.gif)" 1032 | ] 1033 | }, 1034 | { 1035 | "cell_type": "code", 1036 | "execution_count": 42, 1037 | "metadata": {}, 1038 | "outputs": [], 1039 | "source": [ 1040 | "import sklearn\n", 1041 | "from sklearn.feature_extraction.text import CountVectorizer\n", 1042 | "\n", 1043 | "def get_bag_of_words(sentences):\n", 1044 | " '''Compute bag of words for sentences'''\n", 1045 | " vectorizer = CountVectorizer()\n", 1046 | " print('\\nBag of words:', vectorizer.fit_transform(sentences).todense())\n", 1047 | " print('\\nDictionary:', vectorizer.vocabulary_) " 1048 | ] 1049 | }, 1050 | { 1051 | "cell_type": "code", 1052 | "execution_count": 43, 1053 | "metadata": {}, 1054 | "outputs": [ 1055 | { 1056 | "name": "stdout", 1057 | "output_type": "stream", 1058 | "text": [ 1059 | "\n", 1060 | "Bag of words: [[1 2 1 1 1 1 1 1 1 1 1 1 1 1 1]]\n", 1061 | "\n", 1062 | "Dictionary: {'today': 13, 'is': 7, '23rd': 0, 'march': 9, 'am': 1, 'in': 6, 'san': 12, 'francisco': 5, 'california': 3, 'currently': 4, 'attending': 2, 'natural': 10, 'language': 8, 'processing': 11, 'workshop': 14}\n" 1063 | ] 1064 | } 1065 | ], 1066 | "source": [ 1067 | "get_bag_of_words(sample_data)" 1068 | ] 1069 | }, 1070 | { 1071 | "cell_type": "code", 1072 | "execution_count": 44, 1073 | "metadata": { 1074 | "scrolled": true 1075 | }, 1076 | "outputs": [ 1077 | { 1078 | "name": "stdout", 1079 | "output_type": "stream", 1080 | "text": [ 1081 | "\n", 1082 | "Bag of words: [[1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1]]\n", 1083 | "\n", 1084 | "Dictionary: {'through': 23, 'call': 5, 'for': 9, 'code': 6, 'top': 25, 'solutions': 19, 'are': 3, 'actively': 1, 'supported': 20, 'to': 24, 'bring': 4, 'the': 22, 'technology': 21, 'into': 13, 'communities': 7, 'in': 12, 'need': 17, 'working': 29, 'with': 28, 'partners': 18, 'like': 14, 'united': 27, 'nations': 16, 'and': 2, 'linux': 15, 'foundation': 10, 'deployments': 8, 'underway': 26, 'across': 0, 'globe': 11}\n" 1085 | ] 1086 | } 1087 | ], 1088 | "source": [ 1089 | "get_bag_of_words(conference_data)" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "markdown", 1094 | "metadata": {}, 1095 | "source": [ 1096 | "### Word embeddings - Word2Vec\n", 1097 | "Vector space model - represent words and sentences as vectors to get semantic relationships. A really good tutorial for Word2Vec is here: https://www.kaggle.com/alvations/word2vec-embedding-using-gensim-and-nltk\n", 1098 | "\n", 1099 | "![Word2Vec](http://www.flyml.net/wp-content/uploads/2016/11/w2v-3-samples.png)" 1100 | ] 1101 | }, 1102 | { 1103 | "cell_type": "markdown", 1104 | "metadata": {}, 1105 | "source": [ 1106 | "## Machine Learning using Natural Language Processing\n", 1107 | "Machine Learning includes two approaches: supervised and unsupervised. Supervised learning works on data that already has labels i.e. they provide supervision to the model. Eg. Classification, Regression. Unsupervised learning is to find out the inherent structure present in the data and there are no labels i.e. no supervision. Eg. Clustering.\n", 1108 | "\n", 1109 | "Example application: Creating an application for community members to submit issues to governing body. Classification can be used to predict which type of issue it is - sanitation, infrastructure, safety, etc.\n", 1110 | "\n", 1111 | "A lot of these text properties can be used as features for Machine Learning systems. One specific case is text classification. A more detailed resource is here: https://www.nltk.org/book/ch06.html\n", 1112 | "\n", 1113 | "The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics. More details here: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html" 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "code", 1118 | "execution_count": 45, 1119 | "metadata": { 1120 | "scrolled": true 1121 | }, 1122 | "outputs": [ 1123 | { 1124 | "name": "stdout", 1125 | "output_type": "stream", 1126 | "text": [ 1127 | "Classes present: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n", 1128 | "Number of classes present: 20\n", 1129 | "Number of data points: 18846\n" 1130 | ] 1131 | } 1132 | ], 1133 | "source": [ 1134 | "from sklearn.datasets import fetch_20newsgroups\n", 1135 | "\n", 1136 | "news = fetch_20newsgroups(subset='all')\n", 1137 | "print('Classes present:', news.target_names)\n", 1138 | "print('Number of classes present:', len(news.target_names))\n", 1139 | "print('Number of data points:', len(news.data))" 1140 | ] 1141 | }, 1142 | { 1143 | "cell_type": "code", 1144 | "execution_count": 46, 1145 | "metadata": {}, 1146 | "outputs": [ 1147 | { 1148 | "name": "stdout", 1149 | "output_type": "stream", 1150 | "text": [ 1151 | "[rec.sport.hockey]:\t\t \"From: Mamatha Devineni Ratnam ...\"\n", 1152 | "[comp.sys.ibm.pc.hardware]:\t\t \"From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson) ...\"\n", 1153 | "[talk.politics.mideast]:\t\t \"From: hilmi-er@dsv.su.se (Hilmi Eren) ...\"\n", 1154 | "[comp.sys.ibm.pc.hardware]:\t\t \"From: guyd@austin.ibm.com (Guy Dawson) ...\"\n", 1155 | "[comp.sys.mac.hardware]:\t\t \"From: Alexander Samuel McDiarmid ...\"\n", 1156 | "[sci.electronics]:\t\t \"From: tell@cs.unc.edu (Stephen Tell) ...\"\n", 1157 | "[comp.sys.mac.hardware]:\t\t \"From: lpa8921@tamuts.tamu.edu (Louis Paul Adams) ...\"\n", 1158 | "[rec.sport.hockey]:\t\t \"From: dchhabra@stpl.ists.ca (Deepak Chhabra) ...\"\n", 1159 | "[rec.sport.hockey]:\t\t \"From: dchhabra@stpl.ists.ca (Deepak Chhabra) ...\"\n", 1160 | "[talk.religion.misc]:\t\t \"From: arromdee@jyusenkyou.cs.jhu.edu (Ken Arromdee) ...\"\n" 1161 | ] 1162 | } 1163 | ], 1164 | "source": [ 1165 | "# Printing the first few characters for each category\n", 1166 | "\n", 1167 | "for text, num_label in zip(news.data[:10], news.target[:10]):\n", 1168 | " print('[%s]:\\t\\t \"%s ...\"' % (news.target_names[num_label], text[:500].split('\\n')[0]))" 1169 | ] 1170 | }, 1171 | { 1172 | "cell_type": "markdown", 1173 | "metadata": {}, 1174 | "source": [ 1175 | "### Dividing into training and test data sets\n", 1176 | "Think of it as learning in class (training) and then taking an exam (testing) to evaluate your performance. The testing is done on unseen data to know the actual abilities of the classifier i.e. preventing memorization or rote-learning. Test data set is usually 20-25% of the data set. You can also use cross-validation to ensure robustness of classifier." 1177 | ] 1178 | }, 1179 | { 1180 | "cell_type": "code", 1181 | "execution_count": 47, 1182 | "metadata": {}, 1183 | "outputs": [], 1184 | "source": [ 1185 | "from sklearn.model_selection import train_test_split\n", 1186 | " \n", 1187 | "def train(classifier, X, y):\n", 1188 | " \"\"\"Train given classifier\"\"\"\n", 1189 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)\n", 1190 | " classifier.fit(X_train, y_train)\n", 1191 | " print(\"\\nAccuracy:\", classifier.score(X_test, y_test))\n", 1192 | " return classifier" 1193 | ] 1194 | }, 1195 | { 1196 | "cell_type": "markdown", 1197 | "metadata": {}, 1198 | "source": [ 1199 | "### Naive Bayes\n", 1200 | "Probabilistic classifier based on Bayes theorem. Assumes independence among the features. Details here: https://en.wikipedia.org/wiki/Naive_Bayes_classifier\n", 1201 | "\n", 1202 | "We'll be using Pipeline to apply transformations sequentially: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html\n", 1203 | "\n", 1204 | "Adapted from: https://nlpforhackers.io/text-classification/\n" 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "markdown", 1209 | "metadata": {}, 1210 | "source": [ 1211 | "#### Feature: TF-IDF " 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "code", 1216 | "execution_count": 48, 1217 | "metadata": {}, 1218 | "outputs": [], 1219 | "source": [ 1220 | "from sklearn.naive_bayes import MultinomialNB\n", 1221 | "from sklearn.pipeline import Pipeline\n", 1222 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 1223 | " \n", 1224 | "model_1 = Pipeline([('vectorizer', TfidfVectorizer()),('classifier', MultinomialNB())])" 1225 | ] 1226 | }, 1227 | { 1228 | "cell_type": "code", 1229 | "execution_count": 49, 1230 | "metadata": {}, 1231 | "outputs": [ 1232 | { 1233 | "name": "stdout", 1234 | "output_type": "stream", 1235 | "text": [ 1236 | "Results for TF-IDF as feature:\n", 1237 | "\n", 1238 | "Accuracy: 0.8463497453310697\n" 1239 | ] 1240 | }, 1241 | { 1242 | "data": { 1243 | "text/plain": [ 1244 | "Pipeline(steps=[('vectorizer', TfidfVectorizer()),\n", 1245 | " ('classifier', MultinomialNB())])" 1246 | ] 1247 | }, 1248 | "execution_count": 49, 1249 | "metadata": {}, 1250 | "output_type": "execute_result" 1251 | } 1252 | ], 1253 | "source": [ 1254 | "print('Results for TF-IDF as feature:')\n", 1255 | "train(model_1, news.data, news.target)" 1256 | ] 1257 | }, 1258 | { 1259 | "cell_type": "markdown", 1260 | "metadata": {}, 1261 | "source": [ 1262 | "#### Feature: TF i.e. Removing IDF from TF-IDF" 1263 | ] 1264 | }, 1265 | { 1266 | "cell_type": "code", 1267 | "execution_count": 50, 1268 | "metadata": {}, 1269 | "outputs": [], 1270 | "source": [ 1271 | "model_2 = Pipeline([('vectorizer', TfidfVectorizer(use_idf=False)),\n", 1272 | " ('classifier', MultinomialNB())])" 1273 | ] 1274 | }, 1275 | { 1276 | "cell_type": "code", 1277 | "execution_count": 51, 1278 | "metadata": {}, 1279 | "outputs": [ 1280 | { 1281 | "name": "stdout", 1282 | "output_type": "stream", 1283 | "text": [ 1284 | "Results for TF as feature, removing IDF\n", 1285 | "\n", 1286 | "Accuracy: 0.756578947368421\n" 1287 | ] 1288 | }, 1289 | { 1290 | "data": { 1291 | "text/plain": [ 1292 | "Pipeline(steps=[('vectorizer', TfidfVectorizer(use_idf=False)),\n", 1293 | " ('classifier', MultinomialNB())])" 1294 | ] 1295 | }, 1296 | "execution_count": 51, 1297 | "metadata": {}, 1298 | "output_type": "execute_result" 1299 | } 1300 | ], 1301 | "source": [ 1302 | "print('Results for TF as feature, removing IDF')\n", 1303 | "train(model_2, news.data, news.target)" 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "markdown", 1308 | "metadata": {}, 1309 | "source": [ 1310 | "So, IDF does make a huge difference!" 1311 | ] 1312 | }, 1313 | { 1314 | "cell_type": "markdown", 1315 | "metadata": {}, 1316 | "source": [ 1317 | "#### Feature: TF-IDF + stopwords removal" 1318 | ] 1319 | }, 1320 | { 1321 | "cell_type": "code", 1322 | "execution_count": 52, 1323 | "metadata": {}, 1324 | "outputs": [], 1325 | "source": [ 1326 | "model_3 = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))),\n", 1327 | " ('classifier', MultinomialNB())])" 1328 | ] 1329 | }, 1330 | { 1331 | "cell_type": "code", 1332 | "execution_count": 53, 1333 | "metadata": {}, 1334 | "outputs": [ 1335 | { 1336 | "name": "stdout", 1337 | "output_type": "stream", 1338 | "text": [ 1339 | "Results for TF-IDF as feature, using stopword removal:\n", 1340 | "\n", 1341 | "Accuracy: 0.8777589134125636\n" 1342 | ] 1343 | }, 1344 | { 1345 | "data": { 1346 | "text/plain": [ 1347 | "Pipeline(steps=[('vectorizer',\n", 1348 | " TfidfVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we',\n", 1349 | " 'our', 'ours', 'ourselves', 'you',\n", 1350 | " \"you're\", \"you've\", \"you'll\",\n", 1351 | " \"you'd\", 'your', 'yours',\n", 1352 | " 'yourself', 'yourselves', 'he',\n", 1353 | " 'him', 'his', 'himself', 'she',\n", 1354 | " \"she's\", 'her', 'hers', 'herself',\n", 1355 | " 'it', \"it's\", 'its', 'itself', ...])),\n", 1356 | " ('classifier', MultinomialNB())])" 1357 | ] 1358 | }, 1359 | "execution_count": 53, 1360 | "metadata": {}, 1361 | "output_type": "execute_result" 1362 | } 1363 | ], 1364 | "source": [ 1365 | "print('Results for TF-IDF as feature, using stopword removal:')\n", 1366 | "train(model_3, news.data, news.target)" 1367 | ] 1368 | }, 1369 | { 1370 | "cell_type": "markdown", 1371 | "metadata": {}, 1372 | "source": [ 1373 | "#### Feature: TF-IDF + stopwords removal + ignoring words with frequency < 5\n", 1374 | "Trying simple things may work too!" 1375 | ] 1376 | }, 1377 | { 1378 | "cell_type": "code", 1379 | "execution_count": 54, 1380 | "metadata": {}, 1381 | "outputs": [], 1382 | "source": [ 1383 | "model_4 = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'), min_df=5)),\n", 1384 | " ('classifier', MultinomialNB())])" 1385 | ] 1386 | }, 1387 | { 1388 | "cell_type": "code", 1389 | "execution_count": 55, 1390 | "metadata": { 1391 | "scrolled": false 1392 | }, 1393 | "outputs": [ 1394 | { 1395 | "name": "stdout", 1396 | "output_type": "stream", 1397 | "text": [ 1398 | "Results for TF-IDF + stopwords removal + ignoring words with frequency < 5:\n", 1399 | "\n", 1400 | "Accuracy: 0.8820033955857386\n" 1401 | ] 1402 | }, 1403 | { 1404 | "data": { 1405 | "text/plain": [ 1406 | "Pipeline(steps=[('vectorizer',\n", 1407 | " TfidfVectorizer(min_df=5,\n", 1408 | " stop_words=['i', 'me', 'my', 'myself', 'we',\n", 1409 | " 'our', 'ours', 'ourselves', 'you',\n", 1410 | " \"you're\", \"you've\", \"you'll\",\n", 1411 | " \"you'd\", 'your', 'yours',\n", 1412 | " 'yourself', 'yourselves', 'he',\n", 1413 | " 'him', 'his', 'himself', 'she',\n", 1414 | " \"she's\", 'her', 'hers', 'herself',\n", 1415 | " 'it', \"it's\", 'its', 'itself', ...])),\n", 1416 | " ('classifier', MultinomialNB())])" 1417 | ] 1418 | }, 1419 | "execution_count": 55, 1420 | "metadata": {}, 1421 | "output_type": "execute_result" 1422 | } 1423 | ], 1424 | "source": [ 1425 | "print('Results for TF-IDF + stopwords removal + ignoring words with frequency < 5:')\n", 1426 | "train(model_4, news.data, news.target)" 1427 | ] 1428 | }, 1429 | { 1430 | "cell_type": "markdown", 1431 | "metadata": {}, 1432 | "source": [ 1433 | "#### Feature: TF-IDF + stopwords removal + ignoring words with frequency < 10" 1434 | ] 1435 | }, 1436 | { 1437 | "cell_type": "code", 1438 | "execution_count": 56, 1439 | "metadata": {}, 1440 | "outputs": [], 1441 | "source": [ 1442 | "model_5 = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'), min_df=10)),\n", 1443 | " ('classifier', MultinomialNB())])" 1444 | ] 1445 | }, 1446 | { 1447 | "cell_type": "code", 1448 | "execution_count": 57, 1449 | "metadata": { 1450 | "scrolled": false 1451 | }, 1452 | "outputs": [ 1453 | { 1454 | "name": "stdout", 1455 | "output_type": "stream", 1456 | "text": [ 1457 | "Results for TF-IDF + stopwords removal + ignoring words with frequency < 10:\n", 1458 | "\n", 1459 | "Accuracy: 0.8745755517826825\n" 1460 | ] 1461 | }, 1462 | { 1463 | "data": { 1464 | "text/plain": [ 1465 | "Pipeline(steps=[('vectorizer',\n", 1466 | " TfidfVectorizer(min_df=10,\n", 1467 | " stop_words=['i', 'me', 'my', 'myself', 'we',\n", 1468 | " 'our', 'ours', 'ourselves', 'you',\n", 1469 | " \"you're\", \"you've\", \"you'll\",\n", 1470 | " \"you'd\", 'your', 'yours',\n", 1471 | " 'yourself', 'yourselves', 'he',\n", 1472 | " 'him', 'his', 'himself', 'she',\n", 1473 | " \"she's\", 'her', 'hers', 'herself',\n", 1474 | " 'it', \"it's\", 'its', 'itself', ...])),\n", 1475 | " ('classifier', MultinomialNB())])" 1476 | ] 1477 | }, 1478 | "execution_count": 57, 1479 | "metadata": {}, 1480 | "output_type": "execute_result" 1481 | } 1482 | ], 1483 | "source": [ 1484 | "print('Results for TF-IDF + stopwords removal + ignoring words with frequency < 10:')\n", 1485 | "train(model_5, news.data, news.target)" 1486 | ] 1487 | }, 1488 | { 1489 | "cell_type": "markdown", 1490 | "metadata": {}, 1491 | "source": [ 1492 | "Make sure to not go overboard with simple steps!" 1493 | ] 1494 | }, 1495 | { 1496 | "cell_type": "markdown", 1497 | "metadata": {}, 1498 | "source": [ 1499 | "#### Feature: TF-IDF + stopwords removal + ignoring words with frequency < 5 + tuning hyperparameter alpha\n", 1500 | "Alpha is a hyperparameter for smoothing in Multinomial NB that controls the model itself." 1501 | ] 1502 | }, 1503 | { 1504 | "cell_type": "code", 1505 | "execution_count": 58, 1506 | "metadata": {}, 1507 | "outputs": [], 1508 | "source": [ 1509 | "import string\n", 1510 | "\n", 1511 | "model_6 = Pipeline([('vectorizer', TfidfVectorizer(min_df = 5,\n", 1512 | " stop_words=stopwords.words('english') + list(string.punctuation))),\n", 1513 | " ('classifier', MultinomialNB(alpha=0.05))])" 1514 | ] 1515 | }, 1516 | { 1517 | "cell_type": "code", 1518 | "execution_count": 59, 1519 | "metadata": { 1520 | "scrolled": false 1521 | }, 1522 | "outputs": [ 1523 | { 1524 | "name": "stdout", 1525 | "output_type": "stream", 1526 | "text": [ 1527 | "Results for TF-IDF + stopwords removal + ignoring words with frequency < 5 + tuning hyperparameter alpha:\n", 1528 | "\n", 1529 | "Accuracy: 0.9028013582342954\n" 1530 | ] 1531 | }, 1532 | { 1533 | "data": { 1534 | "text/plain": [ 1535 | "Pipeline(steps=[('vectorizer',\n", 1536 | " TfidfVectorizer(min_df=5,\n", 1537 | " stop_words=['i', 'me', 'my', 'myself', 'we',\n", 1538 | " 'our', 'ours', 'ourselves', 'you',\n", 1539 | " \"you're\", \"you've\", \"you'll\",\n", 1540 | " \"you'd\", 'your', 'yours',\n", 1541 | " 'yourself', 'yourselves', 'he',\n", 1542 | " 'him', 'his', 'himself', 'she',\n", 1543 | " \"she's\", 'her', 'hers', 'herself',\n", 1544 | " 'it', \"it's\", 'its', 'itself', ...])),\n", 1545 | " ('classifier', MultinomialNB(alpha=0.05))])" 1546 | ] 1547 | }, 1548 | "execution_count": 59, 1549 | "metadata": {}, 1550 | "output_type": "execute_result" 1551 | } 1552 | ], 1553 | "source": [ 1554 | "print('Results for TF-IDF + stopwords removal + ignoring words with frequency < 5 + tuning hyperparameter alpha:')\n", 1555 | "train(model_6, news.data, news.target)" 1556 | ] 1557 | }, 1558 | { 1559 | "cell_type": "markdown", 1560 | "metadata": {}, 1561 | "source": [ 1562 | "#### Feature: TF-IDF + stopwords removal + ignoring words with frequency < 5 + tuning hyperparameter alpha + stemming\n", 1563 | "Let's check if stemming the words makes any difference." 1564 | ] 1565 | }, 1566 | { 1567 | "cell_type": "code", 1568 | "execution_count": 60, 1569 | "metadata": {}, 1570 | "outputs": [], 1571 | "source": [ 1572 | "import string\n", 1573 | "from nltk.stem import PorterStemmer\n", 1574 | "from nltk import word_tokenize\n", 1575 | " \n", 1576 | "def stem_tokenizer(text):\n", 1577 | " \"\"\"Computing stem for each word in text\"\"\"\n", 1578 | " stemmer = PorterStemmer()\n", 1579 | " return [stemmer.stem(word) for word in word_tokenize(text)]\n", 1580 | "\n", 1581 | "# Stemming the stopwords as text is stemmed first, and then stopwords are removed\n", 1582 | "stemmed_stopwords = [PorterStemmer().stem(word) for word in stopwords.words('english')]\n", 1583 | " \n", 1584 | "model_7 = Pipeline([('vectorizer', TfidfVectorizer(tokenizer=stem_tokenizer, min_df = 5,\n", 1585 | " stop_words=stemmed_stopwords + list(string.punctuation))),\n", 1586 | " ('classifier', MultinomialNB(alpha=0.05))])" 1587 | ] 1588 | }, 1589 | { 1590 | "cell_type": "code", 1591 | "execution_count": 61, 1592 | "metadata": { 1593 | "scrolled": true 1594 | }, 1595 | "outputs": [ 1596 | { 1597 | "name": "stdout", 1598 | "output_type": "stream", 1599 | "text": [ 1600 | "Results for TF-IDF + stopwords removal + ignoring words with frequency < 5 + tuning hyperparameter alpha + stemming:\n", 1601 | "\n", 1602 | "Accuracy: 0.9036502546689303\n" 1603 | ] 1604 | }, 1605 | { 1606 | "data": { 1607 | "text/plain": [ 1608 | "Pipeline(steps=[('vectorizer',\n", 1609 | " TfidfVectorizer(min_df=5,\n", 1610 | " stop_words=['i', 'me', 'my', 'myself', 'we',\n", 1611 | " 'our', 'our', 'ourselv', 'you',\n", 1612 | " \"you'r\", \"you'v\", \"you'll\",\n", 1613 | " \"you'd\", 'your', 'your',\n", 1614 | " 'yourself', 'yourselv', 'he',\n", 1615 | " 'him', 'hi', 'himself', 'she',\n", 1616 | " \"she'\", 'her', 'her', 'herself',\n", 1617 | " 'it', \"it'\", 'it', 'itself', ...],\n", 1618 | " tokenizer=)),\n", 1619 | " ('classifier', MultinomialNB(alpha=0.05))])" 1620 | ] 1621 | }, 1622 | "execution_count": 61, 1623 | "metadata": {}, 1624 | "output_type": "execute_result" 1625 | } 1626 | ], 1627 | "source": [ 1628 | "print('Results for TF-IDF + stopwords removal + ignoring words with frequency < 5 +', \n", 1629 | " 'tuning hyperparameter alpha + stemming:')\n", 1630 | "train(model_7, news.data, news.target)" 1631 | ] 1632 | }, 1633 | { 1634 | "cell_type": "markdown", 1635 | "metadata": {}, 1636 | "source": [ 1637 | "Feel free to experiment with other features and see how well the classifier performs!" 1638 | ] 1639 | }, 1640 | { 1641 | "cell_type": "markdown", 1642 | "metadata": {}, 1643 | "source": [ 1644 | "## Sentiment analysis\n", 1645 | "\n", 1646 | "NLTK's VADER algorithm is used to detect polarity of words and establish the overall sentiment (compound score) for sentences. We're using a small sample of tweets.\n", 1647 | "\n", 1648 | "Example application: Detect messages or posts that could have hate or alarming speech." 1649 | ] 1650 | }, 1651 | { 1652 | "cell_type": "code", 1653 | "execution_count": 75, 1654 | "metadata": {}, 1655 | "outputs": [ 1656 | { 1657 | "name": "stderr", 1658 | "output_type": "stream", 1659 | "text": [ 1660 | "[nltk_data] Downloading package vader_lexicon to\n", 1661 | "[nltk_data] /Users/gjena/nltk_data...\n", 1662 | "[nltk_data] Package vader_lexicon is already up-to-date!\n" 1663 | ] 1664 | } 1665 | ], 1666 | "source": [ 1667 | "import nltk\n", 1668 | "nltk.download('vader_lexicon')\n", 1669 | "from nltk.sentiment.vader import SentimentIntensityAnalyzer \n", 1670 | "\n", 1671 | "def get_sentiment(data):\n", 1672 | " '''Get sentiment of sentences using VADER algorithm'''\n", 1673 | " scorer = SentimentIntensityAnalyzer()\n", 1674 | " for sentence in reviews:\n", 1675 | " print(sentence)\n", 1676 | " ss = scorer.polarity_scores(sentence)\n", 1677 | " for k in ss:\n", 1678 | " print('{0}: {1}, ' .format(k, ss[k]), end='')\n", 1679 | " print('\\n') " 1680 | ] 1681 | }, 1682 | { 1683 | "cell_type": "code", 1684 | "execution_count": 83, 1685 | "metadata": {}, 1686 | "outputs": [ 1687 | { 1688 | "name": "stdout", 1689 | "output_type": "stream", 1690 | "text": [ 1691 | "I'm doing good.\n", 1692 | "neg: 0.0, neu: 0.408, pos: 0.592, compound: 0.4404, \n", 1693 | "\n", 1694 | "You are a loser! Go to hell!\n", 1695 | "neg: 0.682, neu: 0.318, pos: 0.0, compound: -0.8619, \n", 1696 | "\n", 1697 | "That's a great movie, you should definitely watch it\n", 1698 | "neg: 0.0, neu: 0.469, pos: 0.531, compound: 0.7783, \n", 1699 | "\n", 1700 | "You are absolutely disgusting, you should be punished\n", 1701 | "neg: 0.527, neu: 0.473, pos: 0.0, compound: -0.7713, \n", 1702 | "\n" 1703 | ] 1704 | } 1705 | ], 1706 | "source": [ 1707 | "reviews = [\"I'm doing good.\",\n", 1708 | " \"You are a loser! Go to hell!\",\n", 1709 | " \"That's a great movie, you should definitely watch it\",\n", 1710 | " \"You are absolutely disgusting, you should be punished\"]\n", 1711 | "\n", 1712 | "get_sentiment(reviews)" 1713 | ] 1714 | }, 1715 | { 1716 | "cell_type": "markdown", 1717 | "metadata": {}, 1718 | "source": [ 1719 | "## Topic modeling\n", 1720 | "Topic modeling is an unsupervised ML method used to find inherent structure in documents. It learns\n", 1721 | "representations of topics in documents which allows grouping of different documents together. We will\n", 1722 | "use ```Gensim``` library and Latent Dirichlet Allocation (LDA) for this.\n", 1723 | "\n", 1724 | "LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.\n", 1725 | "\n", 1726 | "Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.\n", 1727 | "\n", 1728 | "Adapted from https://kleiber.me/blog/2017/07/22/tutorial-lda-wikipedia/\n", 1729 | "\n", 1730 | "Example application: Help first responders by identifying tweets with helpful information from people affected by incident.\n", 1731 | "\n", 1732 | "![Topic modeling](https://i.stack.imgur.com/vI8Lc.png)\n", 1733 | "\n" 1734 | ] 1735 | }, 1736 | { 1737 | "cell_type": "code", 1738 | "execution_count": 93, 1739 | "metadata": {}, 1740 | "outputs": [], 1741 | "source": [ 1742 | "import wikipedia, random\n", 1743 | "\n", 1744 | "def fetch_data(article_names):\n", 1745 | " '''Fetching the data from given Wikipedia articles'''\n", 1746 | " wikipedia_random_articles = wikipedia.random(2)\n", 1747 | " wikipedia_random_articles.extend(article_names)\n", 1748 | " wikipedia_random_articles\n", 1749 | " print(wikipedia_random_articles)\n", 1750 | " \n", 1751 | " wikipedia_articles = []\n", 1752 | " for wikipedia_article in wikipedia_random_articles:\n", 1753 | " wikipedia_articles.append([wikipedia_article, \n", 1754 | " wikipedia.page(wikipedia_article).content])\n", 1755 | " return wikipedia_articles" 1756 | ] 1757 | }, 1758 | { 1759 | "cell_type": "code", 1760 | "execution_count": 94, 1761 | "metadata": {}, 1762 | "outputs": [ 1763 | { 1764 | "name": "stderr", 1765 | "output_type": "stream", 1766 | "text": [ 1767 | "[nltk_data] Downloading package stopwords to /Users/gjena/nltk_data...\n", 1768 | "[nltk_data] Package stopwords is already up-to-date!\n" 1769 | ] 1770 | } 1771 | ], 1772 | "source": [ 1773 | "nltk.download('stopwords') \n", 1774 | "from nltk.stem.porter import PorterStemmer\n", 1775 | "from nltk.corpus import stopwords\n", 1776 | "from nltk.tokenize import RegexpTokenizer\n", 1777 | "\n", 1778 | "def clean(article):\n", 1779 | " '''Cleaning the article contents and getting the word stems'''\n", 1780 | " title, document = article\n", 1781 | " tokens = RegexpTokenizer(r'\\w+').tokenize(document.lower())\n", 1782 | " tokens_clean = [token for token in tokens if token not in \n", 1783 | " stopwords.words('english')]\n", 1784 | " tokens_stemmed = [PorterStemmer().stem(token) for token \n", 1785 | " in tokens_clean]\n", 1786 | " return (title, tokens_stemmed)" 1787 | ] 1788 | }, 1789 | { 1790 | "cell_type": "code", 1791 | "execution_count": 95, 1792 | "metadata": {}, 1793 | "outputs": [ 1794 | { 1795 | "name": "stdout", 1796 | "output_type": "stream", 1797 | "text": [ 1798 | "['Berrysbridge', 'Hardaspur, Raebareli', 'Disaster', 'Government', 'English language', 'Computer Programming', 'Tsunami']\n" 1799 | ] 1800 | } 1801 | ], 1802 | "source": [ 1803 | "from gensim import corpora, models\n", 1804 | "import gensim\n", 1805 | "\n", 1806 | "article_names = ['Disaster', 'Government', 'English language', \n", 1807 | " 'Computer Programming', 'Tsunami']\n", 1808 | "wikipedia_articles = fetch_data(article_names)\n", 1809 | "wikipedia_articles\n", 1810 | "wikipedia_articles_clean = list(map(clean, wikipedia_articles))" 1811 | ] 1812 | }, 1813 | { 1814 | "cell_type": "code", 1815 | "execution_count": 96, 1816 | "metadata": {}, 1817 | "outputs": [], 1818 | "source": [ 1819 | "article_contents = [article[1] for article in wikipedia_articles_clean]\n", 1820 | "dictionary = corpora.Dictionary(article_contents)\n", 1821 | "corpus = [dictionary.doc2bow(article) for article in \n", 1822 | " article_contents[:-1]] # All except the last one" 1823 | ] 1824 | }, 1825 | { 1826 | "cell_type": "code", 1827 | "execution_count": 97, 1828 | "metadata": {}, 1829 | "outputs": [ 1830 | { 1831 | "data": { 1832 | "text/plain": [ 1833 | "[(0,\n", 1834 | " '0.000*\"villag\" + 0.000*\"devon\" + 0.000*\"berrysbridg\" + 0.000*\"link\" + 0.000*\"wikimedia\"'),\n", 1835 | " (1,\n", 1836 | " '0.028*\"program\" + 0.024*\"languag\" + 0.014*\"comput\" + 0.014*\"code\" + 0.010*\"use\"'),\n", 1837 | " (2,\n", 1838 | " '0.028*\"disast\" + 0.008*\"natur\" + 0.007*\"human\" + 0.007*\"hazard\" + 0.006*\"caus\"'),\n", 1839 | " (3,\n", 1840 | " '0.040*\"govern\" + 0.012*\"state\" + 0.012*\"form\" + 0.010*\"polit\" + 0.008*\"democraci\"'),\n", 1841 | " (4,\n", 1842 | " '0.000*\"villag\" + 0.000*\"devon\" + 0.000*\"berrysbridg\" + 0.000*\"link\" + 0.000*\"england\"'),\n", 1843 | " (5,\n", 1844 | " '0.048*\"english\" + 0.019*\"languag\" + 0.013*\"use\" + 0.013*\"word\" + 0.011*\"verb\"')]" 1845 | ] 1846 | }, 1847 | "execution_count": 97, 1848 | "metadata": {}, 1849 | "output_type": "execute_result" 1850 | } 1851 | ], 1852 | "source": [ 1853 | "lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=6, \n", 1854 | " id2word = dictionary, \n", 1855 | " passes=100)\n", 1856 | "\n", 1857 | "topic_results = lda_model.print_topics(num_topics=6, num_words=5)\n", 1858 | "topic_results" 1859 | ] 1860 | }, 1861 | { 1862 | "cell_type": "code", 1863 | "execution_count": 100, 1864 | "metadata": {}, 1865 | "outputs": [ 1866 | { 1867 | "ename": "ModuleNotFoundError", 1868 | "evalue": "No module named 'wordcloud'", 1869 | "output_type": "error", 1870 | "traceback": [ 1871 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 1872 | "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", 1873 | "\u001b[0;32m/var/folders/hl/lf36z_nd0vg5lgls9n1b_6100000gn/T/ipykernel_57937/1611871103.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mmatplotlib\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mpyplot\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0mwordcloud\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mWordCloud\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mSTOPWORDS\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmatplotlib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolors\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mmcolors\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_line_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'matplotlib'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'inline'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mcols\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mcolor\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolor\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mmcolors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTABLEAU_COLORS\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;31m# more colors: 'mcolors.XKCD_COLORS'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 1874 | "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'wordcloud'" 1875 | ] 1876 | } 1877 | ], 1878 | "source": [ 1879 | "from matplotlib import pyplot as plt\n", 1880 | "from wordcloud import WordCloud, STOPWORDS\n", 1881 | "import matplotlib.colors as mcolors\n", 1882 | "%matplotlib inline\n", 1883 | "cols = [color for name, color in mcolors.TABLEAU_COLORS.items()] # more colors: 'mcolors.XKCD_COLORS'\n", 1884 | "\n", 1885 | "cloud = WordCloud(background_color='white',\n", 1886 | " width=2500,\n", 1887 | " height=1800,\n", 1888 | " max_words=10,\n", 1889 | " colormap='tab10',\n", 1890 | " color_func=lambda *args, **kwargs: cols[i],\n", 1891 | " prefer_horizontal=1.0)\n", 1892 | "\n", 1893 | "topics = lda_model.show_topics(formatted=False)\n", 1894 | "\n", 1895 | "fig, axes = plt.subplots(3, 2, figsize=(10,10), sharex=True, sharey=True)\n", 1896 | "\n", 1897 | "for i, ax in enumerate(axes.flatten()):\n", 1898 | " fig.add_subplot(ax)\n", 1899 | " topic_words = dict(topics[i][1])\n", 1900 | " cloud.generate_from_frequencies(topic_words, max_font_size=300)\n", 1901 | " plt.gca().imshow(cloud)\n", 1902 | " plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))\n", 1903 | " plt.gca().axis('off')\n", 1904 | "\n", 1905 | "\n", 1906 | "plt.subplots_adjust(wspace=0, hspace=0)\n", 1907 | "plt.axis('off')\n", 1908 | "plt.margins(x=0, y=0)\n", 1909 | "plt.tight_layout()\n", 1910 | "plt.show()" 1911 | ] 1912 | }, 1913 | { 1914 | "cell_type": "code", 1915 | "execution_count": 99, 1916 | "metadata": {}, 1917 | "outputs": [ 1918 | { 1919 | "name": "stdout", 1920 | "output_type": "stream", 1921 | "text": [ 1922 | "Similarity to each of the topics: [(0, 0.012798238), (1, 0.121735476), (2, 0.4074674), (3, 0.11926399), (5, 0.3291508)]\n", 1923 | "Given topic is most similar to topic 2 with a similarity of 0.4074674\n" 1924 | ] 1925 | } 1926 | ], 1927 | "source": [ 1928 | "from operator import itemgetter\n", 1929 | "\n", 1930 | "similarity = list(lda_model[[dictionary.doc2bow(article_contents[-1])]])\n", 1931 | "print('Similarity to each of the topics:', similarity[0])\n", 1932 | "match = max(similarity[0], key=itemgetter(1))\n", 1933 | "print('Given topic is most similar to topic', match[0], ' with a similarity of', match[1])" 1934 | ] 1935 | }, 1936 | { 1937 | "cell_type": "markdown", 1938 | "metadata": {}, 1939 | "source": [ 1940 | "## Resources\n", 1941 | "\n", 1942 | "Thank you for attending! Would appreciate if you could give your [feedback](https://docs.google.com/forms/d/1wh7oz6UA4v3DI30FbRyPmnDqa4hZlJeZUuBDn0JcnE0/prefill). \n", 1943 | "\n", 1944 | "- More on NLP https://monkeylearn.com/blog/definitive-guide-natural-language-processing/\n", 1945 | "- A very comprehensive list of resources by Penn https://www.seas.upenn.edu/~romap/nlp-resources.html\n", 1946 | "- Peter Norvig's spell corrector http://norvig.com/spell-correct.html\n", 1947 | "- Applications and datasets https://machinelearningmastery.com/datasets-natural-language-processing/\n", 1948 | "- More datasets https://gengo.ai/datasets/the-best-25-datasets-for-natural-language-processing/\n", 1949 | "- https://towardsdatascience.com/text-analytics-topic-modelling-on-music-genres-song-lyrics-deb82c86caa2\n", 1950 | "- Collection of tutorials https://medium.com/machine-learning-in-practice/over-200-of-the-best-machine-learning-nlp-and-python-tutorials-2018-edition-dd8cf53cb7dc\n", 1951 | "- Text classification https://textminingonline.com/dive-into-nltk-part-vii-a-preliminary-study-on-text-classification\n" 1952 | ] 1953 | }, 1954 | { 1955 | "cell_type": "markdown", 1956 | "metadata": {}, 1957 | "source": [ 1958 | "## Contact\n", 1959 | "\n", 1960 | "You can contact me at:\n", 1961 | "\n", 1962 | "- Personal website: https://gjena.github.io/\n", 1963 | "- LinkedIn: https://www.linkedin.com/in/grishmajena/\n", 1964 | "- Twitter: @DebateLover" 1965 | ] 1966 | } 1967 | ], 1968 | "metadata": { 1969 | "kernelspec": { 1970 | "display_name": "Python 3", 1971 | "language": "python", 1972 | "name": "python3" 1973 | }, 1974 | "language_info": { 1975 | "codemirror_mode": { 1976 | "name": "ipython", 1977 | "version": 3 1978 | }, 1979 | "file_extension": ".py", 1980 | "mimetype": "text/x-python", 1981 | "name": "python", 1982 | "nbconvert_exporter": "python", 1983 | "pygments_lexer": "ipython3", 1984 | "version": "3.9.7" 1985 | } 1986 | }, 1987 | "nbformat": 4, 1988 | "nbformat_minor": 2 1989 | } 1990 | --------------------------------------------------------------------------------