├── LICENSE
├── README.md
├── Section5
    ├── Reviews.rar
    ├── plot.html
    └── section5_video3_video4_training_visualizing_wordembedding.ipynb
├── section1
    └── video3
    │   └── section1_video3_install_corpora.ipynb
├── section2
    ├── video 2
    │   └── section_2_video_2_cleaning.ipynb
    ├── video 3
    │   └── section_2_video_3_tokenizing.ipynb
    └── video 4
    │   └── section_2_video_4_ngrams.ipynb
├── section3
    ├── ner_dataset.rar
    ├── section3_video3_pretrained_models.ipynb
    └── section3_video4_training_ner.ipynb
└── section4
    └── section4_video3_basic_classifier.ipynb


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Packt
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | 
 4 | # Text Mining with Machine Learning and Python [Video]
 5 | This is the code repository for [Text Mining with Machine Learning and Python [Video]](https://www.packtpub.com/application-development/text-mining-machine-learning-and-python-video?utm_source=github&utm_medium=repository&utm_campaign=9781789137361), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the video course from start to finish.
 6 | ## About the Video Course
 7 | Text is one of the most actively researched and widely spread types of data in the Data Science field today. New advances in machine learning and deep learning techniques now make it possible to build fantastic data products on text sources. New exciting text data sources pop up all the time. You'll build your own toolbox of know-how, packages, and working code snippets so you can perform your own text mining analyses.
 8 | 
 9 | You'll start by understanding the fundamentals of modern text mining and move on to some exciting processes involved in it. You'll learn how machine learning is used to extract meaningful information from text and the different processes involved in it. You will learn to read and process text features. Then you'll learn how to extract information from text and work on pre-trained models, while also delving into text classification, and entity extraction and classification. You will explore the process of word embedding by working on Skip-grams, CBOW, and X2Vec with some additional and important text mining processes. By the end of the course, you will have learned and understood the various aspects of text mining with ML aText is one of the most actively researched and widely spread types of data in the Data Science field today. New advances in machine learning and deep learning techniques now make it possible to build fantastic data products on text sources. New exciting text data sources pop out all the time like tulips in the spring. This course aims to you the first steps into this expertise. To build up your toolbox of know-how, packages and working code snippets to perform your own Text Mining analysis.
10 | 
11 | Starting from the basics of preprocessing text features, we’ll take a look at how we can extract relevant features from text and classify documents through Machine Learning. Since Word Embeddings have become indispensable in today’s NLP world, we’ll dive deeper into their inner workings and have a go at training our own embedding models. 
12 | 
13 | By the end of the course, you will have a high-level understanding of the various components involved in a current-day NLP pipeline, and a set of working code to build further upon. 
14 | nd the important processes involved in it, and will have begun your journey as an effective text miner.
15 | 
16 | 
17 | <H2>What You Will Learn</H2>
18 | <DIV class=book-info-will-learn-text>
19 | <UL>
20 | <LI>Refine and clean your text
21 | <LI>Extract important data from text
22 | <LI>Classify text into types
23 | <LI>Apply modern ML and DL techniques on the text 
24 | <LI>Work on pre-trained models 
25 | <LI>Important text mining processes
26 | <LI>Analyze text in the best and most effective way</LI></UL></DIV>
27 | 
28 | ## Instructions and Navigation
29 | ### Assumed Knowledge
30 | To fully benefit from the coverage included in this course, you will need:<br/>
31 | 
32 | ●	Working experience with Python and Jupyter Notebooks
33 | 
34 | ●	First experience with doing data analytics in Python
35 | 
36 | ●	First encounter with Machine Learning (scikit-learn experience is a plus)
37 | 
38 | 
39 | ### Technical Requirements
40 | This course has the following software requirements:<br/>
41 | 
42 | ●	Anaconda distribution of latest Python 3
43 | 
44 | ●	Separate conda env with Python 3 installed
45 | 
46 |           ○	available to set up once Anaconda is installed
47 | 
48 | ●	Jupyter notebook
49 | 
50 |           ○	available to activate once Anaconda is installed
51 | 
52 | ●	Extra packages:
53 | 
54 |           ○	NLTK (pip install nltk==3.2.2)
55 | 
56 |           ○	Spacy (pip install spacy==2.0.3)
57 | 
58 |           ○	Gensim (pip install gensim==3.3.0)
59 | 
60 |           ○	Scikit-learn (pip install scikit-learn==0.19.1)
61 | 
62 |           ○	Tensorflow (for CPU) (pip install tensorflow==1.4.0)
63 | 
64 |           ○	Keras (pip install keras==2.1.3)
65 | 
66 |           ○	python-crfsuite (pip install python-crfsuite==0.9.5)
67 | 
68 | This course has been tested on the following system configuration:
69 | 
70 | ●	OS: Windows 10
71 | 
72 | ●	Processor: Quad Core 2.8 Ghz
73 | 
74 | ●	Memory: 16GB
75 | 
76 | ●	Hard Disk Space: 3GB
77 | 
78 | 
79 | 
80 | ## Related Products
81 | * [Hands-On Machine Learning with Python and Scikit-Learn [Video]](https://www.packtpub.com/big-data-and-business-intelligence/hands-machine-learning-python-and-scikit-learn-video?utm_source=github&utm_medium=repository&utm_campaign=9781788991056)
82 | 
83 | * [Machine Learning with scikit-learn and Tensorflow [Video]](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-scikit-learn-and-tensorflow-video?utm_source=github&utm_medium=repository&utm_campaign=9781788629928)
84 | 
85 | * [Kali Linux Advanced Wireless Penetration Testing [Video]](https://www.packtpub.com/networking-and-servers/kali-linux-advanced-wireless-penetration-testing-video?utm_source=github&utm_medium=repository&utm_campaign=9781788832342)
86 | 
87 | 


--------------------------------------------------------------------------------
/Section5/Reviews.rar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Text-Mining-with-Machine-Learning-and-Python/31fbe17da4e984f9c3b5e6a590ec53df4d0b1c05/Section5/Reviews.rar


--------------------------------------------------------------------------------
/Section5/section5_video3_video4_training_visualizing_wordembedding.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 0. Imports"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": false
 15 |    },
 16 |    "outputs": [
 17 |     {
 18 |      "name": "stderr",
 19 |      "output_type": "stream",
 20 |      "text": [
 21 |       "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\gensim\\utils.py:1167: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n",
 22 |       "  warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n"
 23 |      ]
 24 |     }
 25 |    ],
 26 |    "source": [
 27 |     "import pandas as pd\n",
 28 |     "import gensim\n",
 29 |     "import spacy\n",
 30 |     "from tqdm import tqdm"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 2,
 36 |    "metadata": {
 37 |     "collapsed": true
 38 |    },
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "tqdm.pandas(desc=\"Progress\")"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 3,
 47 |    "metadata": {
 48 |     "collapsed": true
 49 |    },
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "nlp_en = spacy.load(\"en_core_web_md\")"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "## 1. Train word embeddings"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "markdown",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "#### 1.1 Get data"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": 4,
 72 |    "metadata": {
 73 |     "collapsed": true
 74 |    },
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "pd_data = pd.read_csv(\"Reviews.csv\")"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": 17,
 83 |    "metadata": {
 84 |     "collapsed": false
 85 |    },
 86 |    "outputs": [
 87 |     {
 88 |      "data": {
 89 |       "text/html": [
 90 |        "<div>\n",
 91 |        "<style>\n",
 92 |        "    .dataframe thead tr:only-child th {\n",
 93 |        "        text-align: right;\n",
 94 |        "    }\n",
 95 |        "\n",
 96 |        "    .dataframe thead th {\n",
 97 |        "        text-align: left;\n",
 98 |        "    }\n",
 99 |        "\n",
100 |        "    .dataframe tbody tr th {\n",
101 |        "        vertical-align: top;\n",
102 |        "    }\n",
103 |        "</style>\n",
104 |        "<table border=\"1\" class=\"dataframe\">\n",
105 |        "  <thead>\n",
106 |        "    <tr style=\"text-align: right;\">\n",
107 |        "      <th></th>\n",
108 |        "      <th>Id</th>\n",
109 |        "      <th>ProductId</th>\n",
110 |        "      <th>UserId</th>\n",
111 |        "      <th>ProfileName</th>\n",
112 |        "      <th>HelpfulnessNumerator</th>\n",
113 |        "      <th>HelpfulnessDenominator</th>\n",
114 |        "      <th>Score</th>\n",
115 |        "      <th>Time</th>\n",
116 |        "      <th>Summary</th>\n",
117 |        "      <th>Text</th>\n",
118 |        "      <th>tokens</th>\n",
119 |        "    </tr>\n",
120 |        "  </thead>\n",
121 |        "  <tbody>\n",
122 |        "    <tr>\n",
123 |        "      <th>0</th>\n",
124 |        "      <td>1</td>\n",
125 |        "      <td>B001E4KFG0</td>\n",
126 |        "      <td>A3SGXH7AUHU8GW</td>\n",
127 |        "      <td>delmartian</td>\n",
128 |        "      <td>1</td>\n",
129 |        "      <td>1</td>\n",
130 |        "      <td>5</td>\n",
131 |        "      <td>1303862400</td>\n",
132 |        "      <td>Good Quality Dog Food</td>\n",
133 |        "      <td>I have bought several of the Vitality canned d...</td>\n",
134 |        "      <td>[I, have, bought, several, of, the, Vitality, ...</td>\n",
135 |        "    </tr>\n",
136 |        "    <tr>\n",
137 |        "      <th>1</th>\n",
138 |        "      <td>2</td>\n",
139 |        "      <td>B00813GRG4</td>\n",
140 |        "      <td>A1D87F6ZCVE5NK</td>\n",
141 |        "      <td>dll pa</td>\n",
142 |        "      <td>0</td>\n",
143 |        "      <td>0</td>\n",
144 |        "      <td>1</td>\n",
145 |        "      <td>1346976000</td>\n",
146 |        "      <td>Not as Advertised</td>\n",
147 |        "      <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
148 |        "      <td>[Product, arrived, labeled, as, Jumbo, Salted,...</td>\n",
149 |        "    </tr>\n",
150 |        "    <tr>\n",
151 |        "      <th>2</th>\n",
152 |        "      <td>3</td>\n",
153 |        "      <td>B000LQOCH0</td>\n",
154 |        "      <td>ABXLMWJIXXAIN</td>\n",
155 |        "      <td>Natalia Corres \"Natalia Corres\"</td>\n",
156 |        "      <td>1</td>\n",
157 |        "      <td>1</td>\n",
158 |        "      <td>4</td>\n",
159 |        "      <td>1219017600</td>\n",
160 |        "      <td>\"Delight\" says it all</td>\n",
161 |        "      <td>This is a confection that has been around a fe...</td>\n",
162 |        "      <td>[This, is, a, confection, that, has, been, aro...</td>\n",
163 |        "    </tr>\n",
164 |        "  </tbody>\n",
165 |        "</table>\n",
166 |        "</div>"
167 |       ],
168 |       "text/plain": [
169 |        "   Id   ProductId          UserId                      ProfileName  \\\n",
170 |        "0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   \n",
171 |        "1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   \n",
172 |        "2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres \"Natalia Corres\"   \n",
173 |        "\n",
174 |        "   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \\\n",
175 |        "0                     1                       1      5  1303862400   \n",
176 |        "1                     0                       0      1  1346976000   \n",
177 |        "2                     1                       1      4  1219017600   \n",
178 |        "\n",
179 |        "                 Summary                                               Text  \\\n",
180 |        "0  Good Quality Dog Food  I have bought several of the Vitality canned d...   \n",
181 |        "1      Not as Advertised  Product arrived labeled as Jumbo Salted Peanut...   \n",
182 |        "2  \"Delight\" says it all  This is a confection that has been around a fe...   \n",
183 |        "\n",
184 |        "                                              tokens  \n",
185 |        "0  [I, have, bought, several, of, the, Vitality, ...  \n",
186 |        "1  [Product, arrived, labeled, as, Jumbo, Salted,...  \n",
187 |        "2  [This, is, a, confection, that, has, been, aro...  "
188 |       ]
189 |      },
190 |      "execution_count": 17,
191 |      "metadata": {},
192 |      "output_type": "execute_result"
193 |     }
194 |    ],
195 |    "source": [
196 |     "pd_data.head(3)"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "markdown",
201 |    "metadata": {},
202 |    "source": [
203 |     "#### 1.2. Process data"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "code",
208 |    "execution_count": 6,
209 |    "metadata": {
210 |     "collapsed": false
211 |    },
212 |    "outputs": [],
213 |    "source": [
214 |     "def get_tokens(sentence):\n",
215 |     "    return [x.text for x in nlp_en(sentence)]"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": 7,
221 |    "metadata": {
222 |     "collapsed": false
223 |    },
224 |    "outputs": [
225 |     {
226 |      "name": "stderr",
227 |      "output_type": "stream",
228 |      "text": [
229 |       "Progress: 100%|██████████████████████████████████████████████████████████████| 568454/568454 [4:14:40<00:00, 37.20it/s]\n"
230 |      ]
231 |     }
232 |    ],
233 |    "source": [
234 |     "pd_data[\"tokens\"] = pd_data[\"Text\"].progress_apply(get_tokens)"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": 11,
240 |    "metadata": {
241 |     "collapsed": true
242 |    },
243 |    "outputs": [],
244 |    "source": [
245 |     "pd_data.to_pickle(\"pd_data_tokenized.pickle\")"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": 2,
251 |    "metadata": {
252 |     "collapsed": true
253 |    },
254 |    "outputs": [],
255 |    "source": [
256 |     "pd_data = pd.read_pickle(\"pd_data_tokenized.pickle\")"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "markdown",
261 |    "metadata": {},
262 |    "source": [
263 |     "#### 1.3. Train word embeddings using word2vec"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": 8,
269 |    "metadata": {
270 |     "collapsed": true
271 |    },
272 |    "outputs": [],
273 |    "source": [
274 |     "model_w2v = gensim.models.Word2Vec(pd_data[\"tokens\"].tolist(), min_count=5, window = 9, size = 100)"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "markdown",
279 |    "metadata": {},
280 |    "source": [
281 |     "#### 1.4. Train word embeddings using fasttext"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": 9,
287 |    "metadata": {
288 |     "collapsed": true
289 |    },
290 |    "outputs": [],
291 |    "source": [
292 |     "model_ft = gensim.models.FastText(pd_data[\"tokens\"].tolist(), min_count=5, window = 9, size = 100)"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "markdown",
297 |    "metadata": {},
298 |    "source": [
299 |     "#### 1.5. Persistence"
300 |    ]
301 |   },
302 |   {
303 |    "cell_type": "code",
304 |    "execution_count": null,
305 |    "metadata": {
306 |     "collapsed": true
307 |    },
308 |    "outputs": [],
309 |    "source": [
310 |     "model_w2v.save(\"model_w2v.model\")\n",
311 |     "model_ft.save(\"model_ft.model\")"
312 |    ]
313 |   },
314 |   {
315 |    "cell_type": "code",
316 |    "execution_count": 2,
317 |    "metadata": {
318 |     "collapsed": true
319 |    },
320 |    "outputs": [],
321 |    "source": [
322 |     "model_w2v = gensim.models.Word2Vec.load(\"model_w2v.model\")\n",
323 |     "model_ft = gensim.models.FastText.load(\"model_ft.model\")"
324 |    ]
325 |   },
326 |   {
327 |    "cell_type": "markdown",
328 |    "metadata": {},
329 |    "source": [
330 |     "#### 1.6. Similarity"
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "code",
335 |    "execution_count": 7,
336 |    "metadata": {
337 |     "collapsed": false
338 |    },
339 |    "outputs": [
340 |     {
341 |      "name": "stderr",
342 |      "output_type": "stream",
343 |      "text": [
344 |       "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n",
345 |       "  if __name__ == '__main__':\n"
346 |      ]
347 |     },
348 |     {
349 |      "data": {
350 |       "text/plain": [
351 |        "[('fish', 0.8536328077316284),\n",
352 |        " ('tuna', 0.7662709951400757),\n",
353 |        " ('chicken', 0.7630202174186707),\n",
354 |        " ('seafood', 0.7627329230308533),\n",
355 |        " ('turkey', 0.7592297792434692)]"
356 |       ]
357 |      },
358 |      "execution_count": 7,
359 |      "metadata": {},
360 |      "output_type": "execute_result"
361 |     }
362 |    ],
363 |    "source": [
364 |     "model_w2v.most_similar(\"salmon\", topn=5)"
365 |    ]
366 |   },
367 |   {
368 |    "cell_type": "code",
369 |    "execution_count": 8,
370 |    "metadata": {
371 |     "collapsed": false
372 |    },
373 |    "outputs": [
374 |     {
375 |      "name": "stderr",
376 |      "output_type": "stream",
377 |      "text": [
378 |       "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n",
379 |       "  if __name__ == '__main__':\n"
380 |      ]
381 |     },
382 |     {
383 |      "data": {
384 |       "text/plain": [
385 |        "[('cheddar', 0.7746697068214417),\n",
386 |        " ('mozzarella', 0.7572810649871826),\n",
387 |        " ('parmesan', 0.7331867218017578),\n",
388 |        " ('chedder', 0.7296013236045837),\n",
389 |        " ('mayo', 0.7252874374389648)]"
390 |       ]
391 |      },
392 |      "execution_count": 8,
393 |      "metadata": {},
394 |      "output_type": "execute_result"
395 |     }
396 |    ],
397 |    "source": [
398 |     "model_w2v.most_similar(positive=['cheese'], topn=5)"
399 |    ]
400 |   },
401 |   {
402 |    "cell_type": "markdown",
403 |    "metadata": {},
404 |    "source": [
405 |     "#### 1.7. Correlation"
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "code",
410 |    "execution_count": 1,
411 |    "metadata": {
412 |     "collapsed": false
413 |    },
414 |    "outputs": [
415 |     {
416 |      "ename": "NameError",
417 |      "evalue": "name 'model_w2v' is not defined",
418 |      "output_type": "error",
419 |      "traceback": [
420 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
421 |       "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
422 |       "\u001b[0;32m<ipython-input-1-e6c8bb39b84d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmodel_w2v\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmost_similar\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mpositive\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'pea'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'salsa'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnegative\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'tomato'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtopn\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
423 |       "\u001b[0;31mNameError\u001b[0m: name 'model_w2v' is not defined"
424 |      ]
425 |     }
426 |    ],
427 |    "source": [
428 |     "model_w2v.most_similar(positive=['pea', 'salsa'], negative=['tomato'], topn=3)"
429 |    ]
430 |   },
431 |   {
432 |    "cell_type": "code",
433 |    "execution_count": 14,
434 |    "metadata": {
435 |     "collapsed": false
436 |    },
437 |    "outputs": [
438 |     {
439 |      "name": "stderr",
440 |      "output_type": "stream",
441 |      "text": [
442 |       "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n",
443 |       "  if __name__ == '__main__':\n"
444 |      ]
445 |     },
446 |     {
447 |      "data": {
448 |       "text/plain": [
449 |        "[('tequila', 0.7341920137405396),\n",
450 |        " ('lemonade', 0.7284362316131592),\n",
451 |        " ('juice', 0.7281173467636108)]"
452 |       ]
453 |      },
454 |      "execution_count": 14,
455 |      "metadata": {},
456 |      "output_type": "execute_result"
457 |     }
458 |    ],
459 |    "source": [
460 |     "model_w2v.most_similar(positive=['lemon', 'water'], topn=3)"
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "code",
465 |    "execution_count": 15,
466 |    "metadata": {
467 |     "collapsed": false
468 |    },
469 |    "outputs": [
470 |     {
471 |      "name": "stderr",
472 |      "output_type": "stream",
473 |      "text": [
474 |       "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n",
475 |       "  if __name__ == '__main__':\n"
476 |      ]
477 |     },
478 |     {
479 |      "data": {
480 |       "text/plain": [
481 |        "[('bread', 0.7283815145492554),\n",
482 |        " ('pizza', 0.7018527388572693),\n",
483 |        " ('dough', 0.6836484670639038)]"
484 |       ]
485 |      },
486 |      "execution_count": 15,
487 |      "metadata": {},
488 |      "output_type": "execute_result"
489 |     }
490 |    ],
491 |    "source": [
492 |     "model_w2v.most_similar(positive=['salami', 'crust'], topn=3)"
493 |    ]
494 |   },
495 |   {
496 |    "cell_type": "code",
497 |    "execution_count": 16,
498 |    "metadata": {
499 |     "collapsed": false
500 |    },
501 |    "outputs": [
502 |     {
503 |      "name": "stderr",
504 |      "output_type": "stream",
505 |      "text": [
506 |       "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n",
507 |       "  if __name__ == '__main__':\n"
508 |      ]
509 |     },
510 |     {
511 |      "data": {
512 |       "text/plain": [
513 |        "[('hamburger', 0.814429521560669),\n",
514 |        " ('ham', 0.795830488204956),\n",
515 |        " ('sausage', 0.7887133359909058)]"
516 |       ]
517 |      },
518 |      "execution_count": 16,
519 |      "metadata": {},
520 |      "output_type": "execute_result"
521 |     }
522 |    ],
523 |    "source": [
524 |     "model_w2v.most_similar(positive=['beef', 'bun'], topn=3)"
525 |    ]
526 |   },
527 |   {
528 |    "cell_type": "markdown",
529 |    "metadata": {
530 |     "collapsed": true
531 |    },
532 |    "source": [
533 |     "## 2. Visualise them"
534 |    ]
535 |   },
536 |   {
537 |    "cell_type": "code",
538 |    "execution_count": 33,
539 |    "metadata": {
540 |     "collapsed": false
541 |    },
542 |    "outputs": [],
543 |    "source": [
544 |     "from sklearn.manifold import TSNE\n",
545 |     "import matplotlib.pyplot as plt\n",
546 |     "from bokeh.plotting import figure, output_file, show\n",
547 |     "from bokeh.models import ColumnDataSource, Range1d, LabelSet, Label"
548 |    ]
549 |   },
550 |   {
551 |    "cell_type": "code",
552 |    "execution_count": 27,
553 |    "metadata": {
554 |     "collapsed": false
555 |    },
556 |    "outputs": [
557 |     {
558 |      "name": "stdout",
559 |      "output_type": "stream",
560 |      "text": [
561 |       "Wall time: 2min 7s\n"
562 |      ]
563 |     }
564 |    ],
565 |    "source": [
566 |     "%%time\n",
567 |     "model_w2v = gensim.models.Word2Vec(pd_data[\"tokens\"].tolist(), min_count=500, window = 9, size = 100)"
568 |    ]
569 |   },
570 |   {
571 |    "cell_type": "code",
572 |    "execution_count": 29,
573 |    "metadata": {
574 |     "collapsed": false
575 |    },
576 |    "outputs": [
577 |     {
578 |      "name": "stderr",
579 |      "output_type": "stream",
580 |      "text": [
581 |       "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:5: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n"
582 |      ]
583 |     }
584 |    ],
585 |    "source": [
586 |     "tokens = []\n",
587 |     "labels = []\n",
588 |     "\n",
589 |     "for x in model_w2v.wv.vocab:\n",
590 |     "    tokens.append(model_w2v[x])\n",
591 |     "    labels.append(x)"
592 |    ]
593 |   },
594 |   {
595 |    "cell_type": "code",
596 |    "execution_count": 30,
597 |    "metadata": {
598 |     "collapsed": false
599 |    },
600 |    "outputs": [
601 |     {
602 |      "name": "stdout",
603 |      "output_type": "stream",
604 |      "text": [
605 |       "Wall time: 2min 47s\n"
606 |      ]
607 |     }
608 |    ],
609 |    "source": [
610 |     "%%time\n",
611 |     "tsne_model = TSNE(n_components=2, random_state=11)\n",
612 |     "fitted = tsne_model.fit_transform(tokens)"
613 |    ]
614 |   },
615 |   {
616 |    "cell_type": "code",
617 |    "execution_count": 34,
618 |    "metadata": {
619 |     "collapsed": true
620 |    },
621 |    "outputs": [],
622 |    "source": [
623 |     "output_file(\"plot.html\")\n",
624 |     "            \n",
625 |     "p = figure(plot_width=1000, plot_height=1000)\n",
626 |     "\n",
627 |     "lst = list(model_w2v.wv.vocab)\n",
628 |     "\n",
629 |     "\n",
630 |     "\n",
631 |     "p.circle(fitted[:, 0], fitted[:, 1], size=2, color=\"navy\", alpha=0.5)\n",
632 |     "\n",
633 |     "texts = lst\n",
634 |     "\n",
635 |     "\n",
636 |     "source = ColumnDataSource(data=dict(x=fitted[:, 0], y=fitted[:, 1], text=texts))\n",
637 |     "\n",
638 |     "labels = LabelSet(x='x', y='y', text='text',\n",
639 |     "         x_offset=5, y_offset=5, source=source)\n",
640 |     "p.add_layout(labels)\n",
641 |     "\n",
642 |     "\n",
643 |     "\n",
644 |     "show(p)"
645 |    ]
646 |   },
647 |   {
648 |    "cell_type": "code",
649 |    "execution_count": null,
650 |    "metadata": {
651 |     "collapsed": true
652 |    },
653 |    "outputs": [],
654 |    "source": []
655 |   }
656 |  ],
657 |  "metadata": {
658 |   "kernelspec": {
659 |    "display_name": "Python 3",
660 |    "language": "python",
661 |    "name": "python3"
662 |   },
663 |   "language_info": {
664 |    "codemirror_mode": {
665 |     "name": "ipython",
666 |     "version": 3
667 |    },
668 |    "file_extension": ".py",
669 |    "mimetype": "text/x-python",
670 |    "name": "python",
671 |    "nbconvert_exporter": "python",
672 |    "pygments_lexer": "ipython3",
673 |    "version": "3.6.0"
674 |   }
675 |  },
676 |  "nbformat": 4,
677 |  "nbformat_minor": 2
678 | }
679 | 


--------------------------------------------------------------------------------
/section1/video3/section1_video3_install_corpora.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": 3,
 6 |    "metadata": {
 7 |     "collapsed": true
 8 |    },
 9 |    "outputs": [],
10 |    "source": [
11 |     "import nltk"
12 |    ]
13 |   },
14 |   {
15 |    "cell_type": "code",
16 |    "execution_count": null,
17 |    "metadata": {
18 |     "collapsed": true
19 |    },
20 |    "outputs": [],
21 |    "source": [
22 |     "nltk.download()"
23 |    ]
24 |   }
25 |  ],
26 |  "metadata": {
27 |   "kernelspec": {
28 |    "display_name": "Python 3",
29 |    "language": "python",
30 |    "name": "python3"
31 |   },
32 |   "language_info": {
33 |    "codemirror_mode": {
34 |     "name": "ipython",
35 |     "version": 3
36 |    },
37 |    "file_extension": ".py",
38 |    "mimetype": "text/x-python",
39 |    "name": "python",
40 |    "nbconvert_exporter": "python",
41 |    "pygments_lexer": "ipython3",
42 |    "version": "3.6.0"
43 |   }
44 |  },
45 |  "nbformat": 4,
46 |  "nbformat_minor": 2
47 | }
48 | 


--------------------------------------------------------------------------------
/section2/video 2/section_2_video_2_cleaning.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Demo data"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": false
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "sent1 = \"Feeling loved, even when I'm sick🍫☕️💓#likeforlike #chocolate #bf #iloveyou #aftereight #couplegoals\""
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "### Remove punctuation"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": true
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "import string"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {
 43 |     "collapsed": false
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "translator = str.maketrans('', '', string.punctuation)"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "collapsed": false
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "name": "stdout",
 59 |      "output_type": "stream",
 60 |      "text": [
 61 |       "Feeling loved even when Im sick🍫☕️💓likeforlike chocolate bf iloveyou aftereight couplegoals\n"
 62 |      ]
 63 |     }
 64 |    ],
 65 |    "source": [
 66 |     "sent_pun = sent1.translate(translator)\n",
 67 |     "print(sent_pun)"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "### Remove unicode"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": 5,
 80 |    "metadata": {
 81 |     "collapsed": true
 82 |    },
 83 |    "outputs": [],
 84 |    "source": [
 85 |     "import regex"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "Method 1"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": 6,
 98 |    "metadata": {
 99 |     "collapsed": false
100 |    },
101 |    "outputs": [
102 |     {
103 |      "name": "stdout",
104 |      "output_type": "stream",
105 |      "text": [
106 |       "Feeling loved even when Im sicklikeforlike chocolate bf iloveyou aftereight couplegoals\n"
107 |      ]
108 |     }
109 |    ],
110 |    "source": [
111 |     "sent_pun_uni = sent_pun.encode('ascii', 'ignore').decode(\"utf-8\")\n",
112 |     "print(sent_pun_uni)"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "Method 2"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 7,
125 |    "metadata": {
126 |     "collapsed": false
127 |    },
128 |    "outputs": [
129 |     {
130 |      "name": "stdout",
131 |      "output_type": "stream",
132 |      "text": [
133 |       "Feeling loved even when Im sick  ️ likeforlike chocolate bf iloveyou aftereight couplegoals\n"
134 |      ]
135 |     }
136 |    ],
137 |    "source": [
138 |     "emoji_pattern = regex.compile(\"\"\"\\p{So}\\p{Sk}*\"\"\")\n",
139 |     "sent_pun_uni = emoji_pattern.sub(r' ', sent_pun)\n",
140 |     "print(sent_pun_uni)"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {},
146 |    "source": [
147 |     "### Remove URL"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": 8,
153 |    "metadata": {
154 |     "collapsed": true
155 |    },
156 |    "outputs": [],
157 |    "source": [
158 |     "import re"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": 9,
164 |    "metadata": {
165 |     "collapsed": true
166 |    },
167 |    "outputs": [],
168 |    "source": [
169 |     "subject = \"Omg, check out these fabulous shoes https://thiswebsitedoesntexistsodontbother.com/omgshoesss yes they can be yours\""
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": 10,
175 |    "metadata": {
176 |     "collapsed": false
177 |    },
178 |    "outputs": [
179 |     {
180 |      "name": "stdout",
181 |      "output_type": "stream",
182 |      "text": [
183 |       "Omg, check out these fabulous shoes  yes they can be yours\n"
184 |      ]
185 |     }
186 |    ],
187 |    "source": [
188 |     "result = re.sub(r\"http\\S+\", \"\", subject)\n",
189 |     "print(result)"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "markdown",
194 |    "metadata": {},
195 |    "source": [
196 |     "### Remove Stopwords"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": 11,
202 |    "metadata": {
203 |     "collapsed": true
204 |    },
205 |    "outputs": [],
206 |    "source": [
207 |     "from nltk.corpus import stopwords\n",
208 |     "stop_en = stopwords.words(\"english\")"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": null,
214 |    "metadata": {
215 |     "collapsed": true
216 |    },
217 |    "outputs": [],
218 |    "source": [
219 |     "subject = [\"i\",\"have\",\"a\",\"cat\",\"named\",\"mr\",\"whiskers\",\"he\",\"is\",\"a\",\"very\",\"hungry\",\"cat\"]"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": null,
225 |    "metadata": {
226 |     "collapsed": false
227 |    },
228 |    "outputs": [],
229 |    "source": [
230 |     "subject = [x for x in subject if not x in stop_en]\n",
231 |     "print(subject)"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "code",
236 |    "execution_count": null,
237 |    "metadata": {
238 |     "collapsed": true
239 |    },
240 |    "outputs": [],
241 |    "source": []
242 |   }
243 |  ],
244 |  "metadata": {
245 |   "kernelspec": {
246 |    "display_name": "Python 3",
247 |    "language": "python",
248 |    "name": "python3"
249 |   },
250 |   "language_info": {
251 |    "codemirror_mode": {
252 |     "name": "ipython",
253 |     "version": 3
254 |    },
255 |    "file_extension": ".py",
256 |    "mimetype": "text/x-python",
257 |    "name": "python",
258 |    "nbconvert_exporter": "python",
259 |    "pygments_lexer": "ipython3",
260 |    "version": "3.6.0"
261 |   }
262 |  },
263 |  "nbformat": 4,
264 |  "nbformat_minor": 2
265 | }
266 | 


--------------------------------------------------------------------------------
/section2/video 3/section_2_video_3_tokenizing.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": false
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from pprint import pprint"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "# Tokenization"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "### Using NLTK"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": true
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "import nltk"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {
 43 |     "collapsed": true
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "test_sentence = \"It's too cold outside, we'd be better watering our neighbour's plants tomorrow\""
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "collapsed": false
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "['It',\n",
 61 |        " \"'s\",\n",
 62 |        " 'too',\n",
 63 |        " 'cold',\n",
 64 |        " 'outside',\n",
 65 |        " ',',\n",
 66 |        " 'we',\n",
 67 |        " \"'d\",\n",
 68 |        " 'be',\n",
 69 |        " 'better',\n",
 70 |        " 'watering',\n",
 71 |        " 'our',\n",
 72 |        " 'neighbour',\n",
 73 |        " \"'s\",\n",
 74 |        " 'plants',\n",
 75 |        " 'tomorrow']"
 76 |       ]
 77 |      },
 78 |      "execution_count": 4,
 79 |      "metadata": {},
 80 |      "output_type": "execute_result"
 81 |     }
 82 |    ],
 83 |    "source": [
 84 |     "nltk.word_tokenize(test_sentence)"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "metadata": {},
 90 |    "source": [
 91 |     "### Using Spacy"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 5,
 97 |    "metadata": {
 98 |     "collapsed": false
 99 |    },
100 |    "outputs": [],
101 |    "source": [
102 |     "import spacy\n",
103 |     "nlp_en = spacy.load(\"en_core_web_sm\")"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": 6,
109 |    "metadata": {
110 |     "collapsed": true
111 |    },
112 |    "outputs": [],
113 |    "source": [
114 |     "doc = nlp_en(test_sentence)"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": 7,
120 |    "metadata": {
121 |     "collapsed": false
122 |    },
123 |    "outputs": [
124 |     {
125 |      "data": {
126 |       "text/plain": [
127 |        "['It',\n",
128 |        " \"'s\",\n",
129 |        " 'too',\n",
130 |        " 'cold',\n",
131 |        " 'outside',\n",
132 |        " ',',\n",
133 |        " 'we',\n",
134 |        " \"'d\",\n",
135 |        " 'be',\n",
136 |        " 'better',\n",
137 |        " 'watering',\n",
138 |        " 'our',\n",
139 |        " 'neighbour',\n",
140 |        " \"'s\",\n",
141 |        " 'plants',\n",
142 |        " 'tomorrow']"
143 |       ]
144 |      },
145 |      "execution_count": 7,
146 |      "metadata": {},
147 |      "output_type": "execute_result"
148 |     }
149 |    ],
150 |    "source": [
151 |     "[x.text for x in doc]"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "markdown",
156 |    "metadata": {},
157 |    "source": [
158 |     "# POS tagging"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "metadata": {},
164 |    "source": [
165 |     "### Using NLTK"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": 8,
171 |    "metadata": {
172 |     "collapsed": false
173 |    },
174 |    "outputs": [
175 |     {
176 |      "data": {
177 |       "text/plain": [
178 |        "[('It', 'PRP'),\n",
179 |        " (\"'s\", 'VBZ'),\n",
180 |        " ('too', 'RB'),\n",
181 |        " ('cold', 'JJ'),\n",
182 |        " ('outside', 'JJ'),\n",
183 |        " (',', ','),\n",
184 |        " ('we', 'PRP'),\n",
185 |        " (\"'d\", 'MD'),\n",
186 |        " ('be', 'VB'),\n",
187 |        " ('better', 'RB'),\n",
188 |        " ('watering', 'VBG'),\n",
189 |        " ('our', 'PRP$'),\n",
190 |        " ('neighbour', 'NN'),\n",
191 |        " (\"'s\", 'POS'),\n",
192 |        " ('plants', 'NNS'),\n",
193 |        " ('tomorrow', 'NN')]"
194 |       ]
195 |      },
196 |      "execution_count": 8,
197 |      "metadata": {},
198 |      "output_type": "execute_result"
199 |     }
200 |    ],
201 |    "source": [
202 |     "tokens = nltk.word_tokenize(test_sentence)\n",
203 |     "nltk.pos_tag(tokens)"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "markdown",
208 |    "metadata": {},
209 |    "source": [
210 |     "### Using Spacy"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "execution_count": 9,
216 |    "metadata": {
217 |     "collapsed": false
218 |    },
219 |    "outputs": [
220 |     {
221 |      "data": {
222 |       "text/plain": [
223 |        "[('It', 'PRON'),\n",
224 |        " (\"'s\", 'VERB'),\n",
225 |        " ('too', 'ADV'),\n",
226 |        " ('cold', 'ADJ'),\n",
227 |        " ('outside', 'ADV'),\n",
228 |        " (',', 'PUNCT'),\n",
229 |        " ('we', 'PRON'),\n",
230 |        " (\"'d\", 'VERB'),\n",
231 |        " ('be', 'VERB'),\n",
232 |        " ('better', 'ADJ'),\n",
233 |        " ('watering', 'VERB'),\n",
234 |        " ('our', 'ADJ'),\n",
235 |        " ('neighbour', 'NOUN'),\n",
236 |        " (\"'s\", 'PART'),\n",
237 |        " ('plants', 'NOUN'),\n",
238 |        " ('tomorrow', 'NOUN')]"
239 |       ]
240 |      },
241 |      "execution_count": 9,
242 |      "metadata": {},
243 |      "output_type": "execute_result"
244 |     }
245 |    ],
246 |    "source": [
247 |     "[(x.text, x.pos_) for x in doc]"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "markdown",
252 |    "metadata": {},
253 |    "source": [
254 |     "# Lemmatization"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "markdown",
259 |    "metadata": {},
260 |    "source": [
261 |     "### Using NLTK"
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "code",
266 |    "execution_count": 10,
267 |    "metadata": {
268 |     "collapsed": true
269 |    },
270 |    "outputs": [],
271 |    "source": [
272 |     "import nltk\n",
273 |     "from nltk.stem.wordnet import WordNetLemmatizer\n",
274 |     "from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": 11,
280 |    "metadata": {
281 |     "collapsed": true
282 |    },
283 |    "outputs": [],
284 |    "source": [
285 |     "lemmatizer = WordNetLemmatizer()"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "code",
290 |    "execution_count": 12,
291 |    "metadata": {
292 |     "collapsed": true
293 |    },
294 |    "outputs": [],
295 |    "source": [
296 |     "tokens = nltk.word_tokenize(test_sentence)\n",
297 |     "tags = nltk.pos_tag(tokens)"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": 13,
303 |    "metadata": {
304 |     "collapsed": false
305 |    },
306 |    "outputs": [
307 |     {
308 |      "name": "stdout",
309 |      "output_type": "stream",
310 |      "text": [
311 |       "It\n",
312 |       "'s\n",
313 |       "too\n",
314 |       "cold\n",
315 |       "outside\n",
316 |       ",\n",
317 |       "we\n",
318 |       "'d\n",
319 |       "be\n",
320 |       "better\n",
321 |       "water\n",
322 |       "our\n",
323 |       "neighbour\n",
324 |       "'s\n",
325 |       "plant\n",
326 |       "tomorrow\n"
327 |      ]
328 |     }
329 |    ],
330 |    "source": [
331 |     "for i, token in enumerate(tokens):\n",
332 |     "    pos_tag = tags[i][1]\n",
333 |     "\n",
334 |     "    if pos_tag.startswith(\"N\"):\n",
335 |     "        lemma = lemmatizer.lemmatize(token, pos=NOUN)\n",
336 |     "    elif pos_tag.startswith(\"V\"):\n",
337 |     "        lemma = lemmatizer.lemmatize(token, pos=VERB)\n",
338 |     "    elif pos_tag.startswith(\"J\"):\n",
339 |     "        lemma = lemmatizer.lemmatize(token, pos=ADJ)\n",
340 |     "    else:\n",
341 |     "        lemma = token\n",
342 |     "        \n",
343 |     "    print(lemma)"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "markdown",
348 |    "metadata": {},
349 |    "source": [
350 |     "### Using Spacy"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "code",
355 |    "execution_count": 14,
356 |    "metadata": {
357 |     "collapsed": false
358 |    },
359 |    "outputs": [
360 |     {
361 |      "data": {
362 |       "text/plain": [
363 |        "['-PRON-',\n",
364 |        " 'have',\n",
365 |        " 'too',\n",
366 |        " 'cold',\n",
367 |        " 'outside',\n",
368 |        " ',',\n",
369 |        " '-PRON-',\n",
370 |        " 'would',\n",
371 |        " 'be',\n",
372 |        " 'well',\n",
373 |        " 'water',\n",
374 |        " '-PRON-',\n",
375 |        " 'neighbour',\n",
376 |        " 'have',\n",
377 |        " 'plant',\n",
378 |        " 'tomorrow']"
379 |       ]
380 |      },
381 |      "execution_count": 14,
382 |      "metadata": {},
383 |      "output_type": "execute_result"
384 |     }
385 |    ],
386 |    "source": [
387 |     "[x.lemma_ for x in doc]"
388 |    ]
389 |   },
390 |   {
391 |    "cell_type": "code",
392 |    "execution_count": null,
393 |    "metadata": {
394 |     "collapsed": true
395 |    },
396 |    "outputs": [],
397 |    "source": []
398 |   }
399 |  ],
400 |  "metadata": {
401 |   "kernelspec": {
402 |    "display_name": "Python 3",
403 |    "language": "python",
404 |    "name": "python3"
405 |   },
406 |   "language_info": {
407 |    "codemirror_mode": {
408 |     "name": "ipython",
409 |     "version": 3
410 |    },
411 |    "file_extension": ".py",
412 |    "mimetype": "text/x-python",
413 |    "name": "python",
414 |    "nbconvert_exporter": "python",
415 |    "pygments_lexer": "ipython3",
416 |    "version": "3.6.0"
417 |   }
418 |  },
419 |  "nbformat": 4,
420 |  "nbformat_minor": 2
421 | }
422 | 


--------------------------------------------------------------------------------
/section2/video 4/section_2_video_4_ngrams.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# No intelligence"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "from nltk import ngrams"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 2,
 24 |    "metadata": {
 25 |     "collapsed": false
 26 |    },
 27 |    "outputs": [
 28 |     {
 29 |      "name": "stdout",
 30 |      "output_type": "stream",
 31 |      "text": [
 32 |       "('oh', 'my')\n",
 33 |       "('my', 'god,')\n",
 34 |       "('god,', 'the')\n",
 35 |       "('the', 'chocolate')\n",
 36 |       "('chocolate', 'is')\n",
 37 |       "('is', '15%')\n",
 38 |       "('15%', 'of')\n",
 39 |       "('of', 'today')\n"
 40 |      ]
 41 |     }
 42 |    ],
 43 |    "source": [
 44 |     "sentence = \"oh my god, the chocolate is 15% of today\"\n",
 45 |     "n = 2\n",
 46 |     "bigrams = ngrams(sentence.split(), n)\n",
 47 |     "for grams in bigrams:\n",
 48 |     "    print( grams)"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "metadata": {},
 54 |    "source": [
 55 |     "# Some intelligence"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": 3,
 61 |    "metadata": {
 62 |     "collapsed": true
 63 |    },
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "from nltk import RegexpParser"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": 4,
 72 |    "metadata": {
 73 |     "collapsed": true
 74 |    },
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "chunkbiGram = r\"\"\"NA: {<NOUN><ADJ> }\n",
 78 |     "                        AN: {<ADJ><NOUN> }\n",
 79 |     "                        NN: {<NOUN><NOUN> }\n",
 80 |     "               \"\"\""
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 5,
 86 |    "metadata": {
 87 |     "collapsed": false
 88 |    },
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "chunkparserbigram = RegexpParser(chunkbiGram)"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 6,
 97 |    "metadata": {
 98 |     "collapsed": true
 99 |    },
100 |    "outputs": [],
101 |    "source": [
102 |     "example = [('oh', 'INTJ'),\n",
103 |     " ('my', 'INTJ'),\n",
104 |     " ('god', 'INTJ'),\n",
105 |     " (',', 'PUNCT'),\n",
106 |     " ('the', 'DET'),\n",
107 |     " ('dark', 'ADJ'),\n",
108 |     " ('chocolate', 'NOUN'),\n",
109 |     " ('is', 'VERB'),\n",
110 |     " ('15', 'NUM'),\n",
111 |     " ('%', 'NOUN'),\n",
112 |     " ('of', 'ADP'),\n",
113 |     " ('today', 'NOUN')]"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 7,
119 |    "metadata": {
120 |     "collapsed": true
121 |    },
122 |    "outputs": [],
123 |    "source": [
124 |     "chunked = chunkparserbigram.parse(example)"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": 8,
130 |    "metadata": {
131 |     "collapsed": false,
132 |     "scrolled": true
133 |    },
134 |    "outputs": [
135 |     {
136 |      "name": "stdout",
137 |      "output_type": "stream",
138 |      "text": [
139 |       "found adjective + noun\n",
140 |       "['dark', 'chocolate']\n"
141 |      ]
142 |     }
143 |    ],
144 |    "source": [
145 |     "for subtree in chunked.subtrees():\n",
146 |     "    if subtree.label() == 'NA':\n",
147 |     "        print('found noun + adjective')\n",
148 |     "        print([leaf[0] for leaf in subtree.leaves()])\n",
149 |     "    elif subtree.label() == 'AN':\n",
150 |     "        print('found adjective + noun')\n",
151 |     "        print([leaf[0] for leaf in subtree.leaves()])\n",
152 |     "    elif subtree.label() == 'NN':\n",
153 |     "        print('found noun + noun')\n",
154 |     "        print([leaf[0] for leaf in subtree.leaves()])"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "metadata": {},
160 |    "source": [
161 |     "# Intelligence: statistical approach"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "markdown",
166 |    "metadata": {},
167 |    "source": [
168 |     "### Using NLTK"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": 9,
174 |    "metadata": {
175 |     "collapsed": true
176 |    },
177 |    "outputs": [],
178 |    "source": [
179 |     "import itertools\n",
180 |     "from nltk.corpus import genesis\n",
181 |     "from nltk.collocations import BigramCollocationFinder\n",
182 |     "from nltk.metrics import BigramAssocMeasures"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 10,
188 |    "metadata": {
189 |     "collapsed": false
190 |    },
191 |    "outputs": [],
192 |    "source": [
193 |     "def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):\n",
194 |     "    bigram_finder = BigramCollocationFinder.from_words(words)\n",
195 |     "    bigrams = bigram_finder.nbest(score_fn, n)\n",
196 |     "    return bigrams"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": 11,
202 |    "metadata": {
203 |     "collapsed": false
204 |    },
205 |    "outputs": [],
206 |    "source": [
207 |     "bigrams = bigram_word_feats(genesis.words('english-web.txt'), n=25)"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "code",
212 |    "execution_count": 12,
213 |    "metadata": {
214 |     "collapsed": false,
215 |     "scrolled": true
216 |    },
217 |    "outputs": [
218 |     {
219 |      "data": {
220 |       "text/plain": [
221 |        "[('Allon', 'Bacuth'),\n",
222 |        " ('Ashteroth', 'Karnaim'),\n",
223 |        " ('Baal', 'Hanan'),\n",
224 |        " ('Beer', 'Lahai'),\n",
225 |        " ('Ben', 'Ammi'),\n",
226 |        " ('En', 'Mishpat'),\n",
227 |        " ('Jegar', 'Sahadutha'),\n",
228 |        " ('Kiriath', 'Arba'),\n",
229 |        " ('Lahai', 'Roi'),\n",
230 |        " ('Most', 'High'),\n",
231 |        " ('Salt', 'Sea'),\n",
232 |        " ('Whoever', 'sheds'),\n",
233 |        " ('appoint', 'overseers'),\n",
234 |        " ('aromatic', 'resin'),\n",
235 |        " ('cutting', 'instrument'),\n",
236 |        " ('direct', 'descendants'),\n",
237 |        " ('droves', 'apart'),\n",
238 |        " ('during', 'mating'),\n",
239 |        " ('falls', 'backward'),\n",
240 |        " ('fig', 'leaves'),\n",
241 |        " ('flaming', 'torch'),\n",
242 |        " ('fresh', 'poplar'),\n",
243 |        " ('fully', 'pay'),\n",
244 |        " ('fury', 'turns'),\n",
245 |        " ('gray', 'hairs')]"
246 |       ]
247 |      },
248 |      "execution_count": 12,
249 |      "metadata": {},
250 |      "output_type": "execute_result"
251 |     }
252 |    ],
253 |    "source": [
254 |     "bigrams"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "markdown",
259 |    "metadata": {},
260 |    "source": [
261 |     "### Using Spacy"
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "code",
266 |    "execution_count": 13,
267 |    "metadata": {
268 |     "collapsed": true
269 |    },
270 |    "outputs": [],
271 |    "source": [
272 |     "from nltk.corpus import inaugural"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "code",
277 |    "execution_count": 14,
278 |    "metadata": {
279 |     "collapsed": false
280 |    },
281 |    "outputs": [
282 |     {
283 |      "name": "stderr",
284 |      "output_type": "stream",
285 |      "text": [
286 |       "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\gensim\\utils.py:1167: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n",
287 |       "  warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n"
288 |      ]
289 |     }
290 |    ],
291 |    "source": [
292 |     "from gensim.models.phrases import Phraser, Phrases"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "code",
297 |    "execution_count": 15,
298 |    "metadata": {
299 |     "collapsed": true
300 |    },
301 |    "outputs": [],
302 |    "source": [
303 |     "all_words = [inaugural.words(x) for x in inaugural.fileids()]"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "code",
308 |    "execution_count": 16,
309 |    "metadata": {
310 |     "collapsed": false
311 |    },
312 |    "outputs": [],
313 |    "source": [
314 |     "phrases = Phrases(all_words, min_count= 100, threshold= 10)"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "code",
319 |    "execution_count": 17,
320 |    "metadata": {
321 |     "collapsed": true
322 |    },
323 |    "outputs": [],
324 |    "source": [
325 |     "bigram = Phraser(phrases)"
326 |    ]
327 |   },
328 |   {
329 |    "cell_type": "code",
330 |    "execution_count": 18,
331 |    "metadata": {
332 |     "collapsed": false
333 |    },
334 |    "outputs": [
335 |     {
336 |      "data": {
337 |       "text/plain": [
338 |        "['Finest', 'people', 'of', 'the', 'United_States']"
339 |       ]
340 |      },
341 |      "execution_count": 18,
342 |      "metadata": {},
343 |      "output_type": "execute_result"
344 |     }
345 |    ],
346 |    "source": [
347 |     "bigram[[\"Finest\",\"people\",\"of\",\"the\",\"United\",\"States\"]]"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "code",
352 |    "execution_count": null,
353 |    "metadata": {
354 |     "collapsed": true
355 |    },
356 |    "outputs": [],
357 |    "source": []
358 |   }
359 |  ],
360 |  "metadata": {
361 |   "kernelspec": {
362 |    "display_name": "Python 3",
363 |    "language": "python",
364 |    "name": "python3"
365 |   },
366 |   "language_info": {
367 |    "codemirror_mode": {
368 |     "name": "ipython",
369 |     "version": 3
370 |    },
371 |    "file_extension": ".py",
372 |    "mimetype": "text/x-python",
373 |    "name": "python",
374 |    "nbconvert_exporter": "python",
375 |    "pygments_lexer": "ipython3",
376 |    "version": "3.6.0"
377 |   }
378 |  },
379 |  "nbformat": 4,
380 |  "nbformat_minor": 2
381 | }
382 | 


--------------------------------------------------------------------------------
/section3/ner_dataset.rar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Text-Mining-with-Machine-Learning-and-Python/31fbe17da4e984f9c3b5e6a590ec53df4d0b1c05/section3/ner_dataset.rar


--------------------------------------------------------------------------------
/section3/section3_video3_pretrained_models.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": false
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from pprint import pprint"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "# Using NLTK"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 2,
 24 |    "metadata": {
 25 |     "collapsed": true
 26 |    },
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "from nltk import word_tokenize, pos_tag, ne_chunk"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "### All good!"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {
 43 |     "collapsed": false
 44 |    },
 45 |    "outputs": [
 46 |     {
 47 |      "name": "stdout",
 48 |      "output_type": "stream",
 49 |      "text": [
 50 |       "(S\n",
 51 |       "  (PERSON Mark/NNP)\n",
 52 |       "  is/VBZ\n",
 53 |       "  working/VBG\n",
 54 |       "  at/IN\n",
 55 |       "  the/DT\n",
 56 |       "  (LOCATION South/NNP Africa/NNP)\n",
 57 |       "  offices/NNS\n",
 58 |       "  at/IN\n",
 59 |       "  (ORGANIZATION Google/NNP))\n"
 60 |      ]
 61 |     }
 62 |    ],
 63 |    "source": [
 64 |     "sentence = \"Mark is working at the South Africa offices at Google\"\n",
 65 |     "ne_tree = ne_chunk(pos_tag(word_tokenize(sentence)))\n",
 66 |     "print(ne_tree)"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "### Not so good..."
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 4,
 79 |    "metadata": {
 80 |     "collapsed": false
 81 |    },
 82 |    "outputs": [
 83 |     {
 84 |      "name": "stdout",
 85 |      "output_type": "stream",
 86 |      "text": [
 87 |       "(S\n",
 88 |       "  (GPE Donald/NNP)\n",
 89 |       "  is/VBZ\n",
 90 |       "  working/VBG\n",
 91 |       "  at/IN\n",
 92 |       "  the/DT\n",
 93 |       "  (GPE Netherlands/NNP)\n",
 94 |       "  offices/NNS\n",
 95 |       "  of/IN\n",
 96 |       "  (GPE Google/NNP))\n"
 97 |      ]
 98 |     }
 99 |    ],
100 |    "source": [
101 |     "sentence = \"Donald is working at the Netherlands offices of Google\"\n",
102 |     "print(ne_chunk(pos_tag(word_tokenize(sentence))))"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "### Include BILOU / IOB"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": 5,
115 |    "metadata": {
116 |     "collapsed": true
117 |    },
118 |    "outputs": [],
119 |    "source": [
120 |     "from nltk.chunk import conlltags2tree, tree2conlltags"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 6,
126 |    "metadata": {
127 |     "collapsed": false
128 |    },
129 |    "outputs": [
130 |     {
131 |      "name": "stdout",
132 |      "output_type": "stream",
133 |      "text": [
134 |       "[('Mark', 'NNP', 'B-PERSON'),\n",
135 |       " ('is', 'VBZ', 'O'),\n",
136 |       " ('working', 'VBG', 'O'),\n",
137 |       " ('at', 'IN', 'O'),\n",
138 |       " ('the', 'DT', 'O'),\n",
139 |       " ('South', 'NNP', 'B-LOCATION'),\n",
140 |       " ('Africa', 'NNP', 'I-LOCATION'),\n",
141 |       " ('offices', 'NNS', 'O'),\n",
142 |       " ('at', 'IN', 'O'),\n",
143 |       " ('Google', 'NNP', 'B-ORGANIZATION')]\n"
144 |      ]
145 |     }
146 |    ],
147 |    "source": [
148 |     "iob_tagged = tree2conlltags(ne_tree)\n",
149 |     "pprint (iob_tagged)"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "metadata": {},
155 |    "source": [
156 |     "# Using Spacy"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "code",
161 |    "execution_count": 7,
162 |    "metadata": {
163 |     "collapsed": true
164 |    },
165 |    "outputs": [],
166 |    "source": [
167 |     "import spacy\n",
168 |     "nlp = spacy.load(\"en_core_web_md\")"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": 8,
174 |    "metadata": {
175 |     "collapsed": false
176 |    },
177 |    "outputs": [
178 |     {
179 |      "name": "stdout",
180 |      "output_type": "stream",
181 |      "text": [
182 |       "[('Mark', 'PERSON'), ('South Africa', 'GPE'), ('Google', 'ORG')]\n"
183 |      ]
184 |     }
185 |    ],
186 |    "source": [
187 |     "doc = nlp(\"Mark is working at the South Africa offices at Google\")\n",
188 |     "pprint([(x.text, x.label_) for x in doc.ents])"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": 9,
194 |    "metadata": {
195 |     "collapsed": false
196 |    },
197 |    "outputs": [
198 |     {
199 |      "name": "stdout",
200 |      "output_type": "stream",
201 |      "text": [
202 |       "[('Donald', 'PERSON'), ('Netherlands', 'GPE'), ('Google', 'ORG')]\n"
203 |      ]
204 |     }
205 |    ],
206 |    "source": [
207 |     "doc = nlp(\"Donald is working at the Netherlands offices of Google\")\n",
208 |     "pprint([(x.text, x.label_) for x in doc.ents])"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "markdown",
213 |    "metadata": {},
214 |    "source": [
215 |     "### BILOU tags"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": 10,
221 |    "metadata": {
222 |     "collapsed": false
223 |    },
224 |    "outputs": [
225 |     {
226 |      "name": "stdout",
227 |      "output_type": "stream",
228 |      "text": [
229 |       "[(Donald, 'B', 'PERSON'),\n",
230 |       " (is, 'O', ''),\n",
231 |       " (working, 'O', ''),\n",
232 |       " (at, 'O', ''),\n",
233 |       " (the, 'O', ''),\n",
234 |       " (Netherlands, 'B', 'GPE'),\n",
235 |       " (offices, 'O', ''),\n",
236 |       " (of, 'O', ''),\n",
237 |       " (Google, 'B', 'ORG')]\n"
238 |      ]
239 |     }
240 |    ],
241 |    "source": [
242 |     "pprint([(x, x.ent_iob_, x.ent_type_) for x in doc])"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "code",
247 |    "execution_count": null,
248 |    "metadata": {
249 |     "collapsed": true
250 |    },
251 |    "outputs": [],
252 |    "source": []
253 |   }
254 |  ],
255 |  "metadata": {
256 |   "kernelspec": {
257 |    "display_name": "Python 3",
258 |    "language": "python",
259 |    "name": "python3"
260 |   },
261 |   "language_info": {
262 |    "codemirror_mode": {
263 |     "name": "ipython",
264 |     "version": 3
265 |    },
266 |    "file_extension": ".py",
267 |    "mimetype": "text/x-python",
268 |    "name": "python",
269 |    "nbconvert_exporter": "python",
270 |    "pygments_lexer": "ipython3",
271 |    "version": "3.6.0"
272 |   }
273 |  },
274 |  "nbformat": 4,
275 |  "nbformat_minor": 2
276 | }
277 | 


--------------------------------------------------------------------------------
/section3/section3_video4_training_ner.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "code",
   5 |    "execution_count": 1,
   6 |    "metadata": {
   7 |     "collapsed": true
   8 |    },
   9 |    "outputs": [],
  10 |    "source": [
  11 |     "import pandas as pd\n",
  12 |     "from pprint import pprint\n",
  13 |     "import random\n",
  14 |     "\n",
  15 |     "import spacy\n",
  16 |     "from spacy.gold import GoldParse\n",
  17 |     "\n",
  18 |     "import nltk\n",
  19 |     "from nltk.stem.wordnet import WordNetLemmatizer\n",
  20 |     "from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ\n",
  21 |     "\n",
  22 |     "import pycrfsuite"
  23 |    ]
  24 |   },
  25 |   {
  26 |    "cell_type": "markdown",
  27 |    "metadata": {},
  28 |    "source": [
  29 |     "# Let's detect natural disasters!"
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "markdown",
  34 |    "metadata": {},
  35 |    "source": [
  36 |     "https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data"
  37 |    ]
  38 |   },
  39 |   {
  40 |    "cell_type": "markdown",
  41 |    "metadata": {},
  42 |    "source": [
  43 |     "## 0. Get the data"
  44 |    ]
  45 |   },
  46 |   {
  47 |    "cell_type": "code",
  48 |    "execution_count": 2,
  49 |    "metadata": {
  50 |     "collapsed": false
  51 |    },
  52 |    "outputs": [
  53 |     {
  54 |      "data": {
  55 |       "text/html": [
  56 |        "<div>\n",
  57 |        "<style>\n",
  58 |        "    .dataframe thead tr:only-child th {\n",
  59 |        "        text-align: right;\n",
  60 |        "    }\n",
  61 |        "\n",
  62 |        "    .dataframe thead th {\n",
  63 |        "        text-align: left;\n",
  64 |        "    }\n",
  65 |        "\n",
  66 |        "    .dataframe tbody tr th {\n",
  67 |        "        vertical-align: top;\n",
  68 |        "    }\n",
  69 |        "</style>\n",
  70 |        "<table border=\"1\" class=\"dataframe\">\n",
  71 |        "  <thead>\n",
  72 |        "    <tr style=\"text-align: right;\">\n",
  73 |        "      <th></th>\n",
  74 |        "      <th>Sentence #</th>\n",
  75 |        "      <th>Word</th>\n",
  76 |        "      <th>POS</th>\n",
  77 |        "      <th>Tag</th>\n",
  78 |        "    </tr>\n",
  79 |        "  </thead>\n",
  80 |        "  <tbody>\n",
  81 |        "    <tr>\n",
  82 |        "      <th>0</th>\n",
  83 |        "      <td>Sentence: 1</td>\n",
  84 |        "      <td>Thousands</td>\n",
  85 |        "      <td>NNS</td>\n",
  86 |        "      <td>O</td>\n",
  87 |        "    </tr>\n",
  88 |        "    <tr>\n",
  89 |        "      <th>1</th>\n",
  90 |        "      <td>NaN</td>\n",
  91 |        "      <td>of</td>\n",
  92 |        "      <td>IN</td>\n",
  93 |        "      <td>O</td>\n",
  94 |        "    </tr>\n",
  95 |        "    <tr>\n",
  96 |        "      <th>2</th>\n",
  97 |        "      <td>NaN</td>\n",
  98 |        "      <td>demonstrators</td>\n",
  99 |        "      <td>NNS</td>\n",
 100 |        "      <td>O</td>\n",
 101 |        "    </tr>\n",
 102 |        "    <tr>\n",
 103 |        "      <th>3</th>\n",
 104 |        "      <td>NaN</td>\n",
 105 |        "      <td>have</td>\n",
 106 |        "      <td>VBP</td>\n",
 107 |        "      <td>O</td>\n",
 108 |        "    </tr>\n",
 109 |        "    <tr>\n",
 110 |        "      <th>4</th>\n",
 111 |        "      <td>NaN</td>\n",
 112 |        "      <td>marched</td>\n",
 113 |        "      <td>VBN</td>\n",
 114 |        "      <td>O</td>\n",
 115 |        "    </tr>\n",
 116 |        "    <tr>\n",
 117 |        "      <th>5</th>\n",
 118 |        "      <td>NaN</td>\n",
 119 |        "      <td>through</td>\n",
 120 |        "      <td>IN</td>\n",
 121 |        "      <td>O</td>\n",
 122 |        "    </tr>\n",
 123 |        "    <tr>\n",
 124 |        "      <th>6</th>\n",
 125 |        "      <td>NaN</td>\n",
 126 |        "      <td>London</td>\n",
 127 |        "      <td>NNP</td>\n",
 128 |        "      <td>B-geo</td>\n",
 129 |        "    </tr>\n",
 130 |        "    <tr>\n",
 131 |        "      <th>7</th>\n",
 132 |        "      <td>NaN</td>\n",
 133 |        "      <td>to</td>\n",
 134 |        "      <td>TO</td>\n",
 135 |        "      <td>O</td>\n",
 136 |        "    </tr>\n",
 137 |        "    <tr>\n",
 138 |        "      <th>8</th>\n",
 139 |        "      <td>NaN</td>\n",
 140 |        "      <td>protest</td>\n",
 141 |        "      <td>VB</td>\n",
 142 |        "      <td>O</td>\n",
 143 |        "    </tr>\n",
 144 |        "    <tr>\n",
 145 |        "      <th>9</th>\n",
 146 |        "      <td>NaN</td>\n",
 147 |        "      <td>the</td>\n",
 148 |        "      <td>DT</td>\n",
 149 |        "      <td>O</td>\n",
 150 |        "    </tr>\n",
 151 |        "  </tbody>\n",
 152 |        "</table>\n",
 153 |        "</div>"
 154 |       ],
 155 |       "text/plain": [
 156 |        "    Sentence #           Word  POS    Tag\n",
 157 |        "0  Sentence: 1      Thousands  NNS      O\n",
 158 |        "1          NaN             of   IN      O\n",
 159 |        "2          NaN  demonstrators  NNS      O\n",
 160 |        "3          NaN           have  VBP      O\n",
 161 |        "4          NaN        marched  VBN      O\n",
 162 |        "5          NaN        through   IN      O\n",
 163 |        "6          NaN         London  NNP  B-geo\n",
 164 |        "7          NaN             to   TO      O\n",
 165 |        "8          NaN        protest   VB      O\n",
 166 |        "9          NaN            the   DT      O"
 167 |       ]
 168 |      },
 169 |      "execution_count": 2,
 170 |      "metadata": {},
 171 |      "output_type": "execute_result"
 172 |     }
 173 |    ],
 174 |    "source": [
 175 |     "df_dataset = pd.read_csv(\"ner_dataset.csv\", encoding=\"latin1\")\n",
 176 |     "df_dataset.head(10)"
 177 |    ]
 178 |   },
 179 |   {
 180 |    "cell_type": "code",
 181 |    "execution_count": 3,
 182 |    "metadata": {
 183 |     "collapsed": false
 184 |    },
 185 |    "outputs": [],
 186 |    "source": [
 187 |     "last_sent_id= 0\n",
 188 |     "for i, row in df_dataset.iterrows():  \n",
 189 |     "    if not pd.isnull(row[\"Sentence #\"]):\n",
 190 |     "        last_sent_id = int(row[\"Sentence #\"][10:])\n",
 191 |     "        row[\"Sentence #\"] = last_sent_id\n",
 192 |     "    else:\n",
 193 |     "        row[\"Sentence #\"] = last_sent_id"
 194 |    ]
 195 |   },
 196 |   {
 197 |    "cell_type": "markdown",
 198 |    "metadata": {},
 199 |    "source": [
 200 |     "### Find those with 'nat' tag:"
 201 |    ]
 202 |   },
 203 |   {
 204 |    "cell_type": "code",
 205 |    "execution_count": 4,
 206 |    "metadata": {
 207 |     "collapsed": false
 208 |    },
 209 |    "outputs": [
 210 |     {
 211 |      "data": {
 212 |       "text/plain": [
 213 |        "Sentence #    object\n",
 214 |        "Word          object\n",
 215 |        "POS           object\n",
 216 |        "Tag           object\n",
 217 |        "dtype: object"
 218 |       ]
 219 |      },
 220 |      "execution_count": 4,
 221 |      "metadata": {},
 222 |      "output_type": "execute_result"
 223 |     }
 224 |    ],
 225 |    "source": [
 226 |     "df_dataset.dtypes"
 227 |    ]
 228 |   },
 229 |   {
 230 |    "cell_type": "code",
 231 |    "execution_count": 5,
 232 |    "metadata": {
 233 |     "collapsed": true
 234 |    },
 235 |    "outputs": [],
 236 |    "source": [
 237 |     "sent_id = df_dataset[df_dataset[\"Tag\"].str.contains(\"nat\")][\"Sentence #\"].unique()"
 238 |    ]
 239 |   },
 240 |   {
 241 |    "cell_type": "code",
 242 |    "execution_count": 6,
 243 |    "metadata": {
 244 |     "collapsed": false
 245 |    },
 246 |    "outputs": [],
 247 |    "source": [
 248 |     "df_dataset_nat = df_dataset[df_dataset[\"Sentence #\"].isin(sent_id)]"
 249 |    ]
 250 |   },
 251 |   {
 252 |    "cell_type": "markdown",
 253 |    "metadata": {},
 254 |    "source": [
 255 |     "### Remap tags"
 256 |    ]
 257 |   },
 258 |   {
 259 |    "cell_type": "code",
 260 |    "execution_count": 8,
 261 |    "metadata": {
 262 |     "collapsed": false
 263 |    },
 264 |    "outputs": [],
 265 |    "source": [
 266 |     "lst_tags = df_dataset_nat[\"Tag\"].unique().tolist()"
 267 |    ]
 268 |   },
 269 |   {
 270 |    "cell_type": "code",
 271 |    "execution_count": 9,
 272 |    "metadata": {
 273 |     "collapsed": false
 274 |    },
 275 |    "outputs": [],
 276 |    "source": [
 277 |     "lst_tags.remove(\"I-nat\")\n",
 278 |     "lst_tags.remove(\"B-nat\")"
 279 |    ]
 280 |   },
 281 |   {
 282 |    "cell_type": "code",
 283 |    "execution_count": 10,
 284 |    "metadata": {
 285 |     "collapsed": false
 286 |    },
 287 |    "outputs": [],
 288 |    "source": [
 289 |     "dict_tags = {}\n",
 290 |     "for i in lst_tags:\n",
 291 |     "    dict_tags[i] = \"O\"\n",
 292 |     "    \n",
 293 |     "dict_tags[\"I-nat\"] = \"I-NAT\"\n",
 294 |     "dict_tags[\"B-nat\"] = \"B-NAT\""
 295 |    ]
 296 |   },
 297 |   {
 298 |    "cell_type": "code",
 299 |    "execution_count": 11,
 300 |    "metadata": {
 301 |     "collapsed": false
 302 |    },
 303 |    "outputs": [
 304 |     {
 305 |      "name": "stderr",
 306 |      "output_type": "stream",
 307 |      "text": [
 308 |       "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: SettingWithCopyWarning: \n",
 309 |       "A value is trying to be set on a copy of a slice from a DataFrame.\n",
 310 |       "Try using .loc[row_indexer,col_indexer] = value instead\n",
 311 |       "\n",
 312 |       "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
 313 |       "  if __name__ == '__main__':\n"
 314 |      ]
 315 |     }
 316 |    ],
 317 |    "source": [
 318 |     "df_dataset_nat[\"Tag remapped\"] = df_dataset_nat[\"Tag\"].map(dict_tags)"
 319 |    ]
 320 |   },
 321 |   {
 322 |    "cell_type": "code",
 323 |    "execution_count": 12,
 324 |    "metadata": {
 325 |     "collapsed": false
 326 |    },
 327 |    "outputs": [
 328 |     {
 329 |      "data": {
 330 |       "text/plain": [
 331 |        "['O', 'B-NAT', 'I-NAT']"
 332 |       ]
 333 |      },
 334 |      "execution_count": 12,
 335 |      "metadata": {},
 336 |      "output_type": "execute_result"
 337 |     }
 338 |    ],
 339 |    "source": [
 340 |     "df_dataset_nat[\"Tag remapped\"].unique().tolist()"
 341 |    ]
 342 |   },
 343 |   {
 344 |    "cell_type": "markdown",
 345 |    "metadata": {},
 346 |    "source": [
 347 |     "## 1. Using Spacy"
 348 |    ]
 349 |   },
 350 |   {
 351 |    "cell_type": "code",
 352 |    "execution_count": 13,
 353 |    "metadata": {
 354 |     "collapsed": true
 355 |    },
 356 |    "outputs": [],
 357 |    "source": [
 358 |     "LABEL = 'NAT'\n",
 359 |     "MAX_ITERATIONS = 50"
 360 |    ]
 361 |   },
 362 |   {
 363 |    "cell_type": "markdown",
 364 |    "metadata": {},
 365 |    "source": [
 366 |     "### Training format"
 367 |    ]
 368 |   },
 369 |   {
 370 |    "cell_type": "code",
 371 |    "execution_count": 14,
 372 |    "metadata": {
 373 |     "collapsed": true
 374 |    },
 375 |    "outputs": [],
 376 |    "source": [
 377 |     "def join_space(values):\n",
 378 |     "    return \" \".join(values).strip()"
 379 |    ]
 380 |   },
 381 |   {
 382 |    "cell_type": "code",
 383 |    "execution_count": 15,
 384 |    "metadata": {
 385 |     "collapsed": true
 386 |    },
 387 |    "outputs": [],
 388 |    "source": [
 389 |     "df_sentences_1 = df_dataset_nat.groupby(\"Sentence #\")[\"Word\"].apply(list).reset_index()"
 390 |    ]
 391 |   },
 392 |   {
 393 |    "cell_type": "code",
 394 |    "execution_count": 16,
 395 |    "metadata": {
 396 |     "collapsed": false
 397 |    },
 398 |    "outputs": [],
 399 |    "source": [
 400 |     "df_sentences_2 = df_dataset_nat.groupby(\"Sentence #\")[\"Tag remapped\"].apply(list).reset_index()"
 401 |    ]
 402 |   },
 403 |   {
 404 |    "cell_type": "code",
 405 |    "execution_count": 17,
 406 |    "metadata": {
 407 |     "collapsed": false
 408 |    },
 409 |    "outputs": [],
 410 |    "source": [
 411 |     "df_sentences = pd.merge(left=df_sentences_1, right = df_sentences_2, on = \"Sentence #\")"
 412 |    ]
 413 |   },
 414 |   {
 415 |    "cell_type": "code",
 416 |    "execution_count": 18,
 417 |    "metadata": {
 418 |     "collapsed": false
 419 |    },
 420 |    "outputs": [
 421 |     {
 422 |      "data": {
 423 |       "text/html": [
 424 |        "<div>\n",
 425 |        "<style>\n",
 426 |        "    .dataframe thead tr:only-child th {\n",
 427 |        "        text-align: right;\n",
 428 |        "    }\n",
 429 |        "\n",
 430 |        "    .dataframe thead th {\n",
 431 |        "        text-align: left;\n",
 432 |        "    }\n",
 433 |        "\n",
 434 |        "    .dataframe tbody tr th {\n",
 435 |        "        vertical-align: top;\n",
 436 |        "    }\n",
 437 |        "</style>\n",
 438 |        "<table border=\"1\" class=\"dataframe\">\n",
 439 |        "  <thead>\n",
 440 |        "    <tr style=\"text-align: right;\">\n",
 441 |        "      <th></th>\n",
 442 |        "      <th>Sentence #</th>\n",
 443 |        "      <th>Word</th>\n",
 444 |        "      <th>Tag remapped</th>\n",
 445 |        "    </tr>\n",
 446 |        "  </thead>\n",
 447 |        "  <tbody>\n",
 448 |        "    <tr>\n",
 449 |        "      <th>0</th>\n",
 450 |        "      <td>121</td>\n",
 451 |        "      <td>[Officials, say, the, 27-year, old, man, from,...</td>\n",
 452 |        "      <td>[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...</td>\n",
 453 |        "    </tr>\n",
 454 |        "    <tr>\n",
 455 |        "      <th>1</th>\n",
 456 |        "      <td>206</td>\n",
 457 |        "      <td>[Humans, are, usually, infected, with, bird, f...</td>\n",
 458 |        "      <td>[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...</td>\n",
 459 |        "    </tr>\n",
 460 |        "    <tr>\n",
 461 |        "      <th>2</th>\n",
 462 |        "      <td>227</td>\n",
 463 |        "      <td>[One, of, the, 2008, Olympic, mascots, is, mod...</td>\n",
 464 |        "      <td>[O, O, O, O, O, O, O, O, O, O, O, O, B-NAT, I-...</td>\n",
 465 |        "    </tr>\n",
 466 |        "    <tr>\n",
 467 |        "      <th>3</th>\n",
 468 |        "      <td>229</td>\n",
 469 |        "      <td>[Sam, Beattie, reports, from, Jing, Jing, 's, ...</td>\n",
 470 |        "      <td>[O, O, O, O, B-NAT, I-NAT, O, O, O, O, O, O]</td>\n",
 471 |        "    </tr>\n",
 472 |        "  </tbody>\n",
 473 |        "</table>\n",
 474 |        "</div>"
 475 |       ],
 476 |       "text/plain": [
 477 |        "   Sentence #                                               Word  \\\n",
 478 |        "0         121  [Officials, say, the, 27-year, old, man, from,...   \n",
 479 |        "1         206  [Humans, are, usually, infected, with, bird, f...   \n",
 480 |        "2         227  [One, of, the, 2008, Olympic, mascots, is, mod...   \n",
 481 |        "3         229  [Sam, Beattie, reports, from, Jing, Jing, 's, ...   \n",
 482 |        "\n",
 483 |        "                                        Tag remapped  \n",
 484 |        "0  [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...  \n",
 485 |        "1  [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...  \n",
 486 |        "2  [O, O, O, O, O, O, O, O, O, O, O, O, B-NAT, I-...  \n",
 487 |        "3       [O, O, O, O, B-NAT, I-NAT, O, O, O, O, O, O]  "
 488 |       ]
 489 |      },
 490 |      "execution_count": 18,
 491 |      "metadata": {},
 492 |      "output_type": "execute_result"
 493 |     }
 494 |    ],
 495 |    "source": [
 496 |     "df_sentences.head(4)"
 497 |    ]
 498 |   },
 499 |   {
 500 |    "cell_type": "code",
 501 |    "execution_count": 19,
 502 |    "metadata": {
 503 |     "collapsed": true
 504 |    },
 505 |    "outputs": [],
 506 |    "source": [
 507 |     "train_data = []\n",
 508 |     "\n",
 509 |     "for i, row in df_sentences.iterrows():\n",
 510 |     "    raw_sent = \" \".join(row[\"Word\"]).replace(\" ,\", \",\")\n",
 511 |     "    \n",
 512 |     "    tags = list(zip(row[\"Word\"],row[\"Tag remapped\"]))\n",
 513 |     "    advance = 0\n",
 514 |     "\n",
 515 |     "    new_ents = []\n",
 516 |     "\n",
 517 |     "    for i in range(len(tags)):\n",
 518 |     "        tag = tags[i]\n",
 519 |     "\n",
 520 |     "        word = tag[0]\n",
 521 |     "        ent = tag[1]\n",
 522 |     "\n",
 523 |     "        ent = ent.replace(\"B-\", \"\")\n",
 524 |     "        ent = ent.replace(\"I-\", \"\")\n",
 525 |     "        ent = ent.replace(\"L-\", \"\")\n",
 526 |     "        ent = ent.replace(\"O-\", \"\")\n",
 527 |     "        ent = ent.replace(\"U-\", \"\")\n",
 528 |     "\n",
 529 |     "        ent_range = [advance, advance + len(word), ent]\n",
 530 |     "\n",
 531 |     "        advance += len(word)\n",
 532 |     "        if i < (len(tags) - 1):\n",
 533 |     "            if tags[i + 1][0] != ',':\n",
 534 |     "                advance += 1\n",
 535 |     "\n",
 536 |     "        if not ent_range[2] == \"O\":\n",
 537 |     "            new_ents.append(ent_range)\n",
 538 |     "\n",
 539 |     "    new_ents_merged = []\n",
 540 |     "\n",
 541 |     "    for j in range(len(new_ents)):\n",
 542 |     "        if len(new_ents_merged) == 0:\n",
 543 |     "            new_ents_merged.append(new_ents[j])\n",
 544 |     "\n",
 545 |     "        if new_ents_merged[-1][2] == new_ents[j][2]:\n",
 546 |     "            new_ents_merged[-1][1] = new_ents[j][1]\n",
 547 |     "        else:\n",
 548 |     "            new_ents_merged.append(new_ents[j])\n",
 549 |     "\n",
 550 |     "    new_ents_merged_tuples = [tuple(item) for item in new_ents_merged]\n",
 551 |     "    train_data.append((raw_sent, {\"entities\": new_ents_merged_tuples}))"
 552 |    ]
 553 |   },
 554 |   {
 555 |    "cell_type": "code",
 556 |    "execution_count": 20,
 557 |    "metadata": {
 558 |     "collapsed": false
 559 |    },
 560 |    "outputs": [
 561 |     {
 562 |      "name": "stdout",
 563 |      "output_type": "stream",
 564 |      "text": [
 565 |       "[(\"Officials say the 27-year old man from Vietnam 's northern Ninh Binh \"\n",
 566 |       "  'province died late Thursday and tested positive for the H5N1 strain of bird '\n",
 567 |       "  'flu .',\n",
 568 |       "  {'entities': [(125, 129, 'NAT')]}),\n",
 569 |       " ('Humans are usually infected with bird flu by direct contact with infected '\n",
 570 |       "  'poultry, but experts fear the H5N1 virus may mutate into a form easily '\n",
 571 |       "  'transmitted between people .',\n",
 572 |       "  {'entities': [(104, 108, 'NAT')]})]\n"
 573 |      ]
 574 |     }
 575 |    ],
 576 |    "source": [
 577 |     "pprint(train_data[:2])"
 578 |    ]
 579 |   },
 580 |   {
 581 |    "cell_type": "markdown",
 582 |    "metadata": {},
 583 |    "source": [
 584 |     "### Split"
 585 |    ]
 586 |   },
 587 |   {
 588 |    "cell_type": "code",
 589 |    "execution_count": 21,
 590 |    "metadata": {
 591 |     "collapsed": true
 592 |    },
 593 |    "outputs": [],
 594 |    "source": [
 595 |     "test_data = train_data[155:]\n",
 596 |     "train_data = train_data[:155]"
 597 |    ]
 598 |   },
 599 |   {
 600 |    "cell_type": "markdown",
 601 |    "metadata": {},
 602 |    "source": [
 603 |     "### Train"
 604 |    ]
 605 |   },
 606 |   {
 607 |    "cell_type": "code",
 608 |    "execution_count": 22,
 609 |    "metadata": {
 610 |     "collapsed": false
 611 |    },
 612 |    "outputs": [],
 613 |    "source": [
 614 |     "nlp = spacy.load(\"en_core_web_sm\")"
 615 |    ]
 616 |   },
 617 |   {
 618 |    "cell_type": "code",
 619 |    "execution_count": 23,
 620 |    "metadata": {
 621 |     "collapsed": false
 622 |    },
 623 |    "outputs": [],
 624 |    "source": [
 625 |     "if 'ner' not in nlp.pipe_names:\n",
 626 |     "    ner = nlp.create_pipe('ner')\n",
 627 |     "    nlp.add_pipe(ner)\n",
 628 |     "\n",
 629 |     "else:\n",
 630 |     "    ner = nlp.get_pipe('ner')"
 631 |    ]
 632 |   },
 633 |   {
 634 |    "cell_type": "code",
 635 |    "execution_count": 24,
 636 |    "metadata": {
 637 |     "collapsed": true
 638 |    },
 639 |    "outputs": [],
 640 |    "source": [
 641 |     "ner.add_label(LABEL)"
 642 |    ]
 643 |   },
 644 |   {
 645 |    "cell_type": "code",
 646 |    "execution_count": 25,
 647 |    "metadata": {
 648 |     "collapsed": true
 649 |    },
 650 |    "outputs": [],
 651 |    "source": [
 652 |     "other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']"
 653 |    ]
 654 |   },
 655 |   {
 656 |    "cell_type": "code",
 657 |    "execution_count": 26,
 658 |    "metadata": {
 659 |     "collapsed": false
 660 |    },
 661 |    "outputs": [
 662 |     {
 663 |      "name": "stdout",
 664 |      "output_type": "stream",
 665 |      "text": [
 666 |       "{'ner': 1151.5631708588069}\n",
 667 |       "{'ner': 970.3055948485708}\n",
 668 |       "Wall time: 1min 32s\n"
 669 |      ]
 670 |     }
 671 |    ],
 672 |    "source": [
 673 |     "%%time\n",
 674 |     "with nlp.disable_pipes(*other_pipes):  # only train NER\n",
 675 |     "    optimizer = nlp.begin_training()\n",
 676 |     "    for itn in range(2):\n",
 677 |     "        random.shuffle(train_data)\n",
 678 |     "        losses = {}\n",
 679 |     "        for text, annotations in train_data:\n",
 680 |     "            nlp.update([text], [annotations], sgd=optimizer, drop=0.35,losses=losses)\n",
 681 |     "        print(losses)"
 682 |    ]
 683 |   },
 684 |   {
 685 |    "cell_type": "code",
 686 |    "execution_count": 27,
 687 |    "metadata": {
 688 |     "collapsed": true
 689 |    },
 690 |    "outputs": [],
 691 |    "source": [
 692 |     "nlp.meta['name'] = \"en_core_web_sm_newlabel\""
 693 |    ]
 694 |   },
 695 |   {
 696 |    "cell_type": "code",
 697 |    "execution_count": 28,
 698 |    "metadata": {
 699 |     "collapsed": true
 700 |    },
 701 |    "outputs": [],
 702 |    "source": [
 703 |     "nlp.to_disk(\"models\")"
 704 |    ]
 705 |   },
 706 |   {
 707 |    "cell_type": "markdown",
 708 |    "metadata": {},
 709 |    "source": [
 710 |     "### Test it out"
 711 |    ]
 712 |   },
 713 |   {
 714 |    "cell_type": "code",
 715 |    "execution_count": 29,
 716 |    "metadata": {
 717 |     "collapsed": false
 718 |    },
 719 |    "outputs": [],
 720 |    "source": [
 721 |     "nlp2 = spacy.load(\"models\")"
 722 |    ]
 723 |   },
 724 |   {
 725 |    "cell_type": "code",
 726 |    "execution_count": 30,
 727 |    "metadata": {
 728 |     "collapsed": false
 729 |    },
 730 |    "outputs": [],
 731 |    "source": [
 732 |     "y_true = []\n",
 733 |     "\n",
 734 |     "for i, test in enumerate(test_data):\n",
 735 |     "    y_true.append([test[0][j[0]:j[1]] for j in test[1][\"entities\"]])"
 736 |    ]
 737 |   },
 738 |   {
 739 |    "cell_type": "code",
 740 |    "execution_count": 31,
 741 |    "metadata": {
 742 |     "collapsed": true
 743 |    },
 744 |    "outputs": [],
 745 |    "source": [
 746 |     "y_predict = []\n",
 747 |     "for test in test_data:\n",
 748 |     "    doc = nlp2(test[0])\n",
 749 |     "    y_predict.append([ent.text for ent in doc.ents if ent.label_ == \"NAT\"])"
 750 |    ]
 751 |   },
 752 |   {
 753 |    "cell_type": "code",
 754 |    "execution_count": 32,
 755 |    "metadata": {
 756 |     "collapsed": false
 757 |    },
 758 |    "outputs": [
 759 |     {
 760 |      "data": {
 761 |       "text/plain": [
 762 |        "['Rita']"
 763 |       ]
 764 |      },
 765 |      "execution_count": 32,
 766 |      "metadata": {},
 767 |      "output_type": "execute_result"
 768 |     }
 769 |    ],
 770 |    "source": [
 771 |     "y_predict[0]"
 772 |    ]
 773 |   },
 774 |   {
 775 |    "cell_type": "code",
 776 |    "execution_count": 33,
 777 |    "metadata": {
 778 |     "collapsed": true
 779 |    },
 780 |    "outputs": [],
 781 |    "source": [
 782 |     "def evaluate(y_predict, y_true):\n",
 783 |     "    correct = 0\n",
 784 |     "    for j, val in enumerate(y_predict):\n",
 785 |     "        if val == y_true[j]:\n",
 786 |     "            correct += 1\n",
 787 |     "            \n",
 788 |     "    return correct / len(y_predict)"
 789 |    ]
 790 |   },
 791 |   {
 792 |    "cell_type": "code",
 793 |    "execution_count": 34,
 794 |    "metadata": {
 795 |     "collapsed": false
 796 |    },
 797 |    "outputs": [
 798 |     {
 799 |      "data": {
 800 |       "text/plain": [
 801 |        "0.64"
 802 |       ]
 803 |      },
 804 |      "execution_count": 34,
 805 |      "metadata": {},
 806 |      "output_type": "execute_result"
 807 |     }
 808 |    ],
 809 |    "source": [
 810 |     "evaluate(y_predict=y_predict, y_true=y_true)"
 811 |    ]
 812 |   },
 813 |   {
 814 |    "cell_type": "markdown",
 815 |    "metadata": {},
 816 |    "source": [
 817 |     "## 3. Using PyCRF"
 818 |    ]
 819 |   },
 820 |   {
 821 |    "cell_type": "markdown",
 822 |    "metadata": {},
 823 |    "source": [
 824 |     "### Training format"
 825 |    ]
 826 |   },
 827 |   {
 828 |    "cell_type": "code",
 829 |    "execution_count": 35,
 830 |    "metadata": {
 831 |     "collapsed": true
 832 |    },
 833 |    "outputs": [],
 834 |    "source": [
 835 |     "lemmatizer = WordNetLemmatizer()"
 836 |    ]
 837 |   },
 838 |   {
 839 |    "cell_type": "code",
 840 |    "execution_count": 36,
 841 |    "metadata": {
 842 |     "collapsed": false
 843 |    },
 844 |    "outputs": [],
 845 |    "source": [
 846 |     "train_data = []\n",
 847 |     "for index, row in df_sentences.iterrows():\n",
 848 |     "    \n",
 849 |     "    train_data_sentence = []\n",
 850 |     "    \n",
 851 |     "    raw_sent = row[\"Word\"]\n",
 852 |     "    tokens = nltk.pos_tag(raw_sent)\n",
 853 |     "\n",
 854 |     "    for i, val in enumerate(tokens):\n",
 855 |     "        train_data_word = []\n",
 856 |     "        \n",
 857 |     "        word = raw_sent[i]\n",
 858 |     "        label = row[\"Tag remapped\"][i]\n",
 859 |     "        pos_tag = tokens[i][1]\n",
 860 |     "\n",
 861 |     "        if pos_tag.startswith(\"N\"):\n",
 862 |     "            lemma = lemmatizer.lemmatize(word.lower(), pos=NOUN)\n",
 863 |     "        elif pos_tag.startswith(\"V\"):\n",
 864 |     "            lemma = lemmatizer.lemmatize(word.lower(), pos=VERB)\n",
 865 |     "        elif pos_tag.startswith(\"J\"):\n",
 866 |     "            lemma = lemmatizer.lemmatize(word.lower(), pos=ADJ)\n",
 867 |     "        else:\n",
 868 |     "            lemma = word\n",
 869 |     "            \n",
 870 |     "        train_data_word.append(word)\n",
 871 |     "        train_data_word.append(pos_tag)\n",
 872 |     "        train_data_word.append(lemma)\n",
 873 |     "        train_data_word.append(label)\n",
 874 |     "        \n",
 875 |     "        train_data_sentence.append(train_data_word)\n",
 876 |     "        \n",
 877 |     "    train_data.append(train_data_sentence)"
 878 |    ]
 879 |   },
 880 |   {
 881 |    "cell_type": "markdown",
 882 |    "metadata": {},
 883 |    "source": [
 884 |     "### Feature engineering"
 885 |    ]
 886 |   },
 887 |   {
 888 |    "cell_type": "code",
 889 |    "execution_count": 37,
 890 |    "metadata": {
 891 |     "collapsed": true
 892 |    },
 893 |    "outputs": [],
 894 |    "source": [
 895 |     "def word2features(sent, i, embed={}, use_gazetteers=False):\n",
 896 |     "    word = sent[i][0]\n",
 897 |     "    postag = sent[i][-3]\n",
 898 |     "    lemma = sent[i][-2].lower()\n",
 899 |     "    features = [\n",
 900 |     "        'bias',\n",
 901 |     "        'word.lower=' + word.lower(),\n",
 902 |     "        'word[-3:]=' + word[-3:],\n",
 903 |     "        'word[-2:]=' + word[-2:],\n",
 904 |     "        'word.isupper=%s' % word.isupper(),\n",
 905 |     "        'word.istitle=%s' % word.istitle(),\n",
 906 |     "        'word.isdigit=%s' % word.isdigit(),\n",
 907 |     "        'postag=' + postag,\n",
 908 |     "        'postag[:2]=' + postag[:2]\n",
 909 |     "    ]\n",
 910 |     "    if embed != {}:\n",
 911 |     "        features.extend(['word.embed=%s' % embed.get(word, len(embed))])\n",
 912 |     "    if use_gazetteers:\n",
 913 |     "        features.extend(['word.measures=%s' % str(word.lower() in UNIT_GAZETTEER or lemma in UNIT_GAZETTEER),\n",
 914 |     "                        'word.products=%s' % str(word.lower() in PRODUCTS_GAZETTEER or lemma in PRODUCTS_GAZETTEER)])\n",
 915 |     "\n",
 916 |     "    if i > 0:\n",
 917 |     "        word1 = sent[i - 1][0]\n",
 918 |     "        postag1 = sent[i - 1][-3]\n",
 919 |     "        lemma1 = sent[i - 1][-2].lower()\n",
 920 |     "        features.extend([\n",
 921 |     "            '-1:word.lower=' + word1.lower(),\n",
 922 |     "            '-1:word.istitle=%s' % word1.istitle(),\n",
 923 |     "            '-1:word.isupper=%s' % word1.isupper(),\n",
 924 |     "            '-1:postag=' + postag1,\n",
 925 |     "            '-1:postag[:2]=' + postag1[:2]\n",
 926 |     "        ])\n",
 927 |     "        if embed != {}:\n",
 928 |     "            features.extend(['-1:word.embed=%s' % embed.get(word1, len(embed))])\n",
 929 |     "        if use_gazetteers:\n",
 930 |     "            features.extend(['-1:word.measures=%s' % str(word1.lower() in UNIT_GAZETTEER or lemma1 in UNIT_GAZETTEER),\n",
 931 |     "                            '-1:word.products=%s' % str(word1.lower() in PRODUCTS_GAZETTEER or lemma1 in PRODUCTS_GAZETTEER)])\n",
 932 |     "\n",
 933 |     "    else:\n",
 934 |     "        features.append('BOS')\n",
 935 |     "\n",
 936 |     "    if i < len(sent) - 1:\n",
 937 |     "        word1 = sent[i + 1][0]\n",
 938 |     "        postag1 = sent[i + 1][-3]\n",
 939 |     "        lemma1 = sent[i + 1][-2].lower()\n",
 940 |     "        features.extend([\n",
 941 |     "            '+1:word.lower=' + word1.lower(),\n",
 942 |     "            '+1:word.istitle=%s' % word1.istitle(),\n",
 943 |     "            '+1:word.isupper=%s' % word1.isupper(),\n",
 944 |     "            '+1:postag=' + postag1,\n",
 945 |     "            '+1:postag[:2]=' + postag1[:2]\n",
 946 |     "        ])\n",
 947 |     "        if use_gazetteers:\n",
 948 |     "            features.extend(['+1:word.measures=%s' % str(word1.lower() in UNIT_GAZETTEER or lemma1 in UNIT_GAZETTEER),\n",
 949 |     "                            '+1:word.products=%s' % str(word1.lower() in PRODUCTS_GAZETTEER or lemma1 in PRODUCTS_GAZETTEER)])\n",
 950 |     "\n",
 951 |     "    else:\n",
 952 |     "        features.append('EOS')\n",
 953 |     "\n",
 954 |     "    return features"
 955 |    ]
 956 |   },
 957 |   {
 958 |    "cell_type": "code",
 959 |    "execution_count": 38,
 960 |    "metadata": {
 961 |     "collapsed": true
 962 |    },
 963 |    "outputs": [],
 964 |    "source": [
 965 |     "def sent2features(sent, embed={}, use_gazetteers=False):\n",
 966 |     "\n",
 967 |     "    return [word2features(sent, i, embed=embed, use_gazetteers=use_gazetteers) for i in range(len(sent))]"
 968 |    ]
 969 |   },
 970 |   {
 971 |    "cell_type": "code",
 972 |    "execution_count": 39,
 973 |    "metadata": {
 974 |     "collapsed": true
 975 |    },
 976 |    "outputs": [],
 977 |    "source": [
 978 |     "train_data_formatted = [sent2features(x) for x in train_data]"
 979 |    ]
 980 |   },
 981 |   {
 982 |    "cell_type": "markdown",
 983 |    "metadata": {},
 984 |    "source": [
 985 |     "### Labels"
 986 |    ]
 987 |   },
 988 |   {
 989 |    "cell_type": "code",
 990 |    "execution_count": 40,
 991 |    "metadata": {
 992 |     "collapsed": true
 993 |    },
 994 |    "outputs": [],
 995 |    "source": [
 996 |     "y_data = df_sentences[\"Tag remapped\"].tolist()"
 997 |    ]
 998 |   },
 999 |   {
1000 |    "cell_type": "markdown",
1001 |    "metadata": {},
1002 |    "source": [
1003 |     "### Split"
1004 |    ]
1005 |   },
1006 |   {
1007 |    "cell_type": "code",
1008 |    "execution_count": 41,
1009 |    "metadata": {
1010 |     "collapsed": false
1011 |    },
1012 |    "outputs": [],
1013 |    "source": [
1014 |     "x_test = train_data_formatted[155:]\n",
1015 |     "y_test = y_data[155:]\n",
1016 |     "\n",
1017 |     "x_train = train_data_formatted[:155]\n",
1018 |     "y_train = y_data[:155]"
1019 |    ]
1020 |   },
1021 |   {
1022 |    "cell_type": "markdown",
1023 |    "metadata": {},
1024 |    "source": [
1025 |     "### Model training"
1026 |    ]
1027 |   },
1028 |   {
1029 |    "cell_type": "code",
1030 |    "execution_count": 42,
1031 |    "metadata": {
1032 |     "collapsed": false
1033 |    },
1034 |    "outputs": [],
1035 |    "source": [
1036 |     "def train(X_train, y_train, model_name):\n",
1037 |     "    \"\"\" Trains a CRF on the given training data and saves the model. \"\"\"\n",
1038 |     "    print(\"Training\", model_name)\n",
1039 |     "    trainer = pycrfsuite.Trainer(verbose=False)\n",
1040 |     "\n",
1041 |     "    for xseq, yseq in zip(X_train, y_train):\n",
1042 |     "        trainer.append(xseq, yseq)\n",
1043 |     "\n",
1044 |     "    trainer.set_params({\n",
1045 |     "        'c1': 0.1,  # coefficient for L1 penalty\n",
1046 |     "        'c2': 1e-3,  # coefficient for L2 penalty\n",
1047 |     "        'feature.possible_transitions': True\n",
1048 |     "    })\n",
1049 |     "\n",
1050 |     "    trainer.train(model_name)"
1051 |    ]
1052 |   },
1053 |   {
1054 |    "cell_type": "code",
1055 |    "execution_count": 43,
1056 |    "metadata": {
1057 |     "collapsed": false
1058 |    },
1059 |    "outputs": [
1060 |     {
1061 |      "name": "stdout",
1062 |      "output_type": "stream",
1063 |      "text": [
1064 |       "Training pycrfmodel.model\n",
1065 |       "Wall time: 6.36 s\n"
1066 |      ]
1067 |     }
1068 |    ],
1069 |    "source": [
1070 |     "%%time\n",
1071 |     "train(x_train, y_train, 'pycrfmodel.model')"
1072 |    ]
1073 |   },
1074 |   {
1075 |    "cell_type": "code",
1076 |    "execution_count": 44,
1077 |    "metadata": {
1078 |     "collapsed": false
1079 |    },
1080 |    "outputs": [],
1081 |    "source": [
1082 |     "def tag(X_test,model_name):\n",
1083 |     "    \"\"\" Labels test data with the model saved in model_name. \"\"\"\n",
1084 |     "    tagger = pycrfsuite.Tagger()\n",
1085 |     "    tagger.open(model_name)\n",
1086 |     "\n",
1087 |     "    return [tagger.tag(seq) for seq in X_test]"
1088 |    ]
1089 |   },
1090 |   {
1091 |    "cell_type": "code",
1092 |    "execution_count": 45,
1093 |    "metadata": {
1094 |     "collapsed": false
1095 |    },
1096 |    "outputs": [
1097 |     {
1098 |      "data": {
1099 |       "text/plain": [
1100 |        "['O',\n",
1101 |        " 'O',\n",
1102 |        " 'O',\n",
1103 |        " 'B-NAT',\n",
1104 |        " 'I-NAT',\n",
1105 |        " 'I-NAT',\n",
1106 |        " 'O',\n",
1107 |        " 'O',\n",
1108 |        " 'O',\n",
1109 |        " 'O',\n",
1110 |        " 'O',\n",
1111 |        " 'O',\n",
1112 |        " 'O',\n",
1113 |        " 'O',\n",
1114 |        " 'O',\n",
1115 |        " 'O',\n",
1116 |        " 'O',\n",
1117 |        " 'O',\n",
1118 |        " 'O',\n",
1119 |        " 'O',\n",
1120 |        " 'O',\n",
1121 |        " 'O',\n",
1122 |        " 'O',\n",
1123 |        " 'O',\n",
1124 |        " 'O']"
1125 |       ]
1126 |      },
1127 |      "execution_count": 45,
1128 |      "metadata": {},
1129 |      "output_type": "execute_result"
1130 |     }
1131 |    ],
1132 |    "source": [
1133 |     "tag(x_test, 'pycrfmodel.model')[0]"
1134 |    ]
1135 |   },
1136 |   {
1137 |    "cell_type": "code",
1138 |    "execution_count": 46,
1139 |    "metadata": {
1140 |     "collapsed": false
1141 |    },
1142 |    "outputs": [],
1143 |    "source": [
1144 |     "def evaluate(y_predict, y_true, ignore_bio = True):\n",
1145 |     "    correct = 0\n",
1146 |     "    total = 0\n",
1147 |     "    for i, y_pred in enumerate(y_predict):\n",
1148 |     "        for j, y in enumerate(y_pred):\n",
1149 |     "            if ignore_bio:\n",
1150 |     "                if y[2:] == y_true[i][j][2:]:\n",
1151 |     "                    correct += 1\n",
1152 |     "                \n",
1153 |     "            else:\n",
1154 |     "                if y == y_true[i][j]:\n",
1155 |     "                    correct += 1\n",
1156 |     "                \n",
1157 |     "            \n",
1158 |     "        \n",
1159 |     "        total += len(y_pred)\n",
1160 |     "        \n",
1161 |     "    return correct / total\n",
1162 |     "        "
1163 |    ]
1164 |   },
1165 |   {
1166 |    "cell_type": "code",
1167 |    "execution_count": 47,
1168 |    "metadata": {
1169 |     "collapsed": false
1170 |    },
1171 |    "outputs": [
1172 |     {
1173 |      "data": {
1174 |       "text/plain": [
1175 |        "0.9840213049267643"
1176 |       ]
1177 |      },
1178 |      "execution_count": 47,
1179 |      "metadata": {},
1180 |      "output_type": "execute_result"
1181 |     }
1182 |    ],
1183 |    "source": [
1184 |     "evaluate(tag(x_test, 'pycrfmodel.model'), y_test, ignore_bio=True)"
1185 |    ]
1186 |   },
1187 |   {
1188 |    "cell_type": "code",
1189 |    "execution_count": null,
1190 |    "metadata": {
1191 |     "collapsed": true
1192 |    },
1193 |    "outputs": [],
1194 |    "source": []
1195 |   }
1196 |  ],
1197 |  "metadata": {
1198 |   "kernelspec": {
1199 |    "display_name": "Python 3",
1200 |    "language": "python",
1201 |    "name": "python3"
1202 |   },
1203 |   "language_info": {
1204 |    "codemirror_mode": {
1205 |     "name": "ipython",
1206 |     "version": 3
1207 |    },
1208 |    "file_extension": ".py",
1209 |    "mimetype": "text/x-python",
1210 |    "name": "python",
1211 |    "nbconvert_exporter": "python",
1212 |    "pygments_lexer": "ipython3",
1213 |    "version": "3.6.0"
1214 |   }
1215 |  },
1216 |  "nbformat": 4,
1217 |  "nbformat_minor": 2
1218 | }
1219 | 


--------------------------------------------------------------------------------
/section4/section4_video3_basic_classifier.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 0. Imports"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 109,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "from sklearn.datasets import fetch_20newsgroups\n",
 19 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
 20 |     "from sklearn.svm import LinearSVC\n",
 21 |     "from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score\n",
 22 |     "from sklearn.model_selection import GridSearchCV"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 32,
 28 |    "metadata": {
 29 |     "collapsed": true
 30 |    },
 31 |    "outputs": [],
 32 |    "source": [
 33 |     "import spacy\n",
 34 |     "from nltk.corpus import stopwords\n",
 35 |     "import string"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 106,
 41 |    "metadata": {
 42 |     "collapsed": true
 43 |    },
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "from collections import Counter\n",
 47 |     "import matplotlib.pyplot as plt\n",
 48 |     "import seaborn as sns\n",
 49 |     "import numpy as np"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "## 1. Get data"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 3,
 62 |    "metadata": {
 63 |     "collapsed": false
 64 |    },
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "newsgroups_train = fetch_20newsgroups(subset='train')\n",
 68 |     "newsgroups_test = fetch_20newsgroups(subset='test')"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "## 2. Data processing"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": 4,
 81 |    "metadata": {
 82 |     "collapsed": true
 83 |    },
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "x_train = newsgroups_train.data\n",
 87 |     "y_train = newsgroups_train.target"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 60,
 93 |    "metadata": {
 94 |     "collapsed": true
 95 |    },
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "x_test = newsgroups_test.data\n",
 99 |     "y_test = newsgroups_test.target"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 23,
105 |    "metadata": {
106 |     "collapsed": false
107 |    },
108 |    "outputs": [
109 |     {
110 |      "data": {
111 |       "text/plain": [
112 |        "'From: shd2001@andy.bgsu.edu (Sherlette Dixon)\\nSubject: Christianity & Atheism:  an update\\nOrganization: BGSU\\nLines: 32\\n\\nFirst, I would like to thank all who sent me their opinions on the matter\\nat hand.  All advice was taken to heart, if not directly used.  My friend\\nfound out about the matter quite accidently.  After reading some of my\\nmail, I quit from the mail reader & went about my business.  I must have\\ntrashed my mail improperly, because he got on the same terminal the next\\nday & saw my old messages.  He thought they were responses to a post he\\nplaced in alt.atheism earlier that week, so he read some of them before\\nrealizing that they were for me.  I got a message from him the next day; he\\napologized for reading my mail & said that he did not want to appear to be\\na snoop.  He said that he would be willing to talk to me about his views &\\ndidn\\'t mind doing so, especially with a friend.  So we did.  I neither\\nchanged his mind nor did he change mine, as that was not the point.  Now he\\nknows where I\\'m coming from & now I know where he\\'s coming from.  And all\\nthat I can do is pray for him, as I\\'ve always done.\\n\\nI believe the reason that he & I \"click\" instead of \"bash\" heads is because\\nI see Christianity as a tool for revolution, & not a tool for maintaining\\nthe status quo.  To be quite blunt, I have more of a reason to reject God\\nthan he does just by the fact that I am an African-American female. \\nChristianity & religion have been used as tools to separate my people from\\nthe true knowledge of our history & the wealth of our contributions to the\\nworld society.  The \"kitchen of heaven\" was all we had to look forward to\\nduring the slave days, & this mentality & second-class status still exists\\ntoday.  I, too, have rejected\\nan aspect of Christianity----that of the estabished church.  Too much\\nhypocricy exists behind the walls of \"God\\'s house\" beginning with the\\nimages of a white Jesus to that of the members:  praise God on Sunday &\\nraise hell beginning Monday.  God-willing, I will find a church home where\\nI can feel comfortable & at-home, but I don\\'t see it happening anytime\\nsoon.\\n\\nSherlette \\n'"
113 |       ]
114 |      },
115 |      "execution_count": 23,
116 |      "metadata": {},
117 |      "output_type": "execute_result"
118 |     }
119 |    ],
120 |    "source": [
121 |     "x_train[120]"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "metadata": {},
127 |    "source": [
128 |     "So most of the data processing:\n",
129 |     "- stopwords\n",
130 |     "- punctuation\n",
131 |     "- punctuation chains\n",
132 |     "- single character words\n",
133 |     "- stuff like '\\n\\t\\t\\t\\t\\t\\t' and '\\n'"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "## 3. Document processing"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": 8,
146 |    "metadata": {
147 |     "collapsed": true
148 |    },
149 |    "outputs": [],
150 |    "source": [
151 |     "nlp = spacy.load(\"en_core_web_md\")"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 25,
157 |    "metadata": {
158 |     "collapsed": false
159 |    },
160 |    "outputs": [
161 |     {
162 |      "name": "stdout",
163 |      "output_type": "stream",
164 |      "text": [
165 |       "Wall time: 17min 34s\n"
166 |      ]
167 |     }
168 |    ],
169 |    "source": [
170 |     "%%time\n",
171 |     "x_train_nlp = [[x.lemma_ for x in nlp(y)] for y in x_train]"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "code",
176 |    "execution_count": 61,
177 |    "metadata": {
178 |     "collapsed": false
179 |    },
180 |    "outputs": [
181 |     {
182 |      "name": "stdout",
183 |      "output_type": "stream",
184 |      "text": [
185 |       "Wall time: 11min 31s\n"
186 |      ]
187 |     }
188 |    ],
189 |    "source": [
190 |     "%%time\n",
191 |     "x_test_nlp = [[x.lemma_ for x in nlp(y)] for y in x_test]"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "markdown",
196 |    "metadata": {},
197 |    "source": [
198 |     "## 2 bis. Data processing: take 2"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "markdown",
203 |    "metadata": {},
204 |    "source": [
205 |     "### 2.1. Remove stopwords"
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "code",
210 |    "execution_count": 28,
211 |    "metadata": {
212 |     "collapsed": true
213 |    },
214 |    "outputs": [],
215 |    "source": [
216 |     "stop_en = stopwords.words(\"english\")"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": 29,
222 |    "metadata": {
223 |     "collapsed": true
224 |    },
225 |    "outputs": [],
226 |    "source": [
227 |     "x_cleaned_1 = []\n",
228 |     "for x in x_train_nlp:\n",
229 |     "    x_cleaned_1.append([y for y in x if not y in stop_en])"
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "code",
234 |    "execution_count": 62,
235 |    "metadata": {
236 |     "collapsed": true
237 |    },
238 |    "outputs": [],
239 |    "source": [
240 |     "x_cleaned_1_test = []\n",
241 |     "for x in x_test_nlp:\n",
242 |     "    x_cleaned_1_test.append([y for y in x if not y in stop_en])"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "markdown",
247 |    "metadata": {},
248 |    "source": [
249 |     "### 2.2. Remove punct"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "code",
254 |    "execution_count": 33,
255 |    "metadata": {
256 |     "collapsed": true
257 |    },
258 |    "outputs": [],
259 |    "source": [
260 |     "x_cleaned_2 = []\n",
261 |     "for x in x_cleaned_1:\n",
262 |     "    x_cleaned_2.append([y for y in x if not y in list(string.punctuation)])"
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "code",
267 |    "execution_count": 63,
268 |    "metadata": {
269 |     "collapsed": true
270 |    },
271 |    "outputs": [],
272 |    "source": [
273 |     "x_cleaned_2_test = []\n",
274 |     "for x in x_cleaned_1_test:\n",
275 |     "    x_cleaned_2_test.append([y for y in x if not y in list(string.punctuation)])"
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "markdown",
280 |    "metadata": {},
281 |    "source": [
282 |     "### 2.3 Remove other useless stuff"
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "code",
287 |    "execution_count": 35,
288 |    "metadata": {
289 |     "collapsed": true
290 |    },
291 |    "outputs": [],
292 |    "source": [
293 |     "useless = [\"-PRON-\"]"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "code",
298 |    "execution_count": 36,
299 |    "metadata": {
300 |     "collapsed": false
301 |    },
302 |    "outputs": [],
303 |    "source": [
304 |     "x_cleaned_3 = []\n",
305 |     "for x in x_cleaned_2:\n",
306 |     "    x_cleaned_3.append([y for y in x if not y in useless])"
307 |    ]
308 |   },
309 |   {
310 |    "cell_type": "code",
311 |    "execution_count": 64,
312 |    "metadata": {
313 |     "collapsed": true
314 |    },
315 |    "outputs": [],
316 |    "source": [
317 |     "x_cleaned_3_test = []\n",
318 |     "for x in x_cleaned_2_test:\n",
319 |     "    x_cleaned_3_test.append([y for y in x if not y in useless])"
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "markdown",
324 |    "metadata": {},
325 |    "source": [
326 |     "### 2.4 Remove \\n and '--'"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "code",
331 |    "execution_count": 37,
332 |    "metadata": {
333 |     "collapsed": true
334 |    },
335 |    "outputs": [],
336 |    "source": [
337 |     "x_cleaned_4 = []\n",
338 |     "for x in x_cleaned_3:\n",
339 |     "    x_cleaned_4.append([y for y in x if not (\"--\" in y or '\\n' in y) ])"
340 |    ]
341 |   },
342 |   {
343 |    "cell_type": "code",
344 |    "execution_count": 65,
345 |    "metadata": {
346 |     "collapsed": true
347 |    },
348 |    "outputs": [],
349 |    "source": [
350 |     "x_cleaned_4_test = []\n",
351 |     "for x in x_cleaned_3_test:\n",
352 |     "    x_cleaned_4_test.append([y for y in x if not (\"--\" in y or '\\n' in y) ])"
353 |    ]
354 |   },
355 |   {
356 |    "cell_type": "markdown",
357 |    "metadata": {},
358 |    "source": [
359 |     "### 2.5 Join together"
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "code",
364 |    "execution_count": 39,
365 |    "metadata": {
366 |     "collapsed": false
367 |    },
368 |    "outputs": [],
369 |    "source": [
370 |     "x_cleaned = [\" \".join(y) for y in x_cleaned_4]"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "code",
375 |    "execution_count": 66,
376 |    "metadata": {
377 |     "collapsed": true
378 |    },
379 |    "outputs": [],
380 |    "source": [
381 |     "x_cleaned_test = [\" \".join(y) for y in x_cleaned_4_test]"
382 |    ]
383 |   },
384 |   {
385 |    "cell_type": "code",
386 |    "execution_count": 124,
387 |    "metadata": {
388 |     "collapsed": false
389 |    },
390 |    "outputs": [
391 |     {
392 |      "data": {
393 |       "text/plain": [
394 |        "\"From: lerxst@wam.umd.edu (where's my thing)\\nSubject: WHAT car is this!?\\nNntp-Posting-Host: rac3.wam.umd.edu\\nOrganization: University of Maryland, College Park\\nLines: 15\\n\\n I was wondering if anyone out there could enlighten me on this car I saw\\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\\nthe front bumper was separate from the rest of the body. This is \\nall I know. If anyone can tellme a model name, engine specs, years\\nof production, where this car is made, history, or whatever info you\\nhave on this funky looking car, please e-mail.\\n\\nThanks,\\n- IL\\n   ---- brought to you by your neighborhood Lerxst ----\\n\\n\\n\\n\\n\""
395 |       ]
396 |      },
397 |      "execution_count": 124,
398 |      "metadata": {},
399 |      "output_type": "execute_result"
400 |     }
401 |    ],
402 |    "source": [
403 |     "x_train[0]"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": 123,
409 |    "metadata": {
410 |     "collapsed": false
411 |    },
412 |    "outputs": [
413 |     {
414 |      "data": {
415 |       "text/plain": [
416 |        "'lerxst@wam.umd.edu thing subject car nntp posting host rac3.wam.umd.edu organization university maryland college park line 15 wonder anyone enlighten car see day 2-door sport car look late 60s/ early 70 call bricklin door really small addition front bumper separate rest body know anyone tellme model name engine spec year production car make history whatev info funky look car please e mail thank il bring neighborhood lerxst'"
417 |       ]
418 |      },
419 |      "execution_count": 123,
420 |      "metadata": {},
421 |      "output_type": "execute_result"
422 |     }
423 |    ],
424 |    "source": [
425 |     "x_cleaned[0]"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "markdown",
430 |    "metadata": {},
431 |    "source": [
432 |     "## 4. Splitting"
433 |    ]
434 |   },
435 |   {
436 |    "cell_type": "markdown",
437 |    "metadata": {},
438 |    "source": [
439 |     "Already taken care of :-)"
440 |    ]
441 |   },
442 |   {
443 |    "cell_type": "code",
444 |    "execution_count": 43,
445 |    "metadata": {
446 |     "collapsed": false
447 |    },
448 |    "outputs": [],
449 |    "source": [
450 |     "cnt = Counter(y_train)"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "code",
455 |    "execution_count": 48,
456 |    "metadata": {
457 |     "collapsed": false
458 |    },
459 |    "outputs": [
460 |     {
461 |      "data": {
462 |       "text/plain": [
463 |        "Counter({0: 480,\n",
464 |        "         1: 584,\n",
465 |        "         2: 591,\n",
466 |        "         3: 590,\n",
467 |        "         4: 578,\n",
468 |        "         5: 593,\n",
469 |        "         6: 585,\n",
470 |        "         7: 594,\n",
471 |        "         8: 598,\n",
472 |        "         9: 597,\n",
473 |        "         10: 600,\n",
474 |        "         11: 595,\n",
475 |        "         12: 591,\n",
476 |        "         13: 594,\n",
477 |        "         14: 593,\n",
478 |        "         15: 599,\n",
479 |        "         16: 546,\n",
480 |        "         17: 564,\n",
481 |        "         18: 465,\n",
482 |        "         19: 377})"
483 |       ]
484 |      },
485 |      "execution_count": 48,
486 |      "metadata": {},
487 |      "output_type": "execute_result"
488 |     }
489 |    ],
490 |    "source": [
491 |     "cnt"
492 |    ]
493 |   },
494 |   {
495 |    "cell_type": "markdown",
496 |    "metadata": {},
497 |    "source": [
498 |     "## 5. Feature representation"
499 |    ]
500 |   },
501 |   {
502 |    "cell_type": "code",
503 |    "execution_count": 55,
504 |    "metadata": {
505 |     "collapsed": false
506 |    },
507 |    "outputs": [
508 |     {
509 |      "data": {
510 |       "text/plain": [
511 |        "(11314, 130107)"
512 |       ]
513 |      },
514 |      "execution_count": 55,
515 |      "metadata": {},
516 |      "output_type": "execute_result"
517 |     }
518 |    ],
519 |    "source": [
520 |     "vec = TfidfVectorizer()\n",
521 |     "x_train_vec = vec.fit_transform(x_train)\n",
522 |     "x_train_vec.shape"
523 |    ]
524 |   },
525 |   {
526 |    "cell_type": "code",
527 |    "execution_count": 75,
528 |    "metadata": {
529 |     "collapsed": false
530 |    },
531 |    "outputs": [
532 |     {
533 |      "data": {
534 |       "text/plain": [
535 |        "(11314, 119777)"
536 |       ]
537 |      },
538 |      "execution_count": 75,
539 |      "metadata": {},
540 |      "output_type": "execute_result"
541 |     }
542 |    ],
543 |    "source": [
544 |     "vec = TfidfVectorizer()\n",
545 |     "x_train_vec = vec.fit_transform(x_cleaned)\n",
546 |     "x_train_vec.shape"
547 |    ]
548 |   },
549 |   {
550 |    "cell_type": "markdown",
551 |    "metadata": {},
552 |    "source": [
553 |     "## 6. Metric and algo"
554 |    ]
555 |   },
556 |   {
557 |    "cell_type": "code",
558 |    "execution_count": 99,
559 |    "metadata": {
560 |     "collapsed": false
561 |    },
562 |    "outputs": [],
563 |    "source": [
564 |     "clf = LinearSVC(C=1, multi_class='ovr', dual=True)"
565 |    ]
566 |   },
567 |   {
568 |    "cell_type": "code",
569 |    "execution_count": 100,
570 |    "metadata": {
571 |     "collapsed": false
572 |    },
573 |    "outputs": [
574 |     {
575 |      "name": "stdout",
576 |      "output_type": "stream",
577 |      "text": [
578 |       "Wall time: 2.24 s\n"
579 |      ]
580 |     },
581 |     {
582 |      "data": {
583 |       "text/plain": [
584 |        "LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,\n",
585 |        "     intercept_scaling=1, loss='squared_hinge', max_iter=1000,\n",
586 |        "     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n",
587 |        "     verbose=0)"
588 |       ]
589 |      },
590 |      "execution_count": 100,
591 |      "metadata": {},
592 |      "output_type": "execute_result"
593 |     }
594 |    ],
595 |    "source": [
596 |     "%%time\n",
597 |     "clf.fit(x_train_vec, y_train)"
598 |    ]
599 |   },
600 |   {
601 |    "cell_type": "markdown",
602 |    "metadata": {},
603 |    "source": [
604 |     "## 7. Validation"
605 |    ]
606 |   },
607 |   {
608 |    "cell_type": "code",
609 |    "execution_count": 101,
610 |    "metadata": {
611 |     "collapsed": true
612 |    },
613 |    "outputs": [],
614 |    "source": [
615 |     "x_test_vec = vec.transform(x_cleaned_test)"
616 |    ]
617 |   },
618 |   {
619 |    "cell_type": "code",
620 |    "execution_count": 102,
621 |    "metadata": {
622 |     "collapsed": true
623 |    },
624 |    "outputs": [],
625 |    "source": [
626 |     "y_predict = clf.predict(x_test_vec)"
627 |    ]
628 |   },
629 |   {
630 |    "cell_type": "code",
631 |    "execution_count": 103,
632 |    "metadata": {
633 |     "collapsed": false
634 |    },
635 |    "outputs": [
636 |     {
637 |      "name": "stdout",
638 |      "output_type": "stream",
639 |      "text": [
640 |       "accuracy:  0.8538236856080722\n",
641 |       "precision:  0.8538236856080722\n",
642 |       "recall:  0.8538236856080722\n",
643 |       "f1:  0.8538236856080722\n"
644 |      ]
645 |     }
646 |    ],
647 |    "source": [
648 |     "print(\"accuracy: \", accuracy_score(y_pred=y_predict, y_true=y_test))\n",
649 |     "print(\"precision: \", precision_score(y_pred=y_predict, y_true=y_test, average= \"micro\"))\n",
650 |     "print(\"recall: \", recall_score(y_pred=y_predict, y_true=y_test, average= \"micro\"))\n",
651 |     "print(\"f1: \", f1_score(y_pred=y_predict, y_true=y_test, average= \"micro\"))"
652 |    ]
653 |   },
654 |   {
655 |    "cell_type": "code",
656 |    "execution_count": 107,
657 |    "metadata": {
658 |     "collapsed": true
659 |    },
660 |    "outputs": [],
661 |    "source": [
662 |     "def show_top10(classifier, vectorizer, categories):\n",
663 |     "    feature_names = np.asarray(vectorizer.get_feature_names())\n",
664 |     "    for i, category in enumerate(categories):\n",
665 |     "        top10 = np.argsort(classifier.coef_[i])[-10:]\n",
666 |     "        print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))"
667 |    ]
668 |   },
669 |   {
670 |    "cell_type": "code",
671 |    "execution_count": 108,
672 |    "metadata": {
673 |     "collapsed": false
674 |    },
675 |    "outputs": [
676 |     {
677 |      "name": "stdout",
678 |      "output_type": "stream",
679 |      "text": [
680 |       "alt.atheism: mangoe rushdie jaeger atheists cobb wingate islamic atheist keith atheism\n",
681 |       "comp.graphics: animation cview tiff polygon pov 3do graphics 3d image graphic\n",
682 |       "comp.os.ms-windows.misc: nt winqvt download ini file ax win3 driver cica windows\n",
683 |       "comp.sys.ibm.pc.hardware: jumper scsi monitor fastmicro irq vlb 486 pc ide gateway\n",
684 |       "comp.sys.mac.hardware: se lciii iisi lc centris duo quadra apple powerbook mac\n",
685 |       "comp.windows.x: expo xpert xlib window lcs server xterm x11r5 widget motif\n",
686 |       "misc.forsale: camera include distribution wanted condition sell ship forsale offer sale\n",
687 |       "rec.autos: chevrolet engine truck auto convertible warning dealer toyota automotive car\n",
688 |       "rec.motorcycles: harley kawasaki dog helmet rider bmw ride motorcycle bike dod\n",
689 |       "rec.sport.baseball: braves giants tigers stadium cub yankee pitch sox phillies baseball\n",
690 |       "rec.sport.hockey: cup bruins goal coach espn play team playoff nhl hockey\n",
691 |       "sci.crypt: encrypt nsa crypto security wiretap pgp tap encryption key clipper\n",
692 |       "sci.electronics: ee explode power scope voltage 256k electronic electronics 8051 circuit\n",
693 |       "sci.med: pitt krillean patient medical cancer treatment photography disease msg doctor\n",
694 |       "sci.space: sci dietz prb rocket shuttle planet launch moon orbit space\n",
695 |       "soc.religion.christian: marry fisher geneva hell prayer christian christ church rutgers athos\n",
696 |       "talk.politics.guns: batf fbi feustel cathy atf waco handgun weapon firearm gun\n",
697 |       "talk.politics.mideast: argic holocaust adl armenia serdar arab turkish armenian israel israeli\n",
698 |       "talk.politics.misc: teel drug president drieux gay optilink clinton tax cramer kaldis\n",
699 |       "talk.religion.misc: thyagi psyrobtw 666 christian koresh biblical hudson weiss morality beast\n"
700 |      ]
701 |     }
702 |    ],
703 |    "source": [
704 |     "show_top10(clf, vec, newsgroups_train.target_names)"
705 |    ]
706 |   },
707 |   {
708 |    "cell_type": "markdown",
709 |    "metadata": {},
710 |    "source": [
711 |     "## 8. Parameter tuning"
712 |    ]
713 |   },
714 |   {
715 |    "cell_type": "code",
716 |    "execution_count": 114,
717 |    "metadata": {
718 |     "collapsed": false
719 |    },
720 |    "outputs": [],
721 |    "source": [
722 |     "parameters = {'C':[0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2], \"dual\":[True,False]}"
723 |    ]
724 |   },
725 |   {
726 |    "cell_type": "code",
727 |    "execution_count": 115,
728 |    "metadata": {
729 |     "collapsed": true
730 |    },
731 |    "outputs": [],
732 |    "source": [
733 |     "clf = LinearSVC()"
734 |    ]
735 |   },
736 |   {
737 |    "cell_type": "code",
738 |    "execution_count": 116,
739 |    "metadata": {
740 |     "collapsed": true
741 |    },
742 |    "outputs": [],
743 |    "source": [
744 |     "grid = GridSearchCV(clf, parameters)"
745 |    ]
746 |   },
747 |   {
748 |    "cell_type": "code",
749 |    "execution_count": 118,
750 |    "metadata": {
751 |     "collapsed": false
752 |    },
753 |    "outputs": [
754 |     {
755 |      "data": {
756 |       "text/plain": [
757 |        "GridSearchCV(cv=None, error_score='raise',\n",
758 |        "       estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,\n",
759 |        "     intercept_scaling=1, loss='squared_hinge', max_iter=1000,\n",
760 |        "     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n",
761 |        "     verbose=0),\n",
762 |        "       fit_params=None, iid=True, n_jobs=1,\n",
763 |        "       param_grid={'C': [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2], 'dual': [True, False]},\n",
764 |        "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
765 |        "       scoring=None, verbose=0)"
766 |       ]
767 |      },
768 |      "execution_count": 118,
769 |      "metadata": {},
770 |      "output_type": "execute_result"
771 |     }
772 |    ],
773 |    "source": [
774 |     "grid.fit(x_train_vec, y_train)"
775 |    ]
776 |   },
777 |   {
778 |    "cell_type": "code",
779 |    "execution_count": 119,
780 |    "metadata": {
781 |     "collapsed": false
782 |    },
783 |    "outputs": [
784 |     {
785 |      "data": {
786 |       "text/plain": [
787 |        "{'C': 1.5, 'dual': True}"
788 |       ]
789 |      },
790 |      "execution_count": 119,
791 |      "metadata": {},
792 |      "output_type": "execute_result"
793 |     }
794 |    ],
795 |    "source": [
796 |     "grid.best_params_"
797 |    ]
798 |   },
799 |   {
800 |    "cell_type": "code",
801 |    "execution_count": 122,
802 |    "metadata": {
803 |     "collapsed": false
804 |    },
805 |    "outputs": [
806 |     {
807 |      "data": {
808 |       "text/plain": [
809 |        "0.9164751635142302"
810 |       ]
811 |      },
812 |      "execution_count": 122,
813 |      "metadata": {},
814 |      "output_type": "execute_result"
815 |     }
816 |    ],
817 |    "source": [
818 |     "grid.best_score_"
819 |    ]
820 |   },
821 |   {
822 |    "cell_type": "markdown",
823 |    "metadata": {},
824 |    "source": [
825 |     "## 9. Into production"
826 |    ]
827 |   },
828 |   {
829 |    "cell_type": "markdown",
830 |    "metadata": {},
831 |    "source": [
832 |     "See video 4.5 ;-)"
833 |    ]
834 |   }
835 |  ],
836 |  "metadata": {
837 |   "kernelspec": {
838 |    "display_name": "Python 3",
839 |    "language": "python",
840 |    "name": "python3"
841 |   },
842 |   "language_info": {
843 |    "codemirror_mode": {
844 |     "name": "ipython",
845 |     "version": 3
846 |    },
847 |    "file_extension": ".py",
848 |    "mimetype": "text/x-python",
849 |    "name": "python",
850 |    "nbconvert_exporter": "python",
851 |    "pygments_lexer": "ipython3",
852 |    "version": "3.6.0"
853 |   }
854 |  },
855 |  "nbformat": 4,
856 |  "nbformat_minor": 2
857 | }
858 | 


--------------------------------------------------------------------------------