├── LICENSE ├── README.md ├── Section5 ├── Reviews.rar ├── plot.html └── section5_video3_video4_training_visualizing_wordembedding.ipynb ├── section1 └── video3 │ └── section1_video3_install_corpora.ipynb ├── section2 ├── video 2 │ └── section_2_video_2_cleaning.ipynb ├── video 3 │ └── section_2_video_3_tokenizing.ipynb └── video 4 │ └── section_2_video_4_ngrams.ipynb ├── section3 ├── ner_dataset.rar ├── section3_video3_pretrained_models.ipynb └── section3_video4_training_ner.ipynb └── section4 └── section4_video3_basic_classifier.ipynb /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Text Mining with Machine Learning and Python [Video] 5 | This is the code repository for [Text Mining with Machine Learning and Python [Video]](https://www.packtpub.com/application-development/text-mining-machine-learning-and-python-video?utm_source=github&utm_medium=repository&utm_campaign=9781789137361), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the video course from start to finish. 6 | ## About the Video Course 7 | Text is one of the most actively researched and widely spread types of data in the Data Science field today. New advances in machine learning and deep learning techniques now make it possible to build fantastic data products on text sources. New exciting text data sources pop up all the time. You'll build your own toolbox of know-how, packages, and working code snippets so you can perform your own text mining analyses. 8 | 9 | You'll start by understanding the fundamentals of modern text mining and move on to some exciting processes involved in it. You'll learn how machine learning is used to extract meaningful information from text and the different processes involved in it. You will learn to read and process text features. Then you'll learn how to extract information from text and work on pre-trained models, while also delving into text classification, and entity extraction and classification. You will explore the process of word embedding by working on Skip-grams, CBOW, and X2Vec with some additional and important text mining processes. By the end of the course, you will have learned and understood the various aspects of text mining with ML aText is one of the most actively researched and widely spread types of data in the Data Science field today. New advances in machine learning and deep learning techniques now make it possible to build fantastic data products on text sources. New exciting text data sources pop out all the time like tulips in the spring. This course aims to you the first steps into this expertise. To build up your toolbox of know-how, packages and working code snippets to perform your own Text Mining analysis. 10 | 11 | Starting from the basics of preprocessing text features, we’ll take a look at how we can extract relevant features from text and classify documents through Machine Learning. Since Word Embeddings have become indispensable in today’s NLP world, we’ll dive deeper into their inner workings and have a go at training our own embedding models. 12 | 13 | By the end of the course, you will have a high-level understanding of the various components involved in a current-day NLP pipeline, and a set of working code to build further upon. 14 | nd the important processes involved in it, and will have begun your journey as an effective text miner. 15 | 16 | 17 |

What You Will Learn

18 |
19 |
27 | 28 | ## Instructions and Navigation 29 | ### Assumed Knowledge 30 | To fully benefit from the coverage included in this course, you will need:
31 | 32 | ● Working experience with Python and Jupyter Notebooks 33 | 34 | ● First experience with doing data analytics in Python 35 | 36 | ● First encounter with Machine Learning (scikit-learn experience is a plus) 37 | 38 | 39 | ### Technical Requirements 40 | This course has the following software requirements:
41 | 42 | ● Anaconda distribution of latest Python 3 43 | 44 | ● Separate conda env with Python 3 installed 45 | 46 | ○ available to set up once Anaconda is installed 47 | 48 | ● Jupyter notebook 49 | 50 | ○ available to activate once Anaconda is installed 51 | 52 | ● Extra packages: 53 | 54 | ○ NLTK (pip install nltk==3.2.2) 55 | 56 | ○ Spacy (pip install spacy==2.0.3) 57 | 58 | ○ Gensim (pip install gensim==3.3.0) 59 | 60 | ○ Scikit-learn (pip install scikit-learn==0.19.1) 61 | 62 | ○ Tensorflow (for CPU) (pip install tensorflow==1.4.0) 63 | 64 | ○ Keras (pip install keras==2.1.3) 65 | 66 | ○ python-crfsuite (pip install python-crfsuite==0.9.5) 67 | 68 | This course has been tested on the following system configuration: 69 | 70 | ● OS: Windows 10 71 | 72 | ● Processor: Quad Core 2.8 Ghz 73 | 74 | ● Memory: 16GB 75 | 76 | ● Hard Disk Space: 3GB 77 | 78 | 79 | 80 | ## Related Products 81 | * [Hands-On Machine Learning with Python and Scikit-Learn [Video]](https://www.packtpub.com/big-data-and-business-intelligence/hands-machine-learning-python-and-scikit-learn-video?utm_source=github&utm_medium=repository&utm_campaign=9781788991056) 82 | 83 | * [Machine Learning with scikit-learn and Tensorflow [Video]](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-scikit-learn-and-tensorflow-video?utm_source=github&utm_medium=repository&utm_campaign=9781788629928) 84 | 85 | * [Kali Linux Advanced Wireless Penetration Testing [Video]](https://www.packtpub.com/networking-and-servers/kali-linux-advanced-wireless-penetration-testing-video?utm_source=github&utm_medium=repository&utm_campaign=9781788832342) 86 | 87 | -------------------------------------------------------------------------------- /Section5/Reviews.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Text-Mining-with-Machine-Learning-and-Python/31fbe17da4e984f9c3b5e6a590ec53df4d0b1c05/Section5/Reviews.rar -------------------------------------------------------------------------------- /Section5/section5_video3_video4_training_visualizing_wordembedding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 0. Imports" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "name": "stderr", 19 | "output_type": "stream", 20 | "text": [ 21 | "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\gensim\\utils.py:1167: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n", 22 | " warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n" 23 | ] 24 | } 25 | ], 26 | "source": [ 27 | "import pandas as pd\n", 28 | "import gensim\n", 29 | "import spacy\n", 30 | "from tqdm import tqdm" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 2, 36 | "metadata": { 37 | "collapsed": true 38 | }, 39 | "outputs": [], 40 | "source": [ 41 | "tqdm.pandas(desc=\"Progress\")" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 3, 47 | "metadata": { 48 | "collapsed": true 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "nlp_en = spacy.load(\"en_core_web_md\")" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "## 1. Train word embeddings" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "#### 1.1 Get data" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 4, 72 | "metadata": { 73 | "collapsed": true 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "pd_data = pd.read_csv(\"Reviews.csv\")" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 17, 83 | "metadata": { 84 | "collapsed": false 85 | }, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/html": [ 90 | "
\n", 91 | "\n", 104 | "\n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | "
IdProductIdUserIdProfileNameHelpfulnessNumeratorHelpfulnessDenominatorScoreTimeSummaryTexttokens
01B001E4KFG0A3SGXH7AUHU8GWdelmartian1151303862400Good Quality Dog FoodI have bought several of the Vitality canned d...[I, have, bought, several, of, the, Vitality, ...
12B00813GRG4A1D87F6ZCVE5NKdll pa0011346976000Not as AdvertisedProduct arrived labeled as Jumbo Salted Peanut...[Product, arrived, labeled, as, Jumbo, Salted,...
23B000LQOCH0ABXLMWJIXXAINNatalia Corres \"Natalia Corres\"1141219017600\"Delight\" says it allThis is a confection that has been around a fe...[This, is, a, confection, that, has, been, aro...
\n", 166 | "
" 167 | ], 168 | "text/plain": [ 169 | " Id ProductId UserId ProfileName \\\n", 170 | "0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian \n", 171 | "1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa \n", 172 | "2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres \"Natalia Corres\" \n", 173 | "\n", 174 | " HelpfulnessNumerator HelpfulnessDenominator Score Time \\\n", 175 | "0 1 1 5 1303862400 \n", 176 | "1 0 0 1 1346976000 \n", 177 | "2 1 1 4 1219017600 \n", 178 | "\n", 179 | " Summary Text \\\n", 180 | "0 Good Quality Dog Food I have bought several of the Vitality canned d... \n", 181 | "1 Not as Advertised Product arrived labeled as Jumbo Salted Peanut... \n", 182 | "2 \"Delight\" says it all This is a confection that has been around a fe... \n", 183 | "\n", 184 | " tokens \n", 185 | "0 [I, have, bought, several, of, the, Vitality, ... \n", 186 | "1 [Product, arrived, labeled, as, Jumbo, Salted,... \n", 187 | "2 [This, is, a, confection, that, has, been, aro... " 188 | ] 189 | }, 190 | "execution_count": 17, 191 | "metadata": {}, 192 | "output_type": "execute_result" 193 | } 194 | ], 195 | "source": [ 196 | "pd_data.head(3)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "#### 1.2. Process data" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 6, 209 | "metadata": { 210 | "collapsed": false 211 | }, 212 | "outputs": [], 213 | "source": [ 214 | "def get_tokens(sentence):\n", 215 | " return [x.text for x in nlp_en(sentence)]" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 7, 221 | "metadata": { 222 | "collapsed": false 223 | }, 224 | "outputs": [ 225 | { 226 | "name": "stderr", 227 | "output_type": "stream", 228 | "text": [ 229 | "Progress: 100%|██████████████████████████████████████████████████████████████| 568454/568454 [4:14:40<00:00, 37.20it/s]\n" 230 | ] 231 | } 232 | ], 233 | "source": [ 234 | "pd_data[\"tokens\"] = pd_data[\"Text\"].progress_apply(get_tokens)" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 11, 240 | "metadata": { 241 | "collapsed": true 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "pd_data.to_pickle(\"pd_data_tokenized.pickle\")" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": 2, 251 | "metadata": { 252 | "collapsed": true 253 | }, 254 | "outputs": [], 255 | "source": [ 256 | "pd_data = pd.read_pickle(\"pd_data_tokenized.pickle\")" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "#### 1.3. Train word embeddings using word2vec" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 8, 269 | "metadata": { 270 | "collapsed": true 271 | }, 272 | "outputs": [], 273 | "source": [ 274 | "model_w2v = gensim.models.Word2Vec(pd_data[\"tokens\"].tolist(), min_count=5, window = 9, size = 100)" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "#### 1.4. Train word embeddings using fasttext" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 9, 287 | "metadata": { 288 | "collapsed": true 289 | }, 290 | "outputs": [], 291 | "source": [ 292 | "model_ft = gensim.models.FastText(pd_data[\"tokens\"].tolist(), min_count=5, window = 9, size = 100)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "#### 1.5. Persistence" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": { 306 | "collapsed": true 307 | }, 308 | "outputs": [], 309 | "source": [ 310 | "model_w2v.save(\"model_w2v.model\")\n", 311 | "model_ft.save(\"model_ft.model\")" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 2, 317 | "metadata": { 318 | "collapsed": true 319 | }, 320 | "outputs": [], 321 | "source": [ 322 | "model_w2v = gensim.models.Word2Vec.load(\"model_w2v.model\")\n", 323 | "model_ft = gensim.models.FastText.load(\"model_ft.model\")" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "#### 1.6. Similarity" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 7, 336 | "metadata": { 337 | "collapsed": false 338 | }, 339 | "outputs": [ 340 | { 341 | "name": "stderr", 342 | "output_type": "stream", 343 | "text": [ 344 | "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n", 345 | " if __name__ == '__main__':\n" 346 | ] 347 | }, 348 | { 349 | "data": { 350 | "text/plain": [ 351 | "[('fish', 0.8536328077316284),\n", 352 | " ('tuna', 0.7662709951400757),\n", 353 | " ('chicken', 0.7630202174186707),\n", 354 | " ('seafood', 0.7627329230308533),\n", 355 | " ('turkey', 0.7592297792434692)]" 356 | ] 357 | }, 358 | "execution_count": 7, 359 | "metadata": {}, 360 | "output_type": "execute_result" 361 | } 362 | ], 363 | "source": [ 364 | "model_w2v.most_similar(\"salmon\", topn=5)" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 8, 370 | "metadata": { 371 | "collapsed": false 372 | }, 373 | "outputs": [ 374 | { 375 | "name": "stderr", 376 | "output_type": "stream", 377 | "text": [ 378 | "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n", 379 | " if __name__ == '__main__':\n" 380 | ] 381 | }, 382 | { 383 | "data": { 384 | "text/plain": [ 385 | "[('cheddar', 0.7746697068214417),\n", 386 | " ('mozzarella', 0.7572810649871826),\n", 387 | " ('parmesan', 0.7331867218017578),\n", 388 | " ('chedder', 0.7296013236045837),\n", 389 | " ('mayo', 0.7252874374389648)]" 390 | ] 391 | }, 392 | "execution_count": 8, 393 | "metadata": {}, 394 | "output_type": "execute_result" 395 | } 396 | ], 397 | "source": [ 398 | "model_w2v.most_similar(positive=['cheese'], topn=5)" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | "#### 1.7. Correlation" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": 1, 411 | "metadata": { 412 | "collapsed": false 413 | }, 414 | "outputs": [ 415 | { 416 | "ename": "NameError", 417 | "evalue": "name 'model_w2v' is not defined", 418 | "output_type": "error", 419 | "traceback": [ 420 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 421 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 422 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmodel_w2v\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmost_similar\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mpositive\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'pea'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'salsa'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnegative\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'tomato'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtopn\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", 423 | "\u001b[0;31mNameError\u001b[0m: name 'model_w2v' is not defined" 424 | ] 425 | } 426 | ], 427 | "source": [ 428 | "model_w2v.most_similar(positive=['pea', 'salsa'], negative=['tomato'], topn=3)" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": 14, 434 | "metadata": { 435 | "collapsed": false 436 | }, 437 | "outputs": [ 438 | { 439 | "name": "stderr", 440 | "output_type": "stream", 441 | "text": [ 442 | "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n", 443 | " if __name__ == '__main__':\n" 444 | ] 445 | }, 446 | { 447 | "data": { 448 | "text/plain": [ 449 | "[('tequila', 0.7341920137405396),\n", 450 | " ('lemonade', 0.7284362316131592),\n", 451 | " ('juice', 0.7281173467636108)]" 452 | ] 453 | }, 454 | "execution_count": 14, 455 | "metadata": {}, 456 | "output_type": "execute_result" 457 | } 458 | ], 459 | "source": [ 460 | "model_w2v.most_similar(positive=['lemon', 'water'], topn=3)" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 15, 466 | "metadata": { 467 | "collapsed": false 468 | }, 469 | "outputs": [ 470 | { 471 | "name": "stderr", 472 | "output_type": "stream", 473 | "text": [ 474 | "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n", 475 | " if __name__ == '__main__':\n" 476 | ] 477 | }, 478 | { 479 | "data": { 480 | "text/plain": [ 481 | "[('bread', 0.7283815145492554),\n", 482 | " ('pizza', 0.7018527388572693),\n", 483 | " ('dough', 0.6836484670639038)]" 484 | ] 485 | }, 486 | "execution_count": 15, 487 | "metadata": {}, 488 | "output_type": "execute_result" 489 | } 490 | ], 491 | "source": [ 492 | "model_w2v.most_similar(positive=['salami', 'crust'], topn=3)" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": 16, 498 | "metadata": { 499 | "collapsed": false 500 | }, 501 | "outputs": [ 502 | { 503 | "name": "stderr", 504 | "output_type": "stream", 505 | "text": [ 506 | "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n", 507 | " if __name__ == '__main__':\n" 508 | ] 509 | }, 510 | { 511 | "data": { 512 | "text/plain": [ 513 | "[('hamburger', 0.814429521560669),\n", 514 | " ('ham', 0.795830488204956),\n", 515 | " ('sausage', 0.7887133359909058)]" 516 | ] 517 | }, 518 | "execution_count": 16, 519 | "metadata": {}, 520 | "output_type": "execute_result" 521 | } 522 | ], 523 | "source": [ 524 | "model_w2v.most_similar(positive=['beef', 'bun'], topn=3)" 525 | ] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "metadata": { 530 | "collapsed": true 531 | }, 532 | "source": [ 533 | "## 2. Visualise them" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 33, 539 | "metadata": { 540 | "collapsed": false 541 | }, 542 | "outputs": [], 543 | "source": [ 544 | "from sklearn.manifold import TSNE\n", 545 | "import matplotlib.pyplot as plt\n", 546 | "from bokeh.plotting import figure, output_file, show\n", 547 | "from bokeh.models import ColumnDataSource, Range1d, LabelSet, Label" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 27, 553 | "metadata": { 554 | "collapsed": false 555 | }, 556 | "outputs": [ 557 | { 558 | "name": "stdout", 559 | "output_type": "stream", 560 | "text": [ 561 | "Wall time: 2min 7s\n" 562 | ] 563 | } 564 | ], 565 | "source": [ 566 | "%%time\n", 567 | "model_w2v = gensim.models.Word2Vec(pd_data[\"tokens\"].tolist(), min_count=500, window = 9, size = 100)" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": 29, 573 | "metadata": { 574 | "collapsed": false 575 | }, 576 | "outputs": [ 577 | { 578 | "name": "stderr", 579 | "output_type": "stream", 580 | "text": [ 581 | "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:5: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n" 582 | ] 583 | } 584 | ], 585 | "source": [ 586 | "tokens = []\n", 587 | "labels = []\n", 588 | "\n", 589 | "for x in model_w2v.wv.vocab:\n", 590 | " tokens.append(model_w2v[x])\n", 591 | " labels.append(x)" 592 | ] 593 | }, 594 | { 595 | "cell_type": "code", 596 | "execution_count": 30, 597 | "metadata": { 598 | "collapsed": false 599 | }, 600 | "outputs": [ 601 | { 602 | "name": "stdout", 603 | "output_type": "stream", 604 | "text": [ 605 | "Wall time: 2min 47s\n" 606 | ] 607 | } 608 | ], 609 | "source": [ 610 | "%%time\n", 611 | "tsne_model = TSNE(n_components=2, random_state=11)\n", 612 | "fitted = tsne_model.fit_transform(tokens)" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": 34, 618 | "metadata": { 619 | "collapsed": true 620 | }, 621 | "outputs": [], 622 | "source": [ 623 | "output_file(\"plot.html\")\n", 624 | " \n", 625 | "p = figure(plot_width=1000, plot_height=1000)\n", 626 | "\n", 627 | "lst = list(model_w2v.wv.vocab)\n", 628 | "\n", 629 | "\n", 630 | "\n", 631 | "p.circle(fitted[:, 0], fitted[:, 1], size=2, color=\"navy\", alpha=0.5)\n", 632 | "\n", 633 | "texts = lst\n", 634 | "\n", 635 | "\n", 636 | "source = ColumnDataSource(data=dict(x=fitted[:, 0], y=fitted[:, 1], text=texts))\n", 637 | "\n", 638 | "labels = LabelSet(x='x', y='y', text='text',\n", 639 | " x_offset=5, y_offset=5, source=source)\n", 640 | "p.add_layout(labels)\n", 641 | "\n", 642 | "\n", 643 | "\n", 644 | "show(p)" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "metadata": { 651 | "collapsed": true 652 | }, 653 | "outputs": [], 654 | "source": [] 655 | } 656 | ], 657 | "metadata": { 658 | "kernelspec": { 659 | "display_name": "Python 3", 660 | "language": "python", 661 | "name": "python3" 662 | }, 663 | "language_info": { 664 | "codemirror_mode": { 665 | "name": "ipython", 666 | "version": 3 667 | }, 668 | "file_extension": ".py", 669 | "mimetype": "text/x-python", 670 | "name": "python", 671 | "nbconvert_exporter": "python", 672 | "pygments_lexer": "ipython3", 673 | "version": "3.6.0" 674 | } 675 | }, 676 | "nbformat": 4, 677 | "nbformat_minor": 2 678 | } 679 | -------------------------------------------------------------------------------- /section1/video3/section1_video3_install_corpora.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import nltk" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": { 18 | "collapsed": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "nltk.download()" 23 | ] 24 | } 25 | ], 26 | "metadata": { 27 | "kernelspec": { 28 | "display_name": "Python 3", 29 | "language": "python", 30 | "name": "python3" 31 | }, 32 | "language_info": { 33 | "codemirror_mode": { 34 | "name": "ipython", 35 | "version": 3 36 | }, 37 | "file_extension": ".py", 38 | "mimetype": "text/x-python", 39 | "name": "python", 40 | "nbconvert_exporter": "python", 41 | "pygments_lexer": "ipython3", 42 | "version": "3.6.0" 43 | } 44 | }, 45 | "nbformat": 4, 46 | "nbformat_minor": 2 47 | } 48 | -------------------------------------------------------------------------------- /section2/video 2/section_2_video_2_cleaning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Demo data" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "sent1 = \"Feeling loved, even when I'm sick🍫☕️💓#likeforlike #chocolate #bf #iloveyou #aftereight #couplegoals\"" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "### Remove punctuation" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "import string" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "translator = str.maketrans('', '', string.punctuation)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | "Feeling loved even when Im sick🍫☕️💓likeforlike chocolate bf iloveyou aftereight couplegoals\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "sent_pun = sent1.translate(translator)\n", 67 | "print(sent_pun)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "### Remove unicode" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 5, 80 | "metadata": { 81 | "collapsed": true 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "import regex" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "Method 1" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 6, 98 | "metadata": { 99 | "collapsed": false 100 | }, 101 | "outputs": [ 102 | { 103 | "name": "stdout", 104 | "output_type": "stream", 105 | "text": [ 106 | "Feeling loved even when Im sicklikeforlike chocolate bf iloveyou aftereight couplegoals\n" 107 | ] 108 | } 109 | ], 110 | "source": [ 111 | "sent_pun_uni = sent_pun.encode('ascii', 'ignore').decode(\"utf-8\")\n", 112 | "print(sent_pun_uni)" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "Method 2" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 7, 125 | "metadata": { 126 | "collapsed": false 127 | }, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "Feeling loved even when Im sick ️ likeforlike chocolate bf iloveyou aftereight couplegoals\n" 134 | ] 135 | } 136 | ], 137 | "source": [ 138 | "emoji_pattern = regex.compile(\"\"\"\\p{So}\\p{Sk}*\"\"\")\n", 139 | "sent_pun_uni = emoji_pattern.sub(r' ', sent_pun)\n", 140 | "print(sent_pun_uni)" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "### Remove URL" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 8, 153 | "metadata": { 154 | "collapsed": true 155 | }, 156 | "outputs": [], 157 | "source": [ 158 | "import re" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 9, 164 | "metadata": { 165 | "collapsed": true 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "subject = \"Omg, check out these fabulous shoes https://thiswebsitedoesntexistsodontbother.com/omgshoesss yes they can be yours\"" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 10, 175 | "metadata": { 176 | "collapsed": false 177 | }, 178 | "outputs": [ 179 | { 180 | "name": "stdout", 181 | "output_type": "stream", 182 | "text": [ 183 | "Omg, check out these fabulous shoes yes they can be yours\n" 184 | ] 185 | } 186 | ], 187 | "source": [ 188 | "result = re.sub(r\"http\\S+\", \"\", subject)\n", 189 | "print(result)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "### Remove Stopwords" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 11, 202 | "metadata": { 203 | "collapsed": true 204 | }, 205 | "outputs": [], 206 | "source": [ 207 | "from nltk.corpus import stopwords\n", 208 | "stop_en = stopwords.words(\"english\")" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": { 215 | "collapsed": true 216 | }, 217 | "outputs": [], 218 | "source": [ 219 | "subject = [\"i\",\"have\",\"a\",\"cat\",\"named\",\"mr\",\"whiskers\",\"he\",\"is\",\"a\",\"very\",\"hungry\",\"cat\"]" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": { 226 | "collapsed": false 227 | }, 228 | "outputs": [], 229 | "source": [ 230 | "subject = [x for x in subject if not x in stop_en]\n", 231 | "print(subject)" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": { 238 | "collapsed": true 239 | }, 240 | "outputs": [], 241 | "source": [] 242 | } 243 | ], 244 | "metadata": { 245 | "kernelspec": { 246 | "display_name": "Python 3", 247 | "language": "python", 248 | "name": "python3" 249 | }, 250 | "language_info": { 251 | "codemirror_mode": { 252 | "name": "ipython", 253 | "version": 3 254 | }, 255 | "file_extension": ".py", 256 | "mimetype": "text/x-python", 257 | "name": "python", 258 | "nbconvert_exporter": "python", 259 | "pygments_lexer": "ipython3", 260 | "version": "3.6.0" 261 | } 262 | }, 263 | "nbformat": 4, 264 | "nbformat_minor": 2 265 | } 266 | -------------------------------------------------------------------------------- /section2/video 3/section_2_video_3_tokenizing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from pprint import pprint" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Tokenization" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "### Using NLTK" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "import nltk" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "collapsed": true 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "test_sentence = \"It's too cold outside, we'd be better watering our neighbour's plants tomorrow\"" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['It',\n", 61 | " \"'s\",\n", 62 | " 'too',\n", 63 | " 'cold',\n", 64 | " 'outside',\n", 65 | " ',',\n", 66 | " 'we',\n", 67 | " \"'d\",\n", 68 | " 'be',\n", 69 | " 'better',\n", 70 | " 'watering',\n", 71 | " 'our',\n", 72 | " 'neighbour',\n", 73 | " \"'s\",\n", 74 | " 'plants',\n", 75 | " 'tomorrow']" 76 | ] 77 | }, 78 | "execution_count": 4, 79 | "metadata": {}, 80 | "output_type": "execute_result" 81 | } 82 | ], 83 | "source": [ 84 | "nltk.word_tokenize(test_sentence)" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "### Using Spacy" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 5, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "import spacy\n", 103 | "nlp_en = spacy.load(\"en_core_web_sm\")" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 6, 109 | "metadata": { 110 | "collapsed": true 111 | }, 112 | "outputs": [], 113 | "source": [ 114 | "doc = nlp_en(test_sentence)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 7, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [ 124 | { 125 | "data": { 126 | "text/plain": [ 127 | "['It',\n", 128 | " \"'s\",\n", 129 | " 'too',\n", 130 | " 'cold',\n", 131 | " 'outside',\n", 132 | " ',',\n", 133 | " 'we',\n", 134 | " \"'d\",\n", 135 | " 'be',\n", 136 | " 'better',\n", 137 | " 'watering',\n", 138 | " 'our',\n", 139 | " 'neighbour',\n", 140 | " \"'s\",\n", 141 | " 'plants',\n", 142 | " 'tomorrow']" 143 | ] 144 | }, 145 | "execution_count": 7, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "[x.text for x in doc]" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "# POS tagging" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "### Using NLTK" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 8, 171 | "metadata": { 172 | "collapsed": false 173 | }, 174 | "outputs": [ 175 | { 176 | "data": { 177 | "text/plain": [ 178 | "[('It', 'PRP'),\n", 179 | " (\"'s\", 'VBZ'),\n", 180 | " ('too', 'RB'),\n", 181 | " ('cold', 'JJ'),\n", 182 | " ('outside', 'JJ'),\n", 183 | " (',', ','),\n", 184 | " ('we', 'PRP'),\n", 185 | " (\"'d\", 'MD'),\n", 186 | " ('be', 'VB'),\n", 187 | " ('better', 'RB'),\n", 188 | " ('watering', 'VBG'),\n", 189 | " ('our', 'PRP$'),\n", 190 | " ('neighbour', 'NN'),\n", 191 | " (\"'s\", 'POS'),\n", 192 | " ('plants', 'NNS'),\n", 193 | " ('tomorrow', 'NN')]" 194 | ] 195 | }, 196 | "execution_count": 8, 197 | "metadata": {}, 198 | "output_type": "execute_result" 199 | } 200 | ], 201 | "source": [ 202 | "tokens = nltk.word_tokenize(test_sentence)\n", 203 | "nltk.pos_tag(tokens)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "### Using Spacy" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 9, 216 | "metadata": { 217 | "collapsed": false 218 | }, 219 | "outputs": [ 220 | { 221 | "data": { 222 | "text/plain": [ 223 | "[('It', 'PRON'),\n", 224 | " (\"'s\", 'VERB'),\n", 225 | " ('too', 'ADV'),\n", 226 | " ('cold', 'ADJ'),\n", 227 | " ('outside', 'ADV'),\n", 228 | " (',', 'PUNCT'),\n", 229 | " ('we', 'PRON'),\n", 230 | " (\"'d\", 'VERB'),\n", 231 | " ('be', 'VERB'),\n", 232 | " ('better', 'ADJ'),\n", 233 | " ('watering', 'VERB'),\n", 234 | " ('our', 'ADJ'),\n", 235 | " ('neighbour', 'NOUN'),\n", 236 | " (\"'s\", 'PART'),\n", 237 | " ('plants', 'NOUN'),\n", 238 | " ('tomorrow', 'NOUN')]" 239 | ] 240 | }, 241 | "execution_count": 9, 242 | "metadata": {}, 243 | "output_type": "execute_result" 244 | } 245 | ], 246 | "source": [ 247 | "[(x.text, x.pos_) for x in doc]" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "# Lemmatization" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "### Using NLTK" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 10, 267 | "metadata": { 268 | "collapsed": true 269 | }, 270 | "outputs": [], 271 | "source": [ 272 | "import nltk\n", 273 | "from nltk.stem.wordnet import WordNetLemmatizer\n", 274 | "from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 11, 280 | "metadata": { 281 | "collapsed": true 282 | }, 283 | "outputs": [], 284 | "source": [ 285 | "lemmatizer = WordNetLemmatizer()" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 12, 291 | "metadata": { 292 | "collapsed": true 293 | }, 294 | "outputs": [], 295 | "source": [ 296 | "tokens = nltk.word_tokenize(test_sentence)\n", 297 | "tags = nltk.pos_tag(tokens)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 13, 303 | "metadata": { 304 | "collapsed": false 305 | }, 306 | "outputs": [ 307 | { 308 | "name": "stdout", 309 | "output_type": "stream", 310 | "text": [ 311 | "It\n", 312 | "'s\n", 313 | "too\n", 314 | "cold\n", 315 | "outside\n", 316 | ",\n", 317 | "we\n", 318 | "'d\n", 319 | "be\n", 320 | "better\n", 321 | "water\n", 322 | "our\n", 323 | "neighbour\n", 324 | "'s\n", 325 | "plant\n", 326 | "tomorrow\n" 327 | ] 328 | } 329 | ], 330 | "source": [ 331 | "for i, token in enumerate(tokens):\n", 332 | " pos_tag = tags[i][1]\n", 333 | "\n", 334 | " if pos_tag.startswith(\"N\"):\n", 335 | " lemma = lemmatizer.lemmatize(token, pos=NOUN)\n", 336 | " elif pos_tag.startswith(\"V\"):\n", 337 | " lemma = lemmatizer.lemmatize(token, pos=VERB)\n", 338 | " elif pos_tag.startswith(\"J\"):\n", 339 | " lemma = lemmatizer.lemmatize(token, pos=ADJ)\n", 340 | " else:\n", 341 | " lemma = token\n", 342 | " \n", 343 | " print(lemma)" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "### Using Spacy" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 14, 356 | "metadata": { 357 | "collapsed": false 358 | }, 359 | "outputs": [ 360 | { 361 | "data": { 362 | "text/plain": [ 363 | "['-PRON-',\n", 364 | " 'have',\n", 365 | " 'too',\n", 366 | " 'cold',\n", 367 | " 'outside',\n", 368 | " ',',\n", 369 | " '-PRON-',\n", 370 | " 'would',\n", 371 | " 'be',\n", 372 | " 'well',\n", 373 | " 'water',\n", 374 | " '-PRON-',\n", 375 | " 'neighbour',\n", 376 | " 'have',\n", 377 | " 'plant',\n", 378 | " 'tomorrow']" 379 | ] 380 | }, 381 | "execution_count": 14, 382 | "metadata": {}, 383 | "output_type": "execute_result" 384 | } 385 | ], 386 | "source": [ 387 | "[x.lemma_ for x in doc]" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": null, 393 | "metadata": { 394 | "collapsed": true 395 | }, 396 | "outputs": [], 397 | "source": [] 398 | } 399 | ], 400 | "metadata": { 401 | "kernelspec": { 402 | "display_name": "Python 3", 403 | "language": "python", 404 | "name": "python3" 405 | }, 406 | "language_info": { 407 | "codemirror_mode": { 408 | "name": "ipython", 409 | "version": 3 410 | }, 411 | "file_extension": ".py", 412 | "mimetype": "text/x-python", 413 | "name": "python", 414 | "nbconvert_exporter": "python", 415 | "pygments_lexer": "ipython3", 416 | "version": "3.6.0" 417 | } 418 | }, 419 | "nbformat": 4, 420 | "nbformat_minor": 2 421 | } 422 | -------------------------------------------------------------------------------- /section2/video 4/section_2_video_4_ngrams.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# No intelligence" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "from nltk import ngrams" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [ 28 | { 29 | "name": "stdout", 30 | "output_type": "stream", 31 | "text": [ 32 | "('oh', 'my')\n", 33 | "('my', 'god,')\n", 34 | "('god,', 'the')\n", 35 | "('the', 'chocolate')\n", 36 | "('chocolate', 'is')\n", 37 | "('is', '15%')\n", 38 | "('15%', 'of')\n", 39 | "('of', 'today')\n" 40 | ] 41 | } 42 | ], 43 | "source": [ 44 | "sentence = \"oh my god, the chocolate is 15% of today\"\n", 45 | "n = 2\n", 46 | "bigrams = ngrams(sentence.split(), n)\n", 47 | "for grams in bigrams:\n", 48 | " print( grams)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "# Some intelligence" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 3, 61 | "metadata": { 62 | "collapsed": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "from nltk import RegexpParser" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 4, 72 | "metadata": { 73 | "collapsed": true 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "chunkbiGram = r\"\"\"NA: { }\n", 78 | " AN: { }\n", 79 | " NN: { }\n", 80 | " \"\"\"" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 5, 86 | "metadata": { 87 | "collapsed": false 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "chunkparserbigram = RegexpParser(chunkbiGram)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 6, 97 | "metadata": { 98 | "collapsed": true 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "example = [('oh', 'INTJ'),\n", 103 | " ('my', 'INTJ'),\n", 104 | " ('god', 'INTJ'),\n", 105 | " (',', 'PUNCT'),\n", 106 | " ('the', 'DET'),\n", 107 | " ('dark', 'ADJ'),\n", 108 | " ('chocolate', 'NOUN'),\n", 109 | " ('is', 'VERB'),\n", 110 | " ('15', 'NUM'),\n", 111 | " ('%', 'NOUN'),\n", 112 | " ('of', 'ADP'),\n", 113 | " ('today', 'NOUN')]" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 7, 119 | "metadata": { 120 | "collapsed": true 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "chunked = chunkparserbigram.parse(example)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 8, 130 | "metadata": { 131 | "collapsed": false, 132 | "scrolled": true 133 | }, 134 | "outputs": [ 135 | { 136 | "name": "stdout", 137 | "output_type": "stream", 138 | "text": [ 139 | "found adjective + noun\n", 140 | "['dark', 'chocolate']\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "for subtree in chunked.subtrees():\n", 146 | " if subtree.label() == 'NA':\n", 147 | " print('found noun + adjective')\n", 148 | " print([leaf[0] for leaf in subtree.leaves()])\n", 149 | " elif subtree.label() == 'AN':\n", 150 | " print('found adjective + noun')\n", 151 | " print([leaf[0] for leaf in subtree.leaves()])\n", 152 | " elif subtree.label() == 'NN':\n", 153 | " print('found noun + noun')\n", 154 | " print([leaf[0] for leaf in subtree.leaves()])" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "# Intelligence: statistical approach" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "### Using NLTK" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 9, 174 | "metadata": { 175 | "collapsed": true 176 | }, 177 | "outputs": [], 178 | "source": [ 179 | "import itertools\n", 180 | "from nltk.corpus import genesis\n", 181 | "from nltk.collocations import BigramCollocationFinder\n", 182 | "from nltk.metrics import BigramAssocMeasures" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 10, 188 | "metadata": { 189 | "collapsed": false 190 | }, 191 | "outputs": [], 192 | "source": [ 193 | "def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):\n", 194 | " bigram_finder = BigramCollocationFinder.from_words(words)\n", 195 | " bigrams = bigram_finder.nbest(score_fn, n)\n", 196 | " return bigrams" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 11, 202 | "metadata": { 203 | "collapsed": false 204 | }, 205 | "outputs": [], 206 | "source": [ 207 | "bigrams = bigram_word_feats(genesis.words('english-web.txt'), n=25)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 12, 213 | "metadata": { 214 | "collapsed": false, 215 | "scrolled": true 216 | }, 217 | "outputs": [ 218 | { 219 | "data": { 220 | "text/plain": [ 221 | "[('Allon', 'Bacuth'),\n", 222 | " ('Ashteroth', 'Karnaim'),\n", 223 | " ('Baal', 'Hanan'),\n", 224 | " ('Beer', 'Lahai'),\n", 225 | " ('Ben', 'Ammi'),\n", 226 | " ('En', 'Mishpat'),\n", 227 | " ('Jegar', 'Sahadutha'),\n", 228 | " ('Kiriath', 'Arba'),\n", 229 | " ('Lahai', 'Roi'),\n", 230 | " ('Most', 'High'),\n", 231 | " ('Salt', 'Sea'),\n", 232 | " ('Whoever', 'sheds'),\n", 233 | " ('appoint', 'overseers'),\n", 234 | " ('aromatic', 'resin'),\n", 235 | " ('cutting', 'instrument'),\n", 236 | " ('direct', 'descendants'),\n", 237 | " ('droves', 'apart'),\n", 238 | " ('during', 'mating'),\n", 239 | " ('falls', 'backward'),\n", 240 | " ('fig', 'leaves'),\n", 241 | " ('flaming', 'torch'),\n", 242 | " ('fresh', 'poplar'),\n", 243 | " ('fully', 'pay'),\n", 244 | " ('fury', 'turns'),\n", 245 | " ('gray', 'hairs')]" 246 | ] 247 | }, 248 | "execution_count": 12, 249 | "metadata": {}, 250 | "output_type": "execute_result" 251 | } 252 | ], 253 | "source": [ 254 | "bigrams" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "### Using Spacy" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 13, 267 | "metadata": { 268 | "collapsed": true 269 | }, 270 | "outputs": [], 271 | "source": [ 272 | "from nltk.corpus import inaugural" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 14, 278 | "metadata": { 279 | "collapsed": false 280 | }, 281 | "outputs": [ 282 | { 283 | "name": "stderr", 284 | "output_type": "stream", 285 | "text": [ 286 | "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\gensim\\utils.py:1167: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n", 287 | " warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n" 288 | ] 289 | } 290 | ], 291 | "source": [ 292 | "from gensim.models.phrases import Phraser, Phrases" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 15, 298 | "metadata": { 299 | "collapsed": true 300 | }, 301 | "outputs": [], 302 | "source": [ 303 | "all_words = [inaugural.words(x) for x in inaugural.fileids()]" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 16, 309 | "metadata": { 310 | "collapsed": false 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "phrases = Phrases(all_words, min_count= 100, threshold= 10)" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 17, 320 | "metadata": { 321 | "collapsed": true 322 | }, 323 | "outputs": [], 324 | "source": [ 325 | "bigram = Phraser(phrases)" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 18, 331 | "metadata": { 332 | "collapsed": false 333 | }, 334 | "outputs": [ 335 | { 336 | "data": { 337 | "text/plain": [ 338 | "['Finest', 'people', 'of', 'the', 'United_States']" 339 | ] 340 | }, 341 | "execution_count": 18, 342 | "metadata": {}, 343 | "output_type": "execute_result" 344 | } 345 | ], 346 | "source": [ 347 | "bigram[[\"Finest\",\"people\",\"of\",\"the\",\"United\",\"States\"]]" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": { 354 | "collapsed": true 355 | }, 356 | "outputs": [], 357 | "source": [] 358 | } 359 | ], 360 | "metadata": { 361 | "kernelspec": { 362 | "display_name": "Python 3", 363 | "language": "python", 364 | "name": "python3" 365 | }, 366 | "language_info": { 367 | "codemirror_mode": { 368 | "name": "ipython", 369 | "version": 3 370 | }, 371 | "file_extension": ".py", 372 | "mimetype": "text/x-python", 373 | "name": "python", 374 | "nbconvert_exporter": "python", 375 | "pygments_lexer": "ipython3", 376 | "version": "3.6.0" 377 | } 378 | }, 379 | "nbformat": 4, 380 | "nbformat_minor": 2 381 | } 382 | -------------------------------------------------------------------------------- /section3/ner_dataset.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Text-Mining-with-Machine-Learning-and-Python/31fbe17da4e984f9c3b5e6a590ec53df4d0b1c05/section3/ner_dataset.rar -------------------------------------------------------------------------------- /section3/section3_video3_pretrained_models.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from pprint import pprint" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Using NLTK" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "from nltk import word_tokenize, pos_tag, ne_chunk" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### All good!" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [ 46 | { 47 | "name": "stdout", 48 | "output_type": "stream", 49 | "text": [ 50 | "(S\n", 51 | " (PERSON Mark/NNP)\n", 52 | " is/VBZ\n", 53 | " working/VBG\n", 54 | " at/IN\n", 55 | " the/DT\n", 56 | " (LOCATION South/NNP Africa/NNP)\n", 57 | " offices/NNS\n", 58 | " at/IN\n", 59 | " (ORGANIZATION Google/NNP))\n" 60 | ] 61 | } 62 | ], 63 | "source": [ 64 | "sentence = \"Mark is working at the South Africa offices at Google\"\n", 65 | "ne_tree = ne_chunk(pos_tag(word_tokenize(sentence)))\n", 66 | "print(ne_tree)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "### Not so good..." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": { 80 | "collapsed": false 81 | }, 82 | "outputs": [ 83 | { 84 | "name": "stdout", 85 | "output_type": "stream", 86 | "text": [ 87 | "(S\n", 88 | " (GPE Donald/NNP)\n", 89 | " is/VBZ\n", 90 | " working/VBG\n", 91 | " at/IN\n", 92 | " the/DT\n", 93 | " (GPE Netherlands/NNP)\n", 94 | " offices/NNS\n", 95 | " of/IN\n", 96 | " (GPE Google/NNP))\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "sentence = \"Donald is working at the Netherlands offices of Google\"\n", 102 | "print(ne_chunk(pos_tag(word_tokenize(sentence))))" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "### Include BILOU / IOB" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 5, 115 | "metadata": { 116 | "collapsed": true 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "from nltk.chunk import conlltags2tree, tree2conlltags" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 6, 126 | "metadata": { 127 | "collapsed": false 128 | }, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "[('Mark', 'NNP', 'B-PERSON'),\n", 135 | " ('is', 'VBZ', 'O'),\n", 136 | " ('working', 'VBG', 'O'),\n", 137 | " ('at', 'IN', 'O'),\n", 138 | " ('the', 'DT', 'O'),\n", 139 | " ('South', 'NNP', 'B-LOCATION'),\n", 140 | " ('Africa', 'NNP', 'I-LOCATION'),\n", 141 | " ('offices', 'NNS', 'O'),\n", 142 | " ('at', 'IN', 'O'),\n", 143 | " ('Google', 'NNP', 'B-ORGANIZATION')]\n" 144 | ] 145 | } 146 | ], 147 | "source": [ 148 | "iob_tagged = tree2conlltags(ne_tree)\n", 149 | "pprint (iob_tagged)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "# Using Spacy" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 7, 162 | "metadata": { 163 | "collapsed": true 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "import spacy\n", 168 | "nlp = spacy.load(\"en_core_web_md\")" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 8, 174 | "metadata": { 175 | "collapsed": false 176 | }, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "[('Mark', 'PERSON'), ('South Africa', 'GPE'), ('Google', 'ORG')]\n" 183 | ] 184 | } 185 | ], 186 | "source": [ 187 | "doc = nlp(\"Mark is working at the South Africa offices at Google\")\n", 188 | "pprint([(x.text, x.label_) for x in doc.ents])" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 9, 194 | "metadata": { 195 | "collapsed": false 196 | }, 197 | "outputs": [ 198 | { 199 | "name": "stdout", 200 | "output_type": "stream", 201 | "text": [ 202 | "[('Donald', 'PERSON'), ('Netherlands', 'GPE'), ('Google', 'ORG')]\n" 203 | ] 204 | } 205 | ], 206 | "source": [ 207 | "doc = nlp(\"Donald is working at the Netherlands offices of Google\")\n", 208 | "pprint([(x.text, x.label_) for x in doc.ents])" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "### BILOU tags" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 10, 221 | "metadata": { 222 | "collapsed": false 223 | }, 224 | "outputs": [ 225 | { 226 | "name": "stdout", 227 | "output_type": "stream", 228 | "text": [ 229 | "[(Donald, 'B', 'PERSON'),\n", 230 | " (is, 'O', ''),\n", 231 | " (working, 'O', ''),\n", 232 | " (at, 'O', ''),\n", 233 | " (the, 'O', ''),\n", 234 | " (Netherlands, 'B', 'GPE'),\n", 235 | " (offices, 'O', ''),\n", 236 | " (of, 'O', ''),\n", 237 | " (Google, 'B', 'ORG')]\n" 238 | ] 239 | } 240 | ], 241 | "source": [ 242 | "pprint([(x, x.ent_iob_, x.ent_type_) for x in doc])" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": { 249 | "collapsed": true 250 | }, 251 | "outputs": [], 252 | "source": [] 253 | } 254 | ], 255 | "metadata": { 256 | "kernelspec": { 257 | "display_name": "Python 3", 258 | "language": "python", 259 | "name": "python3" 260 | }, 261 | "language_info": { 262 | "codemirror_mode": { 263 | "name": "ipython", 264 | "version": 3 265 | }, 266 | "file_extension": ".py", 267 | "mimetype": "text/x-python", 268 | "name": "python", 269 | "nbconvert_exporter": "python", 270 | "pygments_lexer": "ipython3", 271 | "version": "3.6.0" 272 | } 273 | }, 274 | "nbformat": 4, 275 | "nbformat_minor": 2 276 | } 277 | -------------------------------------------------------------------------------- /section3/section3_video4_training_ner.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import pandas as pd\n", 12 | "from pprint import pprint\n", 13 | "import random\n", 14 | "\n", 15 | "import spacy\n", 16 | "from spacy.gold import GoldParse\n", 17 | "\n", 18 | "import nltk\n", 19 | "from nltk.stem.wordnet import WordNetLemmatizer\n", 20 | "from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ\n", 21 | "\n", 22 | "import pycrfsuite" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "# Let's detect natural disasters!" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## 0. Get the data" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 2, 49 | "metadata": { 50 | "collapsed": false 51 | }, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/html": [ 56 | "
\n", 57 | "\n", 70 | "\n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | "
Sentence #WordPOSTag
0Sentence: 1ThousandsNNSO
1NaNofINO
2NaNdemonstratorsNNSO
3NaNhaveVBPO
4NaNmarchedVBNO
5NaNthroughINO
6NaNLondonNNPB-geo
7NaNtoTOO
8NaNprotestVBO
9NaNtheDTO
\n", 153 | "
" 154 | ], 155 | "text/plain": [ 156 | " Sentence # Word POS Tag\n", 157 | "0 Sentence: 1 Thousands NNS O\n", 158 | "1 NaN of IN O\n", 159 | "2 NaN demonstrators NNS O\n", 160 | "3 NaN have VBP O\n", 161 | "4 NaN marched VBN O\n", 162 | "5 NaN through IN O\n", 163 | "6 NaN London NNP B-geo\n", 164 | "7 NaN to TO O\n", 165 | "8 NaN protest VB O\n", 166 | "9 NaN the DT O" 167 | ] 168 | }, 169 | "execution_count": 2, 170 | "metadata": {}, 171 | "output_type": "execute_result" 172 | } 173 | ], 174 | "source": [ 175 | "df_dataset = pd.read_csv(\"ner_dataset.csv\", encoding=\"latin1\")\n", 176 | "df_dataset.head(10)" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 3, 182 | "metadata": { 183 | "collapsed": false 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "last_sent_id= 0\n", 188 | "for i, row in df_dataset.iterrows(): \n", 189 | " if not pd.isnull(row[\"Sentence #\"]):\n", 190 | " last_sent_id = int(row[\"Sentence #\"][10:])\n", 191 | " row[\"Sentence #\"] = last_sent_id\n", 192 | " else:\n", 193 | " row[\"Sentence #\"] = last_sent_id" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "### Find those with 'nat' tag:" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 4, 206 | "metadata": { 207 | "collapsed": false 208 | }, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "text/plain": [ 213 | "Sentence # object\n", 214 | "Word object\n", 215 | "POS object\n", 216 | "Tag object\n", 217 | "dtype: object" 218 | ] 219 | }, 220 | "execution_count": 4, 221 | "metadata": {}, 222 | "output_type": "execute_result" 223 | } 224 | ], 225 | "source": [ 226 | "df_dataset.dtypes" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 5, 232 | "metadata": { 233 | "collapsed": true 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "sent_id = df_dataset[df_dataset[\"Tag\"].str.contains(\"nat\")][\"Sentence #\"].unique()" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 6, 243 | "metadata": { 244 | "collapsed": false 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "df_dataset_nat = df_dataset[df_dataset[\"Sentence #\"].isin(sent_id)]" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "### Remap tags" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 8, 261 | "metadata": { 262 | "collapsed": false 263 | }, 264 | "outputs": [], 265 | "source": [ 266 | "lst_tags = df_dataset_nat[\"Tag\"].unique().tolist()" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 9, 272 | "metadata": { 273 | "collapsed": false 274 | }, 275 | "outputs": [], 276 | "source": [ 277 | "lst_tags.remove(\"I-nat\")\n", 278 | "lst_tags.remove(\"B-nat\")" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 10, 284 | "metadata": { 285 | "collapsed": false 286 | }, 287 | "outputs": [], 288 | "source": [ 289 | "dict_tags = {}\n", 290 | "for i in lst_tags:\n", 291 | " dict_tags[i] = \"O\"\n", 292 | " \n", 293 | "dict_tags[\"I-nat\"] = \"I-NAT\"\n", 294 | "dict_tags[\"B-nat\"] = \"B-NAT\"" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 11, 300 | "metadata": { 301 | "collapsed": false 302 | }, 303 | "outputs": [ 304 | { 305 | "name": "stderr", 306 | "output_type": "stream", 307 | "text": [ 308 | "C:\\Users\\peter\\Anaconda3\\lib\\site-packages\\ipykernel\\__main__.py:1: SettingWithCopyWarning: \n", 309 | "A value is trying to be set on a copy of a slice from a DataFrame.\n", 310 | "Try using .loc[row_indexer,col_indexer] = value instead\n", 311 | "\n", 312 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", 313 | " if __name__ == '__main__':\n" 314 | ] 315 | } 316 | ], 317 | "source": [ 318 | "df_dataset_nat[\"Tag remapped\"] = df_dataset_nat[\"Tag\"].map(dict_tags)" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": 12, 324 | "metadata": { 325 | "collapsed": false 326 | }, 327 | "outputs": [ 328 | { 329 | "data": { 330 | "text/plain": [ 331 | "['O', 'B-NAT', 'I-NAT']" 332 | ] 333 | }, 334 | "execution_count": 12, 335 | "metadata": {}, 336 | "output_type": "execute_result" 337 | } 338 | ], 339 | "source": [ 340 | "df_dataset_nat[\"Tag remapped\"].unique().tolist()" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "## 1. Using Spacy" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 13, 353 | "metadata": { 354 | "collapsed": true 355 | }, 356 | "outputs": [], 357 | "source": [ 358 | "LABEL = 'NAT'\n", 359 | "MAX_ITERATIONS = 50" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "### Training format" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 14, 372 | "metadata": { 373 | "collapsed": true 374 | }, 375 | "outputs": [], 376 | "source": [ 377 | "def join_space(values):\n", 378 | " return \" \".join(values).strip()" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 15, 384 | "metadata": { 385 | "collapsed": true 386 | }, 387 | "outputs": [], 388 | "source": [ 389 | "df_sentences_1 = df_dataset_nat.groupby(\"Sentence #\")[\"Word\"].apply(list).reset_index()" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 16, 395 | "metadata": { 396 | "collapsed": false 397 | }, 398 | "outputs": [], 399 | "source": [ 400 | "df_sentences_2 = df_dataset_nat.groupby(\"Sentence #\")[\"Tag remapped\"].apply(list).reset_index()" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 17, 406 | "metadata": { 407 | "collapsed": false 408 | }, 409 | "outputs": [], 410 | "source": [ 411 | "df_sentences = pd.merge(left=df_sentences_1, right = df_sentences_2, on = \"Sentence #\")" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": 18, 417 | "metadata": { 418 | "collapsed": false 419 | }, 420 | "outputs": [ 421 | { 422 | "data": { 423 | "text/html": [ 424 | "
\n", 425 | "\n", 438 | "\n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | "
Sentence #WordTag remapped
0121[Officials, say, the, 27-year, old, man, from,...[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...
1206[Humans, are, usually, infected, with, bird, f...[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...
2227[One, of, the, 2008, Olympic, mascots, is, mod...[O, O, O, O, O, O, O, O, O, O, O, O, B-NAT, I-...
3229[Sam, Beattie, reports, from, Jing, Jing, 's, ...[O, O, O, O, B-NAT, I-NAT, O, O, O, O, O, O]
\n", 474 | "
" 475 | ], 476 | "text/plain": [ 477 | " Sentence # Word \\\n", 478 | "0 121 [Officials, say, the, 27-year, old, man, from,... \n", 479 | "1 206 [Humans, are, usually, infected, with, bird, f... \n", 480 | "2 227 [One, of, the, 2008, Olympic, mascots, is, mod... \n", 481 | "3 229 [Sam, Beattie, reports, from, Jing, Jing, 's, ... \n", 482 | "\n", 483 | " Tag remapped \n", 484 | "0 [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ... \n", 485 | "1 [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ... \n", 486 | "2 [O, O, O, O, O, O, O, O, O, O, O, O, B-NAT, I-... \n", 487 | "3 [O, O, O, O, B-NAT, I-NAT, O, O, O, O, O, O] " 488 | ] 489 | }, 490 | "execution_count": 18, 491 | "metadata": {}, 492 | "output_type": "execute_result" 493 | } 494 | ], 495 | "source": [ 496 | "df_sentences.head(4)" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": 19, 502 | "metadata": { 503 | "collapsed": true 504 | }, 505 | "outputs": [], 506 | "source": [ 507 | "train_data = []\n", 508 | "\n", 509 | "for i, row in df_sentences.iterrows():\n", 510 | " raw_sent = \" \".join(row[\"Word\"]).replace(\" ,\", \",\")\n", 511 | " \n", 512 | " tags = list(zip(row[\"Word\"],row[\"Tag remapped\"]))\n", 513 | " advance = 0\n", 514 | "\n", 515 | " new_ents = []\n", 516 | "\n", 517 | " for i in range(len(tags)):\n", 518 | " tag = tags[i]\n", 519 | "\n", 520 | " word = tag[0]\n", 521 | " ent = tag[1]\n", 522 | "\n", 523 | " ent = ent.replace(\"B-\", \"\")\n", 524 | " ent = ent.replace(\"I-\", \"\")\n", 525 | " ent = ent.replace(\"L-\", \"\")\n", 526 | " ent = ent.replace(\"O-\", \"\")\n", 527 | " ent = ent.replace(\"U-\", \"\")\n", 528 | "\n", 529 | " ent_range = [advance, advance + len(word), ent]\n", 530 | "\n", 531 | " advance += len(word)\n", 532 | " if i < (len(tags) - 1):\n", 533 | " if tags[i + 1][0] != ',':\n", 534 | " advance += 1\n", 535 | "\n", 536 | " if not ent_range[2] == \"O\":\n", 537 | " new_ents.append(ent_range)\n", 538 | "\n", 539 | " new_ents_merged = []\n", 540 | "\n", 541 | " for j in range(len(new_ents)):\n", 542 | " if len(new_ents_merged) == 0:\n", 543 | " new_ents_merged.append(new_ents[j])\n", 544 | "\n", 545 | " if new_ents_merged[-1][2] == new_ents[j][2]:\n", 546 | " new_ents_merged[-1][1] = new_ents[j][1]\n", 547 | " else:\n", 548 | " new_ents_merged.append(new_ents[j])\n", 549 | "\n", 550 | " new_ents_merged_tuples = [tuple(item) for item in new_ents_merged]\n", 551 | " train_data.append((raw_sent, {\"entities\": new_ents_merged_tuples}))" 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": 20, 557 | "metadata": { 558 | "collapsed": false 559 | }, 560 | "outputs": [ 561 | { 562 | "name": "stdout", 563 | "output_type": "stream", 564 | "text": [ 565 | "[(\"Officials say the 27-year old man from Vietnam 's northern Ninh Binh \"\n", 566 | " 'province died late Thursday and tested positive for the H5N1 strain of bird '\n", 567 | " 'flu .',\n", 568 | " {'entities': [(125, 129, 'NAT')]}),\n", 569 | " ('Humans are usually infected with bird flu by direct contact with infected '\n", 570 | " 'poultry, but experts fear the H5N1 virus may mutate into a form easily '\n", 571 | " 'transmitted between people .',\n", 572 | " {'entities': [(104, 108, 'NAT')]})]\n" 573 | ] 574 | } 575 | ], 576 | "source": [ 577 | "pprint(train_data[:2])" 578 | ] 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "metadata": {}, 583 | "source": [ 584 | "### Split" 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": 21, 590 | "metadata": { 591 | "collapsed": true 592 | }, 593 | "outputs": [], 594 | "source": [ 595 | "test_data = train_data[155:]\n", 596 | "train_data = train_data[:155]" 597 | ] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "metadata": {}, 602 | "source": [ 603 | "### Train" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": 22, 609 | "metadata": { 610 | "collapsed": false 611 | }, 612 | "outputs": [], 613 | "source": [ 614 | "nlp = spacy.load(\"en_core_web_sm\")" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": 23, 620 | "metadata": { 621 | "collapsed": false 622 | }, 623 | "outputs": [], 624 | "source": [ 625 | "if 'ner' not in nlp.pipe_names:\n", 626 | " ner = nlp.create_pipe('ner')\n", 627 | " nlp.add_pipe(ner)\n", 628 | "\n", 629 | "else:\n", 630 | " ner = nlp.get_pipe('ner')" 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": 24, 636 | "metadata": { 637 | "collapsed": true 638 | }, 639 | "outputs": [], 640 | "source": [ 641 | "ner.add_label(LABEL)" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": 25, 647 | "metadata": { 648 | "collapsed": true 649 | }, 650 | "outputs": [], 651 | "source": [ 652 | "other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": 26, 658 | "metadata": { 659 | "collapsed": false 660 | }, 661 | "outputs": [ 662 | { 663 | "name": "stdout", 664 | "output_type": "stream", 665 | "text": [ 666 | "{'ner': 1151.5631708588069}\n", 667 | "{'ner': 970.3055948485708}\n", 668 | "Wall time: 1min 32s\n" 669 | ] 670 | } 671 | ], 672 | "source": [ 673 | "%%time\n", 674 | "with nlp.disable_pipes(*other_pipes): # only train NER\n", 675 | " optimizer = nlp.begin_training()\n", 676 | " for itn in range(2):\n", 677 | " random.shuffle(train_data)\n", 678 | " losses = {}\n", 679 | " for text, annotations in train_data:\n", 680 | " nlp.update([text], [annotations], sgd=optimizer, drop=0.35,losses=losses)\n", 681 | " print(losses)" 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": 27, 687 | "metadata": { 688 | "collapsed": true 689 | }, 690 | "outputs": [], 691 | "source": [ 692 | "nlp.meta['name'] = \"en_core_web_sm_newlabel\"" 693 | ] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": 28, 698 | "metadata": { 699 | "collapsed": true 700 | }, 701 | "outputs": [], 702 | "source": [ 703 | "nlp.to_disk(\"models\")" 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": {}, 709 | "source": [ 710 | "### Test it out" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": 29, 716 | "metadata": { 717 | "collapsed": false 718 | }, 719 | "outputs": [], 720 | "source": [ 721 | "nlp2 = spacy.load(\"models\")" 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": 30, 727 | "metadata": { 728 | "collapsed": false 729 | }, 730 | "outputs": [], 731 | "source": [ 732 | "y_true = []\n", 733 | "\n", 734 | "for i, test in enumerate(test_data):\n", 735 | " y_true.append([test[0][j[0]:j[1]] for j in test[1][\"entities\"]])" 736 | ] 737 | }, 738 | { 739 | "cell_type": "code", 740 | "execution_count": 31, 741 | "metadata": { 742 | "collapsed": true 743 | }, 744 | "outputs": [], 745 | "source": [ 746 | "y_predict = []\n", 747 | "for test in test_data:\n", 748 | " doc = nlp2(test[0])\n", 749 | " y_predict.append([ent.text for ent in doc.ents if ent.label_ == \"NAT\"])" 750 | ] 751 | }, 752 | { 753 | "cell_type": "code", 754 | "execution_count": 32, 755 | "metadata": { 756 | "collapsed": false 757 | }, 758 | "outputs": [ 759 | { 760 | "data": { 761 | "text/plain": [ 762 | "['Rita']" 763 | ] 764 | }, 765 | "execution_count": 32, 766 | "metadata": {}, 767 | "output_type": "execute_result" 768 | } 769 | ], 770 | "source": [ 771 | "y_predict[0]" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 33, 777 | "metadata": { 778 | "collapsed": true 779 | }, 780 | "outputs": [], 781 | "source": [ 782 | "def evaluate(y_predict, y_true):\n", 783 | " correct = 0\n", 784 | " for j, val in enumerate(y_predict):\n", 785 | " if val == y_true[j]:\n", 786 | " correct += 1\n", 787 | " \n", 788 | " return correct / len(y_predict)" 789 | ] 790 | }, 791 | { 792 | "cell_type": "code", 793 | "execution_count": 34, 794 | "metadata": { 795 | "collapsed": false 796 | }, 797 | "outputs": [ 798 | { 799 | "data": { 800 | "text/plain": [ 801 | "0.64" 802 | ] 803 | }, 804 | "execution_count": 34, 805 | "metadata": {}, 806 | "output_type": "execute_result" 807 | } 808 | ], 809 | "source": [ 810 | "evaluate(y_predict=y_predict, y_true=y_true)" 811 | ] 812 | }, 813 | { 814 | "cell_type": "markdown", 815 | "metadata": {}, 816 | "source": [ 817 | "## 3. Using PyCRF" 818 | ] 819 | }, 820 | { 821 | "cell_type": "markdown", 822 | "metadata": {}, 823 | "source": [ 824 | "### Training format" 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": 35, 830 | "metadata": { 831 | "collapsed": true 832 | }, 833 | "outputs": [], 834 | "source": [ 835 | "lemmatizer = WordNetLemmatizer()" 836 | ] 837 | }, 838 | { 839 | "cell_type": "code", 840 | "execution_count": 36, 841 | "metadata": { 842 | "collapsed": false 843 | }, 844 | "outputs": [], 845 | "source": [ 846 | "train_data = []\n", 847 | "for index, row in df_sentences.iterrows():\n", 848 | " \n", 849 | " train_data_sentence = []\n", 850 | " \n", 851 | " raw_sent = row[\"Word\"]\n", 852 | " tokens = nltk.pos_tag(raw_sent)\n", 853 | "\n", 854 | " for i, val in enumerate(tokens):\n", 855 | " train_data_word = []\n", 856 | " \n", 857 | " word = raw_sent[i]\n", 858 | " label = row[\"Tag remapped\"][i]\n", 859 | " pos_tag = tokens[i][1]\n", 860 | "\n", 861 | " if pos_tag.startswith(\"N\"):\n", 862 | " lemma = lemmatizer.lemmatize(word.lower(), pos=NOUN)\n", 863 | " elif pos_tag.startswith(\"V\"):\n", 864 | " lemma = lemmatizer.lemmatize(word.lower(), pos=VERB)\n", 865 | " elif pos_tag.startswith(\"J\"):\n", 866 | " lemma = lemmatizer.lemmatize(word.lower(), pos=ADJ)\n", 867 | " else:\n", 868 | " lemma = word\n", 869 | " \n", 870 | " train_data_word.append(word)\n", 871 | " train_data_word.append(pos_tag)\n", 872 | " train_data_word.append(lemma)\n", 873 | " train_data_word.append(label)\n", 874 | " \n", 875 | " train_data_sentence.append(train_data_word)\n", 876 | " \n", 877 | " train_data.append(train_data_sentence)" 878 | ] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "metadata": {}, 883 | "source": [ 884 | "### Feature engineering" 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "execution_count": 37, 890 | "metadata": { 891 | "collapsed": true 892 | }, 893 | "outputs": [], 894 | "source": [ 895 | "def word2features(sent, i, embed={}, use_gazetteers=False):\n", 896 | " word = sent[i][0]\n", 897 | " postag = sent[i][-3]\n", 898 | " lemma = sent[i][-2].lower()\n", 899 | " features = [\n", 900 | " 'bias',\n", 901 | " 'word.lower=' + word.lower(),\n", 902 | " 'word[-3:]=' + word[-3:],\n", 903 | " 'word[-2:]=' + word[-2:],\n", 904 | " 'word.isupper=%s' % word.isupper(),\n", 905 | " 'word.istitle=%s' % word.istitle(),\n", 906 | " 'word.isdigit=%s' % word.isdigit(),\n", 907 | " 'postag=' + postag,\n", 908 | " 'postag[:2]=' + postag[:2]\n", 909 | " ]\n", 910 | " if embed != {}:\n", 911 | " features.extend(['word.embed=%s' % embed.get(word, len(embed))])\n", 912 | " if use_gazetteers:\n", 913 | " features.extend(['word.measures=%s' % str(word.lower() in UNIT_GAZETTEER or lemma in UNIT_GAZETTEER),\n", 914 | " 'word.products=%s' % str(word.lower() in PRODUCTS_GAZETTEER or lemma in PRODUCTS_GAZETTEER)])\n", 915 | "\n", 916 | " if i > 0:\n", 917 | " word1 = sent[i - 1][0]\n", 918 | " postag1 = sent[i - 1][-3]\n", 919 | " lemma1 = sent[i - 1][-2].lower()\n", 920 | " features.extend([\n", 921 | " '-1:word.lower=' + word1.lower(),\n", 922 | " '-1:word.istitle=%s' % word1.istitle(),\n", 923 | " '-1:word.isupper=%s' % word1.isupper(),\n", 924 | " '-1:postag=' + postag1,\n", 925 | " '-1:postag[:2]=' + postag1[:2]\n", 926 | " ])\n", 927 | " if embed != {}:\n", 928 | " features.extend(['-1:word.embed=%s' % embed.get(word1, len(embed))])\n", 929 | " if use_gazetteers:\n", 930 | " features.extend(['-1:word.measures=%s' % str(word1.lower() in UNIT_GAZETTEER or lemma1 in UNIT_GAZETTEER),\n", 931 | " '-1:word.products=%s' % str(word1.lower() in PRODUCTS_GAZETTEER or lemma1 in PRODUCTS_GAZETTEER)])\n", 932 | "\n", 933 | " else:\n", 934 | " features.append('BOS')\n", 935 | "\n", 936 | " if i < len(sent) - 1:\n", 937 | " word1 = sent[i + 1][0]\n", 938 | " postag1 = sent[i + 1][-3]\n", 939 | " lemma1 = sent[i + 1][-2].lower()\n", 940 | " features.extend([\n", 941 | " '+1:word.lower=' + word1.lower(),\n", 942 | " '+1:word.istitle=%s' % word1.istitle(),\n", 943 | " '+1:word.isupper=%s' % word1.isupper(),\n", 944 | " '+1:postag=' + postag1,\n", 945 | " '+1:postag[:2]=' + postag1[:2]\n", 946 | " ])\n", 947 | " if use_gazetteers:\n", 948 | " features.extend(['+1:word.measures=%s' % str(word1.lower() in UNIT_GAZETTEER or lemma1 in UNIT_GAZETTEER),\n", 949 | " '+1:word.products=%s' % str(word1.lower() in PRODUCTS_GAZETTEER or lemma1 in PRODUCTS_GAZETTEER)])\n", 950 | "\n", 951 | " else:\n", 952 | " features.append('EOS')\n", 953 | "\n", 954 | " return features" 955 | ] 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": 38, 960 | "metadata": { 961 | "collapsed": true 962 | }, 963 | "outputs": [], 964 | "source": [ 965 | "def sent2features(sent, embed={}, use_gazetteers=False):\n", 966 | "\n", 967 | " return [word2features(sent, i, embed=embed, use_gazetteers=use_gazetteers) for i in range(len(sent))]" 968 | ] 969 | }, 970 | { 971 | "cell_type": "code", 972 | "execution_count": 39, 973 | "metadata": { 974 | "collapsed": true 975 | }, 976 | "outputs": [], 977 | "source": [ 978 | "train_data_formatted = [sent2features(x) for x in train_data]" 979 | ] 980 | }, 981 | { 982 | "cell_type": "markdown", 983 | "metadata": {}, 984 | "source": [ 985 | "### Labels" 986 | ] 987 | }, 988 | { 989 | "cell_type": "code", 990 | "execution_count": 40, 991 | "metadata": { 992 | "collapsed": true 993 | }, 994 | "outputs": [], 995 | "source": [ 996 | "y_data = df_sentences[\"Tag remapped\"].tolist()" 997 | ] 998 | }, 999 | { 1000 | "cell_type": "markdown", 1001 | "metadata": {}, 1002 | "source": [ 1003 | "### Split" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "code", 1008 | "execution_count": 41, 1009 | "metadata": { 1010 | "collapsed": false 1011 | }, 1012 | "outputs": [], 1013 | "source": [ 1014 | "x_test = train_data_formatted[155:]\n", 1015 | "y_test = y_data[155:]\n", 1016 | "\n", 1017 | "x_train = train_data_formatted[:155]\n", 1018 | "y_train = y_data[:155]" 1019 | ] 1020 | }, 1021 | { 1022 | "cell_type": "markdown", 1023 | "metadata": {}, 1024 | "source": [ 1025 | "### Model training" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "code", 1030 | "execution_count": 42, 1031 | "metadata": { 1032 | "collapsed": false 1033 | }, 1034 | "outputs": [], 1035 | "source": [ 1036 | "def train(X_train, y_train, model_name):\n", 1037 | " \"\"\" Trains a CRF on the given training data and saves the model. \"\"\"\n", 1038 | " print(\"Training\", model_name)\n", 1039 | " trainer = pycrfsuite.Trainer(verbose=False)\n", 1040 | "\n", 1041 | " for xseq, yseq in zip(X_train, y_train):\n", 1042 | " trainer.append(xseq, yseq)\n", 1043 | "\n", 1044 | " trainer.set_params({\n", 1045 | " 'c1': 0.1, # coefficient for L1 penalty\n", 1046 | " 'c2': 1e-3, # coefficient for L2 penalty\n", 1047 | " 'feature.possible_transitions': True\n", 1048 | " })\n", 1049 | "\n", 1050 | " trainer.train(model_name)" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "execution_count": 43, 1056 | "metadata": { 1057 | "collapsed": false 1058 | }, 1059 | "outputs": [ 1060 | { 1061 | "name": "stdout", 1062 | "output_type": "stream", 1063 | "text": [ 1064 | "Training pycrfmodel.model\n", 1065 | "Wall time: 6.36 s\n" 1066 | ] 1067 | } 1068 | ], 1069 | "source": [ 1070 | "%%time\n", 1071 | "train(x_train, y_train, 'pycrfmodel.model')" 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "code", 1076 | "execution_count": 44, 1077 | "metadata": { 1078 | "collapsed": false 1079 | }, 1080 | "outputs": [], 1081 | "source": [ 1082 | "def tag(X_test,model_name):\n", 1083 | " \"\"\" Labels test data with the model saved in model_name. \"\"\"\n", 1084 | " tagger = pycrfsuite.Tagger()\n", 1085 | " tagger.open(model_name)\n", 1086 | "\n", 1087 | " return [tagger.tag(seq) for seq in X_test]" 1088 | ] 1089 | }, 1090 | { 1091 | "cell_type": "code", 1092 | "execution_count": 45, 1093 | "metadata": { 1094 | "collapsed": false 1095 | }, 1096 | "outputs": [ 1097 | { 1098 | "data": { 1099 | "text/plain": [ 1100 | "['O',\n", 1101 | " 'O',\n", 1102 | " 'O',\n", 1103 | " 'B-NAT',\n", 1104 | " 'I-NAT',\n", 1105 | " 'I-NAT',\n", 1106 | " 'O',\n", 1107 | " 'O',\n", 1108 | " 'O',\n", 1109 | " 'O',\n", 1110 | " 'O',\n", 1111 | " 'O',\n", 1112 | " 'O',\n", 1113 | " 'O',\n", 1114 | " 'O',\n", 1115 | " 'O',\n", 1116 | " 'O',\n", 1117 | " 'O',\n", 1118 | " 'O',\n", 1119 | " 'O',\n", 1120 | " 'O',\n", 1121 | " 'O',\n", 1122 | " 'O',\n", 1123 | " 'O',\n", 1124 | " 'O']" 1125 | ] 1126 | }, 1127 | "execution_count": 45, 1128 | "metadata": {}, 1129 | "output_type": "execute_result" 1130 | } 1131 | ], 1132 | "source": [ 1133 | "tag(x_test, 'pycrfmodel.model')[0]" 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "code", 1138 | "execution_count": 46, 1139 | "metadata": { 1140 | "collapsed": false 1141 | }, 1142 | "outputs": [], 1143 | "source": [ 1144 | "def evaluate(y_predict, y_true, ignore_bio = True):\n", 1145 | " correct = 0\n", 1146 | " total = 0\n", 1147 | " for i, y_pred in enumerate(y_predict):\n", 1148 | " for j, y in enumerate(y_pred):\n", 1149 | " if ignore_bio:\n", 1150 | " if y[2:] == y_true[i][j][2:]:\n", 1151 | " correct += 1\n", 1152 | " \n", 1153 | " else:\n", 1154 | " if y == y_true[i][j]:\n", 1155 | " correct += 1\n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " total += len(y_pred)\n", 1160 | " \n", 1161 | " return correct / total\n", 1162 | " " 1163 | ] 1164 | }, 1165 | { 1166 | "cell_type": "code", 1167 | "execution_count": 47, 1168 | "metadata": { 1169 | "collapsed": false 1170 | }, 1171 | "outputs": [ 1172 | { 1173 | "data": { 1174 | "text/plain": [ 1175 | "0.9840213049267643" 1176 | ] 1177 | }, 1178 | "execution_count": 47, 1179 | "metadata": {}, 1180 | "output_type": "execute_result" 1181 | } 1182 | ], 1183 | "source": [ 1184 | "evaluate(tag(x_test, 'pycrfmodel.model'), y_test, ignore_bio=True)" 1185 | ] 1186 | }, 1187 | { 1188 | "cell_type": "code", 1189 | "execution_count": null, 1190 | "metadata": { 1191 | "collapsed": true 1192 | }, 1193 | "outputs": [], 1194 | "source": [] 1195 | } 1196 | ], 1197 | "metadata": { 1198 | "kernelspec": { 1199 | "display_name": "Python 3", 1200 | "language": "python", 1201 | "name": "python3" 1202 | }, 1203 | "language_info": { 1204 | "codemirror_mode": { 1205 | "name": "ipython", 1206 | "version": 3 1207 | }, 1208 | "file_extension": ".py", 1209 | "mimetype": "text/x-python", 1210 | "name": "python", 1211 | "nbconvert_exporter": "python", 1212 | "pygments_lexer": "ipython3", 1213 | "version": "3.6.0" 1214 | } 1215 | }, 1216 | "nbformat": 4, 1217 | "nbformat_minor": 2 1218 | } 1219 | -------------------------------------------------------------------------------- /section4/section4_video3_basic_classifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 0. Imports" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 109, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "from sklearn.datasets import fetch_20newsgroups\n", 19 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 20 | "from sklearn.svm import LinearSVC\n", 21 | "from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score\n", 22 | "from sklearn.model_selection import GridSearchCV" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 32, 28 | "metadata": { 29 | "collapsed": true 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "import spacy\n", 34 | "from nltk.corpus import stopwords\n", 35 | "import string" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 106, 41 | "metadata": { 42 | "collapsed": true 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "from collections import Counter\n", 47 | "import matplotlib.pyplot as plt\n", 48 | "import seaborn as sns\n", 49 | "import numpy as np" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "## 1. Get data" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 3, 62 | "metadata": { 63 | "collapsed": false 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "newsgroups_train = fetch_20newsgroups(subset='train')\n", 68 | "newsgroups_test = fetch_20newsgroups(subset='test')" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "## 2. Data processing" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 4, 81 | "metadata": { 82 | "collapsed": true 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "x_train = newsgroups_train.data\n", 87 | "y_train = newsgroups_train.target" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 60, 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "x_test = newsgroups_test.data\n", 99 | "y_test = newsgroups_test.target" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 23, 105 | "metadata": { 106 | "collapsed": false 107 | }, 108 | "outputs": [ 109 | { 110 | "data": { 111 | "text/plain": [ 112 | "'From: shd2001@andy.bgsu.edu (Sherlette Dixon)\\nSubject: Christianity & Atheism: an update\\nOrganization: BGSU\\nLines: 32\\n\\nFirst, I would like to thank all who sent me their opinions on the matter\\nat hand. All advice was taken to heart, if not directly used. My friend\\nfound out about the matter quite accidently. After reading some of my\\nmail, I quit from the mail reader & went about my business. I must have\\ntrashed my mail improperly, because he got on the same terminal the next\\nday & saw my old messages. He thought they were responses to a post he\\nplaced in alt.atheism earlier that week, so he read some of them before\\nrealizing that they were for me. I got a message from him the next day; he\\napologized for reading my mail & said that he did not want to appear to be\\na snoop. He said that he would be willing to talk to me about his views &\\ndidn\\'t mind doing so, especially with a friend. So we did. I neither\\nchanged his mind nor did he change mine, as that was not the point. Now he\\nknows where I\\'m coming from & now I know where he\\'s coming from. And all\\nthat I can do is pray for him, as I\\'ve always done.\\n\\nI believe the reason that he & I \"click\" instead of \"bash\" heads is because\\nI see Christianity as a tool for revolution, & not a tool for maintaining\\nthe status quo. To be quite blunt, I have more of a reason to reject God\\nthan he does just by the fact that I am an African-American female. \\nChristianity & religion have been used as tools to separate my people from\\nthe true knowledge of our history & the wealth of our contributions to the\\nworld society. The \"kitchen of heaven\" was all we had to look forward to\\nduring the slave days, & this mentality & second-class status still exists\\ntoday. I, too, have rejected\\nan aspect of Christianity----that of the estabished church. Too much\\nhypocricy exists behind the walls of \"God\\'s house\" beginning with the\\nimages of a white Jesus to that of the members: praise God on Sunday &\\nraise hell beginning Monday. God-willing, I will find a church home where\\nI can feel comfortable & at-home, but I don\\'t see it happening anytime\\nsoon.\\n\\nSherlette \\n'" 113 | ] 114 | }, 115 | "execution_count": 23, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "x_train[120]" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "So most of the data processing:\n", 129 | "- stopwords\n", 130 | "- punctuation\n", 131 | "- punctuation chains\n", 132 | "- single character words\n", 133 | "- stuff like '\\n\\t\\t\\t\\t\\t\\t' and '\\n'" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "## 3. Document processing" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 8, 146 | "metadata": { 147 | "collapsed": true 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "nlp = spacy.load(\"en_core_web_md\")" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 25, 157 | "metadata": { 158 | "collapsed": false 159 | }, 160 | "outputs": [ 161 | { 162 | "name": "stdout", 163 | "output_type": "stream", 164 | "text": [ 165 | "Wall time: 17min 34s\n" 166 | ] 167 | } 168 | ], 169 | "source": [ 170 | "%%time\n", 171 | "x_train_nlp = [[x.lemma_ for x in nlp(y)] for y in x_train]" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 61, 177 | "metadata": { 178 | "collapsed": false 179 | }, 180 | "outputs": [ 181 | { 182 | "name": "stdout", 183 | "output_type": "stream", 184 | "text": [ 185 | "Wall time: 11min 31s\n" 186 | ] 187 | } 188 | ], 189 | "source": [ 190 | "%%time\n", 191 | "x_test_nlp = [[x.lemma_ for x in nlp(y)] for y in x_test]" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "## 2 bis. Data processing: take 2" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "### 2.1. Remove stopwords" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 28, 211 | "metadata": { 212 | "collapsed": true 213 | }, 214 | "outputs": [], 215 | "source": [ 216 | "stop_en = stopwords.words(\"english\")" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 29, 222 | "metadata": { 223 | "collapsed": true 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "x_cleaned_1 = []\n", 228 | "for x in x_train_nlp:\n", 229 | " x_cleaned_1.append([y for y in x if not y in stop_en])" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 62, 235 | "metadata": { 236 | "collapsed": true 237 | }, 238 | "outputs": [], 239 | "source": [ 240 | "x_cleaned_1_test = []\n", 241 | "for x in x_test_nlp:\n", 242 | " x_cleaned_1_test.append([y for y in x if not y in stop_en])" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "### 2.2. Remove punct" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 33, 255 | "metadata": { 256 | "collapsed": true 257 | }, 258 | "outputs": [], 259 | "source": [ 260 | "x_cleaned_2 = []\n", 261 | "for x in x_cleaned_1:\n", 262 | " x_cleaned_2.append([y for y in x if not y in list(string.punctuation)])" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 63, 268 | "metadata": { 269 | "collapsed": true 270 | }, 271 | "outputs": [], 272 | "source": [ 273 | "x_cleaned_2_test = []\n", 274 | "for x in x_cleaned_1_test:\n", 275 | " x_cleaned_2_test.append([y for y in x if not y in list(string.punctuation)])" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "### 2.3 Remove other useless stuff" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 35, 288 | "metadata": { 289 | "collapsed": true 290 | }, 291 | "outputs": [], 292 | "source": [ 293 | "useless = [\"-PRON-\"]" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 36, 299 | "metadata": { 300 | "collapsed": false 301 | }, 302 | "outputs": [], 303 | "source": [ 304 | "x_cleaned_3 = []\n", 305 | "for x in x_cleaned_2:\n", 306 | " x_cleaned_3.append([y for y in x if not y in useless])" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 64, 312 | "metadata": { 313 | "collapsed": true 314 | }, 315 | "outputs": [], 316 | "source": [ 317 | "x_cleaned_3_test = []\n", 318 | "for x in x_cleaned_2_test:\n", 319 | " x_cleaned_3_test.append([y for y in x if not y in useless])" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "### 2.4 Remove \\n and '--'" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": 37, 332 | "metadata": { 333 | "collapsed": true 334 | }, 335 | "outputs": [], 336 | "source": [ 337 | "x_cleaned_4 = []\n", 338 | "for x in x_cleaned_3:\n", 339 | " x_cleaned_4.append([y for y in x if not (\"--\" in y or '\\n' in y) ])" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 65, 345 | "metadata": { 346 | "collapsed": true 347 | }, 348 | "outputs": [], 349 | "source": [ 350 | "x_cleaned_4_test = []\n", 351 | "for x in x_cleaned_3_test:\n", 352 | " x_cleaned_4_test.append([y for y in x if not (\"--\" in y or '\\n' in y) ])" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "### 2.5 Join together" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": 39, 365 | "metadata": { 366 | "collapsed": false 367 | }, 368 | "outputs": [], 369 | "source": [ 370 | "x_cleaned = [\" \".join(y) for y in x_cleaned_4]" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": 66, 376 | "metadata": { 377 | "collapsed": true 378 | }, 379 | "outputs": [], 380 | "source": [ 381 | "x_cleaned_test = [\" \".join(y) for y in x_cleaned_4_test]" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 124, 387 | "metadata": { 388 | "collapsed": false 389 | }, 390 | "outputs": [ 391 | { 392 | "data": { 393 | "text/plain": [ 394 | "\"From: lerxst@wam.umd.edu (where's my thing)\\nSubject: WHAT car is this!?\\nNntp-Posting-Host: rac3.wam.umd.edu\\nOrganization: University of Maryland, College Park\\nLines: 15\\n\\n I was wondering if anyone out there could enlighten me on this car I saw\\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\\nthe front bumper was separate from the rest of the body. This is \\nall I know. If anyone can tellme a model name, engine specs, years\\nof production, where this car is made, history, or whatever info you\\nhave on this funky looking car, please e-mail.\\n\\nThanks,\\n- IL\\n ---- brought to you by your neighborhood Lerxst ----\\n\\n\\n\\n\\n\"" 395 | ] 396 | }, 397 | "execution_count": 124, 398 | "metadata": {}, 399 | "output_type": "execute_result" 400 | } 401 | ], 402 | "source": [ 403 | "x_train[0]" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 123, 409 | "metadata": { 410 | "collapsed": false 411 | }, 412 | "outputs": [ 413 | { 414 | "data": { 415 | "text/plain": [ 416 | "'lerxst@wam.umd.edu thing subject car nntp posting host rac3.wam.umd.edu organization university maryland college park line 15 wonder anyone enlighten car see day 2-door sport car look late 60s/ early 70 call bricklin door really small addition front bumper separate rest body know anyone tellme model name engine spec year production car make history whatev info funky look car please e mail thank il bring neighborhood lerxst'" 417 | ] 418 | }, 419 | "execution_count": 123, 420 | "metadata": {}, 421 | "output_type": "execute_result" 422 | } 423 | ], 424 | "source": [ 425 | "x_cleaned[0]" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": {}, 431 | "source": [ 432 | "## 4. Splitting" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "Already taken care of :-)" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 43, 445 | "metadata": { 446 | "collapsed": false 447 | }, 448 | "outputs": [], 449 | "source": [ 450 | "cnt = Counter(y_train)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 48, 456 | "metadata": { 457 | "collapsed": false 458 | }, 459 | "outputs": [ 460 | { 461 | "data": { 462 | "text/plain": [ 463 | "Counter({0: 480,\n", 464 | " 1: 584,\n", 465 | " 2: 591,\n", 466 | " 3: 590,\n", 467 | " 4: 578,\n", 468 | " 5: 593,\n", 469 | " 6: 585,\n", 470 | " 7: 594,\n", 471 | " 8: 598,\n", 472 | " 9: 597,\n", 473 | " 10: 600,\n", 474 | " 11: 595,\n", 475 | " 12: 591,\n", 476 | " 13: 594,\n", 477 | " 14: 593,\n", 478 | " 15: 599,\n", 479 | " 16: 546,\n", 480 | " 17: 564,\n", 481 | " 18: 465,\n", 482 | " 19: 377})" 483 | ] 484 | }, 485 | "execution_count": 48, 486 | "metadata": {}, 487 | "output_type": "execute_result" 488 | } 489 | ], 490 | "source": [ 491 | "cnt" 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": {}, 497 | "source": [ 498 | "## 5. Feature representation" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": 55, 504 | "metadata": { 505 | "collapsed": false 506 | }, 507 | "outputs": [ 508 | { 509 | "data": { 510 | "text/plain": [ 511 | "(11314, 130107)" 512 | ] 513 | }, 514 | "execution_count": 55, 515 | "metadata": {}, 516 | "output_type": "execute_result" 517 | } 518 | ], 519 | "source": [ 520 | "vec = TfidfVectorizer()\n", 521 | "x_train_vec = vec.fit_transform(x_train)\n", 522 | "x_train_vec.shape" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 75, 528 | "metadata": { 529 | "collapsed": false 530 | }, 531 | "outputs": [ 532 | { 533 | "data": { 534 | "text/plain": [ 535 | "(11314, 119777)" 536 | ] 537 | }, 538 | "execution_count": 75, 539 | "metadata": {}, 540 | "output_type": "execute_result" 541 | } 542 | ], 543 | "source": [ 544 | "vec = TfidfVectorizer()\n", 545 | "x_train_vec = vec.fit_transform(x_cleaned)\n", 546 | "x_train_vec.shape" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": {}, 552 | "source": [ 553 | "## 6. Metric and algo" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": 99, 559 | "metadata": { 560 | "collapsed": false 561 | }, 562 | "outputs": [], 563 | "source": [ 564 | "clf = LinearSVC(C=1, multi_class='ovr', dual=True)" 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": 100, 570 | "metadata": { 571 | "collapsed": false 572 | }, 573 | "outputs": [ 574 | { 575 | "name": "stdout", 576 | "output_type": "stream", 577 | "text": [ 578 | "Wall time: 2.24 s\n" 579 | ] 580 | }, 581 | { 582 | "data": { 583 | "text/plain": [ 584 | "LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,\n", 585 | " intercept_scaling=1, loss='squared_hinge', max_iter=1000,\n", 586 | " multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n", 587 | " verbose=0)" 588 | ] 589 | }, 590 | "execution_count": 100, 591 | "metadata": {}, 592 | "output_type": "execute_result" 593 | } 594 | ], 595 | "source": [ 596 | "%%time\n", 597 | "clf.fit(x_train_vec, y_train)" 598 | ] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "metadata": {}, 603 | "source": [ 604 | "## 7. Validation" 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": 101, 610 | "metadata": { 611 | "collapsed": true 612 | }, 613 | "outputs": [], 614 | "source": [ 615 | "x_test_vec = vec.transform(x_cleaned_test)" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": 102, 621 | "metadata": { 622 | "collapsed": true 623 | }, 624 | "outputs": [], 625 | "source": [ 626 | "y_predict = clf.predict(x_test_vec)" 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": 103, 632 | "metadata": { 633 | "collapsed": false 634 | }, 635 | "outputs": [ 636 | { 637 | "name": "stdout", 638 | "output_type": "stream", 639 | "text": [ 640 | "accuracy: 0.8538236856080722\n", 641 | "precision: 0.8538236856080722\n", 642 | "recall: 0.8538236856080722\n", 643 | "f1: 0.8538236856080722\n" 644 | ] 645 | } 646 | ], 647 | "source": [ 648 | "print(\"accuracy: \", accuracy_score(y_pred=y_predict, y_true=y_test))\n", 649 | "print(\"precision: \", precision_score(y_pred=y_predict, y_true=y_test, average= \"micro\"))\n", 650 | "print(\"recall: \", recall_score(y_pred=y_predict, y_true=y_test, average= \"micro\"))\n", 651 | "print(\"f1: \", f1_score(y_pred=y_predict, y_true=y_test, average= \"micro\"))" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": 107, 657 | "metadata": { 658 | "collapsed": true 659 | }, 660 | "outputs": [], 661 | "source": [ 662 | "def show_top10(classifier, vectorizer, categories):\n", 663 | " feature_names = np.asarray(vectorizer.get_feature_names())\n", 664 | " for i, category in enumerate(categories):\n", 665 | " top10 = np.argsort(classifier.coef_[i])[-10:]\n", 666 | " print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))" 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": 108, 672 | "metadata": { 673 | "collapsed": false 674 | }, 675 | "outputs": [ 676 | { 677 | "name": "stdout", 678 | "output_type": "stream", 679 | "text": [ 680 | "alt.atheism: mangoe rushdie jaeger atheists cobb wingate islamic atheist keith atheism\n", 681 | "comp.graphics: animation cview tiff polygon pov 3do graphics 3d image graphic\n", 682 | "comp.os.ms-windows.misc: nt winqvt download ini file ax win3 driver cica windows\n", 683 | "comp.sys.ibm.pc.hardware: jumper scsi monitor fastmicro irq vlb 486 pc ide gateway\n", 684 | "comp.sys.mac.hardware: se lciii iisi lc centris duo quadra apple powerbook mac\n", 685 | "comp.windows.x: expo xpert xlib window lcs server xterm x11r5 widget motif\n", 686 | "misc.forsale: camera include distribution wanted condition sell ship forsale offer sale\n", 687 | "rec.autos: chevrolet engine truck auto convertible warning dealer toyota automotive car\n", 688 | "rec.motorcycles: harley kawasaki dog helmet rider bmw ride motorcycle bike dod\n", 689 | "rec.sport.baseball: braves giants tigers stadium cub yankee pitch sox phillies baseball\n", 690 | "rec.sport.hockey: cup bruins goal coach espn play team playoff nhl hockey\n", 691 | "sci.crypt: encrypt nsa crypto security wiretap pgp tap encryption key clipper\n", 692 | "sci.electronics: ee explode power scope voltage 256k electronic electronics 8051 circuit\n", 693 | "sci.med: pitt krillean patient medical cancer treatment photography disease msg doctor\n", 694 | "sci.space: sci dietz prb rocket shuttle planet launch moon orbit space\n", 695 | "soc.religion.christian: marry fisher geneva hell prayer christian christ church rutgers athos\n", 696 | "talk.politics.guns: batf fbi feustel cathy atf waco handgun weapon firearm gun\n", 697 | "talk.politics.mideast: argic holocaust adl armenia serdar arab turkish armenian israel israeli\n", 698 | "talk.politics.misc: teel drug president drieux gay optilink clinton tax cramer kaldis\n", 699 | "talk.religion.misc: thyagi psyrobtw 666 christian koresh biblical hudson weiss morality beast\n" 700 | ] 701 | } 702 | ], 703 | "source": [ 704 | "show_top10(clf, vec, newsgroups_train.target_names)" 705 | ] 706 | }, 707 | { 708 | "cell_type": "markdown", 709 | "metadata": {}, 710 | "source": [ 711 | "## 8. Parameter tuning" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": 114, 717 | "metadata": { 718 | "collapsed": false 719 | }, 720 | "outputs": [], 721 | "source": [ 722 | "parameters = {'C':[0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2], \"dual\":[True,False]}" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": 115, 728 | "metadata": { 729 | "collapsed": true 730 | }, 731 | "outputs": [], 732 | "source": [ 733 | "clf = LinearSVC()" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": 116, 739 | "metadata": { 740 | "collapsed": true 741 | }, 742 | "outputs": [], 743 | "source": [ 744 | "grid = GridSearchCV(clf, parameters)" 745 | ] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "execution_count": 118, 750 | "metadata": { 751 | "collapsed": false 752 | }, 753 | "outputs": [ 754 | { 755 | "data": { 756 | "text/plain": [ 757 | "GridSearchCV(cv=None, error_score='raise',\n", 758 | " estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,\n", 759 | " intercept_scaling=1, loss='squared_hinge', max_iter=1000,\n", 760 | " multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n", 761 | " verbose=0),\n", 762 | " fit_params=None, iid=True, n_jobs=1,\n", 763 | " param_grid={'C': [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2], 'dual': [True, False]},\n", 764 | " pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", 765 | " scoring=None, verbose=0)" 766 | ] 767 | }, 768 | "execution_count": 118, 769 | "metadata": {}, 770 | "output_type": "execute_result" 771 | } 772 | ], 773 | "source": [ 774 | "grid.fit(x_train_vec, y_train)" 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": 119, 780 | "metadata": { 781 | "collapsed": false 782 | }, 783 | "outputs": [ 784 | { 785 | "data": { 786 | "text/plain": [ 787 | "{'C': 1.5, 'dual': True}" 788 | ] 789 | }, 790 | "execution_count": 119, 791 | "metadata": {}, 792 | "output_type": "execute_result" 793 | } 794 | ], 795 | "source": [ 796 | "grid.best_params_" 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": 122, 802 | "metadata": { 803 | "collapsed": false 804 | }, 805 | "outputs": [ 806 | { 807 | "data": { 808 | "text/plain": [ 809 | "0.9164751635142302" 810 | ] 811 | }, 812 | "execution_count": 122, 813 | "metadata": {}, 814 | "output_type": "execute_result" 815 | } 816 | ], 817 | "source": [ 818 | "grid.best_score_" 819 | ] 820 | }, 821 | { 822 | "cell_type": "markdown", 823 | "metadata": {}, 824 | "source": [ 825 | "## 9. Into production" 826 | ] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": {}, 831 | "source": [ 832 | "See video 4.5 ;-)" 833 | ] 834 | } 835 | ], 836 | "metadata": { 837 | "kernelspec": { 838 | "display_name": "Python 3", 839 | "language": "python", 840 | "name": "python3" 841 | }, 842 | "language_info": { 843 | "codemirror_mode": { 844 | "name": "ipython", 845 | "version": 3 846 | }, 847 | "file_extension": ".py", 848 | "mimetype": "text/x-python", 849 | "name": "python", 850 | "nbconvert_exporter": "python", 851 | "pygments_lexer": "ipython3", 852 | "version": "3.6.0" 853 | } 854 | }, 855 | "nbformat": 4, 856 | "nbformat_minor": 2 857 | } 858 | --------------------------------------------------------------------------------