├── README.md
├── data
    └── tweets.csv
├── intro_to_python.ipynb
├── nlp_workshop1.ipynb
└── nlp_workshop2.ipynb


/README.md:
--------------------------------------------------------------------------------
 1 | # NLP Workshop
 2 | 
 3 | Welcome to this repository. I have instructions and resources so you can get up to speed.
 4 | 
 5 | 1. You will need Python installed on your computer and other data science packages. I use the [Anaconda distribution of Python 3.6](https://www.continuum.io/downloads).
 6 | 
 7 | 2. If you are comfortable with Git, make sure to clone this repository on your local computer, otherwise you can simply download a zip archive by clicking on the green button on the top right of the page.
 8 | 
 9 | 3. Make sure you are familiar with Python syntax. There is a [review Jupyter notebook](./intro_to_python.ipynb) in this repository. Make sure you can go into your terminal and run the command `jupyer notebook`. A notebook server should start and you will be able to view, create and save notebooks.
10 | 
11 | 4. We will be using TextBlob. Make sure to run these commands in your terminal / shell.
12 | 
13 |        $ pip install -U textblob
14 |        $ python -m textblob.download_corpora
15 | 
16 | 5. We will be using the [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) provided by Kaggle. The associated data files are located in the [data folder](./tweets.csv).
17 | 
18 | 6. You should be able to import the following packges without a problem. If you get an error, `pip install` or `conda install`.
19 | 
20 |         $ import pandas
21 |         $ import numpy
22 |         $ import textblob
23 | 
24 | ----------
25 | 
26 | Organized by [K2 Data Science](http://www.k2datascience.com).
27 | 


--------------------------------------------------------------------------------
/intro_to_python.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Introduction To Python"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "This is a collection of various statements, features, etc. of Jupyter and the Python language.  "
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "code",
  19 |    "execution_count": 68,
  20 |    "metadata": {
  21 |     "collapsed": false
  22 |    },
  23 |    "outputs": [],
  24 |    "source": [
  25 |     "a = 10"
  26 |    ]
  27 |   },
  28 |   {
  29 |    "cell_type": "code",
  30 |    "execution_count": 69,
  31 |    "metadata": {
  32 |     "collapsed": false
  33 |    },
  34 |    "outputs": [
  35 |     {
  36 |      "name": "stdout",
  37 |      "output_type": "stream",
  38 |      "text": [
  39 |       "10\n"
  40 |      ]
  41 |     }
  42 |    ],
  43 |    "source": [
  44 |     "print(a)"
  45 |    ]
  46 |   },
  47 |   {
  48 |    "cell_type": "code",
  49 |    "execution_count": 70,
  50 |    "metadata": {
  51 |     "collapsed": false
  52 |    },
  53 |    "outputs": [],
  54 |    "source": [
  55 |     "import math"
  56 |    ]
  57 |   },
  58 |   {
  59 |    "cell_type": "code",
  60 |    "execution_count": 71,
  61 |    "metadata": {
  62 |     "collapsed": false
  63 |    },
  64 |    "outputs": [
  65 |     {
  66 |      "name": "stdout",
  67 |      "output_type": "stream",
  68 |      "text": [
  69 |       "1.0\n"
  70 |      ]
  71 |     }
  72 |    ],
  73 |    "source": [
  74 |     "x = math.cos(2 * math.pi)\n",
  75 |     "print(x)"
  76 |    ]
  77 |   },
  78 |   {
  79 |    "cell_type": "markdown",
  80 |    "metadata": {},
  81 |    "source": [
  82 |     "Import the whole module into the current namespace instead."
  83 |    ]
  84 |   },
  85 |   {
  86 |    "cell_type": "code",
  87 |    "execution_count": 72,
  88 |    "metadata": {
  89 |     "collapsed": false
  90 |    },
  91 |    "outputs": [
  92 |     {
  93 |      "name": "stdout",
  94 |      "output_type": "stream",
  95 |      "text": [
  96 |       "1.0\n"
  97 |      ]
  98 |     }
  99 |    ],
 100 |    "source": [
 101 |     "from math import *\n",
 102 |     "x = cos(2 * pi)\n",
 103 |     "print(x)"
 104 |    ]
 105 |   },
 106 |   {
 107 |    "cell_type": "markdown",
 108 |    "metadata": {},
 109 |    "source": [
 110 |     "Several ways to look at documentation for a module."
 111 |    ]
 112 |   },
 113 |   {
 114 |    "cell_type": "code",
 115 |    "execution_count": 73,
 116 |    "metadata": {
 117 |     "collapsed": false
 118 |    },
 119 |    "outputs": [
 120 |     {
 121 |      "name": "stdout",
 122 |      "output_type": "stream",
 123 |      "text": [
 124 |       "['__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos', 'cosh', 'degrees', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf', 'isnan', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan', 'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'trunc']\n"
 125 |      ]
 126 |     }
 127 |    ],
 128 |    "source": [
 129 |     "print(dir(math))"
 130 |    ]
 131 |   },
 132 |   {
 133 |    "cell_type": "code",
 134 |    "execution_count": 74,
 135 |    "metadata": {
 136 |     "collapsed": false
 137 |    },
 138 |    "outputs": [
 139 |     {
 140 |      "name": "stdout",
 141 |      "output_type": "stream",
 142 |      "text": [
 143 |       "Help on built-in function cos in module math:\n",
 144 |       "\n",
 145 |       "cos(...)\n",
 146 |       "    cos(x)\n",
 147 |       "    \n",
 148 |       "    Return the cosine of x (measured in radians).\n",
 149 |       "\n"
 150 |      ]
 151 |     }
 152 |    ],
 153 |    "source": [
 154 |     "help(math.cos)"
 155 |    ]
 156 |   },
 157 |   {
 158 |    "cell_type": "markdown",
 159 |    "metadata": {},
 160 |    "source": [
 161 |     "### Variables"
 162 |    ]
 163 |   },
 164 |   {
 165 |    "cell_type": "code",
 166 |    "execution_count": 75,
 167 |    "metadata": {
 168 |     "collapsed": false
 169 |    },
 170 |    "outputs": [
 171 |     {
 172 |      "data": {
 173 |       "text/plain": [
 174 |        "float"
 175 |       ]
 176 |      },
 177 |      "execution_count": 75,
 178 |      "metadata": {},
 179 |      "output_type": "execute_result"
 180 |     }
 181 |    ],
 182 |    "source": [
 183 |     "x = 1.0\n",
 184 |     "type(x)"
 185 |    ]
 186 |   },
 187 |   {
 188 |    "cell_type": "code",
 189 |    "execution_count": 76,
 190 |    "metadata": {
 191 |     "collapsed": false
 192 |    },
 193 |    "outputs": [
 194 |     {
 195 |      "data": {
 196 |       "text/plain": [
 197 |        "int"
 198 |       ]
 199 |      },
 200 |      "execution_count": 76,
 201 |      "metadata": {},
 202 |      "output_type": "execute_result"
 203 |     }
 204 |    ],
 205 |    "source": [
 206 |     "# dynamically typed\n",
 207 |     "x = 1\n",
 208 |     "type(x)"
 209 |    ]
 210 |   },
 211 |   {
 212 |    "cell_type": "markdown",
 213 |    "metadata": {},
 214 |    "source": [
 215 |     "### Operators"
 216 |    ]
 217 |   },
 218 |   {
 219 |    "cell_type": "code",
 220 |    "execution_count": 77,
 221 |    "metadata": {
 222 |     "collapsed": false
 223 |    },
 224 |    "outputs": [
 225 |     {
 226 |      "data": {
 227 |       "text/plain": [
 228 |        "(3, -1, 2, 0.5)"
 229 |       ]
 230 |      },
 231 |      "execution_count": 77,
 232 |      "metadata": {},
 233 |      "output_type": "execute_result"
 234 |     }
 235 |    ],
 236 |    "source": [
 237 |     "1 + 2, 1 - 2, 1 * 2, 1 / 2"
 238 |    ]
 239 |   },
 240 |   {
 241 |    "cell_type": "code",
 242 |    "execution_count": 78,
 243 |    "metadata": {
 244 |     "collapsed": false
 245 |    },
 246 |    "outputs": [
 247 |     {
 248 |      "data": {
 249 |       "text/plain": [
 250 |        "1.0"
 251 |       ]
 252 |      },
 253 |      "execution_count": 78,
 254 |      "metadata": {},
 255 |      "output_type": "execute_result"
 256 |     }
 257 |    ],
 258 |    "source": [
 259 |     "# integer division of float numbers\n",
 260 |     "3.0 // 2.0"
 261 |    ]
 262 |   },
 263 |   {
 264 |    "cell_type": "code",
 265 |    "execution_count": 79,
 266 |    "metadata": {
 267 |     "collapsed": false
 268 |    },
 269 |    "outputs": [
 270 |     {
 271 |      "data": {
 272 |       "text/plain": [
 273 |        "4"
 274 |       ]
 275 |      },
 276 |      "execution_count": 79,
 277 |      "metadata": {},
 278 |      "output_type": "execute_result"
 279 |     }
 280 |    ],
 281 |    "source": [
 282 |     "# power operator\n",
 283 |     "2 ** 2"
 284 |    ]
 285 |   },
 286 |   {
 287 |    "cell_type": "code",
 288 |    "execution_count": 80,
 289 |    "metadata": {
 290 |     "collapsed": false
 291 |    },
 292 |    "outputs": [
 293 |     {
 294 |      "data": {
 295 |       "text/plain": [
 296 |        "False"
 297 |       ]
 298 |      },
 299 |      "execution_count": 80,
 300 |      "metadata": {},
 301 |      "output_type": "execute_result"
 302 |     }
 303 |    ],
 304 |    "source": [
 305 |     "True and False"
 306 |    ]
 307 |   },
 308 |   {
 309 |    "cell_type": "code",
 310 |    "execution_count": 81,
 311 |    "metadata": {
 312 |     "collapsed": false
 313 |    },
 314 |    "outputs": [
 315 |     {
 316 |      "data": {
 317 |       "text/plain": [
 318 |        "True"
 319 |       ]
 320 |      },
 321 |      "execution_count": 81,
 322 |      "metadata": {},
 323 |      "output_type": "execute_result"
 324 |     }
 325 |    ],
 326 |    "source": [
 327 |     "not False"
 328 |    ]
 329 |   },
 330 |   {
 331 |    "cell_type": "code",
 332 |    "execution_count": 82,
 333 |    "metadata": {
 334 |     "collapsed": false
 335 |    },
 336 |    "outputs": [
 337 |     {
 338 |      "data": {
 339 |       "text/plain": [
 340 |        "True"
 341 |       ]
 342 |      },
 343 |      "execution_count": 82,
 344 |      "metadata": {},
 345 |      "output_type": "execute_result"
 346 |     }
 347 |    ],
 348 |    "source": [
 349 |     "True or False"
 350 |    ]
 351 |   },
 352 |   {
 353 |    "cell_type": "code",
 354 |    "execution_count": 83,
 355 |    "metadata": {
 356 |     "collapsed": false
 357 |    },
 358 |    "outputs": [
 359 |     {
 360 |      "data": {
 361 |       "text/plain": [
 362 |        "(True, False, False, False, True, True)"
 363 |       ]
 364 |      },
 365 |      "execution_count": 83,
 366 |      "metadata": {},
 367 |      "output_type": "execute_result"
 368 |     }
 369 |    ],
 370 |    "source": [
 371 |     "2 > 1, 2 < 1, 2 > 2, 2 < 2, 2 >= 2, 2 <= 2"
 372 |    ]
 373 |   },
 374 |   {
 375 |    "cell_type": "code",
 376 |    "execution_count": 84,
 377 |    "metadata": {
 378 |     "collapsed": false
 379 |    },
 380 |    "outputs": [
 381 |     {
 382 |      "data": {
 383 |       "text/plain": [
 384 |        "True"
 385 |       ]
 386 |      },
 387 |      "execution_count": 84,
 388 |      "metadata": {},
 389 |      "output_type": "execute_result"
 390 |     }
 391 |    ],
 392 |    "source": [
 393 |     "# equality\n",
 394 |     "[1,2] == [1,2]"
 395 |    ]
 396 |   },
 397 |   {
 398 |    "cell_type": "markdown",
 399 |    "metadata": {},
 400 |    "source": [
 401 |     "### Strings"
 402 |    ]
 403 |   },
 404 |   {
 405 |    "cell_type": "code",
 406 |    "execution_count": 85,
 407 |    "metadata": {
 408 |     "collapsed": false
 409 |    },
 410 |    "outputs": [
 411 |     {
 412 |      "data": {
 413 |       "text/plain": [
 414 |        "str"
 415 |       ]
 416 |      },
 417 |      "execution_count": 85,
 418 |      "metadata": {},
 419 |      "output_type": "execute_result"
 420 |     }
 421 |    ],
 422 |    "source": [
 423 |     "s = \"Hello world\"\n",
 424 |     "type(s)"
 425 |    ]
 426 |   },
 427 |   {
 428 |    "cell_type": "code",
 429 |    "execution_count": 86,
 430 |    "metadata": {
 431 |     "collapsed": false
 432 |    },
 433 |    "outputs": [
 434 |     {
 435 |      "data": {
 436 |       "text/plain": [
 437 |        "11"
 438 |       ]
 439 |      },
 440 |      "execution_count": 86,
 441 |      "metadata": {},
 442 |      "output_type": "execute_result"
 443 |     }
 444 |    ],
 445 |    "source": [
 446 |     "len(s)"
 447 |    ]
 448 |   },
 449 |   {
 450 |    "cell_type": "code",
 451 |    "execution_count": 87,
 452 |    "metadata": {
 453 |     "collapsed": false
 454 |    },
 455 |    "outputs": [
 456 |     {
 457 |      "name": "stdout",
 458 |      "output_type": "stream",
 459 |      "text": [
 460 |       "Hello test\n"
 461 |      ]
 462 |     }
 463 |    ],
 464 |    "source": [
 465 |     "s2 = s.replace(\"world\", \"test\")\n",
 466 |     "print(s2)"
 467 |    ]
 468 |   },
 469 |   {
 470 |    "cell_type": "code",
 471 |    "execution_count": 88,
 472 |    "metadata": {
 473 |     "collapsed": false
 474 |    },
 475 |    "outputs": [
 476 |     {
 477 |      "data": {
 478 |       "text/plain": [
 479 |        "'H'"
 480 |       ]
 481 |      },
 482 |      "execution_count": 88,
 483 |      "metadata": {},
 484 |      "output_type": "execute_result"
 485 |     }
 486 |    ],
 487 |    "source": [
 488 |     "s[0]"
 489 |    ]
 490 |   },
 491 |   {
 492 |    "cell_type": "code",
 493 |    "execution_count": 89,
 494 |    "metadata": {
 495 |     "collapsed": false
 496 |    },
 497 |    "outputs": [
 498 |     {
 499 |      "data": {
 500 |       "text/plain": [
 501 |        "'Hello'"
 502 |       ]
 503 |      },
 504 |      "execution_count": 89,
 505 |      "metadata": {},
 506 |      "output_type": "execute_result"
 507 |     }
 508 |    ],
 509 |    "source": [
 510 |     "s[0:5]"
 511 |    ]
 512 |   },
 513 |   {
 514 |    "cell_type": "code",
 515 |    "execution_count": 90,
 516 |    "metadata": {
 517 |     "collapsed": false
 518 |    },
 519 |    "outputs": [
 520 |     {
 521 |      "data": {
 522 |       "text/plain": [
 523 |        "'world'"
 524 |       ]
 525 |      },
 526 |      "execution_count": 90,
 527 |      "metadata": {},
 528 |      "output_type": "execute_result"
 529 |     }
 530 |    ],
 531 |    "source": [
 532 |     "s[6:]"
 533 |    ]
 534 |   },
 535 |   {
 536 |    "cell_type": "code",
 537 |    "execution_count": 91,
 538 |    "metadata": {
 539 |     "collapsed": false
 540 |    },
 541 |    "outputs": [
 542 |     {
 543 |      "data": {
 544 |       "text/plain": [
 545 |        "'Hello world'"
 546 |       ]
 547 |      },
 548 |      "execution_count": 91,
 549 |      "metadata": {},
 550 |      "output_type": "execute_result"
 551 |     }
 552 |    ],
 553 |    "source": [
 554 |     "s[:]"
 555 |    ]
 556 |   },
 557 |   {
 558 |    "cell_type": "code",
 559 |    "execution_count": 92,
 560 |    "metadata": {
 561 |     "collapsed": false
 562 |    },
 563 |    "outputs": [
 564 |     {
 565 |      "data": {
 566 |       "text/plain": [
 567 |        "'Hlowrd'"
 568 |       ]
 569 |      },
 570 |      "execution_count": 92,
 571 |      "metadata": {},
 572 |      "output_type": "execute_result"
 573 |     }
 574 |    ],
 575 |    "source": [
 576 |     "# define step size of 2\n",
 577 |     "s[::2]"
 578 |    ]
 579 |   },
 580 |   {
 581 |    "cell_type": "code",
 582 |    "execution_count": 93,
 583 |    "metadata": {
 584 |     "collapsed": false
 585 |    },
 586 |    "outputs": [
 587 |     {
 588 |      "name": "stdout",
 589 |      "output_type": "stream",
 590 |      "text": [
 591 |       "str1 str2 str3\n"
 592 |      ]
 593 |     }
 594 |    ],
 595 |    "source": [
 596 |     "# automatically adds a space\n",
 597 |     "print(\"str1\", \"str2\", \"str3\")"
 598 |    ]
 599 |   },
 600 |   {
 601 |    "cell_type": "code",
 602 |    "execution_count": 94,
 603 |    "metadata": {
 604 |     "collapsed": false
 605 |    },
 606 |    "outputs": [
 607 |     {
 608 |      "name": "stdout",
 609 |      "output_type": "stream",
 610 |      "text": [
 611 |       "value = 1.000000\n"
 612 |      ]
 613 |     }
 614 |    ],
 615 |    "source": [
 616 |     "# C-style formatting\n",
 617 |     "print(\"value = %f\" % 1.0) "
 618 |    ]
 619 |   },
 620 |   {
 621 |    "cell_type": "code",
 622 |    "execution_count": 95,
 623 |    "metadata": {
 624 |     "collapsed": false
 625 |    },
 626 |    "outputs": [
 627 |     {
 628 |      "name": "stdout",
 629 |      "output_type": "stream",
 630 |      "text": [
 631 |       "value1 = 3.1415, value2 = 1.5\n"
 632 |      ]
 633 |     }
 634 |    ],
 635 |    "source": [
 636 |     "# alternative, more intuitive way of formatting a string \n",
 637 |     "s3 = 'value1 = {0}, value2 = {1}'.format(3.1415, 1.5)\n",
 638 |     "print(s3)"
 639 |    ]
 640 |   },
 641 |   {
 642 |    "cell_type": "markdown",
 643 |    "metadata": {},
 644 |    "source": [
 645 |     "### Lists"
 646 |    ]
 647 |   },
 648 |   {
 649 |    "cell_type": "code",
 650 |    "execution_count": 96,
 651 |    "metadata": {
 652 |     "collapsed": false
 653 |    },
 654 |    "outputs": [
 655 |     {
 656 |      "name": "stdout",
 657 |      "output_type": "stream",
 658 |      "text": [
 659 |       "<class 'list'>\n",
 660 |       "[1, 2, 3, 4]\n"
 661 |      ]
 662 |     }
 663 |    ],
 664 |    "source": [
 665 |     "l = [1,2,3,4]\n",
 666 |     "\n",
 667 |     "print(type(l))\n",
 668 |     "print(l)"
 669 |    ]
 670 |   },
 671 |   {
 672 |    "cell_type": "code",
 673 |    "execution_count": 97,
 674 |    "metadata": {
 675 |     "collapsed": false
 676 |    },
 677 |    "outputs": [
 678 |     {
 679 |      "name": "stdout",
 680 |      "output_type": "stream",
 681 |      "text": [
 682 |       "[2, 3]\n",
 683 |       "[1, 3]\n"
 684 |      ]
 685 |     }
 686 |    ],
 687 |    "source": [
 688 |     "print(l[1:3])\n",
 689 |     "print(l[::2])"
 690 |    ]
 691 |   },
 692 |   {
 693 |    "cell_type": "code",
 694 |    "execution_count": 98,
 695 |    "metadata": {
 696 |     "collapsed": false
 697 |    },
 698 |    "outputs": [
 699 |     {
 700 |      "data": {
 701 |       "text/plain": [
 702 |        "1"
 703 |       ]
 704 |      },
 705 |      "execution_count": 98,
 706 |      "metadata": {},
 707 |      "output_type": "execute_result"
 708 |     }
 709 |    ],
 710 |    "source": [
 711 |     "l[0]"
 712 |    ]
 713 |   },
 714 |   {
 715 |    "cell_type": "code",
 716 |    "execution_count": 99,
 717 |    "metadata": {
 718 |     "collapsed": false
 719 |    },
 720 |    "outputs": [
 721 |     {
 722 |      "name": "stdout",
 723 |      "output_type": "stream",
 724 |      "text": [
 725 |       "[1, 'a', 1.0, (1-1j)]\n"
 726 |      ]
 727 |     }
 728 |    ],
 729 |    "source": [
 730 |     "# don't have to be the same type\n",
 731 |     "l = [1, 'a', 1.0, 1-1j]\n",
 732 |     "print(l)"
 733 |    ]
 734 |   },
 735 |   {
 736 |    "cell_type": "code",
 737 |    "execution_count": 100,
 738 |    "metadata": {
 739 |     "collapsed": false
 740 |    },
 741 |    "outputs": [
 742 |     {
 743 |      "data": {
 744 |       "text/plain": [
 745 |        "[10, 12, 14, 16, 18, 20, 22, 24, 26, 28]"
 746 |       ]
 747 |      },
 748 |      "execution_count": 100,
 749 |      "metadata": {},
 750 |      "output_type": "execute_result"
 751 |     }
 752 |    ],
 753 |    "source": [
 754 |     "start = 10\n",
 755 |     "stop = 30\n",
 756 |     "step = 2\n",
 757 |     "range(start, stop, step)\n",
 758 |     "\n",
 759 |     "# consume the iterator created by range\n",
 760 |     "list(range(start, stop, step))"
 761 |    ]
 762 |   },
 763 |   {
 764 |    "cell_type": "code",
 765 |    "execution_count": 101,
 766 |    "metadata": {
 767 |     "collapsed": false
 768 |    },
 769 |    "outputs": [
 770 |     {
 771 |      "name": "stdout",
 772 |      "output_type": "stream",
 773 |      "text": [
 774 |       "['A', 'd', 'd']\n"
 775 |      ]
 776 |     }
 777 |    ],
 778 |    "source": [
 779 |     "# create a new empty list\n",
 780 |     "l = []\n",
 781 |     "\n",
 782 |     "# add an elements using `append`\n",
 783 |     "l.append(\"A\")\n",
 784 |     "l.append(\"d\")\n",
 785 |     "l.append(\"d\")\n",
 786 |     "\n",
 787 |     "print(l)"
 788 |    ]
 789 |   },
 790 |   {
 791 |    "cell_type": "code",
 792 |    "execution_count": 102,
 793 |    "metadata": {
 794 |     "collapsed": false
 795 |    },
 796 |    "outputs": [
 797 |     {
 798 |      "name": "stdout",
 799 |      "output_type": "stream",
 800 |      "text": [
 801 |       "['A', 'b', 'c']\n"
 802 |      ]
 803 |     }
 804 |    ],
 805 |    "source": [
 806 |     "l[1:3] = [\"b\", \"c\"]\n",
 807 |     "print(l)"
 808 |    ]
 809 |   },
 810 |   {
 811 |    "cell_type": "code",
 812 |    "execution_count": 103,
 813 |    "metadata": {
 814 |     "collapsed": false
 815 |    },
 816 |    "outputs": [
 817 |     {
 818 |      "name": "stdout",
 819 |      "output_type": "stream",
 820 |      "text": [
 821 |       "['i', 'n', 's', 'e', 'r', 'A', 'A', 't', 'A', 'b', 'c']\n"
 822 |      ]
 823 |     }
 824 |    ],
 825 |    "source": [
 826 |     "l.insert(0, \"i\")\n",
 827 |     "l.insert(1, \"n\")\n",
 828 |     "l.insert(2, \"s\")\n",
 829 |     "l.insert(3, \"e\")\n",
 830 |     "l.insert(4, \"r\")\n",
 831 |     "l.insert(5, \"t\")\n",
 832 |     "l.insert(5, \"A\")\n",
 833 |     "l.insert(5, \"A\")\n",
 834 |     "\n",
 835 |     "\n",
 836 |     "\n",
 837 |     "print(l)"
 838 |    ]
 839 |   },
 840 |   {
 841 |    "cell_type": "code",
 842 |    "execution_count": 104,
 843 |    "metadata": {
 844 |     "collapsed": false
 845 |    },
 846 |    "outputs": [
 847 |     {
 848 |      "name": "stdout",
 849 |      "output_type": "stream",
 850 |      "text": [
 851 |       "['i', 'n', 's', 'e', 'r', 'A', 't', 'A', 'b', 'c']\n"
 852 |      ]
 853 |     }
 854 |    ],
 855 |    "source": [
 856 |     "l.remove(\"A\")\n",
 857 |     "print(l)"
 858 |    ]
 859 |   },
 860 |   {
 861 |    "cell_type": "code",
 862 |    "execution_count": 105,
 863 |    "metadata": {
 864 |     "collapsed": false
 865 |    },
 866 |    "outputs": [
 867 |     {
 868 |      "name": "stdout",
 869 |      "output_type": "stream",
 870 |      "text": [
 871 |       "['i', 'n', 's', 'e', 'r', 'A', 'b', 'c']\n"
 872 |      ]
 873 |     }
 874 |    ],
 875 |    "source": [
 876 |     "del l[7]\n",
 877 |     "del l[6]\n",
 878 |     "\n",
 879 |     "print(l)"
 880 |    ]
 881 |   },
 882 |   {
 883 |    "cell_type": "markdown",
 884 |    "metadata": {},
 885 |    "source": [
 886 |     "### Tuples"
 887 |    ]
 888 |   },
 889 |   {
 890 |    "cell_type": "code",
 891 |    "execution_count": 106,
 892 |    "metadata": {
 893 |     "collapsed": false
 894 |    },
 895 |    "outputs": [
 896 |     {
 897 |      "name": "stdout",
 898 |      "output_type": "stream",
 899 |      "text": [
 900 |       "(10, 20) <class 'tuple'>\n"
 901 |      ]
 902 |     }
 903 |    ],
 904 |    "source": [
 905 |     "point = (10, 20)\n",
 906 |     "print(point, type(point))"
 907 |    ]
 908 |   },
 909 |   {
 910 |    "cell_type": "code",
 911 |    "execution_count": 107,
 912 |    "metadata": {
 913 |     "collapsed": false
 914 |    },
 915 |    "outputs": [
 916 |     {
 917 |      "name": "stdout",
 918 |      "output_type": "stream",
 919 |      "text": [
 920 |       "x = 10\n",
 921 |       "y = 20\n"
 922 |      ]
 923 |     }
 924 |    ],
 925 |    "source": [
 926 |     "# unpacking\n",
 927 |     "x, y = point\n",
 928 |     "\n",
 929 |     "print(\"x =\", x)\n",
 930 |     "print(\"y =\", y)"
 931 |    ]
 932 |   },
 933 |   {
 934 |    "cell_type": "markdown",
 935 |    "metadata": {},
 936 |    "source": [
 937 |     "### Dictionaries"
 938 |    ]
 939 |   },
 940 |   {
 941 |    "cell_type": "code",
 942 |    "execution_count": 108,
 943 |    "metadata": {
 944 |     "collapsed": false
 945 |    },
 946 |    "outputs": [
 947 |     {
 948 |      "name": "stdout",
 949 |      "output_type": "stream",
 950 |      "text": [
 951 |       "<class 'dict'>\n",
 952 |       "{'parameter1': 1.0, 'parameter2': 2.0, 'parameter3': 3.0}\n"
 953 |      ]
 954 |     }
 955 |    ],
 956 |    "source": [
 957 |     "params = {\"parameter1\" : 1.0,\n",
 958 |     "          \"parameter2\" : 2.0,\n",
 959 |     "          \"parameter3\" : 3.0,}\n",
 960 |     "\n",
 961 |     "print(type(params))\n",
 962 |     "print(params)"
 963 |    ]
 964 |   },
 965 |   {
 966 |    "cell_type": "code",
 967 |    "execution_count": 109,
 968 |    "metadata": {
 969 |     "collapsed": false
 970 |    },
 971 |    "outputs": [
 972 |     {
 973 |      "name": "stdout",
 974 |      "output_type": "stream",
 975 |      "text": [
 976 |       "parameter1 = A\n",
 977 |       "parameter2 = B\n",
 978 |       "parameter3 = 3.0\n",
 979 |       "parameter4 = D\n"
 980 |      ]
 981 |     }
 982 |    ],
 983 |    "source": [
 984 |     "params[\"parameter1\"] = \"A\"\n",
 985 |     "params[\"parameter2\"] = \"B\"\n",
 986 |     "\n",
 987 |     "# add a new entry\n",
 988 |     "params[\"parameter4\"] = \"D\"\n",
 989 |     "\n",
 990 |     "print(\"parameter1 = \" + str(params[\"parameter1\"]))\n",
 991 |     "print(\"parameter2 = \" + str(params[\"parameter2\"]))\n",
 992 |     "print(\"parameter3 = \" + str(params[\"parameter3\"]))\n",
 993 |     "print(\"parameter4 = \" + str(params[\"parameter4\"]))"
 994 |    ]
 995 |   },
 996 |   {
 997 |    "cell_type": "markdown",
 998 |    "metadata": {},
 999 |    "source": [
1000 |     "### Control Flow"
1001 |    ]
1002 |   },
1003 |   {
1004 |    "cell_type": "code",
1005 |    "execution_count": null,
1006 |    "metadata": {
1007 |     "collapsed": false
1008 |    },
1009 |    "outputs": [],
1010 |    "source": [
1011 |     "statement1 = False\n",
1012 |     "statement2 = False\n",
1013 |     "\n",
1014 |     "if statement1:\n",
1015 |     "    print(\"statement1 is True\")\n",
1016 |     "elif statement2:\n",
1017 |     "    print(\"statement2 is True\")\n",
1018 |     "else:\n",
1019 |     "    print(\"statement1 and statement2 are False\")"
1020 |    ]
1021 |   },
1022 |   {
1023 |    "cell_type": "markdown",
1024 |    "metadata": {},
1025 |    "source": [
1026 |     "### Loops"
1027 |    ]
1028 |   },
1029 |   {
1030 |    "cell_type": "code",
1031 |    "execution_count": 110,
1032 |    "metadata": {
1033 |     "collapsed": false
1034 |    },
1035 |    "outputs": [
1036 |     {
1037 |      "name": "stdout",
1038 |      "output_type": "stream",
1039 |      "text": [
1040 |       "0\n",
1041 |       "1\n",
1042 |       "2\n",
1043 |       "3\n"
1044 |      ]
1045 |     }
1046 |    ],
1047 |    "source": [
1048 |     "for x in range(4):\n",
1049 |     "    print(x)"
1050 |    ]
1051 |   },
1052 |   {
1053 |    "cell_type": "code",
1054 |    "execution_count": 111,
1055 |    "metadata": {
1056 |     "collapsed": false
1057 |    },
1058 |    "outputs": [
1059 |     {
1060 |      "name": "stdout",
1061 |      "output_type": "stream",
1062 |      "text": [
1063 |       "scientific\n",
1064 |       "computing\n",
1065 |       "with\n",
1066 |       "python\n"
1067 |      ]
1068 |     }
1069 |    ],
1070 |    "source": [
1071 |     "for word in [\"scientific\", \"computing\", \"with\", \"python\"]:\n",
1072 |     "    print(word)"
1073 |    ]
1074 |   },
1075 |   {
1076 |    "cell_type": "code",
1077 |    "execution_count": 112,
1078 |    "metadata": {
1079 |     "collapsed": false
1080 |    },
1081 |    "outputs": [
1082 |     {
1083 |      "name": "stdout",
1084 |      "output_type": "stream",
1085 |      "text": [
1086 |       "parameter1 = A\n",
1087 |       "parameter2 = B\n",
1088 |       "parameter3 = 3.0\n",
1089 |       "parameter4 = D\n"
1090 |      ]
1091 |     }
1092 |    ],
1093 |    "source": [
1094 |     "for key, value in params.items():\n",
1095 |     "    print(key + \" = \" + str(value))"
1096 |    ]
1097 |   },
1098 |   {
1099 |    "cell_type": "code",
1100 |    "execution_count": 113,
1101 |    "metadata": {
1102 |     "collapsed": false
1103 |    },
1104 |    "outputs": [
1105 |     {
1106 |      "name": "stdout",
1107 |      "output_type": "stream",
1108 |      "text": [
1109 |       "0 -3\n",
1110 |       "1 -2\n",
1111 |       "2 -1\n",
1112 |       "3 0\n",
1113 |       "4 1\n",
1114 |       "5 2\n"
1115 |      ]
1116 |     }
1117 |    ],
1118 |    "source": [
1119 |     "for idx, x in enumerate(range(-3,3)):\n",
1120 |     "    print(idx, x)"
1121 |    ]
1122 |   },
1123 |   {
1124 |    "cell_type": "code",
1125 |    "execution_count": 114,
1126 |    "metadata": {
1127 |     "collapsed": false
1128 |    },
1129 |    "outputs": [
1130 |     {
1131 |      "name": "stdout",
1132 |      "output_type": "stream",
1133 |      "text": [
1134 |       "[0, 1, 4, 9, 16]\n"
1135 |      ]
1136 |     }
1137 |    ],
1138 |    "source": [
1139 |     "# l1 = []\n",
1140 |     "# for x in range(0,5):\n",
1141 |     "#     x = x**2\n",
1142 |     "#     l1.append(x)  \n",
1143 |     "\n",
1144 |     "l1 = [x**2 for x in range(0,5)]\n",
1145 |     "print(l1)"
1146 |    ]
1147 |   },
1148 |   {
1149 |    "cell_type": "code",
1150 |    "execution_count": 115,
1151 |    "metadata": {
1152 |     "collapsed": false
1153 |    },
1154 |    "outputs": [
1155 |     {
1156 |      "name": "stdout",
1157 |      "output_type": "stream",
1158 |      "text": [
1159 |       "0\n",
1160 |       "1\n",
1161 |       "2\n",
1162 |       "3\n",
1163 |       "4\n",
1164 |       "done\n"
1165 |      ]
1166 |     }
1167 |    ],
1168 |    "source": [
1169 |     "i = 0\n",
1170 |     "while i < 5:\n",
1171 |     "    print(i)\n",
1172 |     "    i = i + 1\n",
1173 |     "print(\"done\")"
1174 |    ]
1175 |   },
1176 |   {
1177 |    "cell_type": "markdown",
1178 |    "metadata": {},
1179 |    "source": [
1180 |     "### Functions"
1181 |    ]
1182 |   },
1183 |   {
1184 |    "cell_type": "code",
1185 |    "execution_count": 116,
1186 |    "metadata": {
1187 |     "collapsed": false
1188 |    },
1189 |    "outputs": [],
1190 |    "source": [
1191 |     "# include a docstring\n",
1192 |     "def func(s):\n",
1193 |     "    \"\"\"\n",
1194 |     "    Print a string 's' and tell how many characters it has    \n",
1195 |     "    \"\"\"\n",
1196 |     "    \n",
1197 |     "    print(s + \" has \" + str(len(s)) + \" characters\")"
1198 |    ]
1199 |   },
1200 |   {
1201 |    "cell_type": "code",
1202 |    "execution_count": 117,
1203 |    "metadata": {
1204 |     "collapsed": false
1205 |    },
1206 |    "outputs": [
1207 |     {
1208 |      "name": "stdout",
1209 |      "output_type": "stream",
1210 |      "text": [
1211 |       "Help on function func in module __main__:\n",
1212 |       "\n",
1213 |       "func(s)\n",
1214 |       "    Print a string 's' and tell how many characters it has\n",
1215 |       "\n"
1216 |      ]
1217 |     }
1218 |    ],
1219 |    "source": [
1220 |     "help(func)"
1221 |    ]
1222 |   },
1223 |   {
1224 |    "cell_type": "code",
1225 |    "execution_count": 118,
1226 |    "metadata": {
1227 |     "collapsed": false
1228 |    },
1229 |    "outputs": [
1230 |     {
1231 |      "name": "stdout",
1232 |      "output_type": "stream",
1233 |      "text": [
1234 |       "test has 4 characters\n"
1235 |      ]
1236 |     }
1237 |    ],
1238 |    "source": [
1239 |     "func(\"test\")"
1240 |    ]
1241 |   },
1242 |   {
1243 |    "cell_type": "code",
1244 |    "execution_count": 119,
1245 |    "metadata": {
1246 |     "collapsed": false
1247 |    },
1248 |    "outputs": [],
1249 |    "source": [
1250 |     "def square(x):\n",
1251 |     "    return x ** 2"
1252 |    ]
1253 |   },
1254 |   {
1255 |    "cell_type": "code",
1256 |    "execution_count": 120,
1257 |    "metadata": {
1258 |     "collapsed": false
1259 |    },
1260 |    "outputs": [
1261 |     {
1262 |      "data": {
1263 |       "text/plain": [
1264 |        "25"
1265 |       ]
1266 |      },
1267 |      "execution_count": 120,
1268 |      "metadata": {},
1269 |      "output_type": "execute_result"
1270 |     }
1271 |    ],
1272 |    "source": [
1273 |     "square(5)"
1274 |    ]
1275 |   },
1276 |   {
1277 |    "cell_type": "code",
1278 |    "execution_count": 121,
1279 |    "metadata": {
1280 |     "collapsed": false
1281 |    },
1282 |    "outputs": [],
1283 |    "source": [
1284 |     "# multiple return values\n",
1285 |     "def powers(x):\n",
1286 |     "    return x ** 2, x ** 3, x ** 4"
1287 |    ]
1288 |   },
1289 |   {
1290 |    "cell_type": "code",
1291 |    "execution_count": 122,
1292 |    "metadata": {
1293 |     "collapsed": false
1294 |    },
1295 |    "outputs": [
1296 |     {
1297 |      "data": {
1298 |       "text/plain": [
1299 |        "(25, 125, 625)"
1300 |       ]
1301 |      },
1302 |      "execution_count": 122,
1303 |      "metadata": {},
1304 |      "output_type": "execute_result"
1305 |     }
1306 |    ],
1307 |    "source": [
1308 |     "powers(5)"
1309 |    ]
1310 |   },
1311 |   {
1312 |    "cell_type": "code",
1313 |    "execution_count": 123,
1314 |    "metadata": {
1315 |     "collapsed": false
1316 |    },
1317 |    "outputs": [
1318 |     {
1319 |      "name": "stdout",
1320 |      "output_type": "stream",
1321 |      "text": [
1322 |       "125\n"
1323 |      ]
1324 |     }
1325 |    ],
1326 |    "source": [
1327 |     "x2, x3, x4 = powers(5)\n",
1328 |     "print(x3)"
1329 |    ]
1330 |   },
1331 |   {
1332 |    "cell_type": "code",
1333 |    "execution_count": 124,
1334 |    "metadata": {
1335 |     "collapsed": false
1336 |    },
1337 |    "outputs": [
1338 |     {
1339 |      "data": {
1340 |       "text/plain": [
1341 |        "25"
1342 |       ]
1343 |      },
1344 |      "execution_count": 124,
1345 |      "metadata": {},
1346 |      "output_type": "execute_result"
1347 |     }
1348 |    ],
1349 |    "source": [
1350 |     "f1 = lambda x: x**2\n",
1351 |     "f1(5)"
1352 |    ]
1353 |   },
1354 |   {
1355 |    "cell_type": "code",
1356 |    "execution_count": 125,
1357 |    "metadata": {
1358 |     "collapsed": false
1359 |    },
1360 |    "outputs": [
1361 |     {
1362 |      "data": {
1363 |       "text/plain": [
1364 |        "<map at 0x106d8b4e0>"
1365 |       ]
1366 |      },
1367 |      "execution_count": 125,
1368 |      "metadata": {},
1369 |      "output_type": "execute_result"
1370 |     }
1371 |    ],
1372 |    "source": [
1373 |     "map(lambda x: x**2, range(-3,4))"
1374 |    ]
1375 |   },
1376 |   {
1377 |    "cell_type": "code",
1378 |    "execution_count": 126,
1379 |    "metadata": {
1380 |     "collapsed": false
1381 |    },
1382 |    "outputs": [
1383 |     {
1384 |      "data": {
1385 |       "text/plain": [
1386 |        "[9, 4, 1, 0, 1, 4, 9]"
1387 |       ]
1388 |      },
1389 |      "execution_count": 126,
1390 |      "metadata": {},
1391 |      "output_type": "execute_result"
1392 |     }
1393 |    ],
1394 |    "source": [
1395 |     "# convert iterator to list\n",
1396 |     "list(map(lambda x: x**2, range(-3,4)))"
1397 |    ]
1398 |   },
1399 |   {
1400 |    "cell_type": "markdown",
1401 |    "metadata": {},
1402 |    "source": [
1403 |     "### Classes"
1404 |    ]
1405 |   },
1406 |   {
1407 |    "cell_type": "code",
1408 |    "execution_count": 128,
1409 |    "metadata": {
1410 |     "collapsed": false
1411 |    },
1412 |    "outputs": [],
1413 |    "source": [
1414 |     "class Point:\n",
1415 |     "    def __init__(self, x, y):\n",
1416 |     "        self.x = x\n",
1417 |     "        self.y = y\n",
1418 |     "        \n",
1419 |     "    def translate(self, dx, dy):\n",
1420 |     "        self.x += dx\n",
1421 |     "        self.y += dy\n",
1422 |     "        \n",
1423 |     "    def __str__(self):\n",
1424 |     "        return(\"Point at [%f, %f]\" % (self.x, self.y))"
1425 |    ]
1426 |   },
1427 |   {
1428 |    "cell_type": "code",
1429 |    "execution_count": 129,
1430 |    "metadata": {
1431 |     "collapsed": false
1432 |    },
1433 |    "outputs": [
1434 |     {
1435 |      "name": "stdout",
1436 |      "output_type": "stream",
1437 |      "text": [
1438 |       "Point at [0.000000, 0.000000]\n"
1439 |      ]
1440 |     }
1441 |    ],
1442 |    "source": [
1443 |     "p1 = Point(0, 0)\n",
1444 |     "print(p1)"
1445 |    ]
1446 |   },
1447 |   {
1448 |    "cell_type": "code",
1449 |    "execution_count": 130,
1450 |    "metadata": {
1451 |     "collapsed": false
1452 |    },
1453 |    "outputs": [
1454 |     {
1455 |      "name": "stdout",
1456 |      "output_type": "stream",
1457 |      "text": [
1458 |       "Point at [0.250000, 1.500000]\n",
1459 |       "Point at [1.000000, 1.000000]\n"
1460 |      ]
1461 |     }
1462 |    ],
1463 |    "source": [
1464 |     "p2 = Point(1, 1)\n",
1465 |     "\n",
1466 |     "p1.translate(0.25, 1.5)\n",
1467 |     "\n",
1468 |     "print(p1)\n",
1469 |     "print(p2)"
1470 |    ]
1471 |   },
1472 |   {
1473 |    "cell_type": "markdown",
1474 |    "metadata": {},
1475 |    "source": [
1476 |     "### Exceptions"
1477 |    ]
1478 |   },
1479 |   {
1480 |    "cell_type": "code",
1481 |    "execution_count": 131,
1482 |    "metadata": {
1483 |     "collapsed": false
1484 |    },
1485 |    "outputs": [
1486 |     {
1487 |      "name": "stdout",
1488 |      "output_type": "stream",
1489 |      "text": [
1490 |       "Caught an expection\n"
1491 |      ]
1492 |     }
1493 |    ],
1494 |    "source": [
1495 |     "try:\n",
1496 |     "    print(test)\n",
1497 |     "except:\n",
1498 |     "    print(\"Caught an expection\")"
1499 |    ]
1500 |   },
1501 |   {
1502 |    "cell_type": "code",
1503 |    "execution_count": 132,
1504 |    "metadata": {
1505 |     "collapsed": false
1506 |    },
1507 |    "outputs": [
1508 |     {
1509 |      "name": "stdout",
1510 |      "output_type": "stream",
1511 |      "text": [
1512 |       "Caught an exception: name 'test' is not defined\n"
1513 |      ]
1514 |     }
1515 |    ],
1516 |    "source": [
1517 |     "try:\n",
1518 |     "    print(test)\n",
1519 |     "except Exception as e:\n",
1520 |     "    print(\"Caught an exception: \" + str(e))"
1521 |    ]
1522 |   },
1523 |   {
1524 |    "cell_type": "code",
1525 |    "execution_count": null,
1526 |    "metadata": {
1527 |     "collapsed": true
1528 |    },
1529 |    "outputs": [],
1530 |    "source": []
1531 |   }
1532 |  ],
1533 |  "metadata": {
1534 |   "anaconda-cloud": {},
1535 |   "kernelspec": {
1536 |    "display_name": "Python [conda root]",
1537 |    "language": "python",
1538 |    "name": "conda-root-py"
1539 |   },
1540 |   "language_info": {
1541 |    "codemirror_mode": {
1542 |     "name": "ipython",
1543 |     "version": 3
1544 |    },
1545 |    "file_extension": ".py",
1546 |    "mimetype": "text/x-python",
1547 |    "name": "python",
1548 |    "nbconvert_exporter": "python",
1549 |    "pygments_lexer": "ipython3",
1550 |    "version": "3.5.2"
1551 |   }
1552 |  },
1553 |  "nbformat": 4,
1554 |  "nbformat_minor": 0
1555 | }
1556 | 


--------------------------------------------------------------------------------
/nlp_workshop1.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "deletable": true,
   7 |     "editable": true
   8 |    },
   9 |    "source": [
  10 |     "# TextBlob: An Introduction of Methods"
  11 |    ]
  12 |   },
  13 |   {
  14 |    "cell_type": "markdown",
  15 |    "metadata": {
  16 |     "deletable": true,
  17 |     "editable": true
  18 |    },
  19 |    "source": [
  20 |     "## What is NLP?\n",
  21 |     "\n",
  22 |     "* Computer understanding and manipulation of human language\n",
  23 |     "* A way for computers to analyze, understand, and derive meaning from human language in a smart and useful way\n",
  24 |     "* Intersection of computer science, artificial intelligence, and computational linguistics\n",
  25 |     "\n",
  26 |     "NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statical inference. In general, the more data analyzed, the more accurate the model will be."
  27 |    ]
  28 |   },
  29 |   {
  30 |    "cell_type": "markdown",
  31 |    "metadata": {
  32 |     "deletable": true,
  33 |     "editable": true
  34 |    },
  35 |    "source": [
  36 |     "## Two Subfields of NLP\n",
  37 |     "\n",
  38 |     "There are two common subfields of natural language processing:\n",
  39 |     "\n",
  40 |     "* Natural Language Understanding (NLU)\n",
  41 |     "    - A process used to convert human language into data with a form that encapsulates meaning and context in a computer-interpretable form.\n",
  42 |     "    - This is a work in progress in data science\n",
  43 |     "    - Understanding human language is difficult\n",
  44 |     "* Natural Language Generation (NLG)\n",
  45 |     "    - Uses NLU to generate human language that appears natural and relevant.\n",
  46 |     "    - Chat bots and software that automatically generates textual content use NLG.\n",
  47 |     "    \n",
  48 |     "These subfields do not completely makeup with space of NLP:\n",
  49 |     "\n",
  50 |     "* NLU includes things like\n",
  51 |     "    - relationship extraction\n",
  52 |     "    - sentiment analysis\n",
  53 |     "    - summarization\n",
  54 |     "    - *semantic* parsing\n",
  55 |     "* NLP also includes (not part of NLU)\n",
  56 |     "    - *syntactic* parsing\n",
  57 |     "    - text categorization\n",
  58 |     "    - part of speech tagging\n",
  59 |     "    \n",
  60 |     "While some parts of NLP (e.g. POS tagging) are used in NLU, they are not strictly components of NLU."
  61 |    ]
  62 |   },
  63 |   {
  64 |    "cell_type": "markdown",
  65 |    "metadata": {
  66 |     "deletable": true,
  67 |     "editable": true
  68 |    },
  69 |    "source": [
  70 |     "## Challenges in NLP\n",
  71 |     "\n",
  72 |     "NLP has many challenges, and the field is not yet mature. Some of the challenges currently faced are\n",
  73 |     "\n",
  74 |     "* Ambiguity of language\n",
  75 |     "    - syntactic ambiguity: some sentences can have multiple interpretations\n",
  76 |     "    - words with multiple definitions (e.g. patient: to tolerate delays? a hospital patient?)\n",
  77 |     "* Context affects meaning\n",
  78 |     "    - social context\n",
  79 |     "    - time of day\n",
  80 |     "    - content of previous sentences\n",
  81 |     "* Other\n",
  82 |     "    - sarcasm, humor, slang, etc.\n",
  83 |     "    \n",
  84 |     "Most or all of these are tied to NLU in one way or another. Further advancements in AI are needed to create general solutions that can handle the many form of language encountered. For example, the form of language encountered in a novel is very different from what you would find in a social media feed (e.g. Tweets)."
  85 |    ]
  86 |   },
  87 |   {
  88 |    "cell_type": "markdown",
  89 |    "metadata": {
  90 |     "deletable": true,
  91 |     "editable": true
  92 |    },
  93 |    "source": [
  94 |     "## Some Uses for NLP\n",
  95 |     "\n",
  96 |     "The uses for NLP grow as new and creative ideas arise, but some common uses are\n",
  97 |     "\n",
  98 |     "* automatic summarization\n",
  99 |     "* translation\n",
 100 |     "* named entity recognition\n",
 101 |     "    - person\n",
 102 |     "    - place\n",
 103 |     "    - organization\n",
 104 |     "    - object\n",
 105 |     "    - etc.\n",
 106 |     "* relationship extraction\n",
 107 |     "* sentiment analysis\n",
 108 |     "* speech recognition\n",
 109 |     "* topic segmentation / text classification\n",
 110 |     "* grammar correction\n",
 111 |     "* chat bots\n",
 112 |     "* automatic tag, keyword, and content generation\n",
 113 |     "\n",
 114 |     "Speech recognition is one use that doesn't *require* NLU, but it can be made better with it. It doesn't require it because a machine can recognize various words and phrases, and then take certain actions without actually understanding anything about what was said.\n",
 115 |     "\n",
 116 |     "**Some specific use cases for NLP:**\n",
 117 |     "\n",
 118 |     "* Analyze social media and forums to gain insight into what customers are saying\n",
 119 |     "    - identify new product opportunities,\n",
 120 |     "    - problems with current products/services,\n",
 121 |     "    - overall user/customer sentiment\n",
 122 |     "* Spam detection\n",
 123 |     "* Financial algorithmic trading\n",
 124 |     "    - extract info from news that impacts trading decisions\n",
 125 |     "* Answering questions (e.g. chat bots)"
 126 |    ]
 127 |   },
 128 |   {
 129 |    "cell_type": "markdown",
 130 |    "metadata": {
 131 |     "deletable": true,
 132 |     "editable": true
 133 |    },
 134 |    "source": [
 135 |     "# Techniques &amp; Tools"
 136 |    ]
 137 |   },
 138 |   {
 139 |    "cell_type": "markdown",
 140 |    "metadata": {
 141 |     "deletable": true,
 142 |     "editable": true
 143 |    },
 144 |    "source": [
 145 |     "## Techniques\n",
 146 |     "\n",
 147 |     "* **Tokenization**: split text into sentences, words, and noun-phrases\n",
 148 |     "* **Tagging**: String -> tagged list of pairs `('word', 'POS')`\n",
 149 |     "\n",
 150 |     "    Ex: `'This is a string'` -> `[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('string', 'NN')]`\n",
 151 |     "  \n",
 152 |     "  \n",
 153 |     "* **Parsing** (syntactic structure): String -> hierarchical structure with syntax tags\n",
 154 |     "\n",
 155 |     "    Ex: `'This is a string'` -> `'This/DT/O/O is/VBZ/B-VP/O a/DT/B-NP/O string/NN/I-NP/O'`\n",
 156 |     "\n",
 157 |     "* **Information Extraction**: \n",
 158 |     "    - named entity extraction: string in -> output text with labeled entities (person, company, location, etc.)\n",
 159 |     "    - relationships between entities: string in -> output entity relationships\n",
 160 |     "* **n-grams**:\n",
 161 |     "    - string in -> output list of n-tuples of successive words\n",
 162 |     "    - n-grams are used as features in machine learning\n",
 163 |     "    \n",
 164 |     "    Ex (2-gram): `'This is a string'` -> `(['This', 'is'], ['is', 'a'], ['a', 'string'])`\n",
 165 |     "    \n",
 166 |     "*These will be discussed in more detail as they come up.*"
 167 |    ]
 168 |   },
 169 |   {
 170 |    "cell_type": "markdown",
 171 |    "metadata": {
 172 |     "deletable": true,
 173 |     "editable": true
 174 |    },
 175 |    "source": [
 176 |     "## Tools\n",
 177 |     "\n",
 178 |     "There are many tools availble for NLP. Some popular choices are\n",
 179 |     "\n",
 180 |     "* Stanford's Core NLP Suite\n",
 181 |     "* Natural Language Toolkit (NLTK)\n",
 182 |     "* Apache OpenNLP\n",
 183 |     "* WordNet\n",
 184 |     "* **TextBlob**\n",
 185 |     "\n",
 186 |     "We will be working with TextBlob, which builds off of (and can integrate with) NLTK and WordNet."
 187 |    ]
 188 |   },
 189 |   {
 190 |    "cell_type": "markdown",
 191 |    "metadata": {
 192 |     "deletable": true,
 193 |     "editable": true
 194 |    },
 195 |    "source": [
 196 |     "# TextBlob: An Introduction of Methods"
 197 |    ]
 198 |   },
 199 |   {
 200 |    "cell_type": "markdown",
 201 |    "metadata": {
 202 |     "deletable": true,
 203 |     "editable": true
 204 |    },
 205 |    "source": [
 206 |     "## Installation\n",
 207 |     "\n",
 208 |     "To install TextBlob, open a new Terminal and enter the following:\n",
 209 |     "\n",
 210 |     "```\n",
 211 |     "$ pip install -U textblob\n",
 212 |     "$ python -m textblob.download_corpora\n",
 213 |     "```"
 214 |    ]
 215 |   },
 216 |   {
 217 |    "cell_type": "markdown",
 218 |    "metadata": {
 219 |     "deletable": true,
 220 |     "editable": true
 221 |    },
 222 |    "source": [
 223 |     "## Getting Started\n",
 224 |     "\n",
 225 |     "From here on, you can follow along with the notebook and create new notes and try out code as you like."
 226 |    ]
 227 |   },
 228 |   {
 229 |    "cell_type": "code",
 230 |    "execution_count": 6,
 231 |    "metadata": {
 232 |     "collapsed": true,
 233 |     "deletable": true,
 234 |     "editable": true
 235 |    },
 236 |    "outputs": [],
 237 |    "source": [
 238 |     "# import what we need\n",
 239 |     "import pandas as pd\n",
 240 |     "from pandas import DataFrame as DF, Series\n",
 241 |     "\n",
 242 |     "import numpy as np\n",
 243 |     "\n",
 244 |     "from textblob import TextBlob"
 245 |    ]
 246 |   },
 247 |   {
 248 |    "cell_type": "code",
 249 |    "execution_count": 7,
 250 |    "metadata": {
 251 |     "collapsed": false,
 252 |     "deletable": true,
 253 |     "editable": true
 254 |    },
 255 |    "outputs": [],
 256 |    "source": [
 257 |     "# read data\n",
 258 |     "\n",
 259 |     "# use only the column called 'text'\n",
 260 |     "data = pd.read_csv('tweets.csv', usecols=['text'])"
 261 |    ]
 262 |   },
 263 |   {
 264 |    "cell_type": "code",
 265 |    "execution_count": 8,
 266 |    "metadata": {
 267 |     "collapsed": false,
 268 |     "deletable": true,
 269 |     "editable": true
 270 |    },
 271 |    "outputs": [
 272 |     {
 273 |      "data": {
 274 |       "text/html": [
 275 |        "<div>\n",
 276 |        "<table border=\"1\" class=\"dataframe\">\n",
 277 |        "  <thead>\n",
 278 |        "    <tr style=\"text-align: right;\">\n",
 279 |        "      <th></th>\n",
 280 |        "      <th>text</th>\n",
 281 |        "    </tr>\n",
 282 |        "  </thead>\n",
 283 |        "  <tbody>\n",
 284 |        "    <tr>\n",
 285 |        "      <th>0</th>\n",
 286 |        "      <td>@VirginAmerica What @dhepburn said.</td>\n",
 287 |        "    </tr>\n",
 288 |        "    <tr>\n",
 289 |        "      <th>1</th>\n",
 290 |        "      <td>@VirginAmerica plus you've added commercials t...</td>\n",
 291 |        "    </tr>\n",
 292 |        "    <tr>\n",
 293 |        "      <th>2</th>\n",
 294 |        "      <td>@VirginAmerica I didn't today... Must mean I n...</td>\n",
 295 |        "    </tr>\n",
 296 |        "  </tbody>\n",
 297 |        "</table>\n",
 298 |        "</div>"
 299 |       ],
 300 |       "text/plain": [
 301 |        "                                                text\n",
 302 |        "0                @VirginAmerica What @dhepburn said.\n",
 303 |        "1  @VirginAmerica plus you've added commercials t...\n",
 304 |        "2  @VirginAmerica I didn't today... Must mean I n..."
 305 |       ]
 306 |      },
 307 |      "execution_count": 8,
 308 |      "metadata": {},
 309 |      "output_type": "execute_result"
 310 |     }
 311 |    ],
 312 |    "source": [
 313 |     "data.head(3)"
 314 |    ]
 315 |   },
 316 |   {
 317 |    "cell_type": "markdown",
 318 |    "metadata": {
 319 |     "deletable": true,
 320 |     "editable": true
 321 |    },
 322 |    "source": [
 323 |     "## Create a TextBlob object\n",
 324 |     "\n",
 325 |     "`TextBlob` objects are the foundation of everything we will be doing. They take a string as an input and create an object on which we can apply many of the TextBlob methods.\n",
 326 |     "\n",
 327 |     "Let's create a blob using a tweet in our data."
 328 |    ]
 329 |   },
 330 |   {
 331 |    "cell_type": "code",
 332 |    "execution_count": 4,
 333 |    "metadata": {
 334 |     "collapsed": true,
 335 |     "deletable": true,
 336 |     "editable": true
 337 |    },
 338 |    "outputs": [],
 339 |    "source": [
 340 |     "# create a blob from the tweet at index 25\n",
 341 |     "tweet = data.text[25]\n",
 342 |     "blob = TextBlob(tweet)"
 343 |    ]
 344 |   },
 345 |   {
 346 |    "cell_type": "markdown",
 347 |    "metadata": {
 348 |     "deletable": true,
 349 |     "editable": true
 350 |    },
 351 |    "source": [
 352 |     "# TextBlob Methods: Tokenization"
 353 |    ]
 354 |   },
 355 |   {
 356 |    "cell_type": "markdown",
 357 |    "metadata": {
 358 |     "deletable": true,
 359 |     "editable": true
 360 |    },
 361 |    "source": [
 362 |     "Tokenization allows us to split a string (a paragraph, a page, etc.) into various \"tokens\" that become useful in further processing and analysis. Tokenization also occurs on the back-end of some methods.\n",
 363 |     "\n",
 364 |     "Let's look at some tokenization options."
 365 |    ]
 366 |   },
 367 |   {
 368 |    "cell_type": "markdown",
 369 |    "metadata": {
 370 |     "deletable": true,
 371 |     "editable": true
 372 |    },
 373 |    "source": [
 374 |     "## Sentences\n",
 375 |     "\n",
 376 |     "Using the `sentences` method we get a list of `Sentence` objects, each containing (in order) all of the sentences that make up the string passed to `TextBlob`."
 377 |    ]
 378 |   },
 379 |   {
 380 |    "cell_type": "code",
 381 |    "execution_count": 11,
 382 |    "metadata": {
 383 |     "collapsed": false,
 384 |     "deletable": true,
 385 |     "editable": true
 386 |    },
 387 |    "outputs": [
 388 |     {
 389 |      "data": {
 390 |       "text/plain": [
 391 |        "[Sentence(\"@VirginAmerica status match program.\"),\n",
 392 |        " Sentence(\"I applied and it's been three weeks.\"),\n",
 393 |        " Sentence(\"Called and emailed with no response.\")]"
 394 |       ]
 395 |      },
 396 |      "execution_count": 11,
 397 |      "metadata": {},
 398 |      "output_type": "execute_result"
 399 |     }
 400 |    ],
 401 |    "source": [
 402 |     "# return list of Sentence objects\n",
 403 |     "blob.sentences"
 404 |    ]
 405 |   },
 406 |   {
 407 |    "cell_type": "markdown",
 408 |    "metadata": {
 409 |     "deletable": true,
 410 |     "editable": true
 411 |    },
 412 |    "source": [
 413 |     "Similar to `TextBlob` objects, we can use various methods with `Sentence` objects."
 414 |    ]
 415 |   },
 416 |   {
 417 |    "cell_type": "code",
 418 |    "execution_count": 12,
 419 |    "metadata": {
 420 |     "collapsed": false,
 421 |     "deletable": true,
 422 |     "editable": true
 423 |    },
 424 |    "outputs": [
 425 |     {
 426 |      "data": {
 427 |       "text/plain": [
 428 |        "[('Called', 'VBN'),\n",
 429 |        " ('and', 'CC'),\n",
 430 |        " ('emailed', 'VBN'),\n",
 431 |        " ('with', 'IN'),\n",
 432 |        " ('no', 'DT'),\n",
 433 |        " ('response', 'NN')]"
 434 |       ]
 435 |      },
 436 |      "execution_count": 12,
 437 |      "metadata": {},
 438 |      "output_type": "execute_result"
 439 |     }
 440 |    ],
 441 |    "source": [
 442 |     "# get the first sentence\n",
 443 |     "s = blob.sentences[2]\n",
 444 |     "# get tags from this sentence\n",
 445 |     "s.tags[:10]"
 446 |    ]
 447 |   },
 448 |   {
 449 |    "cell_type": "markdown",
 450 |    "metadata": {
 451 |     "deletable": true,
 452 |     "editable": true
 453 |    },
 454 |    "source": [
 455 |     "## Words\n",
 456 |     "\n",
 457 |     "Instead of a list of sentences, we can get a `WordList` object that returns all of the individual words in our string."
 458 |    ]
 459 |   },
 460 |   {
 461 |    "cell_type": "code",
 462 |    "execution_count": 13,
 463 |    "metadata": {
 464 |     "collapsed": false,
 465 |     "deletable": true,
 466 |     "editable": true
 467 |    },
 468 |    "outputs": [
 469 |     {
 470 |      "data": {
 471 |       "text/plain": [
 472 |        "WordList(['VirginAmerica', 'status', 'match', 'program', 'I', 'applied', 'and', 'it', \"'s\", 'been', 'three', 'weeks', 'Called', 'and', 'emailed', 'with', 'no', 'response'])"
 473 |       ]
 474 |      },
 475 |      "execution_count": 13,
 476 |      "metadata": {},
 477 |      "output_type": "execute_result"
 478 |     }
 479 |    ],
 480 |    "source": [
 481 |     "# return WordList object (works like a standard list in Python)\n",
 482 |     "blob.words"
 483 |    ]
 484 |   },
 485 |   {
 486 |    "cell_type": "markdown",
 487 |    "metadata": {
 488 |     "deletable": true,
 489 |     "editable": true
 490 |    },
 491 |    "source": [
 492 |     "We can access words in a `WordList` just like a regular Python list:"
 493 |    ]
 494 |   },
 495 |   {
 496 |    "cell_type": "code",
 497 |    "execution_count": 14,
 498 |    "metadata": {
 499 |     "collapsed": false,
 500 |     "deletable": true,
 501 |     "editable": true
 502 |    },
 503 |    "outputs": [
 504 |     {
 505 |      "data": {
 506 |       "text/plain": [
 507 |        "WordList(['it', \"'s\"])"
 508 |       ]
 509 |      },
 510 |      "execution_count": 14,
 511 |      "metadata": {},
 512 |      "output_type": "execute_result"
 513 |     }
 514 |    ],
 515 |    "source": [
 516 |     "blob.words[7:9]"
 517 |    ]
 518 |   },
 519 |   {
 520 |    "cell_type": "markdown",
 521 |    "metadata": {
 522 |     "deletable": true,
 523 |     "editable": true
 524 |    },
 525 |    "source": [
 526 |     "**Notice**: TextBlob doesn't do the best job of handling contractions and possessive forms. Ex: \"it's\" is split into \"it\" and \"'s\"."
 527 |    ]
 528 |   },
 529 |   {
 530 |    "cell_type": "markdown",
 531 |    "metadata": {
 532 |     "deletable": true,
 533 |     "editable": true
 534 |    },
 535 |    "source": [
 536 |     "## Word Counts\n",
 537 |     "\n",
 538 |     "We can get a dict that contains all the unique words in our string as keys, and counts for each as values."
 539 |    ]
 540 |   },
 541 |   {
 542 |    "cell_type": "code",
 543 |    "execution_count": 9,
 544 |    "metadata": {
 545 |     "collapsed": false,
 546 |     "deletable": true,
 547 |     "editable": true
 548 |    },
 549 |    "outputs": [
 550 |     {
 551 |      "data": {
 552 |       "text/plain": [
 553 |        "defaultdict(int,\n",
 554 |        "            {'and': 2,\n",
 555 |        "             'applied': 1,\n",
 556 |        "             'been': 1,\n",
 557 |        "             'called': 1,\n",
 558 |        "             'emailed': 1,\n",
 559 |        "             'i': 1,\n",
 560 |        "             'it': 1,\n",
 561 |        "             'match': 1,\n",
 562 |        "             'no': 1,\n",
 563 |        "             'program': 1,\n",
 564 |        "             'response': 1,\n",
 565 |        "             's': 1,\n",
 566 |        "             'status': 1,\n",
 567 |        "             'three': 1,\n",
 568 |        "             'virginamerica': 1,\n",
 569 |        "             'weeks': 1,\n",
 570 |        "             'with': 1})"
 571 |       ]
 572 |      },
 573 |      "execution_count": 9,
 574 |      "metadata": {},
 575 |      "output_type": "execute_result"
 576 |     }
 577 |    ],
 578 |    "source": [
 579 |     "# returns defaultdict with unique words as keys and counts as values.\n",
 580 |     "blob.word_counts"
 581 |    ]
 582 |   },
 583 |   {
 584 |    "cell_type": "code",
 585 |    "execution_count": 10,
 586 |    "metadata": {
 587 |     "collapsed": false,
 588 |     "deletable": true,
 589 |     "editable": true
 590 |    },
 591 |    "outputs": [
 592 |     {
 593 |      "name": "stdout",
 594 |      "output_type": "stream",
 595 |      "text": [
 596 |       "2\n",
 597 |       "2\n"
 598 |      ]
 599 |     }
 600 |    ],
 601 |    "source": [
 602 |     "# we can get counts for individual words is two ways\n",
 603 |     "# 1. use the count method on a WordList\n",
 604 |     "print(blob.words.count('and'))\n",
 605 |     "# 2. access a key in the word_counts dict\n",
 606 |     "print(blob.word_counts['and'])"
 607 |    ]
 608 |   },
 609 |   {
 610 |    "cell_type": "markdown",
 611 |    "metadata": {
 612 |     "deletable": true,
 613 |     "editable": true
 614 |    },
 615 |    "source": [
 616 |     "**NOTE!**\n",
 617 |     "\n",
 618 |     "if you use `word_counts['some_word']` and that word is not originally in the defaultdict, it will be added with a count of zero:"
 619 |    ]
 620 |   },
 621 |   {
 622 |    "cell_type": "code",
 623 |    "execution_count": 11,
 624 |    "metadata": {
 625 |     "collapsed": false,
 626 |     "deletable": true,
 627 |     "editable": true
 628 |    },
 629 |    "outputs": [
 630 |     {
 631 |      "data": {
 632 |       "text/plain": [
 633 |        "defaultdict(int, {'a': 1, 'of': 1, 'string': 1, 'words': 1})"
 634 |       ]
 635 |      },
 636 |      "execution_count": 11,
 637 |      "metadata": {},
 638 |      "output_type": "execute_result"
 639 |     }
 640 |    ],
 641 |    "source": [
 642 |     "# example of above\n",
 643 |     "b = TextBlob('a string of words')\n",
 644 |     "b.word_counts"
 645 |    ]
 646 |   },
 647 |   {
 648 |    "cell_type": "code",
 649 |    "execution_count": 12,
 650 |    "metadata": {
 651 |     "collapsed": false,
 652 |     "deletable": true,
 653 |     "editable": true
 654 |    },
 655 |    "outputs": [
 656 |     {
 657 |      "data": {
 658 |       "text/plain": [
 659 |        "0"
 660 |       ]
 661 |      },
 662 |      "execution_count": 12,
 663 |      "metadata": {},
 664 |      "output_type": "execute_result"
 665 |     }
 666 |    ],
 667 |    "source": [
 668 |     "# get count of word not in dict\n",
 669 |     "b.word_counts['test']"
 670 |    ]
 671 |   },
 672 |   {
 673 |    "cell_type": "code",
 674 |    "execution_count": 13,
 675 |    "metadata": {
 676 |     "collapsed": false,
 677 |     "deletable": true,
 678 |     "editable": true
 679 |    },
 680 |    "outputs": [
 681 |     {
 682 |      "data": {
 683 |       "text/plain": [
 684 |        "defaultdict(int, {'a': 1, 'of': 1, 'string': 1, 'test': 0, 'words': 1})"
 685 |       ]
 686 |      },
 687 |      "execution_count": 13,
 688 |      "metadata": {},
 689 |      "output_type": "execute_result"
 690 |     }
 691 |    ],
 692 |    "source": [
 693 |     "# look at contents of dict again\n",
 694 |     "# notice that 'test' is now included\n",
 695 |     "b.word_counts"
 696 |    ]
 697 |   },
 698 |   {
 699 |    "cell_type": "markdown",
 700 |    "metadata": {
 701 |     "deletable": true,
 702 |     "editable": true
 703 |    },
 704 |    "source": [
 705 |     "## Noun Phrases\n",
 706 |     "\n",
 707 |     "**Noun phrases:** a word or group of words that functions in a sentence as subject, object, or prepositional object.\n",
 708 |     "\n",
 709 |     "Examples of __noun phrases__ are underlined in the sentences below. The **head** noun appears in bold.\n",
 710 |     "\n",
 711 |     "* __The election-year **politics**__ are annoying for __many **people**__.\n",
 712 |     "* __Almost every **sentence**__ contains __at least one noun **phrase**__.\n",
 713 |     "* __Current economic **weakness**__ may be __a **result** of high energy prices__.\n",
 714 |     "\n",
 715 |     "Noun phrases can be identified by the possibility of pronoun substitution, as is illustrated in the examples below.\n",
 716 |     "\n",
 717 |     "a. __This **sentence**__ contains __two noun **phrases**__.<br>\n",
 718 |     "b. **It** contains **them**.\n",
 719 |     "\n",
 720 |     "We can get a `WordList` containing noun phrases using the `noun_phrase` method on a blob."
 721 |    ]
 722 |   },
 723 |   {
 724 |    "cell_type": "code",
 725 |    "execution_count": 9,
 726 |    "metadata": {
 727 |     "collapsed": false,
 728 |     "deletable": true,
 729 |     "editable": true
 730 |    },
 731 |    "outputs": [
 732 |     {
 733 |      "data": {
 734 |       "text/plain": [
 735 |        "[Sentence(\"@VirginAmerica status match program.\"),\n",
 736 |        " Sentence(\"I applied and it's been three weeks.\"),\n",
 737 |        " Sentence(\"Called and emailed with no response.\")]"
 738 |       ]
 739 |      },
 740 |      "execution_count": 9,
 741 |      "metadata": {},
 742 |      "output_type": "execute_result"
 743 |     }
 744 |    ],
 745 |    "source": [
 746 |     "blob.sentences"
 747 |    ]
 748 |   },
 749 |   {
 750 |    "cell_type": "code",
 751 |    "execution_count": 10,
 752 |    "metadata": {
 753 |     "collapsed": false,
 754 |     "deletable": true,
 755 |     "editable": true
 756 |    },
 757 |    "outputs": [
 758 |     {
 759 |      "data": {
 760 |       "text/plain": [
 761 |        "WordList(['virginamerica', 'pretty graphics', 'minimal iconography'])"
 762 |       ]
 763 |      },
 764 |      "execution_count": 10,
 765 |      "metadata": {},
 766 |      "output_type": "execute_result"
 767 |     }
 768 |    ],
 769 |    "source": [
 770 |     "# return WordList with noun phrases for tweet at index 11\n",
 771 |     "TextBlob(data.text[11]).noun_phrases"
 772 |    ]
 773 |   },
 774 |   {
 775 |    "cell_type": "markdown",
 776 |    "metadata": {
 777 |     "deletable": true,
 778 |     "editable": true
 779 |    },
 780 |    "source": [
 781 |     "The algorithm used isn't perfect, but things rarely are in NLP."
 782 |    ]
 783 |   },
 784 |   {
 785 |    "cell_type": "markdown",
 786 |    "metadata": {},
 787 |    "source": [
 788 |     "<br>"
 789 |    ]
 790 |   },
 791 |   {
 792 |    "cell_type": "markdown",
 793 |    "metadata": {},
 794 |    "source": [
 795 |     "# Practice Problems\n",
 796 |     "\n",
 797 |     "1. Create a TextBlob object called blob using tweet at index 41\n",
 798 |     "2. Print each sentence in blob on a separate line\n",
 799 |     "3. Get word counts in descending order (most frequent first)\n",
 800 |     "4. Come up with two ways to get the total word count for blob\n",
 801 |     "5. Get all noun-phrases in blob. What is wrong with the second “phrase” in the results?\n",
 802 |     "6. Select all entries in the data that contain more than 3 noun phrases\n",
 803 |     "7. **Extra:** Using a similar method as in 6, print one tweet that has exactly 3 sentences without creating a list\n"
 804 |    ]
 805 |   },
 806 |   {
 807 |    "cell_type": "markdown",
 808 |    "metadata": {
 809 |     "deletable": true,
 810 |     "editable": true
 811 |    },
 812 |    "source": [
 813 |     "<br>"
 814 |    ]
 815 |   },
 816 |   {
 817 |    "cell_type": "markdown",
 818 |    "metadata": {
 819 |     "deletable": true,
 820 |     "editable": true
 821 |    },
 822 |    "source": [
 823 |     "# TextBlob Methods: POS & Morphology"
 824 |    ]
 825 |   },
 826 |   {
 827 |    "cell_type": "markdown",
 828 |    "metadata": {
 829 |     "deletable": true,
 830 |     "editable": true
 831 |    },
 832 |    "source": [
 833 |     "Here we will cover all of the following:\n",
 834 |     "    \n",
 835 |     "* **part-of-speech (POS) tagging**: get list of tuples containing each word and it’s part of speech (e.g. noun)\n",
 836 |     "* **pluralization**: get the plural form of any singular words\n",
 837 |     "* **singularization**: get the singular form of any plural words\n",
 838 |     "* **lemmatization**: get the stripped/unmodified version of a word (e.g. singing -> sing)"
 839 |    ]
 840 |   },
 841 |   {
 842 |    "cell_type": "markdown",
 843 |    "metadata": {
 844 |     "deletable": true,
 845 |     "editable": true
 846 |    },
 847 |    "source": [
 848 |     "## part-of-speech (POS) tagging\n",
 849 |     "\n",
 850 |     "Using the `tags` method, we can get a list of doubles that contains every word in our string paired with its part of speech, as determined by the algorithm.\n",
 851 |     "\n",
 852 |     "POS tagging (also grammatical tagging) is useful for understanding context and grammar. Many words can belong to different parts of speech, depending on the context and words around them. POS tagging attempts to disambiguate a text by determining most likely parts of speech for each word based on the content."
 853 |    ]
 854 |   },
 855 |   {
 856 |    "cell_type": "code",
 857 |    "execution_count": 16,
 858 |    "metadata": {
 859 |     "collapsed": false,
 860 |     "deletable": true,
 861 |     "editable": true
 862 |    },
 863 |    "outputs": [
 864 |     {
 865 |      "data": {
 866 |       "text/plain": [
 867 |        "[('@', 'NN'),\n",
 868 |        " ('VirginAmerica', 'NNP'),\n",
 869 |        " ('status', 'NN'),\n",
 870 |        " ('match', 'NN'),\n",
 871 |        " ('program', 'NN'),\n",
 872 |        " ('I', 'PRP'),\n",
 873 |        " ('applied', 'VBD'),\n",
 874 |        " ('and', 'CC'),\n",
 875 |        " ('it', 'PRP'),\n",
 876 |        " (\"'s\", 'VBZ'),\n",
 877 |        " ('been', 'VBN'),\n",
 878 |        " ('three', 'CD'),\n",
 879 |        " ('weeks', 'NNS'),\n",
 880 |        " ('Called', 'VBN'),\n",
 881 |        " ('and', 'CC'),\n",
 882 |        " ('emailed', 'VBN'),\n",
 883 |        " ('with', 'IN'),\n",
 884 |        " ('no', 'DT'),\n",
 885 |        " ('response', 'NN')]"
 886 |       ]
 887 |      },
 888 |      "execution_count": 16,
 889 |      "metadata": {},
 890 |      "output_type": "execute_result"
 891 |     }
 892 |    ],
 893 |    "source": [
 894 |     "# return list of tuples containing words in a string and the part of speech that each belongs to\n",
 895 |     "blob.tags"
 896 |    ]
 897 |   },
 898 |   {
 899 |    "cell_type": "markdown",
 900 |    "metadata": {
 901 |     "deletable": true,
 902 |     "editable": true
 903 |    },
 904 |    "source": [
 905 |     "The tags each have a unique meaning. For example:\n",
 906 |     "* 'VBX': verb (X indicates type of verb)\n",
 907 |     "* 'DT': determiner\n",
 908 |     "\n",
 909 |     "A comprehensive table can be found at http://www.clips.ua.ac.be/pages/mbsp-tags"
 910 |    ]
 911 |   },
 912 |   {
 913 |    "cell_type": "markdown",
 914 |    "metadata": {
 915 |     "deletable": true,
 916 |     "editable": true
 917 |    },
 918 |    "source": [
 919 |     "## pluralization\n",
 920 |     "\n",
 921 |     "This is a relatively simple rule-based process that takes the singular form of a word and applies the correct pluralization to it.\n",
 922 |     "\n",
 923 |     "In TextBlob we can pluralize a single word (in the form of a `Word` obj.) or pluralize all words in a `WordList`."
 924 |    ]
 925 |   },
 926 |   {
 927 |    "cell_type": "code",
 928 |    "execution_count": 17,
 929 |    "metadata": {
 930 |     "collapsed": false,
 931 |     "deletable": true,
 932 |     "editable": true
 933 |    },
 934 |    "outputs": [
 935 |     {
 936 |      "data": {
 937 |       "text/plain": [
 938 |        "'companies'"
 939 |       ]
 940 |      },
 941 |      "execution_count": 17,
 942 |      "metadata": {},
 943 |      "output_type": "execute_result"
 944 |     }
 945 |    ],
 946 |    "source": [
 947 |     "# import\n",
 948 |     "from textblob import Word, WordList\n",
 949 |     "# create a Word object\n",
 950 |     "w = Word('company')\n",
 951 |     "# return the plural of a single word\n",
 952 |     "w.pluralize()"
 953 |    ]
 954 |   },
 955 |   {
 956 |    "cell_type": "code",
 957 |    "execution_count": 18,
 958 |    "metadata": {
 959 |     "collapsed": false,
 960 |     "deletable": true,
 961 |     "editable": true
 962 |    },
 963 |    "outputs": [
 964 |     {
 965 |      "data": {
 966 |       "text/plain": [
 967 |        "WordList(['who', 'what', 'when', 'where', 'why'])"
 968 |       ]
 969 |      },
 970 |      "execution_count": 18,
 971 |      "metadata": {},
 972 |      "output_type": "execute_result"
 973 |     }
 974 |    ],
 975 |    "source": [
 976 |     "# Side note: we can also create WordList objects\n",
 977 |     "wl = WordList(['who','what','when','where','why'])\n",
 978 |     "wl"
 979 |    ]
 980 |   },
 981 |   {
 982 |    "cell_type": "markdown",
 983 |    "metadata": {
 984 |     "deletable": true,
 985 |     "editable": true
 986 |    },
 987 |    "source": [
 988 |     "## singularization\n",
 989 |     "\n",
 990 |     "The opposite of pluralization: take a word (or words) in plural form and singularize them."
 991 |    ]
 992 |   },
 993 |   {
 994 |    "cell_type": "code",
 995 |    "execution_count": 19,
 996 |    "metadata": {
 997 |     "collapsed": false,
 998 |     "deletable": true,
 999 |     "editable": true
1000 |    },
1001 |    "outputs": [
1002 |     {
1003 |      "data": {
1004 |       "text/plain": [
1005 |        "WordList(['agency', 'octopus', 'word'])"
1006 |       ]
1007 |      },
1008 |      "execution_count": 19,
1009 |      "metadata": {},
1010 |      "output_type": "execute_result"
1011 |     }
1012 |    ],
1013 |    "source": [
1014 |     "wl = WordList(['agencies', 'octopi', 'words'])\n",
1015 |     "wl.singularize()"
1016 |    ]
1017 |   },
1018 |   {
1019 |    "cell_type": "markdown",
1020 |    "metadata": {
1021 |     "deletable": true,
1022 |     "editable": true
1023 |    },
1024 |    "source": [
1025 |     "## lemmatization"
1026 |    ]
1027 |   },
1028 |   {
1029 |    "cell_type": "markdown",
1030 |    "metadata": {
1031 |     "deletable": true,
1032 |     "editable": true
1033 |    },
1034 |    "source": [
1035 |     "Lemmatization takes a word that has been modified or morphed in some way using proper linguistic rules, and returns the stripped/unmodified version of it.\n",
1036 |     "\n",
1037 |     "The `lemmatize()` method has an optional parameter:\n",
1038 |     "* pos – Part of speech to filter upon. If None, defaults to _wordnet.NOUN.\n",
1039 |     "* options: \n",
1040 |     "    - `'n'` for noun, \n",
1041 |     "    - `'v'` for verb, \n",
1042 |     "    - `'a'` for adjective, \n",
1043 |     "    - `'r'` for adverb.\n",
1044 |     "\n",
1045 |     "Note: adverbs don't usually work with the standard `lemmatize` method."
1046 |    ]
1047 |   },
1048 |   {
1049 |    "cell_type": "code",
1050 |    "execution_count": 20,
1051 |    "metadata": {
1052 |     "collapsed": false,
1053 |     "deletable": true,
1054 |     "editable": true
1055 |    },
1056 |    "outputs": [
1057 |     {
1058 |      "data": {
1059 |       "text/plain": [
1060 |        "'sing'"
1061 |       ]
1062 |      },
1063 |      "execution_count": 20,
1064 |      "metadata": {},
1065 |      "output_type": "execute_result"
1066 |     }
1067 |    ],
1068 |    "source": [
1069 |     "w = Word('singing')\n",
1070 |     "# for some words you have to pass the type\n",
1071 |     "# in this case we pass 'v' for verb (not to be confused with POS tag formats)\n",
1072 |     "w.lemmatize('v')"
1073 |    ]
1074 |   },
1075 |   {
1076 |    "cell_type": "code",
1077 |    "execution_count": 21,
1078 |    "metadata": {
1079 |     "collapsed": false,
1080 |     "deletable": true,
1081 |     "editable": true
1082 |    },
1083 |    "outputs": [
1084 |     {
1085 |      "data": {
1086 |       "text/plain": [
1087 |        "'go'"
1088 |       ]
1089 |      },
1090 |      "execution_count": 21,
1091 |      "metadata": {},
1092 |      "output_type": "execute_result"
1093 |     }
1094 |    ],
1095 |    "source": [
1096 |     "# past participle verb\n",
1097 |     "w = Word('went')\n",
1098 |     "w.lemmatize('v')"
1099 |    ]
1100 |   },
1101 |   {
1102 |    "cell_type": "code",
1103 |    "execution_count": 22,
1104 |    "metadata": {
1105 |     "collapsed": false,
1106 |     "deletable": true,
1107 |     "editable": true
1108 |    },
1109 |    "outputs": [
1110 |     {
1111 |      "data": {
1112 |       "text/plain": [
1113 |        "'kindly'"
1114 |       ]
1115 |      },
1116 |      "execution_count": 22,
1117 |      "metadata": {},
1118 |      "output_type": "execute_result"
1119 |     }
1120 |    ],
1121 |    "source": [
1122 |     "# it doesn't always work: try an adverb\n",
1123 |     "w = Word('kindly')\n",
1124 |     "w.lemmatize('r')"
1125 |    ]
1126 |   },
1127 |   {
1128 |    "cell_type": "markdown",
1129 |    "metadata": {
1130 |     "deletable": true,
1131 |     "editable": true
1132 |    },
1133 |    "source": [
1134 |     "# Parsing & n-grams"
1135 |    ]
1136 |   },
1137 |   {
1138 |    "cell_type": "markdown",
1139 |    "metadata": {
1140 |     "deletable": true,
1141 |     "editable": true
1142 |    },
1143 |    "source": [
1144 |     "## Parsing"
1145 |    ]
1146 |   },
1147 |   {
1148 |    "cell_type": "markdown",
1149 |    "metadata": {
1150 |     "deletable": true,
1151 |     "editable": true
1152 |    },
1153 |    "source": [
1154 |     "Parsing gives us the syntactic structure of a string or sentence by appending each word with tags that indicate it's place in a hierarchy. See the tree in the PowerPoint slides for a visual example.\n",
1155 |     "\n",
1156 |     "Let's parse the sentence shown in the tree:"
1157 |    ]
1158 |   },
1159 |   {
1160 |    "cell_type": "code",
1161 |    "execution_count": 23,
1162 |    "metadata": {
1163 |     "collapsed": false,
1164 |     "deletable": true,
1165 |     "editable": true
1166 |    },
1167 |    "outputs": [
1168 |     {
1169 |      "data": {
1170 |       "text/plain": [
1171 |        "'John/NNP/B-NP/O loves/VBZ/B-VP/O Mary/NNP/B-NP/O'"
1172 |       ]
1173 |      },
1174 |      "execution_count": 23,
1175 |      "metadata": {},
1176 |      "output_type": "execute_result"
1177 |     }
1178 |    ],
1179 |    "source": [
1180 |     "# return a string containing each word in the text along with its parts of speech hierarchy\n",
1181 |     "b = TextBlob('John loves Mary')\n",
1182 |     "b.parse()"
1183 |    ]
1184 |   },
1185 |   {
1186 |    "cell_type": "markdown",
1187 |    "metadata": {
1188 |     "deletable": true,
1189 |     "editable": true
1190 |    },
1191 |    "source": [
1192 |     "`John/NNP/B-NP/O` gives the position in the hierarchy of the text for the word \"`John`\" in our sentence, working from the word to the top of the hierarchy.\n",
1193 |     "\n",
1194 |     "In this case (For the word `John`):\n",
1195 |     "* NNP indicates it is a \"noun, proper singular\"\n",
1196 |     "* the `B-` in `B-NP` indicates the word is: inside the chunk, preceding word is part of a different chunk\n",
1197 |     "* the `NP` in `B-NP` indicates it is part of a noun phrase\n",
1198 |     "* `O` is \"not part of chunk\", meaning we are at the end of this particular hierarchy (chunk).\n",
1199 |     "\n",
1200 |     "Details can be read on the page that gives detailed parts of speech (link posted under POS tagging).\n",
1201 |     "\n",
1202 |     "Parsing and syntactic structure is a complex subject, and is not covered in depth here."
1203 |    ]
1204 |   },
1205 |   {
1206 |    "cell_type": "markdown",
1207 |    "metadata": {
1208 |     "deletable": true,
1209 |     "editable": true
1210 |    },
1211 |    "source": [
1212 |     "## n-grams"
1213 |    ]
1214 |   },
1215 |   {
1216 |    "cell_type": "markdown",
1217 |    "metadata": {
1218 |     "deletable": true,
1219 |     "editable": true
1220 |    },
1221 |    "source": [
1222 |     "**n**-grams are groups of n successive words. Quite often n-grams are created by shifting one word at a time through a text, but there are cases where they skip k-words at a time.\n",
1223 |     "\n",
1224 |     "The usefulness of n-grams comes in with machine learning, where each n-gram is used as a feature for learning. These will be used more in the next workshop, but for now let's look at getting n-grams from a text using TextBlob:"
1225 |    ]
1226 |   },
1227 |   {
1228 |    "cell_type": "markdown",
1229 |    "metadata": {
1230 |     "deletable": true,
1231 |     "editable": true
1232 |    },
1233 |    "source": [
1234 |     "TextBlob has an `ngrams` method that will take an optional argument `n`, which is the size of n-grams to generate. Default is 3.\n",
1235 |     "\n",
1236 |     "The method returns a list of `WordList` objects."
1237 |    ]
1238 |   },
1239 |   {
1240 |    "cell_type": "code",
1241 |    "execution_count": 24,
1242 |    "metadata": {
1243 |     "collapsed": false,
1244 |     "deletable": true,
1245 |     "editable": true
1246 |    },
1247 |    "outputs": [
1248 |     {
1249 |      "data": {
1250 |       "text/plain": [
1251 |        "[WordList(['VirginAmerica', 'status', 'match']),\n",
1252 |        " WordList(['status', 'match', 'program']),\n",
1253 |        " WordList(['match', 'program', 'I']),\n",
1254 |        " WordList(['program', 'I', 'applied']),\n",
1255 |        " WordList(['I', 'applied', 'and'])]"
1256 |       ]
1257 |      },
1258 |      "execution_count": 24,
1259 |      "metadata": {},
1260 |      "output_type": "execute_result"
1261 |     }
1262 |    ],
1263 |    "source": [
1264 |     "# return list of n-grams (default n=3)\n",
1265 |     "# get only first 5 n-grams\n",
1266 |     "blob.ngrams()[:5]"
1267 |    ]
1268 |   },
1269 |   {
1270 |    "cell_type": "code",
1271 |    "execution_count": 25,
1272 |    "metadata": {
1273 |     "collapsed": false,
1274 |     "deletable": true,
1275 |     "editable": true
1276 |    },
1277 |    "outputs": [
1278 |     {
1279 |      "data": {
1280 |       "text/plain": [
1281 |        "[WordList(['VirginAmerica', 'status']),\n",
1282 |        " WordList(['status', 'match']),\n",
1283 |        " WordList(['match', 'program']),\n",
1284 |        " WordList(['program', 'I']),\n",
1285 |        " WordList(['I', 'applied'])]"
1286 |       ]
1287 |      },
1288 |      "execution_count": 25,
1289 |      "metadata": {},
1290 |      "output_type": "execute_result"
1291 |     }
1292 |    ],
1293 |    "source": [
1294 |     "# get another set with n = 2\n",
1295 |     "blob.ngrams(n=2)[:5]"
1296 |    ]
1297 |   },
1298 |   {
1299 |    "cell_type": "code",
1300 |    "execution_count": null,
1301 |    "metadata": {
1302 |     "collapsed": true
1303 |    },
1304 |    "outputs": [],
1305 |    "source": []
1306 |   },
1307 |   {
1308 |    "cell_type": "markdown",
1309 |    "metadata": {},
1310 |    "source": [
1311 |     "# Practice Problems\n",
1312 |     "\n",
1313 |     "1. Create and parse blob (using index 25) and print the first 10 pieces on separate lines\n",
1314 |     "2. Singularize all words in blob\n",
1315 |     "3. Pluralize the words ['gallery', 'mouse', 'man']\n",
1316 |     "4. Lemmatize the words ['categories', 'mice', 'better', 'found']\n",
1317 |     "5. Print the first 5 unique POS tags in blob\n",
1318 |     "6. Given the n-grams in the last cell in the notebook, reconstruct the original sentence\n",
1319 |     "7. **Extra:** List all words in blob that are plural (with index of each word)"
1320 |    ]
1321 |   },
1322 |   {
1323 |    "cell_type": "markdown",
1324 |    "metadata": {},
1325 |    "source": [
1326 |     "### For practice problem 6 in the second hour"
1327 |    ]
1328 |   },
1329 |   {
1330 |    "cell_type": "code",
1331 |    "execution_count": 1,
1332 |    "metadata": {
1333 |     "collapsed": true
1334 |    },
1335 |    "outputs": [],
1336 |    "source": [
1337 |     "ngrams = [['The', 'quick', 'brown', 'fox'],\n",
1338 |     " ['quick', 'brown', 'fox', 'jumps'],\n",
1339 |     " ['brown', 'fox', 'jumps', 'over'],\n",
1340 |     " ['fox', 'jumps', 'over', 'the'],\n",
1341 |     " ['jumps', 'over', 'the', 'lazy'],\n",
1342 |     " ['over', 'the', 'lazy', 'dog'],\n",
1343 |     " ['the', 'lazy', 'dog', 'and'],\n",
1344 |     " ['lazy', 'dog', 'and', 'the'],\n",
1345 |     " ['dog', 'and', 'the', 'cow'],\n",
1346 |     " ['and', 'the', 'cow', 'jumped'],\n",
1347 |     " ['the', 'cow', 'jumped', 'over'],\n",
1348 |     " ['cow', 'jumped', 'over', 'the'],\n",
1349 |     " ['jumped', 'over', 'the', 'moon']]"
1350 |    ]
1351 |   }
1352 |  ],
1353 |  "metadata": {
1354 |   "anaconda-cloud": {},
1355 |   "kernelspec": {
1356 |    "display_name": "Python [default]",
1357 |    "language": "python",
1358 |    "name": "python3"
1359 |   },
1360 |   "language_info": {
1361 |    "codemirror_mode": {
1362 |     "name": "ipython",
1363 |     "version": 3
1364 |    },
1365 |    "file_extension": ".py",
1366 |    "mimetype": "text/x-python",
1367 |    "name": "python",
1368 |    "nbconvert_exporter": "python",
1369 |    "pygments_lexer": "ipython3",
1370 |    "version": "3.5.2"
1371 |   }
1372 |  },
1373 |  "nbformat": 4,
1374 |  "nbformat_minor": 2
1375 | }
1376 | 


--------------------------------------------------------------------------------
/nlp_workshop2.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# TextBlob: Sentiment Analysis &amp; Classifiers"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "## What is Sentiment Analysis?\n",
  15 |     "\n",
  16 |     "Sentiment analysis is a method in NLP used to classify the emotion (or tone) and subjectiveness of human language. At the most common and basic level, the goal is to classify a text as positive, negative, or neutral in tone, and to determine how subjective it is. The aspect of subjectivity will only very briefly be noted in this workshop.\n",
  17 |     "\n",
  18 |     "At a more complex level, sentiment analysis is a technique used to classify the specific emotions in human language, such as angry, happy, sad, excited, etc. So instead of simply learning/classifying three classes (positive, negative, neutral), the goal is to involve many specific classes."
  19 |    ]
  20 |   },
  21 |   {
  22 |    "cell_type": "markdown",
  23 |    "metadata": {},
  24 |    "source": [
  25 |     "## Why Use Sentiment Analysis?\n",
  26 |     "\n",
  27 |     "The actual usefulness of sentiment analysis depends on the industry using it, but the most common reason to use it involve scraping lots of data (e.g. twitter feeds or reddit comments) to determine how customers/users feel about a particular brand, product, or service. \n",
  28 |     "\n",
  29 |     "There is also a use for sentiment analysis when analyzing financial securities (stock market): if a large proportion of people shift in sentiment about a particular market or stock, that is going to affect the price of securities involved."
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "markdown",
  34 |    "metadata": {},
  35 |    "source": [
  36 |     "## Specific Uses\n",
  37 |     "\n",
  38 |     "* Insight into opinions on specific political policies\n",
  39 |     "* Brand monitoring (how is a brand perceived?)\n",
  40 |     "* Identify good and bad aspects of product or ads\n",
  41 |     "* Impact of changes in sentiment on securities markets\n",
  42 |     "* Will likely be used one day with virtual assistants and other AI\n",
  43 |     "* Hotels can use it to know how they can improve their property and service"
  44 |    ]
  45 |   },
  46 |   {
  47 |    "cell_type": "markdown",
  48 |    "metadata": {},
  49 |    "source": [
  50 |     " "
  51 |    ]
  52 |   },
  53 |   {
  54 |    "cell_type": "markdown",
  55 |    "metadata": {},
  56 |    "source": [
  57 |     "## Getting Started"
  58 |    ]
  59 |   },
  60 |   {
  61 |    "cell_type": "code",
  62 |    "execution_count": 1,
  63 |    "metadata": {
  64 |     "collapsed": true,
  65 |     "deletable": true,
  66 |     "editable": true
  67 |    },
  68 |    "outputs": [],
  69 |    "source": [
  70 |     "# import what we need\n",
  71 |     "import pandas as pd\n",
  72 |     "from pandas import DataFrame as DF, Series\n",
  73 |     "\n",
  74 |     "import numpy as np\n",
  75 |     "\n",
  76 |     "import matplotlib.pyplot as plt\n",
  77 |     "%matplotlib inline\n",
  78 |     "\n",
  79 |     "from textblob import TextBlob"
  80 |    ]
  81 |   },
  82 |   {
  83 |    "cell_type": "code",
  84 |    "execution_count": 2,
  85 |    "metadata": {
  86 |     "collapsed": true,
  87 |     "deletable": true,
  88 |     "editable": true
  89 |    },
  90 |    "outputs": [],
  91 |    "source": [
  92 |     "# read data\n",
  93 |     "cols = ['airline_sentiment','airline_sentiment_confidence',\n",
  94 |     "        'airline','name','text']\n",
  95 |     "data = pd.read_csv('tweets.csv', usecols=cols)"
  96 |    ]
  97 |   },
  98 |   {
  99 |    "cell_type": "markdown",
 100 |    "metadata": {},
 101 |    "source": [
 102 |     "Below is the first 5 rows of our data. We will only be using the first two features, and the last feature."
 103 |    ]
 104 |   },
 105 |   {
 106 |    "cell_type": "code",
 107 |    "execution_count": 3,
 108 |    "metadata": {
 109 |     "collapsed": false,
 110 |     "deletable": true,
 111 |     "editable": true
 112 |    },
 113 |    "outputs": [
 114 |     {
 115 |      "data": {
 116 |       "text/html": [
 117 |        "<div>\n",
 118 |        "<table border=\"1\" class=\"dataframe\">\n",
 119 |        "  <thead>\n",
 120 |        "    <tr style=\"text-align: right;\">\n",
 121 |        "      <th></th>\n",
 122 |        "      <th>airline_sentiment</th>\n",
 123 |        "      <th>airline_sentiment_confidence</th>\n",
 124 |        "      <th>airline</th>\n",
 125 |        "      <th>name</th>\n",
 126 |        "      <th>text</th>\n",
 127 |        "    </tr>\n",
 128 |        "  </thead>\n",
 129 |        "  <tbody>\n",
 130 |        "    <tr>\n",
 131 |        "      <th>0</th>\n",
 132 |        "      <td>neutral</td>\n",
 133 |        "      <td>1.0000</td>\n",
 134 |        "      <td>Virgin America</td>\n",
 135 |        "      <td>cairdin</td>\n",
 136 |        "      <td>@VirginAmerica What @dhepburn said.</td>\n",
 137 |        "    </tr>\n",
 138 |        "    <tr>\n",
 139 |        "      <th>1</th>\n",
 140 |        "      <td>positive</td>\n",
 141 |        "      <td>0.3486</td>\n",
 142 |        "      <td>Virgin America</td>\n",
 143 |        "      <td>jnardino</td>\n",
 144 |        "      <td>@VirginAmerica plus you've added commercials t...</td>\n",
 145 |        "    </tr>\n",
 146 |        "    <tr>\n",
 147 |        "      <th>2</th>\n",
 148 |        "      <td>neutral</td>\n",
 149 |        "      <td>0.6837</td>\n",
 150 |        "      <td>Virgin America</td>\n",
 151 |        "      <td>yvonnalynn</td>\n",
 152 |        "      <td>@VirginAmerica I didn't today... Must mean I n...</td>\n",
 153 |        "    </tr>\n",
 154 |        "    <tr>\n",
 155 |        "      <th>3</th>\n",
 156 |        "      <td>negative</td>\n",
 157 |        "      <td>1.0000</td>\n",
 158 |        "      <td>Virgin America</td>\n",
 159 |        "      <td>jnardino</td>\n",
 160 |        "      <td>@VirginAmerica it's really aggressive to blast...</td>\n",
 161 |        "    </tr>\n",
 162 |        "    <tr>\n",
 163 |        "      <th>4</th>\n",
 164 |        "      <td>negative</td>\n",
 165 |        "      <td>1.0000</td>\n",
 166 |        "      <td>Virgin America</td>\n",
 167 |        "      <td>jnardino</td>\n",
 168 |        "      <td>@VirginAmerica and it's a really big bad thing...</td>\n",
 169 |        "    </tr>\n",
 170 |        "  </tbody>\n",
 171 |        "</table>\n",
 172 |        "</div>"
 173 |       ],
 174 |       "text/plain": [
 175 |        "  airline_sentiment  airline_sentiment_confidence         airline        name  \\\n",
 176 |        "0           neutral                        1.0000  Virgin America     cairdin   \n",
 177 |        "1          positive                        0.3486  Virgin America    jnardino   \n",
 178 |        "2           neutral                        0.6837  Virgin America  yvonnalynn   \n",
 179 |        "3          negative                        1.0000  Virgin America    jnardino   \n",
 180 |        "4          negative                        1.0000  Virgin America    jnardino   \n",
 181 |        "\n",
 182 |        "                                                text  \n",
 183 |        "0                @VirginAmerica What @dhepburn said.  \n",
 184 |        "1  @VirginAmerica plus you've added commercials t...  \n",
 185 |        "2  @VirginAmerica I didn't today... Must mean I n...  \n",
 186 |        "3  @VirginAmerica it's really aggressive to blast...  \n",
 187 |        "4  @VirginAmerica and it's a really big bad thing...  "
 188 |       ]
 189 |      },
 190 |      "execution_count": 3,
 191 |      "metadata": {},
 192 |      "output_type": "execute_result"
 193 |     }
 194 |    ],
 195 |    "source": [
 196 |     "data.head()"
 197 |    ]
 198 |   },
 199 |   {
 200 |    "cell_type": "markdown",
 201 |    "metadata": {},
 202 |    "source": [
 203 |     "# Polarity & Subjectivity Using TextBlob `sentiment`"
 204 |    ]
 205 |   },
 206 |   {
 207 |    "cell_type": "markdown",
 208 |    "metadata": {
 209 |     "deletable": true,
 210 |     "editable": true
 211 |    },
 212 |    "source": [
 213 |     "## Basic Sentiment Analysis"
 214 |    ]
 215 |   },
 216 |   {
 217 |    "cell_type": "markdown",
 218 |    "metadata": {
 219 |     "deletable": true,
 220 |     "editable": true
 221 |    },
 222 |    "source": [
 223 |     "### Using the TextBlob `sentiment` method"
 224 |    ]
 225 |   },
 226 |   {
 227 |    "cell_type": "markdown",
 228 |    "metadata": {
 229 |     "deletable": true,
 230 |     "editable": true
 231 |    },
 232 |    "source": [
 233 |     "TextBlob has a `sentiment` method that can be used on any `TextBlob` object. It returns two values:\n",
 234 |     "* polarity: value in range [-1, 1], indicating how negative or positive the text is (close to 0.0 is neutral).\n",
 235 |     "* subjectivity: value in range [0, 1], indicating how subjective the text is (1 is very subjective)\n",
 236 |     "\n",
 237 |     "This method is very basic, and there is a lot to be desired, but it can still be helpful if you don't have opportunity to train a classifier, and just need some rough results."
 238 |    ]
 239 |   },
 240 |   {
 241 |    "cell_type": "code",
 242 |    "execution_count": 4,
 243 |    "metadata": {
 244 |     "collapsed": false,
 245 |     "deletable": true,
 246 |     "editable": true
 247 |    },
 248 |    "outputs": [
 249 |     {
 250 |      "name": "stdout",
 251 |      "output_type": "stream",
 252 |      "text": [
 253 |       "The food is on the table \n",
 254 |       "(p=0.0, s=0.0) \n",
 255 |       "\n",
 256 |       "The food is green \n",
 257 |       "(p=-0.2, s=0.3) \n",
 258 |       "\n",
 259 |       "I don't like the food \n",
 260 |       "(p=0.0, s=0.0) \n",
 261 |       "\n",
 262 |       "I do not like the food \n",
 263 |       "(p=0.0, s=0.0) \n",
 264 |       "\n",
 265 |       "I like the food \n",
 266 |       "(p=0.0, s=0.0) \n",
 267 |       "\n",
 268 |       "I don't love the food \n",
 269 |       "(p=0.5, s=0.6) \n",
 270 |       "\n",
 271 |       "I do not love the food \n",
 272 |       "(p=-0.25, s=0.6) \n",
 273 |       "\n",
 274 |       "I hate the food \n",
 275 |       "(p=-0.8, s=0.9) \n",
 276 |       "\n",
 277 |       "I love the food \n",
 278 |       "(p=0.5, s=0.6) \n",
 279 |       "\n",
 280 |       "The food is delicious \n",
 281 |       "(p=1.0, s=1.0) \n",
 282 |       "\n"
 283 |      ]
 284 |     }
 285 |    ],
 286 |    "source": [
 287 |     "lines = [\"The food is on the table\", \"The food is green\", \"I don't like the food\",\n",
 288 |     "         \"I do not like the food\", \"I like the food\", \"I don't love the food\", \"I do not love the food\",\n",
 289 |     "         \"I hate the food\", \"I love the food\", \"The food is delicious\"]\n",
 290 |     "\n",
 291 |     "# analyze the sentences\n",
 292 |     "sentiments = [b.sentiment for b in [TextBlob(l) for l in lines]]\n",
 293 |     "for l,s in zip(lines, sentiments):\n",
 294 |     "    print('{} \\n(p={}, s={})'.format(l, s[0], s[1]), '\\n')"
 295 |    ]
 296 |   },
 297 |   {
 298 |    "cell_type": "markdown",
 299 |    "metadata": {
 300 |     "deletable": true,
 301 |     "editable": true
 302 |    },
 303 |    "source": [
 304 |     "As seen above, this method doesn't recognize negative contractions (e.g. don't), and it has trouble with ambiguous works that can take on multiple meanings (e.g. like, which is also used for comparision)\n",
 305 |     "\n",
 306 |     "Let's see how it does with a couple book reviews."
 307 |    ]
 308 |   },
 309 |   {
 310 |    "cell_type": "markdown",
 311 |    "metadata": {},
 312 |    "source": [
 313 |     "## Using The `sentiment` Method on Tweets"
 314 |    ]
 315 |   },
 316 |   {
 317 |    "cell_type": "markdown",
 318 |    "metadata": {},
 319 |    "source": [
 320 |     "We will get a subset of our data that contains only the first 10 rows that have a confidence level greater that 0.6. This is because we are uninterested in entries with a high level of uncertainty, because keeping low-confidence observations would reduce the certainty of evaluations that we will make later."
 321 |    ]
 322 |   },
 323 |   {
 324 |    "cell_type": "code",
 325 |    "execution_count": 5,
 326 |    "metadata": {
 327 |     "collapsed": false,
 328 |     "deletable": true,
 329 |     "editable": true
 330 |    },
 331 |    "outputs": [],
 332 |    "source": [
 333 |     "# get subset of tweets where confidence is > 0.6\n",
 334 |     "subset = data[data.airline_sentiment_confidence > 0.6]\\\n",
 335 |     "    .head(10).copy().reset_index(drop=True)\n",
 336 |     "tweets = subset.text"
 337 |    ]
 338 |   },
 339 |   {
 340 |    "cell_type": "code",
 341 |    "execution_count": 6,
 342 |    "metadata": {
 343 |     "collapsed": false,
 344 |     "deletable": true,
 345 |     "editable": true
 346 |    },
 347 |    "outputs": [
 348 |     {
 349 |      "data": {
 350 |       "text/html": [
 351 |        "<div>\n",
 352 |        "<table border=\"1\" class=\"dataframe\">\n",
 353 |        "  <thead>\n",
 354 |        "    <tr style=\"text-align: right;\">\n",
 355 |        "      <th></th>\n",
 356 |        "      <th>airline_sentiment</th>\n",
 357 |        "      <th>airline_sentiment_confidence</th>\n",
 358 |        "      <th>airline</th>\n",
 359 |        "      <th>name</th>\n",
 360 |        "      <th>text</th>\n",
 361 |        "    </tr>\n",
 362 |        "  </thead>\n",
 363 |        "  <tbody>\n",
 364 |        "    <tr>\n",
 365 |        "      <th>0</th>\n",
 366 |        "      <td>neutral</td>\n",
 367 |        "      <td>1.0000</td>\n",
 368 |        "      <td>Virgin America</td>\n",
 369 |        "      <td>cairdin</td>\n",
 370 |        "      <td>@VirginAmerica What @dhepburn said.</td>\n",
 371 |        "    </tr>\n",
 372 |        "    <tr>\n",
 373 |        "      <th>1</th>\n",
 374 |        "      <td>neutral</td>\n",
 375 |        "      <td>0.6837</td>\n",
 376 |        "      <td>Virgin America</td>\n",
 377 |        "      <td>yvonnalynn</td>\n",
 378 |        "      <td>@VirginAmerica I didn't today... Must mean I n...</td>\n",
 379 |        "    </tr>\n",
 380 |        "    <tr>\n",
 381 |        "      <th>2</th>\n",
 382 |        "      <td>negative</td>\n",
 383 |        "      <td>1.0000</td>\n",
 384 |        "      <td>Virgin America</td>\n",
 385 |        "      <td>jnardino</td>\n",
 386 |        "      <td>@VirginAmerica it's really aggressive to blast...</td>\n",
 387 |        "    </tr>\n",
 388 |        "    <tr>\n",
 389 |        "      <th>3</th>\n",
 390 |        "      <td>negative</td>\n",
 391 |        "      <td>1.0000</td>\n",
 392 |        "      <td>Virgin America</td>\n",
 393 |        "      <td>jnardino</td>\n",
 394 |        "      <td>@VirginAmerica and it's a really big bad thing...</td>\n",
 395 |        "    </tr>\n",
 396 |        "    <tr>\n",
 397 |        "      <th>4</th>\n",
 398 |        "      <td>negative</td>\n",
 399 |        "      <td>1.0000</td>\n",
 400 |        "      <td>Virgin America</td>\n",
 401 |        "      <td>jnardino</td>\n",
 402 |        "      <td>@VirginAmerica seriously would pay $30 a fligh...</td>\n",
 403 |        "    </tr>\n",
 404 |        "    <tr>\n",
 405 |        "      <th>5</th>\n",
 406 |        "      <td>positive</td>\n",
 407 |        "      <td>0.6745</td>\n",
 408 |        "      <td>Virgin America</td>\n",
 409 |        "      <td>cjmcginnis</td>\n",
 410 |        "      <td>@VirginAmerica yes, nearly every time I fly VX...</td>\n",
 411 |        "    </tr>\n",
 412 |        "    <tr>\n",
 413 |        "      <th>6</th>\n",
 414 |        "      <td>neutral</td>\n",
 415 |        "      <td>0.6340</td>\n",
 416 |        "      <td>Virgin America</td>\n",
 417 |        "      <td>pilot</td>\n",
 418 |        "      <td>@VirginAmerica Really missed a prime opportuni...</td>\n",
 419 |        "    </tr>\n",
 420 |        "    <tr>\n",
 421 |        "      <th>7</th>\n",
 422 |        "      <td>positive</td>\n",
 423 |        "      <td>0.6559</td>\n",
 424 |        "      <td>Virgin America</td>\n",
 425 |        "      <td>dhepburn</td>\n",
 426 |        "      <td>@virginamerica Well, I didn't…but NOW I DO! :-D</td>\n",
 427 |        "    </tr>\n",
 428 |        "    <tr>\n",
 429 |        "      <th>8</th>\n",
 430 |        "      <td>positive</td>\n",
 431 |        "      <td>1.0000</td>\n",
 432 |        "      <td>Virgin America</td>\n",
 433 |        "      <td>YupitsTate</td>\n",
 434 |        "      <td>@VirginAmerica it was amazing, and arrived an ...</td>\n",
 435 |        "    </tr>\n",
 436 |        "    <tr>\n",
 437 |        "      <th>9</th>\n",
 438 |        "      <td>neutral</td>\n",
 439 |        "      <td>0.6769</td>\n",
 440 |        "      <td>Virgin America</td>\n",
 441 |        "      <td>idk_but_youtube</td>\n",
 442 |        "      <td>@VirginAmerica did you know that suicide is th...</td>\n",
 443 |        "    </tr>\n",
 444 |        "  </tbody>\n",
 445 |        "</table>\n",
 446 |        "</div>"
 447 |       ],
 448 |       "text/plain": [
 449 |        "  airline_sentiment  airline_sentiment_confidence         airline  \\\n",
 450 |        "0           neutral                        1.0000  Virgin America   \n",
 451 |        "1           neutral                        0.6837  Virgin America   \n",
 452 |        "2          negative                        1.0000  Virgin America   \n",
 453 |        "3          negative                        1.0000  Virgin America   \n",
 454 |        "4          negative                        1.0000  Virgin America   \n",
 455 |        "5          positive                        0.6745  Virgin America   \n",
 456 |        "6           neutral                        0.6340  Virgin America   \n",
 457 |        "7          positive                        0.6559  Virgin America   \n",
 458 |        "8          positive                        1.0000  Virgin America   \n",
 459 |        "9           neutral                        0.6769  Virgin America   \n",
 460 |        "\n",
 461 |        "              name                                               text  \n",
 462 |        "0          cairdin                @VirginAmerica What @dhepburn said.  \n",
 463 |        "1       yvonnalynn  @VirginAmerica I didn't today... Must mean I n...  \n",
 464 |        "2         jnardino  @VirginAmerica it's really aggressive to blast...  \n",
 465 |        "3         jnardino  @VirginAmerica and it's a really big bad thing...  \n",
 466 |        "4         jnardino  @VirginAmerica seriously would pay $30 a fligh...  \n",
 467 |        "5       cjmcginnis  @VirginAmerica yes, nearly every time I fly VX...  \n",
 468 |        "6            pilot  @VirginAmerica Really missed a prime opportuni...  \n",
 469 |        "7         dhepburn    @virginamerica Well, I didn't…but NOW I DO! :-D  \n",
 470 |        "8       YupitsTate  @VirginAmerica it was amazing, and arrived an ...  \n",
 471 |        "9  idk_but_youtube  @VirginAmerica did you know that suicide is th...  "
 472 |       ]
 473 |      },
 474 |      "execution_count": 6,
 475 |      "metadata": {},
 476 |      "output_type": "execute_result"
 477 |     }
 478 |    ],
 479 |    "source": [
 480 |     "subset"
 481 |    ]
 482 |   },
 483 |   {
 484 |    "cell_type": "markdown",
 485 |    "metadata": {},
 486 |    "source": [
 487 |     "### Compare the `sentiment` predictions with each line in `subset`\n",
 488 |     "\n",
 489 |     "We want to get a sense of how each tweet is being classified"
 490 |    ]
 491 |   },
 492 |   {
 493 |    "cell_type": "code",
 494 |    "execution_count": 7,
 495 |    "metadata": {
 496 |     "collapsed": false,
 497 |     "deletable": true,
 498 |     "editable": true
 499 |    },
 500 |    "outputs": [
 501 |     {
 502 |      "name": "stdout",
 503 |      "output_type": "stream",
 504 |      "text": [
 505 |       "@VirginAmerica What @dhepburn said. \n",
 506 |       " 0.0 (target: neutral) \n",
 507 |       "\n",
 508 |       "@VirginAmerica I didn't today... Must mean I need to take another trip! \n",
 509 |       " -0.390625 (target: neutral) \n",
 510 |       "\n",
 511 |       "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces &amp; they have little recourse \n",
 512 |       " 0.0062500000000000056 (target: negative) \n",
 513 |       "\n",
 514 |       "@VirginAmerica and it's a really big bad thing about it \n",
 515 |       " -0.3499999999999999 (target: negative) \n",
 516 |       "\n",
 517 |       "@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\n",
 518 |       "it's really the only bad thing about flying VA \n",
 519 |       " -0.2083333333333333 (target: negative) \n",
 520 |       "\n",
 521 |       "@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :) \n",
 522 |       " 0.4666666666666666 (target: positive) \n",
 523 |       "\n",
 524 |       "@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP \n",
 525 |       " 0.2 (target: neutral) \n",
 526 |       "\n",
 527 |       "@virginamerica Well, I didn't…but NOW I DO! :-D \n",
 528 |       " 1.0 (target: positive) \n",
 529 |       "\n",
 530 |       "@VirginAmerica it was amazing, and arrived an hour early. You're too good to me. \n",
 531 |       " 0.4666666666666666 (target: positive) \n",
 532 |       "\n",
 533 |       "@VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24 \n",
 534 |       " 0.0 (target: neutral) \n",
 535 |       "\n"
 536 |      ]
 537 |     }
 538 |    ],
 539 |    "source": [
 540 |     "# print the tweets and predicted polarity line-by-line\n",
 541 |     "for i,t in enumerate(tweets):\n",
 542 |     "    s = TextBlob(t).sentiment\n",
 543 |     "    target = subset.airline_sentiment[i]\n",
 544 |     "    print(t, '\\n', '{} (target: {}) \\n'.format(s[0], target))"
 545 |    ]
 546 |   },
 547 |   {
 548 |    "cell_type": "markdown",
 549 |    "metadata": {
 550 |     "deletable": true,
 551 |     "editable": true
 552 |    },
 553 |    "source": [
 554 |     "This basic sentiment analyzer missed the mark on 3/10 tweets (2 neutral and 1 negative). That's not too bad, but these results are nothing to celebrate. The perfmance declines quite a bit with larger texts.\n",
 555 |     "\n",
 556 |     "Looking at the two tweets the `sentiment` method estimated incorrectly:\n",
 557 |     "\n",
 558 |     "**@VirginAmerica I didn't today... Must mean I need to take another trip!**\n",
 559 |     "This one is interpreted by the computer as negative, and perhaps it's correct. This one is full of ambiguity without any context, and that is probably why the target value in the set is neutral.\n",
 560 |     "\n",
 561 |     "**@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces &amp; they have little recourse**\n",
 562 |     "This one is "
 563 |    ]
 564 |   },
 565 |   {
 566 |    "cell_type": "markdown",
 567 |    "metadata": {},
 568 |    "source": [
 569 |     "### Analyze polarity of each word in the last sentence above to see what's happening"
 570 |    ]
 571 |   },
 572 |   {
 573 |    "cell_type": "code",
 574 |    "execution_count": 8,
 575 |    "metadata": {
 576 |     "collapsed": false,
 577 |     "deletable": true,
 578 |     "editable": true
 579 |    },
 580 |    "outputs": [
 581 |     {
 582 |      "name": "stdout",
 583 |      "output_type": "stream",
 584 |      "text": [
 585 |       "VirginAmerica 0.0 \n",
 586 |       "\n",
 587 |       "it 0.0 \n",
 588 |       "\n",
 589 |       "'s 0.0 \n",
 590 |       "\n",
 591 |       "really 0.2 \n",
 592 |       "\n",
 593 |       "aggressive 0.0 \n",
 594 |       "\n",
 595 |       "to 0.0 \n",
 596 |       "\n",
 597 |       "blast 0.0 \n",
 598 |       "\n",
 599 |       "obnoxious 0.0 \n",
 600 |       "\n",
 601 |       "entertainment 0.0 \n",
 602 |       "\n",
 603 |       "in 0.0 \n",
 604 |       "\n",
 605 |       "your 0.0 \n",
 606 |       "\n",
 607 |       "guests 0.0 \n",
 608 |       "\n",
 609 |       "faces 0.0 \n",
 610 |       "\n",
 611 |       "amp 0.0 \n",
 612 |       "\n",
 613 |       "they 0.0 \n",
 614 |       "\n",
 615 |       "have 0.0 \n",
 616 |       "\n",
 617 |       "little -0.1875 \n",
 618 |       "\n",
 619 |       "recourse 0.0 \n",
 620 |       "\n"
 621 |      ]
 622 |     }
 623 |    ],
 624 |    "source": [
 625 |     "words = TextBlob(tweets[2]).words\n",
 626 |     "for w in words: print(w, TextBlob(w).sentiment[0], '\\n')"
 627 |    ]
 628 |   },
 629 |   {
 630 |    "cell_type": "markdown",
 631 |    "metadata": {
 632 |     "deletable": true,
 633 |     "editable": true
 634 |    },
 635 |    "source": [
 636 |     "We can see that the `sentiment` method does not consider the words \"obnoxious\" or \"aggressive\" to be negative, which is a glaring problem for our analysis. This method is clearly limited and we need a better method."
 637 |    ]
 638 |   },
 639 |   {
 640 |    "cell_type": "markdown",
 641 |    "metadata": {
 642 |     "deletable": true,
 643 |     "editable": true
 644 |    },
 645 |    "source": [
 646 |     "# Naive Bayes Classifier for Sentiment Anlaysis"
 647 |    ]
 648 |   },
 649 |   {
 650 |    "cell_type": "markdown",
 651 |    "metadata": {},
 652 |    "source": [
 653 |     "Here we will use a Naive Bayes Classifier (included with TextBlob) to create a better sentiment analyzer. We will only train on a small portion of our data since it takes a while to train. However, even with a small amount of training data we can get better results than the `sentiment` method.\n",
 654 |     "\n",
 655 |     "There are other classifiers included with TextBlob, but this one is easy to use and gives good performance.\n",
 656 |     "\n",
 657 |     "We will start with three goals\n",
 658 |     "* learn to train and test/evaluate this classifier using a subset of our data\n",
 659 |     "* compare the performance to the original sentiment method\n",
 660 |     "* look at the features the classifier is extracting from the text"
 661 |    ]
 662 |   },
 663 |   {
 664 |    "cell_type": "markdown",
 665 |    "metadata": {},
 666 |    "source": [
 667 |     "### Create train and test sets\n",
 668 |     "\n",
 669 |     "* train the model on the first set\n",
 670 |     "* test/evaluate it on the other"
 671 |    ]
 672 |   },
 673 |   {
 674 |    "cell_type": "markdown",
 675 |    "metadata": {},
 676 |    "source": [
 677 |     "The set below named `reduced` is reduced in dimensionality (keeping only the features/columns we care about).\n",
 678 |     "\n",
 679 |     "The `train` and `test` sets are created using something called a list comprehension. If you don't know what that is, it's okay, and you can look it up later. What is important is to know that the Naive Bayes classifier takes data in the form of a list of doubles, where each double is one observation (text, label), where label is the class label that belongs to the text."
 680 |    ]
 681 |   },
 682 |   {
 683 |    "cell_type": "code",
 684 |    "execution_count": 9,
 685 |    "metadata": {
 686 |     "collapsed": false,
 687 |     "deletable": true,
 688 |     "editable": true
 689 |    },
 690 |    "outputs": [],
 691 |    "source": [
 692 |     "# get reduced set\n",
 693 |     "reduced = data.ix[:, ['airline_sentiment','text']].copy()\n",
 694 |     "reduced.rename(columns={'airline_sentiment': 'target'}, inplace=1)\n",
 695 |     "\n",
 696 |     "# now create train and test sets for first 500 tweets\n",
 697 |     "# for the TextBlob classifier we need a list of doubles (string, target)\n",
 698 |     "train = [(s, t) for s,t in zip(reduced.iloc[:350].text, reduced.iloc[:350].target)]\n",
 699 |     "test = [(s, t) for s,t in zip(reduced.iloc[350:500].text, reduced.iloc[350:500].target)]"
 700 |    ]
 701 |   },
 702 |   {
 703 |    "cell_type": "markdown",
 704 |    "metadata": {},
 705 |    "source": [
 706 |     "### Train and evaulate"
 707 |    ]
 708 |   },
 709 |   {
 710 |    "cell_type": "code",
 711 |    "execution_count": 10,
 712 |    "metadata": {
 713 |     "collapsed": false,
 714 |     "deletable": true,
 715 |     "editable": true
 716 |    },
 717 |    "outputs": [
 718 |     {
 719 |      "data": {
 720 |       "text/plain": [
 721 |        "0.6066666666666667"
 722 |       ]
 723 |      },
 724 |      "execution_count": 10,
 725 |      "metadata": {},
 726 |      "output_type": "execute_result"
 727 |     }
 728 |    ],
 729 |    "source": [
 730 |     "# import the classifier\n",
 731 |     "from textblob.classifiers import NaiveBayesClassifier\n",
 732 |     "\n",
 733 |     "# train\n",
 734 |     "cl = NaiveBayesClassifier(train)\n",
 735 |     "# evaluate\n",
 736 |     "cl.accuracy(test)"
 737 |    ]
 738 |   },
 739 |   {
 740 |    "cell_type": "code",
 741 |    "execution_count": 11,
 742 |    "metadata": {
 743 |     "collapsed": false,
 744 |     "deletable": true,
 745 |     "editable": true
 746 |    },
 747 |    "outputs": [
 748 |     {
 749 |      "data": {
 750 |       "text/plain": [
 751 |        "negative    9178\n",
 752 |        "neutral     3099\n",
 753 |        "positive    2363\n",
 754 |        "Name: target, dtype: int64"
 755 |       ]
 756 |      },
 757 |      "execution_count": 11,
 758 |      "metadata": {},
 759 |      "output_type": "execute_result"
 760 |     }
 761 |    ],
 762 |    "source": [
 763 |     "# a quick look at the distribution of class labels\n",
 764 |     "reduced.target.value_counts()"
 765 |    ]
 766 |   },
 767 |   {
 768 |    "cell_type": "markdown",
 769 |    "metadata": {
 770 |     "deletable": true,
 771 |     "editable": true
 772 |    },
 773 |    "source": [
 774 |     "The classes in the test set are pretty much balanced, but the classes in the entire reduced set are not balanced."
 775 |    ]
 776 |   },
 777 |   {
 778 |    "cell_type": "markdown",
 779 |    "metadata": {
 780 |     "deletable": true,
 781 |     "editable": true
 782 |    },
 783 |    "source": [
 784 |     "Let's compare the 61% classifier accuracy to the performance of the `sentiment` method."
 785 |    ]
 786 |   },
 787 |   {
 788 |    "cell_type": "markdown",
 789 |    "metadata": {},
 790 |    "source": [
 791 |     "## When accuracy isn't good enough"
 792 |    ]
 793 |   },
 794 |   {
 795 |    "cell_type": "markdown",
 796 |    "metadata": {
 797 |     "deletable": true,
 798 |     "editable": true
 799 |    },
 800 |    "source": [
 801 |     "**Need better scoring method for multi-class predictions**\n",
 802 |     "\n",
 803 |     "Regular accuracy is simply the ratio of number of correct predictions to total number of predictions made. This pays no attention to how many classes there are, or how well each one is predicted.\n",
 804 |     "\n",
 805 |     "**When it’s not good enough**\n",
 806 |     "* there are more than two classes (in our case there are 3)\n",
 807 |     "* there is an imbalance (at least one class with far fewer instances than another)\n",
 808 |     "\n",
 809 |     "If there is a strong imbalance (and this does happen) where there are two classes one only happens 5% of the time, if all we do is predict everything to be the majority class, then we will automatically get 95% accuracy. That's meaningless in such a case.\n",
 810 |     "\n",
 811 |     "**Precision and Recall are two useful metrics in these cases**\n",
 812 |     "\n",
 813 |     "Precision = TP / (TP + FP) : how often predictions of a specific class are correct\n",
 814 |     "\n",
 815 |     "TP : True Positive<br>\n",
 816 |     "FP : False Positive\n",
 817 |     "\n",
 818 |     "Recall = TP / (TP + FN) : how often specific classes are identified (not missed)\n",
 819 |     "\n",
 820 |     "FN : False Negative\n",
 821 |     "\n",
 822 |     "**Precision & Recall**\n",
 823 |     "\n",
 824 |     "Precision = $\\frac{TP}{TP + FP}$\n",
 825 |     "\n",
 826 |     "Recall = $\\frac{TP}{TP + FN}$"
 827 |    ]
 828 |   },
 829 |   {
 830 |    "cell_type": "code",
 831 |    "execution_count": 13,
 832 |    "metadata": {
 833 |     "collapsed": false,
 834 |     "deletable": true,
 835 |     "editable": true
 836 |    },
 837 |    "outputs": [],
 838 |    "source": [
 839 |     "# create a score function that will give precision and recall values for each class\n",
 840 |     "def score(true, predicted):\n",
 841 |     "    eq = np.equal\n",
 842 |     "    \n",
 843 |     "    t = np.array(true)\n",
 844 |     "    p = np.array(predicted)\n",
 845 |     "    \n",
 846 |     "    tp = np.array([eq((t == c)*(p == c), 1).sum() for c in np.unique(t)])\n",
 847 |     "    fp = np.array([eq((t != c)*(p == c), 1).sum() for c in np.unique(t)])\n",
 848 |     "    fn = np.array([eq((t == c)*(p != c), 1).sum() for c in np.unique(t)])\n",
 849 |     "\n",
 850 |     "    precision = tp/(tp + fp)\n",
 851 |     "    recall = tp/(tp + fn)\n",
 852 |     "    \n",
 853 |     "    return (np.unique(t), precision, recall)"
 854 |    ]
 855 |   },
 856 |   {
 857 |    "cell_type": "markdown",
 858 |    "metadata": {
 859 |     "deletable": true,
 860 |     "editable": true
 861 |    },
 862 |    "source": [
 863 |     "### Evaluate classifier on larger set\n",
 864 |     "\\* **skip this; takes too long** \\*\n",
 865 |     "\n",
 866 |     "**With train/test split**"
 867 |    ]
 868 |   },
 869 |   {
 870 |    "cell_type": "code",
 871 |    "execution_count": 15,
 872 |    "metadata": {
 873 |     "collapsed": true,
 874 |     "deletable": true,
 875 |     "editable": true
 876 |    },
 877 |    "outputs": [],
 878 |    "source": [
 879 |     "# create new train and test sets\n",
 880 |     "# for the TextBlob classifier we need a list of doubles (string, target)\n",
 881 |     "\n",
 882 |     "# train = [(s, t) for s,t in zip(reduced.iloc[:1500].text, reduced.iloc[:1500].target)]\n",
 883 |     "# test = [(s, t) for s,t in zip(reduced.iloc[1500:2000].text, reduced.iloc[1500:2000].target)]"
 884 |    ]
 885 |   },
 886 |   {
 887 |    "cell_type": "code",
 888 |    "execution_count": 16,
 889 |    "metadata": {
 890 |     "collapsed": false,
 891 |     "deletable": true,
 892 |     "editable": true
 893 |    },
 894 |    "outputs": [
 895 |     {
 896 |      "data": {
 897 |       "text/plain": [
 898 |        "0.786"
 899 |       ]
 900 |      },
 901 |      "execution_count": 16,
 902 |      "metadata": {},
 903 |      "output_type": "execute_result"
 904 |     }
 905 |    ],
 906 |    "source": [
 907 |     "# train\n",
 908 |     "# cl = NaiveBayesClassifier(train)\n",
 909 |     "\n",
 910 |     "# evaluate\n",
 911 |     "# cl.accuracy(test)\n",
 912 |     "# 0.786"
 913 |    ]
 914 |   },
 915 |   {
 916 |    "cell_type": "code",
 917 |    "execution_count": null,
 918 |    "metadata": {
 919 |     "collapsed": true
 920 |    },
 921 |    "outputs": [],
 922 |    "source": []
 923 |   },
 924 |   {
 925 |    "cell_type": "markdown",
 926 |    "metadata": {},
 927 |    "source": [
 928 |     "# Practice Problems\n",
 929 |     "\n",
 930 |     "1. Create a pandas series of polarity values predicted for all entries in the reduced set using the sentiment method\n",
 931 |     "2. Create a column in the reduced set with class labels mapped from the polarity values in (1.) using the following rules:\n",
 932 |     "    - polarity  <   - 0.1 : ‘negative’\n",
 933 |     "    - polarity  >  0.1 : ‘positive’\n",
 934 |     "    - else : ‘neutral’\n",
 935 |     "3. Compute the accuracy of the predicted labels from (2.) for the same range as the test set [350:500]\n",
 936 |     "4. Update the score function to print a clean table of scores with (hint: use pandas)\n",
 937 |     "    - rows for precision and recall\n",
 938 |     "    - columns for class labels\n"
 939 |    ]
 940 |   },
 941 |   {
 942 |    "cell_type": "markdown",
 943 |    "metadata": {},
 944 |    "source": [
 945 |     "# Naive Bayes Classifier: Digging Deeper"
 946 |    ]
 947 |   },
 948 |   {
 949 |    "cell_type": "markdown",
 950 |    "metadata": {},
 951 |    "source": [
 952 |     "## Making Predictions\n",
 953 |     "\n",
 954 |     "`NaiveBayesClassifier` has a `classify` method that takes text (a single string) as an argument. This means that we can either classify some string that we choose to type by hand, or classify tweets from our test set individually."
 955 |    ]
 956 |   },
 957 |   {
 958 |    "cell_type": "code",
 959 |    "execution_count": 24,
 960 |    "metadata": {
 961 |     "collapsed": false
 962 |    },
 963 |    "outputs": [
 964 |     {
 965 |      "data": {
 966 |       "text/plain": [
 967 |        "'positive'"
 968 |       ]
 969 |      },
 970 |      "execution_count": 24,
 971 |      "metadata": {},
 972 |      "output_type": "execute_result"
 973 |     }
 974 |    ],
 975 |    "source": [
 976 |     "cl.classify('I love this airline')"
 977 |    ]
 978 |   },
 979 |   {
 980 |    "cell_type": "markdown",
 981 |    "metadata": {},
 982 |    "source": [
 983 |     "### Getting class probabilities"
 984 |    ]
 985 |   },
 986 |   {
 987 |    "cell_type": "code",
 988 |    "execution_count": 26,
 989 |    "metadata": {
 990 |     "collapsed": false
 991 |    },
 992 |    "outputs": [
 993 |     {
 994 |      "data": {
 995 |       "text/plain": [
 996 |        "'positive'"
 997 |       ]
 998 |      },
 999 |      "execution_count": 26,
1000 |      "metadata": {},
1001 |      "output_type": "execute_result"
1002 |     }
1003 |    ],
1004 |    "source": [
1005 |     "probs = cl.prob_classify('I love this airline')\n",
1006 |     "probs.max()"
1007 |    ]
1008 |   },
1009 |   {
1010 |    "cell_type": "code",
1011 |    "execution_count": 27,
1012 |    "metadata": {
1013 |     "collapsed": false
1014 |    },
1015 |    "outputs": [
1016 |     {
1017 |      "data": {
1018 |       "text/plain": [
1019 |        "0.8788493380472053"
1020 |       ]
1021 |      },
1022 |      "execution_count": 27,
1023 |      "metadata": {},
1024 |      "output_type": "execute_result"
1025 |     }
1026 |    ],
1027 |    "source": [
1028 |     "probs.prob('positive')"
1029 |    ]
1030 |   },
1031 |   {
1032 |    "cell_type": "code",
1033 |    "execution_count": 28,
1034 |    "metadata": {
1035 |     "collapsed": false
1036 |    },
1037 |    "outputs": [
1038 |     {
1039 |      "data": {
1040 |       "text/plain": [
1041 |        "0.01575421132591375"
1042 |       ]
1043 |      },
1044 |      "execution_count": 28,
1045 |      "metadata": {},
1046 |      "output_type": "execute_result"
1047 |     }
1048 |    ],
1049 |    "source": [
1050 |     "probs.prob('negative')"
1051 |    ]
1052 |   },
1053 |   {
1054 |    "cell_type": "markdown",
1055 |    "metadata": {},
1056 |    "source": [
1057 |     "The above can be useful if you want to make modifications to how something is classified by setting a threshold. For example, you may want to only classify something as positive if the probability exceeds 0.9, instead of it simply having the highest probability."
1058 |    ]
1059 |   },
1060 |   {
1061 |    "cell_type": "markdown",
1062 |    "metadata": {
1063 |     "collapsed": true,
1064 |     "deletable": true,
1065 |     "editable": true
1066 |    },
1067 |    "source": [
1068 |     "## Informative Features\n",
1069 |     "\n",
1070 |     "The method below gives us some insight into how the classifier is making decisions. For example, we can see that if a string contains the word \"great\", the there is are 8.7:1 odds that the string is positive instead of negative. All of the features are taken into account for one string, so that doesn't mean just because \"great\" is in the string it will be classified as positive."
1071 |    ]
1072 |   },
1073 |   {
1074 |    "cell_type": "code",
1075 |    "execution_count": 17,
1076 |    "metadata": {
1077 |     "collapsed": false,
1078 |     "deletable": true,
1079 |     "editable": true
1080 |    },
1081 |    "outputs": [
1082 |     {
1083 |      "name": "stdout",
1084 |      "output_type": "stream",
1085 |      "text": [
1086 |       "Most Informative Features\n",
1087 |       "            contains(no) = True           negati : neutra =      9.7 : 1.0\n",
1088 |       "         contains(great) = True           positi : negati =      9.7 : 1.0\n",
1089 |       "        contains(Thanks) = True           positi : negati =      8.7 : 1.0\n",
1090 |       "          contains(love) = True           positi : negati =      8.7 : 1.0\n",
1091 |       "        contains(thanks) = True           positi : negati =      6.9 : 1.0\n",
1092 |       "          contains(site) = True           negati : positi =      6.5 : 1.0\n",
1093 |       "           contains(not) = True           negati : positi =      6.0 : 1.0\n",
1094 |       "       contains(amazing) = True           positi : negati =      6.0 : 1.0\n",
1095 |       "         contains(Thank) = True           positi : negati =      6.0 : 1.0\n",
1096 |       "       contains(website) = True           negati : neutra =      5.5 : 1.0\n"
1097 |      ]
1098 |     }
1099 |    ],
1100 |    "source": [
1101 |     "cl.show_informative_features(10)"
1102 |    ]
1103 |   },
1104 |   {
1105 |    "cell_type": "markdown",
1106 |    "metadata": {
1107 |     "deletable": true,
1108 |     "editable": true
1109 |    },
1110 |    "source": [
1111 |     "**How to interpret this:**\n",
1112 |     "* We are given rows that have `contains(feature) = True/False` and a comparison of two class labels with a ratio that indicates how likely one is over the other \n",
1113 |     "* The printed results are in descending order of importance\n",
1114 |     "* Ex: `contains(no) = True` gives the ratio of 9.7 : 1.0, showing that it is extremely likely to be negative rather than neutral\n",
1115 |     "* The default features for the Naive Bayes classifier are individual words found in the data"
1116 |    ]
1117 |   },
1118 |   {
1119 |    "cell_type": "markdown",
1120 |    "metadata": {},
1121 |    "source": [
1122 |     "## Extracting Features\n",
1123 |     "\n",
1124 |     "We are provided a method that serves one purpose: take a string and return a dictionary of all features in our classifier (individual words by default), and whether or not that word is in the string. It is essentially a binary feature vector."
1125 |    ]
1126 |   },
1127 |   {
1128 |    "cell_type": "code",
1129 |    "execution_count": 22,
1130 |    "metadata": {
1131 |     "collapsed": false,
1132 |     "deletable": true,
1133 |     "editable": true
1134 |    },
1135 |    "outputs": [
1136 |     {
1137 |      "data": {
1138 |       "text/plain": [
1139 |        "{'contains(while)': False,\n",
1140 |        " 'contains(schedule)': False,\n",
1141 |        " 'contains(week)': False,\n",
1142 |        " 'contains(hard)': False,\n",
1143 |        " 'contains(sorry)': False,\n",
1144 |        " 'contains(t.co/zSuZTNAIJq)': False,\n",
1145 |        " 'contains(views)': False,\n",
1146 |        " 'contains(add)': False,\n",
1147 |        " 'contains(issue)': False,\n",
1148 |        " 'contains(quick)': False,\n",
1149 |        " 'contains(Andrews)': False,\n",
1150 |        " 'contains(Follow)': False,\n",
1151 |        " 'contains(enter)': False,\n",
1152 |        " 'contains(Many)': False,\n",
1153 |        " 'contains(t.co/UT5GrRwAaA)': False,\n",
1154 |        " 'contains(Holla)': False,\n",
1155 |        " 'contains(Same)': False,\n",
1156 |        " 'contains(cake)': False,\n",
1157 |        " 'contains(t.co/gLXFwP6nQH)': False,\n",
1158 |        " 'contains(NewsVP)': False,\n",
1159 |        " 'contains(24hrs)': False,\n",
1160 |        " 'contains(reimburse)': False,\n",
1161 |        " 'contains(makes)': False,\n",
1162 |        " 'contains(back-end)': False,\n",
1163 |        " 'contains(PrincessHalf)': False,\n",
1164 |        " 'contains(pros)': False,\n",
1165 |        " 'contains(if)': False,\n",
1166 |        " 'contains(wish)': False,\n",
1167 |        " 'contains(t.co/XZ6qeG3nef)': False,\n",
1168 |        " 'contains(bked)': False,\n",
1169 |        " 'contains(account)': False,\n",
1170 |        " 'contains(Lister)': False,\n",
1171 |        " 'contains(keeps)': False,\n",
1172 |        " 'contains(brand)': False,\n",
1173 |        " 'contains(jump)': False,\n",
1174 |        " 'contains(deals)': False,\n",
1175 |        " 'contains(Handily)': False,\n",
1176 |        " 'contains(has)': False,\n",
1177 |        " 'contains(charging)': False,\n",
1178 |        " 'contains(Debbie)': False,\n",
1179 |        " 'contains(ressie)': False,\n",
1180 |        " 'contains(time)': False,\n",
1181 |        " 'contains(t.co/UKdjjijroW)': False,\n",
1182 |        " 'contains(downtown)': False,\n",
1183 |        " 'contains(t.co/yPo7nYpRZl)': False,\n",
1184 |        " 'contains(2015)': False,\n",
1185 |        " 'contains(interesting)': False,\n",
1186 |        " 'contains(gon)': False,\n",
1187 |        " 'contains(answer)': False,\n",
1188 |        " 'contains(DFW)': False,\n",
1189 |        " 'contains(GMA)': False,\n",
1190 |        " 'contains(redirected)': False,\n",
1191 |        " 'contains(first)': False,\n",
1192 |        " 'contains(net)': False,\n",
1193 |        " 'contains(You’ve)': False,\n",
1194 |        " 'contains(100)': False,\n",
1195 |        " 'contains(last)': False,\n",
1196 |        " 'contains(sec)': False,\n",
1197 |        " 'contains(rain)': False,\n",
1198 |        " 'contains(b/c)': False,\n",
1199 |        " 'contains(having)': False,\n",
1200 |        " 'contains(SEA)': False,\n",
1201 |        " 'contains(Like)': False,\n",
1202 |        " 'contains(VirginAmerica)': False,\n",
1203 |        " 'contains(💗🇬🇧💗🇺🇸💗)': False,\n",
1204 |        " 'contains(taxes)': False,\n",
1205 |        " 'contains(Such)': False,\n",
1206 |        " 'contains(disappointing)': False,\n",
1207 |        " 'contains(t.co/SLLYIBE2vQ)': False,\n",
1208 |        " 'contains(come)': False,\n",
1209 |        " 'contains(wanted)': False,\n",
1210 |        " 'contains(might)': False,\n",
1211 |        " 'contains(back)': False,\n",
1212 |        " 'contains(JFK)': False,\n",
1213 |        " 'contains(bin)': False,\n",
1214 |        " 'contains(check-in)': False,\n",
1215 |        " 'contains(wan)': False,\n",
1216 |        " 'contains(AvalonHollywood)': False,\n",
1217 |        " 'contains(KETR)': False,\n",
1218 |        " 'contains(blast)': False,\n",
1219 |        " 'contains(spotify)': False,\n",
1220 |        " 'contains(financial)': False,\n",
1221 |        " 'contains(rockstars)': False,\n",
1222 |        " 'contains(2/27)': False,\n",
1223 |        " 'contains(Flightly)': False,\n",
1224 |        " 'contains(received)': False,\n",
1225 |        " 'contains(application)': False,\n",
1226 |        " 'contains(If)': False,\n",
1227 |        " 'contains(call/email)': False,\n",
1228 |        " 'contains(BOS-FLL)': False,\n",
1229 |        " 'contains(from)': False,\n",
1230 |        " 'contains(hours)': False,\n",
1231 |        " 'contains(59)': False,\n",
1232 |        " 'contains(delighted)': False,\n",
1233 |        " 'contains(Upgrade)': False,\n",
1234 |        " 'contains(Not)': False,\n",
1235 |        " 'contains(yet)': False,\n",
1236 |        " 'contains(Baggage)': False,\n",
1237 |        " 'contains(r)': False,\n",
1238 |        " 'contains(Applied)': False,\n",
1239 |        " 'contains(50)': False,\n",
1240 |        " 'contains(complimentary)': False,\n",
1241 |        " 'contains(be)': False,\n",
1242 |        " 'contains(because)': False,\n",
1243 |        " 'contains(Checkin)': False,\n",
1244 |        " 'contains(only)': False,\n",
1245 |        " 'contains(indicates)': False,\n",
1246 |        " 'contains(landing)': False,\n",
1247 |        " 'contains(refunding)': False,\n",
1248 |        " 'contains(fares)': False,\n",
1249 |        " 'contains(Really)': False,\n",
1250 |        " 'contains(Reuters)': False,\n",
1251 |        " 'contains(Row)': False,\n",
1252 |        " 'contains(Every)': False,\n",
1253 |        " 'contains(eye)': False,\n",
1254 |        " 'contains(midnight)': False,\n",
1255 |        " 'contains(congrats)': False,\n",
1256 |        " 'contains(Oscars)': False,\n",
1257 |        " 'contains(What)': False,\n",
1258 |        " 'contains(user)': False,\n",
1259 |        " 'contains(manage)': False,\n",
1260 |        " 'contains(disruption)': False,\n",
1261 |        " 'contains(BTW)': False,\n",
1262 |        " 'contains(hiring)': False,\n",
1263 |        " 'contains(Middle)': False,\n",
1264 |        " 'contains(putting)': False,\n",
1265 |        " 'contains(paying)': False,\n",
1266 |        " 'contains(rescheduled)': False,\n",
1267 |        " 'contains(RNP)': False,\n",
1268 |        " 'contains(change)': False,\n",
1269 |        " 'contains(hold)': False,\n",
1270 |        " 'contains(was)': False,\n",
1271 |        " 'contains(soft)': False,\n",
1272 |        " 'contains(please)': False,\n",
1273 |        " 'contains(ATWOnline)': False,\n",
1274 |        " 'contains(t.co/hy0VrfhjHt)': False,\n",
1275 |        " 'contains(non)': False,\n",
1276 |        " 'contains(longer)': False,\n",
1277 |        " 'contains(2-8)': False,\n",
1278 |        " 'contains(leading)': False,\n",
1279 |        " 'contains(faces)': False,\n",
1280 |        " 'contains(continues)': False,\n",
1281 |        " 'contains(response)': False,\n",
1282 |        " 'contains(You)': False,\n",
1283 |        " 'contains(emails)': False,\n",
1284 |        " 'contains(exhausted)': False,\n",
1285 |        " 'contains(Cancelled)': False,\n",
1286 |        " 'contains(our)': False,\n",
1287 |        " 'contains(TOMORROW)': False,\n",
1288 |        " 'contains(second)': False,\n",
1289 |        " 'contains(stylesheets)': False,\n",
1290 |        " 'contains(Q4)': False,\n",
1291 |        " 'contains(DO)': False,\n",
1292 |        " 'contains(register)': False,\n",
1293 |        " 'contains(Bandie)': False,\n",
1294 |        " 'contains(Use)': False,\n",
1295 |        " 'contains(When)': False,\n",
1296 |        " 'contains(NO)': False,\n",
1297 |        " 'contains(flyer)': False,\n",
1298 |        " 'contains(board)': False,\n",
1299 |        " \"contains('ve)\": False,\n",
1300 |        " 'contains(neverflyvirginforbusiness)': False,\n",
1301 |        " 'contains(SilverStatus)': False,\n",
1302 |        " 'contains(broken)': False,\n",
1303 |        " 'contains(butt)': False,\n",
1304 |        " 'contains(Very)': False,\n",
1305 |        " 'contains(posted)': False,\n",
1306 |        " 'contains(minutes)': False,\n",
1307 |        " 'contains(FiDiFamilies)': False,\n",
1308 |        " 'contains(missed)': False,\n",
1309 |        " 'contains(DC)': False,\n",
1310 |        " 'contains(permanently)': False,\n",
1311 |        " 'contains(March)': False,\n",
1312 |        " 'contains(sooner)': False,\n",
1313 |        " 'contains(looking)': False,\n",
1314 |        " 'contains(match)': False,\n",
1315 |        " 'contains(completely)': False,\n",
1316 |        " 'contains(Hands)': False,\n",
1317 |        " 'contains(Hey)': False,\n",
1318 |        " 'contains(assistance)': False,\n",
1319 |        " 'contains(airplanemodewason)': False,\n",
1320 |        " 'contains(get)': False,\n",
1321 |        " 'contains(sorted)': False,\n",
1322 |        " 'contains(blew)': False,\n",
1323 |        " 'contains(somehow)': False,\n",
1324 |        " 'contains(Boo)': False,\n",
1325 |        " 'contains(cabin)': False,\n",
1326 |        " 'contains(you)': False,\n",
1327 |        " 'contains(doctor)': False,\n",
1328 |        " 'contains(4:50)': False,\n",
1329 |        " 'contains(rescheduling)': False,\n",
1330 |        " 'contains(SFO-FLL)': False,\n",
1331 |        " 'contains(told)': False,\n",
1332 |        " 'contains(VAbeatsJblue)': False,\n",
1333 |        " 'contains(half)': False,\n",
1334 |        " 'contains(ugh)': False,\n",
1335 |        " 'contains(does)': False,\n",
1336 |        " 'contains(dog)': False,\n",
1337 |        " 'contains(picture)': False,\n",
1338 |        " 'contains(few)': False,\n",
1339 |        " 'contains(distribution)': False,\n",
1340 |        " 'contains(passenger)': False,\n",
1341 |        " 'contains(advise)': False,\n",
1342 |        " 'contains(begrudgingly)': False,\n",
1343 |        " 'contains(roasted)': False,\n",
1344 |        " 'contains(avail)': False,\n",
1345 |        " 'contains(soon)': False,\n",
1346 |        " 'contains(U)': False,\n",
1347 |        " 'contains(ever)': False,\n",
1348 |        " 'contains(virginmedia)': False,\n",
1349 |        " 'contains(NYC-JFK)': False,\n",
1350 |        " 'contains(behind)': False,\n",
1351 |        " 'contains(way)': False,\n",
1352 |        " 'contains(JKF)': False,\n",
1353 |        " 'contains(EWR)': False,\n",
1354 |        " 'contains(comfort)': False,\n",
1355 |        " 'contains(2A)': False,\n",
1356 |        " 'contains(recourse)': False,\n",
1357 |        " 'contains(offer)': False,\n",
1358 |        " 'contains(plz)': False,\n",
1359 |        " 'contains(FLL)': False,\n",
1360 |        " 'contains(View)': False,\n",
1361 |        " 'contains(can’t)': False,\n",
1362 |        " 'contains(why)': False,\n",
1363 |        " 'contains(mountains)': False,\n",
1364 |        " 'contains(globe)': False,\n",
1365 |        " 'contains(rockstar)': False,\n",
1366 |        " 'contains(possible)': False,\n",
1367 |        " 'contains(LadyGaga)': False,\n",
1368 |        " 'contains(dirty)': False,\n",
1369 |        " 'contains(fabulous)': False,\n",
1370 |        " 'contains(entertainment)': False,\n",
1371 |        " 'contains(purchased)': False,\n",
1372 |        " 'contains(landed)': False,\n",
1373 |        " 'contains(YOU)': False,\n",
1374 |        " 'contains(Dulles_Airport)': False,\n",
1375 |        " 'contains(April)': False,\n",
1376 |        " 'contains(as)': False,\n",
1377 |        " 'contains(IAD)': False,\n",
1378 |        " 'contains(mind)': False,\n",
1379 |        " 'contains(being)': False,\n",
1380 |        " 'contains(SuuperG)': False,\n",
1381 |        " 'contains(gt)': False,\n",
1382 |        " 'contains(Flighted)': False,\n",
1383 |        " 'contains(RT)': False,\n",
1384 |        " 'contains(status)': False,\n",
1385 |        " 'contains(FCmostinnovative)': False,\n",
1386 |        " 'contains(current)': False,\n",
1387 |        " 'contains(vendor)': False,\n",
1388 |        " 'contains(happening)': False,\n",
1389 |        " 'contains(hated)': False,\n",
1390 |        " 'contains(shrinerack)': False,\n",
1391 |        " 'contains(iced)': False,\n",
1392 |        " 'contains(Takes)': False,\n",
1393 |        " 'contains(guiltypleasures)': False,\n",
1394 |        " 'contains(anything)': False,\n",
1395 |        " 'contains(May)': False,\n",
1396 |        " 'contains(giving)': False,\n",
1397 |        " 'contains(refreshed)': False,\n",
1398 |        " 'contains(subsequent)': False,\n",
1399 |        " 'contains(weather)': False,\n",
1400 |        " 'contains(built)': False,\n",
1401 |        " 'contains(checking)': False,\n",
1402 |        " 'contains(heard)': False,\n",
1403 |        " 'contains(carrieunderwood)': False,\n",
1404 |        " 'contains(Call)': False,\n",
1405 |        " 'contains(facing)': False,\n",
1406 |        " 'contains(1st)': False,\n",
1407 |        " 'contains(front-end)': False,\n",
1408 |        " 'contains(even)': False,\n",
1409 |        " 'contains(button)': False,\n",
1410 |        " 'contains(peeps)': False,\n",
1411 |        " 'contains(Get)': False,\n",
1412 |        " 'contains(new)': False,\n",
1413 |        " 'contains(mobile)': False,\n",
1414 |        " 'contains(thank)': False,\n",
1415 |        " 'contains(moodlight)': False,\n",
1416 |        " 'contains(mentioned)': False,\n",
1417 |        " 'contains(turbulence)': False,\n",
1418 |        " 'contains(cause)': False,\n",
1419 |        " 'contains(eat)': False,\n",
1420 |        " 'contains(We)': False,\n",
1421 |        " 'contains(Gold)': False,\n",
1422 |        " 'contains(reset)': False,\n",
1423 |        " 'contains(Doom)': False,\n",
1424 |        " 'contains(precipitation)': False,\n",
1425 |        " 'contains(getting)': False,\n",
1426 |        " 'contains(want)': False,\n",
1427 |        " 'contains(least)': False,\n",
1428 |        " 'contains(90s)': False,\n",
1429 |        " 'contains(Why)': False,\n",
1430 |        " 'contains(cool)': False,\n",
1431 |        " 'contains(t.co/PYalebgkJt)': False,\n",
1432 |        " 'contains(recent)': False,\n",
1433 |        " 'contains(past)': False,\n",
1434 |        " 'contains(t.co/APtZpuROp4)': False,\n",
1435 |        " 'contains(apologies)': False,\n",
1436 |        " 'contains(89)': False,\n",
1437 |        " 'contains(SFO/LAX)': False,\n",
1438 |        " 'contains(says)': False,\n",
1439 |        " 'contains(t.co/2npXB6oBMr)': False,\n",
1440 |        " 'contains(concerned)': False,\n",
1441 |        " 'contains(dropped)': False,\n",
1442 |        " 'contains(earlier)': False,\n",
1443 |        " 'contains(wondering)': False,\n",
1444 |        " 'contains(Wifey)': False,\n",
1445 |        " 'contains(expectations)': False,\n",
1446 |        " 'contains(That)': False,\n",
1447 |        " 'contains(sanity)': False,\n",
1448 |        " 'contains(Got)': False,\n",
1449 |        " 'contains(Funny)': False,\n",
1450 |        " 'contains(see.Very)': False,\n",
1451 |        " 'contains(dhepburn)': False,\n",
1452 |        " 'contains(910)': False,\n",
1453 |        " 'contains(Gon)': False,\n",
1454 |        " 'contains(follow)': False,\n",
1455 |        " 'contains(white)': False,\n",
1456 |        " 'contains(Having)': False,\n",
1457 |        " 'contains(chat)': False,\n",
1458 |        " 'contains(t.co/RHKaMx9VF5)': False,\n",
1459 |        " 'contains(trust)': False,\n",
1460 |        " 'contains(hour)': False,\n",
1461 |        " 'contains(Valley)': False,\n",
1462 |        " 'contains(imagine)': False,\n",
1463 |        " 'contains(points)': False,\n",
1464 |        " 'contains(luv)': False,\n",
1465 |        " 'contains(gentleman)': False,\n",
1466 |        " 'contains(rep)': False,\n",
1467 |        " 'contains(After)': False,\n",
1468 |        " 'contains(👏)': False,\n",
1469 |        " 'contains(city’)': False,\n",
1470 |        " 'contains(andchexmix)': False,\n",
1471 |        " 'contains(more)': False,\n",
1472 |        " 'contains(Angeles)': False,\n",
1473 |        " 'contains(winds)': False,\n",
1474 |        " 'contains(PLEASE)': False,\n",
1475 |        " 'contains(month)': False,\n",
1476 |        " 'contains(bill)': False,\n",
1477 |        " 'contains(needs)': False,\n",
1478 |        " 'contains(supposed)': False,\n",
1479 |        " 'contains(kicked)': False,\n",
1480 |        " 'contains(revue)': False,\n",
1481 |        " 'contains(red)': False,\n",
1482 |        " 'contains(882)': False,\n",
1483 |        " 'contains(pairings)': False,\n",
1484 |        " 'contains(883)': False,\n",
1485 |        " 'contains(surgery)': False,\n",
1486 |        " 'contains(girls)': False,\n",
1487 |        " 'contains(CarrieUnderwood)': False,\n",
1488 |        " 'contains(momma)': False,\n",
1489 |        " 'contains(this)': True,\n",
1490 |        " 'contains(Worst)': False,\n",
1491 |        " 'contains(guests)': False,\n",
1492 |        " 'contains(when)': False,\n",
1493 |        " 'contains(SSal)': False,\n",
1494 |        " 'contains(X)': False,\n",
1495 |        " 'contains(hope)': False,\n",
1496 |        " 'contains(YOUR)': False,\n",
1497 |        " 'contains(ChrysiChrysic)': False,\n",
1498 |        " 'contains(Include)': False,\n",
1499 |        " 'contains(Still)': False,\n",
1500 |        " 'contains(represents)': False,\n",
1501 |        " 'contains(Oscars2015)': False,\n",
1502 |        " 'contains(MeetTheFleet)': False,\n",
1503 |        " 'contains(reply)': False,\n",
1504 |        " 'contains(desk)': False,\n",
1505 |        " 'contains(spend)': False,\n",
1506 |        " 'contains(Thank)': False,\n",
1507 |        " 'contains(due)': False,\n",
1508 |        " 'contains(And)': False,\n",
1509 |        " 'contains(p)': False,\n",
1510 |        " 'contains(problem)': False,\n",
1511 |        " 'contains(paperwork)': False,\n",
1512 |        " 'contains(section)': False,\n",
1513 |        " 'contains(shows)': False,\n",
1514 |        " 'contains(😂)': False,\n",
1515 |        " 'contains(pilots)': False,\n",
1516 |        " 'contains(VirginAtlantic)': False,\n",
1517 |        " 'contains(Elevate)': False,\n",
1518 |        " 'contains(minimal)': False,\n",
1519 |        " 'contains(doing)': False,\n",
1520 |        " 'contains(severely)': False,\n",
1521 |        " 'contains(day)': False,\n",
1522 |        " 'contains(Was)': False,\n",
1523 |        " 'contains(disappointed)': False,\n",
1524 |        " 'contains(degrees)': False,\n",
1525 |        " 'contains(gave)': False,\n",
1526 |        " 'contains(MayweatherPacquiao)': False,\n",
1527 |        " 'contains(JetBlue)': False,\n",
1528 |        " 'contains(bubbly)': False,\n",
1529 |        " 'contains(Arms)': False,\n",
1530 |        " 'contains(watching)': False,\n",
1531 |        " 'contains(VX358)': False,\n",
1532 |        " 'contains(really)': False,\n",
1533 |        " 'contains(anytime)': False,\n",
1534 |        " 'contains(2/24)': False,\n",
1535 |        " 'contains(SFOtoBOS)': False,\n",
1536 |        " 'contains(TODAY)': False,\n",
1537 |        " 'contains(in🇺🇸2y)': False,\n",
1538 |        " 'contains(team)': False,\n",
1539 |        " 'contains(gusty)': False,\n",
1540 |        " 'contains(amazing)': False,\n",
1541 |        " 'contains(line)': False,\n",
1542 |        " 'contains(Deals)': False,\n",
1543 |        " 'contains(10)': False,\n",
1544 |        " 'contains(song)': False,\n",
1545 |        " 'contains(unexpected)': False,\n",
1546 |        " 'contains(lame)': False,\n",
1547 |        " 'contains(food)': False,\n",
1548 |        " 'contains(me)': True,\n",
1549 |        " 'contains(done)': False,\n",
1550 |        " 'contains(Race)': False,\n",
1551 |        " 'contains(along)': False,\n",
1552 |        " 'contains(pre-check)': False,\n",
1553 |        " 'contains(Airline)': False,\n",
1554 |        " 'contains(BestCrew)': False,\n",
1555 |        " 'contains(weRin)': False,\n",
1556 |        " 'contains(appointments)': False,\n",
1557 |        " 'contains(emailed)': False,\n",
1558 |        " 'contains(stranded)': False,\n",
1559 |        " 'contains(said)': False,\n",
1560 |        " 'contains(😃)': False,\n",
1561 |        " 'contains(uncomfortable)': False,\n",
1562 |        " 'contains(DM)': False,\n",
1563 |        " 'contains(Lady)': False,\n",
1564 |        " 'contains(Another)': False,\n",
1565 |        " 'contains(round)': False,\n",
1566 |        " 'contains(lost)': False,\n",
1567 |        " 'contains(mention)': False,\n",
1568 |        " 'contains(Monday)': False,\n",
1569 |        " 'contains(t.co/vC6Keulg2J)': False,\n",
1570 |        " 'contains(early)': False,\n",
1571 |        " 'contains(neverflyvirgin)': False,\n",
1572 |        " 'contains(forward)': False,\n",
1573 |        " 'contains(price)': False,\n",
1574 |        " 'contains(Awesome)': False,\n",
1575 |        " 'contains(😢)': False,\n",
1576 |        " 'contains(Travelzoo)': False,\n",
1577 |        " 'contains(worm”)': False,\n",
1578 |        " 'contains(check)': False,\n",
1579 |        " 'contains(🍷👍💺✈️)': False,\n",
1580 |        " 'contains(Dallas-Austin)': False,\n",
1581 |        " 'contains(monday)': False,\n",
1582 |        " 'contains(Terrible)': False,\n",
1583 |        " 'contains(find)': False,\n",
1584 |        " 'contains(dislike)': False,\n",
1585 |        " 'contains(boy)': False,\n",
1586 |        " 'contains(BOS-LAS)': False,\n",
1587 |        " 'contains(shaker)': False,\n",
1588 |        " 'contains(updates)': False,\n",
1589 |        " 'contains(no)': True,\n",
1590 |        " 'contains(sneaky)': False,\n",
1591 |        " 'contains(one)': False,\n",
1592 |        " 'contains(OSCARS2105)': False,\n",
1593 |        " 'contains(virgin)': False,\n",
1594 |        " 'contains(yesterday)': False,\n",
1595 |        " 'contains(inquired)': False,\n",
1596 |        " 'contains(t.co/KEK5pDMGiF)': False,\n",
1597 |        " 'contains(t.co/wU3LbCNcr9)': False,\n",
1598 |        " 'contains(Palm)': False,\n",
1599 |        " 'contains(position)': False,\n",
1600 |        " 'contains(business)': False,\n",
1601 |        " 'contains(rise)': False,\n",
1602 |        " 'contains(better)': False,\n",
1603 |        " 'contains(direct)': False,\n",
1604 |        " 'contains(AmericanAir)': False,\n",
1605 |        " 'contains(t.co/PxdEL1nq3l)': False,\n",
1606 |        " 'contains(550)': False,\n",
1607 |        " 'contains(secure)': False,\n",
1608 |        " 'contains(asap)': False,\n",
1609 |        " 'contains(missing)': False,\n",
1610 |        " 'contains(t.co/DnStITRzWy)': False,\n",
1611 |        " 'contains(tickets)': False,\n",
1612 |        " 'contains(t.co/F2LFULCbQ7)': False,\n",
1613 |        " 'contains(2014)': False,\n",
1614 |        " 'contains(kitty)': False,\n",
1615 |        " 'contains(itinerary)': False,\n",
1616 |        " 'contains(innovation)': False,\n",
1617 |        " 'contains(styling)': False,\n",
1618 |        " 'contains(buy)': False,\n",
1619 |        " 'contains(noair)': False,\n",
1620 |        " 'contains(either)': False,\n",
1621 |        " \"contains('ll)\": False,\n",
1622 |        " 'contains(into)': False,\n",
1623 |        " 'contains(selecting)': False,\n",
1624 |        " 'contains(tomorrow)': False,\n",
1625 |        " 'contains(Shame)': False,\n",
1626 |        " 'contains(Bags)': False,\n",
1627 |        " 'contains(playing)': False,\n",
1628 |        " 'contains(769)': False,\n",
1629 |        " 'contains(policy)': False,\n",
1630 |        " 'contains(happy)': False,\n",
1631 |        " 'contains(BOS)': False,\n",
1632 |        " 'contains(pay)': False,\n",
1633 |        " 'contains(CheapFlights)': False,\n",
1634 |        " 'contains(shown)': False,\n",
1635 |        " 'contains(10:50AM)': False,\n",
1636 |        " 'contains(ladygaga)': False,\n",
1637 |        " 'contains(Comps)': False,\n",
1638 |        " 'contains(days)': False,\n",
1639 |        " 'contains(smh)': False,\n",
1640 |        " 'contains(Austin)': False,\n",
1641 |        " 'contains(First)': False,\n",
1642 |        " 'contains(biztravel)': False,\n",
1643 |        " 'contains(😥)': False,\n",
1644 |        " 'contains(attendant)': False,\n",
1645 |        " 'contains(husband)': False,\n",
1646 |        " 'contains(nonstop)': False,\n",
1647 |        " 'contains(process)': False,\n",
1648 |        " 'contains(name)': False,\n",
1649 |        " 'contains(I’m)': False,\n",
1650 |        " 'contains(2:10pm)': False,\n",
1651 |        " 'contains(jessicajaymes)': False,\n",
1652 |        " 'contains(confirmation)': False,\n",
1653 |        " 'contains(adding)': False,\n",
1654 |        " 'contains(city)': False,\n",
1655 |        " 'contains(Had)': False,\n",
1656 |        " 'contains(tech)': False,\n",
1657 |        " 'contains(good)': False,\n",
1658 |        " 'contains(seems)': False,\n",
1659 |        " 'contains(t.co/tvB5zbzVhg)': False,\n",
1660 |        " 'contains(taking)': True,\n",
1661 |        " 'contains(Cool)': False,\n",
1662 |        " 'contains(confirmed)': False,\n",
1663 |        " 'contains(mean)': False,\n",
1664 |        " 'contains(someone)': False,\n",
1665 |        " 'contains(spending)': False,\n",
1666 |        " 'contains(lax)': False,\n",
1667 |        " 'contains(Trying)': False,\n",
1668 |        " 'contains(entered)': False,\n",
1669 |        " 'contains(had)': False,\n",
1670 |        " 'contains(assets)': False,\n",
1671 |        " 'contains(t.co/rGYwJBbhm4)': False,\n",
1672 |        " 'contains(0769)': False,\n",
1673 |        " 'contains(remove)': False,\n",
1674 |        " 'contains(LAS)': False,\n",
1675 |        " 'contains(hipster)': False,\n",
1676 |        " 'contains(been)': False,\n",
1677 |        " 'contains(No)': False,\n",
1678 |        " 'contains(guy)': False,\n",
1679 |        " 'contains(7D)': False,\n",
1680 |        " 'contains(Budapest)': False,\n",
1681 |        " 'contains(applied)': False,\n",
1682 |        " 'contains(hotel)': False,\n",
1683 |        " 'contains(so)': False,\n",
1684 |        " 'contains(seriously)': False,\n",
1685 |        " 'contains(99)': False,\n",
1686 |        " 'contains(around)': False,\n",
1687 |        " 'contains(FreyaBevan_Fund)': False,\n",
1688 |        " 'contains(become)': False,\n",
1689 |        " 'contains(leaving)': False,\n",
1690 |        " 'contains(promised)': False,\n",
1691 |        " 'contains(Dulles)': False,\n",
1692 |        " 'contains(4Q)': False,\n",
1693 |        " 'contains(sounds)': False,\n",
1694 |        " 'contains(big)': False,\n",
1695 |        " 'contains(compatible)': False,\n",
1696 |        " 'contains(pretty)': False,\n",
1697 |        " 'contains(drink)': False,\n",
1698 |        " 'contains(destroyed)': False,\n",
1699 |        " 'contains(uphold)': False,\n",
1700 |        " 'contains(t.co/SSUVWwkyHH)': False,\n",
1701 |        " 'contains(suck)': False,\n",
1702 |        " 'contains(hrs)': False,\n",
1703 |        " 'contains(working)': False,\n",
1704 |        " 'contains(vegan)': False,\n",
1705 |        " 'contains(using)': False,\n",
1706 |        " 'contains(Keep)': False,\n",
1707 |        " \"contains('s)\": False,\n",
1708 |        " 'contains(incubator)': False,\n",
1709 |        " 'contains(access)': False,\n",
1710 |        " 'contains(heyyyy)': False,\n",
1711 |        " 'contains(able)': False,\n",
1712 |        " 'contains(side)': False,\n",
1713 |        " 'contains(two)': False,\n",
1714 |        " 'contains(i)': False,\n",
1715 |        " 'contains(Can)': False,\n",
1716 |        " 'contains(ur)': False,\n",
1717 |        " 'contains(dude)': False,\n",
1718 |        " 'contains(t.co/pX8hQOKS3R)': False,\n",
1719 |        " 'contains(birthday)': False,\n",
1720 |        " 'contains(Congrats)': False,\n",
1721 |        " 'contains(💜✈)': False,\n",
1722 |        " 'contains(Springs)': False,\n",
1723 |        " 'contains(iol)': False,\n",
1724 |        " 'contains(most)': False,\n",
1725 |        " 'contains(Sad)': False,\n",
1726 |        " 'contains(advantage)': False,\n",
1727 |        " 'contains(both)': False,\n",
1728 |        " 'contains(expected)': False,\n",
1729 |        " 'contains(6)': False,\n",
1730 |        " 'contains(things)': False,\n",
1731 |        " 'contains(flight🍸)': False,\n",
1732 |        " 'contains(Grand)': False,\n",
1733 |        " 'contains(biz)': False,\n",
1734 |        " 'contains(would)': False,\n",
1735 |        " 'contains(absolutely)': False,\n",
1736 |        " 'contains(t.co/H952rDKTqy”)': False,\n",
1737 |        " 'contains(evening)': False,\n",
1738 |        " 'contains(paid)': False,\n",
1739 |        " 'contains(914-329-0185)': False,\n",
1740 |        " 'contains(bound)': False,\n",
1741 |        " 'contains(Silicon)': False,\n",
1742 |        " 'contains(G)': False,\n",
1743 |        " 'contains(damaged)': False,\n",
1744 |        " 'contains(adore)': False,\n",
1745 |        " 'contains(fl1289)': False,\n",
1746 |        " 'contains(is)': True,\n",
1747 |        " 'contains(flying)': False,\n",
1748 |        " 'contains(customer)': False,\n",
1749 |        " 'contains(Handle)': False,\n",
1750 |        " 'contains(Waited)': False,\n",
1751 |        " 'contains(Booking)': False,\n",
1752 |        " 'contains(On)': False,\n",
1753 |        " 'contains(Gaga)': False,\n",
1754 |        " 'contains(apparently)': False,\n",
1755 |        " 'contains(seat)': False,\n",
1756 |        " 'contains(four)': False,\n",
1757 |        " 'contains(brought)': False,\n",
1758 |        " 'contains(scared)': False,\n",
1759 |        " 'contains(shame)': False,\n",
1760 |        " 'contains(elevate)': False,\n",
1761 |        " 'contains(sunset)': False,\n",
1762 |        " 'contains(t.co/1AGR9knCpf)': False,\n",
1763 |        " 'contains(Help😍)': False,\n",
1764 |        " 'contains(wow)': False,\n",
1765 |        " 'contains(excited)': False,\n",
1766 |        " 'contains(👸)': False,\n",
1767 |        " 'contains(bet)': False,\n",
1768 |        " 'contains(should)': False,\n",
1769 |        " 'contains(guys)': False,\n",
1770 |        " 'contains(normal)': False,\n",
1771 |        " 'contains(Whenever)': False,\n",
1772 |        " 'contains(member😒)': False,\n",
1773 |        " 'contains(❤️)': False,\n",
1774 |        " 'contains(AM)': False,\n",
1775 |        " 'contains(Problems)': False,\n",
1776 |        " 'contains(flight)': True,\n",
1777 |        " 'contains(use)': False,\n",
1778 |        " 'contains(iconography)': False,\n",
1779 |        " 'contains(horrible)': False,\n",
1780 |        " 'contains(which)': False,\n",
1781 |        " 'contains(wing)': False,\n",
1782 |        " 'contains(headed)': False,\n",
1783 |        " 'contains(TSA)': False,\n",
1784 |        " 'contains(88.9)': False,\n",
1785 |        " 'contains(Without)': False,\n",
1786 |        " 'contains(upgrade)': False,\n",
1787 |        " 'contains(down)': False,\n",
1788 |        " 'contains(couple)': False,\n",
1789 |        " 'contains(full)': False,\n",
1790 |        " 'contains(3)': False,\n",
1791 |        " 'contains(w)': False,\n",
1792 |        " 'contains(‘select)': False,\n",
1793 |        " 'contains(RenttheRunway)': False,\n",
1794 |        " 'contains(27)': False,\n",
1795 |        " 'contains(Prince)': False,\n",
1796 |        " 'contains(support)': False,\n",
1797 |        " 'contains(reallytallchris)': False,\n",
1798 |        " 'contains(NOW)': False,\n",
1799 |        " 'contains(messages)': False,\n",
1800 |        " 'contains(diehardvirgin)': False,\n",
1801 |        " 'contains(went)': False,\n",
1802 |        " 'contains(class)': False,\n",
1803 |        " 'contains(Status)': False,\n",
1804 |        " 'contains(soooo)': False,\n",
1805 |        " 'contains(what)': False,\n",
1806 |        " 'contains(💕💕)': False,\n",
1807 |        " 'contains(money)': False,\n",
1808 |        " 'contains(open)': False,\n",
1809 |        " 'contains(going)': False,\n",
1810 |        " 'contains(t.co/enIQg0buzj)': False,\n",
1811 |        " 'contains(work)': False,\n",
1812 |        " 'contains(thx)': False,\n",
1813 |        " 'contains(airlines)': False,\n",
1814 |        " 'contains(☺️👍)': False,\n",
1815 |        " 'contains(sorrynotsorry)': False,\n",
1816 |        " 'contains(tribute)': False,\n",
1817 |        " 'contains(Creates)': False,\n",
1818 |        " 'contains(mechanical)': False,\n",
1819 |        " 'contains(tacky)': False,\n",
1820 |        " 'contains(luggage)': False,\n",
1821 |        " 'contains(beyond)': False,\n",
1822 |        " 'contains(EVER)': False,\n",
1823 |        " 'contains(arrived)': False,\n",
1824 |        " 'contains(fare)': False,\n",
1825 |        " 'contains(Los)': False,\n",
1826 |        " 'contains(drivers)': False,\n",
1827 |        " 'contains(achieves)': False,\n",
1828 |        " 'contains(refund)': False,\n",
1829 |        " 'contains(free)': False,\n",
1830 |        " 'contains(silver)': False,\n",
1831 |        " 'contains(Will)': False,\n",
1832 |        " 'contains(Well)': False,\n",
1833 |        " 'contains(nearly)': False,\n",
1834 |        " 'contains(temperature)': False,\n",
1835 |        " 'contains(na)': False,\n",
1836 |        " 'contains(track)': False,\n",
1837 |        " 'contains(recline)': False,\n",
1838 |        " 'contains(yall)': False,\n",
1839 |        " 'contains(glad)': False,\n",
1840 |        " 'contains(code)': False,\n",
1841 |        " 'contains(wine)': False,\n",
1842 |        " 'contains(Good)': False,\n",
1843 |        " 'contains(feet)': False,\n",
1844 |        " 'contains(Dallas)': False,\n",
1845 |        " \"contains(didn't…but)\": False,\n",
1846 |        " 'contains(1230)': False,\n",
1847 |        " 'contains(job)': False,\n",
1848 |        " 'contains(standby)': False,\n",
1849 |        " 'contains(by)': False,\n",
1850 |        " 'contains(gate)': False,\n",
1851 |        " 'contains(Quick)': False,\n",
1852 |        " 'contains(easy)': False,\n",
1853 |        " 'contains(inflight)': False,\n",
1854 |        " 'contains(SJC)': False,\n",
1855 |        " 'contains(outstanding)': False,\n",
1856 |        " 'contains(afford)': False,\n",
1857 |        " 'contains(u)': False,\n",
1858 |        " 'contains(provided)': False,\n",
1859 |        " 'contains(start)': False,\n",
1860 |        " 'contains(the)': False,\n",
1861 |        " 'contains(help)': False,\n",
1862 |        " 'contains(Soon)': False,\n",
1863 |        " \"contains('re)\": False,\n",
1864 |        " 'contains(worstflightever)': False,\n",
1865 |        " 'contains(nicely)': False,\n",
1866 |        " 'contains(skies)': False,\n",
1867 |        " 'contains(touchdown)': False,\n",
1868 |        " 'contains(FastCompany)': False,\n",
1869 |        " 'contains(salt)': False,\n",
1870 |        " 'contains(Because)': False,\n",
1871 |        " 'contains(during)': False,\n",
1872 |        " 'contains(shared)': False,\n",
1873 |        " 'contains(desktop)': False,\n",
1874 |        " 'contains(safety)': False,\n",
1875 |        " 'contains(are)': False,\n",
1876 |        " 'contains(t.co/5B2agFd8c4)': False,\n",
1877 |        " 'contains(🙉)': False,\n",
1878 |        " 'contains(welcome)': False,\n",
1879 |        " 'contains(over)': False,\n",
1880 |        " 'contains(super)': False,\n",
1881 |        " 'contains(Site)': False,\n",
1882 |        " 'contains(Easily)': False,\n",
1883 |        " 'contains(reservation)': False,\n",
1884 |        " 'contains(Thx)': False,\n",
1885 |        " 'contains(❄️❄️❄️)': False,\n",
1886 |        " 'contains(HELP)': False,\n",
1887 |        " 'contains(banned)': False,\n",
1888 |        " 'contains(until)': False,\n",
1889 |        " 'contains(feel)': False,\n",
1890 |        " 'contains(of)': False,\n",
1891 |        " 'contains(Mostly)': False,\n",
1892 |        " 'contains(passengers)': False,\n",
1893 |        " 'contains(across)': False,\n",
1894 |        " 'contains(t.co/VPqEm31XUQ)': False,\n",
1895 |        " 'contains(😘)': False,\n",
1896 |        " 'contains(infant)': False,\n",
1897 |        " 'contains(Friday)': False,\n",
1898 |        " 'contains(prefer)': False,\n",
1899 |        " 'contains(t.co/oA2dRfAoQ2)': False,\n",
1900 |        " 'contains(t.co/UJfS9Zi6kd)': False,\n",
1901 |        " 'contains(united)': False,\n",
1902 |        " 'contains(dfw-lax)': False,\n",
1903 |        " 'contains(Hi)': False,\n",
1904 |        " 'contains(lt)': False,\n",
1905 |        " 'contains(Love/gratitude.mpower)': False,\n",
1906 |        " 'contains(email)': False,\n",
1907 |        " 'contains(30)': False,\n",
1908 |        " 'contains(2nd)': False,\n",
1909 |        " 'contains(airport)': False,\n",
1910 |        " 'contains(Sentinel)': False,\n",
1911 |        " 'contains(upgrades)': False,\n",
1912 |        " 'contains(receive)': False,\n",
1913 |        " 'contains(cross-browser)': False,\n",
1914 |        " 'contains(met)': False,\n",
1915 |        " 'contains(😍👌)': False,\n",
1916 |        " 'contains(ball)': False,\n",
1917 |        " 'contains(F)': False,\n",
1918 |        " 'contains(thing)': False,\n",
1919 |        " 'contains(expensive)': False,\n",
1920 |        " 'contains(In)': False,\n",
1921 |        " 'contains(screen)': False,\n",
1922 |        " 'contains(DREAM)': False,\n",
1923 |        " 'contains(24)': False,\n",
1924 |        " 'contains(Greetingz)': False,\n",
1925 |        " 'contains(order)': False,\n",
1926 |        " 'contains(delay)': False,\n",
1927 |        " 'contains(large)': False,\n",
1928 |        " 'contains(Time)': False,\n",
1929 |        " 'contains(min)': False,\n",
1930 |        " 'contains(9am)': False,\n",
1931 |        " 'contains(did)': False,\n",
1932 |        " 'contains(an)': False,\n",
1933 |        " 'contains(student)': False,\n",
1934 |        " 'contains(nomorevirgin)': False,\n",
1935 |        " 'contains(they)': False,\n",
1936 |        " 'contains(Seats)': False,\n",
1937 |        " 'contains(classics)': False,\n",
1938 |        " 'contains(ordered)': False,\n",
1939 |        " 'contains(always)': False,\n",
1940 |        " 'contains(wonked)': False,\n",
1941 |        " 'contains(t.co/tZZJhuIbCH)': False,\n",
1942 |        " 'contains(JPERHI)': False,\n",
1943 |        " 'contains(tonite)': False,\n",
1944 |        " 'contains(browsers)': False,\n",
1945 |        " 'contains(area)': False,\n",
1946 |        " 'contains(LOVE)': False,\n",
1947 |        " 'contains(Business)': False,\n",
1948 |        " 'contains(market)': False,\n",
1949 |        " 'contains(interested)': False,\n",
1950 |        " 'contains(tossed)': False,\n",
1951 |        " 'contains(york)': False,\n",
1952 |        " 'contains(people)': False,\n",
1953 |        " 'contains(see)': False,\n",
1954 |        " 'contains(report)': False,\n",
1955 |        " 'contains(customerservice)': False,\n",
1956 |        " 'contains(show)': False,\n",
1957 |        " 'contains(crew)': False,\n",
1958 |        " 'contains(America)': False,\n",
1959 |        " 'contains(yes)': False,\n",
1960 |        " 'contains(Love)': False,\n",
1961 |        " 'contains(VA370)': False,\n",
1962 |        " 'contains(dancing)': False,\n",
1963 |        " 'contains(end)': False,\n",
1964 |        " 'contains(338)': False,\n",
1965 |        " 'contains(800)': False,\n",
1966 |        " 'contains(713)': False,\n",
1967 |        " 'contains(t.co/CnctL7G1ef)': False,\n",
1968 |        " 'contains(or)': False,\n",
1969 |        " 'contains(Martin)': False,\n",
1970 |        " 'contains(flights)': False,\n",
1971 |        " 'contains(Points)': False,\n",
1972 |        " 'contains(match.Got)': False,\n",
1973 |        " 'contains(links)': False,\n",
1974 |        " 'contains(friends)': False,\n",
1975 |        " 'contains(booked)': False,\n",
1976 |        " 'contains(seatbelt)': False,\n",
1977 |        " 'contains(SFO/EWR)': False,\n",
1978 |        " 'contains(DCA)': False,\n",
1979 |        " 'contains(route)': False,\n",
1980 |        " 'contains(😁)': False,\n",
1981 |        " 'contains(page)': False,\n",
1982 |        " 'contains(greyed)': False,\n",
1983 |        " 'contains(commercials)': False,\n",
1984 |        " 'contains(premium)': False,\n",
1985 |        " 'contains(beautiful)': False,\n",
1986 |        " 'contains(bucks)': False,\n",
1987 |        " 'contains(cold)': False,\n",
1988 |        " 'contains(site)': False,\n",
1989 |        " 'contains(PDX)': False,\n",
1990 |        " 'contains(their)': False,\n",
1991 |        " 'contains(right)': False,\n",
1992 |        " 'contains(positive)': False,\n",
1993 |        " 'contains(Have)': False,\n",
1994 |        " 'contains(freddieawards)': False,\n",
1995 |        " 'contains(where)': True,\n",
1996 |        " 'contains(self-service)': False,\n",
1997 |        " 'contains(video)': False,\n",
1998 |        " 'contains(winning)': False,\n",
1999 |        " 'contains(choppy)': False,\n",
2000 |        " 'contains(amp)': False,\n",
2001 |        " 'contains(t.co/EwwGi97gdx)': False,\n",
2002 |        " 'contains(phone)': False,\n",
2003 |        " 'contains(weeks)': False,\n",
2004 |        " 'contains(claim)': False,\n",
2005 |        " 'contains(three)': False,\n",
2006 |        " 'contains(much)': False,\n",
2007 |        " 'contains(anyone)': False,\n",
2008 |        " 'contains(have)': True,\n",
2009 |        " 'contains(Ca)': False,\n",
2010 |        " 'contains(before)': False,\n",
2011 |        " 'contains(Hats)': False,\n",
2012 |        " 'contains(Baldwin)': False,\n",
2013 |        " 'contains(One)': False,\n",
2014 |        " 'contains(benefits)': False,\n",
2015 |        " 'contains(oscars2015)': False,\n",
2016 |        " 'contains(kids)': False,\n",
2017 |        " \"contains('d)\": False,\n",
2018 |        " 'contains(how)': False,\n",
2019 |        " 'contains(w/2)': False,\n",
2020 |        " 'contains(ROCK)': False,\n",
2021 |        " 'contains(every)': False,\n",
2022 |        " 'contains(experience)': False,\n",
2023 |        " 'contains(cross)': False,\n",
2024 |        " 'contains(very)': False,\n",
2025 |        " 'contains(prime)': False,\n",
2026 |        " 'contains(win)': False,\n",
2027 |        " 'contains(Who)': False,\n",
2028 |        " 'contains(t.co/GsB2J3c4gM)': False,\n",
2029 |        " 'contains(making)': False,\n",
2030 |        " 'contains(away)': False,\n",
2031 |        " 'contains(snow)': False,\n",
2032 |        " 'contains(🌞✈)': False,\n",
2033 |        " 'contains(Investor)': False,\n",
2034 |        " 'contains(any)': False,\n",
2035 |        " 'contains(center)': False,\n",
2036 |        " 'contains(worse)': False,\n",
2037 |        " 'contains(OscarsCountdown)': False,\n",
2038 |        " 'contains(413)': False,\n",
2039 |        " 'contains(but)': False,\n",
2040 |        " 'contains(out)': False,\n",
2041 |        " 'contains(Sign)': False,\n",
2042 |        " 'contains(connecting)': False,\n",
2043 |        " 'contains(assist)': False,\n",
2044 |        " 'contains(think)': False,\n",
2045 |        " 'contains(designated)': False,\n",
2046 |        " 'contains(VX)': False,\n",
2047 |        " 'contains(💗)': False,\n",
2048 |        " 'contains(w/the)': False,\n",
2049 |        " 'contains(know)': False,\n",
2050 |        " 'contains(understand)': False,\n",
2051 |        " 'contains(lot)': False,\n",
2052 |        " 'contains(selected👎)': False,\n",
2053 |        " 'contains(7AM)': False,\n",
2054 |        " 'contains(delays)': False,\n",
2055 |        " 'contains(like)': False,\n",
2056 |        " 'contains(t.co/vhp2GtDWPk)': False,\n",
2057 |        " 'contains(Must)': False,\n",
2058 |        " 'contains(less)': False,\n",
2059 |        " 'contains(barely)': False,\n",
2060 |        " 'contains(9)': False,\n",
2061 |        " 'contains(t.co/aqZWecOkk2)': False,\n",
2062 |        " 'contains(hand)': False,\n",
2063 |        " 'contains(intern)': False,\n",
2064 |        " 'contains(together)': False,\n",
2065 |        " 'contains(will)': False,\n",
2066 |        " 'contains(Flight)': False,\n",
2067 |        " 'contains(spruce)': False,\n",
2068 |        " 'contains(website)': False,\n",
2069 |        " 'contains(iPad)': False,\n",
2070 |        " 'contains(Lots)': False,\n",
2071 |        " 'contains(great)': False,\n",
2072 |        " 'contains(on.Easy)': False,\n",
2073 |        " 'contains(backtowinter)': False,\n",
2074 |        " 'contains(say)': False,\n",
2075 |        " 'contains(Is)': False,\n",
2076 |        " 'contains(charged)': False,\n",
2077 |        " 'contains(Lofty)': False,\n",
2078 |        " 'contains(story)': False,\n",
2079 |        " 'contains(gone)': False,\n",
2080 |        " 'contains(program)': False,\n",
2081 |        " 'contains(fail)': False,\n",
2082 |        " 'contains(Arab)': False,\n",
2083 |        " 'contains(tracking)': False,\n",
2084 |        " 'contains(give)': False,\n",
2085 |        " 'contains(714)': False,\n",
2086 |        " 'contains(calling)': False,\n",
2087 |        " 'contains(pleasecomeback)': False,\n",
2088 |        " 'contains(films)': False,\n",
2089 |        " 'contains(t.co/Dw5nf0ibtr)': False,\n",
2090 |        " 'contains(SouthwestAir)': False,\n",
2091 |        " 'contains(todays)': False,\n",
2092 |        " 'contains(after)': False,\n",
2093 |        " 'contains(long)': False,\n",
2094 |        " 'contains(Are)': False,\n",
2095 |        " 'contains(urgent)': False,\n",
2096 |        " 'contains(moose)': False,\n",
2097 |        " 'contains(go)': False,\n",
2098 |        " 'contains(trying)': False,\n",
2099 |        " 'contains(shift)': False,\n",
2100 |        " 'contains(watch)': False,\n",
2101 |        " 'contains(at)': False,\n",
2102 |        " 'contains(among)': False,\n",
2103 |        " 'contains(iPhone)': False,\n",
2104 |        " 'contains(who)': False,\n",
2105 |        " 'contains(MCO)': False,\n",
2106 |        " 'contains(DTW)': False,\n",
2107 |        " 'contains(Nice)': False,\n",
2108 |        " 'contains(hi)': False,\n",
2109 |        " 'contains(failing)': False,\n",
2110 |        " 'contains(Men)': False,\n",
2111 |        " 'contains(TTINAC11)': False,\n",
2112 |        " 'contains(step)': False,\n",
2113 |        " 'contains(graphics)': False,\n",
2114 |        " 'contains(faster)': False,\n",
2115 |        " 'contains(inconvenience)': False,\n",
2116 |        " 'contains(helping)': False,\n",
2117 |        " 'contains(web)': False,\n",
2118 |        " 'contains(beats)': False,\n",
2119 |        " 'contains(hahaha)': False,\n",
2120 |        " 'contains(276)': False,\n",
2121 |        " 'contains(PHL)': False,\n",
2122 |        " 'contains(year)': False,\n",
2123 |        " 'contains(fav)': False,\n",
2124 |        " 'contains(demo)': False,\n",
2125 |        " 'contains(drinks)': False,\n",
2126 |        " 'contains(number)': False,\n",
2127 |        " 'contains(sendambien)': False,\n",
2128 |        " 'contains(LAX)': False,\n",
2129 |        " 'contains(info)': False,\n",
2130 |        " 'contains(Sat)': False,\n",
2131 |        " 'contains(morning)': False,\n",
2132 |        " 'contains(depart)': False,\n",
2133 |        " 'contains(looks)': False,\n",
2134 |        " 'contains(SoundOfMusic)': False,\n",
2135 |        " 'contains(Or)': False,\n",
2136 |        " 'contains(scanned)': False,\n",
2137 |        " 'contains(best)': False,\n",
2138 |        " 'contains(redcarpet)': False,\n",
2139 |        " ...}"
2140 |       ]
2141 |      },
2142 |      "execution_count": 22,
2143 |      "metadata": {},
2144 |      "output_type": "execute_result"
2145 |     }
2146 |    ],
2147 |    "source": [
2148 |     "cl.extract_features('I have no idea where this flight is taking me')"
2149 |    ]
2150 |   },
2151 |   {
2152 |    "cell_type": "markdown",
2153 |    "metadata": {},
2154 |    "source": [
2155 |     "## Classifying From Within a TextBlob\n",
2156 |     "\n",
2157 |     "We can perform classification on the contents of a TextBlob object using an existing classifier (like the one we created earlier (named cl). The usefulness of this might seem questionable, since you can just pass a normal string to the classifier. However, something you will be doing other work with some text in the form of a blob, and then when you need to perform classification, you don't have to go back and get the raw string.\n",
2158 |     "\n",
2159 |     "Using a clssifier in a `TextBlob` is as easy as passing the classifier as an argument when you create the blob.\n",
2160 |     "\n",
2161 |     "**Note:** The classifier must be one that you have already trained.\n",
2162 |     "\n",
2163 |     "Let's look at a couple examples:"
2164 |    ]
2165 |   },
2166 |   {
2167 |    "cell_type": "code",
2168 |    "execution_count": 18,
2169 |    "metadata": {
2170 |     "collapsed": false,
2171 |     "deletable": true,
2172 |     "editable": true
2173 |    },
2174 |    "outputs": [
2175 |     {
2176 |      "data": {
2177 |       "text/plain": [
2178 |        "'positive'"
2179 |       ]
2180 |      },
2181 |      "execution_count": 18,
2182 |      "metadata": {},
2183 |      "output_type": "execute_result"
2184 |     }
2185 |    ],
2186 |    "source": [
2187 |     "b = TextBlob('I loved the flight', classifier=cl)\n",
2188 |     "b.classify()"
2189 |    ]
2190 |   },
2191 |   {
2192 |    "cell_type": "code",
2193 |    "execution_count": 19,
2194 |    "metadata": {
2195 |     "collapsed": false,
2196 |     "deletable": true,
2197 |     "editable": true
2198 |    },
2199 |    "outputs": [
2200 |     {
2201 |      "data": {
2202 |       "text/plain": [
2203 |        "'neutral'"
2204 |       ]
2205 |      },
2206 |      "execution_count": 19,
2207 |      "metadata": {},
2208 |      "output_type": "execute_result"
2209 |     }
2210 |    ],
2211 |    "source": [
2212 |     "b = TextBlob('I hated the flight', classifier=cl)\n",
2213 |     "b.classify()"
2214 |    ]
2215 |   },
2216 |   {
2217 |    "cell_type": "markdown",
2218 |    "metadata": {
2219 |     "deletable": true,
2220 |     "editable": true
2221 |    },
2222 |    "source": [
2223 |     "Our classifier probably didn't encounter the word \"hate\" or \"hated\". We can update our model to improve classification."
2224 |    ]
2225 |   },
2226 |   {
2227 |    "cell_type": "markdown",
2228 |    "metadata": {},
2229 |    "source": [
2230 |     "## Update Existing Classifiers With New Data\n",
2231 |     "\n",
2232 |     "Our classifier obviously failed us when we tried to classify the string \"I hate this flight.\"\n",
2233 |     "We have the option of easily updating our classifier with new data, so let's do that now."
2234 |    ]
2235 |   },
2236 |   {
2237 |    "cell_type": "code",
2238 |    "execution_count": 20,
2239 |    "metadata": {
2240 |     "collapsed": false,
2241 |     "deletable": true,
2242 |     "editable": true
2243 |    },
2244 |    "outputs": [
2245 |     {
2246 |      "data": {
2247 |       "text/plain": [
2248 |        "True"
2249 |       ]
2250 |      },
2251 |      "execution_count": 20,
2252 |      "metadata": {},
2253 |      "output_type": "execute_result"
2254 |     }
2255 |    ],
2256 |    "source": [
2257 |     "# new data is also a list of tuples\n",
2258 |     "# be sure the class labels are correct\n",
2259 |     "updates = [('I hated flying', 'negative'), ('I hate flying', 'negative'),\n",
2260 |     "           ('I hate this airline', 'negative'), ('I hated the seats', 'negative')]\n",
2261 |     "cl.update(updates)  # this is unfortunately slow"
2262 |    ]
2263 |   },
2264 |   {
2265 |    "cell_type": "markdown",
2266 |    "metadata": {},
2267 |    "source": [
2268 |     "You can ignore the output `True`\n",
2269 |     "\n",
2270 |     "**Note:** If you get the error `too many values to unpack (expected 2)`, try re-running the cell where we created the train/test sets and create/train the classifier from scratch."
2271 |    ]
2272 |   },
2273 |   {
2274 |    "cell_type": "markdown",
2275 |    "metadata": {},
2276 |    "source": [
2277 |     "Now that we have updated our classifier with new data, let's see how our original sentence is classified."
2278 |    ]
2279 |   },
2280 |   {
2281 |    "cell_type": "code",
2282 |    "execution_count": 21,
2283 |    "metadata": {
2284 |     "collapsed": false,
2285 |     "deletable": true,
2286 |     "editable": true
2287 |    },
2288 |    "outputs": [
2289 |     {
2290 |      "data": {
2291 |       "text/plain": [
2292 |        "'negative'"
2293 |       ]
2294 |      },
2295 |      "execution_count": 21,
2296 |      "metadata": {},
2297 |      "output_type": "execute_result"
2298 |     }
2299 |    ],
2300 |    "source": [
2301 |     "# let's see how it does now using 'I hated the flight'\n",
2302 |     "b = TextBlob('I hated the flight', classifier=cl) # update\n",
2303 |     "b.classify()"
2304 |    ]
2305 |   },
2306 |   {
2307 |    "cell_type": "markdown",
2308 |    "metadata": {},
2309 |    "source": [
2310 |     "An now we have the correct classification of `'negative'`\n",
2311 |     "\n",
2312 |     "If you do not get the correct class, try running the update cell once more."
2313 |    ]
2314 |   },
2315 |   {
2316 |    "cell_type": "markdown",
2317 |    "metadata": {
2318 |     "deletable": true,
2319 |     "editable": true
2320 |    },
2321 |    "source": [
2322 |     "## Other Classifiers\n",
2323 |     "\n",
2324 |     "TextBlob has a number of built in classifiers, all of which can be found in the documentation at the link below."
2325 |    ]
2326 |   },
2327 |   {
2328 |    "cell_type": "markdown",
2329 |    "metadata": {
2330 |     "deletable": true,
2331 |     "editable": true
2332 |    },
2333 |    "source": [
2334 |     "http://textblob.readthedocs.io/en/dev/api_reference.html#api-classifiers"
2335 |    ]
2336 |   },
2337 |   {
2338 |    "cell_type": "code",
2339 |    "execution_count": null,
2340 |    "metadata": {
2341 |     "collapsed": true
2342 |    },
2343 |    "outputs": [],
2344 |    "source": []
2345 |   },
2346 |   {
2347 |    "cell_type": "markdown",
2348 |    "metadata": {},
2349 |    "source": [
2350 |     "# Pratice Problems\n",
2351 |     "\n",
2352 |     "1. Train a decision tree classifier on the first 350 tweets in the reduced set (the training set from earlier) — call it something other than cl — and print/examine the tree structure using pseudocode method (hint: wrap in print)\n",
2353 |     "2. Compute the accuracy on the test set [350:500] and compare to the Naive Bayes accuracy\n",
2354 |     "3. Compare the precision and recall scores for the two classifiers. Does the decision tree perform better on any of the classes? (hint: remember that these classify one item at a time)\n",
2355 |     "4. Create a new “balanced” training set of 50 observations from each class and update the current Naive Bayes (cl)\n",
2356 |     "5. Score the updated classifier. Have the scores improved? How about accuracy?\n"
2357 |    ]
2358 |   },
2359 |   {
2360 |    "cell_type": "code",
2361 |    "execution_count": null,
2362 |    "metadata": {
2363 |     "collapsed": true
2364 |    },
2365 |    "outputs": [],
2366 |    "source": []
2367 |   }
2368 |  ],
2369 |  "metadata": {
2370 |   "anaconda-cloud": {},
2371 |   "kernelspec": {
2372 |    "display_name": "Python [default]",
2373 |    "language": "python",
2374 |    "name": "python3"
2375 |   },
2376 |   "language_info": {
2377 |    "codemirror_mode": {
2378 |     "name": "ipython",
2379 |     "version": 3
2380 |    },
2381 |    "file_extension": ".py",
2382 |    "mimetype": "text/x-python",
2383 |    "name": "python",
2384 |    "nbconvert_exporter": "python",
2385 |    "pygments_lexer": "ipython3",
2386 |    "version": "3.5.2"
2387 |   }
2388 |  },
2389 |  "nbformat": 4,
2390 |  "nbformat_minor": 2
2391 | }
2392 | 


--------------------------------------------------------------------------------