├── README.md ├── data └── tweets.csv ├── intro_to_python.ipynb ├── nlp_workshop1.ipynb └── nlp_workshop2.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # NLP Workshop 2 | 3 | Welcome to this repository. I have instructions and resources so you can get up to speed. 4 | 5 | 1. You will need Python installed on your computer and other data science packages. I use the [Anaconda distribution of Python 3.6](https://www.continuum.io/downloads). 6 | 7 | 2. If you are comfortable with Git, make sure to clone this repository on your local computer, otherwise you can simply download a zip archive by clicking on the green button on the top right of the page. 8 | 9 | 3. Make sure you are familiar with Python syntax. There is a [review Jupyter notebook](./intro_to_python.ipynb) in this repository. Make sure you can go into your terminal and run the command `jupyer notebook`. A notebook server should start and you will be able to view, create and save notebooks. 10 | 11 | 4. We will be using TextBlob. Make sure to run these commands in your terminal / shell. 12 | 13 | $ pip install -U textblob 14 | $ python -m textblob.download_corpora 15 | 16 | 5. We will be using the [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) provided by Kaggle. The associated data files are located in the [data folder](./tweets.csv). 17 | 18 | 6. You should be able to import the following packges without a problem. If you get an error, `pip install` or `conda install`. 19 | 20 | $ import pandas 21 | $ import numpy 22 | $ import textblob 23 | 24 | ---------- 25 | 26 | Organized by [K2 Data Science](http://www.k2datascience.com). 27 | -------------------------------------------------------------------------------- /intro_to_python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction To Python" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This is a collection of various statements, features, etc. of Jupyter and the Python language. " 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 68, 20 | "metadata": { 21 | "collapsed": false 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "a = 10" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 69, 31 | "metadata": { 32 | "collapsed": false 33 | }, 34 | "outputs": [ 35 | { 36 | "name": "stdout", 37 | "output_type": "stream", 38 | "text": [ 39 | "10\n" 40 | ] 41 | } 42 | ], 43 | "source": [ 44 | "print(a)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 70, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "import math" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 71, 61 | "metadata": { 62 | "collapsed": false 63 | }, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "1.0\n" 70 | ] 71 | } 72 | ], 73 | "source": [ 74 | "x = math.cos(2 * math.pi)\n", 75 | "print(x)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "Import the whole module into the current namespace instead." 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 72, 88 | "metadata": { 89 | "collapsed": false 90 | }, 91 | "outputs": [ 92 | { 93 | "name": "stdout", 94 | "output_type": "stream", 95 | "text": [ 96 | "1.0\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "from math import *\n", 102 | "x = cos(2 * pi)\n", 103 | "print(x)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "Several ways to look at documentation for a module." 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 73, 116 | "metadata": { 117 | "collapsed": false 118 | }, 119 | "outputs": [ 120 | { 121 | "name": "stdout", 122 | "output_type": "stream", 123 | "text": [ 124 | "['__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos', 'cosh', 'degrees', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf', 'isnan', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan', 'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'trunc']\n" 125 | ] 126 | } 127 | ], 128 | "source": [ 129 | "print(dir(math))" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 74, 135 | "metadata": { 136 | "collapsed": false 137 | }, 138 | "outputs": [ 139 | { 140 | "name": "stdout", 141 | "output_type": "stream", 142 | "text": [ 143 | "Help on built-in function cos in module math:\n", 144 | "\n", 145 | "cos(...)\n", 146 | " cos(x)\n", 147 | " \n", 148 | " Return the cosine of x (measured in radians).\n", 149 | "\n" 150 | ] 151 | } 152 | ], 153 | "source": [ 154 | "help(math.cos)" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "### Variables" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 75, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/plain": [ 174 | "float" 175 | ] 176 | }, 177 | "execution_count": 75, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "x = 1.0\n", 184 | "type(x)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 76, 190 | "metadata": { 191 | "collapsed": false 192 | }, 193 | "outputs": [ 194 | { 195 | "data": { 196 | "text/plain": [ 197 | "int" 198 | ] 199 | }, 200 | "execution_count": 76, 201 | "metadata": {}, 202 | "output_type": "execute_result" 203 | } 204 | ], 205 | "source": [ 206 | "# dynamically typed\n", 207 | "x = 1\n", 208 | "type(x)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "### Operators" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 77, 221 | "metadata": { 222 | "collapsed": false 223 | }, 224 | "outputs": [ 225 | { 226 | "data": { 227 | "text/plain": [ 228 | "(3, -1, 2, 0.5)" 229 | ] 230 | }, 231 | "execution_count": 77, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | } 235 | ], 236 | "source": [ 237 | "1 + 2, 1 - 2, 1 * 2, 1 / 2" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 78, 243 | "metadata": { 244 | "collapsed": false 245 | }, 246 | "outputs": [ 247 | { 248 | "data": { 249 | "text/plain": [ 250 | "1.0" 251 | ] 252 | }, 253 | "execution_count": 78, 254 | "metadata": {}, 255 | "output_type": "execute_result" 256 | } 257 | ], 258 | "source": [ 259 | "# integer division of float numbers\n", 260 | "3.0 // 2.0" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 79, 266 | "metadata": { 267 | "collapsed": false 268 | }, 269 | "outputs": [ 270 | { 271 | "data": { 272 | "text/plain": [ 273 | "4" 274 | ] 275 | }, 276 | "execution_count": 79, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "# power operator\n", 283 | "2 ** 2" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 80, 289 | "metadata": { 290 | "collapsed": false 291 | }, 292 | "outputs": [ 293 | { 294 | "data": { 295 | "text/plain": [ 296 | "False" 297 | ] 298 | }, 299 | "execution_count": 80, 300 | "metadata": {}, 301 | "output_type": "execute_result" 302 | } 303 | ], 304 | "source": [ 305 | "True and False" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 81, 311 | "metadata": { 312 | "collapsed": false 313 | }, 314 | "outputs": [ 315 | { 316 | "data": { 317 | "text/plain": [ 318 | "True" 319 | ] 320 | }, 321 | "execution_count": 81, 322 | "metadata": {}, 323 | "output_type": "execute_result" 324 | } 325 | ], 326 | "source": [ 327 | "not False" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 82, 333 | "metadata": { 334 | "collapsed": false 335 | }, 336 | "outputs": [ 337 | { 338 | "data": { 339 | "text/plain": [ 340 | "True" 341 | ] 342 | }, 343 | "execution_count": 82, 344 | "metadata": {}, 345 | "output_type": "execute_result" 346 | } 347 | ], 348 | "source": [ 349 | "True or False" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 83, 355 | "metadata": { 356 | "collapsed": false 357 | }, 358 | "outputs": [ 359 | { 360 | "data": { 361 | "text/plain": [ 362 | "(True, False, False, False, True, True)" 363 | ] 364 | }, 365 | "execution_count": 83, 366 | "metadata": {}, 367 | "output_type": "execute_result" 368 | } 369 | ], 370 | "source": [ 371 | "2 > 1, 2 < 1, 2 > 2, 2 < 2, 2 >= 2, 2 <= 2" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 84, 377 | "metadata": { 378 | "collapsed": false 379 | }, 380 | "outputs": [ 381 | { 382 | "data": { 383 | "text/plain": [ 384 | "True" 385 | ] 386 | }, 387 | "execution_count": 84, 388 | "metadata": {}, 389 | "output_type": "execute_result" 390 | } 391 | ], 392 | "source": [ 393 | "# equality\n", 394 | "[1,2] == [1,2]" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "### Strings" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 85, 407 | "metadata": { 408 | "collapsed": false 409 | }, 410 | "outputs": [ 411 | { 412 | "data": { 413 | "text/plain": [ 414 | "str" 415 | ] 416 | }, 417 | "execution_count": 85, 418 | "metadata": {}, 419 | "output_type": "execute_result" 420 | } 421 | ], 422 | "source": [ 423 | "s = \"Hello world\"\n", 424 | "type(s)" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": 86, 430 | "metadata": { 431 | "collapsed": false 432 | }, 433 | "outputs": [ 434 | { 435 | "data": { 436 | "text/plain": [ 437 | "11" 438 | ] 439 | }, 440 | "execution_count": 86, 441 | "metadata": {}, 442 | "output_type": "execute_result" 443 | } 444 | ], 445 | "source": [ 446 | "len(s)" 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": 87, 452 | "metadata": { 453 | "collapsed": false 454 | }, 455 | "outputs": [ 456 | { 457 | "name": "stdout", 458 | "output_type": "stream", 459 | "text": [ 460 | "Hello test\n" 461 | ] 462 | } 463 | ], 464 | "source": [ 465 | "s2 = s.replace(\"world\", \"test\")\n", 466 | "print(s2)" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 88, 472 | "metadata": { 473 | "collapsed": false 474 | }, 475 | "outputs": [ 476 | { 477 | "data": { 478 | "text/plain": [ 479 | "'H'" 480 | ] 481 | }, 482 | "execution_count": 88, 483 | "metadata": {}, 484 | "output_type": "execute_result" 485 | } 486 | ], 487 | "source": [ 488 | "s[0]" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": 89, 494 | "metadata": { 495 | "collapsed": false 496 | }, 497 | "outputs": [ 498 | { 499 | "data": { 500 | "text/plain": [ 501 | "'Hello'" 502 | ] 503 | }, 504 | "execution_count": 89, 505 | "metadata": {}, 506 | "output_type": "execute_result" 507 | } 508 | ], 509 | "source": [ 510 | "s[0:5]" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": 90, 516 | "metadata": { 517 | "collapsed": false 518 | }, 519 | "outputs": [ 520 | { 521 | "data": { 522 | "text/plain": [ 523 | "'world'" 524 | ] 525 | }, 526 | "execution_count": 90, 527 | "metadata": {}, 528 | "output_type": "execute_result" 529 | } 530 | ], 531 | "source": [ 532 | "s[6:]" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": 91, 538 | "metadata": { 539 | "collapsed": false 540 | }, 541 | "outputs": [ 542 | { 543 | "data": { 544 | "text/plain": [ 545 | "'Hello world'" 546 | ] 547 | }, 548 | "execution_count": 91, 549 | "metadata": {}, 550 | "output_type": "execute_result" 551 | } 552 | ], 553 | "source": [ 554 | "s[:]" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 92, 560 | "metadata": { 561 | "collapsed": false 562 | }, 563 | "outputs": [ 564 | { 565 | "data": { 566 | "text/plain": [ 567 | "'Hlowrd'" 568 | ] 569 | }, 570 | "execution_count": 92, 571 | "metadata": {}, 572 | "output_type": "execute_result" 573 | } 574 | ], 575 | "source": [ 576 | "# define step size of 2\n", 577 | "s[::2]" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 93, 583 | "metadata": { 584 | "collapsed": false 585 | }, 586 | "outputs": [ 587 | { 588 | "name": "stdout", 589 | "output_type": "stream", 590 | "text": [ 591 | "str1 str2 str3\n" 592 | ] 593 | } 594 | ], 595 | "source": [ 596 | "# automatically adds a space\n", 597 | "print(\"str1\", \"str2\", \"str3\")" 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": 94, 603 | "metadata": { 604 | "collapsed": false 605 | }, 606 | "outputs": [ 607 | { 608 | "name": "stdout", 609 | "output_type": "stream", 610 | "text": [ 611 | "value = 1.000000\n" 612 | ] 613 | } 614 | ], 615 | "source": [ 616 | "# C-style formatting\n", 617 | "print(\"value = %f\" % 1.0) " 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": 95, 623 | "metadata": { 624 | "collapsed": false 625 | }, 626 | "outputs": [ 627 | { 628 | "name": "stdout", 629 | "output_type": "stream", 630 | "text": [ 631 | "value1 = 3.1415, value2 = 1.5\n" 632 | ] 633 | } 634 | ], 635 | "source": [ 636 | "# alternative, more intuitive way of formatting a string \n", 637 | "s3 = 'value1 = {0}, value2 = {1}'.format(3.1415, 1.5)\n", 638 | "print(s3)" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "### Lists" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": 96, 651 | "metadata": { 652 | "collapsed": false 653 | }, 654 | "outputs": [ 655 | { 656 | "name": "stdout", 657 | "output_type": "stream", 658 | "text": [ 659 | "\n", 660 | "[1, 2, 3, 4]\n" 661 | ] 662 | } 663 | ], 664 | "source": [ 665 | "l = [1,2,3,4]\n", 666 | "\n", 667 | "print(type(l))\n", 668 | "print(l)" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 97, 674 | "metadata": { 675 | "collapsed": false 676 | }, 677 | "outputs": [ 678 | { 679 | "name": "stdout", 680 | "output_type": "stream", 681 | "text": [ 682 | "[2, 3]\n", 683 | "[1, 3]\n" 684 | ] 685 | } 686 | ], 687 | "source": [ 688 | "print(l[1:3])\n", 689 | "print(l[::2])" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": 98, 695 | "metadata": { 696 | "collapsed": false 697 | }, 698 | "outputs": [ 699 | { 700 | "data": { 701 | "text/plain": [ 702 | "1" 703 | ] 704 | }, 705 | "execution_count": 98, 706 | "metadata": {}, 707 | "output_type": "execute_result" 708 | } 709 | ], 710 | "source": [ 711 | "l[0]" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": 99, 717 | "metadata": { 718 | "collapsed": false 719 | }, 720 | "outputs": [ 721 | { 722 | "name": "stdout", 723 | "output_type": "stream", 724 | "text": [ 725 | "[1, 'a', 1.0, (1-1j)]\n" 726 | ] 727 | } 728 | ], 729 | "source": [ 730 | "# don't have to be the same type\n", 731 | "l = [1, 'a', 1.0, 1-1j]\n", 732 | "print(l)" 733 | ] 734 | }, 735 | { 736 | "cell_type": "code", 737 | "execution_count": 100, 738 | "metadata": { 739 | "collapsed": false 740 | }, 741 | "outputs": [ 742 | { 743 | "data": { 744 | "text/plain": [ 745 | "[10, 12, 14, 16, 18, 20, 22, 24, 26, 28]" 746 | ] 747 | }, 748 | "execution_count": 100, 749 | "metadata": {}, 750 | "output_type": "execute_result" 751 | } 752 | ], 753 | "source": [ 754 | "start = 10\n", 755 | "stop = 30\n", 756 | "step = 2\n", 757 | "range(start, stop, step)\n", 758 | "\n", 759 | "# consume the iterator created by range\n", 760 | "list(range(start, stop, step))" 761 | ] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "execution_count": 101, 766 | "metadata": { 767 | "collapsed": false 768 | }, 769 | "outputs": [ 770 | { 771 | "name": "stdout", 772 | "output_type": "stream", 773 | "text": [ 774 | "['A', 'd', 'd']\n" 775 | ] 776 | } 777 | ], 778 | "source": [ 779 | "# create a new empty list\n", 780 | "l = []\n", 781 | "\n", 782 | "# add an elements using `append`\n", 783 | "l.append(\"A\")\n", 784 | "l.append(\"d\")\n", 785 | "l.append(\"d\")\n", 786 | "\n", 787 | "print(l)" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": 102, 793 | "metadata": { 794 | "collapsed": false 795 | }, 796 | "outputs": [ 797 | { 798 | "name": "stdout", 799 | "output_type": "stream", 800 | "text": [ 801 | "['A', 'b', 'c']\n" 802 | ] 803 | } 804 | ], 805 | "source": [ 806 | "l[1:3] = [\"b\", \"c\"]\n", 807 | "print(l)" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": 103, 813 | "metadata": { 814 | "collapsed": false 815 | }, 816 | "outputs": [ 817 | { 818 | "name": "stdout", 819 | "output_type": "stream", 820 | "text": [ 821 | "['i', 'n', 's', 'e', 'r', 'A', 'A', 't', 'A', 'b', 'c']\n" 822 | ] 823 | } 824 | ], 825 | "source": [ 826 | "l.insert(0, \"i\")\n", 827 | "l.insert(1, \"n\")\n", 828 | "l.insert(2, \"s\")\n", 829 | "l.insert(3, \"e\")\n", 830 | "l.insert(4, \"r\")\n", 831 | "l.insert(5, \"t\")\n", 832 | "l.insert(5, \"A\")\n", 833 | "l.insert(5, \"A\")\n", 834 | "\n", 835 | "\n", 836 | "\n", 837 | "print(l)" 838 | ] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "execution_count": 104, 843 | "metadata": { 844 | "collapsed": false 845 | }, 846 | "outputs": [ 847 | { 848 | "name": "stdout", 849 | "output_type": "stream", 850 | "text": [ 851 | "['i', 'n', 's', 'e', 'r', 'A', 't', 'A', 'b', 'c']\n" 852 | ] 853 | } 854 | ], 855 | "source": [ 856 | "l.remove(\"A\")\n", 857 | "print(l)" 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": 105, 863 | "metadata": { 864 | "collapsed": false 865 | }, 866 | "outputs": [ 867 | { 868 | "name": "stdout", 869 | "output_type": "stream", 870 | "text": [ 871 | "['i', 'n', 's', 'e', 'r', 'A', 'b', 'c']\n" 872 | ] 873 | } 874 | ], 875 | "source": [ 876 | "del l[7]\n", 877 | "del l[6]\n", 878 | "\n", 879 | "print(l)" 880 | ] 881 | }, 882 | { 883 | "cell_type": "markdown", 884 | "metadata": {}, 885 | "source": [ 886 | "### Tuples" 887 | ] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "execution_count": 106, 892 | "metadata": { 893 | "collapsed": false 894 | }, 895 | "outputs": [ 896 | { 897 | "name": "stdout", 898 | "output_type": "stream", 899 | "text": [ 900 | "(10, 20) \n" 901 | ] 902 | } 903 | ], 904 | "source": [ 905 | "point = (10, 20)\n", 906 | "print(point, type(point))" 907 | ] 908 | }, 909 | { 910 | "cell_type": "code", 911 | "execution_count": 107, 912 | "metadata": { 913 | "collapsed": false 914 | }, 915 | "outputs": [ 916 | { 917 | "name": "stdout", 918 | "output_type": "stream", 919 | "text": [ 920 | "x = 10\n", 921 | "y = 20\n" 922 | ] 923 | } 924 | ], 925 | "source": [ 926 | "# unpacking\n", 927 | "x, y = point\n", 928 | "\n", 929 | "print(\"x =\", x)\n", 930 | "print(\"y =\", y)" 931 | ] 932 | }, 933 | { 934 | "cell_type": "markdown", 935 | "metadata": {}, 936 | "source": [ 937 | "### Dictionaries" 938 | ] 939 | }, 940 | { 941 | "cell_type": "code", 942 | "execution_count": 108, 943 | "metadata": { 944 | "collapsed": false 945 | }, 946 | "outputs": [ 947 | { 948 | "name": "stdout", 949 | "output_type": "stream", 950 | "text": [ 951 | "\n", 952 | "{'parameter1': 1.0, 'parameter2': 2.0, 'parameter3': 3.0}\n" 953 | ] 954 | } 955 | ], 956 | "source": [ 957 | "params = {\"parameter1\" : 1.0,\n", 958 | " \"parameter2\" : 2.0,\n", 959 | " \"parameter3\" : 3.0,}\n", 960 | "\n", 961 | "print(type(params))\n", 962 | "print(params)" 963 | ] 964 | }, 965 | { 966 | "cell_type": "code", 967 | "execution_count": 109, 968 | "metadata": { 969 | "collapsed": false 970 | }, 971 | "outputs": [ 972 | { 973 | "name": "stdout", 974 | "output_type": "stream", 975 | "text": [ 976 | "parameter1 = A\n", 977 | "parameter2 = B\n", 978 | "parameter3 = 3.0\n", 979 | "parameter4 = D\n" 980 | ] 981 | } 982 | ], 983 | "source": [ 984 | "params[\"parameter1\"] = \"A\"\n", 985 | "params[\"parameter2\"] = \"B\"\n", 986 | "\n", 987 | "# add a new entry\n", 988 | "params[\"parameter4\"] = \"D\"\n", 989 | "\n", 990 | "print(\"parameter1 = \" + str(params[\"parameter1\"]))\n", 991 | "print(\"parameter2 = \" + str(params[\"parameter2\"]))\n", 992 | "print(\"parameter3 = \" + str(params[\"parameter3\"]))\n", 993 | "print(\"parameter4 = \" + str(params[\"parameter4\"]))" 994 | ] 995 | }, 996 | { 997 | "cell_type": "markdown", 998 | "metadata": {}, 999 | "source": [ 1000 | "### Control Flow" 1001 | ] 1002 | }, 1003 | { 1004 | "cell_type": "code", 1005 | "execution_count": null, 1006 | "metadata": { 1007 | "collapsed": false 1008 | }, 1009 | "outputs": [], 1010 | "source": [ 1011 | "statement1 = False\n", 1012 | "statement2 = False\n", 1013 | "\n", 1014 | "if statement1:\n", 1015 | " print(\"statement1 is True\")\n", 1016 | "elif statement2:\n", 1017 | " print(\"statement2 is True\")\n", 1018 | "else:\n", 1019 | " print(\"statement1 and statement2 are False\")" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "markdown", 1024 | "metadata": {}, 1025 | "source": [ 1026 | "### Loops" 1027 | ] 1028 | }, 1029 | { 1030 | "cell_type": "code", 1031 | "execution_count": 110, 1032 | "metadata": { 1033 | "collapsed": false 1034 | }, 1035 | "outputs": [ 1036 | { 1037 | "name": "stdout", 1038 | "output_type": "stream", 1039 | "text": [ 1040 | "0\n", 1041 | "1\n", 1042 | "2\n", 1043 | "3\n" 1044 | ] 1045 | } 1046 | ], 1047 | "source": [ 1048 | "for x in range(4):\n", 1049 | " print(x)" 1050 | ] 1051 | }, 1052 | { 1053 | "cell_type": "code", 1054 | "execution_count": 111, 1055 | "metadata": { 1056 | "collapsed": false 1057 | }, 1058 | "outputs": [ 1059 | { 1060 | "name": "stdout", 1061 | "output_type": "stream", 1062 | "text": [ 1063 | "scientific\n", 1064 | "computing\n", 1065 | "with\n", 1066 | "python\n" 1067 | ] 1068 | } 1069 | ], 1070 | "source": [ 1071 | "for word in [\"scientific\", \"computing\", \"with\", \"python\"]:\n", 1072 | " print(word)" 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "code", 1077 | "execution_count": 112, 1078 | "metadata": { 1079 | "collapsed": false 1080 | }, 1081 | "outputs": [ 1082 | { 1083 | "name": "stdout", 1084 | "output_type": "stream", 1085 | "text": [ 1086 | "parameter1 = A\n", 1087 | "parameter2 = B\n", 1088 | "parameter3 = 3.0\n", 1089 | "parameter4 = D\n" 1090 | ] 1091 | } 1092 | ], 1093 | "source": [ 1094 | "for key, value in params.items():\n", 1095 | " print(key + \" = \" + str(value))" 1096 | ] 1097 | }, 1098 | { 1099 | "cell_type": "code", 1100 | "execution_count": 113, 1101 | "metadata": { 1102 | "collapsed": false 1103 | }, 1104 | "outputs": [ 1105 | { 1106 | "name": "stdout", 1107 | "output_type": "stream", 1108 | "text": [ 1109 | "0 -3\n", 1110 | "1 -2\n", 1111 | "2 -1\n", 1112 | "3 0\n", 1113 | "4 1\n", 1114 | "5 2\n" 1115 | ] 1116 | } 1117 | ], 1118 | "source": [ 1119 | "for idx, x in enumerate(range(-3,3)):\n", 1120 | " print(idx, x)" 1121 | ] 1122 | }, 1123 | { 1124 | "cell_type": "code", 1125 | "execution_count": 114, 1126 | "metadata": { 1127 | "collapsed": false 1128 | }, 1129 | "outputs": [ 1130 | { 1131 | "name": "stdout", 1132 | "output_type": "stream", 1133 | "text": [ 1134 | "[0, 1, 4, 9, 16]\n" 1135 | ] 1136 | } 1137 | ], 1138 | "source": [ 1139 | "# l1 = []\n", 1140 | "# for x in range(0,5):\n", 1141 | "# x = x**2\n", 1142 | "# l1.append(x) \n", 1143 | "\n", 1144 | "l1 = [x**2 for x in range(0,5)]\n", 1145 | "print(l1)" 1146 | ] 1147 | }, 1148 | { 1149 | "cell_type": "code", 1150 | "execution_count": 115, 1151 | "metadata": { 1152 | "collapsed": false 1153 | }, 1154 | "outputs": [ 1155 | { 1156 | "name": "stdout", 1157 | "output_type": "stream", 1158 | "text": [ 1159 | "0\n", 1160 | "1\n", 1161 | "2\n", 1162 | "3\n", 1163 | "4\n", 1164 | "done\n" 1165 | ] 1166 | } 1167 | ], 1168 | "source": [ 1169 | "i = 0\n", 1170 | "while i < 5:\n", 1171 | " print(i)\n", 1172 | " i = i + 1\n", 1173 | "print(\"done\")" 1174 | ] 1175 | }, 1176 | { 1177 | "cell_type": "markdown", 1178 | "metadata": {}, 1179 | "source": [ 1180 | "### Functions" 1181 | ] 1182 | }, 1183 | { 1184 | "cell_type": "code", 1185 | "execution_count": 116, 1186 | "metadata": { 1187 | "collapsed": false 1188 | }, 1189 | "outputs": [], 1190 | "source": [ 1191 | "# include a docstring\n", 1192 | "def func(s):\n", 1193 | " \"\"\"\n", 1194 | " Print a string 's' and tell how many characters it has \n", 1195 | " \"\"\"\n", 1196 | " \n", 1197 | " print(s + \" has \" + str(len(s)) + \" characters\")" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "code", 1202 | "execution_count": 117, 1203 | "metadata": { 1204 | "collapsed": false 1205 | }, 1206 | "outputs": [ 1207 | { 1208 | "name": "stdout", 1209 | "output_type": "stream", 1210 | "text": [ 1211 | "Help on function func in module __main__:\n", 1212 | "\n", 1213 | "func(s)\n", 1214 | " Print a string 's' and tell how many characters it has\n", 1215 | "\n" 1216 | ] 1217 | } 1218 | ], 1219 | "source": [ 1220 | "help(func)" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "code", 1225 | "execution_count": 118, 1226 | "metadata": { 1227 | "collapsed": false 1228 | }, 1229 | "outputs": [ 1230 | { 1231 | "name": "stdout", 1232 | "output_type": "stream", 1233 | "text": [ 1234 | "test has 4 characters\n" 1235 | ] 1236 | } 1237 | ], 1238 | "source": [ 1239 | "func(\"test\")" 1240 | ] 1241 | }, 1242 | { 1243 | "cell_type": "code", 1244 | "execution_count": 119, 1245 | "metadata": { 1246 | "collapsed": false 1247 | }, 1248 | "outputs": [], 1249 | "source": [ 1250 | "def square(x):\n", 1251 | " return x ** 2" 1252 | ] 1253 | }, 1254 | { 1255 | "cell_type": "code", 1256 | "execution_count": 120, 1257 | "metadata": { 1258 | "collapsed": false 1259 | }, 1260 | "outputs": [ 1261 | { 1262 | "data": { 1263 | "text/plain": [ 1264 | "25" 1265 | ] 1266 | }, 1267 | "execution_count": 120, 1268 | "metadata": {}, 1269 | "output_type": "execute_result" 1270 | } 1271 | ], 1272 | "source": [ 1273 | "square(5)" 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "code", 1278 | "execution_count": 121, 1279 | "metadata": { 1280 | "collapsed": false 1281 | }, 1282 | "outputs": [], 1283 | "source": [ 1284 | "# multiple return values\n", 1285 | "def powers(x):\n", 1286 | " return x ** 2, x ** 3, x ** 4" 1287 | ] 1288 | }, 1289 | { 1290 | "cell_type": "code", 1291 | "execution_count": 122, 1292 | "metadata": { 1293 | "collapsed": false 1294 | }, 1295 | "outputs": [ 1296 | { 1297 | "data": { 1298 | "text/plain": [ 1299 | "(25, 125, 625)" 1300 | ] 1301 | }, 1302 | "execution_count": 122, 1303 | "metadata": {}, 1304 | "output_type": "execute_result" 1305 | } 1306 | ], 1307 | "source": [ 1308 | "powers(5)" 1309 | ] 1310 | }, 1311 | { 1312 | "cell_type": "code", 1313 | "execution_count": 123, 1314 | "metadata": { 1315 | "collapsed": false 1316 | }, 1317 | "outputs": [ 1318 | { 1319 | "name": "stdout", 1320 | "output_type": "stream", 1321 | "text": [ 1322 | "125\n" 1323 | ] 1324 | } 1325 | ], 1326 | "source": [ 1327 | "x2, x3, x4 = powers(5)\n", 1328 | "print(x3)" 1329 | ] 1330 | }, 1331 | { 1332 | "cell_type": "code", 1333 | "execution_count": 124, 1334 | "metadata": { 1335 | "collapsed": false 1336 | }, 1337 | "outputs": [ 1338 | { 1339 | "data": { 1340 | "text/plain": [ 1341 | "25" 1342 | ] 1343 | }, 1344 | "execution_count": 124, 1345 | "metadata": {}, 1346 | "output_type": "execute_result" 1347 | } 1348 | ], 1349 | "source": [ 1350 | "f1 = lambda x: x**2\n", 1351 | "f1(5)" 1352 | ] 1353 | }, 1354 | { 1355 | "cell_type": "code", 1356 | "execution_count": 125, 1357 | "metadata": { 1358 | "collapsed": false 1359 | }, 1360 | "outputs": [ 1361 | { 1362 | "data": { 1363 | "text/plain": [ 1364 | "" 1365 | ] 1366 | }, 1367 | "execution_count": 125, 1368 | "metadata": {}, 1369 | "output_type": "execute_result" 1370 | } 1371 | ], 1372 | "source": [ 1373 | "map(lambda x: x**2, range(-3,4))" 1374 | ] 1375 | }, 1376 | { 1377 | "cell_type": "code", 1378 | "execution_count": 126, 1379 | "metadata": { 1380 | "collapsed": false 1381 | }, 1382 | "outputs": [ 1383 | { 1384 | "data": { 1385 | "text/plain": [ 1386 | "[9, 4, 1, 0, 1, 4, 9]" 1387 | ] 1388 | }, 1389 | "execution_count": 126, 1390 | "metadata": {}, 1391 | "output_type": "execute_result" 1392 | } 1393 | ], 1394 | "source": [ 1395 | "# convert iterator to list\n", 1396 | "list(map(lambda x: x**2, range(-3,4)))" 1397 | ] 1398 | }, 1399 | { 1400 | "cell_type": "markdown", 1401 | "metadata": {}, 1402 | "source": [ 1403 | "### Classes" 1404 | ] 1405 | }, 1406 | { 1407 | "cell_type": "code", 1408 | "execution_count": 128, 1409 | "metadata": { 1410 | "collapsed": false 1411 | }, 1412 | "outputs": [], 1413 | "source": [ 1414 | "class Point:\n", 1415 | " def __init__(self, x, y):\n", 1416 | " self.x = x\n", 1417 | " self.y = y\n", 1418 | " \n", 1419 | " def translate(self, dx, dy):\n", 1420 | " self.x += dx\n", 1421 | " self.y += dy\n", 1422 | " \n", 1423 | " def __str__(self):\n", 1424 | " return(\"Point at [%f, %f]\" % (self.x, self.y))" 1425 | ] 1426 | }, 1427 | { 1428 | "cell_type": "code", 1429 | "execution_count": 129, 1430 | "metadata": { 1431 | "collapsed": false 1432 | }, 1433 | "outputs": [ 1434 | { 1435 | "name": "stdout", 1436 | "output_type": "stream", 1437 | "text": [ 1438 | "Point at [0.000000, 0.000000]\n" 1439 | ] 1440 | } 1441 | ], 1442 | "source": [ 1443 | "p1 = Point(0, 0)\n", 1444 | "print(p1)" 1445 | ] 1446 | }, 1447 | { 1448 | "cell_type": "code", 1449 | "execution_count": 130, 1450 | "metadata": { 1451 | "collapsed": false 1452 | }, 1453 | "outputs": [ 1454 | { 1455 | "name": "stdout", 1456 | "output_type": "stream", 1457 | "text": [ 1458 | "Point at [0.250000, 1.500000]\n", 1459 | "Point at [1.000000, 1.000000]\n" 1460 | ] 1461 | } 1462 | ], 1463 | "source": [ 1464 | "p2 = Point(1, 1)\n", 1465 | "\n", 1466 | "p1.translate(0.25, 1.5)\n", 1467 | "\n", 1468 | "print(p1)\n", 1469 | "print(p2)" 1470 | ] 1471 | }, 1472 | { 1473 | "cell_type": "markdown", 1474 | "metadata": {}, 1475 | "source": [ 1476 | "### Exceptions" 1477 | ] 1478 | }, 1479 | { 1480 | "cell_type": "code", 1481 | "execution_count": 131, 1482 | "metadata": { 1483 | "collapsed": false 1484 | }, 1485 | "outputs": [ 1486 | { 1487 | "name": "stdout", 1488 | "output_type": "stream", 1489 | "text": [ 1490 | "Caught an expection\n" 1491 | ] 1492 | } 1493 | ], 1494 | "source": [ 1495 | "try:\n", 1496 | " print(test)\n", 1497 | "except:\n", 1498 | " print(\"Caught an expection\")" 1499 | ] 1500 | }, 1501 | { 1502 | "cell_type": "code", 1503 | "execution_count": 132, 1504 | "metadata": { 1505 | "collapsed": false 1506 | }, 1507 | "outputs": [ 1508 | { 1509 | "name": "stdout", 1510 | "output_type": "stream", 1511 | "text": [ 1512 | "Caught an exception: name 'test' is not defined\n" 1513 | ] 1514 | } 1515 | ], 1516 | "source": [ 1517 | "try:\n", 1518 | " print(test)\n", 1519 | "except Exception as e:\n", 1520 | " print(\"Caught an exception: \" + str(e))" 1521 | ] 1522 | }, 1523 | { 1524 | "cell_type": "code", 1525 | "execution_count": null, 1526 | "metadata": { 1527 | "collapsed": true 1528 | }, 1529 | "outputs": [], 1530 | "source": [] 1531 | } 1532 | ], 1533 | "metadata": { 1534 | "anaconda-cloud": {}, 1535 | "kernelspec": { 1536 | "display_name": "Python [conda root]", 1537 | "language": "python", 1538 | "name": "conda-root-py" 1539 | }, 1540 | "language_info": { 1541 | "codemirror_mode": { 1542 | "name": "ipython", 1543 | "version": 3 1544 | }, 1545 | "file_extension": ".py", 1546 | "mimetype": "text/x-python", 1547 | "name": "python", 1548 | "nbconvert_exporter": "python", 1549 | "pygments_lexer": "ipython3", 1550 | "version": "3.5.2" 1551 | } 1552 | }, 1553 | "nbformat": 4, 1554 | "nbformat_minor": 0 1555 | } 1556 | -------------------------------------------------------------------------------- /nlp_workshop1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "deletable": true, 7 | "editable": true 8 | }, 9 | "source": [ 10 | "# TextBlob: An Introduction of Methods" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "deletable": true, 17 | "editable": true 18 | }, 19 | "source": [ 20 | "## What is NLP?\n", 21 | "\n", 22 | "* Computer understanding and manipulation of human language\n", 23 | "* A way for computers to analyze, understand, and derive meaning from human language in a smart and useful way\n", 24 | "* Intersection of computer science, artificial intelligence, and computational linguistics\n", 25 | "\n", 26 | "NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statical inference. In general, the more data analyzed, the more accurate the model will be." 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": { 32 | "deletable": true, 33 | "editable": true 34 | }, 35 | "source": [ 36 | "## Two Subfields of NLP\n", 37 | "\n", 38 | "There are two common subfields of natural language processing:\n", 39 | "\n", 40 | "* Natural Language Understanding (NLU)\n", 41 | " - A process used to convert human language into data with a form that encapsulates meaning and context in a computer-interpretable form.\n", 42 | " - This is a work in progress in data science\n", 43 | " - Understanding human language is difficult\n", 44 | "* Natural Language Generation (NLG)\n", 45 | " - Uses NLU to generate human language that appears natural and relevant.\n", 46 | " - Chat bots and software that automatically generates textual content use NLG.\n", 47 | " \n", 48 | "These subfields do not completely makeup with space of NLP:\n", 49 | "\n", 50 | "* NLU includes things like\n", 51 | " - relationship extraction\n", 52 | " - sentiment analysis\n", 53 | " - summarization\n", 54 | " - *semantic* parsing\n", 55 | "* NLP also includes (not part of NLU)\n", 56 | " - *syntactic* parsing\n", 57 | " - text categorization\n", 58 | " - part of speech tagging\n", 59 | " \n", 60 | "While some parts of NLP (e.g. POS tagging) are used in NLU, they are not strictly components of NLU." 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": { 66 | "deletable": true, 67 | "editable": true 68 | }, 69 | "source": [ 70 | "## Challenges in NLP\n", 71 | "\n", 72 | "NLP has many challenges, and the field is not yet mature. Some of the challenges currently faced are\n", 73 | "\n", 74 | "* Ambiguity of language\n", 75 | " - syntactic ambiguity: some sentences can have multiple interpretations\n", 76 | " - words with multiple definitions (e.g. patient: to tolerate delays? a hospital patient?)\n", 77 | "* Context affects meaning\n", 78 | " - social context\n", 79 | " - time of day\n", 80 | " - content of previous sentences\n", 81 | "* Other\n", 82 | " - sarcasm, humor, slang, etc.\n", 83 | " \n", 84 | "Most or all of these are tied to NLU in one way or another. Further advancements in AI are needed to create general solutions that can handle the many form of language encountered. For example, the form of language encountered in a novel is very different from what you would find in a social media feed (e.g. Tweets)." 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": { 90 | "deletable": true, 91 | "editable": true 92 | }, 93 | "source": [ 94 | "## Some Uses for NLP\n", 95 | "\n", 96 | "The uses for NLP grow as new and creative ideas arise, but some common uses are\n", 97 | "\n", 98 | "* automatic summarization\n", 99 | "* translation\n", 100 | "* named entity recognition\n", 101 | " - person\n", 102 | " - place\n", 103 | " - organization\n", 104 | " - object\n", 105 | " - etc.\n", 106 | "* relationship extraction\n", 107 | "* sentiment analysis\n", 108 | "* speech recognition\n", 109 | "* topic segmentation / text classification\n", 110 | "* grammar correction\n", 111 | "* chat bots\n", 112 | "* automatic tag, keyword, and content generation\n", 113 | "\n", 114 | "Speech recognition is one use that doesn't *require* NLU, but it can be made better with it. It doesn't require it because a machine can recognize various words and phrases, and then take certain actions without actually understanding anything about what was said.\n", 115 | "\n", 116 | "**Some specific use cases for NLP:**\n", 117 | "\n", 118 | "* Analyze social media and forums to gain insight into what customers are saying\n", 119 | " - identify new product opportunities,\n", 120 | " - problems with current products/services,\n", 121 | " - overall user/customer sentiment\n", 122 | "* Spam detection\n", 123 | "* Financial algorithmic trading\n", 124 | " - extract info from news that impacts trading decisions\n", 125 | "* Answering questions (e.g. chat bots)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": { 131 | "deletable": true, 132 | "editable": true 133 | }, 134 | "source": [ 135 | "# Techniques & Tools" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": { 141 | "deletable": true, 142 | "editable": true 143 | }, 144 | "source": [ 145 | "## Techniques\n", 146 | "\n", 147 | "* **Tokenization**: split text into sentences, words, and noun-phrases\n", 148 | "* **Tagging**: String -> tagged list of pairs `('word', 'POS')`\n", 149 | "\n", 150 | " Ex: `'This is a string'` -> `[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('string', 'NN')]`\n", 151 | " \n", 152 | " \n", 153 | "* **Parsing** (syntactic structure): String -> hierarchical structure with syntax tags\n", 154 | "\n", 155 | " Ex: `'This is a string'` -> `'This/DT/O/O is/VBZ/B-VP/O a/DT/B-NP/O string/NN/I-NP/O'`\n", 156 | "\n", 157 | "* **Information Extraction**: \n", 158 | " - named entity extraction: string in -> output text with labeled entities (person, company, location, etc.)\n", 159 | " - relationships between entities: string in -> output entity relationships\n", 160 | "* **n-grams**:\n", 161 | " - string in -> output list of n-tuples of successive words\n", 162 | " - n-grams are used as features in machine learning\n", 163 | " \n", 164 | " Ex (2-gram): `'This is a string'` -> `(['This', 'is'], ['is', 'a'], ['a', 'string'])`\n", 165 | " \n", 166 | "*These will be discussed in more detail as they come up.*" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": { 172 | "deletable": true, 173 | "editable": true 174 | }, 175 | "source": [ 176 | "## Tools\n", 177 | "\n", 178 | "There are many tools availble for NLP. Some popular choices are\n", 179 | "\n", 180 | "* Stanford's Core NLP Suite\n", 181 | "* Natural Language Toolkit (NLTK)\n", 182 | "* Apache OpenNLP\n", 183 | "* WordNet\n", 184 | "* **TextBlob**\n", 185 | "\n", 186 | "We will be working with TextBlob, which builds off of (and can integrate with) NLTK and WordNet." 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": { 192 | "deletable": true, 193 | "editable": true 194 | }, 195 | "source": [ 196 | "# TextBlob: An Introduction of Methods" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": { 202 | "deletable": true, 203 | "editable": true 204 | }, 205 | "source": [ 206 | "## Installation\n", 207 | "\n", 208 | "To install TextBlob, open a new Terminal and enter the following:\n", 209 | "\n", 210 | "```\n", 211 | "$ pip install -U textblob\n", 212 | "$ python -m textblob.download_corpora\n", 213 | "```" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": { 219 | "deletable": true, 220 | "editable": true 221 | }, 222 | "source": [ 223 | "## Getting Started\n", 224 | "\n", 225 | "From here on, you can follow along with the notebook and create new notes and try out code as you like." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 6, 231 | "metadata": { 232 | "collapsed": true, 233 | "deletable": true, 234 | "editable": true 235 | }, 236 | "outputs": [], 237 | "source": [ 238 | "# import what we need\n", 239 | "import pandas as pd\n", 240 | "from pandas import DataFrame as DF, Series\n", 241 | "\n", 242 | "import numpy as np\n", 243 | "\n", 244 | "from textblob import TextBlob" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 7, 250 | "metadata": { 251 | "collapsed": false, 252 | "deletable": true, 253 | "editable": true 254 | }, 255 | "outputs": [], 256 | "source": [ 257 | "# read data\n", 258 | "\n", 259 | "# use only the column called 'text'\n", 260 | "data = pd.read_csv('tweets.csv', usecols=['text'])" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 8, 266 | "metadata": { 267 | "collapsed": false, 268 | "deletable": true, 269 | "editable": true 270 | }, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/html": [ 275 | "
\n", 276 | "\n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | "
text
0@VirginAmerica What @dhepburn said.
1@VirginAmerica plus you've added commercials t...
2@VirginAmerica I didn't today... Must mean I n...
\n", 298 | "
" 299 | ], 300 | "text/plain": [ 301 | " text\n", 302 | "0 @VirginAmerica What @dhepburn said.\n", 303 | "1 @VirginAmerica plus you've added commercials t...\n", 304 | "2 @VirginAmerica I didn't today... Must mean I n..." 305 | ] 306 | }, 307 | "execution_count": 8, 308 | "metadata": {}, 309 | "output_type": "execute_result" 310 | } 311 | ], 312 | "source": [ 313 | "data.head(3)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": { 319 | "deletable": true, 320 | "editable": true 321 | }, 322 | "source": [ 323 | "## Create a TextBlob object\n", 324 | "\n", 325 | "`TextBlob` objects are the foundation of everything we will be doing. They take a string as an input and create an object on which we can apply many of the TextBlob methods.\n", 326 | "\n", 327 | "Let's create a blob using a tweet in our data." 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 4, 333 | "metadata": { 334 | "collapsed": true, 335 | "deletable": true, 336 | "editable": true 337 | }, 338 | "outputs": [], 339 | "source": [ 340 | "# create a blob from the tweet at index 25\n", 341 | "tweet = data.text[25]\n", 342 | "blob = TextBlob(tweet)" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": { 348 | "deletable": true, 349 | "editable": true 350 | }, 351 | "source": [ 352 | "# TextBlob Methods: Tokenization" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": { 358 | "deletable": true, 359 | "editable": true 360 | }, 361 | "source": [ 362 | "Tokenization allows us to split a string (a paragraph, a page, etc.) into various \"tokens\" that become useful in further processing and analysis. Tokenization also occurs on the back-end of some methods.\n", 363 | "\n", 364 | "Let's look at some tokenization options." 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": { 370 | "deletable": true, 371 | "editable": true 372 | }, 373 | "source": [ 374 | "## Sentences\n", 375 | "\n", 376 | "Using the `sentences` method we get a list of `Sentence` objects, each containing (in order) all of the sentences that make up the string passed to `TextBlob`." 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 11, 382 | "metadata": { 383 | "collapsed": false, 384 | "deletable": true, 385 | "editable": true 386 | }, 387 | "outputs": [ 388 | { 389 | "data": { 390 | "text/plain": [ 391 | "[Sentence(\"@VirginAmerica status match program.\"),\n", 392 | " Sentence(\"I applied and it's been three weeks.\"),\n", 393 | " Sentence(\"Called and emailed with no response.\")]" 394 | ] 395 | }, 396 | "execution_count": 11, 397 | "metadata": {}, 398 | "output_type": "execute_result" 399 | } 400 | ], 401 | "source": [ 402 | "# return list of Sentence objects\n", 403 | "blob.sentences" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": { 409 | "deletable": true, 410 | "editable": true 411 | }, 412 | "source": [ 413 | "Similar to `TextBlob` objects, we can use various methods with `Sentence` objects." 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 12, 419 | "metadata": { 420 | "collapsed": false, 421 | "deletable": true, 422 | "editable": true 423 | }, 424 | "outputs": [ 425 | { 426 | "data": { 427 | "text/plain": [ 428 | "[('Called', 'VBN'),\n", 429 | " ('and', 'CC'),\n", 430 | " ('emailed', 'VBN'),\n", 431 | " ('with', 'IN'),\n", 432 | " ('no', 'DT'),\n", 433 | " ('response', 'NN')]" 434 | ] 435 | }, 436 | "execution_count": 12, 437 | "metadata": {}, 438 | "output_type": "execute_result" 439 | } 440 | ], 441 | "source": [ 442 | "# get the first sentence\n", 443 | "s = blob.sentences[2]\n", 444 | "# get tags from this sentence\n", 445 | "s.tags[:10]" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": { 451 | "deletable": true, 452 | "editable": true 453 | }, 454 | "source": [ 455 | "## Words\n", 456 | "\n", 457 | "Instead of a list of sentences, we can get a `WordList` object that returns all of the individual words in our string." 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 13, 463 | "metadata": { 464 | "collapsed": false, 465 | "deletable": true, 466 | "editable": true 467 | }, 468 | "outputs": [ 469 | { 470 | "data": { 471 | "text/plain": [ 472 | "WordList(['VirginAmerica', 'status', 'match', 'program', 'I', 'applied', 'and', 'it', \"'s\", 'been', 'three', 'weeks', 'Called', 'and', 'emailed', 'with', 'no', 'response'])" 473 | ] 474 | }, 475 | "execution_count": 13, 476 | "metadata": {}, 477 | "output_type": "execute_result" 478 | } 479 | ], 480 | "source": [ 481 | "# return WordList object (works like a standard list in Python)\n", 482 | "blob.words" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": { 488 | "deletable": true, 489 | "editable": true 490 | }, 491 | "source": [ 492 | "We can access words in a `WordList` just like a regular Python list:" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": 14, 498 | "metadata": { 499 | "collapsed": false, 500 | "deletable": true, 501 | "editable": true 502 | }, 503 | "outputs": [ 504 | { 505 | "data": { 506 | "text/plain": [ 507 | "WordList(['it', \"'s\"])" 508 | ] 509 | }, 510 | "execution_count": 14, 511 | "metadata": {}, 512 | "output_type": "execute_result" 513 | } 514 | ], 515 | "source": [ 516 | "blob.words[7:9]" 517 | ] 518 | }, 519 | { 520 | "cell_type": "markdown", 521 | "metadata": { 522 | "deletable": true, 523 | "editable": true 524 | }, 525 | "source": [ 526 | "**Notice**: TextBlob doesn't do the best job of handling contractions and possessive forms. Ex: \"it's\" is split into \"it\" and \"'s\"." 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": { 532 | "deletable": true, 533 | "editable": true 534 | }, 535 | "source": [ 536 | "## Word Counts\n", 537 | "\n", 538 | "We can get a dict that contains all the unique words in our string as keys, and counts for each as values." 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": 9, 544 | "metadata": { 545 | "collapsed": false, 546 | "deletable": true, 547 | "editable": true 548 | }, 549 | "outputs": [ 550 | { 551 | "data": { 552 | "text/plain": [ 553 | "defaultdict(int,\n", 554 | " {'and': 2,\n", 555 | " 'applied': 1,\n", 556 | " 'been': 1,\n", 557 | " 'called': 1,\n", 558 | " 'emailed': 1,\n", 559 | " 'i': 1,\n", 560 | " 'it': 1,\n", 561 | " 'match': 1,\n", 562 | " 'no': 1,\n", 563 | " 'program': 1,\n", 564 | " 'response': 1,\n", 565 | " 's': 1,\n", 566 | " 'status': 1,\n", 567 | " 'three': 1,\n", 568 | " 'virginamerica': 1,\n", 569 | " 'weeks': 1,\n", 570 | " 'with': 1})" 571 | ] 572 | }, 573 | "execution_count": 9, 574 | "metadata": {}, 575 | "output_type": "execute_result" 576 | } 577 | ], 578 | "source": [ 579 | "# returns defaultdict with unique words as keys and counts as values.\n", 580 | "blob.word_counts" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": 10, 586 | "metadata": { 587 | "collapsed": false, 588 | "deletable": true, 589 | "editable": true 590 | }, 591 | "outputs": [ 592 | { 593 | "name": "stdout", 594 | "output_type": "stream", 595 | "text": [ 596 | "2\n", 597 | "2\n" 598 | ] 599 | } 600 | ], 601 | "source": [ 602 | "# we can get counts for individual words is two ways\n", 603 | "# 1. use the count method on a WordList\n", 604 | "print(blob.words.count('and'))\n", 605 | "# 2. access a key in the word_counts dict\n", 606 | "print(blob.word_counts['and'])" 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "metadata": { 612 | "deletable": true, 613 | "editable": true 614 | }, 615 | "source": [ 616 | "**NOTE!**\n", 617 | "\n", 618 | "if you use `word_counts['some_word']` and that word is not originally in the defaultdict, it will be added with a count of zero:" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": 11, 624 | "metadata": { 625 | "collapsed": false, 626 | "deletable": true, 627 | "editable": true 628 | }, 629 | "outputs": [ 630 | { 631 | "data": { 632 | "text/plain": [ 633 | "defaultdict(int, {'a': 1, 'of': 1, 'string': 1, 'words': 1})" 634 | ] 635 | }, 636 | "execution_count": 11, 637 | "metadata": {}, 638 | "output_type": "execute_result" 639 | } 640 | ], 641 | "source": [ 642 | "# example of above\n", 643 | "b = TextBlob('a string of words')\n", 644 | "b.word_counts" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 12, 650 | "metadata": { 651 | "collapsed": false, 652 | "deletable": true, 653 | "editable": true 654 | }, 655 | "outputs": [ 656 | { 657 | "data": { 658 | "text/plain": [ 659 | "0" 660 | ] 661 | }, 662 | "execution_count": 12, 663 | "metadata": {}, 664 | "output_type": "execute_result" 665 | } 666 | ], 667 | "source": [ 668 | "# get count of word not in dict\n", 669 | "b.word_counts['test']" 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": 13, 675 | "metadata": { 676 | "collapsed": false, 677 | "deletable": true, 678 | "editable": true 679 | }, 680 | "outputs": [ 681 | { 682 | "data": { 683 | "text/plain": [ 684 | "defaultdict(int, {'a': 1, 'of': 1, 'string': 1, 'test': 0, 'words': 1})" 685 | ] 686 | }, 687 | "execution_count": 13, 688 | "metadata": {}, 689 | "output_type": "execute_result" 690 | } 691 | ], 692 | "source": [ 693 | "# look at contents of dict again\n", 694 | "# notice that 'test' is now included\n", 695 | "b.word_counts" 696 | ] 697 | }, 698 | { 699 | "cell_type": "markdown", 700 | "metadata": { 701 | "deletable": true, 702 | "editable": true 703 | }, 704 | "source": [ 705 | "## Noun Phrases\n", 706 | "\n", 707 | "**Noun phrases:** a word or group of words that functions in a sentence as subject, object, or prepositional object.\n", 708 | "\n", 709 | "Examples of __noun phrases__ are underlined in the sentences below. The **head** noun appears in bold.\n", 710 | "\n", 711 | "* __The election-year **politics**__ are annoying for __many **people**__.\n", 712 | "* __Almost every **sentence**__ contains __at least one noun **phrase**__.\n", 713 | "* __Current economic **weakness**__ may be __a **result** of high energy prices__.\n", 714 | "\n", 715 | "Noun phrases can be identified by the possibility of pronoun substitution, as is illustrated in the examples below.\n", 716 | "\n", 717 | "a. __This **sentence**__ contains __two noun **phrases**__.
\n", 718 | "b. **It** contains **them**.\n", 719 | "\n", 720 | "We can get a `WordList` containing noun phrases using the `noun_phrase` method on a blob." 721 | ] 722 | }, 723 | { 724 | "cell_type": "code", 725 | "execution_count": 9, 726 | "metadata": { 727 | "collapsed": false, 728 | "deletable": true, 729 | "editable": true 730 | }, 731 | "outputs": [ 732 | { 733 | "data": { 734 | "text/plain": [ 735 | "[Sentence(\"@VirginAmerica status match program.\"),\n", 736 | " Sentence(\"I applied and it's been three weeks.\"),\n", 737 | " Sentence(\"Called and emailed with no response.\")]" 738 | ] 739 | }, 740 | "execution_count": 9, 741 | "metadata": {}, 742 | "output_type": "execute_result" 743 | } 744 | ], 745 | "source": [ 746 | "blob.sentences" 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": 10, 752 | "metadata": { 753 | "collapsed": false, 754 | "deletable": true, 755 | "editable": true 756 | }, 757 | "outputs": [ 758 | { 759 | "data": { 760 | "text/plain": [ 761 | "WordList(['virginamerica', 'pretty graphics', 'minimal iconography'])" 762 | ] 763 | }, 764 | "execution_count": 10, 765 | "metadata": {}, 766 | "output_type": "execute_result" 767 | } 768 | ], 769 | "source": [ 770 | "# return WordList with noun phrases for tweet at index 11\n", 771 | "TextBlob(data.text[11]).noun_phrases" 772 | ] 773 | }, 774 | { 775 | "cell_type": "markdown", 776 | "metadata": { 777 | "deletable": true, 778 | "editable": true 779 | }, 780 | "source": [ 781 | "The algorithm used isn't perfect, but things rarely are in NLP." 782 | ] 783 | }, 784 | { 785 | "cell_type": "markdown", 786 | "metadata": {}, 787 | "source": [ 788 | "
" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "metadata": {}, 794 | "source": [ 795 | "# Practice Problems\n", 796 | "\n", 797 | "1. Create a TextBlob object called blob using tweet at index 41\n", 798 | "2. Print each sentence in blob on a separate line\n", 799 | "3. Get word counts in descending order (most frequent first)\n", 800 | "4. Come up with two ways to get the total word count for blob\n", 801 | "5. Get all noun-phrases in blob. What is wrong with the second “phrase” in the results?\n", 802 | "6. Select all entries in the data that contain more than 3 noun phrases\n", 803 | "7. **Extra:** Using a similar method as in 6, print one tweet that has exactly 3 sentences without creating a list\n" 804 | ] 805 | }, 806 | { 807 | "cell_type": "markdown", 808 | "metadata": { 809 | "deletable": true, 810 | "editable": true 811 | }, 812 | "source": [ 813 | "
" 814 | ] 815 | }, 816 | { 817 | "cell_type": "markdown", 818 | "metadata": { 819 | "deletable": true, 820 | "editable": true 821 | }, 822 | "source": [ 823 | "# TextBlob Methods: POS & Morphology" 824 | ] 825 | }, 826 | { 827 | "cell_type": "markdown", 828 | "metadata": { 829 | "deletable": true, 830 | "editable": true 831 | }, 832 | "source": [ 833 | "Here we will cover all of the following:\n", 834 | " \n", 835 | "* **part-of-speech (POS) tagging**: get list of tuples containing each word and it’s part of speech (e.g. noun)\n", 836 | "* **pluralization**: get the plural form of any singular words\n", 837 | "* **singularization**: get the singular form of any plural words\n", 838 | "* **lemmatization**: get the stripped/unmodified version of a word (e.g. singing -> sing)" 839 | ] 840 | }, 841 | { 842 | "cell_type": "markdown", 843 | "metadata": { 844 | "deletable": true, 845 | "editable": true 846 | }, 847 | "source": [ 848 | "## part-of-speech (POS) tagging\n", 849 | "\n", 850 | "Using the `tags` method, we can get a list of doubles that contains every word in our string paired with its part of speech, as determined by the algorithm.\n", 851 | "\n", 852 | "POS tagging (also grammatical tagging) is useful for understanding context and grammar. Many words can belong to different parts of speech, depending on the context and words around them. POS tagging attempts to disambiguate a text by determining most likely parts of speech for each word based on the content." 853 | ] 854 | }, 855 | { 856 | "cell_type": "code", 857 | "execution_count": 16, 858 | "metadata": { 859 | "collapsed": false, 860 | "deletable": true, 861 | "editable": true 862 | }, 863 | "outputs": [ 864 | { 865 | "data": { 866 | "text/plain": [ 867 | "[('@', 'NN'),\n", 868 | " ('VirginAmerica', 'NNP'),\n", 869 | " ('status', 'NN'),\n", 870 | " ('match', 'NN'),\n", 871 | " ('program', 'NN'),\n", 872 | " ('I', 'PRP'),\n", 873 | " ('applied', 'VBD'),\n", 874 | " ('and', 'CC'),\n", 875 | " ('it', 'PRP'),\n", 876 | " (\"'s\", 'VBZ'),\n", 877 | " ('been', 'VBN'),\n", 878 | " ('three', 'CD'),\n", 879 | " ('weeks', 'NNS'),\n", 880 | " ('Called', 'VBN'),\n", 881 | " ('and', 'CC'),\n", 882 | " ('emailed', 'VBN'),\n", 883 | " ('with', 'IN'),\n", 884 | " ('no', 'DT'),\n", 885 | " ('response', 'NN')]" 886 | ] 887 | }, 888 | "execution_count": 16, 889 | "metadata": {}, 890 | "output_type": "execute_result" 891 | } 892 | ], 893 | "source": [ 894 | "# return list of tuples containing words in a string and the part of speech that each belongs to\n", 895 | "blob.tags" 896 | ] 897 | }, 898 | { 899 | "cell_type": "markdown", 900 | "metadata": { 901 | "deletable": true, 902 | "editable": true 903 | }, 904 | "source": [ 905 | "The tags each have a unique meaning. For example:\n", 906 | "* 'VBX': verb (X indicates type of verb)\n", 907 | "* 'DT': determiner\n", 908 | "\n", 909 | "A comprehensive table can be found at http://www.clips.ua.ac.be/pages/mbsp-tags" 910 | ] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "metadata": { 915 | "deletable": true, 916 | "editable": true 917 | }, 918 | "source": [ 919 | "## pluralization\n", 920 | "\n", 921 | "This is a relatively simple rule-based process that takes the singular form of a word and applies the correct pluralization to it.\n", 922 | "\n", 923 | "In TextBlob we can pluralize a single word (in the form of a `Word` obj.) or pluralize all words in a `WordList`." 924 | ] 925 | }, 926 | { 927 | "cell_type": "code", 928 | "execution_count": 17, 929 | "metadata": { 930 | "collapsed": false, 931 | "deletable": true, 932 | "editable": true 933 | }, 934 | "outputs": [ 935 | { 936 | "data": { 937 | "text/plain": [ 938 | "'companies'" 939 | ] 940 | }, 941 | "execution_count": 17, 942 | "metadata": {}, 943 | "output_type": "execute_result" 944 | } 945 | ], 946 | "source": [ 947 | "# import\n", 948 | "from textblob import Word, WordList\n", 949 | "# create a Word object\n", 950 | "w = Word('company')\n", 951 | "# return the plural of a single word\n", 952 | "w.pluralize()" 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": 18, 958 | "metadata": { 959 | "collapsed": false, 960 | "deletable": true, 961 | "editable": true 962 | }, 963 | "outputs": [ 964 | { 965 | "data": { 966 | "text/plain": [ 967 | "WordList(['who', 'what', 'when', 'where', 'why'])" 968 | ] 969 | }, 970 | "execution_count": 18, 971 | "metadata": {}, 972 | "output_type": "execute_result" 973 | } 974 | ], 975 | "source": [ 976 | "# Side note: we can also create WordList objects\n", 977 | "wl = WordList(['who','what','when','where','why'])\n", 978 | "wl" 979 | ] 980 | }, 981 | { 982 | "cell_type": "markdown", 983 | "metadata": { 984 | "deletable": true, 985 | "editable": true 986 | }, 987 | "source": [ 988 | "## singularization\n", 989 | "\n", 990 | "The opposite of pluralization: take a word (or words) in plural form and singularize them." 991 | ] 992 | }, 993 | { 994 | "cell_type": "code", 995 | "execution_count": 19, 996 | "metadata": { 997 | "collapsed": false, 998 | "deletable": true, 999 | "editable": true 1000 | }, 1001 | "outputs": [ 1002 | { 1003 | "data": { 1004 | "text/plain": [ 1005 | "WordList(['agency', 'octopus', 'word'])" 1006 | ] 1007 | }, 1008 | "execution_count": 19, 1009 | "metadata": {}, 1010 | "output_type": "execute_result" 1011 | } 1012 | ], 1013 | "source": [ 1014 | "wl = WordList(['agencies', 'octopi', 'words'])\n", 1015 | "wl.singularize()" 1016 | ] 1017 | }, 1018 | { 1019 | "cell_type": "markdown", 1020 | "metadata": { 1021 | "deletable": true, 1022 | "editable": true 1023 | }, 1024 | "source": [ 1025 | "## lemmatization" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "markdown", 1030 | "metadata": { 1031 | "deletable": true, 1032 | "editable": true 1033 | }, 1034 | "source": [ 1035 | "Lemmatization takes a word that has been modified or morphed in some way using proper linguistic rules, and returns the stripped/unmodified version of it.\n", 1036 | "\n", 1037 | "The `lemmatize()` method has an optional parameter:\n", 1038 | "* pos – Part of speech to filter upon. If None, defaults to _wordnet.NOUN.\n", 1039 | "* options: \n", 1040 | " - `'n'` for noun, \n", 1041 | " - `'v'` for verb, \n", 1042 | " - `'a'` for adjective, \n", 1043 | " - `'r'` for adverb.\n", 1044 | "\n", 1045 | "Note: adverbs don't usually work with the standard `lemmatize` method." 1046 | ] 1047 | }, 1048 | { 1049 | "cell_type": "code", 1050 | "execution_count": 20, 1051 | "metadata": { 1052 | "collapsed": false, 1053 | "deletable": true, 1054 | "editable": true 1055 | }, 1056 | "outputs": [ 1057 | { 1058 | "data": { 1059 | "text/plain": [ 1060 | "'sing'" 1061 | ] 1062 | }, 1063 | "execution_count": 20, 1064 | "metadata": {}, 1065 | "output_type": "execute_result" 1066 | } 1067 | ], 1068 | "source": [ 1069 | "w = Word('singing')\n", 1070 | "# for some words you have to pass the type\n", 1071 | "# in this case we pass 'v' for verb (not to be confused with POS tag formats)\n", 1072 | "w.lemmatize('v')" 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "code", 1077 | "execution_count": 21, 1078 | "metadata": { 1079 | "collapsed": false, 1080 | "deletable": true, 1081 | "editable": true 1082 | }, 1083 | "outputs": [ 1084 | { 1085 | "data": { 1086 | "text/plain": [ 1087 | "'go'" 1088 | ] 1089 | }, 1090 | "execution_count": 21, 1091 | "metadata": {}, 1092 | "output_type": "execute_result" 1093 | } 1094 | ], 1095 | "source": [ 1096 | "# past participle verb\n", 1097 | "w = Word('went')\n", 1098 | "w.lemmatize('v')" 1099 | ] 1100 | }, 1101 | { 1102 | "cell_type": "code", 1103 | "execution_count": 22, 1104 | "metadata": { 1105 | "collapsed": false, 1106 | "deletable": true, 1107 | "editable": true 1108 | }, 1109 | "outputs": [ 1110 | { 1111 | "data": { 1112 | "text/plain": [ 1113 | "'kindly'" 1114 | ] 1115 | }, 1116 | "execution_count": 22, 1117 | "metadata": {}, 1118 | "output_type": "execute_result" 1119 | } 1120 | ], 1121 | "source": [ 1122 | "# it doesn't always work: try an adverb\n", 1123 | "w = Word('kindly')\n", 1124 | "w.lemmatize('r')" 1125 | ] 1126 | }, 1127 | { 1128 | "cell_type": "markdown", 1129 | "metadata": { 1130 | "deletable": true, 1131 | "editable": true 1132 | }, 1133 | "source": [ 1134 | "# Parsing & n-grams" 1135 | ] 1136 | }, 1137 | { 1138 | "cell_type": "markdown", 1139 | "metadata": { 1140 | "deletable": true, 1141 | "editable": true 1142 | }, 1143 | "source": [ 1144 | "## Parsing" 1145 | ] 1146 | }, 1147 | { 1148 | "cell_type": "markdown", 1149 | "metadata": { 1150 | "deletable": true, 1151 | "editable": true 1152 | }, 1153 | "source": [ 1154 | "Parsing gives us the syntactic structure of a string or sentence by appending each word with tags that indicate it's place in a hierarchy. See the tree in the PowerPoint slides for a visual example.\n", 1155 | "\n", 1156 | "Let's parse the sentence shown in the tree:" 1157 | ] 1158 | }, 1159 | { 1160 | "cell_type": "code", 1161 | "execution_count": 23, 1162 | "metadata": { 1163 | "collapsed": false, 1164 | "deletable": true, 1165 | "editable": true 1166 | }, 1167 | "outputs": [ 1168 | { 1169 | "data": { 1170 | "text/plain": [ 1171 | "'John/NNP/B-NP/O loves/VBZ/B-VP/O Mary/NNP/B-NP/O'" 1172 | ] 1173 | }, 1174 | "execution_count": 23, 1175 | "metadata": {}, 1176 | "output_type": "execute_result" 1177 | } 1178 | ], 1179 | "source": [ 1180 | "# return a string containing each word in the text along with its parts of speech hierarchy\n", 1181 | "b = TextBlob('John loves Mary')\n", 1182 | "b.parse()" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "markdown", 1187 | "metadata": { 1188 | "deletable": true, 1189 | "editable": true 1190 | }, 1191 | "source": [ 1192 | "`John/NNP/B-NP/O` gives the position in the hierarchy of the text for the word \"`John`\" in our sentence, working from the word to the top of the hierarchy.\n", 1193 | "\n", 1194 | "In this case (For the word `John`):\n", 1195 | "* NNP indicates it is a \"noun, proper singular\"\n", 1196 | "* the `B-` in `B-NP` indicates the word is: inside the chunk, preceding word is part of a different chunk\n", 1197 | "* the `NP` in `B-NP` indicates it is part of a noun phrase\n", 1198 | "* `O` is \"not part of chunk\", meaning we are at the end of this particular hierarchy (chunk).\n", 1199 | "\n", 1200 | "Details can be read on the page that gives detailed parts of speech (link posted under POS tagging).\n", 1201 | "\n", 1202 | "Parsing and syntactic structure is a complex subject, and is not covered in depth here." 1203 | ] 1204 | }, 1205 | { 1206 | "cell_type": "markdown", 1207 | "metadata": { 1208 | "deletable": true, 1209 | "editable": true 1210 | }, 1211 | "source": [ 1212 | "## n-grams" 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "markdown", 1217 | "metadata": { 1218 | "deletable": true, 1219 | "editable": true 1220 | }, 1221 | "source": [ 1222 | "**n**-grams are groups of n successive words. Quite often n-grams are created by shifting one word at a time through a text, but there are cases where they skip k-words at a time.\n", 1223 | "\n", 1224 | "The usefulness of n-grams comes in with machine learning, where each n-gram is used as a feature for learning. These will be used more in the next workshop, but for now let's look at getting n-grams from a text using TextBlob:" 1225 | ] 1226 | }, 1227 | { 1228 | "cell_type": "markdown", 1229 | "metadata": { 1230 | "deletable": true, 1231 | "editable": true 1232 | }, 1233 | "source": [ 1234 | "TextBlob has an `ngrams` method that will take an optional argument `n`, which is the size of n-grams to generate. Default is 3.\n", 1235 | "\n", 1236 | "The method returns a list of `WordList` objects." 1237 | ] 1238 | }, 1239 | { 1240 | "cell_type": "code", 1241 | "execution_count": 24, 1242 | "metadata": { 1243 | "collapsed": false, 1244 | "deletable": true, 1245 | "editable": true 1246 | }, 1247 | "outputs": [ 1248 | { 1249 | "data": { 1250 | "text/plain": [ 1251 | "[WordList(['VirginAmerica', 'status', 'match']),\n", 1252 | " WordList(['status', 'match', 'program']),\n", 1253 | " WordList(['match', 'program', 'I']),\n", 1254 | " WordList(['program', 'I', 'applied']),\n", 1255 | " WordList(['I', 'applied', 'and'])]" 1256 | ] 1257 | }, 1258 | "execution_count": 24, 1259 | "metadata": {}, 1260 | "output_type": "execute_result" 1261 | } 1262 | ], 1263 | "source": [ 1264 | "# return list of n-grams (default n=3)\n", 1265 | "# get only first 5 n-grams\n", 1266 | "blob.ngrams()[:5]" 1267 | ] 1268 | }, 1269 | { 1270 | "cell_type": "code", 1271 | "execution_count": 25, 1272 | "metadata": { 1273 | "collapsed": false, 1274 | "deletable": true, 1275 | "editable": true 1276 | }, 1277 | "outputs": [ 1278 | { 1279 | "data": { 1280 | "text/plain": [ 1281 | "[WordList(['VirginAmerica', 'status']),\n", 1282 | " WordList(['status', 'match']),\n", 1283 | " WordList(['match', 'program']),\n", 1284 | " WordList(['program', 'I']),\n", 1285 | " WordList(['I', 'applied'])]" 1286 | ] 1287 | }, 1288 | "execution_count": 25, 1289 | "metadata": {}, 1290 | "output_type": "execute_result" 1291 | } 1292 | ], 1293 | "source": [ 1294 | "# get another set with n = 2\n", 1295 | "blob.ngrams(n=2)[:5]" 1296 | ] 1297 | }, 1298 | { 1299 | "cell_type": "code", 1300 | "execution_count": null, 1301 | "metadata": { 1302 | "collapsed": true 1303 | }, 1304 | "outputs": [], 1305 | "source": [] 1306 | }, 1307 | { 1308 | "cell_type": "markdown", 1309 | "metadata": {}, 1310 | "source": [ 1311 | "# Practice Problems\n", 1312 | "\n", 1313 | "1. Create and parse blob (using index 25) and print the first 10 pieces on separate lines\n", 1314 | "2. Singularize all words in blob\n", 1315 | "3. Pluralize the words ['gallery', 'mouse', 'man']\n", 1316 | "4. Lemmatize the words ['categories', 'mice', 'better', 'found']\n", 1317 | "5. Print the first 5 unique POS tags in blob\n", 1318 | "6. Given the n-grams in the last cell in the notebook, reconstruct the original sentence\n", 1319 | "7. **Extra:** List all words in blob that are plural (with index of each word)" 1320 | ] 1321 | }, 1322 | { 1323 | "cell_type": "markdown", 1324 | "metadata": {}, 1325 | "source": [ 1326 | "### For practice problem 6 in the second hour" 1327 | ] 1328 | }, 1329 | { 1330 | "cell_type": "code", 1331 | "execution_count": 1, 1332 | "metadata": { 1333 | "collapsed": true 1334 | }, 1335 | "outputs": [], 1336 | "source": [ 1337 | "ngrams = [['The', 'quick', 'brown', 'fox'],\n", 1338 | " ['quick', 'brown', 'fox', 'jumps'],\n", 1339 | " ['brown', 'fox', 'jumps', 'over'],\n", 1340 | " ['fox', 'jumps', 'over', 'the'],\n", 1341 | " ['jumps', 'over', 'the', 'lazy'],\n", 1342 | " ['over', 'the', 'lazy', 'dog'],\n", 1343 | " ['the', 'lazy', 'dog', 'and'],\n", 1344 | " ['lazy', 'dog', 'and', 'the'],\n", 1345 | " ['dog', 'and', 'the', 'cow'],\n", 1346 | " ['and', 'the', 'cow', 'jumped'],\n", 1347 | " ['the', 'cow', 'jumped', 'over'],\n", 1348 | " ['cow', 'jumped', 'over', 'the'],\n", 1349 | " ['jumped', 'over', 'the', 'moon']]" 1350 | ] 1351 | } 1352 | ], 1353 | "metadata": { 1354 | "anaconda-cloud": {}, 1355 | "kernelspec": { 1356 | "display_name": "Python [default]", 1357 | "language": "python", 1358 | "name": "python3" 1359 | }, 1360 | "language_info": { 1361 | "codemirror_mode": { 1362 | "name": "ipython", 1363 | "version": 3 1364 | }, 1365 | "file_extension": ".py", 1366 | "mimetype": "text/x-python", 1367 | "name": "python", 1368 | "nbconvert_exporter": "python", 1369 | "pygments_lexer": "ipython3", 1370 | "version": "3.5.2" 1371 | } 1372 | }, 1373 | "nbformat": 4, 1374 | "nbformat_minor": 2 1375 | } 1376 | -------------------------------------------------------------------------------- /nlp_workshop2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# TextBlob: Sentiment Analysis & Classifiers" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## What is Sentiment Analysis?\n", 15 | "\n", 16 | "Sentiment analysis is a method in NLP used to classify the emotion (or tone) and subjectiveness of human language. At the most common and basic level, the goal is to classify a text as positive, negative, or neutral in tone, and to determine how subjective it is. The aspect of subjectivity will only very briefly be noted in this workshop.\n", 17 | "\n", 18 | "At a more complex level, sentiment analysis is a technique used to classify the specific emotions in human language, such as angry, happy, sad, excited, etc. So instead of simply learning/classifying three classes (positive, negative, neutral), the goal is to involve many specific classes." 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "## Why Use Sentiment Analysis?\n", 26 | "\n", 27 | "The actual usefulness of sentiment analysis depends on the industry using it, but the most common reason to use it involve scraping lots of data (e.g. twitter feeds or reddit comments) to determine how customers/users feel about a particular brand, product, or service. \n", 28 | "\n", 29 | "There is also a use for sentiment analysis when analyzing financial securities (stock market): if a large proportion of people shift in sentiment about a particular market or stock, that is going to affect the price of securities involved." 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Specific Uses\n", 37 | "\n", 38 | "* Insight into opinions on specific political policies\n", 39 | "* Brand monitoring (how is a brand perceived?)\n", 40 | "* Identify good and bad aspects of product or ads\n", 41 | "* Impact of changes in sentiment on securities markets\n", 42 | "* Will likely be used one day with virtual assistants and other AI\n", 43 | "* Hotels can use it to know how they can improve their property and service" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | " " 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "## Getting Started" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 1, 63 | "metadata": { 64 | "collapsed": true, 65 | "deletable": true, 66 | "editable": true 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "# import what we need\n", 71 | "import pandas as pd\n", 72 | "from pandas import DataFrame as DF, Series\n", 73 | "\n", 74 | "import numpy as np\n", 75 | "\n", 76 | "import matplotlib.pyplot as plt\n", 77 | "%matplotlib inline\n", 78 | "\n", 79 | "from textblob import TextBlob" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 2, 85 | "metadata": { 86 | "collapsed": true, 87 | "deletable": true, 88 | "editable": true 89 | }, 90 | "outputs": [], 91 | "source": [ 92 | "# read data\n", 93 | "cols = ['airline_sentiment','airline_sentiment_confidence',\n", 94 | " 'airline','name','text']\n", 95 | "data = pd.read_csv('tweets.csv', usecols=cols)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "Below is the first 5 rows of our data. We will only be using the first two features, and the last feature." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 3, 108 | "metadata": { 109 | "collapsed": false, 110 | "deletable": true, 111 | "editable": true 112 | }, 113 | "outputs": [ 114 | { 115 | "data": { 116 | "text/html": [ 117 | "
\n", 118 | "\n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | "
airline_sentimentairline_sentiment_confidenceairlinenametext
0neutral1.0000Virgin Americacairdin@VirginAmerica What @dhepburn said.
1positive0.3486Virgin Americajnardino@VirginAmerica plus you've added commercials t...
2neutral0.6837Virgin Americayvonnalynn@VirginAmerica I didn't today... Must mean I n...
3negative1.0000Virgin Americajnardino@VirginAmerica it's really aggressive to blast...
4negative1.0000Virgin Americajnardino@VirginAmerica and it's a really big bad thing...
\n", 172 | "
" 173 | ], 174 | "text/plain": [ 175 | " airline_sentiment airline_sentiment_confidence airline name \\\n", 176 | "0 neutral 1.0000 Virgin America cairdin \n", 177 | "1 positive 0.3486 Virgin America jnardino \n", 178 | "2 neutral 0.6837 Virgin America yvonnalynn \n", 179 | "3 negative 1.0000 Virgin America jnardino \n", 180 | "4 negative 1.0000 Virgin America jnardino \n", 181 | "\n", 182 | " text \n", 183 | "0 @VirginAmerica What @dhepburn said. \n", 184 | "1 @VirginAmerica plus you've added commercials t... \n", 185 | "2 @VirginAmerica I didn't today... Must mean I n... \n", 186 | "3 @VirginAmerica it's really aggressive to blast... \n", 187 | "4 @VirginAmerica and it's a really big bad thing... " 188 | ] 189 | }, 190 | "execution_count": 3, 191 | "metadata": {}, 192 | "output_type": "execute_result" 193 | } 194 | ], 195 | "source": [ 196 | "data.head()" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "# Polarity & Subjectivity Using TextBlob `sentiment`" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": { 209 | "deletable": true, 210 | "editable": true 211 | }, 212 | "source": [ 213 | "## Basic Sentiment Analysis" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": { 219 | "deletable": true, 220 | "editable": true 221 | }, 222 | "source": [ 223 | "### Using the TextBlob `sentiment` method" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": { 229 | "deletable": true, 230 | "editable": true 231 | }, 232 | "source": [ 233 | "TextBlob has a `sentiment` method that can be used on any `TextBlob` object. It returns two values:\n", 234 | "* polarity: value in range [-1, 1], indicating how negative or positive the text is (close to 0.0 is neutral).\n", 235 | "* subjectivity: value in range [0, 1], indicating how subjective the text is (1 is very subjective)\n", 236 | "\n", 237 | "This method is very basic, and there is a lot to be desired, but it can still be helpful if you don't have opportunity to train a classifier, and just need some rough results." 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 4, 243 | "metadata": { 244 | "collapsed": false, 245 | "deletable": true, 246 | "editable": true 247 | }, 248 | "outputs": [ 249 | { 250 | "name": "stdout", 251 | "output_type": "stream", 252 | "text": [ 253 | "The food is on the table \n", 254 | "(p=0.0, s=0.0) \n", 255 | "\n", 256 | "The food is green \n", 257 | "(p=-0.2, s=0.3) \n", 258 | "\n", 259 | "I don't like the food \n", 260 | "(p=0.0, s=0.0) \n", 261 | "\n", 262 | "I do not like the food \n", 263 | "(p=0.0, s=0.0) \n", 264 | "\n", 265 | "I like the food \n", 266 | "(p=0.0, s=0.0) \n", 267 | "\n", 268 | "I don't love the food \n", 269 | "(p=0.5, s=0.6) \n", 270 | "\n", 271 | "I do not love the food \n", 272 | "(p=-0.25, s=0.6) \n", 273 | "\n", 274 | "I hate the food \n", 275 | "(p=-0.8, s=0.9) \n", 276 | "\n", 277 | "I love the food \n", 278 | "(p=0.5, s=0.6) \n", 279 | "\n", 280 | "The food is delicious \n", 281 | "(p=1.0, s=1.0) \n", 282 | "\n" 283 | ] 284 | } 285 | ], 286 | "source": [ 287 | "lines = [\"The food is on the table\", \"The food is green\", \"I don't like the food\",\n", 288 | " \"I do not like the food\", \"I like the food\", \"I don't love the food\", \"I do not love the food\",\n", 289 | " \"I hate the food\", \"I love the food\", \"The food is delicious\"]\n", 290 | "\n", 291 | "# analyze the sentences\n", 292 | "sentiments = [b.sentiment for b in [TextBlob(l) for l in lines]]\n", 293 | "for l,s in zip(lines, sentiments):\n", 294 | " print('{} \\n(p={}, s={})'.format(l, s[0], s[1]), '\\n')" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": { 300 | "deletable": true, 301 | "editable": true 302 | }, 303 | "source": [ 304 | "As seen above, this method doesn't recognize negative contractions (e.g. don't), and it has trouble with ambiguous works that can take on multiple meanings (e.g. like, which is also used for comparision)\n", 305 | "\n", 306 | "Let's see how it does with a couple book reviews." 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "## Using The `sentiment` Method on Tweets" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "We will get a subset of our data that contains only the first 10 rows that have a confidence level greater that 0.6. This is because we are uninterested in entries with a high level of uncertainty, because keeping low-confidence observations would reduce the certainty of evaluations that we will make later." 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 5, 326 | "metadata": { 327 | "collapsed": false, 328 | "deletable": true, 329 | "editable": true 330 | }, 331 | "outputs": [], 332 | "source": [ 333 | "# get subset of tweets where confidence is > 0.6\n", 334 | "subset = data[data.airline_sentiment_confidence > 0.6]\\\n", 335 | " .head(10).copy().reset_index(drop=True)\n", 336 | "tweets = subset.text" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 6, 342 | "metadata": { 343 | "collapsed": false, 344 | "deletable": true, 345 | "editable": true 346 | }, 347 | "outputs": [ 348 | { 349 | "data": { 350 | "text/html": [ 351 | "
\n", 352 | "\n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | "
airline_sentimentairline_sentiment_confidenceairlinenametext
0neutral1.0000Virgin Americacairdin@VirginAmerica What @dhepburn said.
1neutral0.6837Virgin Americayvonnalynn@VirginAmerica I didn't today... Must mean I n...
2negative1.0000Virgin Americajnardino@VirginAmerica it's really aggressive to blast...
3negative1.0000Virgin Americajnardino@VirginAmerica and it's a really big bad thing...
4negative1.0000Virgin Americajnardino@VirginAmerica seriously would pay $30 a fligh...
5positive0.6745Virgin Americacjmcginnis@VirginAmerica yes, nearly every time I fly VX...
6neutral0.6340Virgin Americapilot@VirginAmerica Really missed a prime opportuni...
7positive0.6559Virgin Americadhepburn@virginamerica Well, I didn't…but NOW I DO! :-D
8positive1.0000Virgin AmericaYupitsTate@VirginAmerica it was amazing, and arrived an ...
9neutral0.6769Virgin Americaidk_but_youtube@VirginAmerica did you know that suicide is th...
\n", 446 | "
" 447 | ], 448 | "text/plain": [ 449 | " airline_sentiment airline_sentiment_confidence airline \\\n", 450 | "0 neutral 1.0000 Virgin America \n", 451 | "1 neutral 0.6837 Virgin America \n", 452 | "2 negative 1.0000 Virgin America \n", 453 | "3 negative 1.0000 Virgin America \n", 454 | "4 negative 1.0000 Virgin America \n", 455 | "5 positive 0.6745 Virgin America \n", 456 | "6 neutral 0.6340 Virgin America \n", 457 | "7 positive 0.6559 Virgin America \n", 458 | "8 positive 1.0000 Virgin America \n", 459 | "9 neutral 0.6769 Virgin America \n", 460 | "\n", 461 | " name text \n", 462 | "0 cairdin @VirginAmerica What @dhepburn said. \n", 463 | "1 yvonnalynn @VirginAmerica I didn't today... Must mean I n... \n", 464 | "2 jnardino @VirginAmerica it's really aggressive to blast... \n", 465 | "3 jnardino @VirginAmerica and it's a really big bad thing... \n", 466 | "4 jnardino @VirginAmerica seriously would pay $30 a fligh... \n", 467 | "5 cjmcginnis @VirginAmerica yes, nearly every time I fly VX... \n", 468 | "6 pilot @VirginAmerica Really missed a prime opportuni... \n", 469 | "7 dhepburn @virginamerica Well, I didn't…but NOW I DO! :-D \n", 470 | "8 YupitsTate @VirginAmerica it was amazing, and arrived an ... \n", 471 | "9 idk_but_youtube @VirginAmerica did you know that suicide is th... " 472 | ] 473 | }, 474 | "execution_count": 6, 475 | "metadata": {}, 476 | "output_type": "execute_result" 477 | } 478 | ], 479 | "source": [ 480 | "subset" 481 | ] 482 | }, 483 | { 484 | "cell_type": "markdown", 485 | "metadata": {}, 486 | "source": [ 487 | "### Compare the `sentiment` predictions with each line in `subset`\n", 488 | "\n", 489 | "We want to get a sense of how each tweet is being classified" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": 7, 495 | "metadata": { 496 | "collapsed": false, 497 | "deletable": true, 498 | "editable": true 499 | }, 500 | "outputs": [ 501 | { 502 | "name": "stdout", 503 | "output_type": "stream", 504 | "text": [ 505 | "@VirginAmerica What @dhepburn said. \n", 506 | " 0.0 (target: neutral) \n", 507 | "\n", 508 | "@VirginAmerica I didn't today... Must mean I need to take another trip! \n", 509 | " -0.390625 (target: neutral) \n", 510 | "\n", 511 | "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces & they have little recourse \n", 512 | " 0.0062500000000000056 (target: negative) \n", 513 | "\n", 514 | "@VirginAmerica and it's a really big bad thing about it \n", 515 | " -0.3499999999999999 (target: negative) \n", 516 | "\n", 517 | "@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\n", 518 | "it's really the only bad thing about flying VA \n", 519 | " -0.2083333333333333 (target: negative) \n", 520 | "\n", 521 | "@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :) \n", 522 | " 0.4666666666666666 (target: positive) \n", 523 | "\n", 524 | "@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP \n", 525 | " 0.2 (target: neutral) \n", 526 | "\n", 527 | "@virginamerica Well, I didn't…but NOW I DO! :-D \n", 528 | " 1.0 (target: positive) \n", 529 | "\n", 530 | "@VirginAmerica it was amazing, and arrived an hour early. You're too good to me. \n", 531 | " 0.4666666666666666 (target: positive) \n", 532 | "\n", 533 | "@VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24 \n", 534 | " 0.0 (target: neutral) \n", 535 | "\n" 536 | ] 537 | } 538 | ], 539 | "source": [ 540 | "# print the tweets and predicted polarity line-by-line\n", 541 | "for i,t in enumerate(tweets):\n", 542 | " s = TextBlob(t).sentiment\n", 543 | " target = subset.airline_sentiment[i]\n", 544 | " print(t, '\\n', '{} (target: {}) \\n'.format(s[0], target))" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": { 550 | "deletable": true, 551 | "editable": true 552 | }, 553 | "source": [ 554 | "This basic sentiment analyzer missed the mark on 3/10 tweets (2 neutral and 1 negative). That's not too bad, but these results are nothing to celebrate. The perfmance declines quite a bit with larger texts.\n", 555 | "\n", 556 | "Looking at the two tweets the `sentiment` method estimated incorrectly:\n", 557 | "\n", 558 | "**@VirginAmerica I didn't today... Must mean I need to take another trip!**\n", 559 | "This one is interpreted by the computer as negative, and perhaps it's correct. This one is full of ambiguity without any context, and that is probably why the target value in the set is neutral.\n", 560 | "\n", 561 | "**@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces & they have little recourse**\n", 562 | "This one is " 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "metadata": {}, 568 | "source": [ 569 | "### Analyze polarity of each word in the last sentence above to see what's happening" 570 | ] 571 | }, 572 | { 573 | "cell_type": "code", 574 | "execution_count": 8, 575 | "metadata": { 576 | "collapsed": false, 577 | "deletable": true, 578 | "editable": true 579 | }, 580 | "outputs": [ 581 | { 582 | "name": "stdout", 583 | "output_type": "stream", 584 | "text": [ 585 | "VirginAmerica 0.0 \n", 586 | "\n", 587 | "it 0.0 \n", 588 | "\n", 589 | "'s 0.0 \n", 590 | "\n", 591 | "really 0.2 \n", 592 | "\n", 593 | "aggressive 0.0 \n", 594 | "\n", 595 | "to 0.0 \n", 596 | "\n", 597 | "blast 0.0 \n", 598 | "\n", 599 | "obnoxious 0.0 \n", 600 | "\n", 601 | "entertainment 0.0 \n", 602 | "\n", 603 | "in 0.0 \n", 604 | "\n", 605 | "your 0.0 \n", 606 | "\n", 607 | "guests 0.0 \n", 608 | "\n", 609 | "faces 0.0 \n", 610 | "\n", 611 | "amp 0.0 \n", 612 | "\n", 613 | "they 0.0 \n", 614 | "\n", 615 | "have 0.0 \n", 616 | "\n", 617 | "little -0.1875 \n", 618 | "\n", 619 | "recourse 0.0 \n", 620 | "\n" 621 | ] 622 | } 623 | ], 624 | "source": [ 625 | "words = TextBlob(tweets[2]).words\n", 626 | "for w in words: print(w, TextBlob(w).sentiment[0], '\\n')" 627 | ] 628 | }, 629 | { 630 | "cell_type": "markdown", 631 | "metadata": { 632 | "deletable": true, 633 | "editable": true 634 | }, 635 | "source": [ 636 | "We can see that the `sentiment` method does not consider the words \"obnoxious\" or \"aggressive\" to be negative, which is a glaring problem for our analysis. This method is clearly limited and we need a better method." 637 | ] 638 | }, 639 | { 640 | "cell_type": "markdown", 641 | "metadata": { 642 | "deletable": true, 643 | "editable": true 644 | }, 645 | "source": [ 646 | "# Naive Bayes Classifier for Sentiment Anlaysis" 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "metadata": {}, 652 | "source": [ 653 | "Here we will use a Naive Bayes Classifier (included with TextBlob) to create a better sentiment analyzer. We will only train on a small portion of our data since it takes a while to train. However, even with a small amount of training data we can get better results than the `sentiment` method.\n", 654 | "\n", 655 | "There are other classifiers included with TextBlob, but this one is easy to use and gives good performance.\n", 656 | "\n", 657 | "We will start with three goals\n", 658 | "* learn to train and test/evaluate this classifier using a subset of our data\n", 659 | "* compare the performance to the original sentiment method\n", 660 | "* look at the features the classifier is extracting from the text" 661 | ] 662 | }, 663 | { 664 | "cell_type": "markdown", 665 | "metadata": {}, 666 | "source": [ 667 | "### Create train and test sets\n", 668 | "\n", 669 | "* train the model on the first set\n", 670 | "* test/evaluate it on the other" 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": {}, 676 | "source": [ 677 | "The set below named `reduced` is reduced in dimensionality (keeping only the features/columns we care about).\n", 678 | "\n", 679 | "The `train` and `test` sets are created using something called a list comprehension. If you don't know what that is, it's okay, and you can look it up later. What is important is to know that the Naive Bayes classifier takes data in the form of a list of doubles, where each double is one observation (text, label), where label is the class label that belongs to the text." 680 | ] 681 | }, 682 | { 683 | "cell_type": "code", 684 | "execution_count": 9, 685 | "metadata": { 686 | "collapsed": false, 687 | "deletable": true, 688 | "editable": true 689 | }, 690 | "outputs": [], 691 | "source": [ 692 | "# get reduced set\n", 693 | "reduced = data.ix[:, ['airline_sentiment','text']].copy()\n", 694 | "reduced.rename(columns={'airline_sentiment': 'target'}, inplace=1)\n", 695 | "\n", 696 | "# now create train and test sets for first 500 tweets\n", 697 | "# for the TextBlob classifier we need a list of doubles (string, target)\n", 698 | "train = [(s, t) for s,t in zip(reduced.iloc[:350].text, reduced.iloc[:350].target)]\n", 699 | "test = [(s, t) for s,t in zip(reduced.iloc[350:500].text, reduced.iloc[350:500].target)]" 700 | ] 701 | }, 702 | { 703 | "cell_type": "markdown", 704 | "metadata": {}, 705 | "source": [ 706 | "### Train and evaulate" 707 | ] 708 | }, 709 | { 710 | "cell_type": "code", 711 | "execution_count": 10, 712 | "metadata": { 713 | "collapsed": false, 714 | "deletable": true, 715 | "editable": true 716 | }, 717 | "outputs": [ 718 | { 719 | "data": { 720 | "text/plain": [ 721 | "0.6066666666666667" 722 | ] 723 | }, 724 | "execution_count": 10, 725 | "metadata": {}, 726 | "output_type": "execute_result" 727 | } 728 | ], 729 | "source": [ 730 | "# import the classifier\n", 731 | "from textblob.classifiers import NaiveBayesClassifier\n", 732 | "\n", 733 | "# train\n", 734 | "cl = NaiveBayesClassifier(train)\n", 735 | "# evaluate\n", 736 | "cl.accuracy(test)" 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": 11, 742 | "metadata": { 743 | "collapsed": false, 744 | "deletable": true, 745 | "editable": true 746 | }, 747 | "outputs": [ 748 | { 749 | "data": { 750 | "text/plain": [ 751 | "negative 9178\n", 752 | "neutral 3099\n", 753 | "positive 2363\n", 754 | "Name: target, dtype: int64" 755 | ] 756 | }, 757 | "execution_count": 11, 758 | "metadata": {}, 759 | "output_type": "execute_result" 760 | } 761 | ], 762 | "source": [ 763 | "# a quick look at the distribution of class labels\n", 764 | "reduced.target.value_counts()" 765 | ] 766 | }, 767 | { 768 | "cell_type": "markdown", 769 | "metadata": { 770 | "deletable": true, 771 | "editable": true 772 | }, 773 | "source": [ 774 | "The classes in the test set are pretty much balanced, but the classes in the entire reduced set are not balanced." 775 | ] 776 | }, 777 | { 778 | "cell_type": "markdown", 779 | "metadata": { 780 | "deletable": true, 781 | "editable": true 782 | }, 783 | "source": [ 784 | "Let's compare the 61% classifier accuracy to the performance of the `sentiment` method." 785 | ] 786 | }, 787 | { 788 | "cell_type": "markdown", 789 | "metadata": {}, 790 | "source": [ 791 | "## When accuracy isn't good enough" 792 | ] 793 | }, 794 | { 795 | "cell_type": "markdown", 796 | "metadata": { 797 | "deletable": true, 798 | "editable": true 799 | }, 800 | "source": [ 801 | "**Need better scoring method for multi-class predictions**\n", 802 | "\n", 803 | "Regular accuracy is simply the ratio of number of correct predictions to total number of predictions made. This pays no attention to how many classes there are, or how well each one is predicted.\n", 804 | "\n", 805 | "**When it’s not good enough**\n", 806 | "* there are more than two classes (in our case there are 3)\n", 807 | "* there is an imbalance (at least one class with far fewer instances than another)\n", 808 | "\n", 809 | "If there is a strong imbalance (and this does happen) where there are two classes one only happens 5% of the time, if all we do is predict everything to be the majority class, then we will automatically get 95% accuracy. That's meaningless in such a case.\n", 810 | "\n", 811 | "**Precision and Recall are two useful metrics in these cases**\n", 812 | "\n", 813 | "Precision = TP / (TP + FP) : how often predictions of a specific class are correct\n", 814 | "\n", 815 | "TP : True Positive
\n", 816 | "FP : False Positive\n", 817 | "\n", 818 | "Recall = TP / (TP + FN) : how often specific classes are identified (not missed)\n", 819 | "\n", 820 | "FN : False Negative\n", 821 | "\n", 822 | "**Precision & Recall**\n", 823 | "\n", 824 | "Precision = $\\frac{TP}{TP + FP}$\n", 825 | "\n", 826 | "Recall = $\\frac{TP}{TP + FN}$" 827 | ] 828 | }, 829 | { 830 | "cell_type": "code", 831 | "execution_count": 13, 832 | "metadata": { 833 | "collapsed": false, 834 | "deletable": true, 835 | "editable": true 836 | }, 837 | "outputs": [], 838 | "source": [ 839 | "# create a score function that will give precision and recall values for each class\n", 840 | "def score(true, predicted):\n", 841 | " eq = np.equal\n", 842 | " \n", 843 | " t = np.array(true)\n", 844 | " p = np.array(predicted)\n", 845 | " \n", 846 | " tp = np.array([eq((t == c)*(p == c), 1).sum() for c in np.unique(t)])\n", 847 | " fp = np.array([eq((t != c)*(p == c), 1).sum() for c in np.unique(t)])\n", 848 | " fn = np.array([eq((t == c)*(p != c), 1).sum() for c in np.unique(t)])\n", 849 | "\n", 850 | " precision = tp/(tp + fp)\n", 851 | " recall = tp/(tp + fn)\n", 852 | " \n", 853 | " return (np.unique(t), precision, recall)" 854 | ] 855 | }, 856 | { 857 | "cell_type": "markdown", 858 | "metadata": { 859 | "deletable": true, 860 | "editable": true 861 | }, 862 | "source": [ 863 | "### Evaluate classifier on larger set\n", 864 | "\\* **skip this; takes too long** \\*\n", 865 | "\n", 866 | "**With train/test split**" 867 | ] 868 | }, 869 | { 870 | "cell_type": "code", 871 | "execution_count": 15, 872 | "metadata": { 873 | "collapsed": true, 874 | "deletable": true, 875 | "editable": true 876 | }, 877 | "outputs": [], 878 | "source": [ 879 | "# create new train and test sets\n", 880 | "# for the TextBlob classifier we need a list of doubles (string, target)\n", 881 | "\n", 882 | "# train = [(s, t) for s,t in zip(reduced.iloc[:1500].text, reduced.iloc[:1500].target)]\n", 883 | "# test = [(s, t) for s,t in zip(reduced.iloc[1500:2000].text, reduced.iloc[1500:2000].target)]" 884 | ] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": 16, 889 | "metadata": { 890 | "collapsed": false, 891 | "deletable": true, 892 | "editable": true 893 | }, 894 | "outputs": [ 895 | { 896 | "data": { 897 | "text/plain": [ 898 | "0.786" 899 | ] 900 | }, 901 | "execution_count": 16, 902 | "metadata": {}, 903 | "output_type": "execute_result" 904 | } 905 | ], 906 | "source": [ 907 | "# train\n", 908 | "# cl = NaiveBayesClassifier(train)\n", 909 | "\n", 910 | "# evaluate\n", 911 | "# cl.accuracy(test)\n", 912 | "# 0.786" 913 | ] 914 | }, 915 | { 916 | "cell_type": "code", 917 | "execution_count": null, 918 | "metadata": { 919 | "collapsed": true 920 | }, 921 | "outputs": [], 922 | "source": [] 923 | }, 924 | { 925 | "cell_type": "markdown", 926 | "metadata": {}, 927 | "source": [ 928 | "# Practice Problems\n", 929 | "\n", 930 | "1. Create a pandas series of polarity values predicted for all entries in the reduced set using the sentiment method\n", 931 | "2. Create a column in the reduced set with class labels mapped from the polarity values in (1.) using the following rules:\n", 932 | " - polarity < - 0.1 : ‘negative’\n", 933 | " - polarity > 0.1 : ‘positive’\n", 934 | " - else : ‘neutral’\n", 935 | "3. Compute the accuracy of the predicted labels from (2.) for the same range as the test set [350:500]\n", 936 | "4. Update the score function to print a clean table of scores with (hint: use pandas)\n", 937 | " - rows for precision and recall\n", 938 | " - columns for class labels\n" 939 | ] 940 | }, 941 | { 942 | "cell_type": "markdown", 943 | "metadata": {}, 944 | "source": [ 945 | "# Naive Bayes Classifier: Digging Deeper" 946 | ] 947 | }, 948 | { 949 | "cell_type": "markdown", 950 | "metadata": {}, 951 | "source": [ 952 | "## Making Predictions\n", 953 | "\n", 954 | "`NaiveBayesClassifier` has a `classify` method that takes text (a single string) as an argument. This means that we can either classify some string that we choose to type by hand, or classify tweets from our test set individually." 955 | ] 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": 24, 960 | "metadata": { 961 | "collapsed": false 962 | }, 963 | "outputs": [ 964 | { 965 | "data": { 966 | "text/plain": [ 967 | "'positive'" 968 | ] 969 | }, 970 | "execution_count": 24, 971 | "metadata": {}, 972 | "output_type": "execute_result" 973 | } 974 | ], 975 | "source": [ 976 | "cl.classify('I love this airline')" 977 | ] 978 | }, 979 | { 980 | "cell_type": "markdown", 981 | "metadata": {}, 982 | "source": [ 983 | "### Getting class probabilities" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": 26, 989 | "metadata": { 990 | "collapsed": false 991 | }, 992 | "outputs": [ 993 | { 994 | "data": { 995 | "text/plain": [ 996 | "'positive'" 997 | ] 998 | }, 999 | "execution_count": 26, 1000 | "metadata": {}, 1001 | "output_type": "execute_result" 1002 | } 1003 | ], 1004 | "source": [ 1005 | "probs = cl.prob_classify('I love this airline')\n", 1006 | "probs.max()" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "code", 1011 | "execution_count": 27, 1012 | "metadata": { 1013 | "collapsed": false 1014 | }, 1015 | "outputs": [ 1016 | { 1017 | "data": { 1018 | "text/plain": [ 1019 | "0.8788493380472053" 1020 | ] 1021 | }, 1022 | "execution_count": 27, 1023 | "metadata": {}, 1024 | "output_type": "execute_result" 1025 | } 1026 | ], 1027 | "source": [ 1028 | "probs.prob('positive')" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "code", 1033 | "execution_count": 28, 1034 | "metadata": { 1035 | "collapsed": false 1036 | }, 1037 | "outputs": [ 1038 | { 1039 | "data": { 1040 | "text/plain": [ 1041 | "0.01575421132591375" 1042 | ] 1043 | }, 1044 | "execution_count": 28, 1045 | "metadata": {}, 1046 | "output_type": "execute_result" 1047 | } 1048 | ], 1049 | "source": [ 1050 | "probs.prob('negative')" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "markdown", 1055 | "metadata": {}, 1056 | "source": [ 1057 | "The above can be useful if you want to make modifications to how something is classified by setting a threshold. For example, you may want to only classify something as positive if the probability exceeds 0.9, instead of it simply having the highest probability." 1058 | ] 1059 | }, 1060 | { 1061 | "cell_type": "markdown", 1062 | "metadata": { 1063 | "collapsed": true, 1064 | "deletable": true, 1065 | "editable": true 1066 | }, 1067 | "source": [ 1068 | "## Informative Features\n", 1069 | "\n", 1070 | "The method below gives us some insight into how the classifier is making decisions. For example, we can see that if a string contains the word \"great\", the there is are 8.7:1 odds that the string is positive instead of negative. All of the features are taken into account for one string, so that doesn't mean just because \"great\" is in the string it will be classified as positive." 1071 | ] 1072 | }, 1073 | { 1074 | "cell_type": "code", 1075 | "execution_count": 17, 1076 | "metadata": { 1077 | "collapsed": false, 1078 | "deletable": true, 1079 | "editable": true 1080 | }, 1081 | "outputs": [ 1082 | { 1083 | "name": "stdout", 1084 | "output_type": "stream", 1085 | "text": [ 1086 | "Most Informative Features\n", 1087 | " contains(no) = True negati : neutra = 9.7 : 1.0\n", 1088 | " contains(great) = True positi : negati = 9.7 : 1.0\n", 1089 | " contains(Thanks) = True positi : negati = 8.7 : 1.0\n", 1090 | " contains(love) = True positi : negati = 8.7 : 1.0\n", 1091 | " contains(thanks) = True positi : negati = 6.9 : 1.0\n", 1092 | " contains(site) = True negati : positi = 6.5 : 1.0\n", 1093 | " contains(not) = True negati : positi = 6.0 : 1.0\n", 1094 | " contains(amazing) = True positi : negati = 6.0 : 1.0\n", 1095 | " contains(Thank) = True positi : negati = 6.0 : 1.0\n", 1096 | " contains(website) = True negati : neutra = 5.5 : 1.0\n" 1097 | ] 1098 | } 1099 | ], 1100 | "source": [ 1101 | "cl.show_informative_features(10)" 1102 | ] 1103 | }, 1104 | { 1105 | "cell_type": "markdown", 1106 | "metadata": { 1107 | "deletable": true, 1108 | "editable": true 1109 | }, 1110 | "source": [ 1111 | "**How to interpret this:**\n", 1112 | "* We are given rows that have `contains(feature) = True/False` and a comparison of two class labels with a ratio that indicates how likely one is over the other \n", 1113 | "* The printed results are in descending order of importance\n", 1114 | "* Ex: `contains(no) = True` gives the ratio of 9.7 : 1.0, showing that it is extremely likely to be negative rather than neutral\n", 1115 | "* The default features for the Naive Bayes classifier are individual words found in the data" 1116 | ] 1117 | }, 1118 | { 1119 | "cell_type": "markdown", 1120 | "metadata": {}, 1121 | "source": [ 1122 | "## Extracting Features\n", 1123 | "\n", 1124 | "We are provided a method that serves one purpose: take a string and return a dictionary of all features in our classifier (individual words by default), and whether or not that word is in the string. It is essentially a binary feature vector." 1125 | ] 1126 | }, 1127 | { 1128 | "cell_type": "code", 1129 | "execution_count": 22, 1130 | "metadata": { 1131 | "collapsed": false, 1132 | "deletable": true, 1133 | "editable": true 1134 | }, 1135 | "outputs": [ 1136 | { 1137 | "data": { 1138 | "text/plain": [ 1139 | "{'contains(while)': False,\n", 1140 | " 'contains(schedule)': False,\n", 1141 | " 'contains(week)': False,\n", 1142 | " 'contains(hard)': False,\n", 1143 | " 'contains(sorry)': False,\n", 1144 | " 'contains(t.co/zSuZTNAIJq)': False,\n", 1145 | " 'contains(views)': False,\n", 1146 | " 'contains(add)': False,\n", 1147 | " 'contains(issue)': False,\n", 1148 | " 'contains(quick)': False,\n", 1149 | " 'contains(Andrews)': False,\n", 1150 | " 'contains(Follow)': False,\n", 1151 | " 'contains(enter)': False,\n", 1152 | " 'contains(Many)': False,\n", 1153 | " 'contains(t.co/UT5GrRwAaA)': False,\n", 1154 | " 'contains(Holla)': False,\n", 1155 | " 'contains(Same)': False,\n", 1156 | " 'contains(cake)': False,\n", 1157 | " 'contains(t.co/gLXFwP6nQH)': False,\n", 1158 | " 'contains(NewsVP)': False,\n", 1159 | " 'contains(24hrs)': False,\n", 1160 | " 'contains(reimburse)': False,\n", 1161 | " 'contains(makes)': False,\n", 1162 | " 'contains(back-end)': False,\n", 1163 | " 'contains(PrincessHalf)': False,\n", 1164 | " 'contains(pros)': False,\n", 1165 | " 'contains(if)': False,\n", 1166 | " 'contains(wish)': False,\n", 1167 | " 'contains(t.co/XZ6qeG3nef)': False,\n", 1168 | " 'contains(bked)': False,\n", 1169 | " 'contains(account)': False,\n", 1170 | " 'contains(Lister)': False,\n", 1171 | " 'contains(keeps)': False,\n", 1172 | " 'contains(brand)': False,\n", 1173 | " 'contains(jump)': False,\n", 1174 | " 'contains(deals)': False,\n", 1175 | " 'contains(Handily)': False,\n", 1176 | " 'contains(has)': False,\n", 1177 | " 'contains(charging)': False,\n", 1178 | " 'contains(Debbie)': False,\n", 1179 | " 'contains(ressie)': False,\n", 1180 | " 'contains(time)': False,\n", 1181 | " 'contains(t.co/UKdjjijroW)': False,\n", 1182 | " 'contains(downtown)': False,\n", 1183 | " 'contains(t.co/yPo7nYpRZl)': False,\n", 1184 | " 'contains(2015)': False,\n", 1185 | " 'contains(interesting)': False,\n", 1186 | " 'contains(gon)': False,\n", 1187 | " 'contains(answer)': False,\n", 1188 | " 'contains(DFW)': False,\n", 1189 | " 'contains(GMA)': False,\n", 1190 | " 'contains(redirected)': False,\n", 1191 | " 'contains(first)': False,\n", 1192 | " 'contains(net)': False,\n", 1193 | " 'contains(You’ve)': False,\n", 1194 | " 'contains(100)': False,\n", 1195 | " 'contains(last)': False,\n", 1196 | " 'contains(sec)': False,\n", 1197 | " 'contains(rain)': False,\n", 1198 | " 'contains(b/c)': False,\n", 1199 | " 'contains(having)': False,\n", 1200 | " 'contains(SEA)': False,\n", 1201 | " 'contains(Like)': False,\n", 1202 | " 'contains(VirginAmerica)': False,\n", 1203 | " 'contains(💗🇬🇧💗🇺🇸💗)': False,\n", 1204 | " 'contains(taxes)': False,\n", 1205 | " 'contains(Such)': False,\n", 1206 | " 'contains(disappointing)': False,\n", 1207 | " 'contains(t.co/SLLYIBE2vQ)': False,\n", 1208 | " 'contains(come)': False,\n", 1209 | " 'contains(wanted)': False,\n", 1210 | " 'contains(might)': False,\n", 1211 | " 'contains(back)': False,\n", 1212 | " 'contains(JFK)': False,\n", 1213 | " 'contains(bin)': False,\n", 1214 | " 'contains(check-in)': False,\n", 1215 | " 'contains(wan)': False,\n", 1216 | " 'contains(AvalonHollywood)': False,\n", 1217 | " 'contains(KETR)': False,\n", 1218 | " 'contains(blast)': False,\n", 1219 | " 'contains(spotify)': False,\n", 1220 | " 'contains(financial)': False,\n", 1221 | " 'contains(rockstars)': False,\n", 1222 | " 'contains(2/27)': False,\n", 1223 | " 'contains(Flightly)': False,\n", 1224 | " 'contains(received)': False,\n", 1225 | " 'contains(application)': False,\n", 1226 | " 'contains(If)': False,\n", 1227 | " 'contains(call/email)': False,\n", 1228 | " 'contains(BOS-FLL)': False,\n", 1229 | " 'contains(from)': False,\n", 1230 | " 'contains(hours)': False,\n", 1231 | " 'contains(59)': False,\n", 1232 | " 'contains(delighted)': False,\n", 1233 | " 'contains(Upgrade)': False,\n", 1234 | " 'contains(Not)': False,\n", 1235 | " 'contains(yet)': False,\n", 1236 | " 'contains(Baggage)': False,\n", 1237 | " 'contains(r)': False,\n", 1238 | " 'contains(Applied)': False,\n", 1239 | " 'contains(50)': False,\n", 1240 | " 'contains(complimentary)': False,\n", 1241 | " 'contains(be)': False,\n", 1242 | " 'contains(because)': False,\n", 1243 | " 'contains(Checkin)': False,\n", 1244 | " 'contains(only)': False,\n", 1245 | " 'contains(indicates)': False,\n", 1246 | " 'contains(landing)': False,\n", 1247 | " 'contains(refunding)': False,\n", 1248 | " 'contains(fares)': False,\n", 1249 | " 'contains(Really)': False,\n", 1250 | " 'contains(Reuters)': False,\n", 1251 | " 'contains(Row)': False,\n", 1252 | " 'contains(Every)': False,\n", 1253 | " 'contains(eye)': False,\n", 1254 | " 'contains(midnight)': False,\n", 1255 | " 'contains(congrats)': False,\n", 1256 | " 'contains(Oscars)': False,\n", 1257 | " 'contains(What)': False,\n", 1258 | " 'contains(user)': False,\n", 1259 | " 'contains(manage)': False,\n", 1260 | " 'contains(disruption)': False,\n", 1261 | " 'contains(BTW)': False,\n", 1262 | " 'contains(hiring)': False,\n", 1263 | " 'contains(Middle)': False,\n", 1264 | " 'contains(putting)': False,\n", 1265 | " 'contains(paying)': False,\n", 1266 | " 'contains(rescheduled)': False,\n", 1267 | " 'contains(RNP)': False,\n", 1268 | " 'contains(change)': False,\n", 1269 | " 'contains(hold)': False,\n", 1270 | " 'contains(was)': False,\n", 1271 | " 'contains(soft)': False,\n", 1272 | " 'contains(please)': False,\n", 1273 | " 'contains(ATWOnline)': False,\n", 1274 | " 'contains(t.co/hy0VrfhjHt)': False,\n", 1275 | " 'contains(non)': False,\n", 1276 | " 'contains(longer)': False,\n", 1277 | " 'contains(2-8)': False,\n", 1278 | " 'contains(leading)': False,\n", 1279 | " 'contains(faces)': False,\n", 1280 | " 'contains(continues)': False,\n", 1281 | " 'contains(response)': False,\n", 1282 | " 'contains(You)': False,\n", 1283 | " 'contains(emails)': False,\n", 1284 | " 'contains(exhausted)': False,\n", 1285 | " 'contains(Cancelled)': False,\n", 1286 | " 'contains(our)': False,\n", 1287 | " 'contains(TOMORROW)': False,\n", 1288 | " 'contains(second)': False,\n", 1289 | " 'contains(stylesheets)': False,\n", 1290 | " 'contains(Q4)': False,\n", 1291 | " 'contains(DO)': False,\n", 1292 | " 'contains(register)': False,\n", 1293 | " 'contains(Bandie)': False,\n", 1294 | " 'contains(Use)': False,\n", 1295 | " 'contains(When)': False,\n", 1296 | " 'contains(NO)': False,\n", 1297 | " 'contains(flyer)': False,\n", 1298 | " 'contains(board)': False,\n", 1299 | " \"contains('ve)\": False,\n", 1300 | " 'contains(neverflyvirginforbusiness)': False,\n", 1301 | " 'contains(SilverStatus)': False,\n", 1302 | " 'contains(broken)': False,\n", 1303 | " 'contains(butt)': False,\n", 1304 | " 'contains(Very)': False,\n", 1305 | " 'contains(posted)': False,\n", 1306 | " 'contains(minutes)': False,\n", 1307 | " 'contains(FiDiFamilies)': False,\n", 1308 | " 'contains(missed)': False,\n", 1309 | " 'contains(DC)': False,\n", 1310 | " 'contains(permanently)': False,\n", 1311 | " 'contains(March)': False,\n", 1312 | " 'contains(sooner)': False,\n", 1313 | " 'contains(looking)': False,\n", 1314 | " 'contains(match)': False,\n", 1315 | " 'contains(completely)': False,\n", 1316 | " 'contains(Hands)': False,\n", 1317 | " 'contains(Hey)': False,\n", 1318 | " 'contains(assistance)': False,\n", 1319 | " 'contains(airplanemodewason)': False,\n", 1320 | " 'contains(get)': False,\n", 1321 | " 'contains(sorted)': False,\n", 1322 | " 'contains(blew)': False,\n", 1323 | " 'contains(somehow)': False,\n", 1324 | " 'contains(Boo)': False,\n", 1325 | " 'contains(cabin)': False,\n", 1326 | " 'contains(you)': False,\n", 1327 | " 'contains(doctor)': False,\n", 1328 | " 'contains(4:50)': False,\n", 1329 | " 'contains(rescheduling)': False,\n", 1330 | " 'contains(SFO-FLL)': False,\n", 1331 | " 'contains(told)': False,\n", 1332 | " 'contains(VAbeatsJblue)': False,\n", 1333 | " 'contains(half)': False,\n", 1334 | " 'contains(ugh)': False,\n", 1335 | " 'contains(does)': False,\n", 1336 | " 'contains(dog)': False,\n", 1337 | " 'contains(picture)': False,\n", 1338 | " 'contains(few)': False,\n", 1339 | " 'contains(distribution)': False,\n", 1340 | " 'contains(passenger)': False,\n", 1341 | " 'contains(advise)': False,\n", 1342 | " 'contains(begrudgingly)': False,\n", 1343 | " 'contains(roasted)': False,\n", 1344 | " 'contains(avail)': False,\n", 1345 | " 'contains(soon)': False,\n", 1346 | " 'contains(U)': False,\n", 1347 | " 'contains(ever)': False,\n", 1348 | " 'contains(virginmedia)': False,\n", 1349 | " 'contains(NYC-JFK)': False,\n", 1350 | " 'contains(behind)': False,\n", 1351 | " 'contains(way)': False,\n", 1352 | " 'contains(JKF)': False,\n", 1353 | " 'contains(EWR)': False,\n", 1354 | " 'contains(comfort)': False,\n", 1355 | " 'contains(2A)': False,\n", 1356 | " 'contains(recourse)': False,\n", 1357 | " 'contains(offer)': False,\n", 1358 | " 'contains(plz)': False,\n", 1359 | " 'contains(FLL)': False,\n", 1360 | " 'contains(View)': False,\n", 1361 | " 'contains(can’t)': False,\n", 1362 | " 'contains(why)': False,\n", 1363 | " 'contains(mountains)': False,\n", 1364 | " 'contains(globe)': False,\n", 1365 | " 'contains(rockstar)': False,\n", 1366 | " 'contains(possible)': False,\n", 1367 | " 'contains(LadyGaga)': False,\n", 1368 | " 'contains(dirty)': False,\n", 1369 | " 'contains(fabulous)': False,\n", 1370 | " 'contains(entertainment)': False,\n", 1371 | " 'contains(purchased)': False,\n", 1372 | " 'contains(landed)': False,\n", 1373 | " 'contains(YOU)': False,\n", 1374 | " 'contains(Dulles_Airport)': False,\n", 1375 | " 'contains(April)': False,\n", 1376 | " 'contains(as)': False,\n", 1377 | " 'contains(IAD)': False,\n", 1378 | " 'contains(mind)': False,\n", 1379 | " 'contains(being)': False,\n", 1380 | " 'contains(SuuperG)': False,\n", 1381 | " 'contains(gt)': False,\n", 1382 | " 'contains(Flighted)': False,\n", 1383 | " 'contains(RT)': False,\n", 1384 | " 'contains(status)': False,\n", 1385 | " 'contains(FCmostinnovative)': False,\n", 1386 | " 'contains(current)': False,\n", 1387 | " 'contains(vendor)': False,\n", 1388 | " 'contains(happening)': False,\n", 1389 | " 'contains(hated)': False,\n", 1390 | " 'contains(shrinerack)': False,\n", 1391 | " 'contains(iced)': False,\n", 1392 | " 'contains(Takes)': False,\n", 1393 | " 'contains(guiltypleasures)': False,\n", 1394 | " 'contains(anything)': False,\n", 1395 | " 'contains(May)': False,\n", 1396 | " 'contains(giving)': False,\n", 1397 | " 'contains(refreshed)': False,\n", 1398 | " 'contains(subsequent)': False,\n", 1399 | " 'contains(weather)': False,\n", 1400 | " 'contains(built)': False,\n", 1401 | " 'contains(checking)': False,\n", 1402 | " 'contains(heard)': False,\n", 1403 | " 'contains(carrieunderwood)': False,\n", 1404 | " 'contains(Call)': False,\n", 1405 | " 'contains(facing)': False,\n", 1406 | " 'contains(1st)': False,\n", 1407 | " 'contains(front-end)': False,\n", 1408 | " 'contains(even)': False,\n", 1409 | " 'contains(button)': False,\n", 1410 | " 'contains(peeps)': False,\n", 1411 | " 'contains(Get)': False,\n", 1412 | " 'contains(new)': False,\n", 1413 | " 'contains(mobile)': False,\n", 1414 | " 'contains(thank)': False,\n", 1415 | " 'contains(moodlight)': False,\n", 1416 | " 'contains(mentioned)': False,\n", 1417 | " 'contains(turbulence)': False,\n", 1418 | " 'contains(cause)': False,\n", 1419 | " 'contains(eat)': False,\n", 1420 | " 'contains(We)': False,\n", 1421 | " 'contains(Gold)': False,\n", 1422 | " 'contains(reset)': False,\n", 1423 | " 'contains(Doom)': False,\n", 1424 | " 'contains(precipitation)': False,\n", 1425 | " 'contains(getting)': False,\n", 1426 | " 'contains(want)': False,\n", 1427 | " 'contains(least)': False,\n", 1428 | " 'contains(90s)': False,\n", 1429 | " 'contains(Why)': False,\n", 1430 | " 'contains(cool)': False,\n", 1431 | " 'contains(t.co/PYalebgkJt)': False,\n", 1432 | " 'contains(recent)': False,\n", 1433 | " 'contains(past)': False,\n", 1434 | " 'contains(t.co/APtZpuROp4)': False,\n", 1435 | " 'contains(apologies)': False,\n", 1436 | " 'contains(89)': False,\n", 1437 | " 'contains(SFO/LAX)': False,\n", 1438 | " 'contains(says)': False,\n", 1439 | " 'contains(t.co/2npXB6oBMr)': False,\n", 1440 | " 'contains(concerned)': False,\n", 1441 | " 'contains(dropped)': False,\n", 1442 | " 'contains(earlier)': False,\n", 1443 | " 'contains(wondering)': False,\n", 1444 | " 'contains(Wifey)': False,\n", 1445 | " 'contains(expectations)': False,\n", 1446 | " 'contains(That)': False,\n", 1447 | " 'contains(sanity)': False,\n", 1448 | " 'contains(Got)': False,\n", 1449 | " 'contains(Funny)': False,\n", 1450 | " 'contains(see.Very)': False,\n", 1451 | " 'contains(dhepburn)': False,\n", 1452 | " 'contains(910)': False,\n", 1453 | " 'contains(Gon)': False,\n", 1454 | " 'contains(follow)': False,\n", 1455 | " 'contains(white)': False,\n", 1456 | " 'contains(Having)': False,\n", 1457 | " 'contains(chat)': False,\n", 1458 | " 'contains(t.co/RHKaMx9VF5)': False,\n", 1459 | " 'contains(trust)': False,\n", 1460 | " 'contains(hour)': False,\n", 1461 | " 'contains(Valley)': False,\n", 1462 | " 'contains(imagine)': False,\n", 1463 | " 'contains(points)': False,\n", 1464 | " 'contains(luv)': False,\n", 1465 | " 'contains(gentleman)': False,\n", 1466 | " 'contains(rep)': False,\n", 1467 | " 'contains(After)': False,\n", 1468 | " 'contains(👏)': False,\n", 1469 | " 'contains(city’)': False,\n", 1470 | " 'contains(andchexmix)': False,\n", 1471 | " 'contains(more)': False,\n", 1472 | " 'contains(Angeles)': False,\n", 1473 | " 'contains(winds)': False,\n", 1474 | " 'contains(PLEASE)': False,\n", 1475 | " 'contains(month)': False,\n", 1476 | " 'contains(bill)': False,\n", 1477 | " 'contains(needs)': False,\n", 1478 | " 'contains(supposed)': False,\n", 1479 | " 'contains(kicked)': False,\n", 1480 | " 'contains(revue)': False,\n", 1481 | " 'contains(red)': False,\n", 1482 | " 'contains(882)': False,\n", 1483 | " 'contains(pairings)': False,\n", 1484 | " 'contains(883)': False,\n", 1485 | " 'contains(surgery)': False,\n", 1486 | " 'contains(girls)': False,\n", 1487 | " 'contains(CarrieUnderwood)': False,\n", 1488 | " 'contains(momma)': False,\n", 1489 | " 'contains(this)': True,\n", 1490 | " 'contains(Worst)': False,\n", 1491 | " 'contains(guests)': False,\n", 1492 | " 'contains(when)': False,\n", 1493 | " 'contains(SSal)': False,\n", 1494 | " 'contains(X)': False,\n", 1495 | " 'contains(hope)': False,\n", 1496 | " 'contains(YOUR)': False,\n", 1497 | " 'contains(ChrysiChrysic)': False,\n", 1498 | " 'contains(Include)': False,\n", 1499 | " 'contains(Still)': False,\n", 1500 | " 'contains(represents)': False,\n", 1501 | " 'contains(Oscars2015)': False,\n", 1502 | " 'contains(MeetTheFleet)': False,\n", 1503 | " 'contains(reply)': False,\n", 1504 | " 'contains(desk)': False,\n", 1505 | " 'contains(spend)': False,\n", 1506 | " 'contains(Thank)': False,\n", 1507 | " 'contains(due)': False,\n", 1508 | " 'contains(And)': False,\n", 1509 | " 'contains(p)': False,\n", 1510 | " 'contains(problem)': False,\n", 1511 | " 'contains(paperwork)': False,\n", 1512 | " 'contains(section)': False,\n", 1513 | " 'contains(shows)': False,\n", 1514 | " 'contains(😂)': False,\n", 1515 | " 'contains(pilots)': False,\n", 1516 | " 'contains(VirginAtlantic)': False,\n", 1517 | " 'contains(Elevate)': False,\n", 1518 | " 'contains(minimal)': False,\n", 1519 | " 'contains(doing)': False,\n", 1520 | " 'contains(severely)': False,\n", 1521 | " 'contains(day)': False,\n", 1522 | " 'contains(Was)': False,\n", 1523 | " 'contains(disappointed)': False,\n", 1524 | " 'contains(degrees)': False,\n", 1525 | " 'contains(gave)': False,\n", 1526 | " 'contains(MayweatherPacquiao)': False,\n", 1527 | " 'contains(JetBlue)': False,\n", 1528 | " 'contains(bubbly)': False,\n", 1529 | " 'contains(Arms)': False,\n", 1530 | " 'contains(watching)': False,\n", 1531 | " 'contains(VX358)': False,\n", 1532 | " 'contains(really)': False,\n", 1533 | " 'contains(anytime)': False,\n", 1534 | " 'contains(2/24)': False,\n", 1535 | " 'contains(SFOtoBOS)': False,\n", 1536 | " 'contains(TODAY)': False,\n", 1537 | " 'contains(in🇺🇸2y)': False,\n", 1538 | " 'contains(team)': False,\n", 1539 | " 'contains(gusty)': False,\n", 1540 | " 'contains(amazing)': False,\n", 1541 | " 'contains(line)': False,\n", 1542 | " 'contains(Deals)': False,\n", 1543 | " 'contains(10)': False,\n", 1544 | " 'contains(song)': False,\n", 1545 | " 'contains(unexpected)': False,\n", 1546 | " 'contains(lame)': False,\n", 1547 | " 'contains(food)': False,\n", 1548 | " 'contains(me)': True,\n", 1549 | " 'contains(done)': False,\n", 1550 | " 'contains(Race)': False,\n", 1551 | " 'contains(along)': False,\n", 1552 | " 'contains(pre-check)': False,\n", 1553 | " 'contains(Airline)': False,\n", 1554 | " 'contains(BestCrew)': False,\n", 1555 | " 'contains(weRin)': False,\n", 1556 | " 'contains(appointments)': False,\n", 1557 | " 'contains(emailed)': False,\n", 1558 | " 'contains(stranded)': False,\n", 1559 | " 'contains(said)': False,\n", 1560 | " 'contains(😃)': False,\n", 1561 | " 'contains(uncomfortable)': False,\n", 1562 | " 'contains(DM)': False,\n", 1563 | " 'contains(Lady)': False,\n", 1564 | " 'contains(Another)': False,\n", 1565 | " 'contains(round)': False,\n", 1566 | " 'contains(lost)': False,\n", 1567 | " 'contains(mention)': False,\n", 1568 | " 'contains(Monday)': False,\n", 1569 | " 'contains(t.co/vC6Keulg2J)': False,\n", 1570 | " 'contains(early)': False,\n", 1571 | " 'contains(neverflyvirgin)': False,\n", 1572 | " 'contains(forward)': False,\n", 1573 | " 'contains(price)': False,\n", 1574 | " 'contains(Awesome)': False,\n", 1575 | " 'contains(😢)': False,\n", 1576 | " 'contains(Travelzoo)': False,\n", 1577 | " 'contains(worm”)': False,\n", 1578 | " 'contains(check)': False,\n", 1579 | " 'contains(🍷👍💺✈️)': False,\n", 1580 | " 'contains(Dallas-Austin)': False,\n", 1581 | " 'contains(monday)': False,\n", 1582 | " 'contains(Terrible)': False,\n", 1583 | " 'contains(find)': False,\n", 1584 | " 'contains(dislike)': False,\n", 1585 | " 'contains(boy)': False,\n", 1586 | " 'contains(BOS-LAS)': False,\n", 1587 | " 'contains(shaker)': False,\n", 1588 | " 'contains(updates)': False,\n", 1589 | " 'contains(no)': True,\n", 1590 | " 'contains(sneaky)': False,\n", 1591 | " 'contains(one)': False,\n", 1592 | " 'contains(OSCARS2105)': False,\n", 1593 | " 'contains(virgin)': False,\n", 1594 | " 'contains(yesterday)': False,\n", 1595 | " 'contains(inquired)': False,\n", 1596 | " 'contains(t.co/KEK5pDMGiF)': False,\n", 1597 | " 'contains(t.co/wU3LbCNcr9)': False,\n", 1598 | " 'contains(Palm)': False,\n", 1599 | " 'contains(position)': False,\n", 1600 | " 'contains(business)': False,\n", 1601 | " 'contains(rise)': False,\n", 1602 | " 'contains(better)': False,\n", 1603 | " 'contains(direct)': False,\n", 1604 | " 'contains(AmericanAir)': False,\n", 1605 | " 'contains(t.co/PxdEL1nq3l)': False,\n", 1606 | " 'contains(550)': False,\n", 1607 | " 'contains(secure)': False,\n", 1608 | " 'contains(asap)': False,\n", 1609 | " 'contains(missing)': False,\n", 1610 | " 'contains(t.co/DnStITRzWy)': False,\n", 1611 | " 'contains(tickets)': False,\n", 1612 | " 'contains(t.co/F2LFULCbQ7)': False,\n", 1613 | " 'contains(2014)': False,\n", 1614 | " 'contains(kitty)': False,\n", 1615 | " 'contains(itinerary)': False,\n", 1616 | " 'contains(innovation)': False,\n", 1617 | " 'contains(styling)': False,\n", 1618 | " 'contains(buy)': False,\n", 1619 | " 'contains(noair)': False,\n", 1620 | " 'contains(either)': False,\n", 1621 | " \"contains('ll)\": False,\n", 1622 | " 'contains(into)': False,\n", 1623 | " 'contains(selecting)': False,\n", 1624 | " 'contains(tomorrow)': False,\n", 1625 | " 'contains(Shame)': False,\n", 1626 | " 'contains(Bags)': False,\n", 1627 | " 'contains(playing)': False,\n", 1628 | " 'contains(769)': False,\n", 1629 | " 'contains(policy)': False,\n", 1630 | " 'contains(happy)': False,\n", 1631 | " 'contains(BOS)': False,\n", 1632 | " 'contains(pay)': False,\n", 1633 | " 'contains(CheapFlights)': False,\n", 1634 | " 'contains(shown)': False,\n", 1635 | " 'contains(10:50AM)': False,\n", 1636 | " 'contains(ladygaga)': False,\n", 1637 | " 'contains(Comps)': False,\n", 1638 | " 'contains(days)': False,\n", 1639 | " 'contains(smh)': False,\n", 1640 | " 'contains(Austin)': False,\n", 1641 | " 'contains(First)': False,\n", 1642 | " 'contains(biztravel)': False,\n", 1643 | " 'contains(😥)': False,\n", 1644 | " 'contains(attendant)': False,\n", 1645 | " 'contains(husband)': False,\n", 1646 | " 'contains(nonstop)': False,\n", 1647 | " 'contains(process)': False,\n", 1648 | " 'contains(name)': False,\n", 1649 | " 'contains(I’m)': False,\n", 1650 | " 'contains(2:10pm)': False,\n", 1651 | " 'contains(jessicajaymes)': False,\n", 1652 | " 'contains(confirmation)': False,\n", 1653 | " 'contains(adding)': False,\n", 1654 | " 'contains(city)': False,\n", 1655 | " 'contains(Had)': False,\n", 1656 | " 'contains(tech)': False,\n", 1657 | " 'contains(good)': False,\n", 1658 | " 'contains(seems)': False,\n", 1659 | " 'contains(t.co/tvB5zbzVhg)': False,\n", 1660 | " 'contains(taking)': True,\n", 1661 | " 'contains(Cool)': False,\n", 1662 | " 'contains(confirmed)': False,\n", 1663 | " 'contains(mean)': False,\n", 1664 | " 'contains(someone)': False,\n", 1665 | " 'contains(spending)': False,\n", 1666 | " 'contains(lax)': False,\n", 1667 | " 'contains(Trying)': False,\n", 1668 | " 'contains(entered)': False,\n", 1669 | " 'contains(had)': False,\n", 1670 | " 'contains(assets)': False,\n", 1671 | " 'contains(t.co/rGYwJBbhm4)': False,\n", 1672 | " 'contains(0769)': False,\n", 1673 | " 'contains(remove)': False,\n", 1674 | " 'contains(LAS)': False,\n", 1675 | " 'contains(hipster)': False,\n", 1676 | " 'contains(been)': False,\n", 1677 | " 'contains(No)': False,\n", 1678 | " 'contains(guy)': False,\n", 1679 | " 'contains(7D)': False,\n", 1680 | " 'contains(Budapest)': False,\n", 1681 | " 'contains(applied)': False,\n", 1682 | " 'contains(hotel)': False,\n", 1683 | " 'contains(so)': False,\n", 1684 | " 'contains(seriously)': False,\n", 1685 | " 'contains(99)': False,\n", 1686 | " 'contains(around)': False,\n", 1687 | " 'contains(FreyaBevan_Fund)': False,\n", 1688 | " 'contains(become)': False,\n", 1689 | " 'contains(leaving)': False,\n", 1690 | " 'contains(promised)': False,\n", 1691 | " 'contains(Dulles)': False,\n", 1692 | " 'contains(4Q)': False,\n", 1693 | " 'contains(sounds)': False,\n", 1694 | " 'contains(big)': False,\n", 1695 | " 'contains(compatible)': False,\n", 1696 | " 'contains(pretty)': False,\n", 1697 | " 'contains(drink)': False,\n", 1698 | " 'contains(destroyed)': False,\n", 1699 | " 'contains(uphold)': False,\n", 1700 | " 'contains(t.co/SSUVWwkyHH)': False,\n", 1701 | " 'contains(suck)': False,\n", 1702 | " 'contains(hrs)': False,\n", 1703 | " 'contains(working)': False,\n", 1704 | " 'contains(vegan)': False,\n", 1705 | " 'contains(using)': False,\n", 1706 | " 'contains(Keep)': False,\n", 1707 | " \"contains('s)\": False,\n", 1708 | " 'contains(incubator)': False,\n", 1709 | " 'contains(access)': False,\n", 1710 | " 'contains(heyyyy)': False,\n", 1711 | " 'contains(able)': False,\n", 1712 | " 'contains(side)': False,\n", 1713 | " 'contains(two)': False,\n", 1714 | " 'contains(i)': False,\n", 1715 | " 'contains(Can)': False,\n", 1716 | " 'contains(ur)': False,\n", 1717 | " 'contains(dude)': False,\n", 1718 | " 'contains(t.co/pX8hQOKS3R)': False,\n", 1719 | " 'contains(birthday)': False,\n", 1720 | " 'contains(Congrats)': False,\n", 1721 | " 'contains(💜✈)': False,\n", 1722 | " 'contains(Springs)': False,\n", 1723 | " 'contains(iol)': False,\n", 1724 | " 'contains(most)': False,\n", 1725 | " 'contains(Sad)': False,\n", 1726 | " 'contains(advantage)': False,\n", 1727 | " 'contains(both)': False,\n", 1728 | " 'contains(expected)': False,\n", 1729 | " 'contains(6)': False,\n", 1730 | " 'contains(things)': False,\n", 1731 | " 'contains(flight🍸)': False,\n", 1732 | " 'contains(Grand)': False,\n", 1733 | " 'contains(biz)': False,\n", 1734 | " 'contains(would)': False,\n", 1735 | " 'contains(absolutely)': False,\n", 1736 | " 'contains(t.co/H952rDKTqy”)': False,\n", 1737 | " 'contains(evening)': False,\n", 1738 | " 'contains(paid)': False,\n", 1739 | " 'contains(914-329-0185)': False,\n", 1740 | " 'contains(bound)': False,\n", 1741 | " 'contains(Silicon)': False,\n", 1742 | " 'contains(G)': False,\n", 1743 | " 'contains(damaged)': False,\n", 1744 | " 'contains(adore)': False,\n", 1745 | " 'contains(fl1289)': False,\n", 1746 | " 'contains(is)': True,\n", 1747 | " 'contains(flying)': False,\n", 1748 | " 'contains(customer)': False,\n", 1749 | " 'contains(Handle)': False,\n", 1750 | " 'contains(Waited)': False,\n", 1751 | " 'contains(Booking)': False,\n", 1752 | " 'contains(On)': False,\n", 1753 | " 'contains(Gaga)': False,\n", 1754 | " 'contains(apparently)': False,\n", 1755 | " 'contains(seat)': False,\n", 1756 | " 'contains(four)': False,\n", 1757 | " 'contains(brought)': False,\n", 1758 | " 'contains(scared)': False,\n", 1759 | " 'contains(shame)': False,\n", 1760 | " 'contains(elevate)': False,\n", 1761 | " 'contains(sunset)': False,\n", 1762 | " 'contains(t.co/1AGR9knCpf)': False,\n", 1763 | " 'contains(Help😍)': False,\n", 1764 | " 'contains(wow)': False,\n", 1765 | " 'contains(excited)': False,\n", 1766 | " 'contains(👸)': False,\n", 1767 | " 'contains(bet)': False,\n", 1768 | " 'contains(should)': False,\n", 1769 | " 'contains(guys)': False,\n", 1770 | " 'contains(normal)': False,\n", 1771 | " 'contains(Whenever)': False,\n", 1772 | " 'contains(member😒)': False,\n", 1773 | " 'contains(❤️)': False,\n", 1774 | " 'contains(AM)': False,\n", 1775 | " 'contains(Problems)': False,\n", 1776 | " 'contains(flight)': True,\n", 1777 | " 'contains(use)': False,\n", 1778 | " 'contains(iconography)': False,\n", 1779 | " 'contains(horrible)': False,\n", 1780 | " 'contains(which)': False,\n", 1781 | " 'contains(wing)': False,\n", 1782 | " 'contains(headed)': False,\n", 1783 | " 'contains(TSA)': False,\n", 1784 | " 'contains(88.9)': False,\n", 1785 | " 'contains(Without)': False,\n", 1786 | " 'contains(upgrade)': False,\n", 1787 | " 'contains(down)': False,\n", 1788 | " 'contains(couple)': False,\n", 1789 | " 'contains(full)': False,\n", 1790 | " 'contains(3)': False,\n", 1791 | " 'contains(w)': False,\n", 1792 | " 'contains(‘select)': False,\n", 1793 | " 'contains(RenttheRunway)': False,\n", 1794 | " 'contains(27)': False,\n", 1795 | " 'contains(Prince)': False,\n", 1796 | " 'contains(support)': False,\n", 1797 | " 'contains(reallytallchris)': False,\n", 1798 | " 'contains(NOW)': False,\n", 1799 | " 'contains(messages)': False,\n", 1800 | " 'contains(diehardvirgin)': False,\n", 1801 | " 'contains(went)': False,\n", 1802 | " 'contains(class)': False,\n", 1803 | " 'contains(Status)': False,\n", 1804 | " 'contains(soooo)': False,\n", 1805 | " 'contains(what)': False,\n", 1806 | " 'contains(💕💕)': False,\n", 1807 | " 'contains(money)': False,\n", 1808 | " 'contains(open)': False,\n", 1809 | " 'contains(going)': False,\n", 1810 | " 'contains(t.co/enIQg0buzj)': False,\n", 1811 | " 'contains(work)': False,\n", 1812 | " 'contains(thx)': False,\n", 1813 | " 'contains(airlines)': False,\n", 1814 | " 'contains(☺️👍)': False,\n", 1815 | " 'contains(sorrynotsorry)': False,\n", 1816 | " 'contains(tribute)': False,\n", 1817 | " 'contains(Creates)': False,\n", 1818 | " 'contains(mechanical)': False,\n", 1819 | " 'contains(tacky)': False,\n", 1820 | " 'contains(luggage)': False,\n", 1821 | " 'contains(beyond)': False,\n", 1822 | " 'contains(EVER)': False,\n", 1823 | " 'contains(arrived)': False,\n", 1824 | " 'contains(fare)': False,\n", 1825 | " 'contains(Los)': False,\n", 1826 | " 'contains(drivers)': False,\n", 1827 | " 'contains(achieves)': False,\n", 1828 | " 'contains(refund)': False,\n", 1829 | " 'contains(free)': False,\n", 1830 | " 'contains(silver)': False,\n", 1831 | " 'contains(Will)': False,\n", 1832 | " 'contains(Well)': False,\n", 1833 | " 'contains(nearly)': False,\n", 1834 | " 'contains(temperature)': False,\n", 1835 | " 'contains(na)': False,\n", 1836 | " 'contains(track)': False,\n", 1837 | " 'contains(recline)': False,\n", 1838 | " 'contains(yall)': False,\n", 1839 | " 'contains(glad)': False,\n", 1840 | " 'contains(code)': False,\n", 1841 | " 'contains(wine)': False,\n", 1842 | " 'contains(Good)': False,\n", 1843 | " 'contains(feet)': False,\n", 1844 | " 'contains(Dallas)': False,\n", 1845 | " \"contains(didn't…but)\": False,\n", 1846 | " 'contains(1230)': False,\n", 1847 | " 'contains(job)': False,\n", 1848 | " 'contains(standby)': False,\n", 1849 | " 'contains(by)': False,\n", 1850 | " 'contains(gate)': False,\n", 1851 | " 'contains(Quick)': False,\n", 1852 | " 'contains(easy)': False,\n", 1853 | " 'contains(inflight)': False,\n", 1854 | " 'contains(SJC)': False,\n", 1855 | " 'contains(outstanding)': False,\n", 1856 | " 'contains(afford)': False,\n", 1857 | " 'contains(u)': False,\n", 1858 | " 'contains(provided)': False,\n", 1859 | " 'contains(start)': False,\n", 1860 | " 'contains(the)': False,\n", 1861 | " 'contains(help)': False,\n", 1862 | " 'contains(Soon)': False,\n", 1863 | " \"contains('re)\": False,\n", 1864 | " 'contains(worstflightever)': False,\n", 1865 | " 'contains(nicely)': False,\n", 1866 | " 'contains(skies)': False,\n", 1867 | " 'contains(touchdown)': False,\n", 1868 | " 'contains(FastCompany)': False,\n", 1869 | " 'contains(salt)': False,\n", 1870 | " 'contains(Because)': False,\n", 1871 | " 'contains(during)': False,\n", 1872 | " 'contains(shared)': False,\n", 1873 | " 'contains(desktop)': False,\n", 1874 | " 'contains(safety)': False,\n", 1875 | " 'contains(are)': False,\n", 1876 | " 'contains(t.co/5B2agFd8c4)': False,\n", 1877 | " 'contains(🙉)': False,\n", 1878 | " 'contains(welcome)': False,\n", 1879 | " 'contains(over)': False,\n", 1880 | " 'contains(super)': False,\n", 1881 | " 'contains(Site)': False,\n", 1882 | " 'contains(Easily)': False,\n", 1883 | " 'contains(reservation)': False,\n", 1884 | " 'contains(Thx)': False,\n", 1885 | " 'contains(❄️❄️❄️)': False,\n", 1886 | " 'contains(HELP)': False,\n", 1887 | " 'contains(banned)': False,\n", 1888 | " 'contains(until)': False,\n", 1889 | " 'contains(feel)': False,\n", 1890 | " 'contains(of)': False,\n", 1891 | " 'contains(Mostly)': False,\n", 1892 | " 'contains(passengers)': False,\n", 1893 | " 'contains(across)': False,\n", 1894 | " 'contains(t.co/VPqEm31XUQ)': False,\n", 1895 | " 'contains(😘)': False,\n", 1896 | " 'contains(infant)': False,\n", 1897 | " 'contains(Friday)': False,\n", 1898 | " 'contains(prefer)': False,\n", 1899 | " 'contains(t.co/oA2dRfAoQ2)': False,\n", 1900 | " 'contains(t.co/UJfS9Zi6kd)': False,\n", 1901 | " 'contains(united)': False,\n", 1902 | " 'contains(dfw-lax)': False,\n", 1903 | " 'contains(Hi)': False,\n", 1904 | " 'contains(lt)': False,\n", 1905 | " 'contains(Love/gratitude.mpower)': False,\n", 1906 | " 'contains(email)': False,\n", 1907 | " 'contains(30)': False,\n", 1908 | " 'contains(2nd)': False,\n", 1909 | " 'contains(airport)': False,\n", 1910 | " 'contains(Sentinel)': False,\n", 1911 | " 'contains(upgrades)': False,\n", 1912 | " 'contains(receive)': False,\n", 1913 | " 'contains(cross-browser)': False,\n", 1914 | " 'contains(met)': False,\n", 1915 | " 'contains(😍👌)': False,\n", 1916 | " 'contains(ball)': False,\n", 1917 | " 'contains(F)': False,\n", 1918 | " 'contains(thing)': False,\n", 1919 | " 'contains(expensive)': False,\n", 1920 | " 'contains(In)': False,\n", 1921 | " 'contains(screen)': False,\n", 1922 | " 'contains(DREAM)': False,\n", 1923 | " 'contains(24)': False,\n", 1924 | " 'contains(Greetingz)': False,\n", 1925 | " 'contains(order)': False,\n", 1926 | " 'contains(delay)': False,\n", 1927 | " 'contains(large)': False,\n", 1928 | " 'contains(Time)': False,\n", 1929 | " 'contains(min)': False,\n", 1930 | " 'contains(9am)': False,\n", 1931 | " 'contains(did)': False,\n", 1932 | " 'contains(an)': False,\n", 1933 | " 'contains(student)': False,\n", 1934 | " 'contains(nomorevirgin)': False,\n", 1935 | " 'contains(they)': False,\n", 1936 | " 'contains(Seats)': False,\n", 1937 | " 'contains(classics)': False,\n", 1938 | " 'contains(ordered)': False,\n", 1939 | " 'contains(always)': False,\n", 1940 | " 'contains(wonked)': False,\n", 1941 | " 'contains(t.co/tZZJhuIbCH)': False,\n", 1942 | " 'contains(JPERHI)': False,\n", 1943 | " 'contains(tonite)': False,\n", 1944 | " 'contains(browsers)': False,\n", 1945 | " 'contains(area)': False,\n", 1946 | " 'contains(LOVE)': False,\n", 1947 | " 'contains(Business)': False,\n", 1948 | " 'contains(market)': False,\n", 1949 | " 'contains(interested)': False,\n", 1950 | " 'contains(tossed)': False,\n", 1951 | " 'contains(york)': False,\n", 1952 | " 'contains(people)': False,\n", 1953 | " 'contains(see)': False,\n", 1954 | " 'contains(report)': False,\n", 1955 | " 'contains(customerservice)': False,\n", 1956 | " 'contains(show)': False,\n", 1957 | " 'contains(crew)': False,\n", 1958 | " 'contains(America)': False,\n", 1959 | " 'contains(yes)': False,\n", 1960 | " 'contains(Love)': False,\n", 1961 | " 'contains(VA370)': False,\n", 1962 | " 'contains(dancing)': False,\n", 1963 | " 'contains(end)': False,\n", 1964 | " 'contains(338)': False,\n", 1965 | " 'contains(800)': False,\n", 1966 | " 'contains(713)': False,\n", 1967 | " 'contains(t.co/CnctL7G1ef)': False,\n", 1968 | " 'contains(or)': False,\n", 1969 | " 'contains(Martin)': False,\n", 1970 | " 'contains(flights)': False,\n", 1971 | " 'contains(Points)': False,\n", 1972 | " 'contains(match.Got)': False,\n", 1973 | " 'contains(links)': False,\n", 1974 | " 'contains(friends)': False,\n", 1975 | " 'contains(booked)': False,\n", 1976 | " 'contains(seatbelt)': False,\n", 1977 | " 'contains(SFO/EWR)': False,\n", 1978 | " 'contains(DCA)': False,\n", 1979 | " 'contains(route)': False,\n", 1980 | " 'contains(😁)': False,\n", 1981 | " 'contains(page)': False,\n", 1982 | " 'contains(greyed)': False,\n", 1983 | " 'contains(commercials)': False,\n", 1984 | " 'contains(premium)': False,\n", 1985 | " 'contains(beautiful)': False,\n", 1986 | " 'contains(bucks)': False,\n", 1987 | " 'contains(cold)': False,\n", 1988 | " 'contains(site)': False,\n", 1989 | " 'contains(PDX)': False,\n", 1990 | " 'contains(their)': False,\n", 1991 | " 'contains(right)': False,\n", 1992 | " 'contains(positive)': False,\n", 1993 | " 'contains(Have)': False,\n", 1994 | " 'contains(freddieawards)': False,\n", 1995 | " 'contains(where)': True,\n", 1996 | " 'contains(self-service)': False,\n", 1997 | " 'contains(video)': False,\n", 1998 | " 'contains(winning)': False,\n", 1999 | " 'contains(choppy)': False,\n", 2000 | " 'contains(amp)': False,\n", 2001 | " 'contains(t.co/EwwGi97gdx)': False,\n", 2002 | " 'contains(phone)': False,\n", 2003 | " 'contains(weeks)': False,\n", 2004 | " 'contains(claim)': False,\n", 2005 | " 'contains(three)': False,\n", 2006 | " 'contains(much)': False,\n", 2007 | " 'contains(anyone)': False,\n", 2008 | " 'contains(have)': True,\n", 2009 | " 'contains(Ca)': False,\n", 2010 | " 'contains(before)': False,\n", 2011 | " 'contains(Hats)': False,\n", 2012 | " 'contains(Baldwin)': False,\n", 2013 | " 'contains(One)': False,\n", 2014 | " 'contains(benefits)': False,\n", 2015 | " 'contains(oscars2015)': False,\n", 2016 | " 'contains(kids)': False,\n", 2017 | " \"contains('d)\": False,\n", 2018 | " 'contains(how)': False,\n", 2019 | " 'contains(w/2)': False,\n", 2020 | " 'contains(ROCK)': False,\n", 2021 | " 'contains(every)': False,\n", 2022 | " 'contains(experience)': False,\n", 2023 | " 'contains(cross)': False,\n", 2024 | " 'contains(very)': False,\n", 2025 | " 'contains(prime)': False,\n", 2026 | " 'contains(win)': False,\n", 2027 | " 'contains(Who)': False,\n", 2028 | " 'contains(t.co/GsB2J3c4gM)': False,\n", 2029 | " 'contains(making)': False,\n", 2030 | " 'contains(away)': False,\n", 2031 | " 'contains(snow)': False,\n", 2032 | " 'contains(🌞✈)': False,\n", 2033 | " 'contains(Investor)': False,\n", 2034 | " 'contains(any)': False,\n", 2035 | " 'contains(center)': False,\n", 2036 | " 'contains(worse)': False,\n", 2037 | " 'contains(OscarsCountdown)': False,\n", 2038 | " 'contains(413)': False,\n", 2039 | " 'contains(but)': False,\n", 2040 | " 'contains(out)': False,\n", 2041 | " 'contains(Sign)': False,\n", 2042 | " 'contains(connecting)': False,\n", 2043 | " 'contains(assist)': False,\n", 2044 | " 'contains(think)': False,\n", 2045 | " 'contains(designated)': False,\n", 2046 | " 'contains(VX)': False,\n", 2047 | " 'contains(💗)': False,\n", 2048 | " 'contains(w/the)': False,\n", 2049 | " 'contains(know)': False,\n", 2050 | " 'contains(understand)': False,\n", 2051 | " 'contains(lot)': False,\n", 2052 | " 'contains(selected👎)': False,\n", 2053 | " 'contains(7AM)': False,\n", 2054 | " 'contains(delays)': False,\n", 2055 | " 'contains(like)': False,\n", 2056 | " 'contains(t.co/vhp2GtDWPk)': False,\n", 2057 | " 'contains(Must)': False,\n", 2058 | " 'contains(less)': False,\n", 2059 | " 'contains(barely)': False,\n", 2060 | " 'contains(9)': False,\n", 2061 | " 'contains(t.co/aqZWecOkk2)': False,\n", 2062 | " 'contains(hand)': False,\n", 2063 | " 'contains(intern)': False,\n", 2064 | " 'contains(together)': False,\n", 2065 | " 'contains(will)': False,\n", 2066 | " 'contains(Flight)': False,\n", 2067 | " 'contains(spruce)': False,\n", 2068 | " 'contains(website)': False,\n", 2069 | " 'contains(iPad)': False,\n", 2070 | " 'contains(Lots)': False,\n", 2071 | " 'contains(great)': False,\n", 2072 | " 'contains(on.Easy)': False,\n", 2073 | " 'contains(backtowinter)': False,\n", 2074 | " 'contains(say)': False,\n", 2075 | " 'contains(Is)': False,\n", 2076 | " 'contains(charged)': False,\n", 2077 | " 'contains(Lofty)': False,\n", 2078 | " 'contains(story)': False,\n", 2079 | " 'contains(gone)': False,\n", 2080 | " 'contains(program)': False,\n", 2081 | " 'contains(fail)': False,\n", 2082 | " 'contains(Arab)': False,\n", 2083 | " 'contains(tracking)': False,\n", 2084 | " 'contains(give)': False,\n", 2085 | " 'contains(714)': False,\n", 2086 | " 'contains(calling)': False,\n", 2087 | " 'contains(pleasecomeback)': False,\n", 2088 | " 'contains(films)': False,\n", 2089 | " 'contains(t.co/Dw5nf0ibtr)': False,\n", 2090 | " 'contains(SouthwestAir)': False,\n", 2091 | " 'contains(todays)': False,\n", 2092 | " 'contains(after)': False,\n", 2093 | " 'contains(long)': False,\n", 2094 | " 'contains(Are)': False,\n", 2095 | " 'contains(urgent)': False,\n", 2096 | " 'contains(moose)': False,\n", 2097 | " 'contains(go)': False,\n", 2098 | " 'contains(trying)': False,\n", 2099 | " 'contains(shift)': False,\n", 2100 | " 'contains(watch)': False,\n", 2101 | " 'contains(at)': False,\n", 2102 | " 'contains(among)': False,\n", 2103 | " 'contains(iPhone)': False,\n", 2104 | " 'contains(who)': False,\n", 2105 | " 'contains(MCO)': False,\n", 2106 | " 'contains(DTW)': False,\n", 2107 | " 'contains(Nice)': False,\n", 2108 | " 'contains(hi)': False,\n", 2109 | " 'contains(failing)': False,\n", 2110 | " 'contains(Men)': False,\n", 2111 | " 'contains(TTINAC11)': False,\n", 2112 | " 'contains(step)': False,\n", 2113 | " 'contains(graphics)': False,\n", 2114 | " 'contains(faster)': False,\n", 2115 | " 'contains(inconvenience)': False,\n", 2116 | " 'contains(helping)': False,\n", 2117 | " 'contains(web)': False,\n", 2118 | " 'contains(beats)': False,\n", 2119 | " 'contains(hahaha)': False,\n", 2120 | " 'contains(276)': False,\n", 2121 | " 'contains(PHL)': False,\n", 2122 | " 'contains(year)': False,\n", 2123 | " 'contains(fav)': False,\n", 2124 | " 'contains(demo)': False,\n", 2125 | " 'contains(drinks)': False,\n", 2126 | " 'contains(number)': False,\n", 2127 | " 'contains(sendambien)': False,\n", 2128 | " 'contains(LAX)': False,\n", 2129 | " 'contains(info)': False,\n", 2130 | " 'contains(Sat)': False,\n", 2131 | " 'contains(morning)': False,\n", 2132 | " 'contains(depart)': False,\n", 2133 | " 'contains(looks)': False,\n", 2134 | " 'contains(SoundOfMusic)': False,\n", 2135 | " 'contains(Or)': False,\n", 2136 | " 'contains(scanned)': False,\n", 2137 | " 'contains(best)': False,\n", 2138 | " 'contains(redcarpet)': False,\n", 2139 | " ...}" 2140 | ] 2141 | }, 2142 | "execution_count": 22, 2143 | "metadata": {}, 2144 | "output_type": "execute_result" 2145 | } 2146 | ], 2147 | "source": [ 2148 | "cl.extract_features('I have no idea where this flight is taking me')" 2149 | ] 2150 | }, 2151 | { 2152 | "cell_type": "markdown", 2153 | "metadata": {}, 2154 | "source": [ 2155 | "## Classifying From Within a TextBlob\n", 2156 | "\n", 2157 | "We can perform classification on the contents of a TextBlob object using an existing classifier (like the one we created earlier (named cl). The usefulness of this might seem questionable, since you can just pass a normal string to the classifier. However, something you will be doing other work with some text in the form of a blob, and then when you need to perform classification, you don't have to go back and get the raw string.\n", 2158 | "\n", 2159 | "Using a clssifier in a `TextBlob` is as easy as passing the classifier as an argument when you create the blob.\n", 2160 | "\n", 2161 | "**Note:** The classifier must be one that you have already trained.\n", 2162 | "\n", 2163 | "Let's look at a couple examples:" 2164 | ] 2165 | }, 2166 | { 2167 | "cell_type": "code", 2168 | "execution_count": 18, 2169 | "metadata": { 2170 | "collapsed": false, 2171 | "deletable": true, 2172 | "editable": true 2173 | }, 2174 | "outputs": [ 2175 | { 2176 | "data": { 2177 | "text/plain": [ 2178 | "'positive'" 2179 | ] 2180 | }, 2181 | "execution_count": 18, 2182 | "metadata": {}, 2183 | "output_type": "execute_result" 2184 | } 2185 | ], 2186 | "source": [ 2187 | "b = TextBlob('I loved the flight', classifier=cl)\n", 2188 | "b.classify()" 2189 | ] 2190 | }, 2191 | { 2192 | "cell_type": "code", 2193 | "execution_count": 19, 2194 | "metadata": { 2195 | "collapsed": false, 2196 | "deletable": true, 2197 | "editable": true 2198 | }, 2199 | "outputs": [ 2200 | { 2201 | "data": { 2202 | "text/plain": [ 2203 | "'neutral'" 2204 | ] 2205 | }, 2206 | "execution_count": 19, 2207 | "metadata": {}, 2208 | "output_type": "execute_result" 2209 | } 2210 | ], 2211 | "source": [ 2212 | "b = TextBlob('I hated the flight', classifier=cl)\n", 2213 | "b.classify()" 2214 | ] 2215 | }, 2216 | { 2217 | "cell_type": "markdown", 2218 | "metadata": { 2219 | "deletable": true, 2220 | "editable": true 2221 | }, 2222 | "source": [ 2223 | "Our classifier probably didn't encounter the word \"hate\" or \"hated\". We can update our model to improve classification." 2224 | ] 2225 | }, 2226 | { 2227 | "cell_type": "markdown", 2228 | "metadata": {}, 2229 | "source": [ 2230 | "## Update Existing Classifiers With New Data\n", 2231 | "\n", 2232 | "Our classifier obviously failed us when we tried to classify the string \"I hate this flight.\"\n", 2233 | "We have the option of easily updating our classifier with new data, so let's do that now." 2234 | ] 2235 | }, 2236 | { 2237 | "cell_type": "code", 2238 | "execution_count": 20, 2239 | "metadata": { 2240 | "collapsed": false, 2241 | "deletable": true, 2242 | "editable": true 2243 | }, 2244 | "outputs": [ 2245 | { 2246 | "data": { 2247 | "text/plain": [ 2248 | "True" 2249 | ] 2250 | }, 2251 | "execution_count": 20, 2252 | "metadata": {}, 2253 | "output_type": "execute_result" 2254 | } 2255 | ], 2256 | "source": [ 2257 | "# new data is also a list of tuples\n", 2258 | "# be sure the class labels are correct\n", 2259 | "updates = [('I hated flying', 'negative'), ('I hate flying', 'negative'),\n", 2260 | " ('I hate this airline', 'negative'), ('I hated the seats', 'negative')]\n", 2261 | "cl.update(updates) # this is unfortunately slow" 2262 | ] 2263 | }, 2264 | { 2265 | "cell_type": "markdown", 2266 | "metadata": {}, 2267 | "source": [ 2268 | "You can ignore the output `True`\n", 2269 | "\n", 2270 | "**Note:** If you get the error `too many values to unpack (expected 2)`, try re-running the cell where we created the train/test sets and create/train the classifier from scratch." 2271 | ] 2272 | }, 2273 | { 2274 | "cell_type": "markdown", 2275 | "metadata": {}, 2276 | "source": [ 2277 | "Now that we have updated our classifier with new data, let's see how our original sentence is classified." 2278 | ] 2279 | }, 2280 | { 2281 | "cell_type": "code", 2282 | "execution_count": 21, 2283 | "metadata": { 2284 | "collapsed": false, 2285 | "deletable": true, 2286 | "editable": true 2287 | }, 2288 | "outputs": [ 2289 | { 2290 | "data": { 2291 | "text/plain": [ 2292 | "'negative'" 2293 | ] 2294 | }, 2295 | "execution_count": 21, 2296 | "metadata": {}, 2297 | "output_type": "execute_result" 2298 | } 2299 | ], 2300 | "source": [ 2301 | "# let's see how it does now using 'I hated the flight'\n", 2302 | "b = TextBlob('I hated the flight', classifier=cl) # update\n", 2303 | "b.classify()" 2304 | ] 2305 | }, 2306 | { 2307 | "cell_type": "markdown", 2308 | "metadata": {}, 2309 | "source": [ 2310 | "An now we have the correct classification of `'negative'`\n", 2311 | "\n", 2312 | "If you do not get the correct class, try running the update cell once more." 2313 | ] 2314 | }, 2315 | { 2316 | "cell_type": "markdown", 2317 | "metadata": { 2318 | "deletable": true, 2319 | "editable": true 2320 | }, 2321 | "source": [ 2322 | "## Other Classifiers\n", 2323 | "\n", 2324 | "TextBlob has a number of built in classifiers, all of which can be found in the documentation at the link below." 2325 | ] 2326 | }, 2327 | { 2328 | "cell_type": "markdown", 2329 | "metadata": { 2330 | "deletable": true, 2331 | "editable": true 2332 | }, 2333 | "source": [ 2334 | "http://textblob.readthedocs.io/en/dev/api_reference.html#api-classifiers" 2335 | ] 2336 | }, 2337 | { 2338 | "cell_type": "code", 2339 | "execution_count": null, 2340 | "metadata": { 2341 | "collapsed": true 2342 | }, 2343 | "outputs": [], 2344 | "source": [] 2345 | }, 2346 | { 2347 | "cell_type": "markdown", 2348 | "metadata": {}, 2349 | "source": [ 2350 | "# Pratice Problems\n", 2351 | "\n", 2352 | "1. Train a decision tree classifier on the first 350 tweets in the reduced set (the training set from earlier) — call it something other than cl — and print/examine the tree structure using pseudocode method (hint: wrap in print)\n", 2353 | "2. Compute the accuracy on the test set [350:500] and compare to the Naive Bayes accuracy\n", 2354 | "3. Compare the precision and recall scores for the two classifiers. Does the decision tree perform better on any of the classes? (hint: remember that these classify one item at a time)\n", 2355 | "4. Create a new “balanced” training set of 50 observations from each class and update the current Naive Bayes (cl)\n", 2356 | "5. Score the updated classifier. Have the scores improved? How about accuracy?\n" 2357 | ] 2358 | }, 2359 | { 2360 | "cell_type": "code", 2361 | "execution_count": null, 2362 | "metadata": { 2363 | "collapsed": true 2364 | }, 2365 | "outputs": [], 2366 | "source": [] 2367 | } 2368 | ], 2369 | "metadata": { 2370 | "anaconda-cloud": {}, 2371 | "kernelspec": { 2372 | "display_name": "Python [default]", 2373 | "language": "python", 2374 | "name": "python3" 2375 | }, 2376 | "language_info": { 2377 | "codemirror_mode": { 2378 | "name": "ipython", 2379 | "version": 3 2380 | }, 2381 | "file_extension": ".py", 2382 | "mimetype": "text/x-python", 2383 | "name": "python", 2384 | "nbconvert_exporter": "python", 2385 | "pygments_lexer": "ipython3", 2386 | "version": "3.5.2" 2387 | } 2388 | }, 2389 | "nbformat": 4, 2390 | "nbformat_minor": 2 2391 | } 2392 | --------------------------------------------------------------------------------