├── .gitignore ├── Clustering_TopicModeling └── Cluster_TopicModeling.ipynb ├── Intro_to_TextAnalysis ├── Intro_to_TextAnalysis.ipynb ├── Intro_to_TextAnalysis_ANSWERS.ipynb └── King_James_Bible.txt ├── LICENSE ├── NLP_NLTK ├── NLP_NLTK.ipynb ├── NLP_NLTK_Answers.ipynb └── example.txt ├── README.md └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | .DS_Store 3 | booksummaries.txt 4 | -------------------------------------------------------------------------------- /Intro_to_TextAnalysis/Intro_to_TextAnalysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction to Text Analysis" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Today's workshop will address concepts in text analysis. A fundamental understanding of Python is necessary. We will cover:\n", 15 | "\n", 16 | "1. term-document model\n", 17 | "2. regex\n", 18 | "3. POS tagging\n", 19 | "3. sentiment analysis\n", 20 | "4. topic modeling\n", 21 | "5. word2vec\n", 22 | "\n", 23 | "Python packages you will need:\n", 24 | "\n", 25 | "* NLTK ( `$ pip install nltk` )\n", 26 | "* TextBlob ( `$ pip install textblob` )\n", 27 | "* gensim ( `$ pip install gensim` )" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "collapsed": true 34 | }, 35 | "source": [ 36 | "## Introduction\n", 37 | "\n", 38 | "We've spent a lot of time in Python dealing with text data, and that's because text data is everywhere. It is the primary form of communication between persons and persons, persons and computers, and computers and computers. The kind of inferential methods that we apply to text data, however, are different from those applied to tabular data. \n", 39 | "\n", 40 | "This is partly because documents are typically specified in a way that expresses both structure and content using text (i.e. the document object model).\n", 41 | "\n", 42 | "Largely, however, it's because text is difficult to turn into numbers in a way that preserves the information in the document. Today, we'll talk about dominant language models in NLP and the basics of how to implement it in Python.\n", 43 | "\n", 44 | "# Part 1: The term-document model and preprocessing text\n", 45 | "\n", 46 | "The term-document model is also sometimes referred to as \"bag-of-words\" by those who don't think very highly of it. The term document model looks at language as individual communicative efforts that contain one or more tokens. The kind and number of the tokens in a document tells you something about what is attempting to be communicated, and the order of those tokens is ignored.\n", 47 | "\n", 48 | "This is the primary method still used for most text analysis, although models utilizing word embeddings are beginning to take hold. We will discuss word embeddings briefly at the end.\n", 49 | "\n", 50 | "In order to actually turn our text into a bag of words, we'll have to do some preprocessing. This is a crucial step at the beginning of any NLP project, and much of this first section will involve it.\n", 51 | "\n", 52 | "To start with, let's import NLTK and load a document from their toy corpus." 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "import nltk\n", 64 | "nltk.download('webtext')\n", 65 | "document = nltk.corpus.webtext.open('grail.txt').read()" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "Let's see what's in this document" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": { 79 | "collapsed": false 80 | }, 81 | "outputs": [], 82 | "source": [ 83 | "print(document[:1000])" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "It looks like we've gotten ourselves a bit of the script from *Monty Python and the Holy Grail*. Note that when we are looking at the text, part of the structure of the document is written in tokens. For example, stage directions have been placed in brackets, and the names of the person speaking are in all caps.\n", 91 | "\n", 92 | "## Regular expressions\n", 93 | "\n", 94 | "If we wanted to read out all of the stage directions for analysis, or just King Arthur's lines, doing so in base Python string processing will be very difficult. Instead, we are going to use regular expressions. Regular expressions are a method for string manipulation that match patterns instead of bytes." 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": { 101 | "collapsed": false 102 | }, 103 | "outputs": [], 104 | "source": [ 105 | "import re\n", 106 | "snippet = document.split(\"\\n\")[8]\n", 107 | "print(snippet)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "Let's use regex to see if 'coconuts' is in `snippet`." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "re.search(r'coconuts', snippet)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "It is! As you see, it gives us the indices, which we can also get using the `span` method." 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": { 139 | "collapsed": false 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "indices = re.search(r'coconuts', snippet).span()\n", 144 | "print(indices)\n", 145 | "\n", 146 | "print(snippet[indices[0]:indices[1]])" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "Just like with `str.find`, we can search for plain text. But `re` also gives us the option for searching for patterns of bytes - like only alphabetic characters." 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "collapsed": false 161 | }, 162 | "outputs": [], 163 | "source": [ 164 | "re.search(r'[a-z]', snippet)" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "In this case, we've told `re` to search for the first sequence of bytes that is only composed of lowercase letters between `a` and `z`. We could get the letters at the end of each sentence by including a bang at the end of the pattern." 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": { 178 | "collapsed": false 179 | }, 180 | "outputs": [], 181 | "source": [ 182 | "re.search(r'[a-z]!', snippet)" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "There are two things happening here:\n", 190 | "\n", 191 | "1. `[` and `]` do not mean 'bracket'; they are special characters which mean 'anything of this class'\n", 192 | "2. we've only matched one letter each\n", 193 | "\n", 194 | "Re is flexible about how you specify numbers - you can match none, some, a range, or all repetitions of a sequence or character class.\n", 195 | "\n", 196 | "character | meaning\n", 197 | "----------|--------\n", 198 | "`{x}` | exactly x repetitions\n", 199 | "`{x,y}` | between x and y repetitions\n", 200 | "`?` | 0 or 1 repetition\n", 201 | "`*` | 0 or many repetitions\n", 202 | "`+` | 1 or many repetitions" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "Part of the power of regular expressions are their special characters. Common ones that you'll see are:\n", 210 | "\n", 211 | "character | meaning\n", 212 | "----------|--------\n", 213 | "`.` | match anything except a newline\n", 214 | "`^` | match the start of a line\n", 215 | "`$` | match the end of a line\n", 216 | "`\\s` | matches any whitespace or newline" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "What if we wanted to grab all of Arthur's speech without grabbing the name `ARTHUR` itself?\n", 224 | "\n", 225 | "If we wanted to do this using base string manipulation, we would need to do something like:\n", 226 | "\n", 227 | "```\n", 228 | "split the document into lines\n", 229 | "create a new list of just lines that start with ARTHUR\n", 230 | "create a newer list with ARTHUR removed from the front of each element\n", 231 | "```\n", 232 | "\n", 233 | "Regex gives us a way of doing this in one line, by using something called groups. Groups are pieces of a pattern that can be ignored, negated, or given names for later retrieval.\n", 234 | "\n", 235 | "character | meaning\n", 236 | "----------|--------\n", 237 | "`(x)` | match x\n", 238 | "`(?:x)` | match x but don't capture it\n", 239 | "`(?P)` | match something and give it name x\n", 240 | "`(?=x)` | match only if string is followed by x\n", 241 | "`(?!x)` | match only if string is not followed by x" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "collapsed": false 249 | }, 250 | "outputs": [], 251 | "source": [ 252 | "re.findall(r'(?:ARTHUR: )(.+)', document)[0:10]" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "Let's break this regex down.\n", 260 | "\n", 261 | "The first group `(?:ARTHUR: )` will match but not capture all instances of the string \"ARTHUR: \". All of Arthur's lines start with this string, but we only want what follows, so this regex allows us to filter out his name.\n", 262 | "\n", 263 | "The second group `(.+)` will match anything except a newline, for 1 or many repetitions. That means this will capture whatever follows the string \"ARTHUR: \" in that same line.\n", 264 | "\n", 265 | "Because we are using `findall`, the regex engine is capturing and returning the normal groups, but not the non-capturing group. For complicated, multi-piece regular expressions, you may need to pull groups out separately. You can do this with regex names." 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": { 272 | "collapsed": false 273 | }, 274 | "outputs": [], 275 | "source": [ 276 | "p = re.compile(r'(?P[A-Z ]+)(?:: )(?P.+)')\n", 277 | "match = re.search(p, document)\n", 278 | "print(match)" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "The name `(?P[A-Z ]+)`, which we are coincidentally calling \"name\", will search for strings consisting of upper case characters and white space.\n", 286 | "\n", 287 | "The second group `(?:: )` will match \": \" but not capture it.\n", 288 | "\n", 289 | "The third group `(?P.+)` is another name, which we are calling \"line\". This will simply get all the characters after the second group, up to a newline.\n", 290 | "\n", 291 | "To get our names, we just call the `group` method." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": { 298 | "collapsed": false 299 | }, 300 | "outputs": [], 301 | "source": [ 302 | "print(match.group('name'))\n", 303 | "print(match.group('line'))" 304 | ] 305 | }, 306 | { 307 | "cell_type": "markdown", 308 | "metadata": {}, 309 | "source": [ 310 | "## Challenge 1: Regex parsing" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "Using the regex pattern `p` above to print the `set` of unique character names in *Monty Python*:" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": { 324 | "collapsed": false 325 | }, 326 | "outputs": [], 327 | "source": [ 328 | "# YOUR CODE HERE\n" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "You should have 84 different characters.\n", 336 | "\n", 337 | "Now use the `set` you made above to gather all dialogue into a dictionary called `char_dict`, with the keys being the character name and the value being a list of that character's lines:" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": { 344 | "collapsed": false 345 | }, 346 | "outputs": [], 347 | "source": [ 348 | "# char_dict[\"ARTHUR\"] should give you a list of strings with his dialogue\n", 349 | "# YOUR CODE HERE\n" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": { 356 | "collapsed": false 357 | }, 358 | "outputs": [], 359 | "source": [ 360 | "char_dict[\"ARTHUR\"]" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "## Tokenizing\n", 368 | "\n", 369 | "Let's grab Arthur's speech from above, and see what we can learn about Arthur from it." 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": null, 375 | "metadata": { 376 | "collapsed": false 377 | }, 378 | "outputs": [], 379 | "source": [ 380 | "arthur = ' '.join(char_dict[\"ARTHUR\"])\n", 381 | "snippet = arthur[1000:1100]\n", 382 | "print(snippet)" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "In our model for natural language, we're interested in words. The document is currently a continuous string of bytes, which isn't ideal.\n", 390 | "\n", 391 | "The practice of pulling apart a continuous string into units is called \"tokenizing\", and it creates \"tokens\". NLTK, the canonical library for NLP in Python, has a couple of implementations for tokenizing a string into sentences, and sentences into words." 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": { 398 | "collapsed": false 399 | }, 400 | "outputs": [], 401 | "source": [ 402 | "nltk.download('punkt')\n", 403 | "from nltk import word_tokenize, sent_tokenize\n", 404 | "word_tokenize(snippet)" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "Look at what happened to \"didn't\". It's been separated into \"did\" and \"n't\", which keeps with the way contractions work in English. While we know we could just use `snippet.split()` to split on white space, or write a complicated regex, word tokenizers allow for a more accurate representation of words based on additional rules." 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "We notice word tokenizers also separate punctuation, so unlike if we had split on whitespace, word tokenizers won't end up with `there!` and `there` as being different words.\n", 419 | "\n", 420 | "At this point, we can start asking questions like what are the most common words, how many are unqiue words, and what words tend to occur together." 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "metadata": { 427 | "collapsed": false 428 | }, 429 | "outputs": [], 430 | "source": [ 431 | "tokens = word_tokenize(arthur)\n", 432 | "len(tokens), len(set(tokens))" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "So we can see right away that Arthur is using the same words a whole bunch - on average, each unique word is used four times. This is typical of natural language. \n", 440 | "\n", 441 | "> Not necessarily the value, but that the number of unique words in any corpus increases much more slowly than the total number of words.\n", 442 | "\n", 443 | "> A corpus with 100M tokens, for example, probably only has 100,000 unique tokens in it.\n", 444 | "\n", 445 | "For more complicated metrics, it's easier to use NLTK's classes and methods." 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": { 452 | "collapsed": false 453 | }, 454 | "outputs": [], 455 | "source": [ 456 | "from nltk import collocations\n", 457 | "fd = collocations.FreqDist(tokens)\n", 458 | "fd.most_common()[:10]" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "Not so interesting, a common step in text analysis is to remove noise. *However*, what you deem \"noise\" is not only very important but also dependent on the project at hand. For the purposes of today, we will discuss two common categories of strings often considered \"noise\". \n", 466 | "\n", 467 | "- Punctuation: While important for sentence analysis, punctuation will get in the way of word frequency and n-gram analyses. They will also affect any clustering on topic modeling.\n", 468 | "\n", 469 | "- Stopwords: Stopwords are the most frequent words in any given language. Words like \"the\", \"a\", \"that\", etc. are considered not semantically important, and would also skew any frequency or n-gram analysis." 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": {}, 475 | "source": [ 476 | "## Challenge 2: Removing noise" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "Write a function below that takes a string as an argument and returns a list of words without punctuation or stopwords.\n", 484 | "\n", 485 | "`punctuation` is a list of punctuation strings, and we have created the list `stop_words` for you.\n", 486 | "\n", 487 | "Hint: first you'll want to remove punctuation, then tokenize, then remove stop words. Make sure you account for upper and lower case!" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": { 494 | "collapsed": true 495 | }, 496 | "outputs": [], 497 | "source": [ 498 | "def rem_punc_stop(text_string):\n", 499 | " \n", 500 | " from string import punctuation\n", 501 | " from nltk.corpus import stopwords\n", 502 | " \n", 503 | " stop_words = stopwords.words(\"english\")\n", 504 | " \n", 505 | " #YOUR CODE HERE\n", 506 | " \n", 507 | " \n", 508 | " " 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "Now we can rerun our frequency analysis without the noise:" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": null, 521 | "metadata": { 522 | "collapsed": false 523 | }, 524 | "outputs": [], 525 | "source": [ 526 | "tokens_reduced = rem_punc_stop(arthur)\n", 527 | "fd2 = collocations.FreqDist(tokens_reduced)\n", 528 | "fd2.most_common()[:10]" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "We can also look at collocations. In NLTK these can be calculated by either pointwise mutual information or likelihood ratio. More on those measures [here](https://nlp.stanford.edu/fsnlp/promo/colloc.pdf)." 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "metadata": { 542 | "collapsed": false 543 | }, 544 | "outputs": [], 545 | "source": [ 546 | "#pmi\n", 547 | "measures = collocations.BigramAssocMeasures()\n", 548 | "c = collocations.BigramCollocationFinder.from_words(tokens_reduced)\n", 549 | "c.nbest(measures.pmi, 10)" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": { 556 | "collapsed": false 557 | }, 558 | "outputs": [], 559 | "source": [ 560 | "#likelihood\n", 561 | "c.nbest(measures.likelihood_ratio, 10)" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "We see here that the collocation finder is pulling out some things that have face validity. When Arthur is talking about peasants, he calls them \"bloody\" more often than not. However, collocations like \"Brother Maynard\" and \"BLACK KNIGHT\" are less informative to us, because we know that they are proper names." 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | "## Part of Speech Tagging" 576 | ] 577 | }, 578 | { 579 | "cell_type": "markdown", 580 | "metadata": {}, 581 | "source": [ 582 | "Many applications require text to be in the form of a list of sentences. NLTK's `sent_tokenize` should do the trick:" 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": null, 588 | "metadata": { 589 | "collapsed": false 590 | }, 591 | "outputs": [], 592 | "source": [ 593 | "sents = sent_tokenize(arthur)\n", 594 | "sents[0:10]" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "metadata": {}, 600 | "source": [ 601 | "A common step in the NLP pipeline is tagging for part of speech, which can help begin to rectify our \"bag of words\" approach by retaining some idea of syntax. While training a POS tagger is a workshop in itself, NLTK also provides a trained tagger for us:" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": null, 607 | "metadata": { 608 | "collapsed": false 609 | }, 610 | "outputs": [], 611 | "source": [ 612 | "nltk.download(\"averaged_perceptron_tagger\")\n", 613 | "from nltk import pos_tag\n", 614 | "\n", 615 | "toks_and_sents = [word_tokenize(s) for s in sent_tokenize(arthur)]\n", 616 | "tagged_sents = [pos_tag(s) for s in toks_and_sents]\n", 617 | "\n", 618 | "print()\n", 619 | "print(tagged_sents[4])" 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": {}, 625 | "source": [ 626 | "For the POS tagset NLTK uses, see [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)." 627 | ] 628 | }, 629 | { 630 | "cell_type": "markdown", 631 | "metadata": {}, 632 | "source": [ 633 | "## Challenge 3: POS Frequency" 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "metadata": {}, 639 | "source": [ 640 | "Create a frequency distribution for Arthur's parts of speech using the `nltk.FreqDist()` method. You'll need to first create a list of just the POS tags." 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": null, 646 | "metadata": { 647 | "collapsed": false 648 | }, 649 | "outputs": [], 650 | "source": [ 651 | "tags = []\n", 652 | "\n", 653 | "# YOUR CODE HERE" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": null, 659 | "metadata": { 660 | "collapsed": false 661 | }, 662 | "outputs": [], 663 | "source": [ 664 | "tag_fd = nltk.FreqDist(tags)\n", 665 | "tag_fd.most_common()" 666 | ] 667 | }, 668 | { 669 | "cell_type": "markdown", 670 | "metadata": {}, 671 | "source": [ 672 | "## Stemming and Lemmatizing\n", 673 | "\n", 674 | "In NLP it is often the case that the specific form of a word is not as important as the idea to which it refers. For example, if you are trying to identify the topic of a document, counting 'running', 'runs', 'ran', and 'run' as four separate words is not useful. Reducing words to their stems is a process called stemming.\n", 675 | "\n", 676 | "A popular stemming implementation is the Snowball Stemmer, which is based on the [Porter Stemmer](http://snowball.tartarus.org/algorithms/porter/stemmer.html). Its algorithm looks at word forms and does things like drop final 's's, 'ed's, and 'ing's.\n", 677 | "\n", 678 | "Just like the tokenizers, we first have to create a stemmer object with the language we are using." 679 | ] 680 | }, 681 | { 682 | "cell_type": "code", 683 | "execution_count": null, 684 | "metadata": { 685 | "collapsed": true 686 | }, 687 | "outputs": [], 688 | "source": [ 689 | "snowball = nltk.SnowballStemmer('english')" 690 | ] 691 | }, 692 | { 693 | "cell_type": "markdown", 694 | "metadata": {}, 695 | "source": [ 696 | "Now, we can try stemming some words" 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": null, 702 | "metadata": { 703 | "collapsed": false 704 | }, 705 | "outputs": [], 706 | "source": [ 707 | "snowball.stem('running')" 708 | ] 709 | }, 710 | { 711 | "cell_type": "code", 712 | "execution_count": null, 713 | "metadata": { 714 | "collapsed": false 715 | }, 716 | "outputs": [], 717 | "source": [ 718 | "snowball.stem('eats')" 719 | ] 720 | }, 721 | { 722 | "cell_type": "code", 723 | "execution_count": null, 724 | "metadata": { 725 | "collapsed": false 726 | }, 727 | "outputs": [], 728 | "source": [ 729 | "snowball.stem('embarassed')" 730 | ] 731 | }, 732 | { 733 | "cell_type": "markdown", 734 | "metadata": {}, 735 | "source": [ 736 | "Snowball is a very fast algorithm, but it has a lot of edge cases. In some cases, words with the same stem are reduced to two different stems." 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": null, 742 | "metadata": { 743 | "collapsed": false 744 | }, 745 | "outputs": [], 746 | "source": [ 747 | "snowball.stem('cylinder'), snowball.stem('cylindrical')" 748 | ] 749 | }, 750 | { 751 | "cell_type": "markdown", 752 | "metadata": {}, 753 | "source": [ 754 | "In other cases, two different words are reduced to the same stem.\n", 755 | "\n", 756 | "> This is sometimes referred to as a 'collision'" 757 | ] 758 | }, 759 | { 760 | "cell_type": "code", 761 | "execution_count": null, 762 | "metadata": { 763 | "collapsed": false 764 | }, 765 | "outputs": [], 766 | "source": [ 767 | "snowball.stem('vacation'), snowball.stem('vacate')" 768 | ] 769 | }, 770 | { 771 | "cell_type": "markdown", 772 | "metadata": {}, 773 | "source": [ 774 | "A more accurate approach is to use an English word bank like WordNet to call dictionary lookups on word forms, in a process called lemmatization.\n", 775 | "\n", 776 | "Whereas stemming just algorithmically cuts off the ends of words, lemmatization takes into account the grammatical and morphological properties of the word. More on the two [here](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)." 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "execution_count": null, 782 | "metadata": { 783 | "collapsed": false 784 | }, 785 | "outputs": [], 786 | "source": [ 787 | "nltk.download('wordnet')\n", 788 | "wordnet = nltk.WordNetLemmatizer()" 789 | ] 790 | }, 791 | { 792 | "cell_type": "code", 793 | "execution_count": null, 794 | "metadata": { 795 | "collapsed": false 796 | }, 797 | "outputs": [], 798 | "source": [ 799 | "wordnet.lemmatize('vacation'), wordnet.lemmatize('vacate')" 800 | ] 801 | }, 802 | { 803 | "cell_type": "markdown", 804 | "metadata": {}, 805 | "source": [ 806 | "Let's take a look at the most common lemmata in our tokenized text." 807 | ] 808 | }, 809 | { 810 | "cell_type": "code", 811 | "execution_count": null, 812 | "metadata": { 813 | "collapsed": false 814 | }, 815 | "outputs": [], 816 | "source": [ 817 | "tok_red_lem = [snowball.stem(w) for w in tokens_reduced]\n", 818 | "fd3 = collocations.FreqDist(tok_red_lem)\n", 819 | "fd3.most_common()[:15]" 820 | ] 821 | }, 822 | { 823 | "cell_type": "markdown", 824 | "metadata": {}, 825 | "source": [ 826 | "# Part 2: High-level analysis\n", 827 | "\n", 828 | "The rest of this class will focus on high level analyses, which do most of what we just covered for you, or in one quick step. It is important to remember that it is performing the above first. To know how to correctly interpret your analysis, remember that at some point the computer decided certain things weren't important!\n", 829 | "\n", 830 | "## Sentiment\n", 831 | "\n", 832 | "Frequently, we are interested in text to learn something about the person who is speaking. One of these things we've talked about already - linguistic diversity. A similar metric was used a couple of years ago to settle the question of who has the [largest vocabulary in Hip Hop](http://poly-graph.co/vocabulary.html).\n", 833 | "\n", 834 | "> Unsurprisingly, top spots go to Canibus, Aesop Rock, and the Wu Tang Clan. E-40 is also in the top 20, but mostly because he makes up a lot of words; as are OutKast, who print their lyrics with words slurred in the actual typography\n", 835 | "\n", 836 | "Another thing we can learn is about how the speaker is feeling, with a process called sentiment analysis. Before we start, be forewarned that this is not a robust method by any stretch of the imagination. Sentiment classifiers are often trained on product reviews, which limits their ecological validity.\n", 837 | "\n", 838 | "We're going to use TextBlob because it's an easy way to work with text data, and has a built in sentiment classifier." 839 | ] 840 | }, 841 | { 842 | "cell_type": "code", 843 | "execution_count": null, 844 | "metadata": { 845 | "collapsed": false 846 | }, 847 | "outputs": [], 848 | "source": [ 849 | "from textblob import TextBlob\n", 850 | "blob = TextBlob(arthur)\n", 851 | "blob.sentences[:10]" 852 | ] 853 | }, 854 | { 855 | "cell_type": "markdown", 856 | "metadata": {}, 857 | "source": [ 858 | "To check the polarity of a string, we can just iterate through Arthur's sentences. TextBlob will calculate the polarity of each sentence with `sentiment.polarity`, and we can just add it to our accumulator variable `net_pol`." 859 | ] 860 | }, 861 | { 862 | "cell_type": "code", 863 | "execution_count": null, 864 | "metadata": { 865 | "collapsed": false 866 | }, 867 | "outputs": [], 868 | "source": [ 869 | "net_pol = 0\n", 870 | "for sentence in blob.sentences:\n", 871 | " pol = sentence.sentiment.polarity\n", 872 | " print(pol, sentence)\n", 873 | " net_pol += pol\n", 874 | "print()\n", 875 | "print(\"Net polarity of Arthur: \", net_pol)" 876 | ] 877 | }, 878 | { 879 | "cell_type": "markdown", 880 | "metadata": {}, 881 | "source": [ 882 | "What's happening behind the scenes? While there are new algorithms for sentiment anaysis emerging (cf. `VADER`), most algorithms currently rely only on a `dictionary` of words and a corresponding `positive`, `negative`, or `neutral`. Based on all the words in a sentence, a value is calculated for the sentence as a whole. Not super fancy, I know. Of course, you can change the `dictionary` used in the library itself, or opt for more advanced algorithms that aim to capture context." 883 | ] 884 | }, 885 | { 886 | "cell_type": "markdown", 887 | "metadata": {}, 888 | "source": [ 889 | "## Challenge 4: Sentiment" 890 | ] 891 | }, 892 | { 893 | "cell_type": "markdown", 894 | "metadata": {}, 895 | "source": [ 896 | "How about we look at all characters? Create an empty list `collected_stats` and iterate through `char_dict`, calculate the net polarity of each character, and append a tuple of e.g. `(ARTHUR, 11.45)` back to `collected_stats`:" 897 | ] 898 | }, 899 | { 900 | "cell_type": "code", 901 | "execution_count": null, 902 | "metadata": { 903 | "collapsed": false 904 | }, 905 | "outputs": [], 906 | "source": [ 907 | "collected_stats = []\n", 908 | "# YOUR CODE HERE\n" 909 | ] 910 | }, 911 | { 912 | "cell_type": "markdown", 913 | "metadata": {}, 914 | "source": [ 915 | "Now `sort` this list of tuples by polarity, and print the list of characters in *Monty Python* according to their sentiment:" 916 | ] 917 | }, 918 | { 919 | "cell_type": "code", 920 | "execution_count": null, 921 | "metadata": { 922 | "collapsed": false 923 | }, 924 | "outputs": [], 925 | "source": [ 926 | "# YOUR CODE HERE\n" 927 | ] 928 | }, 929 | { 930 | "cell_type": "markdown", 931 | "metadata": {}, 932 | "source": [ 933 | "## Topic Modeling\n", 934 | "\n", 935 | "Another common NLP task is topic modeling. The math behind this is beyond the scope of this course, but the basic strategy is to represent each document as a one-dimensional array, where the indices correspond to integer ids of tokens in the document. Then, some measure of semantic similarity, like the cosine of the angle between unitized versions of the document vectors, is calculated. Finally, distinct topics are identified as leading certain groups of documents. The result is a list of `n` topics with the driving words for that topic, and a list of documents with their relation to each topic (how strongly a document fits that topic.\n", 936 | "\n", 937 | "Let's run a topic model on the characters of *Monty Python*.\n", 938 | "\n", 939 | "Luckily for us there is another Python library that takes care of the heavy lifting for us." 940 | ] 941 | }, 942 | { 943 | "cell_type": "code", 944 | "execution_count": null, 945 | "metadata": { 946 | "collapsed": false 947 | }, 948 | "outputs": [], 949 | "source": [ 950 | "from gensim import corpora, models, similarities" 951 | ] 952 | }, 953 | { 954 | "cell_type": "markdown", 955 | "metadata": {}, 956 | "source": [ 957 | "First we need to separate the speeches and people, but keep it ordered so we index correctly when done. For the speeches, we'll need all speech as one string, then tokenized. We also need to remove punctuation and stop words so that Python can identify important words to documents. It seems we've gotten lucky again, we already wrote *rem_punc_stop*! Finally we'll stem our tokens." 958 | ] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "execution_count": null, 963 | "metadata": { 964 | "collapsed": false 965 | }, 966 | "outputs": [], 967 | "source": [ 968 | "people = []\n", 969 | "speeches = []\n", 970 | "for k,v in char_dict.items():\n", 971 | " people.append(k)\n", 972 | " new_string = ' '.join(v) # join all dialogue pices\n", 973 | " toks = rem_punc_stop(new_string) # remove punctuation and stop words, and tokenize\n", 974 | " stems = [snowball.stem(tok) for tok in toks] # change words to stems\n", 975 | " speeches.append(stems)" 976 | ] 977 | }, 978 | { 979 | "cell_type": "markdown", 980 | "metadata": {}, 981 | "source": [ 982 | "First we create a gensim dictionary which will map the words in our speeches to integer IDs." 983 | ] 984 | }, 985 | { 986 | "cell_type": "code", 987 | "execution_count": null, 988 | "metadata": { 989 | "collapsed": false 990 | }, 991 | "outputs": [], 992 | "source": [ 993 | "dictionary = corpora.Dictionary(speeches)\n", 994 | "print(dictionary)" 995 | ] 996 | }, 997 | { 998 | "cell_type": "markdown", 999 | "metadata": {}, 1000 | "source": [ 1001 | "Next we will filter out words at the extremes. The `no_below` argument refers to the absolute number of documents in which a word occurs, and the `no_above` argument is a fraction of the corpus.\n", 1002 | "\n", 1003 | "Since we want to create topics around which to cluster texts, we don't want to use words that only very rarely occur, since they won't be relevant to many texts. Similarly, if a word appears in too many texts, it's probably not a very useful identifier for a subgroup of the texts, so we don't want to have it in a topic either.\n", 1004 | "\n", 1005 | "As you can imagine, in your own work you'll want to try different values to see what's best." 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "code", 1010 | "execution_count": null, 1011 | "metadata": { 1012 | "collapsed": true 1013 | }, 1014 | "outputs": [], 1015 | "source": [ 1016 | "dictionary.filter_extremes(no_below=2, no_above=.70)" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "markdown", 1021 | "metadata": {}, 1022 | "source": [ 1023 | "Now we create our bag of words corpus with the `doc2bow` method. Each text is now represented as a list of tuples, in which the first item is an integer corresponding to a word, and the second item is its frequency in that text." 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": null, 1029 | "metadata": { 1030 | "collapsed": false 1031 | }, 1032 | "outputs": [], 1033 | "source": [ 1034 | "corpus = [dictionary.doc2bow(i) for i in speeches]\n", 1035 | "print(corpus[1])" 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "markdown", 1040 | "metadata": {}, 1041 | "source": [ 1042 | "Finally we set the parameters for the [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) topic modelling (other algorithms such as [LSI](https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing) do exist, but we won't get into the differences today):" 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "code", 1047 | "execution_count": null, 1048 | "metadata": { 1049 | "collapsed": false 1050 | }, 1051 | "outputs": [], 1052 | "source": [ 1053 | "#we run chunks of 15 texts, and update after every 2 chunks, and make 10 passes\n", 1054 | "lda = models.LdaModel(corpus, num_topics=6, \n", 1055 | " update_every=2,\n", 1056 | " id2word=dictionary, \n", 1057 | " chunksize=15, \n", 1058 | " passes=10)\n", 1059 | "\n", 1060 | "lda.show_topics()" 1061 | ] 1062 | }, 1063 | { 1064 | "cell_type": "markdown", 1065 | "metadata": {}, 1066 | "source": [ 1067 | "We can use the `get_document_topics` method to get the topics for a given document:" 1068 | ] 1069 | }, 1070 | { 1071 | "cell_type": "code", 1072 | "execution_count": null, 1073 | "metadata": { 1074 | "collapsed": false 1075 | }, 1076 | "outputs": [], 1077 | "source": [ 1078 | "print(people[4])\n", 1079 | "lda.get_document_topics(corpus[4])" 1080 | ] 1081 | }, 1082 | { 1083 | "cell_type": "markdown", 1084 | "metadata": {}, 1085 | "source": [ 1086 | "Now let's iterate through our corpus and find the best matching topic for each character." 1087 | ] 1088 | }, 1089 | { 1090 | "cell_type": "code", 1091 | "execution_count": null, 1092 | "metadata": { 1093 | "collapsed": false 1094 | }, 1095 | "outputs": [], 1096 | "source": [ 1097 | "for i,v in enumerate(corpus):\n", 1098 | " \n", 1099 | " topic, score = max(lda.get_document_topics(v), key = lambda x:x[1])\n", 1100 | " print(\"Character: \" + people[i])\n", 1101 | " print()\n", 1102 | " print(\"Highest topic score: \" + str(score))\n", 1103 | " print()\n", 1104 | " print(\"Topic: \" + str(lda.show_topic(topic)))\n", 1105 | " print()" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "markdown", 1110 | "metadata": {}, 1111 | "source": [ 1112 | "## Word embeddings and word2vec" 1113 | ] 1114 | }, 1115 | { 1116 | "cell_type": "markdown", 1117 | "metadata": {}, 1118 | "source": [ 1119 | "Word embeddings are the first successful attempt to move away from the \"bag of words\" model of language. Instead of looking at word frequencies, and vocabulary usage, word embeddings aim to retain syntactic information. Generally, a word2vec model *will not* remove stopwords or punctuation, because they are vital to the model itself.\n", 1120 | "\n", 1121 | "word2vec simply changes a tokenized sentence into a vector of numbers, with each unique token being its own number.\n", 1122 | "\n", 1123 | "e.g.:\n", 1124 | "\n", 1125 | "~~~\n", 1126 | "[[\"I\", \"like\", \"coffee\", \".\"], [\"I\", \"like\", \"my\", \"coffee\", \"without\", \"sugar\", \".\"]]\n", 1127 | "~~~\n", 1128 | "\n", 1129 | "is tranformed to:\n", 1130 | "\n", 1131 | "~~~\n", 1132 | "[[43, 75, 435, 98], [43, 75, 10, 435, 31, 217, 98]]\n", 1133 | "~~~\n", 1134 | "\n", 1135 | "Notice, the \"I\"s, the \"likes\", the \"coffees\", and the \".\"s, all have the same assignment.\n", 1136 | "\n", 1137 | "The model is created by taking these numbers, and creating a high dimensional vector by mapping every word to its surrounding, creating a sort of \"cloud\" of words, where words used in a similar syntactic, and often semantic, fashion, will cluster closer together.\n", 1138 | "\n", 1139 | "One of the drawbacks of word2vec is the volume of data necessary for a decent analysis. So we will read in a copy of the King James Bible and hope it will provide enough data, it then needs to be broken into sentences and tokenized:" 1140 | ] 1141 | }, 1142 | { 1143 | "cell_type": "code", 1144 | "execution_count": null, 1145 | "metadata": { 1146 | "collapsed": false 1147 | }, 1148 | "outputs": [], 1149 | "source": [ 1150 | "with open(\"King_James_Bible.txt\", \"r\") as f:\n", 1151 | " bible = f.read()\n", 1152 | "\n", 1153 | "from nltk.tokenize import sent_tokenize\n", 1154 | "\n", 1155 | "bible = sent_tokenize(bible)\n", 1156 | "bible = [word_tokenize(s) for s in bible]" 1157 | ] 1158 | }, 1159 | { 1160 | "cell_type": "code", 1161 | "execution_count": null, 1162 | "metadata": { 1163 | "collapsed": false 1164 | }, 1165 | "outputs": [], 1166 | "source": [ 1167 | "bible[10]" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "markdown", 1172 | "metadata": {}, 1173 | "source": [ 1174 | "Now we can actually train the model on the language of the Bible:" 1175 | ] 1176 | }, 1177 | { 1178 | "cell_type": "code", 1179 | "execution_count": null, 1180 | "metadata": { 1181 | "collapsed": false 1182 | }, 1183 | "outputs": [], 1184 | "source": [ 1185 | "import gensim\n", 1186 | "model = gensim.models.word2vec.Word2Vec(bible, size=300, window=5, min_count=5, workers=4)\n", 1187 | "model.train(bible)" 1188 | ] 1189 | }, 1190 | { 1191 | "cell_type": "markdown", 1192 | "metadata": {}, 1193 | "source": [ 1194 | "Once the model is trained, we can look at how words are situated in this cloud:" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "code", 1199 | "execution_count": null, 1200 | "metadata": { 1201 | "collapsed": false 1202 | }, 1203 | "outputs": [], 1204 | "source": [ 1205 | "model.most_similar('man')" 1206 | ] 1207 | }, 1208 | { 1209 | "cell_type": "code", 1210 | "execution_count": null, 1211 | "metadata": { 1212 | "collapsed": false 1213 | }, 1214 | "outputs": [], 1215 | "source": [ 1216 | "model.most_similar('woman')" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "markdown", 1221 | "metadata": {}, 1222 | "source": [ 1223 | "We can even create little equations, so what would be a:\n", 1224 | "\n", 1225 | "KING + WOMAN - MAN = ?" 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "code", 1230 | "execution_count": null, 1231 | "metadata": { 1232 | "collapsed": false 1233 | }, 1234 | "outputs": [], 1235 | "source": [ 1236 | "model.most_similar(positive=['king', 'woman'], negative=['man'])" 1237 | ] 1238 | }, 1239 | { 1240 | "cell_type": "markdown", 1241 | "metadata": {}, 1242 | "source": [ 1243 | "## Challenge 5: word2vec\n", 1244 | "\n", 1245 | "Play around with the word2vec model above and try to put into words exactly what the model does, and how one should interpret the results. How would you contrast this with the \"bag of words\" model?" 1246 | ] 1247 | }, 1248 | { 1249 | "cell_type": "markdown", 1250 | "metadata": {}, 1251 | "source": [] 1252 | } 1253 | ], 1254 | "metadata": { 1255 | "kernelspec": { 1256 | "display_name": "Python 3", 1257 | "language": "python", 1258 | "name": "python3" 1259 | }, 1260 | "language_info": { 1261 | "codemirror_mode": { 1262 | "name": "ipython", 1263 | "version": 3 1264 | }, 1265 | "file_extension": ".py", 1266 | "mimetype": "text/x-python", 1267 | "name": "python", 1268 | "nbconvert_exporter": "python", 1269 | "pygments_lexer": "ipython3", 1270 | "version": "3.5.1" 1271 | } 1272 | }, 1273 | "nbformat": 4, 1274 | "nbformat_minor": 0 1275 | } 1276 | -------------------------------------------------------------------------------- /Intro_to_TextAnalysis/Intro_to_TextAnalysis_ANSWERS.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction to Text Analysis" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Challenge 1: Regex parsing" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "Using the regex pattern `p` above, print the `set` of unique characters in *Monty Python*:" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": { 28 | "collapsed": false 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "matches = re.findall(p, document)\n", 33 | "names = set([x[0] for x in matches])\n", 34 | "print(names, len(names))" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "You should have 84 different characters.\n", 42 | "\n", 43 | "Now use the `set` you made above to gather all dialogue into a character `dictionary`, with the keys being the character name and the value being a list of that character's lines:" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": { 50 | "collapsed": false 51 | }, 52 | "outputs": [], 53 | "source": [ 54 | "# char_dict[\"ARTHUR\"] should give you a list of strings with his dialogue\n", 55 | "\n", 56 | "# Solution 1\n", 57 | "\n", 58 | "char_dict = {}\n", 59 | "\n", 60 | "for name in names:\n", 61 | " lines = []\n", 62 | " for line in matches:\n", 63 | " if name == line[0]:\n", 64 | " lines.append(line[1])\n", 65 | " char_dict[name] = lines" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": { 72 | "collapsed": true 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "# Solution 2 (list comprehension)\n", 77 | "\n", 78 | "char_dict = {}\n", 79 | "\n", 80 | "for name in names:\n", 81 | " char_dict[name] = [line[1] for line in matches if name == line[0]]" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": { 88 | "collapsed": true 89 | }, 90 | "outputs": [], 91 | "source": [ 92 | "# Solution 3 (dictionary comprehension)\n", 93 | "\n", 94 | "char_dict = {name: [line[1] for line in re.findall(p, document) if line[0] == name] for name in names}" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": { 101 | "collapsed": false 102 | }, 103 | "outputs": [], 104 | "source": [ 105 | "char_dict[\"ARTHUR\"]" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "## Challenge 2: Removing noise" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "Write a function below that takes a string as an argument and returns that string without punctuation or stopwords (HINT: You can get a good start for a list of stopwords here: `from nltk.corpus import stopwords`)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": { 126 | "collapsed": false 127 | }, 128 | "outputs": [], 129 | "source": [ 130 | "def rem_punc_stop(text_string):\n", 131 | " \n", 132 | " from string import punctuation\n", 133 | " from nltk.corpus import stopwords\n", 134 | "\n", 135 | " for char in punctuation:\n", 136 | " text_string = text_string.replace(char, \"\")\n", 137 | "\n", 138 | " toks = word_tokenize(text_string)\n", 139 | " toks_reduced = [x for x in toks if x.lower() not in stopwords.words('english')]\n", 140 | " \n", 141 | " return toks_reduced" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "## Challenge 3: POS Frequency" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "Create a frequency distribution for Arthur's parts of speech:" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": { 162 | "collapsed": true 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "#Solution 1\n", 167 | "\n", 168 | "tags = []\n", 169 | "\n", 170 | "for sentence in tagged_sents:\n", 171 | " \n", 172 | " for word in sentence:\n", 173 | " \n", 174 | " tags.append(word[1])\n", 175 | " \n", 176 | "tag_fd = nltk.FreqDist(tags)\n", 177 | "tag_fd.most_common()" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": { 184 | "collapsed": true 185 | }, 186 | "outputs": [], 187 | "source": [ 188 | "#Solution 2 (list comprehensions)\n", 189 | "\n", 190 | "tag_fd = nltk.FreqDist(tag for (word, tag) in [item for sublist in tagged_sents for item in sublist])\n", 191 | "tag_fd.most_common()" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "## Challenge 4: Sentiment" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "How about we look at all characters? Create an empty list `collected_stats` and iterate through `char_dict`, calculate the net polarity of each character, and append a tuple of e.g. `(ARTHUR, 11.45)` back to `collected_stats`:" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "metadata": { 212 | "collapsed": false 213 | }, 214 | "outputs": [], 215 | "source": [ 216 | "collected_stats = []\n", 217 | "for k in char_dict.keys():\n", 218 | " blob = TextBlob(' '.join(char_dict[k]))\n", 219 | " net_pol = 0\n", 220 | " for sentence in blob.sentences:\n", 221 | " pol = sentence.sentiment.polarity\n", 222 | " net_pol += pol\n", 223 | " collected_stats.append((k, net_pol))" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "Now `sort` this list of tuples by polarity, and print the list of characters in *Monty Python* according to their sentiment:" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": { 237 | "collapsed": false 238 | }, 239 | "outputs": [], 240 | "source": [ 241 | "sorted_stats = sorted(collected_stats, key=lambda x: x[1])\n", 242 | "for t in sorted_stats:\n", 243 | " print(t[0], t[1])" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "## Challenge 5: word2vec\n", 251 | "\n", 252 | "Play around with the word2vec model above and try to put into words exactly what the model does, and how one should interpret the results. How would you contrast this with the \"bag of words\" model?" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [] 259 | } 260 | ], 261 | "metadata": { 262 | "kernelspec": { 263 | "display_name": "Python 3", 264 | "language": "python", 265 | "name": "python3" 266 | }, 267 | "language_info": { 268 | "codemirror_mode": { 269 | "name": "ipython", 270 | "version": 3 271 | }, 272 | "file_extension": ".py", 273 | "mimetype": "text/x-python", 274 | "name": "python", 275 | "nbconvert_exporter": "python", 276 | "pygments_lexer": "ipython3", 277 | "version": "3.5.1" 278 | } 279 | }, 280 | "nbformat": 4, 281 | "nbformat_minor": 0 282 | } 283 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Creative Commons Attribution-NonCommercial 4.0 International Public License 3 | 4 | By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-NonCommercial 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions. 5 | 6 | Section 1 – Definitions. 7 | 8 | Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image. 9 | Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License. 10 | Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights. 11 | Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements. 12 | Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material. 13 | Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License. 14 | Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license. 15 | Licensor means the individual(s) or entity(ies) granting rights under this Public License. 16 | NonCommercial means not primarily intended for or directed towards commercial advantage or monetary compensation. For purposes of this Public License, the exchange of the Licensed Material for other material subject to Copyright and Similar Rights by digital file-sharing or similar means is NonCommercial provided there is no payment of monetary compensation in connection with the exchange. 17 | Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them. 18 | Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world. 19 | You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning. 20 | Section 2 – Scope. 21 | 22 | License grant. 23 | Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to: 24 | reproduce and Share the Licensed Material, in whole or in part, for NonCommercial purposes only; and 25 | produce, reproduce, and Share Adapted Material for NonCommercial purposes only. 26 | Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions. 27 | Term. The term of this Public License is specified in Section 6(a). 28 | Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material. 29 | Downstream recipients. 30 | Offer from the Licensor – Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License. 31 | No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material. 32 | No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i). 33 | Other rights. 34 | 35 | Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise. 36 | Patent and trademark rights are not licensed under this Public License. 37 | To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties, including when the Licensed Material is used other than for NonCommercial purposes. 38 | Section 3 – License Conditions. 39 | 40 | Your exercise of the Licensed Rights is expressly made subject to the following conditions. 41 | 42 | Attribution. 43 | 44 | If You Share the Licensed Material (including in modified form), You must: 45 | 46 | retain the following if it is supplied by the Licensor with the Licensed Material: 47 | identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated); 48 | a copyright notice; 49 | a notice that refers to this Public License; 50 | a notice that refers to the disclaimer of warranties; 51 | a URI or hyperlink to the Licensed Material to the extent reasonably practicable; 52 | indicate if You modified the Licensed Material and retain an indication of any previous modifications; and 53 | indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License. 54 | You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information. 55 | If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable. 56 | If You Share Adapted Material You produce, the Adapter's License You apply must not prevent recipients of the Adapted Material from complying with this Public License. 57 | Section 4 – Sui Generis Database Rights. 58 | 59 | Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material: 60 | 61 | for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database for NonCommercial purposes only; 62 | if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material; and 63 | You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database. 64 | For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights. 65 | Section 5 – Disclaimer of Warranties and Limitation of Liability. 66 | 67 | Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You. 68 | To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You. 69 | The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability. 70 | Section 6 – Term and Termination. 71 | 72 | This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically. 73 | Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates: 74 | 75 | automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or 76 | upon express reinstatement by the Licensor. 77 | For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License. 78 | For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License. 79 | Sections 1, 5, 6, 7, and 8 survive termination of this Public License. 80 | Section 7 – Other Terms and Conditions. 81 | 82 | The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed. 83 | Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License. 84 | Section 8 – Interpretation. 85 | 86 | For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License. 87 | To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions. 88 | No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor. 89 | Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority. 90 | -------------------------------------------------------------------------------- /NLP_NLTK/NLP_NLTK.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction\n", 8 | "\n", 9 | "Today's workshop will address various concepts in Natural Language Processing, primarily through the use of NTLK. Students should already have a fundmental understanding of Python. We'll work with a corpus of documents and learn how to identify different types of linguistic structure in the text, which can help in classifying the documents or extracting useful information from them. We'll cover:\n", 10 | "\n", 11 | "1. NLTK Corpora\n", 12 | "2. Tokenization\n", 13 | "3. Part-of-Speech (POS) Tagging\n", 14 | "4. Phrase Chunking\n", 15 | "5. Named Entity Recognition (NER)\n", 16 | "6. Dependency Parsing\n", 17 | "\n", 18 | "You will need:\n", 19 | "\n", 20 | "* NLTK (in Bash $ pip install nltk)\n", 21 | "\n", 22 | "* NLTK Book corpora and packages (In Python >>> nltk.download() )\n", 23 | "\n", 24 | "* NumPy package (in Bash $ pip install numpy)\n", 25 | "\n", 26 | "* Stanford Parser: Download Stanford Parser 3.6.0 and unzip to a location that's easy for you to find (e.g. a folder called SourceCode in your Documents folder). Link: http://nlp.stanford.edu/software/lex-parser.shtml#Download\n", 27 | "\n", 28 | "This workshop will further help to solidfy understandings of regex and list comprehensions.\n", 29 | "\n", 30 | "Much of today's work will be adapted, or taken directly, from the NLTK book found here: http://www.nltk.org/book/ ." 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "## Motivation\n", 38 | "\n", 39 | "Why would we use natural language processing? How does it relate to other things we might be doing -- or trying to do -- with text in our research?\n", 40 | "\n", 41 | "Natural language processing is a field of computer science and linguistics; it aims to enable computers to process and derive meaning from input in human language. NLP research is being used to automate tasks like translation, question answering, voice recognition, and language generation.\n", 42 | "\n", 43 | "For social scientists and humanists, we use NLP concepts to improve our analysis of texts of \n", 44 | "interest in our research. Reasons you might use these methods include:\n", 45 | "\n", 46 | "1. You want to be able to better classify documents, or\n", 47 | "2. You want to be able to extract information from those documents.\n", 48 | "\n", 49 | "We'll set up an example of each of these two tasks, look at how well we can accomplish that task without NLP, and then see what we gain by adding each of the concepts we'll cover today." 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "### Task 1. Document Classification\n", 57 | "\n", 58 | "We're often interested in characterizing text from different sources, e.g. measuring the ideology of different politicians based on the language used in their speeches. A simple case would be a situation in which we have a bunch of documents that we want to label as \"positive\" or \"negative\". This is often called \"sentiment analysis\", and it can be very difficult, despite only having two categories, because sentiment is a subjective and often subtle idea.\n", 59 | "\n", 60 | "Since sentiment analysis involves human judgment about the meaning of language, we'll need to do this in a supervised manner, using training data that has already been labeled. We'll need to use a bit of machine learning for this task, but we'll use one of the existing classifiers provided by NLTK. These classifiers take a set of training documents that have already been categorized, and learn how to predict the categories of other documents. We'll use the NLTK Movie Reviews corpus for our training data, and the NLTK Naive Bayes classifer.\n", 61 | "\n", 62 | "We can't give a classifier raw text, because it wouldn't know how to use human language in its calculations. Instead, we represent each document by a vector of features (numeric or boolean values) and then the machine learns what combinations of those features fall into each category. Throughout the workshop today, we'll learn how NLP tools can help us extract different features from documents, to represent the documents with more meaningful or relevant information that might help classify them more accurately." 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "### Task 2. Information Extraction\n", 70 | "\n", 71 | "Sometimes, we aren't interested in characterizing the documents we've collected, we just want to use them as a source of information about something else. For instance, I might not care *how* different news media outlets describe or articulate elections, I just want to figure out how many instances of elections occurred in a particular place or time. Or I might want to discover what people or groups were involved in elections -- e.g. who voted, who won.\n", 72 | "\n", 73 | "For this type of task, the simplest approach might just be a keyword search, looking for all of the news articles with the word \"election\" (or maybe \"elected\" or \"voted\") and then pulling out the articles or sentences in which those keywords appear. But that seems very blunt, we don't know if an appearance of a word really means an election has occurred, and a keyword search won't help us isolate the names of entities involved.\n", 74 | "\n", 75 | "Today, we'll learn how NLP tools can help us identify different actions and actors in a text, to get closer to being able to extract instances of events and the actors that filled specific roles in those events. This will be an exploratory task; we don't yet have annotated training data to allow us to test the accuracy of our automated extraction process. But we can compare different approaches to see what different kinds of information we might be able to identify." 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "## 1) Starting with a Corpus\n", 83 | "\n", 84 | "We can use NLTK on strings, lists or dictionaries of strings, or files containing text. We call the overall body of texts we're working with a \"corpus\", which is a collection of written documents or texts (plural \"corpora\"). You might have a single .csv file of sentences, titles, tweets, etc. that you want to read in to Python all at once and then analyze. If you have your documents in different files, however, NLTK provides a class called PlainTextCorpusReader for working with a corpus as a group of text files. We can declare an NLTK corpus object containing all text files in the current working directory or subdirectories as follows:" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": { 91 | "collapsed": false 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "from nltk.corpus import PlaintextCorpusReader\n", 96 | "\n", 97 | "corpus_root = \"\" # relative path, i.e. current working directory\n", 98 | "my_corpus = PlaintextCorpusReader(corpus_root, '.*txt')" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "We can list all of the files we've just included in the corpus as follows:" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": { 112 | "collapsed": false 113 | }, 114 | "outputs": [], 115 | "source": [ 116 | "my_corpus.fileids()" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "We can read in the contents of a file in our corpus as a string using the .raw() method:" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": { 130 | "collapsed": false 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "my_corpus.raw('example.txt')" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "We can also extract either all the words, or sentences and their words, as lists of strings:" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": { 148 | "collapsed": false 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "my_corpus.words('example.txt')" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": { 159 | "collapsed": false 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "sents = my_corpus.sents('example.txt')\n", 164 | "print(sents)" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": { 170 | "collapsed": false 171 | }, 172 | "source": [ 173 | "NLTK comes with a variety of downloadable corpora that can be used for trying out the methods in the toolkit. We'll use two of these corpora in just a bit, when we set up our practical tasks." 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "## 2) Tokenization\n", 181 | "\n", 182 | "We usually look at grammar and meaning at the level of words, related to each other within sentences, within each document. So if we're starting with raw text, we first need to split the text into sentences, and those sentences into words -- which we call \"tokens\". An NLTK corpus object does this for us, allowing us to read in lists of words from our text files. But NLTK also provides tools that enable us to \"tokenize\" strings ourselves, if we've read them in in longer form.\n", 183 | "\n", 184 | "To understand how to pre-process raw text, let's read in the contents of 'example.txt' in your current directory, using the .raw() method to get one long string." 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": { 191 | "collapsed": false 192 | }, 193 | "outputs": [], 194 | "source": [ 195 | "example_text = my_corpus.raw('example.txt')\n", 196 | "print(example_text)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "Now, you might imagine that the easiest way to identify sentences is to split the document at every period '.', and to split the sentences using white space to get the words." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": { 210 | "collapsed": false 211 | }, 212 | "outputs": [], 213 | "source": [ 214 | "example_sents = example_text.split('.')\n", 215 | "example_sents_toks = [sent.split(' ') for sent in example_sents]\n", 216 | "\n", 217 | "for sent_toks in example_sents_toks:\n", 218 | " print(sent_toks)" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "This doesn't look right. Not all periods divide sentences (periods may also used in abbreviations), and not all sentences end in a period (some end in question marks or exclamation points). Words might be separated by not only single spaces, but also tabs or newlines. We can use the 're' package split method to use regular expressions that capture these various possibilities:" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": { 232 | "collapsed": false 233 | }, 234 | "outputs": [], 235 | "source": [ 236 | "import re\n", 237 | "\n", 238 | "example_sents = re.split('(?<=[a-z])[.?!]\\s', example_text)\n", 239 | "example_sents_toks = [re.split('\\s+', sent) for sent in example_sents]\n", 240 | "\n", 241 | "for sent_toks in example_sents_toks:\n", 242 | " print(sent_toks)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "This looks better, though we've lost the punctuation at the end of each sentence, except for the period at the end of the string (since we only split sentences on a period followed by white space). That last period has remained attached to the word 'out', since we only split words on white space. We could instead use 're.findall()' to search for all sequences of alphanumeric characters. This would split apart conjunctions, which might be useful if we want to consider 'I' and ''m' (short for 'am') to represent separate words.\n", 250 | "\n", 251 | "We'll stop there, because NLTK provides handy classes to do this for us:" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "metadata": { 258 | "collapsed": false 259 | }, 260 | "outputs": [], 261 | "source": [ 262 | "import nltk\n", 263 | "\n", 264 | "example_sents = nltk.sent_tokenize(example_text)\n", 265 | "example_sents_toks = [nltk.word_tokenize(sent) for sent in example_sents]\n", 266 | "\n", 267 | "for sent_toks in example_sents_toks:\n", 268 | " print(sent_toks)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "These lists of tokens are what we get with the words() method applied to an NLTK corpus. We'll work with the NLTK corpora from here on, but now you know how to turn your own documents into lists of words, either by creating an NLTK corpus object containing your own text files, or by reading in longer strings and then using NLTK functions to tokenize them." 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "## Putting into Practice" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "### Task 1: Classifying Documents\n", 290 | "\n", 291 | "Now we're ready to set up our first task, sentiment analysis (i.e. classifying documents as positive or negative). For this task, we'll use the NLTK Movie Reviews corpus, which contains 2,000 movie reviews already categorized as positive or negative." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": { 298 | "collapsed": false 299 | }, 300 | "outputs": [], 301 | "source": [ 302 | "from nltk.corpus import movie_reviews\n", 303 | "\n", 304 | "movie_reviews.categories()" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": { 311 | "collapsed": false 312 | }, 313 | "outputs": [], 314 | "source": [ 315 | "movie_reviews.fileids()[:10]" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "Let's imagine that we have much less training data, though, which might be more realistic. This will make it easier to compare different options, since we don't expect the accuracy to be very high at first. And it'll also shorten the time it takes to process the text, since we don't have a lot of time in a workshop setting. Let's just load the first 200 documents from each of the two categories (negative and positive). We'll need to store each document as a list of tokens, in a tuple with its category. Then we'll shuffle the documents so that the negative and positive ones are interspersed." 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": { 329 | "collapsed": true 330 | }, 331 | "outputs": [], 332 | "source": [ 333 | "import random\n", 334 | "\n", 335 | "docs_tuples = [(movie_reviews.words(fileid), category)\n", 336 | " for category in movie_reviews.categories()\n", 337 | " for fileid in movie_reviews.fileids(category)[:200]]\n", 338 | "\n", 339 | "random.shuffle(docs_tuples)" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "For our initial set of classification features, we'll look for whether certain words appear in each review (regardless of word order, etc). We could use a pre-defined list of words like \"good\" and \"bad\", or we can select a list of words from the corpus. We'll create a list of the most frequent words in the corpus, using the NLTK class FreqDist. It takes a list of words and counts each word's frequency, which can then be sorted from largest to smallest using the method most_common(). We'll take the top 1000 most common words from the 400 documents we read in." 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": null, 352 | "metadata": { 353 | "collapsed": false 354 | }, 355 | "outputs": [], 356 | "source": [ 357 | "movie_words = [word.lower() for (wordlist, cat) in docs_tuples for word in wordlist]\n", 358 | "all_wordfreqs = nltk.FreqDist(movie_words)\n", 359 | "top_wordfreqs = all_wordfreqs.most_common()[:1000]\n", 360 | "print(top_wordfreqs[:10])" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": { 367 | "collapsed": false 368 | }, 369 | "outputs": [], 370 | "source": [ 371 | "feature_words = [x[0] for x in top_wordfreqs]\n", 372 | "print(feature_words[:25])" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "To use these words as document features for classification, we need to define a function that takes each document (as a list of tokens) and returns a set of features representing which words are in that document. NLTK requires us to provide each document's features as a dictionary object, in which feature names are paired with numeric values. Let's make each word feature a 1 or 0, depending on whether that word appears in the document or not. We'll do this for all 1000 of the top words in the corpus, which is now in our 'feature_words' list (so that each document has the same set features). We'll create feature names of the form \"contains(x)\" for each word x in feature_words." 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": null, 385 | "metadata": { 386 | "collapsed": true 387 | }, 388 | "outputs": [], 389 | "source": [ 390 | "def document_features(doc_toks):\n", 391 | " document_words = set(doc_toks)\n", 392 | " features = {}\n", 393 | " for word in feature_words:\n", 394 | " features['contains({})'.format(word)] = 1 if word in document_words else 0\n", 395 | " return features" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "Then we'll use our document_features() function on each document's word list, and store a new tuple with the resulting features and the document's category in a list of feature sets. (This is the format we provide to an NLTK classifier: a list of document tuples, each tuple containing a dictionary object of features plus a single value category or label for that document.)" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": null, 408 | "metadata": { 409 | "collapsed": true 410 | }, 411 | "outputs": [], 412 | "source": [ 413 | "featuresets = [(document_features(wordlist), cat) for (wordlist, cat) in docs_tuples]" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "Then we split the documents into a training set and a test set so we can see how well we do. We'll use the first 300 documents for training and the last 100 documents for test." 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "metadata": { 427 | "collapsed": true 428 | }, 429 | "outputs": [], 430 | "source": [ 431 | "from nltk import NaiveBayesClassifier\n", 432 | "\n", 433 | "train_set, test_set = featuresets[:-100], featuresets[-100:]\n", 434 | "classifier = NaiveBayesClassifier.train(train_set)" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": { 441 | "collapsed": false 442 | }, 443 | "outputs": [], 444 | "source": [ 445 | "nltk.classify.accuracy(classifier, test_set)" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": {}, 451 | "source": [ 452 | "The NLTK classifier also lets us look at the most informative features (i.e. the words that were most useful for classifying the documents as positive or negative), and sure enough, we see some very positive and very negative words." 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "metadata": { 459 | "collapsed": false 460 | }, 461 | "outputs": [], 462 | "source": [ 463 | "classifier.show_most_informative_features(10)" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "### Task 2. Information Extraction\n", 471 | "\n", 472 | "News articles are often used for information extraction task, since they provide information about major events and actors of interest fairly consistently over time. For this task we'll use the NLTK Brown Corpus. The Brown Corpus contains text from 500 sources, categorized by genre, such as news, editorial, and so on. The full list of files in each genre is available here: http://clu.uni.no/icame/brown/bcm-los.html. Let's look at the genres available, and the files in the 'news' genre." 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "metadata": { 479 | "collapsed": false 480 | }, 481 | "outputs": [], 482 | "source": [ 483 | "from nltk.corpus import brown\n", 484 | "\n", 485 | "print(brown.categories())" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "metadata": { 492 | "collapsed": false 493 | }, 494 | "outputs": [], 495 | "source": [ 496 | "print(brown.fileids(categories='news'))" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "metadata": {}, 502 | "source": [ 503 | "Let's read in a list of sentences for each news document in the Brown Corpus. We can print out the first couple sentences from the first document to make sure it looks ok." 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": null, 509 | "metadata": { 510 | "collapsed": false 511 | }, 512 | "outputs": [], 513 | "source": [ 514 | "news_docs = [brown.sents(fileid) for fileid in brown.fileids(categories='news')]\n", 515 | "print(news_docs[0][:2])" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "A simple way to approach information extraction might be to extract all sentences that contain certain keywords, e.g. sentences containing the word \"election\"." 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "metadata": { 529 | "collapsed": false 530 | }, 531 | "outputs": [], 532 | "source": [ 533 | "elect_sents = []\n", 534 | "for doc in news_docs:\n", 535 | " for sent in doc:\n", 536 | " if 'election' in sent:\n", 537 | " elect_sents.append(sent)\n", 538 | " \n", 539 | "len(elect_sents)" 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": null, 545 | "metadata": { 546 | "collapsed": false 547 | }, 548 | "outputs": [], 549 | "source": [ 550 | "print(elect_sents[:2])" 551 | ] 552 | }, 553 | { 554 | "cell_type": "markdown", 555 | "metadata": {}, 556 | "source": [ 557 | "That's getting some relevant sentences, but it seems like there should be more. We could instead create a regular expression that would match either the root \"elect\" or \"vote\" with any ending (e.g. \"elected\", \"elects\", etc). Then we'd need to look for a match to this regular expression for each token in each sentence." 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": null, 563 | "metadata": { 564 | "collapsed": false 565 | }, 566 | "outputs": [], 567 | "source": [ 568 | "elect_regexp = 'elect|vote'\n", 569 | "\n", 570 | "elect_sents = []\n", 571 | "for doc in news_docs:\n", 572 | " for sent in doc:\n", 573 | " for tok in sent:\n", 574 | " if re.match(elect_regexp, tok):\n", 575 | " elect_sents.append(sent)\n", 576 | " break # Break out of the token for loop, so we only add the sentence once\n", 577 | " \n", 578 | "len(elect_sents)" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": {}, 584 | "source": [ 585 | "## 3) Part-of-Speech Tagging\n", 586 | "\n", 587 | "One of the fundamental aspects of words that we can use to begin to understand them is their parts of speech -- whether a word is a noun, verb, adjective, etc. Labeling each word in a sequence with its part of speech is called \"part-of-speech tagging\" or \"POS tagging\". A part of speech represents a syntactic function; the aim here is to identify the grammatical components of a sentence.\n", 588 | "\n", 589 | "Some parts of speech are easier to spot than others, because they follow certain morphological patterns (e.g. verb endings). We can use regular expressions to find these recognizable words individually:" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": null, 595 | "metadata": { 596 | "collapsed": true 597 | }, 598 | "outputs": [], 599 | "source": [ 600 | "patterns = [\n", 601 | " (r'.*ing$', 'VBG'), # gerunds\n", 602 | " (r'.*ed$', 'VBD'), # simple past\n", 603 | " (r'.*es$', 'VBZ'), # 3rd singular present\n", 604 | " (r'.*ould$', 'MD'), # modals\n", 605 | " (r'.*\\'s$', 'NN$'), # possessive nouns\n", 606 | " (r'.*s$', 'NNS'), # plural nouns\n", 607 | " (r'.*ly', 'RB'), # adverbs\n", 608 | " (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers\n", 609 | " (r'.*', 'NN') # nouns (default)\n", 610 | " ]" 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "metadata": {}, 616 | "source": [ 617 | "NLTK has a regular expression tagger that we can use, providing it with our own patterns. Let's see how many words we're able to tag in our example text." 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "metadata": { 624 | "collapsed": false 625 | }, 626 | "outputs": [], 627 | "source": [ 628 | "from nltk import RegexpTagger\n", 629 | "\n", 630 | "regexp_tagger = RegexpTagger(patterns)\n", 631 | "\n", 632 | "sent = nltk.word_tokenize(\"They refuse to permit us to obtain the refuse permit\")\n", 633 | "sent_tagged = regexp_tagger.tag(sent)\n", 634 | "\n", 635 | "print(sent_tagged)" 636 | ] 637 | }, 638 | { 639 | "cell_type": "markdown", 640 | "metadata": {}, 641 | "source": [ 642 | "That didn't work so well, but no problem, this was a very naïve attempt. We can evaluate the accuracy nonetheless, using the POS-tagged Brown corpus:" 643 | ] 644 | }, 645 | { 646 | "cell_type": "code", 647 | "execution_count": null, 648 | "metadata": { 649 | "collapsed": false 650 | }, 651 | "outputs": [], 652 | "source": [ 653 | "brown_tagged_sents = brown.tagged_sents(categories='news')\n", 654 | "regexp_tagger.evaluate(brown_tagged_sents)" 655 | ] 656 | }, 657 | { 658 | "cell_type": "markdown", 659 | "metadata": { 660 | "collapsed": true 661 | }, 662 | "source": [ 663 | "Some words may have different parts of speech depending on their context. In the sentence two fields above, the words \"refuse\" and \"permit\" both appear as verbs and as nouns. We can think of sequences that might indicate parts of speech. For instance, a word after \"the\" or \"an\" is more likely to be a noun, while a word after \"did\" or \"does\" is more likely to be a verb.\n", 664 | "\n", 665 | "State-of-the-art POS taggers rely on probabilistic models and machine learning to tag tokens sequentially, or jointly across the whole sentence at the same time (finding the most likely combination of tags that make sense in relation to each other). Fortunately, you don't need to do this yourself. NLTK comes with an off-the-shelf POS tagger for English language text:" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": null, 671 | "metadata": { 672 | "collapsed": false 673 | }, 674 | "outputs": [], 675 | "source": [ 676 | "sent = nltk.word_tokenize(\"They refuse to permit us to obtain the refuse permit\")\n", 677 | "sent_tagged = nltk.pos_tag(sent)\n", 678 | "print(sent_tagged)" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "Some corpora come already tagged. If we read in the Brown Corpus in raw text format (rather than tokenized), we'll actually see pairs of tokens and tags, separated by a '/'." 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "metadata": { 692 | "collapsed": false 693 | }, 694 | "outputs": [], 695 | "source": [ 696 | "news_raw = brown.raw('ca01').strip()\n", 697 | "print(news_raw[:200])" 698 | ] 699 | }, 700 | { 701 | "cell_type": "markdown", 702 | "metadata": {}, 703 | "source": [ 704 | "We can read these tagged tokens in as a list of tuples using the corpus object's tagged_words() method:" 705 | ] 706 | }, 707 | { 708 | "cell_type": "code", 709 | "execution_count": null, 710 | "metadata": { 711 | "collapsed": false 712 | }, 713 | "outputs": [], 714 | "source": [ 715 | "news_tagged = brown.tagged_words('ca01')\n", 716 | "print(news_tagged[:10])" 717 | ] 718 | }, 719 | { 720 | "cell_type": "markdown", 721 | "metadata": {}, 722 | "source": [ 723 | "Different POS taggers use different tags (collectively a \"tagset\"). The NLTK pos_tagger uses the Penn Treebank POS tagset. Good documentation can be found here: http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html.\n", 724 | "\n", 725 | "The Brown Corpus is tagged with a different tagset, but you can see the similarities." 726 | ] 727 | }, 728 | { 729 | "cell_type": "markdown", 730 | "metadata": {}, 731 | "source": [ 732 | "## Putting into Practice" 733 | ] 734 | }, 735 | { 736 | "cell_type": "markdown", 737 | "metadata": {}, 738 | "source": [ 739 | "### Task 1. Classifying Documents\n", 740 | "\n", 741 | "Part-of-speech tags may be useful in document classification. For sentiment analysis in particular, we might care more about adjectives than about nouns and verbs, and probably much more than about articles or prepositions.\n", 742 | "\n", 743 | "Let's run the code we used in the first version of Task 1, then modify it to only use adjectives as the features for each document. Here's the original code:" 744 | ] 745 | }, 746 | { 747 | "cell_type": "code", 748 | "execution_count": null, 749 | "metadata": { 750 | "collapsed": false 751 | }, 752 | "outputs": [], 753 | "source": [ 754 | "# Read in a list of document (wordlist, category) tuples, and shuffle\n", 755 | "docs_tuples = [(movie_reviews.words(fileid), category)\n", 756 | " for category in movie_reviews.categories()\n", 757 | " for fileid in movie_reviews.fileids(category)[:200]]\n", 758 | "random.shuffle(docs_tuples)\n", 759 | "\n", 760 | "# Create a list of the most frequent words in the entire corpus\n", 761 | "movie_words = [word.lower() for (wordlist, cat) in docs_tuples for word in wordlist]\n", 762 | "all_wordfreqs = nltk.FreqDist(movie_words)\n", 763 | "top_wordfreqs = all_wordfreqs.most_common()[:1000]\n", 764 | "feature_words = [x[0] for x in top_wordfreqs]\n", 765 | "\n", 766 | "# Define a function to extract features of the form containts(word) for each document\n", 767 | "def document_features(doc_toks):\n", 768 | " document_words = set(doc_toks)\n", 769 | " features = {}\n", 770 | " for word in feature_words:\n", 771 | " features['contains({})'.format(word)] = 1 if word in document_words else 0\n", 772 | " return features\n", 773 | "\n", 774 | "# Create feature sets of document (features, category) tuples\n", 775 | "featuresets = [(document_features(wordlist), cat) for (wordlist, cat) in docs_tuples]\n", 776 | "\n", 777 | "# Separate train and test sets, train the classifier, print accuracy and best features\n", 778 | "train_set, test_set = featuresets[:-100], featuresets[-100:]\n", 779 | "classifier = nltk.NaiveBayesClassifier.train(train_set)\n", 780 | "print(nltk.classify.accuracy(classifier, test_set))\n", 781 | "print(classifier.show_most_informative_features(10))" 782 | ] 783 | }, 784 | { 785 | "cell_type": "markdown", 786 | "metadata": {}, 787 | "source": [ 788 | "Now write a modified version of the section that builds a list of feature words from the corpus. Run the POS tagger on the full list of movie_words from the corpus, and pull out only the ones tagged as an adjective ('JJ'). Then create a list of the 1000 most frequent adjectives, and assign that to feature_words.\n", 789 | "\n", 790 | "(We're leaving the first step the way we already did it above, so we only shuffle the document tuples once and leave them in the same order for both versions. Then we can test how well the classifier worked, when trained and tested on the same subset of documents.)" 791 | ] 792 | }, 793 | { 794 | "cell_type": "code", 795 | "execution_count": null, 796 | "metadata": { 797 | "collapsed": false 798 | }, 799 | "outputs": [], 800 | "source": [ 801 | "# Create a list of the most frequent adjectives in the entire corpus\n" 802 | ] 803 | }, 804 | { 805 | "cell_type": "markdown", 806 | "metadata": {}, 807 | "source": [ 808 | "We'll leave the rest of the code the same (as long as our new list of adjectives is called \"feature_words\" again). All we've changed is the words we're looking for in each document. Run this code again and see what happens to the accuracy and informative features." 809 | ] 810 | }, 811 | { 812 | "cell_type": "code", 813 | "execution_count": null, 814 | "metadata": { 815 | "collapsed": false 816 | }, 817 | "outputs": [], 818 | "source": [ 819 | "# Define a function to extract features of the form containts(word) for each document\n", 820 | "def document_features(doc_toks):\n", 821 | " document_words = set(doc_toks)\n", 822 | " features = {}\n", 823 | " for word in feature_words:\n", 824 | " features['contains({})'.format(word)] = 1 if word in document_words else 0\n", 825 | " return features\n", 826 | "\n", 827 | "# Create feature sets of document (features, category) tuples\n", 828 | "featuresets = [(document_features(wordlist), cat) for (wordlist, cat) in docs_tuples]\n", 829 | "\n", 830 | "# Separate train and test sets, train the classifier, print accuracy and best features\n", 831 | "train_set, test_set = featuresets[:-100], featuresets[-100:]\n", 832 | "classifier = nltk.NaiveBayesClassifier.train(train_set)\n", 833 | "print(nltk.classify.accuracy(classifier, test_set))\n", 834 | "print(classifier.show_most_informative_features(10))" 835 | ] 836 | }, 837 | { 838 | "cell_type": "markdown", 839 | "metadata": {}, 840 | "source": [ 841 | "### Task 2. Information Extraction\n", 842 | "\n", 843 | "Part-of-speech tags might also help us find the entities involved in elections. The code below is what we used to extract full sentences if they contained an election-related keyword." 844 | ] 845 | }, 846 | { 847 | "cell_type": "code", 848 | "execution_count": null, 849 | "metadata": { 850 | "collapsed": false 851 | }, 852 | "outputs": [], 853 | "source": [ 854 | "# Read in all news docs as a list of sentences, each sentence a list of tokens\n", 855 | "news_docs = [brown.sents(fileid) for fileid in brown.fileids(categories='news')]\n", 856 | "\n", 857 | "# Create regular expression to search for election-related words\n", 858 | "elect_regexp = 'elect|vote'\n", 859 | "\n", 860 | "# Loop through documents and extract each sentence containing an election-related word\n", 861 | "elect_sents = []\n", 862 | "for doc in news_docs:\n", 863 | " for sent in doc:\n", 864 | " for tok in sent:\n", 865 | " if re.match(elect_regexp, tok):\n", 866 | " elect_sents.append(sent)\n", 867 | " break # Break out of last for loop, so we only add the sentence once\n", 868 | " \n", 869 | "len(elect_sents)" 870 | ] 871 | }, 872 | { 873 | "cell_type": "markdown", 874 | "metadata": {}, 875 | "source": [ 876 | "See if you can write new code that will only extract the nouns from each sentence, if the sentence contains a keyword related to elections." 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": null, 882 | "metadata": { 883 | "collapsed": false 884 | }, 885 | "outputs": [], 886 | "source": [ 887 | "# Loop through docs and extract nouns from each sentence containing an election-related word\n" 888 | ] 889 | }, 890 | { 891 | "cell_type": "markdown", 892 | "metadata": {}, 893 | "source": [ 894 | "Now see if you can write code that will only extract the nouns from each sentence, if that sentence contains a *verb* that's related to elections." 895 | ] 896 | }, 897 | { 898 | "cell_type": "code", 899 | "execution_count": null, 900 | "metadata": { 901 | "collapsed": false 902 | }, 903 | "outputs": [], 904 | "source": [ 905 | "# Loop through docs and extract nouns from each sentence containing an election-related verb\n" 906 | ] 907 | }, 908 | { 909 | "cell_type": "markdown", 910 | "metadata": {}, 911 | "source": [ 912 | "## 4) Phrase Chunking\n", 913 | "\n", 914 | "We may want to work with larger segments of text than single words (but still smaller than a sentence). For instance, in the sentence \"The black cat climbed over the tall fence\", we might want to treat \"The black cat\" as one thing (the subject), \"climbed over\" as a distinct act, and \"the tall fence\" as another thing (the object). The first and third sequences are noun phrases, and the second is a verb phrase.\n", 915 | "\n", 916 | "We can separate these phrases by \"chunking\" the sentence, i.e. splitting it into larger chunks than individual tokens. This is also an important step toward identifying entities, which are often represented by more than one word. You can probably imagine certain patterns that would define a noun phrase, using part of speech tags. For instance, a determiner (e.g. an article like \"the\") could be concatenated onto the noun that follows it. If there's an adjective between them, we can include that too.\n", 917 | "\n", 918 | "To define rules about how to structure words based on their part of speech tags, we use a grammar (in this case, a \"chunk grammar\"). NLTK provides a RegexpParser that takes as input a grammar composed of regular expressions. The grammar is defined as a string, with one line for each rule we define. Each rule starts with the label we want to assign to the chunk (e.g. NP for \"noun phrase\"), followed by a colon, then an expression in regex-like notation that will be matched to tokens' POS tags.\n", 919 | "\n", 920 | "We can define a single rule for a noun phrase like this. The rule allows 0 or 1 determiner, then 0 or more adjectives, and finally at least 1 noun. (By using 'NN.*' as the last POS tag, we can match 'NN', 'NNP' for a proper noun, or 'NNS' for a plural noun.) If a matching sequence of tokens is found, it will be labeled 'NP'." 921 | ] 922 | }, 923 | { 924 | "cell_type": "code", 925 | "execution_count": null, 926 | "metadata": { 927 | "collapsed": true 928 | }, 929 | "outputs": [], 930 | "source": [ 931 | "grammar = \"NP: {
?*+}\"" 932 | ] 933 | }, 934 | { 935 | "cell_type": "markdown", 936 | "metadata": {}, 937 | "source": [ 938 | "We create a chunk parser object by supplying this grammar, then use it to parse a sentence into chunks. The sentence we want to parse must already be POS-tagged, since our grammar uses those POS tags to identify chunks. Let's try this on the first sentence in the election-related sentences we just extracted." 939 | ] 940 | }, 941 | { 942 | "cell_type": "code", 943 | "execution_count": null, 944 | "metadata": { 945 | "collapsed": false 946 | }, 947 | "outputs": [], 948 | "source": [ 949 | "from nltk import RegexpParser\n", 950 | "\n", 951 | "cp = RegexpParser(grammar)\n", 952 | "\n", 953 | "sent = elect_sents[0]\n", 954 | "sent_tagged = nltk.pos_tag(sent)\n", 955 | "sent_chunked = cp.parse(sent_tagged)\n", 956 | "\n", 957 | "print(sent_chunked)" 958 | ] 959 | }, 960 | { 961 | "cell_type": "markdown", 962 | "metadata": {}, 963 | "source": [ 964 | "When we called print() on this chunked sentence, it printed out a nested list of nodes. Some are phrases (labeled 'NP') and others that didn't get chunked into a phrase are just the original tagged tokens (e.g. the verb 'climbed').\n", 965 | "\n", 966 | "The chunked sentence is actually an NLTK tree object, we can find out by calling type() on the output from the RegexpParser:" 967 | ] 968 | }, 969 | { 970 | "cell_type": "code", 971 | "execution_count": null, 972 | "metadata": { 973 | "collapsed": false 974 | }, 975 | "outputs": [], 976 | "source": [ 977 | "type(sent_chunked)" 978 | ] 979 | }, 980 | { 981 | "cell_type": "markdown", 982 | "metadata": {}, 983 | "source": [ 984 | "The tree object has a number of methods we can use to interact with its components. For instance, we can use the method draw() to see a more graphical representation. This will open a separate window.\n", 985 | "\n", 986 | "The tree is pretty flat, because we defined a grammar that only grouped words into non-overlapping noun phrases, with no additional hierarchy above them. This is sometimes referred to as \"shallow parsing\". We'll get to more complex parsing later." 987 | ] 988 | }, 989 | { 990 | "cell_type": "code", 991 | "execution_count": null, 992 | "metadata": { 993 | "collapsed": false 994 | }, 995 | "outputs": [], 996 | "source": [ 997 | "sent_chunked.draw()" 998 | ] 999 | }, 1000 | { 1001 | "cell_type": "markdown", 1002 | "metadata": {}, 1003 | "source": [ 1004 | "If we want to move through the chunks and look at certain phrases, since the tree is essentially flat, we can use a 'for' loop to iterate through all of the nodes in the order they were printed above. Some of the nodes are themselves NLTK tree objects, containing the noun phrases we chunked. Other nodes are just tuples with a token and tag, that didn't make it into a chunk.\n", 1005 | "\n", 1006 | "If a node is a tree object, it has a method label(), in this case marked 'NP'. It also has a method leaves() that will give us the list of tagged tokens (tuples) in the phrase. If we pull out the first token from each tuple, and concatenate these, we can get the original phrase back." 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "code", 1011 | "execution_count": null, 1012 | "metadata": { 1013 | "collapsed": false 1014 | }, 1015 | "outputs": [], 1016 | "source": [ 1017 | "for node in sent_chunked:\n", 1018 | " if type(node)==nltk.tree.Tree and node.label()=='NP':\n", 1019 | " phrase = [tok for (tok, tag) in node.leaves()]\n", 1020 | " print(' '.join(phrase))" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "markdown", 1025 | "metadata": {}, 1026 | "source": [ 1027 | "## 5) Named Entity Recognition\n", 1028 | "\n", 1029 | "Once we have noun phrases separated out, we might find it useful to figure out what categories of things these nouns refer to. Especially if the noun phrase is a proper noun, i.e. a name of something, we might be able to tell if it is the name of a person, an organization, a place, or some other thing. Labeling noun phrases as different types of named entities is called \"Named Entity Recognition\" or \"NER\".\n", 1030 | "\n", 1031 | "Named Entity Recognition involves meaning (semantics) as well as grammar (syntax). The name of a person or an organization might appear in the same place in the exact same sentence, so we also have to know something about existing person and organization names to be able to tell them apart. For that reason, NER taggers are usually trained from labeled training data, using supervised machine learning techniques. NLTK comes with a pre-trained NER tagger we can use for general English text:" 1032 | ] 1033 | }, 1034 | { 1035 | "cell_type": "code", 1036 | "execution_count": null, 1037 | "metadata": { 1038 | "collapsed": false 1039 | }, 1040 | "outputs": [], 1041 | "source": [ 1042 | "sent_nes = nltk.ne_chunk(sent_tagged)\n", 1043 | "print(sent_nes)" 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "markdown", 1048 | "metadata": {}, 1049 | "source": [ 1050 | "Now we can extract named entities and their NER categories. For instance, we might want to pull out a list of all of the organizations or people mentioned in the document:" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "execution_count": null, 1056 | "metadata": { 1057 | "collapsed": false 1058 | }, 1059 | "outputs": [], 1060 | "source": [ 1061 | "entities = {'ORGANIZATION':[], 'PERSON':[], 'LOCATION':[]}\n", 1062 | "for node in sent_nes:\n", 1063 | " if type(node)==nltk.tree.Tree:\n", 1064 | " phrase = [tok for (tok, tag) in node.leaves()]\n", 1065 | " if node.label() in entities.keys():\n", 1066 | " entities[node.label()].append(' '.join(phrase))\n", 1067 | "\n", 1068 | "for key, value in entities.items():\n", 1069 | " print(key, value)" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "markdown", 1074 | "metadata": {}, 1075 | "source": [ 1076 | "## Putting into Practice" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "markdown", 1081 | "metadata": {}, 1082 | "source": [ 1083 | "### Task 1. Classifying Documents\n", 1084 | "\n", 1085 | "Not all adjectives are alike; maybe what matters is which adjectives modify certain nouns: \"awful movie\" clearly sounds like a negative review, while \"awful lines\" or \"awful crowds\" might actually indicate that the movie was popular.\n", 1086 | "\n", 1087 | "Let's try using noun phrases as the features for our sentiment classifier. To do so, we'll need to identify a list of common noun phrases from the corpus, then also chunk each document to see if each noun phrase appears there. These operations are time-consuming, we don't want to do them twice for each document. So let's first chunk all of the documents in our doc_tuples list, and extract a list of the noun phrases in each.\n", 1088 | "\n", 1089 | "We might also want to change our grammar slightly, so that it just looks for noun phrases with an adjective followed by a noun (i.e. no articles, no nouns by themselves.)" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "code", 1094 | "execution_count": null, 1095 | "metadata": { 1096 | "collapsed": false 1097 | }, 1098 | "outputs": [], 1099 | "source": [ 1100 | "grammar = \"NP: {}\"\n", 1101 | "cp = RegexpParser(grammar)\n", 1102 | "\n", 1103 | "def extract_nps(wordlist):\n", 1104 | " wordlist_tagged = nltk.pos_tag(wordlist)\n", 1105 | " wordlist_chunked = cp.parse(wordlist_tagged)\n", 1106 | " nps = []\n", 1107 | " for node in wordlist_chunked:\n", 1108 | " if type(node)==nltk.tree.Tree and node.label()=='NP':\n", 1109 | " phrase = [tok for (tok, tag) in node.leaves()]\n", 1110 | " nps.append(' '.join(phrase))\n", 1111 | " return nps\n", 1112 | "\n", 1113 | "docs_tuples_nps = [(extract_nps(wordlist), cat) for (wordlist, cat) in docs_tuples]" 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "markdown", 1118 | "metadata": {}, 1119 | "source": [ 1120 | "Now instead of adjectives, write new code to identify the 1000 most common noun phrases in the corpus, using the chunking grammar and RegexpParser we created above." 1121 | ] 1122 | }, 1123 | { 1124 | "cell_type": "code", 1125 | "execution_count": null, 1126 | "metadata": { 1127 | "collapsed": false 1128 | }, 1129 | "outputs": [], 1130 | "source": [ 1131 | "# Create a list of the most frequent noun phrases in the entire corpus\n" 1132 | ] 1133 | }, 1134 | { 1135 | "cell_type": "markdown", 1136 | "metadata": {}, 1137 | "source": [ 1138 | "Now we will also need to modify the last line, so that we pass in each document's list of nps to the function document_features, since we've already chunked all of the documents in our corpus." 1139 | ] 1140 | }, 1141 | { 1142 | "cell_type": "code", 1143 | "execution_count": null, 1144 | "metadata": { 1145 | "collapsed": false 1146 | }, 1147 | "outputs": [], 1148 | "source": [ 1149 | "# Create feature sets of document (features, category) tuples\n" 1150 | ] 1151 | }, 1152 | { 1153 | "cell_type": "markdown", 1154 | "metadata": {}, 1155 | "source": [ 1156 | "Run the rest of the code as-is, and see what happens to accuracy and informative features." 1157 | ] 1158 | }, 1159 | { 1160 | "cell_type": "code", 1161 | "execution_count": null, 1162 | "metadata": { 1163 | "collapsed": false 1164 | }, 1165 | "outputs": [], 1166 | "source": [ 1167 | "# Separate train and test sets, train the classifier, print accuracy and best features\n", 1168 | "train_set, test_set = featuresets[:-100], featuresets[-100:]\n", 1169 | "classifier = nltk.NaiveBayesClassifier.train(train_set)\n", 1170 | "print(nltk.classify.accuracy(classifier, test_set))\n", 1171 | "print(classifier.show_most_informative_features(10))" 1172 | ] 1173 | }, 1174 | { 1175 | "cell_type": "markdown", 1176 | "metadata": {}, 1177 | "source": [ 1178 | "### Task 2. Information Extraction\n", 1179 | "\n", 1180 | "Named Entity Recognition is especially useful for information extraction. Let's say we're especially interested in identifying all of the named people and organizations involved in reported elections, but we don't care about locations or other entity names.\n", 1181 | "\n", 1182 | "Again, to remind you what we started with, the code below is what we used to extract full sentences if they contained a keyword related to elections." 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "code", 1187 | "execution_count": null, 1188 | "metadata": { 1189 | "collapsed": false 1190 | }, 1191 | "outputs": [], 1192 | "source": [ 1193 | "# Loop through documents and extract each sentence containing a election-related word\n", 1194 | "elect_sents = []\n", 1195 | "for doc in news_docs:\n", 1196 | " for sent in doc:\n", 1197 | " for tok in sent:\n", 1198 | " if re.match(elect_regexp, tok):\n", 1199 | " elect_sents.append(sent)\n", 1200 | " break # Break out of last for loop, so we only add the sentence once\n", 1201 | " \n", 1202 | "len(elect_sents)" 1203 | ] 1204 | }, 1205 | { 1206 | "cell_type": "markdown", 1207 | "metadata": {}, 1208 | "source": [ 1209 | "Try writing new code to extract all of the named entities that are either a PERSON or ORGANIZATION from a sentence that contains an election-related word." 1210 | ] 1211 | }, 1212 | { 1213 | "cell_type": "code", 1214 | "execution_count": null, 1215 | "metadata": { 1216 | "collapsed": false 1217 | }, 1218 | "outputs": [], 1219 | "source": [] 1220 | }, 1221 | { 1222 | "cell_type": "markdown", 1223 | "metadata": {}, 1224 | "source": [ 1225 | "Now we aren't getting places or things, but we're also missing relevant entities like \"voters\" because they aren't named entities. Try opening it back up to any noun phrase, but being even more specific about position in the sentence, extracting a noun phrase only if it appears immediately before or immediately after an elect word. What might those entities represent?" 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "code", 1230 | "execution_count": null, 1231 | "metadata": { 1232 | "collapsed": false 1233 | }, 1234 | "outputs": [], 1235 | "source": [] 1236 | }, 1237 | { 1238 | "cell_type": "markdown", 1239 | "metadata": {}, 1240 | "source": [ 1241 | "## 6) Parsing\n", 1242 | "\n", 1243 | "Breaking down parts of a sentence and identifying their grammatical roles constitutes parsing. There are two main types of parsing in NLP: constituency parsing and dependency parsing. " 1244 | ] 1245 | }, 1246 | { 1247 | "cell_type": "markdown", 1248 | "metadata": { 1249 | "collapsed": true 1250 | }, 1251 | "source": [ 1252 | "### Constituency Parsing\n", 1253 | "\n", 1254 | "The grammar rule we used to chunk noun phrases represents the first layer of a constituency parse. We can add rules to the grammar to identify other types of phrases as well.\n", 1255 | "\n", 1256 | "The most common additional types of phrases are prepositional phrases and verb phrases. The standard approach is not just to put prepositions in prepositional phrases, and verbs into verb phrases. Instead, we label a verb plus the noun object it's acting on as a verb phrase. Similarly, a preposition plus the following noun is a prepositional phrase.\n", 1257 | "\n", 1258 | "In other words, these phrases are nested. The noun phrases we identified with our first rule become components (or constituents) of the verb or prepositional phrases in the next layer up, until we get to the level of the sentence overall.\n", 1259 | "\n", 1260 | "Let's add those additional phrase types to our grammar. We'll use three quotation marks to indicate a string that covers multiple lines." 1261 | ] 1262 | }, 1263 | { 1264 | "cell_type": "code", 1265 | "execution_count": null, 1266 | "metadata": { 1267 | "collapsed": true 1268 | }, 1269 | "outputs": [], 1270 | "source": [ 1271 | "grammar = r\"\"\"\n", 1272 | " NP: {
?*+} # Chunk sequences of DT, JJ, NN\n", 1273 | " PP: {} # Chunk prepositions followed by NP\n", 1274 | " VP: {+} # Chunk verbs and their arguments\n", 1275 | " CLAUSE: {} # Chunk NP, VP into a clause\n", 1276 | " \"\"\"\n", 1277 | "cp = RegexpParser(grammar)" 1278 | ] 1279 | }, 1280 | { 1281 | "cell_type": "markdown", 1282 | "metadata": {}, 1283 | "source": [ 1284 | "Now let's parse the first sentence in our election-related sentences again." 1285 | ] 1286 | }, 1287 | { 1288 | "cell_type": "code", 1289 | "execution_count": null, 1290 | "metadata": { 1291 | "collapsed": false 1292 | }, 1293 | "outputs": [], 1294 | "source": [ 1295 | "sent = elect_sents[0]\n", 1296 | "sent_tagged = nltk.pos_tag(sent)\n", 1297 | "sent_chunked = cp.parse(sent_tagged)\n", 1298 | "print(sent_chunked)" 1299 | ] 1300 | }, 1301 | { 1302 | "cell_type": "markdown", 1303 | "metadata": {}, 1304 | "source": [ 1305 | "We have more nested phrases now, NPs within PPs within VPs. We can draw the tree to better visualize the greater depth." 1306 | ] 1307 | }, 1308 | { 1309 | "cell_type": "code", 1310 | "execution_count": null, 1311 | "metadata": { 1312 | "collapsed": true 1313 | }, 1314 | "outputs": [], 1315 | "source": [ 1316 | "sent_chunked.draw()" 1317 | ] 1318 | }, 1319 | { 1320 | "cell_type": "markdown", 1321 | "metadata": {}, 1322 | "source": [ 1323 | "Since this is no longer a flat list of noun phrases or other nodes, it would be less wise to use a simple for loop to iterate through the nodes and look for certain phrases of interest. A tree structure is best traversed using recursion: define a function to perform the operations you want to do on one node, then have the function call itself on each of its children." 1324 | ] 1325 | }, 1326 | { 1327 | "cell_type": "code", 1328 | "execution_count": null, 1329 | "metadata": { 1330 | "collapsed": false 1331 | }, 1332 | "outputs": [], 1333 | "source": [ 1334 | "def extract_nps_recurs(tree):\n", 1335 | " nps = []\n", 1336 | " if not type(tree)==nltk.tree.Tree:\n", 1337 | " return nps\n", 1338 | " if tree.label()=='NP':\n", 1339 | " nps.append(' '.join([tok for (tok, tag) in tree.leaves()]))\n", 1340 | " for subtree in tree:\n", 1341 | " nps.extend(extract_nps_recurs(subtree))\n", 1342 | " return nps" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "code", 1347 | "execution_count": null, 1348 | "metadata": { 1349 | "collapsed": false 1350 | }, 1351 | "outputs": [], 1352 | "source": [ 1353 | "extract_nps_recurs(sent_chunked)" 1354 | ] 1355 | }, 1356 | { 1357 | "cell_type": "markdown", 1358 | "metadata": {}, 1359 | "source": [ 1360 | "### Dependency Parsing\n", 1361 | "\n", 1362 | "Another way to parse sentences is to identify which words are syntactically dependent on other words, and what their dependency relationship is. Dependency parsing usually places the main verb of a sentence at the root of the tree, then assigns the verb's subject, direct object, and indirect objects as dependents. An indirect object will usually be connected to a root verb through a preposition. And nouns can have dependents too, which modify or are about some aspect of the noun." 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "markdown", 1367 | "metadata": {}, 1368 | "source": [ 1369 | "> #### Prepositional phrase attachment\n", 1370 | "> \n", 1371 | "> Dependency parsing is very complex; determining which words depend on which other words \n", 1372 | "> involves not only part-of-speech tags, but other information that's more specific to each \n", 1373 | "> verb or noun in the given sequence. Here's a classic example:\n", 1374 | "> \n", 1375 | "> * \"He ate pizza with olives.\"\n", 1376 | "> * \"He ate pizza with a fork.\"\n", 1377 | "> \n", 1378 | "> Which word in the sentence does the last word modify? In the first sentence, the olives are \n", 1379 | "> on the pizza, they modify the noun. Saying \"He ate with olives\" wouldn't make sense without \n", 1380 | "> the pizza. In the second sentence, we aren't talking about a thing called \"a pizza with a \n", 1381 | "> fork\", that doesn't make sense. The fork modifies the verb \"ate\": \"He ate with a fork\"." 1382 | ] 1383 | }, 1384 | { 1385 | "cell_type": "markdown", 1386 | "metadata": {}, 1387 | "source": [ 1388 | "Because of these nuances, dependency parsers are usually built using extensive training data, in the form of \"treebanks\" of sentences annotated with dependency relations. Several major dependency parsers are available in pre-trained form for English language text. It is also possible to train open-source dependency parsers on other publicly available treebanks (such as from the Universal Dependencies project, which offers annotated treebanks in many languages).\n", 1389 | "\n", 1390 | "Today, we'll work with the Stanford Parser, which is part of the Stanford CoreNLP toolkit. Stanford CoreNLP provides a number of state-of-the-art NLP tools and is widely used by computer scientists as well as social scientists and humanists. It is written in Java, but there are APIs that enable you to access some of the tools from Python. Several of the most popular tools can be used through NLTK." 1391 | ] 1392 | }, 1393 | { 1394 | "cell_type": "markdown", 1395 | "metadata": {}, 1396 | "source": [ 1397 | "### Stanford CoreNLP\n", 1398 | "\n", 1399 | "To get started, you'll need to have downloaded the Stanford Parser from this website: http://nlp.stanford.edu/software/lex-parser.html#Download and unzip it to a location on your computer that's easy to find (e.g. a folder called SourceCode in your Documents folder).\n", 1400 | "\n", 1401 | "Then in Python, import the StanfordDependencyParser class from NLTK's parser package. You'll also need to import the module 'os' and set the following environment variables to the location on your computer where you put the unzipped Stanford Parser folder." 1402 | ] 1403 | }, 1404 | { 1405 | "cell_type": "code", 1406 | "execution_count": null, 1407 | "metadata": { 1408 | "collapsed": false 1409 | }, 1410 | "outputs": [], 1411 | "source": [ 1412 | "import os\n", 1413 | "from nltk.parse.stanford import StanfordDependencyParser\n", 1414 | "\n", 1415 | "os.environ['STANFORD_PARSER'] = '/Users/natalieahn/Documents/SourceCode/stanford-parser-full-2016-10-31'\n", 1416 | "os.environ['STANFORD_MODELS'] = '/Users/natalieahn/Documents/SourceCode/stanford-parser-full-2016-10-31'" 1417 | ] 1418 | }, 1419 | { 1420 | "cell_type": "markdown", 1421 | "metadata": {}, 1422 | "source": [ 1423 | "Now let's create a dependency parser object and try parsing our election-related sentences." 1424 | ] 1425 | }, 1426 | { 1427 | "cell_type": "code", 1428 | "execution_count": null, 1429 | "metadata": { 1430 | "collapsed": false 1431 | }, 1432 | "outputs": [], 1433 | "source": [ 1434 | "dependency_parser = StanfordDependencyParser(model_path=\"edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz\")\n", 1435 | "sents_parsed = dependency_parser.parse_sents(elect_sents)" 1436 | ] 1437 | }, 1438 | { 1439 | "cell_type": "markdown", 1440 | "metadata": {}, 1441 | "source": [ 1442 | "The NLTK interface to the Stanford Parser returns an iterators (over iterators) over NLTK Dependency Graph objects. To be able to access the graph objects more than once, we can convert this into a list:" 1443 | ] 1444 | }, 1445 | { 1446 | "cell_type": "code", 1447 | "execution_count": null, 1448 | "metadata": { 1449 | "collapsed": false 1450 | }, 1451 | "outputs": [], 1452 | "source": [ 1453 | "sents_parseobjs = [obj for sent in sents_parsed for obj in sent]" 1454 | ] 1455 | }, 1456 | { 1457 | "cell_type": "code", 1458 | "execution_count": null, 1459 | "metadata": { 1460 | "collapsed": false 1461 | }, 1462 | "outputs": [], 1463 | "source": [ 1464 | "len(sents_parseobjs)" 1465 | ] 1466 | }, 1467 | { 1468 | "cell_type": "markdown", 1469 | "metadata": {}, 1470 | "source": [ 1471 | "The graph object contains a method .tree() to depict the parse tree. (If we add .draw(), it will open in a separate window.)" 1472 | ] 1473 | }, 1474 | { 1475 | "cell_type": "code", 1476 | "execution_count": null, 1477 | "metadata": { 1478 | "collapsed": false 1479 | }, 1480 | "outputs": [], 1481 | "source": [ 1482 | "sents_parseobjs[0].tree()" 1483 | ] 1484 | }, 1485 | { 1486 | "cell_type": "markdown", 1487 | "metadata": {}, 1488 | "source": [ 1489 | "This tree shows us the dependencies (i.e. the arcs), but it doesn't show us the labeled dependency relations, which are a huge part of the value of dependency parsing. In other words, it shows us that \"investigation\" and \"evidence\" are both dependents of the verb \"produced\", but it doesn't show which was the subject and which the object of the action.\n", 1490 | "\n", 1491 | "The method .triples() extracts dependency triples of the form: ((head word, head tag), rel, (dep word, dep tag)). So for every head word - dependent word pair, it will give us a triple, with the dependency relation label in between. (The method .triples() also returns an iterator; here we'll just use a for loop to print out each triple.)" 1492 | ] 1493 | }, 1494 | { 1495 | "cell_type": "code", 1496 | "execution_count": null, 1497 | "metadata": { 1498 | "collapsed": false 1499 | }, 1500 | "outputs": [], 1501 | "source": [ 1502 | "for triple in sents_parseobjs[0].triples():\n", 1503 | " print(triple)" 1504 | ] 1505 | }, 1506 | { 1507 | "cell_type": "markdown", 1508 | "metadata": {}, 1509 | "source": [ 1510 | "This list of triples repeats a lot of head words, in order to capture all of their relations. Another format in which we can view the parse information is to convert it into CoNLL format. (CoNLL stands for the SIGNLL Conference on Computational Natural Language Learning; it organizes annual shared tasks relating to syntactic and semantic parsing.) The CoNLL formatted output is a string with one line for each word in the original sentence. The lines contain the word, its part-of-speech tag (two versions), the line number for the head word it is directly dependent on, and the label for that dependency relation." 1511 | ] 1512 | }, 1513 | { 1514 | "cell_type": "code", 1515 | "execution_count": null, 1516 | "metadata": { 1517 | "collapsed": false 1518 | }, 1519 | "outputs": [], 1520 | "source": [ 1521 | "print(sents_parseobjs[0].to_conll(10))" 1522 | ] 1523 | }, 1524 | { 1525 | "cell_type": "code", 1526 | "execution_count": null, 1527 | "metadata": { 1528 | "collapsed": false 1529 | }, 1530 | "outputs": [], 1531 | "source": [ 1532 | "print(dir(sents_parseobjs[0]))" 1533 | ] 1534 | }, 1535 | { 1536 | "cell_type": "markdown", 1537 | "metadata": {}, 1538 | "source": [ 1539 | "## Putting into Practice" 1540 | ] 1541 | }, 1542 | { 1543 | "cell_type": "markdown", 1544 | "metadata": {}, 1545 | "source": [ 1546 | "### Task 2. Information Extraction\n", 1547 | "\n", 1548 | "Parse trees can be used to extract features for document classification, but today we'll focus on applying dependency parsing to the task of information extraction. This should finally enable us to identify entities with particular roles in elections, like voters, candidates, winners, and losers.\n", 1549 | "\n", 1550 | "Let's try looking for winners. What nouns, in relation to what verbs, would represent the winner of an election? We can look for the subject of verbs like \"win\" (or \"won\"), or maybe \"defeated\" (the direct object would be the loser). We can also look for the direct object of verbs like \"elected\".\n", 1551 | "\n", 1552 | "Note: Dependency roles \"subject\" or \"direct object\" are syntactic (or grammatical) roles. Roles in events like \"winner\" or \"loser\" are semantic (or meaningful) roles. When we use a dependency parse and then add our own rules to extract certain entities that mean something in a real-world event, we're doing \"semantic role labeling.\" This is a hard, complicated task, and we're just scratching the surface of it with this simple example.\n", 1553 | "\n", 1554 | "Since we're only looking at the dependency relation between a verb and one subject or object in this case, we can use the NLTK graph object's method to get dependency triples. See if you can fill in the rest of the code below to extract words that might represent a winner of an election." 1555 | ] 1556 | }, 1557 | { 1558 | "cell_type": "code", 1559 | "execution_count": null, 1560 | "metadata": { 1561 | "collapsed": false 1562 | }, 1563 | "outputs": [], 1564 | "source": [] 1565 | }, 1566 | { 1567 | "cell_type": "markdown", 1568 | "metadata": {}, 1569 | "source": [ 1570 | "## Wrapping Up, Further Exploration:\n", 1571 | "\n", 1572 | "This last exercise didn't give us a lot of clear entities. But you can see where we're headed. It may be because \"elect\" and \"vote\" aren't often the main verb in a sentence about elections. We could also look for subjects of verbs like \"won\" or \"defeated\", but that would only show us the candidates, not who elected them.\n", 1573 | "\n", 1574 | "We also aren't getting entities' full names, just the head word of a noun phrase. And many of the subjects and objects we extracted are pronouns like \"it\" or \"they\", so we'd need to look at a previous sentence or clause to figure out what entity that pronoun refers to, adding another NLP task called correference resolution.\n", 1575 | "\n", 1576 | "Finally, if we wanted to know which voters or group elected which candidate, we'd need to look at multiple dependents of the same verb. The dependency triplets don't allow us to do that. Instead, we could look in the CoNLL-formatted output for lines that have the same line number in the \"head\" column. Or we could construct a tree of nodes out of the CoNLL output, then traverse it with a recursive function.\n", 1577 | "\n", 1578 | "Clearly, information extraction is complicated. Exploring these additional options is beyond the scope of this workshop. The NLTK book discusses additional resources (see Chapter 7 for Information Extraction: http://www.nltk.org/book/ch07.html.) And the Stanford CoreNLP toolkit provides other tools that can help as well (full suite here: http://nlp.stanford.edu/software/)." 1579 | ] 1580 | }, 1581 | { 1582 | "cell_type": "code", 1583 | "execution_count": null, 1584 | "metadata": { 1585 | "collapsed": true 1586 | }, 1587 | "outputs": [], 1588 | "source": [] 1589 | } 1590 | ], 1591 | "metadata": { 1592 | "anaconda-cloud": {}, 1593 | "kernelspec": { 1594 | "display_name": "Python [default]", 1595 | "language": "python", 1596 | "name": "python3" 1597 | }, 1598 | "language_info": { 1599 | "codemirror_mode": { 1600 | "name": "ipython", 1601 | "version": 3 1602 | }, 1603 | "file_extension": ".py", 1604 | "mimetype": "text/x-python", 1605 | "name": "python", 1606 | "nbconvert_exporter": "python", 1607 | "pygments_lexer": "ipython3", 1608 | "version": "3.5.2" 1609 | } 1610 | }, 1611 | "nbformat": 4, 1612 | "nbformat_minor": 1 1613 | } 1614 | -------------------------------------------------------------------------------- /NLP_NLTK/NLP_NLTK_Answers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Task 1: Classifying Documents" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "#### Using Tokenization (and basic bag-of-words features)\n", 15 | "\n", 16 | "Here is the code we went over at the start, to get started classifying documents by sentiment." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": { 23 | "collapsed": false 24 | }, 25 | "outputs": [], 26 | "source": [ 27 | "import re\n", 28 | "import random\n", 29 | "import nltk\n", 30 | "from nltk.corpus import movie_reviews\n", 31 | "\n", 32 | "# Read in a list of document (wordlist, category) tuples, and shuffle\n", 33 | "docs_tuples = [(movie_reviews.words(fileid), category)\n", 34 | " for category in movie_reviews.categories()\n", 35 | " for fileid in movie_reviews.fileids(category)[:200]]\n", 36 | "random.shuffle(docs_tuples)\n", 37 | "\n", 38 | "# Create a list of the most frequent words in the entire corpus\n", 39 | "movie_words = [word.lower() for (wordlist, cat) in docs_tuples for word in wordlist]\n", 40 | "all_wordfreqs = nltk.FreqDist(movie_words)\n", 41 | "top_wordfreqs = all_wordfreqs.most_common()[:1000]\n", 42 | "feature_words = [x[0] for x in top_wordfreqs]\n", 43 | "\n", 44 | "# Define a function to extract features of the form containts(word) for each document\n", 45 | "def document_features(doc_toks):\n", 46 | " document_words = set(doc_toks)\n", 47 | " features = {}\n", 48 | " for word in feature_words:\n", 49 | " features['contains({})'.format(word)] = 1 if word in document_words else 0\n", 50 | " return features\n", 51 | "\n", 52 | "# Create feature sets of document (features, category) tuples\n", 53 | "featuresets = [(document_features(wordlist), cat) for (wordlist, cat) in docs_tuples]\n", 54 | "\n", 55 | "# Separate train and test sets, train the classifier, print accuracy and best features\n", 56 | "train_set, test_set = featuresets[:-100], featuresets[-100:]\n", 57 | "classifier = nltk.NaiveBayesClassifier.train(train_set)\n", 58 | "print(nltk.classify.accuracy(classifier, test_set))\n", 59 | "print(classifier.show_most_informative_features(10))" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "#### Using POS Tagging" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "We left the first part of the code the same as above, but created a new list of most common adjectives as our feature words:" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": { 80 | "collapsed": false 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "# Create a list of the most frequent adjectives in the entire corpus\n", 85 | "from nltk import FreqDist\n", 86 | "\n", 87 | "movie_tokstags = nltk.pos_tag(movie_words)\n", 88 | "movie_adjs = [tok for (tok,tag) in movie_tokstags if re.match('JJ', tag)]\n", 89 | "all_adjfreqs = FreqDist(movie_adjs)\n", 90 | "top_adjfreqs = all_adjfreqs.most_common()[:1000]\n", 91 | "feature_words = [x[0] for x in top_adjfreqs]" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "Then we left the document_features() function and remaining code the same:" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": { 105 | "collapsed": false 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "# Define a function to extract features of the form containts(word) for each document\n", 110 | "def document_features(doc_toks):\n", 111 | " document_words = set(doc_toks)\n", 112 | " features = {}\n", 113 | " for word in feature_words:\n", 114 | " features['contains({})'.format(word)] = 1 if word in document_words else 0\n", 115 | " return features\n", 116 | "\n", 117 | "# Create feature sets of document (features, category) tuples\n", 118 | "featuresets = [(document_features(wordlist), cat) for (wordlist, cat) in docs_tuples]\n", 119 | "\n", 120 | "# Separate train and test sets, train the classifier, print accuracy and best features\n", 121 | "train_set, test_set = featuresets[:-100], featuresets[-100:]\n", 122 | "classifier = nltk.NaiveBayesClassifier.train(train_set)\n", 123 | "print(nltk.classify.accuracy(classifier, test_set))\n", 124 | "print(classifier.show_most_informative_features(10))" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "#### Using Phrase Chunking" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "Now we created a new list of most common noun phrases, and also modified the line where we use the document_features() function to extract the noun phrase features from each document's noun phrase list:" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": { 145 | "collapsed": false 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "# Create a list of the most frequent noun phrases in the entire corpus\n", 150 | "from nltk import RegexpParser\n", 151 | "\n", 152 | "grammar = \"NP: {}\"\n", 153 | "cp = RegexpParser(grammar)\n", 154 | "\n", 155 | "def extract_nps(wordlist):\n", 156 | " wordlist_tagged = nltk.pos_tag(wordlist)\n", 157 | " wordlist_chunked = cp.parse(wordlist_tagged)\n", 158 | " nps = []\n", 159 | " for node in wordlist_chunked:\n", 160 | " if type(node)==nltk.tree.Tree and node.label()=='NP':\n", 161 | " phrase = [tok for (tok, tag) in node.leaves()]\n", 162 | " nps.append(' '.join(phrase))\n", 163 | " return nps\n", 164 | "\n", 165 | "docs_tuples_nps = [(extract_nps(wordlist), cat) for (wordlist, cat) in docs_tuples]\n", 166 | "\n", 167 | "movie_nps = [np for (nplist, cat) in docs_tuples_nps for np in nplist]\n", 168 | "all_npfreqs = FreqDist(movie_nps)\n", 169 | "top_npfreqs = all_npfreqs.most_common()[:1000]\n", 170 | "feature_nps = [x[0] for x in top_npfreqs]" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": { 177 | "collapsed": true 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "# Create feature sets of document (features, category) tuples\n", 182 | "featuresets = [(document_features(nplist), cat) for (nplist, cat) in docs_tuples_nps]" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "We left the last part of the code the same:" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": { 196 | "collapsed": false 197 | }, 198 | "outputs": [], 199 | "source": [ 200 | "# Separate train and test sets, train the classifier, print accuracy and best features\n", 201 | "train_set, test_set = featuresets[:-100], featuresets[-100:]\n", 202 | "classifier = nltk.NaiveBayesClassifier.train(train_set)\n", 203 | "print(nltk.classify.accuracy(classifier, test_set))\n", 204 | "print(classifier.show_most_informative_features(10))" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "This actually doesn't do that well. One reason is that we're using more complex sequences of words as our noun phrase features, each of which is going to appear far less frequently across documents. We might need to increase the number of noun phrases we use, or limit the pattern we're looking for to a single adjective followed by a single noun (leaving out articles, etc). But it might also be the case that adjectives are really the best features to use for sentiment classification in the domain of movie reviews, and the version of this task we did using POS tags was the right way to go." 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "## Task 2. Information Extraction" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "#### Using Tokenization (and basic keyword search)\n", 226 | "\n", 227 | "Here is the code we went over at the start, to initially extract election-related sentences." 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": { 234 | "collapsed": false 235 | }, 236 | "outputs": [], 237 | "source": [ 238 | "from nltk.corpus import brown\n", 239 | "\n", 240 | "# Read in all news docs as a list of sentences, each sentence a list of tokens\n", 241 | "news_docs = [brown.sents(fileid) for fileid in brown.fileids(categories='news')]\n", 242 | "\n", 243 | "# Create regular expression to search for election-related words\n", 244 | "elect_regexp = 'elect|vote'\n", 245 | "\n", 246 | "# Loop through documents and extract each sentence containing an election-related word\n", 247 | "elect_sents = []\n", 248 | "for doc in news_docs:\n", 249 | " for sent in doc:\n", 250 | " for tok in sent:\n", 251 | " if re.match(elect_regexp, tok):\n", 252 | " elect_sents.append(sent)\n", 253 | " break # Break out of last for loop, so we only add the sentence once\n", 254 | " \n", 255 | "len(elect_sents)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "#### Using POS Tagging\n", 263 | "\n", 264 | "We used the election-related sentences we identified in the first step (so we don't waste time tagging irrelevant text). Then we looped through each sentence, ran the POS tagger, and extracted all the nouns." 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": null, 270 | "metadata": { 271 | "collapsed": false 272 | }, 273 | "outputs": [], 274 | "source": [ 275 | "# Extract nouns from election-related sentences\n", 276 | "elect_nouns = []\n", 277 | "for sent in elect_sents:\n", 278 | " sent_tagged = nltk.pos_tag(sent)\n", 279 | " for (tok, tag) in sent_tagged:\n", 280 | " if re.match('N', tag):\n", 281 | " elect_nouns.append(tok)\n", 282 | "\n", 283 | "print(len(elect_nouns))\n", 284 | "print(elect_nouns[:50])" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "We can add a check to see if the sentence has a token matching the election regexp that's tagged as a verb, once we've POS-tagged the sentence, and only add the sentence's nouns if the sentence passes this more specific test." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": { 298 | "collapsed": false 299 | }, 300 | "outputs": [], 301 | "source": [ 302 | "# Extract nouns if the sentence contains an election-related verb\n", 303 | "elect_nouns = []\n", 304 | "for sent in elect_sents:\n", 305 | " sent_nouns = []\n", 306 | " contains_elect_verb = False\n", 307 | " sent_tagged = nltk.pos_tag(sent)\n", 308 | " for (tok, tag) in sent_tagged:\n", 309 | " if re.match('V', tag) and re.match(elect_regexp, tok):\n", 310 | " contains_elect_verb = True\n", 311 | " elif re.match('N', tag):\n", 312 | " sent_nouns.append(tok)\n", 313 | " if contains_elect_verb:\n", 314 | " elect_nouns.extend(sent_nouns)\n", 315 | "\n", 316 | "print(len(elect_nouns))\n", 317 | "print(elect_nouns[:50])" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": {}, 323 | "source": [ 324 | "#### Using Phrase Chunking and NER Tagging\n", 325 | "\n", 326 | "Next we used the NLTK NER tagger (which chunks a sentence into named entity noun phrases, labeled by entity category), to extract named entities for either people or organizations mentioned in election-related sentences." 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": null, 332 | "metadata": { 333 | "collapsed": false 334 | }, 335 | "outputs": [], 336 | "source": [ 337 | "elect_entities = {'ORGANIZATION':[], 'PERSON':[]}\n", 338 | "for sent in elect_sents:\n", 339 | " sent_tagged = nltk.pos_tag(sent)\n", 340 | " sent_nes = nltk.ne_chunk(sent_tagged)\n", 341 | " for node in sent_nes:\n", 342 | " if type(node)==nltk.tree.Tree and node.label() in elect_entities:\n", 343 | " phrase = [tok for (tok, tag) in node.leaves()]\n", 344 | " elect_entities[node.label()].append(' '.join(phrase))\n", 345 | "\n", 346 | "for key, value in elect_entities.items():\n", 347 | " print(key, value, '\\n')" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "We also extracted noun phrases if they appeared right before or after an election-related word." 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": { 361 | "collapsed": false 362 | }, 363 | "outputs": [], 364 | "source": [ 365 | "grammar = \"NP: {
?*+}\"\n", 366 | "cp = RegexpParser(grammar)\n", 367 | "\n", 368 | "entities_before = []\n", 369 | "entities_after = []\n", 370 | "\n", 371 | "for sent in elect_sents:\n", 372 | " sent_tokstags = nltk.pos_tag(sent)\n", 373 | " sent_chunks = cp.parse(sent_tokstags)\n", 374 | " for n in range(len(sent_chunks)):\n", 375 | " node = sent_chunks[n]\n", 376 | " if type(node)!=nltk.tree.Tree and re.match(elect_regexp, node[0]):\n", 377 | " if n > 0:\n", 378 | " node_prev = sent_chunks[n-1]\n", 379 | " if type(node_prev)==nltk.tree.Tree:\n", 380 | " phrase = ' '.join([tok for (tok, tag) in node_prev.leaves()])\n", 381 | " entities_before.append(phrase)\n", 382 | " if n < len(sent_chunks)-1:\n", 383 | " node_after = sent_chunks[n+1]\n", 384 | " if type(node_after)==nltk.tree.Tree:\n", 385 | " phrase = ' '.join([tok for (tok, tag) in node_after.leaves()])\n", 386 | " entities_after.append(phrase)\n", 387 | " \n", 388 | "print('BEFORE:', entities_before)\n", 389 | "print('AFTER:', entities_after)" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "#### Using Dependency Parsing" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": null, 402 | "metadata": { 403 | "collapsed": false 404 | }, 405 | "outputs": [], 406 | "source": [ 407 | "import os\n", 408 | "from nltk.parse.stanford import StanfordDependencyParser\n", 409 | "\n", 410 | "os.environ['STANFORD_PARSER'] = '/Users/natalieahn/Documents/SourceCode/stanford-parser-full-2015-12-09'\n", 411 | "os.environ['STANFORD_MODELS'] = '/Users/natalieahn/Documents/SourceCode/stanford-parser-full-2015-12-09'\n", 412 | "\n", 413 | "dependency_parser = StanfordDependencyParser(model_path=\"edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz\")\n", 414 | "sents_parsed = dependency_parser.parse_sents(elect_sents)\n", 415 | "sents_parseobjs = [obj for sent in sents_parsed for obj in sent]" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": null, 421 | "metadata": { 422 | "collapsed": false, 423 | "scrolled": true 424 | }, 425 | "outputs": [], 426 | "source": [ 427 | "elect_winners = []\n", 428 | "\n", 429 | "for sent_parseobj in sents_parseobjs:\n", 430 | " sent_triples = sent_parseobj.triples()\n", 431 | " for triple in sent_triples:\n", 432 | " # Insert your code here\n", 433 | " if re.match('win|won|defeat|gain|secure|achieve|got', triple[0][0]):\n", 434 | " if re.match('nsubj', triple[1]):\n", 435 | " elect_winners.append(triple[2][0])\n", 436 | " elif re.match('elect|vote|choose|pick', triple[0][0]):\n", 437 | " if re.match('dobj', triple[1]):\n", 438 | " elect_winners.append(triple[2][0])\n", 439 | "\n", 440 | "print(elect_winners)" 441 | ] 442 | } 443 | ], 444 | "metadata": { 445 | "anaconda-cloud": {}, 446 | "kernelspec": { 447 | "display_name": "Python [default]", 448 | "language": "python", 449 | "name": "python3" 450 | }, 451 | "language_info": { 452 | "codemirror_mode": { 453 | "name": "ipython", 454 | "version": 3 455 | }, 456 | "file_extension": ".py", 457 | "mimetype": "text/x-python", 458 | "name": "python", 459 | "nbconvert_exporter": "python", 460 | "pygments_lexer": "ipython3", 461 | "version": "3.5.2" 462 | } 463 | }, 464 | "nbformat": 4, 465 | "nbformat_minor": 1 466 | } 467 | -------------------------------------------------------------------------------- /NLP_NLTK/example.txt: -------------------------------------------------------------------------------- 1 | Welcome to natural language processing! Is it NLP or N.L.P.? Let's work with NLTK to process, classify, and extract information from texts. 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Binder](http://mybinder.org/badge.svg)](http://mybinder.org:/repo/dlab-berkeley/python-text-analysis) 2 | 3 | # Python Text Analysis Workshops 4 | These notebooks serve as material for instructors teaching text analysis in Python at the UC Berkeley D-Lab. 5 | 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | nltk 2 | gensim 3 | textblob 4 | scikit-learn 5 | pandas 6 | matplotlib 7 | beautifulsoup4 8 | --------------------------------------------------------------------------------