├── Chapter 1.ipynb ├── Chapter 2.ipynb ├── Chapter 3.ipynb ├── Chapter 4.ipynb ├── Chapter 5.ipynb ├── LICENSE ├── README.md ├── corpus.txt └── word_freq.txt /Chapter 3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import nltk\n", 12 | "import re\n", 13 | "import pprint\n", 14 | "import random\n", 15 | "from urllib import request\n", 16 | "from nltk import word_tokenize\n", 17 | "from nltk.corpus import brown\n", 18 | "from nltk.corpus import wordnet as wn" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "**1. Define a string ** `s = 'colorless'` **. Write a Python statement that changes this to \"colourless\" using only the slice and concatenation operations.**" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "s = 'colorless'\n", 37 | "s = s[:4] + 'u' + s[4:]" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "**2. We can use the slice notation to remove morphological endings on words. For example, ** `'dogs'[:-1]` ** removes the last character of ** `dogs` **, leaving ** `dog` **. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): ** `dish-es` **, ** `run-ning` **, ** `nation-ality` **, ** `un-do` **, ** `pre-heat` **.**" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "metadata": { 51 | "collapsed": true 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "dish = 'dishes'[:-2]\n", 56 | "run = 'running'[:-4]\n", 57 | "nation = 'nationality'[:-5]\n", 58 | "do = 'undo'[2:]\n", 59 | "heat = 'preheat'[3:]" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "**3. We saw how we can generate an ** `IndexError` ** by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?**" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "Yes, that is possible. Given a string `s`, `s[-(len(s)+1)]` will generate an `IndexError` since it goes too far to the left." 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "**4. We can specify a \"step\" size for the slice. The following returns every second character within the slice: ** `monty[6:11:2]` **. It also works in the reverse direction: ** `monty[10:5:-2]` ** Try these for yourself, then experiment with different step values.**" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "Omitted." 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "**5. What happens if you ask the interpreter to evaluate ** `monty[::-1]` **? Explain why this is a reasonable result.**" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 4, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/plain": [ 105 | "'nohtyP ytnoM'" 106 | ] 107 | }, 108 | "execution_count": 4, 109 | "metadata": {}, 110 | "output_type": "execute_result" 111 | } 112 | ], 113 | "source": [ 114 | "monty = 'Monty Python'\n", 115 | "monty[::-1]" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "Reverse the string. `monty[:]` is the string itself, and `:-1` takes the reverse order." 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "**6. Describe the class of strings matched by the following regular expressions.** \n", 130 | "a. `[a-zA-Z]+` \n", 131 | "b. `[A-Z][a-z]*` \n", 132 | "c. `p[aeiou]{,2}t` \n", 133 | "d. `\\d+(\\.\\d+)?` \n", 134 | "e. `([^aeiou][aeiou][^aeiou])*` \n", 135 | "f. `\\w+|[^\\w\\s]+` \n", 136 | "**Test your answers using ** `nltk.re_show()`." 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "a. Normal words(with one or more letters in either upper or lower case) \n", 144 | "b. Titled words(first letter is upper case) \n", 145 | "c. Words starting with `p`, ending with `t`, and with 0 to 2 vowel(s) between. E.g., `pt`, `pet`, `poet`, etc. \n", 146 | "d. Real numbers(integers and fractions) \n", 147 | "e. [Consonant-Vowel-Consonant] with zero or more times \n", 148 | "f. Alphanumeric character(s) or non-whitespace character(s), can be used for tokenizing" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "**7. Write regular expressions to match the following classes of strings:** \n", 156 | "a. **A single determiner (assume that ** `a`**, **`an`**, and ** `the` ** are the only determiners).** \n", 157 | "b. **An arithmetic expression using integers, addition, and multiplication, such as ** `2*3+8`." 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 5, 163 | "metadata": { 164 | "collapsed": true 165 | }, 166 | "outputs": [], 167 | "source": [ 168 | "re_a = r'(\\ban?\\b|\\bthe\\b)'\n", 169 | "re_b = r'[\\d\\*\\+]+'" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "**8. Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use ** `from urllib import request` ** and then ** ` request.urlopen('http://nltk.org/').read().decode('utf8')` ** to access the contents of the URL.**" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "from bs4 import BeautifulSoup\n", 186 | "def content_of_URL(URL):\n", 187 | " html = request.urlopen(URL).read().decode('utf8')\n", 188 | " raw = BeautifulSoup(html).get_text()\n", 189 | " tokens = word_tokenize(raw)\n", 190 | " return tokens\n", 191 | "\n", 192 | "# well, I haven't installed BeautifulSoup so I skip running this block" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "**9. Save some text into a file ** `corpus.txt` **. Define a function ** `load(f)` ** that reads from the file named in its sole argument, and returns a string containing the text of the file.** \n", 200 | "a. **Use ** `nltk.regexp_tokenize()` ** to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag (?x).** \n", 201 | "b. **Use ** `nltk.regexp_tokenize()` ** to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.**" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 6, 207 | "metadata": { 208 | "collapsed": true 209 | }, 210 | "outputs": [], 211 | "source": [ 212 | "def load_punctuations(f):\n", 213 | " file = open(f)\n", 214 | " raw = file.read()\n", 215 | " pattern = r'''(?x) # set flag to allow verbose regexps\n", 216 | " [,\\.] # comma, period\n", 217 | " | [\\[\\](){}<>] # brackets () {} [] <>\n", 218 | " | ['\"“] # quotation marks\n", 219 | " | [?!] # question mark and exclamation mark\n", 220 | " | [:;] # colon and semicolon\n", 221 | " | \\.\\.\\. # ellipsis\n", 222 | " | [,。?!、‘:;] # some Chinese punctuations\n", 223 | " '''\n", 224 | " return nltk.regexp_tokenize(raw, pattern)\n", 225 | "\n", 226 | "# load_punctuations('corpus.txt')" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 7, 232 | "metadata": {}, 233 | "outputs": [ 234 | { 235 | "data": { 236 | "text/plain": [ 237 | "['$1,000', '£999.99', '¥1000']" 238 | ] 239 | }, 240 | "execution_count": 7, 241 | "metadata": {}, 242 | "output_type": "execute_result" 243 | } 244 | ], 245 | "source": [ 246 | "def load_monetary(f):\n", 247 | " file = open(f)\n", 248 | " raw = file.read()\n", 249 | " pattern = r'''(?x)\n", 250 | " \\$\\d+(?:,\\d+)*(?:\\.\\d+)? # USD\n", 251 | " | £\\d+(?:,\\d+)*(?:\\.\\d+)? # GBP\n", 252 | " | ¥\\d+(?:\\.\\d+)? # CNY\n", 253 | " '''\n", 254 | " return nltk.regexp_tokenize(raw, pattern)\n", 255 | "\n", 256 | "load_monetary('corpus.txt')" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 8, 262 | "metadata": {}, 263 | "outputs": [ 264 | { 265 | "data": { 266 | "text/plain": [ 267 | "['2018-08-06', '2018.08.06', '08/06/20', '06/08/20', '06/08/18', '06-08-20']" 268 | ] 269 | }, 270 | "execution_count": 8, 271 | "metadata": {}, 272 | "output_type": "execute_result" 273 | } 274 | ], 275 | "source": [ 276 | "def load_date(f):\n", 277 | " file = open(f)\n", 278 | " raw = file.read()\n", 279 | " pattern = r'''(?x)\n", 280 | " \\d{,4}[/\\.-]\\d{1,2}[/\\.-]\\d{1,2} # big-endian, e.g., 1996-10-23, 1996.10.23, 1996/10/23\n", 281 | " | \\d{1,2}[/\\.-]\\d{1,2}[/\\.-]\\d{,4} # little-endian or middle-endian, dd/mm/yyyy or mm/dd/yyyy \n", 282 | " '''\n", 283 | " # There are dates with month spelled out in full or in abbreviation as well.\n", 284 | " # But the pattern expression can be extremly tedious so I just leave them out.\n", 285 | " \n", 286 | " return nltk.regexp_tokenize(raw, pattern)\n", 287 | "load_date('corpus.txt')" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "**10. Rewrite the following loop as a list comprehension:**\n", 295 | "\n", 296 | "```Python\n", 297 | ">>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']\n", 298 | ">>> result = []\n", 299 | ">>> for word in sent:\n", 300 | "... word_len = (word, len(word))\n", 301 | "... result.append(word_len)\n", 302 | ">>> result\n", 303 | "[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]\n", 304 | "```" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 9, 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "data": { 314 | "text/plain": [ 315 | "[('The', 3),\n", 316 | " ('dog', 3),\n", 317 | " ('gave', 4),\n", 318 | " ('John', 4),\n", 319 | " ('the', 3),\n", 320 | " ('newspaper', 9)]" 321 | ] 322 | }, 323 | "execution_count": 9, 324 | "metadata": {}, 325 | "output_type": "execute_result" 326 | } 327 | ], 328 | "source": [ 329 | "sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']\n", 330 | "result = [(word, len(word)) for word in sent]\n", 331 | "result" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "**11. Define a string ** `raw` ** containing a sentence of your own choosing. Now, split ** `raw` ** on some character other than space, such as ** `'s'`." 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 10, 344 | "metadata": {}, 345 | "outputs": [ 346 | { 347 | "data": { 348 | "text/plain": [ 349 | "['Define a ', 'tring raw containing a ', 'entence of your own choo', 'ing.']" 350 | ] 351 | }, 352 | "execution_count": 10, 353 | "metadata": {}, 354 | "output_type": "execute_result" 355 | } 356 | ], 357 | "source": [ 358 | "raw = 'Define a string raw containing a sentence of your own choosing.'\n", 359 | "raw.split('s')" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "**12. Write a ** `for` ** loop to print out the characters of a string, one per line.**" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 11, 372 | "metadata": {}, 373 | "outputs": [ 374 | { 375 | "name": "stdout", 376 | "output_type": "stream", 377 | "text": [ 378 | "H\n", 379 | "e\n", 380 | "l\n", 381 | "l\n", 382 | "o\n", 383 | " \n", 384 | "w\n", 385 | "o\n", 386 | "r\n", 387 | "l\n", 388 | "d\n" 389 | ] 390 | } 391 | ], 392 | "source": [ 393 | "s = 'Hello world'\n", 394 | "for char in s:\n", 395 | " print(char)" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "**13. What is the difference between calling split on a string with no argument or with ** `' '` ** as the argument, e.g. ** `sent.split()` ** versus ** `sent.split(' ')` **? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? (In IDLE you will need to use ** `'\\t'` ** to enter a tab character.)**" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": 12, 408 | "metadata": {}, 409 | "outputs": [ 410 | { 411 | "name": "stdout", 412 | "output_type": "stream", 413 | "text": [ 414 | "['Hello\\t', 'World\\nNLTK']\n", 415 | "['Hello', 'World', 'NLTK']\n" 416 | ] 417 | } 418 | ], 419 | "source": [ 420 | "sent = 'Hello\\t World\\nNLTK'\n", 421 | "print(sent.split(' '))\n", 422 | "print(sent.split())" 423 | ] 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "metadata": {}, 428 | "source": [ 429 | "`sent.split(' ')` will not split other blank characters like `\\t`, `\\n`." 430 | ] 431 | }, 432 | { 433 | "cell_type": "markdown", 434 | "metadata": {}, 435 | "source": [ 436 | "**14. Create a variable ** `words` ** containing a list of words. Experiment with ** `words.sort()` ** and ** `sorted(words)` **. What is the difference?**" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "`words.sort()` modify the original variable `words`, and it will not output in default. \n", 444 | "`sorted(words)` return a sorted list without changing the original list." 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "**15. Explore the difference between strings and integers by typing the following at a Python prompt: ** `\"3\" * 7` ** and ** `3 * 7` **. Try converting between strings and integers using ** `int(\"3\")` ** and ** `str(3)` **.**" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": 13, 457 | "metadata": {}, 458 | "outputs": [ 459 | { 460 | "name": "stdout", 461 | "output_type": "stream", 462 | "text": [ 463 | "3333333\n", 464 | "21\n", 465 | "\n", 466 | "\n" 467 | ] 468 | } 469 | ], 470 | "source": [ 471 | "print(\"3\" * 7)\n", 472 | "print(3 * 7)\n", 473 | "print(type(int(\"3\")))\n", 474 | "print(type(str(3)))" 475 | ] 476 | }, 477 | { 478 | "cell_type": "markdown", 479 | "metadata": {}, 480 | "source": [ 481 | "**16. Use a text editor to create a file called ** `prog.py` ** containing the single line ** `monty = 'Monty Python'` **. Next, start up a new session with the Python interpreter, and enter the expression ** `monty` ** at the prompt. You will get an error from the interpreter. Now, try the following (note that you have to leave off the ** `.py` ** part of the filename):**\n", 482 | "```Python\n", 483 | "from prog import monty\n", 484 | "monty\n", 485 | "```\n", 486 | "**This time, Python should return with a value. You can also try ** `import prog` **, in which case Python should be able to evaluate the expression ** `prog.monty` ** at the prompt.**" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "Omitted." 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "**17. What happens when the formatting strings ** `%6s` ** and ** `%-6s` ** are used to display strings that are longer than six characters?**" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": 14, 506 | "metadata": {}, 507 | "outputs": [ 508 | { 509 | "name": "stdout", 510 | "output_type": "stream", 511 | "text": [ 512 | "helloworld\n", 513 | "helloworld\n" 514 | ] 515 | } 516 | ], 517 | "source": [ 518 | "s = 'helloworld'\n", 519 | "print('%6s' %s)\n", 520 | "print('%-6s' %s)\n", 521 | "# There seems no difference." 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "**18. Read in some text from a corpus, tokenize it, and print the list of all ** *wh-* **word types that occur. (** *wh-* **words in English are used in questions, relative clauses and exclamations: ** *who*, *which*, *what* **, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?**" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": 15, 534 | "metadata": {}, 535 | "outputs": [ 536 | { 537 | "name": "stdout", 538 | "output_type": "stream", 539 | "text": [ 540 | "['What', 'Why', 'who']\n" 541 | ] 542 | } 543 | ], 544 | "source": [ 545 | "f = 'corpus.txt'\n", 546 | "file = open(f)\n", 547 | "raw = file.read()\n", 548 | "tokens = word_tokenize(raw)\n", 549 | "print([wh for wh in tokens if wh.lower().startswith('wh')])" 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "metadata": {}, 555 | "source": [ 556 | "**19. Create a file consisting of words and (made up) frequencies, where each line consists of a word, the space character, and a positive integer, e.g. ** `fuzzy 53` **. Read the file into a Python list using ** ` open(filename).readlines()` **. Next, break each line into its two fields using ** `split()` **, and convert the number into an integer using ** `int()` **. The result should be a list of the form: ** `[['fuzzy', 53], ...]` **.**" 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": 16, 562 | "metadata": {}, 563 | "outputs": [ 564 | { 565 | "data": { 566 | "text/plain": [ 567 | "[['fuzzy', 53], ['natural', 14], ['language', 12], ['processing', 16]]" 568 | ] 569 | }, 570 | "execution_count": 16, 571 | "metadata": {}, 572 | "output_type": "execute_result" 573 | } 574 | ], 575 | "source": [ 576 | "filename = 'word_freq.txt'\n", 577 | "lines = open(filename).readlines()\n", 578 | "fields = []\n", 579 | "for line in lines:\n", 580 | " field = line.split()\n", 581 | " field[1] = int(field[1])\n", 582 | " fields.append(field)\n", 583 | "fields" 584 | ] 585 | }, 586 | { 587 | "cell_type": "markdown", 588 | "metadata": {}, 589 | "source": [ 590 | "**20. Write code to access a favorite webpage and extract some text from it. For example, access a weather site and extract the forecast top temperature for your town or city today.**" 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": 17, 596 | "metadata": {}, 597 | "outputs": [ 598 | { 599 | "data": { 600 | "text/plain": [ 601 | "99" 602 | ] 603 | }, 604 | "execution_count": 17, 605 | "metadata": {}, 606 | "output_type": "execute_result" 607 | } 608 | ], 609 | "source": [ 610 | "url = 'https://weather.com/weather/5day/l/CHXX0044:1:CH'\n", 611 | "html = request.urlopen(url).read().decode('utf8')\n", 612 | "high = int(re.findall(r'High (\\d+)F', html)[0]) \n", 613 | "high\n", 614 | "# I just use regular expression instead of BeautifulSoup, which is a bit tricky:D" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "metadata": {}, 620 | "source": [ 621 | "**21. Write a function ** `unknown()` ** that takes a URL as its argument, and returns a list of unknown words that occur on that webpage. In order to do this, extract all substrings consisting of lowercase letters (using ** `re.findall()` **) and remove any items from this set that occur in the Words Corpus ** `(nltk.corpus.words)` **. Try to categorize these words manually and discuss your findings.**" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": null, 627 | "metadata": {}, 628 | "outputs": [], 629 | "source": [ 630 | "def unknown(url):\n", 631 | " html = request.urlopen(url).read().decode('utf8')\n", 632 | " lowers = re.findall(r'\\b[a-z]+', html)\n", 633 | " unknowns = [w for w in lowers if w not in nltk.corpus.words.words()]\n", 634 | " return unknowns\n", 635 | "\n", 636 | "unknown('https://en.wikipedia.org')\n", 637 | "# the function is quite slow..." 638 | ] 639 | }, 640 | { 641 | "cell_type": "markdown", 642 | "metadata": {}, 643 | "source": [ 644 | "**22. Examine the results of processing the URL http://news.bbc.co.uk/ using the regular expressions suggested above. You will see that there is still a fair amount of non-textual data there, particularly Javascript commands. You may also find that sentence breaks have not been properly preserved. Define further regular expressions that improve the extraction of text from this web page.**" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 18, 650 | "metadata": {}, 651 | "outputs": [ 652 | { 653 | "data": { 654 | "text/plain": [ 655 | "['html',\n", 656 | " 'html',\n", 657 | " 'lang',\n", 658 | " 'pw',\n", 659 | " 'charset',\n", 660 | " 'utf',\n", 661 | " 'viewport',\n", 662 | " 'http',\n", 663 | " 'equiv',\n", 664 | " 'ompatible',\n", 665 | " 'google',\n", 666 | " 'bx',\n", 667 | " 'oqt',\n", 668 | " 'fdr',\n", 669 | " 'gxr',\n", 670 | " 'tj',\n", 671 | " 'href',\n", 672 | " 'bbc',\n", 673 | " 'co',\n", 674 | " 'uk',\n", 675 | " 'preconnect',\n", 676 | " 'crossorigin',\n", 677 | " 'href',\n", 678 | " 'files',\n", 679 | " 'bbci',\n", 680 | " 'co',\n", 681 | " 'uk',\n", 682 | " 'preconnect',\n", 683 | " 'crossorigin',\n", 684 | " 'href',\n", 685 | " 'nav',\n", 686 | " 'files',\n", 687 | " 'bbci',\n", 688 | " 'co',\n", 689 | " 'uk',\n", 690 | " 'preconnect',\n", 691 | " 'crossorigin',\n", 692 | " 'href',\n", 693 | " 'ichef',\n", 694 | " 'bbci',\n", 695 | " 'co',\n", 696 | " 'uk',\n", 697 | " 'preconnect',\n", 698 | " 'crossorigin',\n", 699 | " 'dns',\n", 700 | " 'prefetch',\n", 701 | " 'href',\n", 702 | " 'mybbc',\n", 703 | " 'files',\n", 704 | " 'bbci',\n", 705 | " 'co',\n", 706 | " 'uk',\n", 707 | " 'dns']" 708 | ] 709 | }, 710 | "execution_count": 18, 711 | "metadata": {}, 712 | "output_type": "execute_result" 713 | } 714 | ], 715 | "source": [ 716 | "file = open('BBC.html').read() \n", 717 | "# For the reason of GFW, BBC is not accessible now. \n", 718 | "# I don't know how to set up proxy in Jupyter Notebook\n", 719 | "# so I save HTML via safari and deal with the file instead.\n", 720 | "\n", 721 | "lowers = re.findall(r'[a-z]+', file)\n", 722 | "unknowns = [w for w in lowers[:100] if w not in nltk.corpus.words.words()]\n", 723 | "# It's too costly to judge whether a word is in nltk's words corpus.\n", 724 | "# Therefore, I choose the first words \n", 725 | "unknowns" 726 | ] 727 | }, 728 | { 729 | "cell_type": "markdown", 730 | "metadata": {}, 731 | "source": [ 732 | "**23. Are you able to write a regular expression to tokenize text in such a way that the word ** *don't* ** is tokenized into** *do* ** and ** *n't* **? Explain why this regular expression won't work: ** `«n't|\\w+»` **.**" 733 | ] 734 | }, 735 | { 736 | "cell_type": "code", 737 | "execution_count": 19, 738 | "metadata": { 739 | "collapsed": true 740 | }, 741 | "outputs": [], 742 | "source": [ 743 | "pattern = r\"\\w+(?:'t)?\"" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "The regular expression would detect `don` first and leave out `'t`, which doesn't match the expression `n't` and `t` will be matched to `\\w+`." 751 | ] 752 | }, 753 | { 754 | "cell_type": "markdown", 755 | "metadata": {}, 756 | "source": [ 757 | "**24. Try to write code to convert text into hAck3r, using regular expressions and substitution, where ** `e → 3, i → 1, o → 0, l → |, s → 5, . → 5w33t!, ate → 8` **. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map s to two different values: ** `$` ** for word-initial ** `s` **, and ** `5` ** for word-internal ** `s`**.**" 758 | ] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "execution_count": 21, 763 | "metadata": {}, 764 | "outputs": [ 765 | { 766 | "data": { 767 | "text/plain": [ 768 | "'I 8 an app|3 y35t3rday 1n my (ar, $a1d by T0m5w33t!'" 769 | ] 770 | }, 771 | "execution_count": 21, 772 | "metadata": {}, 773 | "output_type": "execute_result" 774 | } 775 | ], 776 | "source": [ 777 | "text = 'I ate an apple yesterday in my car, said by Tom.'\n", 778 | "\n", 779 | "# text = re.sub(pattern, repl, text)\n", 780 | "text = re.sub(r'ate', '8', text) # ate -> 8, replace it first, or 'e' will be replaced\n", 781 | "text = re.sub(r'e', '3', text) # e -> 3\n", 782 | "text = re.sub(r'i', '1', text) # i -> 1\n", 783 | "text = re.sub(r'o', '0', text) # o -> 0\n", 784 | "text = re.sub(r'l', '|', text) # l -> |\n", 785 | "text = re.sub(r'\\.', '5w33t!', text) # . -> 5w33t!\n", 786 | "\n", 787 | "text = re.sub(r'(\\b)(s)', r'\\1$', text) # word-initial s\n", 788 | "text = re.sub(r'(\\w)(s)', r'\\g<1>5', text) # word-internal s\n", 789 | "# reference: https://stackoverflow.com/questions/5984633/python-re-sub-group-number-after-number\n", 790 | "\n", 791 | "text = re.sub(r'g', '9', text) # g -> 9\n", 792 | "text = re.sub(r'c', '(', text) # c -> (\n", 793 | "text" 794 | ] 795 | }, 796 | { 797 | "cell_type": "markdown", 798 | "metadata": {}, 799 | "source": [ 800 | "**25. ** *Pig Latin* ** is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ** *ay* **, e.g.** *string → ingstray*, *idle → idleay* **. http://en.wikipedia.org/wiki/Pig_Latin** \n", 801 | "a. **Write a function to convert a word to Pig Latin.** \n", 802 | "b. **Write code that converts text, instead of individual words.** \n", 803 | "c. **Extend it further to preserve capitalization, to keep ** `qu` ** together (i.e. so that ** `quiet` ** becomes ** `ietquay` **), and to detect when ** `y` ** is used as a consonant (e.g. ** `yellow` **) vs a vowel (e.g. ** `style` **).**" 804 | ] 805 | }, 806 | { 807 | "cell_type": "code", 808 | "execution_count": 22, 809 | "metadata": {}, 810 | "outputs": [ 811 | { 812 | "data": { 813 | "text/plain": [ 814 | "'ingstray'" 815 | ] 816 | }, 817 | "execution_count": 22, 818 | "metadata": {}, 819 | "output_type": "execute_result" 820 | } 821 | ], 822 | "source": [ 823 | "def pig_latin_word(word):\n", 824 | " pattern = r'\\b([^aeiou]*)(\\w*)'\n", 825 | " repl = r'\\2\\1ay'\n", 826 | " word = re.sub(pattern, repl, word)\n", 827 | " # word += 'ay'\n", 828 | " return word\n", 829 | "\n", 830 | "pig_latin_word('string')\n", 831 | "\n", 832 | "# word = 'idle'\n", 833 | "# word = re.sub(r'\\b([^aeiou]*)(\\w*)', r'\\2\\1', word)\n", 834 | "# word += 'ay'\n", 835 | "# word" 836 | ] 837 | }, 838 | { 839 | "cell_type": "code", 840 | "execution_count": 23, 841 | "metadata": {}, 842 | "outputs": [ 843 | { 844 | "data": { 845 | "text/plain": [ 846 | "'ingstray idleay'" 847 | ] 848 | }, 849 | "execution_count": 23, 850 | "metadata": {}, 851 | "output_type": "execute_result" 852 | } 853 | ], 854 | "source": [ 855 | "def pig_latin_text(text):\n", 856 | " pattern = r'\\b([b-df-hj-np-tv-z]*)(\\w*)\\b'\n", 857 | " repl = r'\\2\\1ay'\n", 858 | " text = re.sub(pattern, repl, text)\n", 859 | " return text\n", 860 | "\n", 861 | "pig_latin_text('string idle')" 862 | ] 863 | }, 864 | { 865 | "cell_type": "code", 866 | "execution_count": 24, 867 | "metadata": {}, 868 | "outputs": [ 869 | { 870 | "data": { 871 | "text/plain": [ 872 | "'itequay IDLEay'" 873 | ] 874 | }, 875 | "execution_count": 24, 876 | "metadata": {}, 877 | "output_type": "execute_result" 878 | } 879 | ], 880 | "source": [ 881 | "def pig_latin_extention(text):\n", 882 | " # Well, 'Y' as the initial letter seems to be just consonant?\n", 883 | " pattern = r'(?i)\\b(qu|[b-df-hj-np-tv-z]*)(\\w*)\\b'\n", 884 | " repl = r'\\2\\1ay'\n", 885 | " text = re.sub(pattern, repl, text)\n", 886 | " return text\n", 887 | "\n", 888 | "pig_latin_extention('quite IDLE')" 889 | ] 890 | }, 891 | { 892 | "cell_type": "markdown", 893 | "metadata": {}, 894 | "source": [ 895 | "**26. Download some text from a language that has vowel harmony (e.g. Hungarian), extract the vowel sequences of words, and create a vowel bigram table.**" 896 | ] 897 | }, 898 | { 899 | "cell_type": "markdown", 900 | "metadata": {}, 901 | "source": [ 902 | "Omitted." 903 | ] 904 | }, 905 | { 906 | "cell_type": "markdown", 907 | "metadata": {}, 908 | "source": [ 909 | "**27. Python's random module includes a function ** `choice()` ** which randomly chooses an item from a sequence, e.g. ** `choice(\"aehh \")` ** will produce one of four possible characters, with the letter ** `h` ** being twice as frequent as the others. Write a generator expression that produces a sequence of 500 randomly chosen letters drawn from the string ** `\"aehh \"` **, and put this expression inside a call to the ** `''.join()` ** function, to concatenate them into one long string. You should get a result that looks like uncontrolled sneezing or maniacal laughter: ** `he haha ee heheeh eha` **. Use ** `split()` ** and ** `join()` ** again to normalize the whitespace in this string.**" 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": 25, 915 | "metadata": {}, 916 | "outputs": [ 917 | { 918 | "data": { 919 | "text/plain": [ 920 | "'ehhahehhahe ehaahhe heheha hh hhha e haeehhhhh hhahaheeehehh hahh eeeaehe hhhhheaaaheeahea eeeh heh hhae h ehehaa hheee eeh h haaa heehehee aeaee haehaaha hha e hh a hhee aehh e h hhehahh ahahhhhh aheh h hhh hh ah h aea a heehhhaeeheehha aeaha h aaeaahaaaeaha haaeahahheehe ahae hh haheeehh hhh aha ehhh ehhee aahhee ahhaeee h hehheahehhaa e hehhhe ahhhhhh hhh ahe h a e hahahhhhehh hhehaah heh hhh hh hahh heeh ehhhaaaahahhhehh eeaheeaeahahhhe eaah a aaeh eahhahae hhhahhaeaee hhh'" 921 | ] 922 | }, 923 | "execution_count": 25, 924 | "metadata": {}, 925 | "output_type": "execute_result" 926 | } 927 | ], 928 | "source": [ 929 | "s = []\n", 930 | "for i in range(500):\n", 931 | " s.append(random.choice(\"aehh \"))\n", 932 | "ori = ''.join(s) # the original form may contain multiple blank spaces at the same time\n", 933 | "' '.join(ori.split()) # normalize the whitespace" 934 | ] 935 | }, 936 | { 937 | "cell_type": "markdown", 938 | "metadata": {}, 939 | "source": [ 940 | "**28. Consider the numeric expressions in the following sentence from the MedLine Corpus: ** *The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively.* ** Should we say that the numeric expression ** *4.53 +/- 0.15%* ** is three words? Or should we say that it's a single compound word? Or should we say that it is actually ** *nine* ** words, since it's read \"four point five three, plus or minus zero point fifteen percent\"? Or should we say that it's not a \"real\" word at all, since it wouldn't appear in any dictionary? Discuss these different possibilities. Can you think of application domains that motivate at least two of these answers?**" 941 | ] 942 | }, 943 | { 944 | "cell_type": "markdown", 945 | "metadata": {}, 946 | "source": [ 947 | "Well, all the explanations are reasonable. But I don't understand why it is *nine*, shouldn't it be *eleven*?" 948 | ] 949 | }, 950 | { 951 | "cell_type": "markdown", 952 | "metadata": {}, 953 | "source": [ 954 | "**29. Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define ** `μw` ** to be the average number of letters per word, and ** `μs` ** to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: ** `4.71 μw + 0.5 μs - 21.43` **. Compute the ARI score for various sections of the Brown Corpus, including section ** `f`** (lore) and ** `j` ** (learned). Make use of the fact that ** `nltk.corpus.brown.words()` ** produces a sequence of words, while ** `nltk.corpus.brown.sents()` ** produces a sequence of sentences.**" 955 | ] 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": 26, 960 | "metadata": {}, 961 | "outputs": [ 962 | { 963 | "name": "stdout", 964 | "output_type": "stream", 965 | "text": [ 966 | "adventure 4.0841684990890705\n", 967 | "belles_lettres 10.987652885621749\n", 968 | "editorial 9.471025332953673\n", 969 | "fiction 4.9104735321302115\n", 970 | "government 12.08430349501021\n", 971 | "hobbies 8.922356393630267\n", 972 | "humor 7.887805248319808\n", 973 | "learned 11.926007043317348\n", 974 | "lore 10.254756197101155\n", 975 | "mystery 3.8335518942055167\n", 976 | "news 10.176684595052684\n", 977 | "religion 10.203109907301261\n", 978 | "reviews 10.769699888473433\n", 979 | "romance 4.34922419804213\n", 980 | "science_fiction 4.978058336905399\n" 981 | ] 982 | } 983 | ], 984 | "source": [ 985 | "def miu_w(category):\n", 986 | " word_length = sum(len(w) for w in brown.words(categories=category))\n", 987 | " word_number = len(brown.words(categories=category))\n", 988 | " return word_length / word_number\n", 989 | "\n", 990 | "def miu_s(category):\n", 991 | " sent_length = sum(len(s) for s in brown.sents(categories=category))\n", 992 | " sent_number = len(brown.sents(categories=category))\n", 993 | " return sent_length / sent_number\n", 994 | "\n", 995 | "def ari(category):\n", 996 | " return 4.71 * miu_w(category) + 0.5 * miu_s(category) - 21.43\n", 997 | "\n", 998 | "for category in brown.categories():\n", 999 | " print(category, ari(category))" 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "markdown", 1004 | "metadata": {}, 1005 | "source": [ 1006 | "**30. Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.**" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "code", 1011 | "execution_count": 27, 1012 | "metadata": {}, 1013 | "outputs": [], 1014 | "source": [ 1015 | "raw = \"\"\"THE Dawn of Love is an oil painting by English artist \n", 1016 | "William Etty, first exhibited in 1828. Loosely based on a passage \n", 1017 | "from John Milton's 1634 Comus, it shows Venus leaning across to \n", 1018 | "wake the sleeping Love by stroking his wings. It was very poorly \n", 1019 | "received when first exhibited; the stylised Venus was thought unduly \n", 1020 | "influenced by foreign artists such as Rubens as well as being overly \n", 1021 | "voluptuous and unrealistically coloured, while the painting as a whole \n", 1022 | "was considered tasteless and obscene. The Dawn of Love was omitted \n", 1023 | "from the major 1849 retrospective exhibition of Etty's works, and \n", 1024 | "its exhibition in Glasgow in 1899 drew complaints for its supposed \n", 1025 | "obscenity. In 1889 it was bought by Merton Russell-Cotes, and has \n", 1026 | "remained in the collection of the Russell-Cotes Art Gallery & Museum ever since.\"\"\"\n", 1027 | "# from Wikipedia 2018-08-08's featured article\n", 1028 | "tokens = word_tokenize(raw)\n", 1029 | "\n", 1030 | "porter = nltk.PorterStemmer()\n", 1031 | "lancaster = nltk.LancasterStemmer()\n", 1032 | "\n", 1033 | "porter_output = [porter.stem(t) for t in tokens] \n", 1034 | "lancaster_output = [lancaster.stem(t) for t in tokens]" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "markdown", 1039 | "metadata": {}, 1040 | "source": [ 1041 | "**31. Define the variable saying to contain the list ** `['After', 'all', 'is', 'said', 'and', 'done', ',', 'more',\n", 1042 | "'is', 'said', 'than', 'done', '.']` **. Process this list using a ** `for` ** loop, and store the length of each word in a new list lengths. Hint: begin by assigning the empty list to ** `lengths` **, using ** `lengths = []` **. Then each time through the loop, use ** `append()` ** to add another length value to the list. Now do the same thing using a list comprehension.**" 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "code", 1047 | "execution_count": 28, 1048 | "metadata": {}, 1049 | "outputs": [], 1050 | "source": [ 1051 | "var = ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more', 'is', 'said', 'than', 'done', '.']\n", 1052 | "lengths = []\n", 1053 | "\n", 1054 | "for w in var:\n", 1055 | " lengths.append(len(w))" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "code", 1060 | "execution_count": 29, 1061 | "metadata": {}, 1062 | "outputs": [], 1063 | "source": [ 1064 | "lengths = [len(w) for w in var]" 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "markdown", 1069 | "metadata": {}, 1070 | "source": [ 1071 | "**32. Define a variable ** `silly` ** to contain the string: ** 'newly formed bland ideas are inexpressible in an infuriating way' **. (This happens to be the legitimate interpretation that bilingual English-Spanish speakers can assign to Chomsky's famous nonsense phrase, colorless green ideas sleep furiously according to Wikipedia). Now write code to perform the following tasks:** \n", 1072 | "a. **Split ** `silly` ** into a list of strings, one per word, using Python's ** `split()` ** operation, and save this to a variable called ** `bland`. \n", 1073 | "b. **Extract the second letter of each word in ** `silly` ** and join them into a string, to get ** 'eoldrnnnna'. \n", 1074 | "c. **Combine the words in ** `bland` ** back into a single string, using ** `join()` **. Make sure the words in the resulting string are separated with whitespace.** \n", 1075 | "d. **Print the words of ** `silly` ** in alphabetical order, one per line.**" 1076 | ] 1077 | }, 1078 | { 1079 | "cell_type": "code", 1080 | "execution_count": 30, 1081 | "metadata": {}, 1082 | "outputs": [ 1083 | { 1084 | "name": "stdout", 1085 | "output_type": "stream", 1086 | "text": [ 1087 | "['an', 'are', 'bland', 'formed', 'ideas', 'in', 'inexpressible', 'infuriating', 'newly', 'way']\n" 1088 | ] 1089 | } 1090 | ], 1091 | "source": [ 1092 | "silly = 'newly formed bland ideas are inexpressible in an infuriating way'\n", 1093 | "bland = silly.split() # a\n", 1094 | "''.join(w[1] for w in bland) # b\n", 1095 | "' '.join(bland) # c\n", 1096 | "print(sorted(bland)) # d" 1097 | ] 1098 | }, 1099 | { 1100 | "cell_type": "markdown", 1101 | "metadata": {}, 1102 | "source": [ 1103 | "**33. The ** `index()` ** function can be used to look up items in sequences. For example, ** `'inexpressible'.index('e')` ** tells us the index of the first position of the letter ** `e`. \n", 1104 | "a. **What happens when you look up a substring, e.g. ** `'inexpressible'.index('re')` **?** \n", 1105 | "b. **Define a variable ** `words` ** containing a list of words. Now use ** `words.index()` ** to look up the position of an individual word.** \n", 1106 | "c. **Define a variable ** `silly` ** as in the exercise above. Use the ** `index()` ** function in combination with list slicing to build a list ** `phrase` ** consisting of all the words up to (but not including) ** `in` ** in silly.**" 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "code", 1111 | "execution_count": 31, 1112 | "metadata": {}, 1113 | "outputs": [ 1114 | { 1115 | "data": { 1116 | "text/plain": [ 1117 | "5" 1118 | ] 1119 | }, 1120 | "execution_count": 31, 1121 | "metadata": {}, 1122 | "output_type": "execute_result" 1123 | } 1124 | ], 1125 | "source": [ 1126 | "'inexpressible'.index('re')" 1127 | ] 1128 | }, 1129 | { 1130 | "cell_type": "code", 1131 | "execution_count": 32, 1132 | "metadata": {}, 1133 | "outputs": [ 1134 | { 1135 | "data": { 1136 | "text/plain": [ 1137 | "2" 1138 | ] 1139 | }, 1140 | "execution_count": 32, 1141 | "metadata": {}, 1142 | "output_type": "execute_result" 1143 | } 1144 | ], 1145 | "source": [ 1146 | "words = ['a', 'list', 'of', 'words']\n", 1147 | "words.index('of')" 1148 | ] 1149 | }, 1150 | { 1151 | "cell_type": "code", 1152 | "execution_count": 33, 1153 | "metadata": {}, 1154 | "outputs": [ 1155 | { 1156 | "data": { 1157 | "text/plain": [ 1158 | "['newly', 'formed', 'bland', 'ideas', 'are', 'inexpressible']" 1159 | ] 1160 | }, 1161 | "execution_count": 33, 1162 | "metadata": {}, 1163 | "output_type": "execute_result" 1164 | } 1165 | ], 1166 | "source": [ 1167 | "phrase = bland[:bland.index('in')] # use bland rather than silly here\n", 1168 | "phrase" 1169 | ] 1170 | }, 1171 | { 1172 | "cell_type": "markdown", 1173 | "metadata": {}, 1174 | "source": [ 1175 | "**34. Write code to convert nationality adjectives like ** *Canadian* ** and ** *Australian* ** to their corresponding nouns ** *Canada* ** and ** *Australia* ** (see http://en.wikipedia.org/wiki/List_of_adjectival_forms_of_place_names).**" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "code", 1180 | "execution_count": 35, 1181 | "metadata": {}, 1182 | "outputs": [ 1183 | { 1184 | "data": { 1185 | "text/plain": [ 1186 | "'Canada'" 1187 | ] 1188 | }, 1189 | "execution_count": 35, 1190 | "metadata": {}, 1191 | "output_type": "execute_result" 1192 | } 1193 | ], 1194 | "source": [ 1195 | "# the link should be\n", 1196 | "# https://en.wikipedia.org/wiki/List_of_adjectival_and_demonymic_forms_for_countries_and_nations\n", 1197 | "\n", 1198 | "# Argentina - Argentinian\n", 1199 | "# Australia - Australian\n", 1200 | "# Austria - Austrian\n", 1201 | "# to be finished...\n", 1202 | "\n", 1203 | "pattern = r'(\\w+)ian'\n", 1204 | "repl = r'\\1a'\n", 1205 | "re.sub(pattern, repl, 'Canadian')" 1206 | ] 1207 | }, 1208 | { 1209 | "cell_type": "markdown", 1210 | "metadata": {}, 1211 | "source": [ 1212 | "**35. Read the LanguageLog post on phrases of the form ** *as best as p can* ** and ** *as best p can* **, where ** *p* ** is a pronoun. Investigate this phenomenon with the help of a corpus and the ** `findall()` ** method for searching tokenized text described in 3.5. http://itre.cis.upenn.edu/~myl/languagelog/archives/002733.html**" 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "code", 1217 | "execution_count": 36, 1218 | "metadata": {}, 1219 | "outputs": [ 1220 | { 1221 | "data": { 1222 | "text/plain": [ 1223 | "['as best I can', 'as best as I can', 'As best as she can']" 1224 | ] 1225 | }, 1226 | "execution_count": 36, 1227 | "metadata": {}, 1228 | "output_type": "execute_result" 1229 | } 1230 | ], 1231 | "source": [ 1232 | "text = \"\"\" I wil straight dispose, as best I can, th'inferiour Magistrate ...\n", 1233 | "And I haue thrust my selfe into this maze, Happily to wiue and thriue, as best I may ...\n", 1234 | "In fine, my life is that of a great schoolboy, getting into scrapes for the fun of it,\n", 1235 | "and fighting my way out as best as I can!\n", 1236 | "As best as she can she hides herself in the full sunlight\n", 1237 | "\"\"\"\n", 1238 | "# text sample from the given url link\n", 1239 | "re.findall(r'(?i)as best (?:as )?(?:I|we|you|he|she|they|it) can', text)" 1240 | ] 1241 | }, 1242 | { 1243 | "cell_type": "markdown", 1244 | "metadata": {}, 1245 | "source": [ 1246 | "**36. Study the ** *lolcat* ** version of the book of Genesis, accessible as ** `nltk.corpus.genesis.words('lolcat.txt')` **, and the rules for converting text into ** *lolspeak* ** at http://www.lolcatbible.com/index.php?title=How_to_speak_lolcat. Define regular expressions to convert English words into corresponding lolspeak words.**" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "code", 1251 | "execution_count": 37, 1252 | "metadata": {}, 1253 | "outputs": [ 1254 | { 1255 | "data": { 1256 | "text/plain": [ 1257 | "'siet kiet dood ovah kitteh littel'" 1258 | ] 1259 | }, 1260 | "execution_count": 37, 1261 | "metadata": {}, 1262 | "output_type": "execute_result" 1263 | } 1264 | ], 1265 | "source": [ 1266 | "# nltk.corpus.genesis.words('lolcat.txt')\n", 1267 | "text = 'sight kite dude over kitty little'\n", 1268 | "# just implement some easy-to-check rules\n", 1269 | "text = re.sub(r'ight', 'iet', text) # ight -> iet\n", 1270 | "text = re.sub(r'\\bdude\\b', 'dood', text) # dude -> dood\n", 1271 | "text = re.sub(r'([b-df-hj-np-tv-z])(e)\\b', r'\\2\\1', text) # exchange the consonant and the endding 'e'\n", 1272 | "text = re.sub(r'er\\b', 'ah', text) # -er -> -ah\n", 1273 | "text = re.sub(r'y\\b', 'eh', text) # -y -> -eh\n", 1274 | "text = re.sub(r'le\\b', 'el', text) # -le -> -el\n", 1275 | "text" 1276 | ] 1277 | }, 1278 | { 1279 | "cell_type": "markdown", 1280 | "metadata": {}, 1281 | "source": [ 1282 | "**37. Read about the ** `re.sub()` ** function for string substitution using regular expressions, using ** `help(re.sub)` ** and by consulting the further readings for this chapter. Use ** `re.sub` ** in writing code to remove HTML tags from an HTML file, and to normalize whitespace.**" 1283 | ] 1284 | }, 1285 | { 1286 | "cell_type": "code", 1287 | "execution_count": 38, 1288 | "metadata": {}, 1289 | "outputs": [], 1290 | "source": [ 1291 | "file = open('BBC.html').read()\n", 1292 | "file = re.sub(r'<.*>', '', file)\n", 1293 | "file = re.sub(r'\\s+', ' ', file)" 1294 | ] 1295 | }, 1296 | { 1297 | "cell_type": "markdown", 1298 | "metadata": {}, 1299 | "source": [ 1300 | "**38. An interesting challenge for tokenization is words that have been split across a line-break. E.g. if ** *long-term* ** is split, then we have the string ** *long-\\nterm*. \n", 1301 | "a. **Write a regular expression that identifies words that are hyphenated at a line-break. The expression will need to include the ** `\\n` ** character.** \n", 1302 | "b. **Use ** `re.sub()` ** to remove the \\n character from these words.** \n", 1303 | "c. **How might you identify words that should not remain hyphenated once the newline is removed, e.g. ** 'encyclo-\\npedia'?" 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "code", 1308 | "execution_count": 39, 1309 | "metadata": {}, 1310 | "outputs": [ 1311 | { 1312 | "data": { 1313 | "text/plain": [ 1314 | "['long-\\nterm']" 1315 | ] 1316 | }, 1317 | "execution_count": 39, 1318 | "metadata": {}, 1319 | "output_type": "execute_result" 1320 | } 1321 | ], 1322 | "source": [ 1323 | "text = \"\"\"long-\n", 1324 | "term\"\"\"\n", 1325 | "pattern = r'\\w+-\\n\\w+'\n", 1326 | "re.findall(pattern, text)" 1327 | ] 1328 | }, 1329 | { 1330 | "cell_type": "code", 1331 | "execution_count": 40, 1332 | "metadata": {}, 1333 | "outputs": [ 1334 | { 1335 | "data": { 1336 | "text/plain": [ 1337 | "'long-term'" 1338 | ] 1339 | }, 1340 | "execution_count": 40, 1341 | "metadata": {}, 1342 | "output_type": "execute_result" 1343 | } 1344 | ], 1345 | "source": [ 1346 | "pattern = r'(\\w+-)(\\n)(\\w+)'\n", 1347 | "re.findall(pattern, text)\n", 1348 | "re.sub(pattern, r'\\1\\3', text)" 1349 | ] 1350 | }, 1351 | { 1352 | "cell_type": "markdown", 1353 | "metadata": {}, 1354 | "source": [ 1355 | "Check whether the hyphenated word is in the word corpus." 1356 | ] 1357 | }, 1358 | { 1359 | "cell_type": "markdown", 1360 | "metadata": {}, 1361 | "source": [ 1362 | "**39. Read the Wikipedia entry on ** *Soundex* **. Implement this algorithm in Python.**" 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "code", 1367 | "execution_count": 41, 1368 | "metadata": {}, 1369 | "outputs": [ 1370 | { 1371 | "data": { 1372 | "text/plain": [ 1373 | "'H555'" 1374 | ] 1375 | }, 1376 | "execution_count": 41, 1377 | "metadata": {}, 1378 | "output_type": "execute_result" 1379 | } 1380 | ], 1381 | "source": [ 1382 | "# https://en.wikipedia.org/wiki/Soundex\n", 1383 | "# cumbersome implementation...\n", 1384 | "def soundex(word):\n", 1385 | " word = word.upper() # convert the word to upper case for convenience\n", 1386 | " \n", 1387 | " # Step 1: Retain the first letter\n", 1388 | " sound = word[0]\n", 1389 | "\n", 1390 | " # Step 3: If two or more letters with the same number are adjacent \n", 1391 | " # in the original name (before step 1), only retain the first letter;\n", 1392 | " word = re.sub(r'([BFPV])[BFPV]', r'\\1', word) # \n", 1393 | " word = re.sub(r'([CGJKQSXZ])[CGJKQSXZ]', r'\\1', word)\n", 1394 | " word = re.sub(r'([DT])[DT]', r'\\1', word)\n", 1395 | " word = re.sub(r'LL', r'L', word)\n", 1396 | " word = re.sub(r'([MN])[MN]', r'\\1', word)\n", 1397 | " word = re.sub(r'RR', r'R', word)\n", 1398 | " \n", 1399 | " # Step 3: two letters with the same number separated by 'h' or 'w' are coded as a single number\n", 1400 | " word = re.sub(r'([BFPV])([HW])[BFPV]', r'\\1\\2', word)\n", 1401 | " word = re.sub(r'([CGJKQSXZ])([HW])[CGJKQSXZ]', r'\\1\\2', word)\n", 1402 | " word = re.sub(r'([DT])([HW])[DT]', r'\\1\\2', word)\n", 1403 | " word = re.sub(r'L([HW])L', r'L\\1', word)\n", 1404 | " word = re.sub(r'([MN])([HW])[MN]', r'\\1\\2', word)\n", 1405 | " word = re.sub(r'R([HW])R', r'R\\1', word)\n", 1406 | " \n", 1407 | " # Replace consonants with digits as follows (after the first letter)\n", 1408 | " word = re.sub(r'[AEIOUYHW]', r'', word)\n", 1409 | " word = re.sub(r'[BFPV]', '1', word)\n", 1410 | " word = re.sub(r'[CGJKQSXZ]', '2', word)\n", 1411 | " word = re.sub(r'[DT]', '3', word)\n", 1412 | " word = re.sub(r'L', '4', word)\n", 1413 | " word = re.sub(r'[MN]', '5', word)\n", 1414 | " word = re.sub(r'R', '6', word)\n", 1415 | " \n", 1416 | " # Step 4: If you have too few letters in your word that you can't assign three numbers, \n", 1417 | " # append with zeros until there are three numbers. If you have more than 3 letters, \n", 1418 | " # just retain the first 3 numbers.\n", 1419 | " if sound in 'AEIOUYHW':\n", 1420 | " sound = (sound + word + '000')[:4]\n", 1421 | " else:\n", 1422 | " sound = (sound + word[1:] + '000')[:4]\n", 1423 | " return sound\n", 1424 | "\n", 1425 | "soundex('Honeyman')" 1426 | ] 1427 | }, 1428 | { 1429 | "cell_type": "markdown", 1430 | "metadata": {}, 1431 | "source": [ 1432 | "**40. Obtain raw texts from two or more genres and compute their respective reading difficulty scores as in the earlier exercise on reading difficulty. E.g. compare ABC Rural News and ABC Science News (** `nltk.corpus.abc` **). Use Punkt to perform sentence segmentation.**" 1433 | ] 1434 | }, 1435 | { 1436 | "cell_type": "code", 1437 | "execution_count": 42, 1438 | "metadata": {}, 1439 | "outputs": [ 1440 | { 1441 | "name": "stdout", 1442 | "output_type": "stream", 1443 | "text": [ 1444 | "10.66074843699441\n", 1445 | "10.703963706930097\n" 1446 | ] 1447 | } 1448 | ], 1449 | "source": [ 1450 | "# nltk.corpus.abc.fileids()\n", 1451 | "\n", 1452 | "def ari(fileid):\n", 1453 | " words = nltk.corpus.abc.words(fileids=fileid)\n", 1454 | " \n", 1455 | " text = nltk.corpus.abc.raw(fileids=fileid)\n", 1456 | " sents = nltk.sent_tokenize(text)\n", 1457 | " \n", 1458 | " word_number = len(words)\n", 1459 | " word_length = sum(len(w) for w in words)\n", 1460 | " miu_w = word_length / word_number\n", 1461 | "\n", 1462 | " sent_length = sum(len(s.split()) for s in sents)\n", 1463 | " sent_number = len(sents)\n", 1464 | " miu_s = sent_length / sent_number\n", 1465 | " \n", 1466 | " ari = 4.71 * miu_w + 0.5 * miu_s - 21.43\n", 1467 | " return ari\n", 1468 | "print(ari('rural.txt'))\n", 1469 | "print(ari('science.txt'))" 1470 | ] 1471 | }, 1472 | { 1473 | "cell_type": "markdown", 1474 | "metadata": {}, 1475 | "source": [ 1476 | "**41. Rewrite the following nested loop as a nested list comprehension:** \n", 1477 | "```Python\n", 1478 | ">>> words = ['attribution', 'confabulation', 'elocution',\n", 1479 | "... 'sequoia', 'tenacious', 'unidirectional']\n", 1480 | ">>> vsequences = set()\n", 1481 | ">>> for word in words:\n", 1482 | "... vowels = []\n", 1483 | "... for char in word:\n", 1484 | "... if char in 'aeiou':\n", 1485 | "... vowels.append(char)\n", 1486 | "... vsequences.add(''.join(vowels))\n", 1487 | ">>> sorted(vsequences)\n", 1488 | "['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']\n", 1489 | "```" 1490 | ] 1491 | }, 1492 | { 1493 | "cell_type": "code", 1494 | "execution_count": 43, 1495 | "metadata": {}, 1496 | "outputs": [ 1497 | { 1498 | "data": { 1499 | "text/plain": [ 1500 | "['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']" 1501 | ] 1502 | }, 1503 | "execution_count": 43, 1504 | "metadata": {}, 1505 | "output_type": "execute_result" 1506 | } 1507 | ], 1508 | "source": [ 1509 | "words = ['attribution', 'confabulation', 'elocution', 'sequoia', 'tenacious', 'unidirectional']\n", 1510 | "vsequences = [''.join(re.findall(r'[aeiou]', v)) for v in words]\n", 1511 | "sorted(vsequences)" 1512 | ] 1513 | }, 1514 | { 1515 | "cell_type": "markdown", 1516 | "metadata": {}, 1517 | "source": [ 1518 | "**42. Use WordNet to create a semantic index for a text collection. Extend the concordance search program in 3.6, indexing each word using the offset of its first synset, e.g. ** `wn.synsets('dog')[0].offset` ** (and optionally the offset of some of its ancestors in the hypernym hierarchy).**" 1519 | ] 1520 | }, 1521 | { 1522 | "cell_type": "code", 1523 | "execution_count": 44, 1524 | "metadata": {}, 1525 | "outputs": [ 1526 | { 1527 | "name": "stdout", 1528 | "output_type": "stream", 1529 | "text": [ 1530 | "[ King Arthur music stops ] ARTHUR : Old woman ! DENNIS : Man ! ARTHUR : Man . \n", 1531 | " I did say ' sorry ' about the ' old woman ', but from the behind you looked \n", 1532 | "ver going to be any progress with the -- WOMAN : Dennis , there ' s some lovely f\n", 1533 | " the Britons . Who ' s castle is that ? WOMAN : King of the who ? ARTHUR : The \n", 1534 | "King of the who ? ARTHUR : The Britons . WOMAN : Who are the Britons ? ARTHUR : W\n", 1535 | " are all Britons , and I am your king . WOMAN : I didn ' t know we had a \n", 1536 | "utocracy in which the working classes -- WOMAN : Oh , there you go , bringing cla\n", 1537 | "am in haste . Who lives in that castle ? WOMAN : No one live there . ARTHUR : The\n", 1538 | "there . ARTHUR : Then who is your lord ? WOMAN : We don ' t have a lord . \n", 1539 | " Be quiet ! I order you to be quiet ! WOMAN : Order , eh ? Who does he think \n", 1540 | " ? Heh . ARTHUR : I am your king ! WOMAN : Well , I didn ' t vote for \n", 1541 | " ARTHUR : You don ' t vote for kings . WOMAN : Well , how did you become king t\n", 1542 | "am your king ! DENNIS : Listen , strange women lying in ponds distributing swords\n", 1543 | "icalism is a way of preserving freedom . WOMAN : Oh , Dennis , forget about freed\n", 1544 | " : Are you saying ' ni ' to that old woman ? ARTHUR : Erm , yes . ROGER : \n" 1545 | ] 1546 | } 1547 | ], 1548 | "source": [ 1549 | "class IndexedText(object):\n", 1550 | "\n", 1551 | " def __init__(self, stemmer, text):\n", 1552 | " self._text = text\n", 1553 | " self._stemmer = stemmer\n", 1554 | " # self._index = nltk.Index((self._stem(word), i)\n", 1555 | " # for (i, word) in enumerate(text))\n", 1556 | " self._index = nltk.Index((wn.synsets(self._stem(word))[0].offset(), i)\n", 1557 | " for (i, word) in enumerate(text) \n", 1558 | " if wn.synsets(self._stem(word)) != []) # to avoid list index out of range\n", 1559 | " \n", 1560 | " # basic idea: use WordNet's offset as the word's key rather than the word itself\n", 1561 | " \n", 1562 | " def concordance(self, word, width=40):\n", 1563 | " key = wn.synsets(self._stem(word))[0].offset()\n", 1564 | " wc = int(width/4) # words of context\n", 1565 | " for i in self._index[key]:\n", 1566 | " lcontext = ' '.join(self._text[i-wc:i])\n", 1567 | " rcontext = ' '.join(self._text[i:i+wc])\n", 1568 | " ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)\n", 1569 | " rdisplay = '{:{width}}'.format(rcontext[:width], width=width)\n", 1570 | " print(ldisplay, rdisplay)\n", 1571 | "\n", 1572 | " def _stem(self, word):\n", 1573 | " return self._stemmer.stem(word).lower()\n", 1574 | " \n", 1575 | "porter = nltk.PorterStemmer()\n", 1576 | "grail = nltk.corpus.webtext.words('grail.txt')\n", 1577 | "text = IndexedText(porter, grail)\n", 1578 | "text.concordance('women')" 1579 | ] 1580 | }, 1581 | { 1582 | "cell_type": "markdown", 1583 | "metadata": {}, 1584 | "source": [ 1585 | "**43. With the help of a multilingual corpus such as the Universal Declaration of Human Rights Corpus (** `nltk.corpus.udhr` **), and NLTK's frequency distribution and rank correlation functionality (** `nltk.FreqDist`, `nltk.spearman_correlation` **), develop a system that guesses the language of a previously unseen text. For simplicity, work with a single character encoding and just a few languages.**" 1586 | ] 1587 | }, 1588 | { 1589 | "cell_type": "code", 1590 | "execution_count": 45, 1591 | "metadata": {}, 1592 | "outputs": [ 1593 | { 1594 | "name": "stdout", 1595 | "output_type": "stream", 1596 | "text": [ 1597 | "English-Latin1\n", 1598 | "French_Francais-Latin1\n", 1599 | "German_Deutsch-Latin1\n", 1600 | "Italian-Latin1\n", 1601 | "Spanish-Latin1\n" 1602 | ] 1603 | } 1604 | ], 1605 | "source": [ 1606 | "def guess_language(text):\n", 1607 | " candidate_language = ['English-Latin1', 'French_Francais-Latin1', \n", 1608 | " 'German_Deutsch-Latin1', 'Italian-Latin1', 'Spanish-Latin1']\n", 1609 | "\n", 1610 | " fdist = nltk.FreqDist(lang for lang in candidate_language\n", 1611 | " for w in text if w in nltk.corpus.udhr.words(lang))\n", 1612 | " return fdist\n", 1613 | "\n", 1614 | "# well, I just don't want to show the text in multiple lines since the words doesn't matter\n", 1615 | "text_english = \"Wikipedia is a project dedicated to the building of free encyclopedias in all languages of the world. The project started with the English-language Wikipedia on January 15, 2001. On March 23, 2001 it was joined by a French Wikipedia, and shortly afterwards by many other languages. Large efforts are underway to highlight the international nature of the project. On 20 September 2004 Wikipedia reached a total of 1,000,000 articles in over 100 languages.\".split()\n", 1616 | "text_french = \"Wikipédia Écouter est un projet d'encyclopédie universelle, multilingue, créé par Jimmy Wales et Larry Sanger le 15 janvier 2001 en wiki sous le nom de domaine wikipedia.org. Les versions des différentes langues utilisent le même logiciel de publication, MediaWiki, et ont la même apparence, mais elles comportent des variations dans leurs contenus, leurs structures et leurs modalités d'édition et de gestion.\".split()\n", 1617 | "text_german = \"Wikipedia ist ein am 15. Januar 2001 gegründetes gemeinnütziges Projekt zur Erstellung einer Enzyklopädie in zahlreichen Sprachen mit Hilfe des Wiki­prinzips. Gemäß Publikumsnachfrage und Verbreitung gehört Wikipedia unterdessen zu den Massenmedien. Aufgrund der für die Entstehung und Weiterentwicklung dieser Enzyklopädie charakteristischen kollaborativen Erstellungs-, Kontroll- und Aushandlungsprozesse der ehrenamtlichen Beteiligten zählt Wikipedia zugleich zu den Social Media.\".split()\n", 1618 | "text_italian = \"Wikipedia (pronuncia: vedi sotto) è un'enciclopedia online a contenuto libero, collaborativa, multilingue e gratuita, nata nel 2001, sostenuta e ospitata dalla Wikimedia Foundation, un'organizzazione non a scopo di lucro statunitense. Lanciata da Jimmy Wales e Larry Sanger il 15 gennaio 2001, inizialmente nell'edizione in lingua inglese, nei mesi successivi ha aggiunto edizioni in numerose altre lingue. Sanger ne suggerì il nome,[1] una parola macedonia nata dall'unione della radice wiki al suffisso pedia (da enciclopedia).\".split()\n", 1619 | "text_spanish = \"Wikipedia es una enciclopedia libre, políglota y editada de manera colaborativa. Es administrada por la Fundación Wikimedia, una organización sin ánimo de lucro cuya financiación está basada en donaciones. Sus más de 46 millones de artículos en 288 idiomas han sido redactados conjuntamente por voluntarios de todo el mundo, lo que hace un total de más de 2000 millones de ediciones, y prácticamente cualquier persona con acceso al proyecto6​ puede editarlos, salvo que la página se encuentre protegida contra vandalismos para evitar problemas y/o trifulcas.\".split()\n", 1620 | "\n", 1621 | "print(guess_language(text_english).max())\n", 1622 | "print(guess_language(text_french).max())\n", 1623 | "print(guess_language(text_german).max())\n", 1624 | "print(guess_language(text_italian).max())\n", 1625 | "print(guess_language(text_spanish).max())\n", 1626 | "\n", 1627 | "# I don't know how to use rank correlation functionality :(" 1628 | ] 1629 | }, 1630 | { 1631 | "cell_type": "markdown", 1632 | "metadata": {}, 1633 | "source": [ 1634 | "**44. Write a program that processes a text and discovers cases where a word has been used with a novel sense. For each word, compute the WordNet similarity between all synsets of the word and all synsets of the words in its context. (Note that this is a crude approach; doing it well is a difficult, open research problem.)**" 1635 | ] 1636 | }, 1637 | { 1638 | "cell_type": "code", 1639 | "execution_count": 46, 1640 | "metadata": {}, 1641 | "outputs": [], 1642 | "source": [ 1643 | "def novel_sense(text):\n", 1644 | " for word in text:\n", 1645 | " all_synsets = wn.synsets(word)\n", 1646 | " context_synsets = []\n", 1647 | " for other_word in text:\n", 1648 | " for synset in all_synsets:\n", 1649 | " if other_word in synsest:\n", 1650 | " context_synsets.append(synset)\n", 1651 | " # after this I don't know what to do...\n", 1652 | " # for s1 in all_synsets:\n", 1653 | " # for s2 in context_synsets:\n", 1654 | " # s1.path_similarity(s2) ?" 1655 | ] 1656 | }, 1657 | { 1658 | "cell_type": "markdown", 1659 | "metadata": {}, 1660 | "source": [ 1661 | "**45. Read the article on normalization of non-standard words (Sproat et al, 2001), and implement a similar system for text normalization.**" 1662 | ] 1663 | }, 1664 | { 1665 | "cell_type": "code", 1666 | "execution_count": 47, 1667 | "metadata": { 1668 | "collapsed": true 1669 | }, 1670 | "outputs": [], 1671 | "source": [ 1672 | "# paper link:\n", 1673 | "# http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.200&rep=rep1&type=pdf\n", 1674 | "\n", 1675 | "# well, I don't have much time reading such a paper now." 1676 | ] 1677 | }, 1678 | { 1679 | "cell_type": "code", 1680 | "execution_count": null, 1681 | "metadata": { 1682 | "collapsed": true 1683 | }, 1684 | "outputs": [], 1685 | "source": [] 1686 | } 1687 | ], 1688 | "metadata": { 1689 | "kernelspec": { 1690 | "display_name": "Python 3", 1691 | "language": "python", 1692 | "name": "python3" 1693 | }, 1694 | "language_info": { 1695 | "codemirror_mode": { 1696 | "name": "ipython", 1697 | "version": 3 1698 | }, 1699 | "file_extension": ".py", 1700 | "mimetype": "text/x-python", 1701 | "name": "python", 1702 | "nbconvert_exporter": "python", 1703 | "pygments_lexer": "ipython3", 1704 | "version": "3.6.3" 1705 | } 1706 | }, 1707 | "nbformat": 4, 1708 | "nbformat_minor": 2 1709 | } 1710 | -------------------------------------------------------------------------------- /Chapter 4.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 174, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import nltk\n", 12 | "import re\n", 13 | "from nltk.corpus import wordnet as wn\n", 14 | "from operator import itemgetter\n", 15 | "from timeit import Timer" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "**1. Find out more about sequence objects using Python's help facility. In the interpreter, type ** `help(str)`, `help(list)` **, and ** `help(tuple)` **. This will give you a full list of the functions supported by each type. Some functions have special names flanked with underscore; as the help documentation shows, each such function corresponds to something more familiar. For example ** `x.__getitem__(y)` ** is just a long-winded way of saying ** `x[y]` **.**" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 5, 28 | "metadata": {}, 29 | "outputs": [ 30 | { 31 | "name": "stdout", 32 | "output_type": "stream", 33 | "text": [ 34 | "Help on class tuple in module builtins:\n", 35 | "\n", 36 | "class tuple(object)\n", 37 | " | tuple() -> empty tuple\n", 38 | " | tuple(iterable) -> tuple initialized from iterable's items\n", 39 | " | \n", 40 | " | If the argument is a tuple, the return value is the same object.\n", 41 | " | \n", 42 | " | Methods defined here:\n", 43 | " | \n", 44 | " | __add__(self, value, /)\n", 45 | " | Return self+value.\n", 46 | " | \n", 47 | " | __contains__(self, key, /)\n", 48 | " | Return key in self.\n", 49 | " | \n", 50 | " | __eq__(self, value, /)\n", 51 | " | Return self==value.\n", 52 | " | \n", 53 | " | __ge__(self, value, /)\n", 54 | " | Return self>=value.\n", 55 | " | \n", 56 | " | __getattribute__(self, name, /)\n", 57 | " | Return getattr(self, name).\n", 58 | " | \n", 59 | " | __getitem__(self, key, /)\n", 60 | " | Return self[key].\n", 61 | " | \n", 62 | " | __getnewargs__(...)\n", 63 | " | \n", 64 | " | __gt__(self, value, /)\n", 65 | " | Return self>value.\n", 66 | " | \n", 67 | " | __hash__(self, /)\n", 68 | " | Return hash(self).\n", 69 | " | \n", 70 | " | __iter__(self, /)\n", 71 | " | Implement iter(self).\n", 72 | " | \n", 73 | " | __le__(self, value, /)\n", 74 | " | Return self<=value.\n", 75 | " | \n", 76 | " | __len__(self, /)\n", 77 | " | Return len(self).\n", 78 | " | \n", 79 | " | __lt__(self, value, /)\n", 80 | " | Return self integer -- return number of occurrences of value\n", 99 | " | \n", 100 | " | index(...)\n", 101 | " | T.index(value, [start, [stop]]) -> integer -- return first index of value.\n", 102 | " | Raises ValueError if the value is not present.\n", 103 | "\n" 104 | ] 105 | } 106 | ], 107 | "source": [ 108 | "# help(str)\n", 109 | "# help(list)\n", 110 | "help(tuple)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "**2. Identify three operations that can be performed on both tuples and lists. Identify three list operations that cannot be performed on tuples. Name a context where using a list instead of a tuple generates a Python error.**" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 39, 123 | "metadata": { 124 | "collapsed": true 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "# https://docs.python.org/3/library/stdtypes.html#typesseq\n", 129 | "\n", 130 | "exp_list = ['natural', 'language', 'processing']\n", 131 | "exp_tuple = 'natural', 'language', 'processing'\n", 132 | "\n", 133 | "# Common operations\n", 134 | "'language' in exp_list # in\n", 135 | "'language' not in exp_tuple # not in\n", 136 | "\n", 137 | "exp_list[0] # subsciption\n", 138 | "exp_tuple[1:] # slicing\n", 139 | "\n", 140 | "len(exp_list) # length\n", 141 | "len(exp_tuple)\n", 142 | "\n", 143 | "min(exp_list) # smallest item\n", 144 | "max(exp_tuple) # largest item\n", 145 | "\n", 146 | "exp_list.index('language') # index of the first occurrence\n", 147 | "exp_tuple.index('processing', 1) # index of the first occurrence (at or after index 1)\n", 148 | "\n", 149 | "# List operations(mutable) that cannot be performed on tuples\n", 150 | "exp_list[0] = 'Natural'\n", 151 | "del exp_list[1]\n", 152 | "exp_list.append('understanding')\n", 153 | "exp_list.insert(1, 'Language')\n", 154 | "exp_list.pop()\n", 155 | "exp_list.remove('Natural')\n", 156 | "exp_list.clear()" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 50, 162 | "metadata": { 163 | "collapsed": true 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "tuple_only = 'natural', ['N', 'AE1', 'CH', 'ER0', 'AH0', 'L']\n", 168 | "# though it seems okay to create a list with ['natural', ['N', 'AE1', 'CH', 'ER0', 'AH0', 'L']]" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "**3. Find out how to create a tuple consisting of a single item. There are at least two ways to do this.**" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 62, 181 | "metadata": {}, 182 | "outputs": [ 183 | { 184 | "name": "stdout", 185 | "output_type": "stream", 186 | "text": [ 187 | "\n", 188 | "\n" 189 | ] 190 | } 191 | ], 192 | "source": [ 193 | "tuple1 = 'single',\n", 194 | "tuple2 = tuple(['single'])\n", 195 | "print(type(tuple1))\n", 196 | "print(type(tuple2))" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "**4. Create a list ** `words = ['is', 'NLP', 'fun', '?']` **. Use a series of assignment statements (e.g. ** `words[1] = words[2]` **) and a temporary variable ** `tmp` ** to transform this list into the list ** `['NLP', 'is', 'fun', '!']` **. Now do the same transformation using tuple assignment.**" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 64, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "data": { 213 | "text/plain": [ 214 | "['NLP', 'is', 'fun', '!']" 215 | ] 216 | }, 217 | "execution_count": 64, 218 | "metadata": {}, 219 | "output_type": "execute_result" 220 | } 221 | ], 222 | "source": [ 223 | "words = ['is', 'NLP', 'fun', '?']\n", 224 | "words[0], words[1] = words[1], words[0] # tmp is not necessary\n", 225 | "words[-1] = '!'\n", 226 | "words" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 70, 232 | "metadata": {}, 233 | "outputs": [ 234 | { 235 | "data": { 236 | "text/plain": [ 237 | "['NLP', 'is', 'fun', '!']" 238 | ] 239 | }, 240 | "execution_count": 70, 241 | "metadata": {}, 242 | "output_type": "execute_result" 243 | } 244 | ], 245 | "source": [ 246 | "words_tuple = 'is', 'NLP', 'fun', '?'\n", 247 | "words_new = words_tuple[1], words_tuple[0], words_tuple[2], '!'\n", 248 | "list(words_new)" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "**5. Read about the built-in comparison function ** `cmp` **, by typing ** `help(cmp)` **. How does it differ in behavior from the comparison operators?**" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "Well, `cmp()` compare the two objects *x* and *y* and return an integer according to the outcome. The return value is negative if `x < y`, zero if `x == y` and strictly positive if `x > y`. \n", 263 | "However, it was deprecated in Python3 and `cmp` is no longer a built-in comparison." 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "**6. Does the method for creating a sliding window of n-grams behave correctly for the two limiting cases: ** *n* = 1, **and ** *n* = `len(sent)` **?**" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 74, 276 | "metadata": {}, 277 | "outputs": [ 278 | { 279 | "data": { 280 | "text/plain": [ 281 | "[['The', 'dog', 'gave', 'John', 'the', 'newspaper']]" 282 | ] 283 | }, 284 | "execution_count": 74, 285 | "metadata": {}, 286 | "output_type": "execute_result" 287 | } 288 | ], 289 | "source": [ 290 | "sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']\n", 291 | "n = len(sent) # n = 1\n", 292 | "[sent[i:i+n] for i in range(len(sent)-n+1)]\n", 293 | "# The two boundary cases both works." 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "**7. We pointed out that when empty strings and empty lists occur in the condition part of an ** `if` ** clause, they evaluate to ** `False` **. In this case, they are said to be occurring in a Boolean context. Experiment with different kind of non-Boolean expressions in Boolean contexts, and see whether they evaluate as ** `True` ** or ** `False` **.**" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 78, 306 | "metadata": {}, 307 | "outputs": [ 308 | { 309 | "name": "stdout", 310 | "output_type": "stream", 311 | "text": [ 312 | "True\n" 313 | ] 314 | } 315 | ], 316 | "source": [ 317 | "a = 1\n", 318 | "if a:\n", 319 | " print('True')\n", 320 | "else:\n", 321 | " print('False')" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "`False`, `None`, numeric zero of all types, and empty strings and containers (including strings, tuples, lists, dictionaries, sets and frozensets). All other values are interpreted as true." 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "**8. Use the inequality operators to compare strings, e.g. ** `'Monty' < 'Python'` **. What happens when you do ** `'Z' < 'a'` **? Try pairs of strings which have a common prefix, e.g. ** `'Monty' < 'Montague'` **. Read up on \"lexicographical sort\" in order to understand what is going on here. Try comparing structured objects, e.g. ** `('Monty', 1) < ('Monty', 2)` **. Does this behave as expected?**" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 85, 341 | "metadata": {}, 342 | "outputs": [ 343 | { 344 | "name": "stdout", 345 | "output_type": "stream", 346 | "text": [ 347 | "True\n", 348 | "True\n", 349 | "False\n", 350 | "True\n" 351 | ] 352 | } 353 | ], 354 | "source": [ 355 | "print('Monty' < 'Python')\n", 356 | "\n", 357 | "print('Z' < 'a') # 90('Z') and 97('a') in ASCII\n", 358 | "\n", 359 | "print('Monty' < 'Montague')\n", 360 | "# Lexicographical sort is introduced in Discrete Mathematics and Its Applications\n", 361 | "# https://en.wikipedia.org/wiki/Lexicographical_order\n", 362 | "\n", 363 | "print(('Monty', 1) < ('Monty', 2))" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": {}, 369 | "source": [ 370 | "**9. Write code that removes whitespace at the beginning and end of a string, and normalizes whitespace between words to be a single space character.** \n", 371 | "1. **do this task using ** `split()` ** and ** `join()` \n", 372 | "2. **do this task using regular expression substitutions**" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 95, 378 | "metadata": {}, 379 | "outputs": [ 380 | { 381 | "data": { 382 | "text/plain": [ 383 | "'this is a sample sentence.'" 384 | ] 385 | }, 386 | "execution_count": 95, 387 | "metadata": {}, 388 | "output_type": "execute_result" 389 | } 390 | ], 391 | "source": [ 392 | "s = ' this is \\n a sample\\t sentence. '\n", 393 | "' '.join(s.split())" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 97, 399 | "metadata": {}, 400 | "outputs": [ 401 | { 402 | "data": { 403 | "text/plain": [ 404 | "'this is a sample sentence.'" 405 | ] 406 | }, 407 | "execution_count": 97, 408 | "metadata": {}, 409 | "output_type": "execute_result" 410 | } 411 | ], 412 | "source": [ 413 | "import re\n", 414 | "s_ = re.sub(r'^\\s|\\s$', '', s) # remove whitespace at the beginning and end of a string\n", 415 | "re.sub(r'\\s+', ' ', s_) # normalize whitespace between words" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "**10. Write a program to sort words by length. Define a helper function ** `cmp_len` ** which uses the ** `cmp` ** comparison function on word lengths.**" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 99, 428 | "metadata": {}, 429 | "outputs": [ 430 | { 431 | "data": { 432 | "text/plain": [ 433 | "['a', 'to', 'by', 'sort', 'words', 'length', 'program']" 434 | ] 435 | }, 436 | "execution_count": 99, 437 | "metadata": {}, 438 | "output_type": "execute_result" 439 | } 440 | ], 441 | "source": [ 442 | "word_list = 'a program to sort words by length'.split()\n", 443 | "sorted(word_list, key=len)" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": { 449 | "collapsed": true 450 | }, 451 | "source": [ 452 | "**11. Create a list of words and store it in a variable ** `sent1` **. Now assign ** `sent2 = sent1` **. Modify one of the items in ** `sent1` ** and verify that ** `sent2` ** has changed.** \n", 453 | "a. **Now try the same exercise but instead assign ** `sent2 = sent1[:]` **. Modify ** `sent1` ** again and see what happens to ** `sent2` **. Explain.** \n", 454 | "b. **Now define ** `text1` ** to be a list of lists of strings (e.g. to represent a text consisting of multiple sentences. Now assign ** `text2 = text1[:]` **, assign a new value to one of the words, e.g. ** `text1[1][1] = 'Monty'` **. Check what this did to ** `text2` **. Explain.** \n", 455 | "c. **Load Python's ** `deepcopy()` ** function (i.e. ** `from copy import deepcopy` **), consult its documentation, and test that it makes a fresh copy of any object.**" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": 2, 461 | "metadata": {}, 462 | "outputs": [ 463 | { 464 | "data": { 465 | "text/plain": [ 466 | "['a', 'list', 'of', 'words']" 467 | ] 468 | }, 469 | "execution_count": 2, 470 | "metadata": {}, 471 | "output_type": "execute_result" 472 | } 473 | ], 474 | "source": [ 475 | "sent1 = ['a', 'list', 'of', 'word']\n", 476 | "sent2 = sent1\n", 477 | "sent1[3] = 'words'\n", 478 | "sent2" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 5, 484 | "metadata": {}, 485 | "outputs": [ 486 | { 487 | "data": { 488 | "text/plain": [ 489 | "['a', 'list', 'of', 'Word']" 490 | ] 491 | }, 492 | "execution_count": 5, 493 | "metadata": {}, 494 | "output_type": "execute_result" 495 | } 496 | ], 497 | "source": [ 498 | "sent2 = sent1[:]\n", 499 | "sent1[3] = 'Word'\n", 500 | "sent2\n", 501 | "\n", 502 | "# sent2 = sent1[:] makes a copy of each element of sent1.\n", 503 | "# Since the elements are type of string,\n", 504 | "# the copy is just copy by value.\n", 505 | "# So the modification of sent1 doesn't affect sent2" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": 8, 511 | "metadata": {}, 512 | "outputs": [ 513 | { 514 | "data": { 515 | "text/plain": [ 516 | "[['Hush,', 'little', 'baby,', \"don't\", 'say', 'a', 'word,'],\n", 517 | " [\"Papa's\", 'Monty', 'to', 'buy', 'you', 'a', 'mockingbird.'],\n", 518 | " ['If', 'that', 'mockingbird', \"won't\", 'sing,'],\n", 519 | " [\"Papa's\", 'going', 'to', 'buy', 'you', 'a', 'diamond', 'ring.'],\n", 520 | " ['If', 'that', 'diamond', 'ring', 'turns', 'to', 'brass,'],\n", 521 | " [\"Papa's\", 'going', 'to', 'buy', 'you', 'a', 'looking-glass.'],\n", 522 | " ['If', 'that', 'looking-glass', 'gets', 'broke,'],\n", 523 | " [\"Papa's\", 'going', 'to', 'buy', 'you', 'a', 'billy-goat.'],\n", 524 | " ['If', 'that', 'billy-goat', 'runs', 'away,'],\n", 525 | " [\"Papa's\", 'going', 'to', 'buy', 'you', 'another', 'today.']]" 526 | ] 527 | }, 528 | "execution_count": 8, 529 | "metadata": {}, 530 | "output_type": "execute_result" 531 | } 532 | ], 533 | "source": [ 534 | "text1 = [\"Hush, little baby, don't say a word,\".split(),\n", 535 | " \"Papa's going to buy you a mockingbird.\".split(),\n", 536 | " \"If that mockingbird won't sing,\".split(),\n", 537 | " \"Papa's going to buy you a diamond ring.\".split(),\n", 538 | " \"If that diamond ring turns to brass,\".split(),\n", 539 | " \"Papa's going to buy you a looking-glass.\".split(),\n", 540 | " \"If that looking-glass gets broke,\".split(),\n", 541 | " \"Papa's going to buy you a billy-goat.\".split(),\n", 542 | " \"If that billy-goat runs away,\".split(),\n", 543 | " \"Papa's going to buy you another today.\".split()]\n", 544 | "text2 = text1[:]\n", 545 | "text1[1][1] = 'Monty'\n", 546 | "text2\n", 547 | "\n", 548 | "# text2 = text1[:] makes a copy of each element of sent1\n", 549 | "# The elements are lists and lists are objects.\n", 550 | "# Therefore, the copy is copy by reference.\n", 551 | "# The modification of text1 will affect text2 as well." 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": 10, 557 | "metadata": {}, 558 | "outputs": [ 559 | { 560 | "data": { 561 | "text/plain": [ 562 | "[['Hush,', 'little', 'baby,', \"don't\", 'say', 'a', 'word,'],\n", 563 | " [\"Papa's\", 'Monty', 'to', 'buy', 'you', 'a', 'mockingbird.'],\n", 564 | " ['If', 'that', 'mockingbird', \"won't\", 'sing,'],\n", 565 | " [\"Papa's\", 'going', 'to', 'buy', 'you', 'a', 'diamond', 'ring.'],\n", 566 | " ['If', 'that', 'diamond', 'ring', 'turns', 'to', 'brass,'],\n", 567 | " [\"Papa's\", 'going', 'to', 'buy', 'you', 'a', 'looking-glass.'],\n", 568 | " ['If', 'that', 'looking-glass', 'gets', 'broke,'],\n", 569 | " [\"Papa's\", 'going', 'to', 'buy', 'you', 'a', 'billy-goat.'],\n", 570 | " ['If', 'that', 'billy-goat', 'runs', 'away,'],\n", 571 | " [\"Papa's\", 'going', 'to', 'buy', 'you', 'another', 'today.']]" 572 | ] 573 | }, 574 | "execution_count": 10, 575 | "metadata": {}, 576 | "output_type": "execute_result" 577 | } 578 | ], 579 | "source": [ 580 | "# https://docs.python.org/3/library/copy.html\n", 581 | "from copy import deepcopy\n", 582 | "text3 = deepcopy(text1)\n", 583 | "text1[1][1] = 'going'\n", 584 | "text3\n", 585 | "\n", 586 | "# the modification of original list won't affect the copy" 587 | ] 588 | }, 589 | { 590 | "cell_type": "markdown", 591 | "metadata": {}, 592 | "source": [ 593 | "**12. Initialize an n-by-m list of lists of empty strings using list multiplication, e.g. ** `word_table = [[''] * n] * m` **. What happens when you set one of its values, e.g. ** `word_table[1][2] = \"hello\"` **? Explain why this happens. Now write an expression using ** `range()` ** to construct a list of lists, and show that it does not have this problem.**" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": 18, 599 | "metadata": {}, 600 | "outputs": [ 601 | { 602 | "data": { 603 | "text/plain": [ 604 | "[['', '', 'hello'], ['', '', 'hello'], ['', '', 'hello'], ['', '', 'hello']]" 605 | ] 606 | }, 607 | "execution_count": 18, 608 | "metadata": {}, 609 | "output_type": "execute_result" 610 | } 611 | ], 612 | "source": [ 613 | "m = 4\n", 614 | "n = 3\n", 615 | "word_table = [[''] * n] * m\n", 616 | "word_table[1][2] = \"hello\"\n", 617 | "word_table\n", 618 | "\n", 619 | "# Phenomenon:\n", 620 | "# Each row's third column's value is changed to \"hello\".\n", 621 | "\n", 622 | "# Explanation\n", 623 | "# word_table is a list(M) with m elements of type list(N).\n", 624 | "# The multiplication(*m) just copy the reference of N.\n", 625 | "# The modification of N will affect other references as well.\n", 626 | "# Since N's element is type of string, the copy is value-copy,\n", 627 | "# so the first two element of N is not influenced." 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": 24, 633 | "metadata": {}, 634 | "outputs": [ 635 | { 636 | "data": { 637 | "text/plain": [ 638 | "[['', '', ''], ['', '', 'hello'], ['', '', ''], ['', '', '']]" 639 | ] 640 | }, 641 | "execution_count": 24, 642 | "metadata": {}, 643 | "output_type": "execute_result" 644 | } 645 | ], 646 | "source": [ 647 | "new_table = [['' for _ in range(n)] for _ in range(m)]\n", 648 | "\n", 649 | "new_table[1][2] = \"hello\"\n", 650 | "new_table" 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "metadata": {}, 656 | "source": [ 657 | "**13. Write code to initialize a two-dimensional array of sets called ** `word_vowels` ** and process a list of words, adding each word to ** `word_vowels[l][v]` ** where ** `l` ** is the length of the word and ** `v` ** is the number of vowels it contains.**" 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": 39, 663 | "metadata": { 664 | "collapsed": true 665 | }, 666 | "outputs": [], 667 | "source": [ 668 | "n = 10\n", 669 | "word_vowels = [[set() for _ in range(n)] for _ in range(n)]\n", 670 | "\n", 671 | "word_list = 'Write code to initialize an array and process a list of words'.split()\n", 672 | "for word in word_list:\n", 673 | " l = len(word)\n", 674 | " v = sum(1 for letter in word.lower() if letter in 'aeiou')\n", 675 | " if l < n: # in case the length of word is larger than the array size n\n", 676 | " word_vowels[l][v].add(word)" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "**14. Write a function ** `novel10(text)` ** that prints any word that appeared in the last 10% of a text that had not been encountered earlier.**" 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "execution_count": 44, 689 | "metadata": { 690 | "collapsed": true 691 | }, 692 | "outputs": [], 693 | "source": [ 694 | "def novel10(text):\n", 695 | " novels = set()\n", 696 | " text_list = text.split()\n", 697 | " text_len = len(text_list)\n", 698 | " for word in text_list[int(0.9 * text_len):]:\n", 699 | " if word not in text_list[:int(0.9 * text_len)]:\n", 700 | " novels.add(word)\n", 701 | " for word in novels:\n", 702 | " print(word)" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "**15. Write a program that takes a sentence expressed as a single string, splits it and counts up the words. Get it to print out each word and the word's frequency, one per line, in alphabetical order.**" 710 | ] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "execution_count": 60, 715 | "metadata": {}, 716 | "outputs": [ 717 | { 718 | "name": "stdout", 719 | "output_type": "stream", 720 | "text": [ 721 | "Get: 1\n", 722 | "Write: 1\n", 723 | "a: 3\n", 724 | "alphabetical: 1\n", 725 | "and: 2\n", 726 | "as: 1\n", 727 | "counts: 1\n", 728 | "each: 1\n", 729 | "expressed: 1\n", 730 | "frequency: 1\n", 731 | "in: 1\n", 732 | "it: 2\n", 733 | "line: 1\n", 734 | "one: 1\n", 735 | "order: 1\n", 736 | "out: 1\n", 737 | "per: 1\n", 738 | "print: 1\n", 739 | "program: 1\n", 740 | "sentence: 1\n", 741 | "single: 1\n", 742 | "splits: 1\n", 743 | "string: 1\n", 744 | "takes: 1\n", 745 | "that: 1\n", 746 | "the: 2\n", 747 | "to: 1\n", 748 | "up: 1\n", 749 | "word: 1\n", 750 | "word's: 1\n", 751 | "words: 1\n" 752 | ] 753 | } 754 | ], 755 | "source": [ 756 | "def split_and_word_freq(sent):\n", 757 | " # splits = re.findall(r'\\w+', sent)\n", 758 | " splits = re.findall(r\"\\w+(?:[-']\\w+)*\", sent) # dealing with phrases like it's, warm-hearted\n", 759 | "\n", 760 | " # create the word frequency using dictionary\n", 761 | " word_freq = {}\n", 762 | " for word in splits:\n", 763 | " if word in word_freq:\n", 764 | " word_freq[word] = word_freq[word] + 1\n", 765 | " else:\n", 766 | " word_freq[word] = 1\n", 767 | " \n", 768 | " # print word's frequency in alphabetical order\n", 769 | " for key in sorted(word_freq.keys()):\n", 770 | " print(key + ': ' + str(word_freq[key]))\n", 771 | " \n", 772 | "sent = \"Write a program that takes a sentence expressed as a single string, splits it and counts up the words. Get it to print out each word and the word's frequency, one per line, in alphabetical order.\"\n", 773 | "split_and_word_freq(sent)" 774 | ] 775 | }, 776 | { 777 | "cell_type": "markdown", 778 | "metadata": {}, 779 | "source": [ 780 | "**16. Read up on Gematria, a method for assigning numbers to words, and for mapping between words having the same number to discover the hidden meaning of texts (http://en.wikipedia.org/wiki/Gematria, http://essenes.net/gemcal.htm).** \n", 781 | "a. **Write a function ** `gematria()` ** that sums the numerical values of the letters of a word, according to the letter values in ** `letter_vals` **:**\n", 782 | "```Python\n", 783 | ">>> letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8,\n", 784 | "... 'i':10, 'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100,\n", 785 | "... 'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}\n", 786 | "```\n", 787 | "b. **Process a corpus (e.g. ** `nltk.corpus.state_union` **) and for each document, count how many of its words have the number 666.** \n", 788 | "c. **Write a function ** `decode()` ** to process a text, randomly replacing words with their Gematria equivalents, in order to discover the \"hidden meaning\" of the text.**" 789 | ] 790 | }, 791 | { 792 | "cell_type": "code", 793 | "execution_count": 73, 794 | "metadata": { 795 | "collapsed": true 796 | }, 797 | "outputs": [], 798 | "source": [ 799 | "def gematria(word):\n", 800 | " value = 0\n", 801 | " letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8,\n", 802 | " 'i':10, 'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100,\n", 803 | " 'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}\n", 804 | " for c in word:\n", 805 | " if c in letter_vals:\n", 806 | " value += letter_vals[c]\n", 807 | " return value" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": 74, 813 | "metadata": {}, 814 | "outputs": [ 815 | { 816 | "name": "stdout", 817 | "output_type": "stream", 818 | "text": [ 819 | "1945-Truman.txt 2\n", 820 | "1946-Truman.txt 13\n", 821 | "1947-Truman.txt 0\n", 822 | "1948-Truman.txt 2\n", 823 | "1949-Truman.txt 2\n", 824 | "1950-Truman.txt 1\n", 825 | "1951-Truman.txt 0\n", 826 | "1953-Eisenhower.txt 1\n", 827 | "1954-Eisenhower.txt 6\n", 828 | "1955-Eisenhower.txt 3\n", 829 | "1956-Eisenhower.txt 1\n", 830 | "1957-Eisenhower.txt 2\n", 831 | "1958-Eisenhower.txt 5\n", 832 | "1959-Eisenhower.txt 1\n", 833 | "1960-Eisenhower.txt 5\n", 834 | "1961-Kennedy.txt 0\n", 835 | "1962-Kennedy.txt 11\n", 836 | "1963-Johnson.txt 0\n", 837 | "1963-Kennedy.txt 5\n", 838 | "1964-Johnson.txt 1\n", 839 | "1965-Johnson-1.txt 0\n", 840 | "1965-Johnson-2.txt 0\n", 841 | "1966-Johnson.txt 0\n", 842 | "1967-Johnson.txt 2\n", 843 | "1968-Johnson.txt 3\n", 844 | "1969-Johnson.txt 0\n", 845 | "1970-Nixon.txt 0\n", 846 | "1971-Nixon.txt 1\n", 847 | "1972-Nixon.txt 0\n", 848 | "1973-Nixon.txt 1\n", 849 | "1974-Nixon.txt 0\n", 850 | "1975-Ford.txt 0\n", 851 | "1976-Ford.txt 3\n", 852 | "1977-Ford.txt 0\n", 853 | "1978-Carter.txt 1\n", 854 | "1979-Carter.txt 2\n", 855 | "1980-Carter.txt 0\n", 856 | "1981-Reagan.txt 4\n", 857 | "1982-Reagan.txt 0\n", 858 | "1983-Reagan.txt 2\n", 859 | "1984-Reagan.txt 1\n", 860 | "1985-Reagan.txt 1\n", 861 | "1986-Reagan.txt 1\n", 862 | "1987-Reagan.txt 1\n", 863 | "1988-Reagan.txt 2\n", 864 | "1989-Bush.txt 1\n", 865 | "1990-Bush.txt 2\n", 866 | "1991-Bush-1.txt 0\n", 867 | "1991-Bush-2.txt 0\n", 868 | "1992-Bush.txt 3\n", 869 | "1993-Clinton.txt 1\n", 870 | "1994-Clinton.txt 2\n", 871 | "1995-Clinton.txt 1\n", 872 | "1996-Clinton.txt 2\n", 873 | "1997-Clinton.txt 1\n", 874 | "1998-Clinton.txt 4\n", 875 | "1999-Clinton.txt 1\n", 876 | "2000-Clinton.txt 3\n", 877 | "2001-GWBush-1.txt 1\n", 878 | "2001-GWBush-2.txt 0\n", 879 | "2002-GWBush.txt 0\n", 880 | "2003-GWBush.txt 3\n", 881 | "2004-GWBush.txt 2\n", 882 | "2005-GWBush.txt 2\n", 883 | "2006-GWBush.txt 0\n" 884 | ] 885 | } 886 | ], 887 | "source": [ 888 | "for fileid in nltk.corpus.state_union.fileids():\n", 889 | " cnt = 0\n", 890 | " for word in nltk.corpus.state_union.words(fileid):\n", 891 | " if word.isalpha() and gematria(word.lower()) == 666:\n", 892 | " cnt += 1\n", 893 | " print(fileid, cnt)" 894 | ] 895 | }, 896 | { 897 | "cell_type": "code", 898 | "execution_count": 75, 899 | "metadata": { 900 | "collapsed": true 901 | }, 902 | "outputs": [], 903 | "source": [ 904 | "# not tested\n", 905 | "def decode(text):\n", 906 | " gematrias = {}\n", 907 | " splits = text.split()\n", 908 | "\n", 909 | " # create dictionary \n", 910 | " # key: gematria value\n", 911 | " # value: list of words with that gematria value\n", 912 | " for word in splits:\n", 913 | " if gematria(word) in gematrias:\n", 914 | " gematrias[gematria(word)].add(word)\n", 915 | " else:\n", 916 | " gematrias[gematria(word)] = [word]\n", 917 | " \n", 918 | " for i in range(len(splits)):\n", 919 | " splits[i] = random.choice(gematrias[gematria(word)])\n", 920 | " \n", 921 | " return ' '.join(splits)" 922 | ] 923 | }, 924 | { 925 | "cell_type": "markdown", 926 | "metadata": {}, 927 | "source": [ 928 | "**17. Write a function ** `shorten(text, n)` ** to process a text, omitting the ** *n* ** most frequently occurring words of the text. How readable is it?**" 929 | ] 930 | }, 931 | { 932 | "cell_type": "code", 933 | "execution_count": 26, 934 | "metadata": {}, 935 | "outputs": [ 936 | { 937 | "data": { 938 | "text/plain": [ 939 | "'Write function shorten to process omitting most frequently occurring words of How readable is it'" 940 | ] 941 | }, 942 | "execution_count": 26, 943 | "metadata": {}, 944 | "output_type": "execute_result" 945 | } 946 | ], 947 | "source": [ 948 | "def shorten(text, n):\n", 949 | " splits = re.findall(r\"\\w+(?:[-']\\w+)*\", text)\n", 950 | " fdist = nltk.FreqDist(splits)\n", 951 | " # most_freq = fdist.most_common(n)\n", 952 | " # print(most_freq)\n", 953 | " most_freq = []\n", 954 | " for sample in fdist.most_common(n):\n", 955 | " most_freq.append(sample[0])\n", 956 | " for i in range(len(splits)):\n", 957 | " if splits[i] in most_freq:\n", 958 | " splits[i] = '' # in fact, the element should be deleted in the list\n", 959 | " return ' '.join(splits)\n", 960 | "\n", 961 | "text = 'Write a function shorten(text, n) to process a text, omitting the n most frequently occurring words of the text. How readable is it?'\n", 962 | "shorten(text, 4)" 963 | ] 964 | }, 965 | { 966 | "cell_type": "markdown", 967 | "metadata": { 968 | "collapsed": true 969 | }, 970 | "source": [ 971 | "**18. Write code to print out an index for a lexicon, allowing someone to look up words according to their meanings (or pronunciations; whatever properties are contained in lexical entries).**" 972 | ] 973 | }, 974 | { 975 | "cell_type": "markdown", 976 | "metadata": {}, 977 | "source": [ 978 | "Not understand yet... Just exchange the key and value in nldk.corpus.cmudict?" 979 | ] 980 | }, 981 | { 982 | "cell_type": "markdown", 983 | "metadata": {}, 984 | "source": [ 985 | "**19. Write a list comprehension that sorts a list of WordNet synsets for proximity to a given synset. For example, given the synsets ** `minke_whale.n.01, orca.n.01, novel.n.01` **, and ** `tortoise.n.01` **, sort them according to their ** `shortest_path_distance()` ** from ** `right_whale.n.01` **.**" 986 | ] 987 | }, 988 | { 989 | "cell_type": "code", 990 | "execution_count": 44, 991 | "metadata": {}, 992 | "outputs": [ 993 | { 994 | "data": { 995 | "text/plain": [ 996 | "[Synset('lesser_rorqual.n.01'),\n", 997 | " Synset('killer_whale.n.01'),\n", 998 | " Synset('tortoise.n.01'),\n", 999 | " Synset('novel.n.01')]" 1000 | ] 1001 | }, 1002 | "execution_count": 44, 1003 | "metadata": {}, 1004 | "output_type": "execute_result" 1005 | } 1006 | ], 1007 | "source": [ 1008 | "minke = wn.synset('minke_whale.n.01')\n", 1009 | "orca = wn.synset('orca.n.01')\n", 1010 | "novel = wn.synset('novel.n.01')\n", 1011 | "tortoise = wn.synset('tortoise.n.01')\n", 1012 | "right = wn.synset('right_whale.n.01')\n", 1013 | "\n", 1014 | "wn_list = [minke, orca, novel, tortoise]\n", 1015 | "sorted(wn_list,\n", 1016 | " key=lambda x: right.shortest_path_distance(x))" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "markdown", 1021 | "metadata": { 1022 | "collapsed": true 1023 | }, 1024 | "source": [ 1025 | "**20. Write a function that takes a list of words (containing duplicates) and returns a list of words (with no duplicates) sorted by decreasing frequency. E.g. if the input list contained 10 instances of the word ** `table` ** and 9 instances of the word ** `chair` **, then ** `table` ** would appear before ** `chair` ** in the output list.**" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "code", 1030 | "execution_count": 18, 1031 | "metadata": { 1032 | "collapsed": true 1033 | }, 1034 | "outputs": [], 1035 | "source": [ 1036 | "def decreasing_freq_with_no_duplicates(words):\n", 1037 | " wordset = set(words)\n", 1038 | " fdist = nltk.FreqDist(words)\n", 1039 | " return sorted(wordset,\n", 1040 | " key=lambda x:fdist[x],\n", 1041 | " reverse=True)" 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "code", 1046 | "execution_count": 19, 1047 | "metadata": {}, 1048 | "outputs": [ 1049 | { 1050 | "data": { 1051 | "text/plain": [ 1052 | "['table', 'chair']" 1053 | ] 1054 | }, 1055 | "execution_count": 19, 1056 | "metadata": {}, 1057 | "output_type": "execute_result" 1058 | } 1059 | ], 1060 | "source": [ 1061 | "words = ['table'] * 10 + ['chair'] * 9\n", 1062 | "decreasing_freq_with_no_duplicates(words)" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "markdown", 1067 | "metadata": {}, 1068 | "source": [ 1069 | "**21. Write a function that takes a text and a vocabulary as its arguments and returns the set of words that appear in the text but not in the vocabulary. Both arguments can be represented as lists of strings. Can you do this in a single line, using ** `set.difference()` **?**" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "code", 1074 | "execution_count": 20, 1075 | "metadata": { 1076 | "collapsed": true 1077 | }, 1078 | "outputs": [], 1079 | "source": [ 1080 | "def diff(text, vocab):\n", 1081 | " return set(text).difference(vocab)" 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "code", 1086 | "execution_count": 21, 1087 | "metadata": {}, 1088 | "outputs": [ 1089 | { 1090 | "data": { 1091 | "text/plain": [ 1092 | "{'and', 'text'}" 1093 | ] 1094 | }, 1095 | "execution_count": 21, 1096 | "metadata": {}, 1097 | "output_type": "execute_result" 1098 | } 1099 | ], 1100 | "source": [ 1101 | "text = 'a text and a vocabulary'.split()\n", 1102 | "vocab = 'a vocabulary'.split()\n", 1103 | "diff(text, vocab)" 1104 | ] 1105 | }, 1106 | { 1107 | "cell_type": "markdown", 1108 | "metadata": {}, 1109 | "source": [ 1110 | "**22. Import the ** `itemgetter()` ** function from the operator module in Python's standard library (i.e. ** `from operator import itemgetter` **). Create a list words containing several words. Now try calling: ** `sorted(words, key=itemgetter(1))` **, and ** `sorted(words, key=itemgetter(-1))` **. Explain what ** `itemgetter()` ** is doing.**" 1111 | ] 1112 | }, 1113 | { 1114 | "cell_type": "code", 1115 | "execution_count": 28, 1116 | "metadata": {}, 1117 | "outputs": [ 1118 | { 1119 | "data": { 1120 | "text/plain": [ 1121 | "['wOrds', 'several', 'list', 'containing', 'words']" 1122 | ] 1123 | }, 1124 | "execution_count": 28, 1125 | "metadata": {}, 1126 | "output_type": "execute_result" 1127 | } 1128 | ], 1129 | "source": [ 1130 | "words = 'list wOrds containing several words'.split()\n", 1131 | "sorted(words, key=itemgetter(1))" 1132 | ] 1133 | }, 1134 | { 1135 | "cell_type": "markdown", 1136 | "metadata": {}, 1137 | "source": [ 1138 | "[operator.itemgetter()](https://docs.python.org/3/library/operator.html#operator.itemgetter) \n", 1139 | "`operator.itemgetter()` returns a callable object that fetches item from its operand using the operand’s `__getitem__()` method. If multiple items are specified, returns a tuple of lookup values. For example:\n", 1140 | "\n", 1141 | "After `f = itemgetter(2)`, the call `f(r)` returns `r[2]`. \n", 1142 | "After `g = itemgetter(2, 5, 3)`, the call `g(r)` returns `(r[2], r[5], r[3])`." 1143 | ] 1144 | }, 1145 | { 1146 | "cell_type": "markdown", 1147 | "metadata": {}, 1148 | "source": [ 1149 | "**23. Write a recursive function ** `lookup(trie, key)` ** that looks up a key in a trie, and returns the value it finds. Extend the function to return a word when it is uniquely determined by its prefix (e.g. ** `vanguard` ** is the only word that starts with ** `vang-` **, so ** `lookup(trie, 'vang')` ** should return the same thing as ** `lookup(trie, 'vanguard'))` **.**" 1150 | ] 1151 | }, 1152 | { 1153 | "cell_type": "code", 1154 | "execution_count": 58, 1155 | "metadata": { 1156 | "collapsed": true 1157 | }, 1158 | "outputs": [], 1159 | "source": [ 1160 | "def insert(trie, key, value):\n", 1161 | " if key:\n", 1162 | " first, rest = key[0], key[1:]\n", 1163 | " if first not in trie:\n", 1164 | " trie[first] = {}\n", 1165 | " insert(trie[first], rest, value)\n", 1166 | " else:\n", 1167 | " trie['value'] = value" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "code", 1172 | "execution_count": 92, 1173 | "metadata": {}, 1174 | "outputs": [ 1175 | { 1176 | "name": "stdout", 1177 | "output_type": "stream", 1178 | "text": [ 1179 | "{'c': {'h': {'a': {'i': {'r': {'value': 'flesh'}},\n", 1180 | " 't': {'value': 'cat'}},\n", 1181 | " 'i': {'c': {'value': 'stylish'},\n", 1182 | " 'e': {'n': {'value': 'dog'}}}}}}\n" 1183 | ] 1184 | } 1185 | ], 1186 | "source": [ 1187 | "trie = {}\n", 1188 | "insert(trie, 'chat', 'cat')\n", 1189 | "insert(trie, 'chien', 'dog')\n", 1190 | "insert(trie, 'chair', 'flesh')\n", 1191 | "insert(trie, 'chic', 'stylish')\n", 1192 | "pprint.pprint(trie, width=40)" 1193 | ] 1194 | }, 1195 | { 1196 | "cell_type": "code", 1197 | "execution_count": 95, 1198 | "metadata": { 1199 | "collapsed": true 1200 | }, 1201 | "outputs": [], 1202 | "source": [ 1203 | "# original edition\n", 1204 | "# must type the word completely\n", 1205 | "def lookup(trie, key):\n", 1206 | " if key:\n", 1207 | " return lookup(trie[key[0]], key[1:])\n", 1208 | " else:\n", 1209 | " return trie['value']" 1210 | ] 1211 | }, 1212 | { 1213 | "cell_type": "code", 1214 | "execution_count": 97, 1215 | "metadata": {}, 1216 | "outputs": [ 1217 | { 1218 | "data": { 1219 | "text/plain": [ 1220 | "'flesh'" 1221 | ] 1222 | }, 1223 | "execution_count": 97, 1224 | "metadata": {}, 1225 | "output_type": "execute_result" 1226 | } 1227 | ], 1228 | "source": [ 1229 | "lookup(trie, 'chair')" 1230 | ] 1231 | }, 1232 | { 1233 | "cell_type": "code", 1234 | "execution_count": 117, 1235 | "metadata": { 1236 | "collapsed": true 1237 | }, 1238 | "outputs": [], 1239 | "source": [ 1240 | "# modification edition\n", 1241 | "# can look up a word when it is uniquely determined by its prefix\n", 1242 | "# (though the code is not elegant TAT\n", 1243 | "def lookup1(trie, key):\n", 1244 | " if key:\n", 1245 | " return lookup1(trie[key[0]], key[1:])\n", 1246 | " else:\n", 1247 | " if 'value' in trie:\n", 1248 | " return trie['value']\n", 1249 | " if len(trie) == 1:\n", 1250 | " return lookup1(trie[list(trie)[0]], '')\n", 1251 | " else:\n", 1252 | " print('Invalid look up')" 1253 | ] 1254 | }, 1255 | { 1256 | "cell_type": "code", 1257 | "execution_count": 127, 1258 | "metadata": {}, 1259 | "outputs": [ 1260 | { 1261 | "data": { 1262 | "text/plain": [ 1263 | "True" 1264 | ] 1265 | }, 1266 | "execution_count": 127, 1267 | "metadata": {}, 1268 | "output_type": "execute_result" 1269 | } 1270 | ], 1271 | "source": [ 1272 | "insert(trie, 'vanguard', 'pioneer')\n", 1273 | "lookup1(trie, 'vang') == lookup1(trie, 'vanguard')" 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "markdown", 1278 | "metadata": {}, 1279 | "source": [ 1280 | "**24. Read up on \"keyword linkage\" (chapter 5 of (Scott & Tribble, 2006)). Extract keywords from NLTK's Shakespeare Corpus and using the NetworkX package, plot keyword linkage networks.**" 1281 | ] 1282 | }, 1283 | { 1284 | "cell_type": "markdown", 1285 | "metadata": {}, 1286 | "source": [ 1287 | "Scott & Tribble, 2006 is not available. And I haven't instaled NetworkX neither." 1288 | ] 1289 | }, 1290 | { 1291 | "cell_type": "markdown", 1292 | "metadata": {}, 1293 | "source": [ 1294 | "**25. Read about string edit distance and the Levenshtein Algorithm. Try the implementation provided in nltk.edit_distance(). In what way is this using dynamic programming? Does it use the bottom-up or top-down approach? [See also http://norvig.com/spell-correct.html]**" 1295 | ] 1296 | }, 1297 | { 1298 | "cell_type": "markdown", 1299 | "metadata": {}, 1300 | "source": [ 1301 | "The blog is to be read." 1302 | ] 1303 | }, 1304 | { 1305 | "cell_type": "markdown", 1306 | "metadata": { 1307 | "collapsed": true 1308 | }, 1309 | "source": [ 1310 | "**26. The Catalan numbers arise in many applications of combinatorial mathematics, including the counting of parse trees (6). The series can be defined as follows: ** $C_0=1$ **, and ** $C_{n+1}=\\sum_{0..n}(C_iC_{n-i})$. \n", 1311 | "a. **Write a recursive function to compute ** *n* **th Catalan number** $C_n$ \n", 1312 | "b. **Now write another function that does this computation using dynamic programming.** \n", 1313 | "c. **Use the ** `timeit` ** module to compare the performance of these functions as ** *n* ** increases.**" 1314 | ] 1315 | }, 1316 | { 1317 | "cell_type": "code", 1318 | "execution_count": 162, 1319 | "metadata": { 1320 | "collapsed": true 1321 | }, 1322 | "outputs": [], 1323 | "source": [ 1324 | "def catalan_recursive(n):\n", 1325 | " if n == 0 or n == 1:\n", 1326 | " return 1\n", 1327 | " else:\n", 1328 | " c = 0\n", 1329 | " for i in range(n):\n", 1330 | " c += catalan_recursive(i) * catalan_recursive(n - i - 1)\n", 1331 | " return c" 1332 | ] 1333 | }, 1334 | { 1335 | "cell_type": "code", 1336 | "execution_count": 163, 1337 | "metadata": {}, 1338 | "outputs": [ 1339 | { 1340 | "data": { 1341 | "text/plain": [ 1342 | "42" 1343 | ] 1344 | }, 1345 | "execution_count": 163, 1346 | "metadata": {}, 1347 | "output_type": "execute_result" 1348 | } 1349 | ], 1350 | "source": [ 1351 | "catalan_recursive(5)" 1352 | ] 1353 | }, 1354 | { 1355 | "cell_type": "code", 1356 | "execution_count": 164, 1357 | "metadata": { 1358 | "collapsed": true 1359 | }, 1360 | "outputs": [], 1361 | "source": [ 1362 | "def catalan_dp(n, lookup={0:1, 1:1}):\n", 1363 | " if n not in lookup:\n", 1364 | " c = 0\n", 1365 | " for i in range(n):\n", 1366 | " c += catalan_dp(i) * catalan_dp(n - i - 1)\n", 1367 | " lookup[n] = c\n", 1368 | " return lookup[n]" 1369 | ] 1370 | }, 1371 | { 1372 | "cell_type": "code", 1373 | "execution_count": 169, 1374 | "metadata": {}, 1375 | "outputs": [ 1376 | { 1377 | "data": { 1378 | "text/plain": [ 1379 | "42" 1380 | ] 1381 | }, 1382 | "execution_count": 169, 1383 | "metadata": {}, 1384 | "output_type": "execute_result" 1385 | } 1386 | ], 1387 | "source": [ 1388 | "catalan_dp(5)" 1389 | ] 1390 | }, 1391 | { 1392 | "cell_type": "code", 1393 | "execution_count": 177, 1394 | "metadata": {}, 1395 | "outputs": [ 1396 | { 1397 | "name": "stdout", 1398 | "output_type": "stream", 1399 | "text": [ 1400 | "i = 0\n", 1401 | "Recursive: 0.002184530982049182\n", 1402 | "Dynamic Programming 0.0023424150131177157\n", 1403 | "i = 1\n", 1404 | "Recursive: 0.0020310399995651096\n", 1405 | "Dynamic Programming 0.002242524002213031\n", 1406 | "i = 2\n", 1407 | "Recursive: 0.01463045398122631\n", 1408 | "Dynamic Programming 0.002049253002041951\n", 1409 | "i = 3\n", 1410 | "Recursive: 0.04760096099926159\n", 1411 | "Dynamic Programming 0.0019739810086321086\n", 1412 | "i = 4\n", 1413 | "Recursive: 0.12554282997734845\n", 1414 | "Dynamic Programming 0.001775078009814024\n", 1415 | "i = 5\n", 1416 | "Recursive: 0.4283594549924601\n", 1417 | "Dynamic Programming 0.001807728986022994\n", 1418 | "i = 6\n", 1419 | "Recursive: 1.2319063549803104\n", 1420 | "Dynamic Programming 0.0018030449864454567\n", 1421 | "i = 7\n", 1422 | "Recursive: 3.4462423990189563\n", 1423 | "Dynamic Programming 0.001796147000277415\n", 1424 | "i = 8\n", 1425 | "Recursive: 10.815954695019173\n", 1426 | "Dynamic Programming 0.001826263003749773\n", 1427 | "i = 9\n", 1428 | "Recursive: 30.390052033006214\n", 1429 | "Dynamic Programming 0.0017969440086744726\n" 1430 | ] 1431 | } 1432 | ], 1433 | "source": [ 1434 | "for i in range(10):\n", 1435 | " print('i =', i)\n", 1436 | " print(\"Recursive:\", Timer('catalan_recursive(n)', 'n = i', globals=globals()).timeit(10000))\n", 1437 | " print(\"Dynamic Programming\", Timer('catalan_dp(n)', 'n = i', globals=globals()).timeit(10000))" 1438 | ] 1439 | }, 1440 | { 1441 | "cell_type": "code", 1442 | "execution_count": null, 1443 | "metadata": { 1444 | "collapsed": true 1445 | }, 1446 | "outputs": [], 1447 | "source": [] 1448 | } 1449 | ], 1450 | "metadata": { 1451 | "kernelspec": { 1452 | "display_name": "Python 3", 1453 | "language": "python", 1454 | "name": "python3" 1455 | }, 1456 | "language_info": { 1457 | "codemirror_mode": { 1458 | "name": "ipython", 1459 | "version": 3 1460 | }, 1461 | "file_extension": ".py", 1462 | "mimetype": "text/x-python", 1463 | "name": "python", 1464 | "nbconvert_exporter": "python", 1465 | "pygments_lexer": "ipython3", 1466 | "version": "3.6.3" 1467 | } 1468 | }, 1469 | "nbformat": 4, 1470 | "nbformat_minor": 2 1471 | } 1472 | -------------------------------------------------------------------------------- /Chapter 5.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import nltk\n", 12 | "import datetime\n", 13 | "from nltk.corpus import brown" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "**1. Search the web for \"spoof newspaper headlines\", to find such gems as: ** *British Left Waffles on Falkland Islands* **, and ** *Juvenile Court to Try Shooting Defendant* **. Manually tag these headlines to see if knowledge of the part-of-speech tags removes the ambiguity.**" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 3, 26 | "metadata": {}, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/plain": [ 31 | "[('Juvenile', 'NOUN'),\n", 32 | " ('Court', 'NOUN'),\n", 33 | " ('to', 'PRT'),\n", 34 | " ('Try', 'VERB'),\n", 35 | " ('Shooting', 'ADJ'),\n", 36 | " ('Defendant', 'NOUN')]" 37 | ] 38 | }, 39 | "execution_count": 3, 40 | "metadata": {}, 41 | "output_type": "execute_result" 42 | } 43 | ], 44 | "source": [ 45 | "# here TRY means to examine evidence in court and decide whether sb is innocent or guilty\n", 46 | "headline = 'Juvenile/NOUN Court/NOUN to/PRT Try/VERB Shooting/ADJ Defendant/NOUN'\n", 47 | "[nltk.tag.str2tuple(t) for t in headline.split()]" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "**2. Working with someone else, take turns to pick a word that can be either a noun or a verb (e.g. ** *contest* **); the opponent has to predict which one is likely to be the most frequent in the Brown corpus; check the opponent's prediction, and tally the score over several turns.**" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "Omitted." 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "**3. Tokenize and tag the following sentence: ** *They wind back the clock, while we chase after the wind.* ** What different pronunciations and parts of speech are involved?**" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 4, 74 | "metadata": {}, 75 | "outputs": [ 76 | { 77 | "data": { 78 | "text/plain": [ 79 | "[('They', 'PRP'),\n", 80 | " ('wind', 'VBP'),\n", 81 | " ('back', 'RB'),\n", 82 | " ('the', 'DT'),\n", 83 | " ('clock', 'NN'),\n", 84 | " (',', ','),\n", 85 | " ('while', 'IN'),\n", 86 | " ('we', 'PRP'),\n", 87 | " ('chase', 'VBP'),\n", 88 | " ('after', 'IN'),\n", 89 | " ('the', 'DT'),\n", 90 | " ('wind', 'NN'),\n", 91 | " ('.', '.')]" 92 | ] 93 | }, 94 | "execution_count": 4, 95 | "metadata": {}, 96 | "output_type": "execute_result" 97 | } 98 | ], 99 | "source": [ 100 | "sent = 'They wind back the clock, while we chase after the wind.'\n", 101 | "nltk.pos_tag(nltk.word_tokenize(sent))" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "**4. Review the mappings in 3.1. Discuss any other examples of mappings you can think of. What type of information do they map from and to?**" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "| More Linguistic Object | Maps From | Maps To |\n", 116 | "| --- | --- | --- |\n", 117 | "| Word Frequency | Word | Number of occurrences in a text |\n", 118 | "| Word Prounciation | Word | List of the word's prounciation |\n", 119 | "| Abbreviation | Acronym | List of the full name |" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "**5. Using the Python interpreter in interactive mode, experiment with the dictionary examples in this chapter. Create a dictionary ** `d` **, and add some entries. What happens if you try to access a non-existent entry, e.g. ** `d['xyz']` **?** " 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "```\n", 134 | "Traceback (most recent call last):\n", 135 | " File \"\", line 1, in \n", 136 | "KeyError: 'xyz'\n", 137 | "```" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "**6. Try deleting an element from a dictionary d, using the syntax ** `del d['abc']` **. Check that the item was deleted.**" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "Omitted." 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "**7. Create two dictionaries, ** `d1` ** and ** `d2` **, and add some entries to each. Now issue the command ** `d1.update(d2)` **. What did this do? What might it be useful for?**" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 6, 164 | "metadata": {}, 165 | "outputs": [ 166 | { 167 | "data": { 168 | "text/plain": [ 169 | "{'hello': 1, 'language': 4, 'natural': 3, 'processing': 5, 'world': 2}" 170 | ] 171 | }, 172 | "execution_count": 6, 173 | "metadata": {}, 174 | "output_type": "execute_result" 175 | } 176 | ], 177 | "source": [ 178 | "d1 = {'hello': 1, 'world': 2, 'natural': 0}\n", 179 | "d2 = {'natural': 3, 'language': 4, 'processing': 5}\n", 180 | "d1.update(d2)\n", 181 | "d1" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "Update the dictionary with the key/value pairs from other, overwriting existing keys. \n", 189 | "Useful when merging two dictionaries." 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "**8. Create a dictionary ** `e` **, to represent a single lexical entry for some word of your choice. Define keys like ** `headword, part-of-speech, sense, ` ** and ** `example` **, and assign them suitable values.**" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 7, 202 | "metadata": { 203 | "collapsed": true 204 | }, 205 | "outputs": [], 206 | "source": [ 207 | "e = {}\n", 208 | "e['headword'] = ['NOUN', 'a word or term placed at the beginning (as of a chapter or an entry in an encyclopedia)']\n", 209 | "e['part-of-speech'] = ['PHRASE', 'a traditional class of words distinguished according to the kind of idea denoted and the function performed in a sentence']\n", 210 | "e['sense'] = ['NOUN', 'a meaning conveyed or intended']\n", 211 | "e['example'] = ['NOUN', 'one that serves as a pattern to be imitated or not to be imitated']" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": { 217 | "collapsed": true 218 | }, 219 | "source": [ 220 | "**9. Satisfy yourself that there are restrictions on the distribution of ** *go* ** and ** *went* **, in the sense that they cannot be freely interchanged in the kinds of contexts illustrated in (3d) in 7.**" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "'We *went* on the excursion.' means the tense is past. \n", 228 | "'We *go* on the excursion.' well, is it unlikely used in daily life?" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "**10. Train a unigram tagger and run it on some new text. Observe that some words are not assigned a tag. Why not?**" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 6, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "data": { 245 | "text/plain": [ 246 | "[('hello', None),\n", 247 | " ('world', 'NN'),\n", 248 | " ('natural', 'JJ'),\n", 249 | " ('language', 'NN'),\n", 250 | " ('processing', 'NN')]" 251 | ] 252 | }, 253 | "execution_count": 6, 254 | "metadata": {}, 255 | "output_type": "execute_result" 256 | } 257 | ], 258 | "source": [ 259 | "brown_tagged_sents = brown.tagged_sents(categories='news')\n", 260 | "unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)\n", 261 | "\n", 262 | "test_text = ['hello', 'world', 'natural', 'language', 'processing']\n", 263 | "unigram_tagger.tag(test_text)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "The words doesn't appear in the training text, and therefore the tagger can't speculate the word's tag." 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "**11. Learn about the affix tagger (type ** `help(nltk.AffixTagger)` **). Train an affix tagger and run it on some new text. Experiment with different settings for the affix length and the minimum word length. Discuss your findings.**" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 14, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "data": { 287 | "text/plain": [ 288 | "[('Experiment', 'NN-TL'),\n", 289 | " ('with', None),\n", 290 | " ('different', 'JJ'),\n", 291 | " ('settings', 'NN'),\n", 292 | " ('for', None),\n", 293 | " ('the', None),\n", 294 | " ('affix', None),\n", 295 | " ('length', None),\n", 296 | " ('and', None),\n", 297 | " ('the', None),\n", 298 | " ('minimum', 'NNS'),\n", 299 | " ('word', None),\n", 300 | " ('length', None)]" 301 | ] 302 | }, 303 | "execution_count": 14, 304 | "metadata": {}, 305 | "output_type": "execute_result" 306 | } 307 | ], 308 | "source": [ 309 | "# help(nltk.AffixTagger)\n", 310 | "affix_tagger = nltk.AffixTagger(brown_tagged_sents, affix_length=3, min_stem_length=4)\n", 311 | "test_text = 'Experiment with different settings for the affix length and the minimum word length'.split()\n", 312 | "affix_tagger.tag(test_text)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "**12. Train a bigram tagger with no backoff tagger, and run it on some of the training data. Next, run it on some new data. What happens to the performance of the tagger? Why?**" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "Temporarily omitted." 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "**13. We can use a dictionary to specify the values to be substituted into a formatting string. Read Python's library documentation for formatting strings http://docs.python.org/lib/typesseq-strings.html** *(404 NOT FOUND)* ** and use this method to display today's date in two different formats.**" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 2, 339 | "metadata": {}, 340 | "outputs": [ 341 | { 342 | "data": { 343 | "text/plain": [ 344 | "'2018-08-31'" 345 | ] 346 | }, 347 | "execution_count": 2, 348 | "metadata": {}, 349 | "output_type": "execute_result" 350 | } 351 | ], 352 | "source": [ 353 | "datetime.datetime.today().strftime(\"%Y-%m-%d\")" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": { 359 | "collapsed": true 360 | }, 361 | "source": [ 362 | "**14. Use ** `sorted()` ** and ** `set()` ** to get a sorted list of tags used in the Brown corpus, removing duplicates.**" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 18, 368 | "metadata": { 369 | "collapsed": true 370 | }, 371 | "outputs": [], 372 | "source": [ 373 | "list_of_tags = sorted(set([tag for (_, tag) in brown.tagged_words()]))" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "**15. Write programs to process the Brown Corpus and find answers to the following questions:** \n", 381 | "a. **Which nouns are more common in their plural form, rather than their singular form? (Only consider regular plurals, formed with the ** *-s* ** suffix.)** \n", 382 | "b. **Which word has the greatest number of distinct tags. What are they, and what do they represent?** \n", 383 | "c. **List tags in order of decreasing frequency. What do the 20 most frequent tags represent?** \n", 384 | "d. **Which tags are nouns most commonly found after? What do these tags represent?**" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": 89, 390 | "metadata": { 391 | "collapsed": true 392 | }, 393 | "outputs": [], 394 | "source": [ 395 | "brown_tagged = brown.tagged_words()\n", 396 | "cfd = nltk.ConditionalFreqDist(brown_tagged)" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": 34, 402 | "metadata": { 403 | "collapsed": true 404 | }, 405 | "outputs": [], 406 | "source": [ 407 | "# Which nouns are more common in their plural form, rather than their singular form? \n", 408 | "# (Only consider regular plurals, formed with the -s suffix.)\n", 409 | "\n", 410 | "common_plural = set()\n", 411 | "for word in set(brown.words()):\n", 412 | " if cfd[word+'s']['NNS'] > cfd[word]['NN']:\n", 413 | " common_plural.add(word)" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 77, 419 | "metadata": { 420 | "collapsed": true 421 | }, 422 | "outputs": [], 423 | "source": [ 424 | "# Which word has the greatest number of distinct tags. What are they, and what do they represent?\n", 425 | "\n", 426 | "tag_dict = {k:len(cfd[k]) for k in cfd}\n", 427 | "greatest = max(tag_dict, key=lambda key: tag_dict[key])" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "*that* \n", 435 | "CS, CS-HL, CS-NC, DT, DT-NC, NIL, QL, WPO, WPO-NC, WPS, WPS-HL, WPS-NC" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 88, 441 | "metadata": {}, 442 | "outputs": [ 443 | { 444 | "data": { 445 | "text/plain": [ 446 | "[('NN', 152470),\n", 447 | " ('IN', 120557),\n", 448 | " ('AT', 97959),\n", 449 | " ('JJ', 64028),\n", 450 | " ('.', 60638),\n", 451 | " (',', 58156),\n", 452 | " ('NNS', 55110),\n", 453 | " ('CC', 37718),\n", 454 | " ('RB', 36464),\n", 455 | " ('NP', 34476),\n", 456 | " ('VB', 33693),\n", 457 | " ('VBN', 29186),\n", 458 | " ('VBD', 26167),\n", 459 | " ('CS', 22143),\n", 460 | " ('PPS', 18253),\n", 461 | " ('VBG', 17893),\n", 462 | " ('PP$', 16872),\n", 463 | " ('TO', 14918),\n", 464 | " ('PPSS', 13802),\n", 465 | " ('CD', 13510)]" 466 | ] 467 | }, 468 | "execution_count": 88, 469 | "metadata": {}, 470 | "output_type": "execute_result" 471 | } 472 | ], 473 | "source": [ 474 | "# List tags in order of decreasing frequency. What do the 20 most frequent tags represent?\n", 475 | "\n", 476 | "helper_list = [t for (_, t) in brown_tagged] # extract the tags to a list \n", 477 | "fd = nltk.FreqDist(helper_list)\n", 478 | "fd.most_common(20)" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 93, 484 | "metadata": {}, 485 | "outputs": [ 486 | { 487 | "data": { 488 | "text/plain": [ 489 | "['IN', '.', ',', 'CC', 'NN', 'NNS', 'VBD', 'CS', 'MD', 'BEZ']" 490 | ] 491 | }, 492 | "execution_count": 93, 493 | "metadata": {}, 494 | "output_type": "execute_result" 495 | } 496 | ], 497 | "source": [ 498 | "# Which tags are nouns most commonly found after? What do these tags represent?\n", 499 | "\n", 500 | "word_tag_pairs = nltk.bigrams(brown_tagged)\n", 501 | "noun_after = [b[1] for (a, b) in word_tag_pairs if a[1].startswith('NN')]\n", 502 | "fdist = nltk.FreqDist(noun_after)\n", 503 | "[tag for (tag, _) in fdist.most_common(10)]" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "**16. Explore the following issues that arise in connection with the lookup tagger:** \n", 511 | "a. **What happens to the tagger performance for the various model sizes when a backoff tagger is omitted?** \n", 512 | "b. **Consider the curve in 4.2; suggest a good size for a lookup tagger that balances memory and performance. Can you come up with scenarios where it would be preferable to minimize memory usage, or to maximize performance with no regard for memory usage?**" 513 | ] 514 | }, 515 | { 516 | "cell_type": "markdown", 517 | "metadata": {}, 518 | "source": [ 519 | "When a backoff tagger is omitted, with the increase of model sizes, the tagger performance would be improved since there would be less UNKNOWN words. \n", 520 | "If the memory usage is limited, then a 90% performance is advisable(about 8000 in Figure 4.2). Use as large model size as possible with no regard for memory usage.(Well, take overfitting and calculating time into consideration as well =D)\n" 521 | ] 522 | }, 523 | { 524 | "cell_type": "markdown", 525 | "metadata": {}, 526 | "source": [ 527 | "**17. What is the upper limit of performance for a lookup tagger, assuming no limit to the size of its table? (Hint: write a program to work out what percentage of tokens of a word are assigned the most likely tag for that word, on average.)**" 528 | ] 529 | }, 530 | { 531 | "cell_type": "markdown", 532 | "metadata": {}, 533 | "source": [ 534 | "The word's most possible tag's proportion of all that word's tags?" 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "metadata": {}, 540 | "source": [ 541 | "**18. Generate some statistics for tagged data to answer the following questions:** \n", 542 | "a. **What proportion of word types are always assigned the same part-of-speech tag?** \n", 543 | "b. **How many words are ambiguous, in the sense that they appear with at least two tags?** \n", 544 | "c. **What percentage of word ** *tokens* ** in the Brown Corpus involve these ambiguous words?**" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": 3, 550 | "metadata": {}, 551 | "outputs": [], 552 | "source": [ 553 | "brown_tag = brown.tagged_words(tagset='universal')\n", 554 | "cfd = nltk.ConditionalFreqDist(brown_tag)" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 14, 560 | "metadata": {}, 561 | "outputs": [], 562 | "source": [ 563 | "proportion = sum(1 for word in cfd if len(cfd[word]) == 1) / len(cfd)" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 16, 569 | "metadata": {}, 570 | "outputs": [], 571 | "source": [ 572 | "ambiguous = sum(1 for word in cfd if len(cfd[word]) > 1)" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "**19. The ** `evaluate()` ** method works out how accurately the tagger performs on this text. For example, if the supplied tagged text was ** `[('the', 'DT'), ('dog', 'NN')]` ** and the tagger produced the output ** `[('the', 'NN'), ('dog', 'NN')]` **, then the score would be ** `0.5` **. Let's try to figure out how the evaluation method works:** \n", 580 | "a. **A tagger ** `t` ** takes a list of words as input, and produces a list of tagged words as output. However,** ` t.evaluate()` ** is given correctly tagged text as its only parameter. What must it do with this input before performing the tagging?**" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": null, 586 | "metadata": { 587 | "collapsed": true 588 | }, 589 | "outputs": [], 590 | "source": [] 591 | } 592 | ], 593 | "metadata": { 594 | "kernelspec": { 595 | "display_name": "Python 3", 596 | "language": "python", 597 | "name": "python3" 598 | }, 599 | "language_info": { 600 | "codemirror_mode": { 601 | "name": "ipython", 602 | "version": 3 603 | }, 604 | "file_extension": ".py", 605 | "mimetype": "text/x-python", 606 | "name": "python", 607 | "nbconvert_exporter": "python", 608 | "pygments_lexer": "ipython3", 609 | "version": "3.6.3" 610 | } 611 | }, 612 | "nbformat": 4, 613 | "nbformat_minor": 2 614 | } 615 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Hael Chan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NLP-with-Python-Solutions 2 | This repository stores my solutions to the exercises of [Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit](http://www.nltk.org/book/) 3 | -------------------------------------------------------------------------------- /corpus.txt: -------------------------------------------------------------------------------- 1 | NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. 2 | 3 | Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project. 4 | 5 | NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.” 6 | 7 | Natural Language Processing with Python provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more. The online version of the book has been been updated for Python 3 and NLTK 3. (The original Python 2 version is still available at http://nltk.org/book_1ed.) 8 | 9 | This costs $1,000 10 | That costs £999.99 11 | And that costs ¥1000 12 | 2018-08-06 13 | 2018.08.06 14 | 08/06/2018 15 | 06/08/2018 16 | 06/08/18 17 | 06-08-2018 18 | 19 | What? 20 | Why 21 | who... -------------------------------------------------------------------------------- /word_freq.txt: -------------------------------------------------------------------------------- 1 | fuzzy 53 2 | natural 14 3 | language 12 4 | processing 16 --------------------------------------------------------------------------------