├── .gitignore ├── README.md ├── part_i ├── 01_working_with_jupyter_notebooks.ipynb └── 02_getting_started_with_python.ipynb ├── part_ii ├── 01_basic_text_processing.ipynb ├── 02_basic_text_processing_continued.ipynb ├── 03_basic_nlp.ipynb ├── 04_basic_nlp_continued.ipynb ├── 05_evaluating_nlp.ipynb ├── 06_managing_data.ipynb ├── data │ ├── NYT_1991-01-16-A15.txt │ ├── WP_1990-08-10-25A.txt │ ├── WP_1991-01-17-A1B.txt │ ├── pickled_df.pkl │ └── socc_gnm_articles.csv └── img │ └── spacy_pipeline.png ├── part_iii ├── 01_multilingual_nlp.ipynb ├── 02_universal_dependencies.ipynb ├── 03_pattern_matching.ipynb ├── 04_embeddings.ipynb ├── 05_embeddings_continued.ipynb ├── 06_text_linguistics.ipynb ├── data │ ├── GUM_whow_parachute.conllu │ └── occupy.txt └── img │ ├── alignment.svg │ ├── parasyn.svg │ └── type_token.svg └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | .idea 3 | .ipynb_checkpoints 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Learning materials for the Applied Language Technology MOOC 2 | 3 | This repository contains learning materials for the [Applied Language Technology MOOC](https://applied-language-technology.mooc.fi/) as interactive Jupyter Notebooks. 4 | -------------------------------------------------------------------------------- /part_i/01_working_with_jupyter_notebooks.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# The elements of a Jupyter Notebook" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "tags": [ 15 | "remove-input" 16 | ] 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "# Run this cell to view a YouTube video related to this topic\n", 21 | "from IPython.display import YouTubeVideo\n", 22 | "YouTubeVideo('cthzk6B80ds', height=350, width=600)" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "Jupyter Notebooks are made up of cells, which contain either content or code. \n", 30 | "\n", 31 | "Content cells are written in Markdown, which is a markup language for formatting content.\n", 32 | "\n", 33 | "Code cells, in turn, contain code, which in our case is written in Python 3." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": { 39 | "tags": [ 40 | "remove-cell" 41 | ] 42 | }, 43 | "source": [ 44 | "To see the difference between Markdown and code cells in Jupyter Notebook, run the cells below.\n", 45 | "\n", 46 | "To run a cell, press the _Run_ button in the toolbar on top of the Jupyter Notebook or press the Shift and Enter keys on your keyboard at the same time." 47 | ] 48 | }, 49 | { 50 | "cell_type": "raw", 51 | "metadata": {}, 52 | "source": [ 53 | "This is a markdown cell." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "print(\"This is a code cell.\")" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": { 68 | "tags": [ 69 | "remove-cell" 70 | ] 71 | }, 72 | "source": [ 73 | "As you can see, running the cell moves the *cursor* (indicated by the coloured bounding box around the cell) to the next cell." 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "The code cells, such as the one above, are typically run one by one, while documenting and describing the process using the content cells. \n", 81 | "\n", 82 | "You can also run all cells in a notebook by choosing _Run All_ in the _Cell_ menu on top of the Jupyter Notebook. Press H on your keyboard for a list of shortcuts for various commands.\n", 83 | "\n", 84 | "The number in brackets on the left-hand side of a cell indicates the order in which cells have been executed.\n", 85 | "\n", 86 | "In most cases, cells must be run in a sequential order for the program to work." 87 | ] 88 | } 89 | ], 90 | "metadata": { 91 | "celltoolbar": "Edit Metadata", 92 | "kernelspec": { 93 | "display_name": "Python 3 (ipykernel)", 94 | "language": "python", 95 | "name": "python3" 96 | }, 97 | "language_info": { 98 | "codemirror_mode": { 99 | "name": "ipython", 100 | "version": 3 101 | }, 102 | "file_extension": ".py", 103 | "mimetype": "text/x-python", 104 | "name": "python", 105 | "nbconvert_exporter": "python", 106 | "pygments_lexer": "ipython3", 107 | "version": "3.9.7" 108 | } 109 | }, 110 | "nbformat": 4, 111 | "nbformat_minor": 2 112 | } 113 | -------------------------------------------------------------------------------- /part_i/02_getting_started_with_python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Getting started with Python" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "tags": [ 15 | "remove-input" 16 | ] 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "# Run this cell to view a YouTube video related to this topic\n", 21 | "from IPython.display import YouTubeVideo\n", 22 | "YouTubeVideo('65u7GK9c78o', height=350, width=600)" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "This brief introduction will give you an idea of Python syntax. \n", 30 | "\n", 31 | "You will learn about key concepts such as variables, what they are, and how they are created and updated. \n", 32 | "\n", 33 | "You will also learn about various types of objects defined in Python and how the type of an object determines its behaviour." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Variables\n", 41 | "\n", 42 | "Understanding the concept of a variable is crucial when getting started with Python and other programming languages.\n", 43 | "\n", 44 | "To put it simply, variables are unique names for objects defined in the program. If an object does not have a name, it cannot be referred to elsewhere in the program.\n", 45 | "\n", 46 | "In Python, variables are assigned on the fly using a single equal sign `=`. \n", 47 | "\n", 48 | "The name of the variable is positioned left of the equal sign, while the object that the variable refers to is placed on the right-hand side.\n", 49 | "\n", 50 | "Let's create a variable named `var` containing a _string_ object and call this object by its name.\n", 51 | "\n", 52 | "Note that string objects are always surrounded by single or double quotation marks!" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "var = \"This is a variable.\"" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "The following cell simply calls the variable, returning the object that the variable refers to." 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "var" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "As the notion of a *variable* suggests, the value of a variable can be changed or updated." 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "var = \"Yes, the variable name stays the same but the contents change.\"" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "var" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "If you happen to need a placeholder for some object, you can also assign the value `None` to a variable." 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "var = None" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "var" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "Variable names can be chosen freely and thus the names should be informative. \n", 135 | "\n", 136 | "Variable names are case sensitive, which means that `var` and `Var` are interpreted as different variables." 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "Var" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "Calling the variable `Var` raises a `NameError`, because a variable with this name has not been defined." 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "Naming variables is only limited by keywords that are part of Python's syntax. \n", 160 | "\n", 161 | "Running the following cell prints out these keywords." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "import keyword\n", 171 | "\n", 172 | "keywords = keyword.kwlist\n", 173 | "\n", 174 | "print(keywords)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "Printing out a list of keywords introduces several important aspects of Python: the `import` command can be used to load additional modules and make their functionalities available in Python. \n", 182 | "\n", 183 | "We will frequently use the `import` command to import various external libraries and/or their parts for natural language processing and other tasks.\n", 184 | "\n", 185 | "In this case, the _module_ `keyword` has an _attribute_ called `kwlist`, which contains a _list_ of keywords. We assign this list to the variable `keywords` and print out its contents using the `print()` _function_." 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": { 191 | "tags": [ 192 | "remove-cell" 193 | ] 194 | }, 195 | "source": [ 196 | "### Quick exercise\n", 197 | "\n", 198 | "Choose a name for a variable and assign a string object that contains some text to the variable. Remember the quotation marks around string objects!" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": { 205 | "tags": [ 206 | "remove-cell" 207 | ] 208 | }, 209 | "outputs": [], 210 | "source": [ 211 | "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "## Objects\n", 219 | "\n", 220 | "A list is just one _type_ of object defined in Python. More specifically, a list is one kind of _data structure_ in Python.\n", 221 | "\n", 222 | "We can use the `type()` _function_ to check the type of an object. To get the type of an object assigned to some variable, place its name within parentheses." 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [ 231 | "type(keywords)" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "Remember our variable `var`? Let's check its type as well." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "type(var)" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "The `type()` function is essential when hunting for errors in code.\n", 255 | "\n", 256 | "Knowing the type of a Python object is useful, because it determines what can be done with the object. \n", 257 | "\n", 258 | "For instance, brackets that follow the variable name can be used to access _items_ contained in a _list_. \n", 259 | "\n", 260 | "Note that Python lists are zero-indexed, which means that counting starts from zero, not one." 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "keywords[3]" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "This returns the fourth item in the `keywords` list. \n", 277 | "\n", 278 | "Can we do the same with the variable `var`?" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "var[3]" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "This will not work, since we set `var` to `None`, which is a special type of object called _NoneType_.\n", 295 | "\n", 296 | "Python raises a `TypeError`, because unlike a _list_ object, a _NoneType_ object cannot contain any other objects.\n", 297 | "\n", 298 | "Let's return to the list of Python keywords under the variable `keywords` and check the type of the fourth _item_ in the _list_." 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "type(keywords[3])" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "As you can see, a _list_ can contain other types of objects.\n", 315 | "\n", 316 | "Both strings and lists are common types when working with textual data.\n", 317 | "\n", 318 | "Let's define a toy example consisting of a string with some HTML (Hypertext Markup Language, the language used for creating webpages) tags." 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": {}, 325 | "outputs": [], 326 | "source": [ 327 | "text = \"

This is an example string with some HTML tags thrown in.

\"" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": {}, 334 | "outputs": [], 335 | "source": [ 336 | "text" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "Python provides various methods for manipulating strings such as the one stored under the variable `text`. \n", 344 | "\n", 345 | "The `split()` method, for instance, splits a _string_ into a _list_.\n", 346 | "\n", 347 | "The `sep` argument defines the character that is used as the boundary for a split. \n", 348 | "\n", 349 | "By default, the separator is a _whitespace_ or empty space.\n", 350 | "\n", 351 | "Let's use the `split()` method to split the string under `text` at empty space." 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "tokens = text.split(sep=' ')" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "We assign the result to the varible `tokens`. Calling the variable returns a list." 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "metadata": {}, 374 | "outputs": [], 375 | "source": [ 376 | "tokens" 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | "We can just as easily define some other separator, such as the less than symbol (<) marking the beginning of an HTML tag." 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": null, 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "text.split('<')" 393 | ] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "metadata": {}, 398 | "source": [ 399 | "As you can see, the `split()` method is destructive: the character that we defined as the boundary is deleted from each string in the list.\n", 400 | "\n", 401 | "Note that we do not necessarily have to give the arguments such as `sep` explicitly: a correct type (string, `':'`) at the correct position (as the first *argument*) is enough." 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "What if we would like to remove the HTML tags from our example string?\n", 409 | "\n", 410 | "Let's go back to our original string stored under the variable `text`." 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": null, 416 | "metadata": {}, 417 | "outputs": [], 418 | "source": [ 419 | "text" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "Python strings also have a `replace()` method, which allows replacing specific characters or their sequences in a string.\n", 427 | "\n", 428 | "Let's begin by replacing the initial tag `

` in `text` by providing `'

'` as input to its `replace` method.\n", 429 | "\n", 430 | "Note that the tag `

` is in quotation marks, as the `replace` method requires the input to be a string.\n", 431 | "\n", 432 | "The `replace` method takes two inputs: the string to be replaced (`

`) and the replacement (`''`). By providing an empty string as input to the second argument, we essentially remove any matches from the string." 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": null, 438 | "metadata": {}, 439 | "outputs": [], 440 | "source": [ 441 | "text = text.replace('

', '')" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": null, 447 | "metadata": {}, 448 | "outputs": [], 449 | "source": [ 450 | "text" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "Success! The first tag `

` is no longer present in the string. The other strings, however, remain in place." 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": { 463 | "tags": [ 464 | "remove-cell" 465 | ] 466 | }, 467 | "source": [ 468 | "### Quick exercise\n", 469 | "\n", 470 | "What about the remaining tags? Replace the `` tag in `text` with an empty string." 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": { 477 | "tags": [ 478 | "remove-cell" 479 | ] 480 | }, 481 | "outputs": [], 482 | "source": [ 483 | "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": {}, 489 | "source": [ 490 | "Although the `replace` method allowed us to easily replace parts of a string, it is not the most effective way to do so. What if the data contains dozens of HTML tags or other kind of markup? For this reason, we will explore more efficient ways of manipulating text data in Part II.\n", 491 | "\n", 492 | "This introduction should have given you a first taste of Python and its syntax. We will continue to learn more Python while working with actual examples." 493 | ] 494 | } 495 | ], 496 | "metadata": { 497 | "celltoolbar": "Edit Metadata", 498 | "kernelspec": { 499 | "display_name": "Python 3", 500 | "language": "python", 501 | "name": "python3" 502 | }, 503 | "language_info": { 504 | "codemirror_mode": { 505 | "name": "ipython", 506 | "version": 3 507 | }, 508 | "file_extension": ".py", 509 | "mimetype": "text/x-python", 510 | "name": "python", 511 | "nbconvert_exporter": "python", 512 | "pygments_lexer": "ipython3", 513 | "version": "3.9.9" 514 | }, 515 | "nbsphinx": { 516 | "allow_errors": true 517 | } 518 | }, 519 | "nbformat": 4, 520 | "nbformat_minor": 2 521 | } 522 | -------------------------------------------------------------------------------- /part_ii/01_basic_text_processing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Manipulating text using Python\n", 8 | "\n", 9 | "This section introduces you to the very basics of manipulating text in Python.\n", 10 | "\n", 11 | "After reading this section, you should:\n", 12 | "\n", 13 | " - understand the difference between rich text, structured text and plain text\n", 14 | " - understand the concept of text encoding\n", 15 | " - know how to load plain text files into Python and manipulate their content" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "## Computers and text\n", 23 | "\n", 24 | "Computers can store and represent text in different formats. Knowing the distinction between different types of text is crucial for processing them programmatically." 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "deletable": false, 32 | "tags": [ 33 | "remove-input" 34 | ] 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "# Run this cell to view a YouTube video related to this topic\n", 39 | "from IPython.display import YouTubeVideo\n", 40 | "YouTubeVideo('P-om89HKx80', height=350, width=600)" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### What is rich text?\n", 48 | "\n", 49 | "Word processors, such as Microsoft Word, produce [rich text](https://en.wikipedia.org/wiki/Formatted_text), that is, text whose appearance has been formatted or styled in a specific way.\n", 50 | "\n", 51 | "Rich text allows defining specific visual styles for document elements. Headers, for example, may use a different font than the body text, which may in turn feature *italic* or **bold** fonts for emphasis. Rich text can also include various types of images, tables and other document elements.\n", 52 | "\n", 53 | "Rich text is the default format for modern what-you-see-is-what-you-get word processors.\n", 54 | "\n", 55 | "### What is plain text?\n", 56 | "\n", 57 | "Unlike rich text, [plain text](https://en.wikipedia.org/wiki/Plain_text) does not contain any information about the visual appearance of text, but consists of *characters* only.\n", 58 | "\n", 59 | "Characters, in this context, refers to letters, numbers, punctuation marks, spaces and line breaks.\n", 60 | "\n", 61 | "The definition of plain text is fairly loose, but generally the term refers to text which lacks any formatting or style information.\n", 62 | "\n", 63 | "\n", 64 | "### What is structured text?\n", 65 | "\n", 66 | "Structured text may be thought of as a special case of plain text, which includes character sequences that are used to format the text for display.\n", 67 | "\n", 68 | "Forms of structured text include text described using mark-up languages such as XML, Markdown or HTML.\n", 69 | "\n", 70 | "The example below shows a plain text sentence wrapped into HTML tags for paragraphs `

`. \n", 71 | "\n", 72 | "The opening tag `

` and the closing tag `

` instruct the computer that any content placed between these tags form a paragraph.\n", 73 | "\n", 74 | "```\n", 75 | "

This is an example sentence.

\n", 76 | "```\n", 77 | "\n", 78 | "This information is used for structuring plain text when *rendering* text for display, typically by styling its appearance." 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": { 84 | "tags": [ 85 | "remove-cell" 86 | ] 87 | }, 88 | "source": [ 89 | "If you double-click any content cell in this Jupyter Notebook, you will see the underlying structured text in Markdown.\n", 90 | "\n", 91 | "Running the cell renders the structured text for visual inspection!" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "### Why does this matter?\n", 99 | "\n", 100 | "If you collect a bunch of texts for a corpus, chances are that some originated in rich or structured format, depending on the medium these texts came from.\n", 101 | "\n", 102 | "If you collect printed documents that have been digitized using a technique such as [optical character recognition](https://en.wikipedia.org/wiki/Optical_character_recognition) (OCR) and subsequently converted from rich into plain text, the removal of formatting information is likely to introduce errors into the resulting plain text. Working with this kind of \"dirty\" OCR can have an impact on the results of text analysis (Hill & Hengchen [2019](https://doi.org/10.1093/llc/fqz024)).\n", 103 | "\n", 104 | "If you collect digital documents by scraping discussion forums or websites, you are likely to encounter traces of structured text in the form of markup tags, which may be carried over to plain text during conversion.\n", 105 | "\n", 106 | "Plain text is by far the most interchangeable format for text, as it is easy to read for computers. This is why programming languages work with plain text, and if you plan to use programming languages to manipulate text, you need to know what plain text is. \n", 107 | "\n", 108 | "To summarise, when working with plain text, you may need to deal with traces left by conversion from rich or structured text." 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "## Text encoding\n", 116 | "\n", 117 | "To be read by computers, plain text needs to be *encoded*. This is achieved using *character encoding*, which maps characters (letters, numbers, punctuation, whitespace ...) to a numerical representation understood by the computer.\n", 118 | "\n", 119 | "Ideally, we should not have to deal with low-level operations such as character encoding, but practically we do, because there are multiple systems for encoding characters, and these codings are not compatible with each other. This is the source of endless misery and headache when working with plain text.\n", 120 | "\n", 121 | "There are two character encoding systems that you are likely to encounter: ASCII and Unicode.\n", 122 | "\n", 123 | "### ASCII\n", 124 | "\n", 125 | "[ASCII](https://en.wikipedia.org/wiki/ASCII), which stands for American Standard Code for Information Interchange, is a pioneering character encoding system that has provided a foundation for many modern character encoding systems.\n", 126 | "\n", 127 | "ASCII is still widely used, but is very limited in terms of its character range. If your language happens to include characters such as ä or ö, you are out of luck with ASCII.\n", 128 | "\n", 129 | "### Unicode\n", 130 | "\n", 131 | "[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text in most writing systems used across the world, covering nearly 140 000 characters in modern and historic scripts, symbols and emoji.\n", 132 | "\n", 133 | "For example, the pizza slice emoji 🍕 has the Unicode \"code\" `U+1F355`, whereas the corresponding code for a whitespace is `U+0020`.\n", 134 | "\n", 135 | "Unicode can be implemented by different character encodings, such as [UTF-8](https://en.wikipedia.org/wiki/UTF-8), which is defined by the Unicode standard.\n", 136 | "\n", 137 | "UTF-8 is backwards compatible with ASCII. In other words, the ASCII character encodings form a subset of UTF-8, which makes our life much easier. \n", 138 | "\n", 139 | "Even if a plain text file has been *encoded* in ASCII, we can *decode* it using UTF-8, but **not vice versa**." 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": { 145 | "tags": [] 146 | }, 147 | "source": [ 148 | "## Loading plain text files into Python" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": { 155 | "deletable": false, 156 | "tags": [ 157 | "remove-input" 158 | ] 159 | }, 160 | "outputs": [], 161 | "source": [ 162 | "# Run this cell to view a YouTube video related to this topic\n", 163 | "from IPython.display import YouTubeVideo\n", 164 | "YouTubeVideo('ulRFLvNkhHA', height=350, width=600)" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "Plain text files can be loaded into Python using the `open()` function.\n", 172 | "\n", 173 | "The first argument to the `open()` function must be a string, which contains a *path* to the file that is being opened.\n", 174 | "\n", 175 | "In this case, we have a path that points towards a file named `NYT_1991-01-16-A15.txt`, which is located in a directory named `data`. In the definition of the path, the directory and the filename are separated by a backslash `/`.\n", 176 | "\n", 177 | "To access the file that this path points to, we must provide the path as a string object to the `file` argument of the `open()` function:\n", 178 | "\n", 179 | "```\n", 180 | "open(file='data/NYT_1991-01-16-A15.txt', mode='r', encoding='utf-8')\n", 181 | "```\n", 182 | "\n", 183 | "Before proceeding any further, let's focus on the other arguments provided to the `open()` function.\n", 184 | "\n", 185 | "By default, Python 3 assumes that the text is encoded using UTF-8, but we can make this explicit using the `encoding` argument. \n", 186 | "\n", 187 | "The `encoding` argument takes a string as input: we pass the string `utf-8` to the argument to declare that the plain text is encoded in UTF-8.\n", 188 | "\n", 189 | "We also use the `mode` argument to define that we only want to open the file for *reading*, which is done by passing the string `r` to the argument.\n", 190 | "\n", 191 | "Finally, we use the `open()` function in combination with the `with` statement, which ensures that the file will be closed after performing whatever we do within the *indented* block of code that follows the `with` statement. This prevents the file from consuming memory and resources after we no longer need it." 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "# Open a file and assign it to the variable 'file'\n", 201 | "with open(file='data/NYT_1991-01-16-A15.txt', mode='r', encoding='utf-8') as file:\n", 202 | " \n", 203 | " # The 'with' statement must be followed by an indented code block.\n", 204 | " # Here we call the read() method to read the file contents and \n", 205 | " # assign the result under the variable 'text'.\n", 206 | " text = file.read()" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "As you can see, the `with` statement and the `open()` function are followed by the statement `as` and a variable named `file`.\n", 214 | "\n", 215 | "This tells Python to assign whatever is returned by the `open()` function under the variable `file`.\n", 216 | "\n", 217 | "If we now call the variable `file`, we get a Python `TextIOWrapper` object that contains three arguments: the path to the file under the argument `name` and the `mode` and `encoding` arguments that we specified above." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "# Call the variable to examine the object\n", 227 | "file" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "Keep in mind that in the indented code block following the `with` statement, we called the `read()` method of the `TextIOWrapper` object.\n", 235 | "\n", 236 | "This method read the contents of the file, which we assigned under the variable `text`.\n", 237 | "\n", 238 | "However, if we attempt to call the `read()` method for the variable `file` outside the `with` statement, Python will raise an error, because the file has been closed." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": { 245 | "tags": [] 246 | }, 247 | "outputs": [], 248 | "source": [ 249 | "# Attempt to use the read() method to read the file content\n", 250 | "file.read()" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "This behaviour is expected, as we want the file to be closed, so that it does not consume memory or resources now that we no longer need it. This is especially important when working with thousands of files, as every open file will take up memory and resources.\n", 258 | "\n", 259 | "Let's check the output from applying the `read()` method, which we stored under the variable `text` within the `with` statement.\n", 260 | "\n", 261 | "The text is fairly long, so let's just take a slice of the text containing the first 500 characters, which can be achieved using brackets `[:500]`.\n", 262 | "\n", 263 | "As we learned in [Part I](../part_i/02_getting_started_with_python.ipynb), adding brackets directly after the name of a variable allows accessing parts of the object, if the object in question allows this.\n", 264 | "\n", 265 | "For example, the expression `text[1]` would retrieve the character at position 1 in the string object under the variable `text`.\n", 266 | "\n", 267 | "Adding the colon `:` as a prefix to the number instructs Python to retrieve all characters contained in the string up to the 500th character." 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "# Retrieve the first 500 characters under the variable 'text'\n", 277 | "text[:500]" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "Most of the text is indeed legible, but there are some strange character sequences, such as `\\ufeff` in the very beginning of the text, and the numerous `\\n` sequences occurring throughout the text.\n", 285 | "\n", 286 | "The `\\ufeff` sequence is simply an explicit declaration (\"signature\") that the file has been encoded using UTF-8. Not all UTF-8 encoded files contain this sequence.\n", 287 | "\n", 288 | "The `\\n` sequences, in turn, indicate a line change.\n", 289 | "\n", 290 | "This becomes evident if we use Python's `print()` function to print the first 1000 characters stored in the `text` variable." 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "# Print the first 1000 characters under the variable 'text'\n", 300 | "print(text[:1000])" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "As you can see, Python knows how to interpret `\\n` character sequences and inserts a line break if it encounters this sequence when printing the contents of the string object.\n", 308 | "\n", 309 | "We can also see that the first few lines of the file contain metadata on the article, such as its name, author and source. This information precedes the body text." 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "## Manipulating text" 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": null, 322 | "metadata": { 323 | "deletable": false, 324 | "tags": [ 325 | "remove-input" 326 | ] 327 | }, 328 | "outputs": [], 329 | "source": [ 330 | "# Run this cell to view a YouTube video related to this topic\n", 331 | "from IPython.display import YouTubeVideo\n", 332 | "YouTubeVideo('v4FY6TXt0PU', height=350, width=600)" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "Because the entire text stored under the variable `text` is a string object, we can use all methods available for manipulating strings in Python.\n", 340 | "\n", 341 | "Let's use the `replace()` method to replace all line breaks `\"\\n\"` with empty strings `\"\"` and store the result under the variable `processed_text`. \n", 342 | "\n", 343 | "We then use the `print()` function to print out a slice containing the first 1000 characters using the brackets `[:1000]`." 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "# Replace line breaks \\n with empty strings and assign the result to \n", 353 | "# the variable 'processed_text'\n", 354 | "processed_text = text.replace('\\n', '')\n", 355 | "\n", 356 | "# Print out the first 1000 characters under the variable 'processed_text'\n", 357 | "print(processed_text[:1000])" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [ 364 | "As you can see, all of the text is now clumped together. We can, however, still identify the beginning of each paragraph, which are marked by three whitespaces.\n", 365 | "\n", 366 | "Note that replacing the line breaks also causes the article metadata to form a single paragraph, which is also missing some whitespace characters. For this reason, one must always pay attention to unwanted effects of replacements and other transformations!\n", 367 | "\n", 368 | "However, if we were only interested in the body text of the article, we can now easily remove the metadata, as we know that it is separated from the body text by three whitespace characters.\n", 369 | "\n", 370 | "The easiest way to do this is to use the `split()` method to split the string into a list by using three whitespace characters as the separator.\n", 371 | "\n", 372 | "Let's assign the result under the same variable, that is, `processed_text`, and print out the result." 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "# Use the split() method with three whitespaces as a separator. Assign the\n", 382 | "# result under the variable 'processed_text'.\n", 383 | "processed_text = processed_text.split(sep=' ')\n", 384 | "\n", 385 | "# Print out the result under 'processed_text'\n", 386 | "print(processed_text)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "If you examine the output, you will see that the `split()` method returned a list of string objects. Let's quickly verify this by checking the type of the object stored under `processed_text`." 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "# Check the type of the object under 'processed_text'\n", 403 | "type(processed_text)" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "The metadata is stored in the first item in the list, as the first sequence of three whitespace characters was found where the metadata ends. This is where we first split the string object.\n", 411 | "\n", 412 | "Let's fetch the first item in the list – remember that Python starts counting from zero, which means that the item we want to access can be found at index `0`." 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": {}, 419 | "outputs": [], 420 | "source": [ 421 | "# Retrieve the string object at index 0 from the list 'processed_text'\n", 422 | "processed_text[0]" 423 | ] 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "metadata": {}, 428 | "source": [ 429 | "If we want to remove the metadata and retain just the body text, we can use the `pop()` method of a list object.\n", 430 | "\n", 431 | "This method expects an integer as input, which corresponds to the index of an item that we want to remove from the list." 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": {}, 438 | "outputs": [], 439 | "source": [ 440 | "# Call the pop() method of the list under 'processed_text' and the\n", 441 | "# index of the item to be removed.\n", 442 | "processed_text.pop(0)" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "If you are wondering why we do not assign the result into a variable, the answer is because Python lists are *mutable*, that is, they can be manipulated in place.\n", 450 | "\n", 451 | "In other words, the `pop()` method can modify the list without \"updating\" the variable by reassigning the value under the same variable name.\n", 452 | "\n", 453 | "Let's check the result by retrieving the first three items in the list `processed_text`." 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "# Retrieve the first three items in the list 'processed_text'\n", 463 | "processed_text[:3]" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "As you can see, the first item in the list no longer corresponds to the metadata!" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": { 477 | "deletable": false, 478 | "tags": [ 479 | "remove-input" 480 | ] 481 | }, 482 | "outputs": [], 483 | "source": [ 484 | "# Run this cell to view a YouTube video related to this topic\n", 485 | "from IPython.display import YouTubeVideo\n", 486 | "YouTubeVideo('iabwQKS5lVk', height=350, width=600)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "If we want to convert the list back into a string, we can use the `join()` method of a string object.\n", 494 | "\n", 495 | "The `join()` method expects an *iterable* as input, that is, something that can be iterated over, such as a Python list or a dictionary.\n", 496 | "\n", 497 | "This is where things may get a little confusing: the `join()` method must be called on a *string* that will be used to join the items in the iterable!\n", 498 | "\n", 499 | "In this case, we want to use the original sequence of characters that were used to separate paragraphs of text – a line break and three whitespaces – as the string object that joins the items." 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": {}, 506 | "outputs": [], 507 | "source": [ 508 | "# Use the join() method to join the items in the list 'processed_text' using\n", 509 | "# the string object '\\n ' – a line break and three whitespaces. Store the \n", 510 | "# result under the variable of the same name.\n", 511 | "processed_text = '\\n '.join(processed_text)\n", 512 | "\n", 513 | "# Check the result by printing the first 1000 characters of the resulting \n", 514 | "# string object under 'processed_text'\n", 515 | "print(processed_text[:1000])" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "As you can see, applying the `join()` method returns a string object with the original paragraph breaks!" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "metadata": { 529 | "deletable": false, 530 | "tags": [ 531 | "remove-input" 532 | ] 533 | }, 534 | "outputs": [], 535 | "source": [ 536 | "# Run this cell to view a YouTube video related to this topic\n", 537 | "from IPython.display import YouTubeVideo\n", 538 | "YouTubeVideo('HEinzGt8LG4', height=350, width=600)" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "If you examine the text closely, you can also see remnants of the digitalisation process: the application of optical character recognition, which was discussed [above](#Why-does-this-matter?), has resulted in a mixture of various types of quotation marks, such as `\"`, `“`, `”`, `’’` and `‘‘` (two single quotation marks), being used in the text.\n", 546 | "\n", 547 | "If we were interested in retrieving quotes from the body text, it would be good to use the quotation marks consistently. Let's choose `\"` (a single double-quote) as our preferred quotation mark.\n", 548 | "\n", 549 | "We could replace each quotation mark with this character using the `replace()` method, but applying this method separately for each type of quotation mark would be tedious.\n", 550 | "\n", 551 | "To make the process more efficient, we can leverage two other Python data structures: *lists* and *tuples*.\n", 552 | "\n", 553 | "Let's start by defining a list named `pipeline`. We can create and populate a list by simply placing objects within brackets `[]`. Each list item must be separated by a comma (`,`).\n", 554 | "\n", 555 | "As we saw above, the `replace()` method takes two strings as inputs.\n", 556 | "\n", 557 | "To combine two strings into a single Python object, the most obvious candidate is a data structure named *tuple*, which consist of finite, ordered lists of items.\n", 558 | "\n", 559 | "Tuples are marked by parentheses `( )`, and the items in a tuple are also separated by a comma.\n", 560 | "\n", 561 | "In each tuple, we place the character to be replaced in the first string, and its replacement in the second string." 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "metadata": {}, 568 | "outputs": [], 569 | "source": [ 570 | "# Define a list with four tuples, which each consist of two strings: the character\n", 571 | "# to be replaced and its replacement.\n", 572 | "pipeline = [('“', '\"'), ('´´', '\"'), ('”', '\"'), ('’’', '\"')]" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "This also illustrates how different data structures are often nested in Python: the list consists of tuples, and the tuples consist of string objects.\n", 580 | "\n", 581 | "We can now perform a `for` loop over each item in the list, which iterates through each item in the order in which they appear in the list.\n", 582 | "\n", 583 | "Each item in the list consists of a tuple, which contains two strings.\n", 584 | "\n", 585 | "Note that to enter a `for` loop, Python expects the next line of code to be indented. Press the Tab ↹ key on your keyboard to move the cursor.\n", 586 | "\n", 587 | "What happens next is exactly same that we did before with using the `replace()` method, but instead of manually defining the strings that we want to replace, we use the strings contained in the variables `old` and `new`!\n", 588 | "\n", 589 | "After each loop, we automatically update the string object stored under the variable `processed_text`." 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": null, 595 | "metadata": {}, 596 | "outputs": [], 597 | "source": [ 598 | "# Loop over tuples in the list 'pipeline'. Each tuple has two values, which we \n", 599 | "# assign to variables 'old' and 'new' on the fly!\n", 600 | "for old, new in pipeline:\n", 601 | " \n", 602 | " # Use the replace() method to replace the string under the variable 'old' \n", 603 | " # with the string under the variable new 'new'\n", 604 | " processed_text = processed_text.replace(old, new)" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "Let's examine the output by printing out the string under the variable `processed_text`." 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": {}, 618 | "outputs": [], 619 | "source": [ 620 | "# Print the string\n", 621 | "print(processed_text)" 622 | ] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": {}, 627 | "source": [ 628 | "As the output shows, we could perform a series of replacements by looping over the list of tuples, which defined the patterns to be replaced and their replacements! \n", 629 | "\n", 630 | "To recap, the syntax for the `for` loop is as follows: declare the beginning of a loop using `for`, followed by a *variable* that is used to refer to items retrieved from the list.\n", 631 | "\n", 632 | "The list that is being looped over is preceded by `in` and the name of the variable assigned to the entire *list*.\n", 633 | "\n", 634 | "To better understand how a `for` loop works, let's define only one variable, `our_tuple`, to refer to the items that we fetch from the list." 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": null, 640 | "metadata": {}, 641 | "outputs": [], 642 | "source": [ 643 | "# Loop over the items under the variable 'pipeline'\n", 644 | "for our_tuple in pipeline:\n", 645 | " \n", 646 | " # Print the returned object\n", 647 | " print(our_tuple)" 648 | ] 649 | }, 650 | { 651 | "cell_type": "markdown", 652 | "metadata": {}, 653 | "source": [ 654 | "This print outs the tuples!\n", 655 | "\n", 656 | "Python is smart enough to understand that a single variable refers to the single items, or *tuples* in the list, whereas for two items, it must proceed to the *strings* contained within the tuple.\n", 657 | "\n", 658 | "When writing `for` loops, pay close attention to the items contained in the list!" 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "metadata": {}, 664 | "source": [ 665 | "This should have given you an idea of the basic issues involved in loading and manipulating text using Python. \n", 666 | "\n", 667 | "The [following section](02_basic_text_processing_continued.ipynb) builds on these techniques to manipulate texts more efficiently." 668 | ] 669 | } 670 | ], 671 | "metadata": { 672 | "celltoolbar": "Edit Metadata", 673 | "kernelspec": { 674 | "display_name": "Python 3 (ipykernel)", 675 | "language": "python", 676 | "name": "python3" 677 | }, 678 | "language_info": { 679 | "codemirror_mode": { 680 | "name": "ipython", 681 | "version": 3 682 | }, 683 | "file_extension": ".py", 684 | "mimetype": "text/x-python", 685 | "name": "python", 686 | "nbconvert_exporter": "python", 687 | "pygments_lexer": "ipython3", 688 | "version": "3.9.7" 689 | } 690 | }, 691 | "nbformat": 4, 692 | "nbformat_minor": 4 693 | } 694 | -------------------------------------------------------------------------------- /part_ii/02_basic_text_processing_continued.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Manipulating text at scale\n", 8 | "\n", 9 | "This section introduces you to regular expressions for manipulating text and how to apply the same procedure to several files.\n", 10 | "\n", 11 | "Ideally, Python should enable you to manipulate text at scale, that is, to apply the same procedure to ten, hundred or thousand text files *with the same effort*.\n", 12 | "\n", 13 | "To do so, we must be able to define more flexible patterns than the fixed strings that we used previously with the `replace()` method, while opening and closing files automatically.\n", 14 | "\n", 15 | "This capability is provided by Python modules for *regular expressions* and *file handling*.\n", 16 | "\n", 17 | "After reading this section, you should know:\n", 18 | "\n", 19 | " - how to manipulate multiple text files using Python\n", 20 | " - how to define simple patterns using *regular expressions*\n", 21 | " - how to save the results" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## Regular expressions" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "metadata": { 35 | "tags": [ 36 | "remove-input" 37 | ] 38 | }, 39 | "outputs": [], 40 | "source": [ 41 | "# Run this cell to view a YouTube video related to this topic\n", 42 | "from IPython.display import YouTubeVideo\n", 43 | "YouTubeVideo('seCpHdTA-vs', height=350, width=600)" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) are a \"language\" that allows defining *search patterns*.\n", 51 | "\n", 52 | "These patterns can be used to find or to find and replace patterns in Python string objects.\n", 53 | "\n", 54 | "As opposed to fixed strings, regular expressions allow defining *wildcard characters* that stand in for any character, *quantifiers* that match sequences of repeated characters, and much more.\n", 55 | "\n", 56 | "Python allows using regular expressions through its `re` module.\n", 57 | "\n", 58 | "We can activate this module using the `import` command." 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "import re" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "Let's begin by loading the text file, reading its contents, assigning the last 2000 characters to the variable `extract` and printing out the result." 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "# Define a path to the file, open the file for (r)eading using utf-8 encoding\n", 84 | "with open(file='data/WP_1990-08-10-25A.txt', mode='r', encoding='utf-8') as file:\n", 85 | "\n", 86 | " # Read the file contents using the .read() method\n", 87 | " text = file.read()\n", 88 | "\n", 89 | "# Get the *last* 2000 characters – note the minus sign before the number\n", 90 | "extract = text[-2000:]\n", 91 | "\n", 92 | "# Print the result\n", 93 | "print(extract)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "As you can see, the text has a lot of errors from optical character recognition, mainly in the form of sequences such as `....` and `,,,,`.\n", 101 | "\n", 102 | "Let's compile our first regular expression that searches for sequences of *two or more* full stops.\n", 103 | "\n", 104 | "This is done using the `compile()` function from the `re` module.\n", 105 | "\n", 106 | "The `compile()` function takes a string as an input. \n", 107 | "\n", 108 | "Note that we attach the prefix `r` to the string. This tells Python to store the string in 'raw' format. This means that the string is stored as it appears." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "# Compile a regular expression and assign it to the variable 'stops'\n", 118 | "stops = re.compile(r'\\.{2,}')\n", 119 | "\n", 120 | "# Let's check the type of the regular expression!\n", 121 | "type(stops)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "Let's unpack this regular expression a bit.\n", 129 | "\n", 130 | "1. The regular expression is defined using a Python string, as indicated by the single quotation marks `' '`.\n", 131 | "\n", 132 | "2. We need a backslash `\\` in front of our full stop `.`. The backslash tells Python that we are really referring to a full stop, because regular expressions use a full stop as a *wildcard* character that can stand in for *any character*.\n", 133 | "\n", 134 | "3. The curly brackets `{ }` instruct the regular expression to search for instances of the previous item `\\.` (our actual full stop) that occur two or more times (`2,`). This (hopefully) preserves true uses of a full stop!\n", 135 | "\n", 136 | "In plain language, we tell the regular expression to search for *occurrences of two or more full stops*. " 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "To apply this regular expression to some text, we will use the `sub()` method of our newly-defined regular expression object `stops`.\n", 144 | "\n", 145 | "The `sub()` method takes two arguments:\n", 146 | "\n", 147 | "1. `repl`: A string containing a string that is used to *replace* possible matches.\n", 148 | "2. `string`: A string object to be searched for matches.\n", 149 | "\n", 150 | "The method returns the modified string object.\n", 151 | "\n", 152 | "Let's apply our regular expression to the string stored under the variable `extract`." 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "# Apply the regular expression to the text under 'extract' and save the output\n", 162 | "# to the same variable, essentially overwriting the old text.\n", 163 | "extract = stops.sub(repl='', string=extract)\n", 164 | "\n", 165 | "# Print the text to examine the result\n", 166 | "print(extract)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "As you can see, the sequences of full stops are gone.\n", 174 | "\n", 175 | "We can make our regular expression even more powerful by adding alternatives.\n", 176 | "\n", 177 | "Let's compile another regular expression and store it under the variable `punct`." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "# Compile a regular expression and assign it to the variable 'punct'\n", 187 | "punct = re.compile(r'(\\.|,){2,}')" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "What's new here are the parentheses `( )` and the vertical bar `|` between them, which separates our actual full stop `\\.` and the comma `,`.\n", 195 | "\n", 196 | "The characters surrounded by parentheses and separated by a vertical bar mark *alternatives*.\n", 197 | "\n", 198 | "In plain English, we tell the regular expression to search for *occurrences of two or more full stops or commas*.\n", 199 | "\n", 200 | "Let's apply our new pattern to the text under `extract`.\n", 201 | "\n", 202 | "To ensure the pattern works as intended, let's retrieve the original text from the `text` variable and assign it to the variable `extract` to overwrite our previous edits." 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "# \"Reset\" the extract variable by taking the last 2000 characters of the original string\n", 212 | "extract = text[-2000:]\n", 213 | "\n", 214 | "# Apply the regular expression\n", 215 | "extract = punct.sub(repl='', string=extract)\n", 216 | "\n", 217 | "# Print out the result\n", 218 | "print(extract)" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "Success! The sequences of full stops and commas can be removed using a single regular expression." 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": { 231 | "tags": [ 232 | "remove-cell" 233 | ] 234 | }, 235 | "source": [ 236 | "### Quick exercise\n", 237 | "\n", 238 | "Use `re.compile()` to compile a regular expression that matches `”`, `\"\"` and `’’` and store the result under the variable `quotes`.\n", 239 | "\n", 240 | "Find matching sequences in `extract` and replace them with `\"`.\n", 241 | "\n", 242 | "You will need parentheses `( )` and vertical bars `|` to define the alternatives." 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": { 249 | "tags": [ 250 | "remove-cell" 251 | ] 252 | }, 253 | "outputs": [], 254 | "source": [ 255 | "### Enter your code below this line and run the cell (press Shift and Enter at the same time)\n" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "The more irregular sequences resulting from optical character recognition errors in `extract`, such as `'-'*`, `->.\"`, `/*—.`, `-\"“` and `'\"''.` are much harder to capture.\n", 263 | "\n", 264 | "Capturing these patterns would require defining more complex regular expressions, which are harder to write. Their complexity is, however, what makes regular expressions so powerful, but at the same time, learning how to use them takes time and patience.\n", 265 | "\n", 266 | "It is therefore a good idea to use a service such as [regex101.com](https://www.regex101.com) to learn the basics of regular expressions.\n", 267 | "\n", 268 | "In practice, coming up with regular expressions that cover as many matches as possible is particularly hard. \n", 269 | "\n", 270 | "Capturing most of the errors – and perhaps distributing the manipulations over a series of steps in a pipeline – can already help prepare the text for further processing or analysis.\n", 271 | "\n", 272 | "However, keep in mind that in order to identify patterns for manipulating text programmatically, you should always look at more than one text in your corpus." 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "## Processing multiple files" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": { 286 | "tags": [ 287 | "remove-input" 288 | ] 289 | }, 290 | "outputs": [], 291 | "source": [ 292 | "# Run this cell to view a YouTube video related to this topic\n", 293 | "from IPython.display import YouTubeVideo\n", 294 | "YouTubeVideo('IwhhNfDYvlI', height=350, width=600)" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "Many corpora contain texts in multiple files. \n", 302 | "\n", 303 | "To make manipulating high volumes of text as efficient as possible, we must open the files, read their contents, perform the requested operations and close them *programmatically*.\n", 304 | "\n", 305 | "This procedure is made fairly simple using the `Path` class from Python's `pathlib` module.\n", 306 | "\n", 307 | "Let's import the class first. Using the command `from` with `import` allows us to import only a part of the `pathlib` module, namely the `Path` class. This is useful if you only need some feature contained in a Python module or library." 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "from pathlib import Path" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "The `Path` class encodes information about *paths* in a *directory structure*.\n", 324 | "\n", 325 | "What's particularly great about the Path class is that it can automatically infer what kinds of paths your operating system uses. \n", 326 | "\n", 327 | "Here the problem is that operating systems such as Windows, Linux and Mac OS X have different file system paths.\n", 328 | "\n", 329 | "Using the `Path` class allows us to avoid a lot of trouble, particularly if we want our code to run on different operating systems.\n", 330 | "\n", 331 | "Our repository contains a directory named `data`, which contains the text files that we have been working with recently.\n", 332 | "\n", 333 | "Let's initialise a Path *object* that points towards this directory by providing a string with the directory name to the Path *class*. We assign the object to the variable `corpus_dir`." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "# Create a Path object that points towards the directory 'data' and assign\n", 343 | "# the object to the variable 'corpus_dir'\n", 344 | "corpus_dir = Path('data')" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "The Path object stored under `corpus_dir` has various useful methods and attributes.\n", 352 | "\n", 353 | "We can, for instance, easily check if the path is valid using the `exists()` method.\n", 354 | "\n", 355 | "This returns a Boolean value, that is, either *True* or *False*." 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": null, 361 | "metadata": {}, 362 | "outputs": [], 363 | "source": [ 364 | "# Use the exists() method to check if the path is valid\n", 365 | "corpus_dir.exists()" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "We can also check if the path is a directory using the `is_dir()` method." 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "# Use the exists() method to check if the path points to a directory\n", 382 | "corpus_dir.is_dir()" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "Let's make sure the path does not point towards a file using the `is_file()` method." 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": {}, 396 | "outputs": [], 397 | "source": [ 398 | "# Use the exists() method to check if the path points to a file\n", 399 | "corpus_dir.is_file()" 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": {}, 405 | "source": [ 406 | "Now that we know that the path points toward a directory, we can use the `glob()` method to collect all text files in the directory.\n", 407 | "\n", 408 | "`glob` stands for [*global*](https://en.wikipedia.org/wiki/Glob_(programming)), and was first implemented as a program for matching filenames and paths using wildcards.\n", 409 | "\n", 410 | "The `glob()` method requires one argument, `pattern`, which takes a string as input. This string defines the kinds of files to be collected. The asterisk symbol `*` acts as a wildcard, which can refer to *any sequence of characters* preceding the sequence `.txt`.\n", 411 | "\n", 412 | "The file identifier `.txt` is a commonly-used suffix for plain text files.\n", 413 | "\n", 414 | "We also instruct Python to *cast* the result into a list using the `list()` function, so we can easily loop over the files in the list.\n", 415 | "\n", 416 | "Finally, we store the result under the variable `files` and call the result." 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": {}, 423 | "outputs": [], 424 | "source": [ 425 | "# Get all files with the suffix .txt in the directory 'corpus_dir' and cast the result into a list\n", 426 | "files = list(corpus_dir.glob(pattern='*.txt'))\n", 427 | "\n", 428 | "# Call the result\n", 429 | "files" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": null, 435 | "metadata": { 436 | "tags": [ 437 | "remove-input" 438 | ] 439 | }, 440 | "outputs": [], 441 | "source": [ 442 | "# Run this cell to view a YouTube video related to this topic\n", 443 | "from IPython.display import YouTubeVideo\n", 444 | "YouTubeVideo('rM1X6u9-o8A', height=350, width=600)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "We now have a list of three Path objects that point towards three text files!\n", 452 | "\n", 453 | "This allows us to loop over the files using a `for` loop and manipulate text in each file.\n", 454 | "\n", 455 | "In the cell below, we iterate over each file defined in the Path object, read and modify its contents, and write them to a new file." 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": null, 461 | "metadata": {}, 462 | "outputs": [], 463 | "source": [ 464 | "# Loop over the list of Path objects under 'files'. Refer to the individual files using\n", 465 | "# the variable 'file'.\n", 466 | "for file in files:\n", 467 | " \n", 468 | " # Use the read_text() method of a Path object to read the file contents. Provide \n", 469 | " # the value 'utf-8' to the 'encoding' argument to declare the file encoding.\n", 470 | " # Store the result under the variable 'text'.\n", 471 | " text = file.read_text(encoding='utf-8')\n", 472 | " \n", 473 | " # Apply the regular expression we defined above to remove excessive punctuation \n", 474 | " # from the text. Store the result under the variable 'mod_text'\n", 475 | " mod_text = punct.sub('', text)\n", 476 | " \n", 477 | " # Define a new filename which has the prefix 'mod_' by creating a new string. \n", 478 | " # The Path object contains the filename as a string under the attribute 'name'. \n", 479 | " # Combine the two strings using the '+' expression.\n", 480 | " new_filename = 'mod_' + file.name\n", 481 | " \n", 482 | " # Define a new Path object that points towards the new file. The Path object \n", 483 | " # will automatically join the directory and filename for us.\n", 484 | " new_path = Path('data', new_filename)\n", 485 | " \n", 486 | " # Print a status message using string formatting. By adding the prefix 'f' to \n", 487 | " # a string, we can use curly brackets {} to insert a variable within the string. \n", 488 | " # Here we add the current file path to the string for printing.\n", 489 | " print(f'Writing modified text to {new_path}')\n", 490 | " \n", 491 | " # Use the write_text() method to write the modified text under 'mod_text' to \n", 492 | " # the file using UTF-8 encoding. \n", 493 | " new_path.write_text(mod_text, encoding='utf-8')" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "As you can see from the code block above, Path objects provide two convenient methods for working with text files: `read_text()` and `write_text()`.\n", 501 | "\n", 502 | "These methods can be used to read and write text from files without using the `with` statement, which was introduced in the previous [section](01_basic_text_processing.ipynb). Just as using the `with` statement, the file that the Path points to is closed after the text has been read." 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": { 508 | "tags": [ 509 | "remove-cell" 510 | ] 511 | }, 512 | "source": [ 513 | "If you now take a look at the directory [data](data), you should now see three files whose names have the prefix `mod_`. These are the files we just modified and wrote to disk.\n", 514 | "\n", 515 | "To keep the data directory clean, run the following cell to delete the modified files.\n", 516 | "\n", 517 | "Adding the exclamation mark `!` to the beginning of a code cell tells Jupyter that this is a command to the underlying command line interface, which can be used to manipulate the file system.\n", 518 | "\n", 519 | "In this case, we run the command `rm` to delete all files in the directory `data`, whose filename begins with the characters `mod`." 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": null, 525 | "metadata": { 526 | "tags": [ 527 | "remove-cell" 528 | ] 529 | }, 530 | "outputs": [], 531 | "source": [ 532 | "!rm data/mod*" 533 | ] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": {}, 538 | "source": [ 539 | "This should have given you an idea of the some more powerful methods for manipulating text available in Python, such as regular expressions, and how to apply them to multiple files at the same time.\n", 540 | "\n", 541 | "The [following section](03_basic_nlp.ipynb) will teach you how to apply basic natural language processing techniques to texts." 542 | ] 543 | } 544 | ], 545 | "metadata": { 546 | "celltoolbar": "Edit Metadata", 547 | "kernelspec": { 548 | "display_name": "Python 3 (ipykernel)", 549 | "language": "python", 550 | "name": "python3" 551 | }, 552 | "language_info": { 553 | "codemirror_mode": { 554 | "name": "ipython", 555 | "version": 3 556 | }, 557 | "file_extension": ".py", 558 | "mimetype": "text/x-python", 559 | "name": "python", 560 | "nbconvert_exporter": "python", 561 | "pygments_lexer": "ipython3", 562 | "version": "3.9.7" 563 | } 564 | }, 565 | "nbformat": 4, 566 | "nbformat_minor": 2 567 | } 568 | -------------------------------------------------------------------------------- /part_ii/05_evaluating_nlp.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Evaluating language models\n", 8 | "\n", 9 | "This section introduces you to some basic techniques for evaluating the results of natural language processing.\n", 10 | "\n", 11 | "After reading this section, you should:\n", 12 | "\n", 13 | "- understand what is meant by a gold standard\n", 14 | "- know how to evaluate agreement between human annotators\n", 15 | "- understand simple metrics for evaluating the performance of natural language processing" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "## What is a gold standard?" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": { 29 | "tags": [ 30 | "remove-input" 31 | ] 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "# Run this cell to view a YouTube video related to this topic\n", 36 | "from IPython.display import YouTubeVideo\n", 37 | "YouTubeVideo('eBJDxHUxRwc', height=350, width=600)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "A gold standard – also called a ground truth – refers to human-verified data that can used as a benchmark for evaluating the performance of algorithms. \n", 45 | "\n", 46 | "In natural language processing, gold standards are used to measure how well humans perform on some task.\n", 47 | "\n", 48 | "The goal of natural language processing is to allow computers to achieve or surpass human-level performance in some pre-defined task. \n", 49 | "\n", 50 | "Measuring whether algorithms can do so requires a benchmark, which is provided by the gold standard. Put simply, a gold standard provides a point of reference.\n", 51 | "\n", 52 | "It is important, however, to understand that gold standards are *abstractions* of language use. \n", 53 | "\n", 54 | "Consider, for instance, the task of placing words into word classes: word classes are not given to us by nature, but represent an abstraction that imposes structure on natural language.\n", 55 | "\n", 56 | "Language, however, is naturally ambiguous and subjective, and the abstractions used can be underdeveloped – we cannot be sure if all users would categorise words in the same way.\n", 57 | "\n", 58 | "This is why we need to measure the reliability of any gold standard, that is, to what extent humans agree on the task." 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "## Measuring reliability manually\n", 66 | "\n", 67 | "This section introduces how reliability, often understood as agreement between multiple annotators, can be measured manually.\n", 68 | "\n", 69 | "### Step 1: Annotate data\n", 70 | "\n", 71 | "Sentiment analysis is a task that involves determining the sentiment of a text (for an useful overview that incorporates insights from both linguistics and natural language processing, see [Taboada](https://doi.org/10.1146/annurev-linguistics-011415-040518) (2016).\n", 72 | "\n", 73 | "Training a sentiment analysis model requires collecting training data, that is, examples of texts associated with different sentiments.\n", 74 | "\n", 75 | "Classify the following tweets into three categories – *positive*, *neutral* or *negative* – based on their sentiment.\n", 76 | "\n", 77 | "Write down your decision – one per row – but **do not discuss them or show them to the person next to you.**\n", 78 | "\n", 79 | "```\n", 80 | "1. Updated: HSL GTFS (Helsinki, Finland) https://t.co/fWEpzmNQLz\n", 81 | "2. current weather in Helsinki: broken clouds, -8°C 100% humidity, wind 4kmh, pressure 1061mb\n", 82 | "3. CNN: \"WallStreetBets Redditors go ballistic over GameStop's sinking share price\"\n", 83 | "4. Baana bicycle counter. Today: 3 Same time last week: 1058 Trend: ↓99% This year: 819 518 Last year: 802 079 #Helsinki #cycling\n", 84 | "5. Elon Musk is now tweeting about #bitcoin \n", 85 | "6. A perfect Sunday walk in the woods just a few steps from home.\n", 86 | "7. Went to Domino's today👍 It was so amazing and I think I got damn good dessert as well…\n", 87 | "8. Choo Choo 🚂 There's our train! 🎉 #holidayahead\n", 88 | "9. Happy women's day ❤️💋 kisses to all you beautiful ladies. 😚 #awesometobeawoman\n", 89 | "10. Good morning #Helsinki! Sun will rise in 30 minutes (local time 07:28)\n", 90 | "```" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": { 96 | "nbsphinx": "hidden" 97 | }, 98 | "source": [ 99 | "Double-click this cell to edit its contents and write down your classifications below:\n", 100 | "\n", 101 | " 1.\n", 102 | " 2.\n", 103 | " 3.\n", 104 | " 4.\n", 105 | " 5.\n", 106 | " 6.\n", 107 | " 7.\n", 108 | " 8.\n", 109 | " 9.\n", 110 | " 10." 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### Step 2: Calculate percentage agreement\n", 118 | "\n", 119 | "When creating datasets for training models, we typically want the training data to be reliable, that is, so that we agree on whatever we are describing – in this case, the sentiment of the tweets above.\n", 120 | "\n", 121 | "One way to measure this is simple *percentage agreement*, that is, how many times out of 10 you and the person next to you agreed on the sentiment of a tweet.\n", 122 | "\n", 123 | "Now compare your results calculate percentage agreement by dividing the number of times you agreed by the number of items (10).\n", 124 | "\n", 125 | "You can calculate percentage agreement by executing the cell below: just assign the number items you agree on to the variable `agreement`. " 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "# Replace this number here with the number of items you agreed on\n", 135 | "agreement = 0 \n", 136 | "\n", 137 | "# Divide the count by the number of tweets\n", 138 | "agreement = agreement / 10\n", 139 | "\n", 140 | "# Print out the variable\n", 141 | "agreement" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "### Step 3: Calculate probabilities for each category\n", 149 | "\n", 150 | "Percentage agreement is actually a very poor measure of agreement, as either of you may have made lucky guesses – or perhaps you considered the task boring and classified every tweet into a random category.\n", 151 | "\n", 152 | "If you did, we have no way of knowing this, as percentage agreement cannot tell us if the result occurred by chance!\n", 153 | "\n", 154 | "Luckily, we can estimate the possibility of *chance agreement* easily.\n", 155 | "\n", 156 | "The first step is to count *how many times you used each available category* (positive, neutral or negative).\n", 157 | "\n", 158 | "Assign these counts in the variables below." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "# Count how many items *you* placed in each category\n", 168 | "positive = 0\n", 169 | "neutral = 0\n", 170 | "negative = 0" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "We can convert these counts into *probabilities* by dividing them with the total number of tweets classified." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "positive = positive / 10\n", 187 | "neutral = neutral / 10\n", 188 | "negative = negative / 10\n", 189 | "\n", 190 | "# Call each variable to examine the output\n", 191 | "positive, neutral, negative" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "These probabilities represent the chance of *you* choosing that particular category.\n", 199 | "\n", 200 | "Now ask the person sitting next to you for their corresponding probabilities and tell them yours as well. \n", 201 | "\n", 202 | "Add their probabilities to the variables below." 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "nb_positive = 0\n", 212 | "nb_neutral = 0\n", 213 | "nb_negative = 0" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "Now that we know the probabilities for each class for both annotators, we can calculate the probability that both annotators choose the same category by chance.\n", 221 | "\n", 222 | "This is easy: for each category, simply multiply your probability with the corresponding probability from the person next to you.\n", 223 | "\n", 224 | "If either annotator did not assign a single tweet into a category, e.g. negative, and the other annotator did, then this effectively rules out the possibility of agreeing by chance (multiplication by zero results in zero)." 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "both_positive = positive * nb_positive\n", 234 | "both_neutral = neutral * nb_neutral\n", 235 | "both_negative = negative * nb_negative" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "### Step 4: Estimate chance agreement\n", 243 | "\n", 244 | "Now we are ready to calculate how likely you are to agree by chance.\n", 245 | "\n", 246 | "This is known as *expected agreement*, which is calculated by summing up your combined probabilities for each category." 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "expected_agreement = both_positive + both_neutral + both_negative\n", 256 | "\n", 257 | "expected_agreement" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "Now that we know both observed percentage agreement (`agreement`) and the agreement expected by chance (`expected_agreement`), we can use this information for a more reliable measure of *agreement*.\n", 265 | "\n", 266 | "One such measure is [Cohen's kappa](https://en.wikipedia.org/wiki/Cohen%27s_kappa) ($\\kappa$), which estimates agreement on the basis of both observed and expected agreement.\n", 267 | "\n", 268 | "The formula for Cohen's $\\kappa$ is as follows:\n", 269 | "\n", 270 | "$\\kappa = \\frac{P_{observed} - P_{expected}}{1 - P_{expected}}$\n", 271 | "\n", 272 | "As all this information is stored in our variables `agreement` and `expected_agreement`, we can easily count the $\\kappa$ score using the code below.\n", 273 | "\n", 274 | "Note that we must wrap the subtractions into parentheses to perform them before division." 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "kappa = (agreement - expected_agreement) / (1 - expected_agreement)\n", 284 | "\n", 285 | "kappa" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "This gives us the result for Cohen's $\\kappa$.\n", 293 | "\n", 294 | "Let's now consider how to interpret its value." 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "## Cohen's kappa as a measure of agreement\n", 302 | "\n", 303 | "The theoretical value Cohen's $\\kappa$ runs from $-1$ indicating perfect disagreement to $+1$ for perfect agreement, with $0$ standing for completely random agreement.\n", 304 | "\n", 305 | "The $\\kappa$ score is often interpreted as a measure of the strength of agreement.\n", 306 | "\n", 307 | "[Landis and Koch](https://doi.org/10.2307/2529310) (1977) famously proposed the following benchmarks, which should nevertheless be taken with a pinch of salt as the divisions are completely arbitrary.\n", 308 | "\n", 309 | "| Cohen's K | Strength of agreement|\n", 310 | "|-----------|----------------------|\n", 311 | "| <0.00 | Poor |\n", 312 | "| 0.00–0.20 | Slight |\n", 313 | "| 0.21–0.40 | Fair |\n", 314 | "| 0.41–0.60 | Moderate |\n", 315 | "| 0.61–0.80 | Substantial |\n", 316 | "| 0.81–1.00 | Almost perfect |\n", 317 | "\n", 318 | "Cohen's $\\kappa$ can be used to measure agreement between **two** annotators and the categories available must be **fixed** in advance. \n", 319 | "\n", 320 | "For measuring agreement between more than two annotators, one must use a measure such as [Fleiss'](https://en.wikipedia.org/wiki/Fleiss%27_kappa) $\\kappa$.\n", 321 | "\n", 322 | "Cohen's $\\kappa$ and many more measures of agreement are implemented in various Python libraries, so one rarely needs to perform the calculations manually.\n", 323 | "\n", 324 | "The [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html#sklearn.metrics.cohen_kappa_score) library (`sklearn`), for instance, includes an implementation of Cohen's $\\kappa$.\n", 325 | "\n", 326 | "Let's import the `cohen_kappa_score()` function for calculating Cohen's $\\kappa$ from scikit-learn.\n", 327 | "\n", 328 | "This function takes two *lists* as input and calculates the $\\kappa$ score between them." 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "# Import the cohen_kappa_score function from the 'metrics' module of the scikit-learn library\n", 338 | "from sklearn.metrics import cohen_kappa_score" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "We can then define two lists of part-of-speech tags, which make up our toy example for calculating Cohen's $\\kappa$." 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [ 354 | "# Define two lists named 'a1' and 'a2'\n", 355 | "a1 = ['ADJ', 'AUX', 'NOUN', 'VERB', 'VERB']\n", 356 | "a2 = ['ADJ', 'VERB', 'NOUN', 'NOUN', 'VERB']" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "The next step is to feed the two lists, `a1` and `a2`, to the `cohen_kappa_score()` function." 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "# Use the cohen_kappa_score() function to calculate agreement between the lists\n", 373 | "cohen_kappa_score(a1, a2)" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "According to the benchmark from Landis and Koch, this score would indicate moderate agreement.\n", 381 | "\n", 382 | "Generally, Cohen's $\\kappa$ can be used for measuring agreement on all kinds of tasks that involve placing items into categories.\n", 383 | "\n", 384 | "It is rarely necessary to annotate the whole dataset when measuring agreement – a random sample is often enough.\n", 385 | "\n", 386 | "If Cohen's $\\kappa$ suggests that the human annotators agree on whatever they are categorising, we can assume that the annotations are *reliable* in the sense that they are not random.\n", 387 | "\n", 388 | "However, all measures of inter-annotator agreement, Cohen's $\\kappa$ included, are affected by their underlying assumptions about what agreement is and how it is calculated. In other words, these measures never represent the absolute truth (see e.g. Di Eugenio & Glass [2004](https://dx.doi.org/10.1162/089120104773633402); Artstein & Poesio [2008](https://doi.org/10.1162/coli.07-034-R2))." 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "## Evaluating the performance of language models" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": { 402 | "tags": [ 403 | "remove-input" 404 | ] 405 | }, 406 | "outputs": [], 407 | "source": [ 408 | "# Run this cell to view a YouTube video related to this topic\n", 409 | "from IPython.display import YouTubeVideo\n", 410 | "YouTubeVideo('WiN5JCueeFQ', height=350, width=600)" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": {}, 416 | "source": [ 417 | "Once we have a sufficiently *reliable* gold standard, we can use the gold standard to measure the performance of language models.\n", 418 | "\n", 419 | "Let's assume that we have a reliable gold standard consisting of 10 tokens annotated for their part-of-speech tags.\n", 420 | "\n", 421 | "These part-of-speech tags are given in the list `gold_standard`." 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": null, 427 | "metadata": {}, 428 | "outputs": [], 429 | "source": [ 430 | "# Define a list named 'gold_standard'\n", 431 | "gold_standard = ['ADJ', 'ADJ', 'AUX', 'VERB', 'AUX', 'NOUN', 'NOUN', 'ADJ', 'DET', 'PRON']" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "We then retrieve the predictions for the same tokens from some language model and store them in a list named `predictions`." 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "metadata": {}, 445 | "outputs": [], 446 | "source": [ 447 | "# Define a list named 'predictions'\n", 448 | "predictions = ['NOUN', 'ADJ', 'AUX', 'VERB', 'AUX', 'NOUN', 'VERB', 'ADJ', 'DET', 'PROPN']" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "Now that we have a toy data set with two sets of annotations to compare, let's import the entire *metrics* module from the *scikit-learn* library and apply them to our data.\n", 456 | "\n", 457 | "This module contains implementations for [various evaluation metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)." 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "# Import the 'metrics' module from the scikit-learn library (sklearn)\n", 467 | "from sklearn import metrics" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "metadata": {}, 473 | "source": [ 474 | "First of all, we can calculate [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score) using the `accuracy_score()` function, which is precisely the same as observed agreement that we calculated manually above.\n", 475 | "\n", 476 | "This function takes two lists as input." 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": null, 482 | "metadata": {}, 483 | "outputs": [], 484 | "source": [ 485 | "# Use the accuracy_score() function from the 'metrics' module\n", 486 | "metrics.accuracy_score(gold_standard, predictions)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "Accuracy, however, suffers from the same shortcoming as observed agreement – the output of the language model in `predictions` may be the result of making a series of lucky guesses.\n", 494 | "\n", 495 | "However, given that we are working with a toy example, we can easily verify that 7 out of 10 part-of-speech tags match. This gives an accuracy of 0.7 or 70%.\n", 496 | "\n", 497 | "To better evaluate the performance of the language model against our gold standard, the results can be organised into what is called a *confusion matrix*.\n", 498 | "\n", 499 | "To do so, we need all part-of-speech tags that occur in `gold_standard` and `predictions`.\n", 500 | "\n", 501 | "We can easily collect unique part-of-speech tags using the `set()` function.\n", 502 | "\n", 503 | "The result is a _set_, a powerful data structure in Python, which consists of a collection of unique items.\n", 504 | "\n", 505 | "Essentially, we use a set to remove duplicates in the two lists `gold_standard` and `predictions`." 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": null, 511 | "metadata": {}, 512 | "outputs": [], 513 | "source": [ 514 | "# Collect unique POS tags into a set by combining the two lists\n", 515 | "pos_tags = set(gold_standard + predictions)\n", 516 | "\n", 517 | "# Sort the set alphabetically and cast the result into a list\n", 518 | "pos_tags = list(sorted(pos_tags))\n", 519 | "\n", 520 | "# Print the resulting list\n", 521 | "pos_tags" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "We can use these unique categories to compile a table, in which the rows stand for the gold standard and the columns stand for predictions made by the language model. Having collected all unique part-of-speech tags at hand ensures that we can always find a place for each item.\n", 529 | "\n", 530 | "This kind of table is commonly called a *confusion matrix*.\n", 531 | "\n", 532 | "The table is populated by simply walking through each pair of items in the gold standard and model predictions, adding $+1$ to the cell for this combination.\n", 533 | "\n", 534 | "For example, the first item in `gold_standard` is ADJ, whereas the first item in `predictions` is NOUN." 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": null, 540 | "metadata": {}, 541 | "outputs": [], 542 | "source": [ 543 | "# Print out the first item in each list\n", 544 | "gold_standard[0], predictions[0]" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": {}, 550 | "source": [ 551 | "We then find the row for ADJ and the column for NOUN and add one to this cell.\n", 552 | "\n", 553 | " | | ADJ | AUX | DET | NOUN | PRON | PROPN | VERB |\n", 554 | " |-------|-----|-----|-----|------|------|-------|------|\n", 555 | " | ADJ | 2 | 0 | 0 | 1 | 0 | 0 | 0 | \n", 556 | " | AUX | 0 | 2 | 0 | 0 | 0 | 0 | 0 |\n", 557 | " | DET | 0 | 0 | 1 | 0 | 0 | 0 | 0 |\n", 558 | " | NOUN | 0 | 0 | 0 | 1 | 0 | 0 | 1 |\n", 559 | " | PRON | 0 | 0 | 0 | 0 | 0 | 1 | 0 |\n", 560 | " | PROPN | 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n", 561 | " | VERB | 0 | 0 | 0 | 0 | 0 | 0 | 1 |\n", 562 | "\n", 563 | "As you can see, the correct predictions form a roughly diagonal line across the table.\n", 564 | "\n", 565 | "We can use the table to derive two additional metrics for each class: *precision* and *recall*.\n", 566 | "\n", 567 | "Precision is the *proportion of correct predictions per class*. In plain words, precision tells you how many predictions were correct for each class, or part-of-speech tag.\n", 568 | "\n", 569 | "For example, the sum for column VERB is $2$, of which $1$ prediction is correct (that which is located in the row VERB).\n", 570 | "\n", 571 | "Hence precision for VERB is $1 / 2 = 0.5$ – half of the tokens predicted to be verbs were classified correctly. The same holds true for NOUN, as the column sums up two $2$, but only $1$ prediction is in the correct row.\n", 572 | "\n", 573 | "Recall, in turn, gives the proportion of correct predictions for all examples of that class. \n", 574 | "\n", 575 | "Put differently, recall tells you *how many actual instances of a given class the model was able to \"find\"*.\n", 576 | "\n", 577 | "For example, the sum for row ADJ is $3$: there are three adjectives in the gold standard, but only two are located in the corresponding column for ADJ.\n", 578 | "\n", 579 | "This means that recall for ADJ is $2 / 3 = 0.66$ – approximately 66% of the adjectives present in the gold standard were classified correctly. For NOUN, recall is $1 / 2 = 0.5$.\n", 580 | "\n", 581 | "The *scikit-learn* library provides a `confusion_matrix()` function for automatically generating confusion matrices. \n", 582 | "\n", 583 | "Run the cell below and compare the output to the manually created confusion matrix above." 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": null, 589 | "metadata": {}, 590 | "outputs": [], 591 | "source": [ 592 | "# Calculate a confusion matrix for the two lists and print the result\n", 593 | "print(metrics.confusion_matrix(gold_standard, predictions))" 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "To evaluate the ability of the language model to predict the correct part-of-speech tag, we can use [*precision*](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score), which is implemented in the `precision_score()` function in the *scikit-learn* library.\n", 601 | "\n", 602 | "Because we have more than two classes for part-of-speech tags instead of just two (binary) classes, we must define how the results for each class are processed.\n", 603 | "\n", 604 | "This option is set using the `average` argument of the `precision_score()` function. If we set `average` to `None`, the `precision_score()` function calculates precision for each class. \n", 605 | "\n", 606 | "We also set the `zero_division` argument to tell the function what to do if the classes found in `predictions` are not present in `gold_standard`. This prevents calculating a precision score: `zero_division` sets the precision score to 0 in these cases.\n", 607 | "\n", 608 | "The results are organised according to a sorted *set* of labels present in `gold_standard` and `predictions`." 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "execution_count": null, 614 | "metadata": {}, 615 | "outputs": [], 616 | "source": [ 617 | "# Calculate precision between the two lists for each class (part-of-speech tag)\n", 618 | "precision = metrics.precision_score(gold_standard, predictions, average=None, zero_division=0)\n", 619 | "\n", 620 | "# Call the variable to examine the result\n", 621 | "precision" 622 | ] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": {}, 627 | "source": [ 628 | "The output is a [NumPy](https://www.numpy.org) array. NumPy is a powerful library for working with numerical data which can be found under the hood of many Python libraries.\n", 629 | "\n", 630 | "If we want to combine our list of labels in `pos_tags` with the precision scores in `precision`, we can do this using Python's `zip()` function, which joins together lists and/or arrays of the same size. To view the result, we must cast it into a dictionary using `dict()`." 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": null, 636 | "metadata": {}, 637 | "outputs": [], 638 | "source": [ 639 | "# Combine the 'pos_tags' set with the 'precision' array using the zip()\n", 640 | "# function; cast the result into a dictionary\n", 641 | "dict(zip(pos_tags, precision))" 642 | ] 643 | }, 644 | { 645 | "cell_type": "markdown", 646 | "metadata": {}, 647 | "source": [ 648 | "If we want to get single precision score for all classes, we can use the option given by the string `'macro'`, which means that each class is treated as equally important regardless of how many instances belonging to this class can be found in the data." 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": null, 654 | "metadata": {}, 655 | "outputs": [], 656 | "source": [ 657 | "# Calculate precision between the two lists and take their average\n", 658 | "macro_precision = metrics.precision_score(gold_standard, predictions, average='macro', zero_division=0)\n", 659 | "\n", 660 | "# Call the variable to examine the result\n", 661 | "macro_precision" 662 | ] 663 | }, 664 | { 665 | "cell_type": "markdown", 666 | "metadata": {}, 667 | "source": [ 668 | "The macro-averaged precision score is calculated by summing up the precision scores and dividing them by the number of classes.\n", 669 | "\n", 670 | "We can easily verify this manually." 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": null, 676 | "metadata": {}, 677 | "outputs": [], 678 | "source": [ 679 | "# Calculate macro-average precision manually by summing the precision \n", 680 | "# scores and dividing the result by the number of classes in 'precision'\n", 681 | "sum(precision) / len(precision)" 682 | ] 683 | }, 684 | { 685 | "cell_type": "markdown", 686 | "metadata": {}, 687 | "source": [ 688 | "Calculating recall is equally easy using the `recall_score()` function from the *scikit-learn* library." 689 | ] 690 | }, 691 | { 692 | "cell_type": "code", 693 | "execution_count": null, 694 | "metadata": {}, 695 | "outputs": [], 696 | "source": [ 697 | "# Calculate recall between the two lists for each class (part-of-speech tag)\n", 698 | "recall = metrics.recall_score(gold_standard, predictions, average=None, zero_division=0)\n", 699 | "\n", 700 | "# Combine the 'pos_tags' set with the 'recall' array using the zip()\n", 701 | "# function; cast the result into a dictionary\n", 702 | "dict(zip(pos_tags, recall))" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "The *scikit-learn* library provides a very useful function for providing an overview of classification performance called `classification_report()`.\n", 710 | "\n", 711 | "This will give you the precision and recall scores for each class, together with the [F1-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), which is a balanced average of precision and recall, that is, both precision and recall contribute equally to the F1-score. The values for the F1-score run from $0$ to $1$." 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": null, 717 | "metadata": {}, 718 | "outputs": [], 719 | "source": [ 720 | "# Print out a classification report\n", 721 | "print(metrics.classification_report(gold_standard, predictions, zero_division=0))" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": {}, 727 | "source": [ 728 | "As you can see, the macro-averaged scores on the row *macro avg* correspond to those that we calculated above.\n", 729 | "\n", 730 | "Finally, the weighted averages account for the number of instances in each class when calculating the average. The column *support* counts the number of instances observed for each class." 731 | ] 732 | }, 733 | { 734 | "cell_type": "markdown", 735 | "metadata": {}, 736 | "source": [ 737 | "This section should have given you an idea how to assess the reliability of human annotations, and how reliable annotations can be used as a gold standard for benchmarking the performance of natural language processing. \n", 738 | "\n", 739 | "You should also understand certain basic metrics used for benchmarking performance, such as accuracy, precision, recall and F1-score.\n", 740 | "\n", 741 | "In the [following section](../part_ii/06_managing_data.ipynb), you will learn about managing textual data." 742 | ] 743 | } 744 | ], 745 | "metadata": { 746 | "celltoolbar": "Edit Metadata", 747 | "kernelspec": { 748 | "display_name": "Python 3 (ipykernel)", 749 | "language": "python", 750 | "name": "python3" 751 | }, 752 | "language_info": { 753 | "codemirror_mode": { 754 | "name": "ipython", 755 | "version": 3 756 | }, 757 | "file_extension": ".py", 758 | "mimetype": "text/x-python", 759 | "name": "python", 760 | "nbconvert_exporter": "python", 761 | "pygments_lexer": "ipython3", 762 | "version": "3.9.7" 763 | } 764 | }, 765 | "nbformat": 4, 766 | "nbformat_minor": 2 767 | } 768 | -------------------------------------------------------------------------------- /part_ii/data/NYT_1991-01-16-A15.txt: -------------------------------------------------------------------------------- 1 | U.S. TAKING STEPS TO CURB TERRORISM: F.B.I. Is Ordered to Find Iraqis Whose Visas Have Expired By JAMES BARRON New York Times (1923-Current file); Jan 16, 1991; ProQuest Historical Newspapers: The New York Times with Index pg. A15 U.S. TAKING STEPS TO CURB TERRORISM F.B.I. Is Ordered to Find Iraqis Whose Visas Have Expired By JAMES BARRON The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday. The announcement came as security precautions were tightened throughout the United States. From financial exchanges in lower Manhattan to cloakrooms in Washington and homeless shelters in California, unfamiliar rituals were the order of the day. In many cities, identification badges were being given close scrutiny in office buildings that used to be open to anyone. Concerns about terrorist attack disrupted other daily routines as well. No fast-food deliveries are being allowed at the New York Stock Exchange. And metal detectors at the Los Angeles International Airport were fine-tuned until they were so sensitive that “even your keys will set it off," the airport manager said. Airport officials in Dallas said the Federal Aviation Administration had ordered them to post uniformed guards near ticket counters. There were new precautions at many levels of government. A larger than usual complement of Secret Service agents is on duty at the White House, and marines at the Twentynine Palms training base in California were advised to drive with their car windows rolled up and not to go jogging alone. Potential Schemes Uncovered Andrea Mohln/Tho New York Times Secret Service agents on guard duty outside the Iraqi Embassy yesterday in Washington. Security precautions were being taken around the country as the deadline for Iraq to withdraw from Kuwait neared. Law-enforcement officials said they had found no evidence that terrorist groups had chosen specific targets in New York. But in Washington, a high- ranking official said more than five potential schemes that could result in terrorist acts had been uncovered since Iraq invaded Kuwait on Aug. 2. Each of these involved people who appeared to be "lone zealots” acting independently of groups allied with President Saddam Hussein of Iraq, the official said. He would not provide further details. Only one person has been arrested in these incidents, but the investigations are continuing. Some of the suspects have left the country, one official said. But he refused to say who they were or why they were allowed to go. Foreigners who remain in the United States after the expiration of their visas are subject to immediate expulsion, but a senior official in the Justice Department said the purpose of having the F.B.I. track down Iraqis whose visas had expired was to determine “who is here, where they are and why they stayed.’’ As the hours ticked by and the United Nations deadline for Iraq’s withdrawal from Kuwait neared and then passed, with no evidence of terrorist actions, fear outpaced reality. Some New Yorkers filled their bathtubs with water and stocked up on powdered milk. And some decided to stay close to home. Jonathan Bond, president of the Ktrsh- enbaum & Bond advertising agency in SoHo, canceled plans to fly to Toronto today to make a presentation to a new client. "New York is a terrorist target," he said. “It's the first day of a war and I’m not going near an airport. Anything that needs to be done we can do by fax machine." In Washington there was increased security and alertness, even though the Government seemed to be trying to avoid giving the impression that the nation was on the brink of war. At the Pentagon, security guards were jittery after an unattended package was found early yesterday near an entrance to the Metro subway system. The package turned out to be be harmless, but officials canceled most tours of the Pentagon later in the day. The Secret Service would not talk about the precise steps being taken to increase security for high-ranking offi cials. "We’re redefining and testing out systems and trying to be prepared, said K. David Homes, a spokesman for the Secret Service. Access to some state capitols has also been tightened. State troopers were stationed at entrances to the Louisiana Statehouse and at the Governor's Mansion in Baton Rouge. In Albany, Gov. Mario M. Cuomo ordered the state police to stay in con stant communication with the F.B.I., although he said that "there is no threat that I’m aware of anywhere in the state.” The New York City Police Depart ment received 54 calls about bombs yesterday, nearly four times the normal number, said Sgt. Peter Berry, a police spokesman. But he said no bombs were found, and most of the reports turned out to be unattended packages or lost luggage. That was also a concern at airports around the country. At Logan International Airport in Boston, the public address system crackled with caution. ‘‘Please be advised,” one warning said. “Any luggage left unattended will be immediately ticketed and towed.” The no-nonsense mood extended to I non-passengers. A state police officer stopped a reporter from questioning a guard at Logan, saying it was "a breach of the security guard’s contract" to talk about anti-terrorism precautions. In Dallas, airport officials said the F.A.A. had raised the level of security by one step under the agency's five- step security plan. Managers at Dallas- Fort Worth Airport and Love Field said they were operating at level 2-A, the middle step and the first at which an airport is required to post uniformed guards at ticket counters. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -------------------------------------------------------------------------------- /part_ii/data/WP_1990-08-10-25A.txt: -------------------------------------------------------------------------------- 1 | *We Don’t Stand for Bullies': Diverse Voices in Area Back Action Against Iraq Mary Jordan Washington Post Staff Writer The Washington Post (1974-Current file); Aug 10, 1990; ProQuest Historical Newspapers: The Washington Post Vetoran Stuart Hoik, 06, aaya of Saddam Hussein, “Wo have got to atop tfais madman." Diverse Voices in Area Back Action Against Iraq By Mary Jordan WaihifiRton Pool Stiff Writer It’s not rising gasoline prices or even the threat of a recession, it’s that Iraq’s Saddam Hussein has got to be stopped. “You're darn right we lad to do it,” said Stuart Belk, 65, a retired brick-, layer from Alexandria, responding to the deployment of U.S. warplanes and troops to the Middle Cast. “The worst word in the world is ‘dictator,’ and that’s what Saddam Hussein is. What do you think the United States is for, if not to help countries in trouble? We have got to stop this madman," Belk said. Forty-five years and countless reflections since a land mine took his leg and almost his life in Hitler’s Germany, Belk said there are still certain things worth risking lives for. “Oil is not one of them,” he said, pounding his fist on his wheelchair outside the Veterans Administration’s nursing home in Northwest Washington. "The reason we should be over there is because we don’t stand for dictators. We don’t stand for bullies who gobble up other countries." City and suburban residents, young and old, wealthy and poor who were interviewed supported the U.S. actions. Of the 40 or so interviewed, from a vice president of Smithy Braedon Co., a real estate brokerage firm, to a District Blocked due to copyright See full page image or microfilm. “MICHAELS. DELINE “all the guys... are backing Bush” maintenance worker, from people pumping gas in Fairfax City to those sitting at lunch counters in the District's Shepherd Park, there was a rare unanimity of opinion. They said they were willing to conserve oh gas, even pay a higher price at the pump,' so long as oil companies weren't pocketing extra profits. No one wants thousands of American men and women in a war zone, they said, yet they would understand if more were needed. Deep in their gut, they said, they knew the U.S. commitment in the Persian Gulf will be long and expensive. “I wish we had done something in the past” to thwart Saddam Hussein and his million-strong army, said Sarah Cox, a Fairfax County management analyst who lives in McLean. As she pumped $24.73 worth of Chevron gas into her blue Caprice Classic, Cox said “economic reasons alone” should not justify the U.S. armed forces "charging off like a light brigade.” While she regretted that the United States “sat on its hands until the only solution was a military one," Cox said she understood the need now to stop Hussein. , The sudden invasion of Kuwait eight days ago reminded many how quickly the landscape can change: Saddam Hus- Sco VOICES, A26, Col. 6 “Ws have a responsibility to do whatever it takes to get Iraq out of Kuwait.” — Michael S. DeLlne Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OT CTRINA QUN0-T1IC WASHIKOTON POST Gerald Dunn, a federal government employee, says higher costs at the pump are a small price to pay to defend a “country under attack*" Area Residents Willing: To Conserve Gasoline ~ VOICES, From A25 sein, unknown or confused with Jordan's King Hussein days ago, is now denounced by little children, one mother said; Saudi Arabia was an oil-rich OPEC country to be wary of, not an ally to defend; Iran was more evil than Iraq, not the other way around; and world news was upbeat, as communism continued to collapse. , Now, George Washington University business student Michael S. DeLine said, Americans feel the pressure of being a superpower. “We have a responsibility to do whatever it takes to get Iraq out of Kuwait.” Even if it means a draft? he is asked. “I don't think it will come to that,” but if called, DeLine, 21, said those his age would see it as their duty. "We can’t let some little country wreck the world economy," DeLine said. "All the guys I talked to in the [Sigma Phi Epsilon] fraternity are totally for it; they’re backing Bush. It’s not just that Hussein took over another country. He took over resources of world importance." “If this continues, it’s going to have a dramatic impact on the way we use our automobiles,” said Douglas Fischer, a physical therapist in Silver Spring, who paid 15 cents more a gallon for gasoline on Wednesday than he did last week. “I wouldn’t want to be one of the guys readying for combat,” Fischer, 26, said. "And I wouldn’t want us over there just because of gas prices. War is never a good idea, but we have got to show support for Saudi Arabia.” While area residents interviewed support the U.S. role as democratic peace-keeper, many said they worry about the costs of that venture. The idea of combat in the deserts of the Middle East is of particular concern, especially to those who fought in previous wars. "Saudi Arabia and other countries over the'e must make it vary clear—and not in some back room—that they want us over there and that they are going to help us,” said Clyde Wray, 44, who served as a private in the Army’s 199 Light Infantry Brigade in Vietnam. “We cannot do this alone, again. Hussein's men know the area; he’s got [poisonous nerve] gas,” Wray said. “He could blow one of our ships up in the Gulf in a second," said 74-year-old District resident Willie J. Waller. “There are narrow straits, mountains and deserts. It's a dangerous place.” Wray said he’s had a feeling tliat the next war would erupt in the Middle East because "there is a lot of hate and oil over,, there.” . „„„ “Yes, Bush did the right thing,” the Vietnam veteran;'; said. "But what I’m worried ’ about now is the next step^I__ hope it’s not a misstep. We can-~ not make the mistake again of" taking half-steps, half-"' measures.” . "It’s not a small thing to keep" up the image of the U.S. as sAiv-"' ior,” said Richard Rogers, a mak" chine operator in the District/" "Isn’t that how we see oi)f.-.f selves? As peace-keeper? If yflS’ t didn't do that, what would be? Just another country.” .. Rogers, 42, said that the U.§/'.’ image as guardian of democracy .., also works as a deterrent to ter;;, rorists. If the United States and all its; might wasn't viewed as looking- out for smaller countries’ inter- ; ests, Rogers said, "things could get completely out of hand.;- Anybody could do anything/; Terrorism would spread like a cancer.” , Poor people may be the first'-'* to feel repercussions from Iraq's invasion of Kuwait. Al->." ready, the price of a tankful of gasoline has increased a dollar— or two at many pumps in the region. "This is going to hurt the. poor man and the U.S. car tor.... dustry,” Rogers predicted. “Even if they ration gas, ;lt won’t matter to rich people*”...;, said Belk, from the VA nursing-,:.’ home. “They will buy it on the black market. But if you don't,7 have much money to begin with;-’; it’s going to make a difference/*—. Gerald Dunn, a federal gov*” - eminent employee from Alex-* 1 andria, said that he believed'"" that most middle- and upper-in-"“ come families wouldn’t even no-.; — tice the increase in gas prices.-'; Besides, he said, it was a small -price for them to pay while the nation defended a "country un-' der attack.” ’’ Saddam Hussein, alternatively described as a madman, bully-”" and dictator, drew passionate'"''. repudiation. "No way can we let him get' away with this,” said Robert Stout, a vice president at Smithy Braedon, referring to . the Iraqi president’s decision to.... annex Kuwait. , “Saddam Hussein really,,,,.* scares me," said Falls Church,,,, music instructor Joseph Moq-, ton. "I fear we have a madmaq,,,„; out there and he is not going to;;,* stop at Kuwait. These days, it takes is a nuclear weapon; small enough to fit in a suit-.;.; case.” — ”1 just hope they stop him — quick,” said Viola Anderson, a"-" retired hotel employee in tlie District. "I really feel things could turn terrible, if we donfr stop him soon.” >•'«■« Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -------------------------------------------------------------------------------- /part_ii/data/WP_1991-01-17-A1B.txt: -------------------------------------------------------------------------------- 1 | U.S., Allies Launch Massive Air War Against Targets in Iraq and ... Atkinson, Rick;David S Broder Washington Post Staff Writers The Washington Post (1974-Current file); Jan 17, 1991; ProQuest Historical Newspapers: The Washington Post pg. A1B Bombing Raids Hit Missile Sites, Nuclear and Chemical Plants By Rick Atkinson and David S. Broder Washington Post Suit Writers The United States and its allies went to war against Iraq last night as hundreds of warplanes unleashed a massive bombing attack on targets in Iraq and occupied Kuwait. “Tonight the battle has been joined,” President Bush said in a brief, nationally televised address. “The United States, together with the United Nations, exhausted every means at our disposal to bring this crisis to a peaceful end.” Iraqi President Saddam Hussein, Bush added, “met every overture of peace with open contempt.” No ground fighting was reported, and U.S. officials expressed surprise and relief at the lack of any effective Iraqi military response in the early hours of hostilities. Despite U.S. anxieties that Israel would be drawn into the war by an Iraqi strike, Bagh dad launched no attack on the Jewish state, officials said. Wave after wave of warplanes flown by U.S., British, Saudi and Kuwaiti pilots attacked Baghdad and other targets in Iraq and Kuwait, reportedly hitting oil refineries, the Baghdad international airport and Saddam’s palace. Saddam responded defiantly, calling Bush a “hypocritical criminal” and vowing to crush “the satanic intentions of the White House.” Declaring that “the great duel, the mother of all battles has begun,” the Iraqi leader said “the dawn of victory nears as this great showdown begins.” His remarks were made on state-run Baghdad radio five hours after the war began. Witnesses reported that the moonless night sky over the Iraqi capital was thick with tracer fire from antiaircraft guns and that a smoky pallor had settled over the city. Western correspondents reported power blackouts in much of the city, although not until nearly an hour after the raids began. A second wave of air attacks pounded targets in Baghdad and elsewhere this morning after a 5!/4-hour break, a Western military officer said in Bahrain. In a briefing at the Pentagon less than three hours after the war began, Defense Secretary Richard B. Cheney said initial reports from the gulf were “very, very encouraging. The operation appears to have gone very well.... We achieved a fairly high degree of tactical surprise.” The combat, Cheney added, is “likely to run a long period of time.” Casualty reports were sketchy, but officials said that few if any allied planes had been lost in the raids. A pilot interviewed by the television news pool supervised by the Pentagon reported . spotting considerable ground fire as he flew his initial mission over Iraq, but only a single surface-to-air missile. He said his group of 12 jets all returned safely from its raid. The first targets hit, according to U.S. military sources in Saudi Arabia, were ground-to-ground Scud missiles capable of striking Saudi or Israeli cities. About 50 Tomahawk cruise missiles—low-flying drones fired from U.S. warships—were used in the initial attack, along with F-117 stealth fighters, according to Pentagon sources. Bush said that the attacks also were intended to destroy Iraq’s nuclear weapons potential and chemical weapons stocks, as well as damaging Saddam’s tank force. "We will not fail,” the president added. Gen. Colin L. Powell, chairman of the Joint Chiefs of Staff, declined to provide many specifics but said that “so far there has been no resistance” from the Iraqi air force. U.S. fighters returned from their missions with air-to-air missiles still slung beneath their wings, having found no enemy planes to engage in dogfights. White House spokesman Marlin Fitzwa- ter added, “We don’t quite understand it. We are surprised that there was so little response this first night. But we are keeping in mind that this is only the beginning, and a guy with a million-man army is bound to respond.” •' The order to launch attacks marked the end of 5‘A months of diplomatic and economic efforts by the United Nations to force Saddam to roll back the forces he sent into oil-rich Kuwait on Aug. 2. It came less than 17 hours after the expiration Tuesday midnight of the United Nations deadline for Iraq to withdraw from Kuwait. An overnight Washington Post-ABC News Poll showed three of four persons interviewed approved of Bush’s action. But several hundred demonstrators gathered in See WAR, A24, CoL 1 ' BY CWUT COOK—THC WASHHGTON POST Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I WAR, From A1 r»—( front o{ tf>e White House before midnigfg, shouting, “Shame, shame, shame.VMore than a dozen were arresteit amid rock and bottle throwing. •; "Thqworjd could wait no longer," Bush sail. Congress, which had agonized Sor' months over baddng Bush’s ^hard-line stance, appeared substantially united behind the president's decision in the first few hours the war. Republicans rallied quigkly to his support, and so did sorif of his prominent Democratic cgtics. Senate Armed Services Committee Chairman Sam Nunn (D-Ga.), a leader oi the forces arguing that the economic embargo against Iraq should; Le sustained much longer before ;iesort to arms, said last night, *( believe we will prevail in days onjreeks. Saddam Hussein has made a$agic miscalculation.” Sen* intelligence committee Chaimv^i David L Boren (D-Okla.), who shiged Nunn’s view, said there would 5>e “unanimous support for our tropps.... You'D see Congress play thprole of supporting player." It was not quite unanimous. Rep. Ronald'V. flellums (D-Calif.) said Bush’s'decision brought ‘an inestimable Jtrigedy, for which it will take u»g lifetime to atone." Crudfc oil prices initially soared while sijck prices fell in Tokyo, but as new^reports indicated an allied success^ the mood on markets around,‘^ie world became positive. Oil prifgs retreated and the Japanese stock market surged upward. Late l$t night, the White House annourjfcd that the nation’s Strategic fStroleum Reserve will be tappedgtifurther reassuring the oil markers Althjtigh there was no report of any mafic retaliatory action by Iraq, 1 * Associated Press The Defense Department has established 24-hour telephone numbers for relatives and friehds of service members fojpbtain general information ~ concerning casualties^ the Persian Gulf war. General information: Army 1-703-614-0739; Air Force 1-800-253-9276; Navy 1- 800.J32-1206. Iilfliiediate family mem- bersjnnly: Navy 1-800-255- 38(wJVlarine Corps 1-800- 52$2fl84; Coast Guard 1800-283-8724. airline security was heightened around the world and in the nation's capital in response to earlier threats of Iraqi terrorism. FBI Director William S. Sessions said that some terrorists “have been identified as being in the United States,” but that there have not yet been anyincidents. Although Cheney later said no Iraqi attacks had been launched against Saudi Arabia, Western diplomats and a Saudi official said one oil storage facility had suffered minor damage in a bombardment. The allied attack, code named Operation Desert Storm, included extensive participation by 45 British Tornado jets flying out of Saudi Arabia and Bahrain, and 150 Saudi attack fighters, according to British and Saudi sources. U.S. military officials said the air war against Iraq could last for three to four weeks before ground forces swing into action, although Bush may order brief pauses in the bombing to give Saddam an opportunity to sue for peace. The first phase of the air war targets Iraqi air defenses, command and control centers and Iraqi Scud missiles; U.S. military officials in Saudi Arabia said they believe many of the initial targets, including the Scuds, had been obliterated. There was no indication that Israel had been hit or that Israeli forces were involved in the war. Israeli Air Force Col. Menachem Eynan told a radio reporter two hours after the air strikes began that “We’re completely out of the picture. We're not involved, and we're not acting.” One objective of the attack last night and in the coming days, Pentagon sources said, is to sever Iraqi forces in Kuwait from central government and military control in Baghdad. Subsequent phases of the air war will seek to isolate those forces logistically—by demolishing roads, railroads and supply lines— and then begin destroying the forces themselves. An unconfirmed report from a Kuwaiti resistance source inside Kuwait said that allied troops speaking English had been seen in a suburb of Kuwait City; some Kuwaiti citizens, the source added, took to the streets, honking their car horns, while some Iraqi soldiers tried to surrender. A military spokesman in central Saudi Arabia said the first squadron of F-15E fighter bombers took off at 4:50 p.m. EST (12:50'a.m. Thursday in Saudi Arabia). Col. Roy Davies, chief maintenance officer at the base, said, “This is history in the making." Gen. H. Norman Schwarzkopf, commander of U.S. forces in the gulf, issued a statement to his troops urging them to be “the thunder and lightning of Desert Storm.” The first official notification of the attack came from grim-looking White House press secretary Marlin Fitzwater, who appeared in the White House briefing room just after 7 p.m. Reading a statement from Bush, he said, ‘The liberation of Kuwait has begun. In conjunction with the forces of our coalition partners, the United States has moved under the code name Operation Desert Storm to enforce the mandates of the United Nations Security Council. As of 7 p.m. Eastern Standard Time, Operation Desert Storm forces were engaging targets in Kuwait and Iraq.” Fitzwater said the attack is "designed to accomplish the goal of the U.N. Security Council resolution to get Iraq unconditionaUy out of Kuwait. I think the president’s description of it a few days ago, that it would be swift and massive, would certainly apply.” Bush was in the small study off the Oval Office watching television reports from Baghdad as the attack began. With him were Vice President Quayle, national security adviser Brent Scowcroft and White House Chief of Staff John H. Sununu. Fitzwater was in and out of the office during that period. As the sound of the bombing came over the television, Bush turned to his aides and said, “It's just the way it was scheduled,” Fitzwater reported. He described the president as “very matter-of-fact and calm” as he watched the live reports from Baghdad. Moments later he said to Fitzwater, “Marlin, you’d better go do it,” the signal for the spokesman to give the official announcement. Fitzwater said the decision was made over a period of weeks, with key meetings on several recent Sunday nights after the president returned from Camp David. A final planning meeting came Tuesday morning at the White House. But the decision had been implicit from the start. From the day of Iraq’s invasion, Bush had made it plain that he thought the stakes in the gulf were big enough to justify a war. He persuaded 28 nations to send forces to the gulf and mobilized support from past adversaries, including the Soviet Union, for an embargo that largely isolated Iraq from world commerce. With the backing of an unprecedented range of countries, the United Nations Security Council passed 12 separate and increasingly severe resolutions demanding that Iraq roll back its lightning conquest of the tiny desert kingdom and its rich oil reserves. Last Saturday, Bush obtained what amounted to a congressional declaration of war, when the Senate narrowly and the House by a wider margin approved a resolution endorsing the use of “all necessary means” for expelling Iraq—the same formulation the United Nations had used in its last ultimatum to Saddam. Although the United States had “tilted” to Iraq’s side in the Iran- Iraq war and had sent diplomatic signals early last summer that it had no inclination to intervene in the long-simmering Iraqi territorial dispute with Kuwait, Bush insisted from the beginning that Saddam's military conquest of Kuwait “will not stand.” He never wavered from that view in the five months that led up to the most important decision of his presidency, and as H-Hour approached he seemed almost relieved that the waiting was over. His only appearance during the day came at the beginning of a meeting with educators. When asked about his mood, he said, ‘Life goes on.” When reporters commented that he looked grim, his response was to tell them to Tighten up.” Fitzwater described him early in the day as having “steeled himself” for the events ahead and as bqing “calm and confident” that he had taken the right course. Reading from a prepared statement at that morning briefing, Fitzwater said, “The president has gone beyond the extra mile for peace. Saddam Hussein has yet to take the first step.” By early afternoon, the signs of an imminent military attack, familiar from the Panama invasion a year ago, began emerging in rumor form on Capitol Hill. Unusual signs of activity were observed in the offices of House Speaker Thomas S. Foley ID- Wash.) and Senate Majority Leader George J. Mitchell (D-Maine). Between 5 and 6 p.m., congressional leaders began getting calls from Bush and other administration officials, informing them that hostilities were underway. Sen. Robert C. Byrd (D-W.Va.) recalled teUing the president, “My Bible tells me that Our Heavenly Father... will reward us. I pray for you every night.” By 6:35 p.m., the leaders had received the written certification, required by the congressional resolution, saying that all diplomatic efforts had failed. As in his televised speech, Bush’s letter put the blame for war on Iraq, which he said “has given no sign whatever that it intends to comply with the will of the international community. Nor is there any indication that diplomatic and economic means alone would ever compel Iraq to do so.” In addressing the nation from the Oval Office, Bush said that “while the world waited, Saddam Hussein systematically raped, pillaged and plundered a tiny nation—no threat to his own.... While the world waited, Saddam sought to add to the chemical weapons arsenal he now possesses, an infinitely more dangerous weapon of mass destruction, a nuclear weapon.” Bush reiterated essentiaOy the same case for military action that he has made since his Nov. 8 decision to double the size of U.S. forces in the gulf in order to provide offensive punch. Secretary of State Janies A. Baker III, who logged thousands of miles in assembling the anti-Iraq coalition and futilely seeking a diplomatic solution, spent yesterday notifying allies of the decision to go to war. Baker called Saudi Ambassador Bandar bin Sultan at 8 a.m. to tell him of Bush's decision to begin the attack. Bandar then telephoned King Fahd in Saudi Arabia and notified him through use of a secret, pre-arranged code word, according to a source close to the Saudi government The king then assented to the attack with another code word. The secretary telephoned the new Soviet foreign minister, Alexander Bessmertnykh, in Moscow. He spoke in person to the ambassadors of Israel, Kuwait, West Germany, Syria and Japan to give them notification in most cases in advance of the beginning of the attacks. State Department spokesman Margaret Tutwiler refused to disclose the precise time or order of the notification. The FBI refused to comment on reports that its agents were preparing in the event of hostilities to conduct searches of locations where suspected terrorists might be hiding and perhaps make some arrests. Domestic airlines and airports were ordered at about 8 p.m. to go to a heightened state of security, according to government sources, although “no specific credible threat of terrorism has been received against any airline. The government imposed what it called “preplanned security level 3", one step below the most stringent security requirements. Under level 3, more uniformed police are moved into airports and a stricter “profile* is applied to passengers in determining who to question and search. These Washington Post staff miters contributed to this coverage; Dan BaU, Bruce Broum, Ann Devroy, Lynne Duke, Thomas B. Ed- salt, Gwen Ifill, At Kamen, Sharon LaFraniere, Thomas W. Lippman, BiU McAllister, Don Phillips, R. Jeffrey Smith and Barbara Vobejda. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -------------------------------------------------------------------------------- /part_ii/data/pickled_df.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Applied-Language-Technology/notebooks/9bde63cdf1a8297787d90b5a5c2376d117e87842/part_ii/data/pickled_df.pkl -------------------------------------------------------------------------------- /part_ii/img/spacy_pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Applied-Language-Technology/notebooks/9bde63cdf1a8297787d90b5a5c2376d117e87842/part_ii/img/spacy_pipeline.png -------------------------------------------------------------------------------- /part_iii/01_multilingual_nlp.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Processing diverse languages\n", 8 | "\n", 9 | "After reading this section, you should:\n", 10 | "\n", 11 | " - know how to download and use language models in Stanza, a Python library for processing many languages\n", 12 | " - how to interface Stanza with the spaCy natural language processing library\n", 13 | " - know how to access linguistic annotations produced by Stanza language models via spaCy \n", 14 | "\n", 15 | "## Introduction\n", 16 | "\n", 17 | "Part II introduced basic natural language processing tasks using examples written in the English language.\n", 18 | "\n", 19 | "As a global *lingua franca*, English is a highly-resourced language in terms of natural language processing. Compared to many other languages, the amount of data – especially human-annotated data – available for English is greater and covers a wider range of domains (Del Gratta et al. [2021](https://doi.org/10.1007/s10579-020-09520-6)).\n", 20 | "\n", 21 | "Unfortunately, the imbalance in resources and research effort has led to a situation where the advances in processing the English language are occasionally claimed to hold for natural language in general. \n", 22 | "\n", 23 | "However, as Bender ([2019](https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/)) has shown, *English is not a synonym for natural language*: even if one demonstrates that computers can achieve or surpass human-level performance in some natural language processing task for the English language, this does not mean that one has solved this task or problem for *natural language as a whole*.\n", 24 | "\n", 25 | "To measure progress in the field of natural language processing and to ensure that as many languages as possible can benefit from advances in language technology, it is highly desirable to conduct research on processing languages used across the world. " 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "## Stanza – a Python library for processing many languages" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": { 39 | "tags": [ 40 | "remove-input" 41 | ] 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "# Run this cell to view a YouTube video related to this topic\n", 46 | "from IPython.display import YouTubeVideo\n", 47 | "YouTubeVideo('41aN-_NNY8g', height=350, width=600)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "To get started with working languages other than English, we can use a library named Stanza.\n", 55 | "\n", 56 | "[Stanza](https://stanfordnlp.github.io/stanza/) is a Python library for natural language processing that provides pre-trained language models for [many languages](https://stanfordnlp.github.io/stanza/available_models.html) (Qi et al. [2020](https://www.aclweb.org/anthology/2020.acl-demos.14/)).\n", 57 | "\n", 58 | "Stanza language models are trained on corpora annotated using the [Universal Dependencies](https://universaldependencies.org/) formalism, which means that these models can perform tasks such as tokenization, part-of-speech tagging, morphological tagging and dependency parsing. \n", 59 | "\n", 60 | "These are essentially the same tasks that we explored using the spaCy natural language processing library in [Part II](../part_ii/03_basic_nlp.ipynb).\n", 61 | "\n", 62 | "Let's start exploring Stanza by importing the library." 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "# Import the Stanza library\n", 72 | "import stanza" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "To process a given language, we must first download a Stanza language model using the `download()` function.\n", 80 | "\n", 81 | "The `download()` function requires a single argument, `lang`, which defines the language model to be downloaded.\n", 82 | "\n", 83 | "To download a language model for a given language, retrieve the two-letter language code (e.g. `wo`) for the language from [the list of available language models](https://stanfordnlp.github.io/stanza/available_models.html) and pass the language code as a string object to the `lang` argument.\n", 84 | "\n", 85 | "For example, the following code would download a model for Wolof, a language spoken in West Africa that belongs to the family of Niger-Congo languages. This model has been trained using the Wolof treebank (Dione [2019](https://www.aclweb.org/anthology/W19-8003/)).\n", 86 | "\n", 87 | "```python\n", 88 | "# Download Stanza language model for Wolof\n", 89 | "stanza.download(lang='wo')\n", 90 | "```\n", 91 | "\n", 92 | "For some languages, Stanza provides models that have been trained on different datasets. Stanza refers to models trained on different datasets as *packages*. By default, Stanza automatically downloads the package with model trained on the largest dataset available for the language in question.\n", 93 | "\n", 94 | "To select a model trained on a specific dataset, pass the name of its package as a string object to the `package` argument.\n", 95 | "\n", 96 | "To exemplify, the following command would download a model for Finnish trained on the [*FinnTreeBank*](https://universaldependencies.org/treebanks/fi_ftb/index.html) (package: `ftb`) dataset instead of the default model, which is trained on the [*Turku Dependency Treebank*](https://universaldependencies.org/treebanks/fi_tdt/index.html) dataset (package: `tbt`).\n", 97 | "\n", 98 | "```python\n", 99 | "# Download a Stanza language model for Finnish trained using the FinnTreeBank (package 'ftb')\n", 100 | "stanza.download(lang='fi', package='ftb')\n", 101 | "```\n", 102 | "\n", 103 | "The package names are provided in [the list of language models](https://stanfordnlp.github.io/stanza/available_models.html) available for Stanza." 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": { 109 | "tags": [ 110 | "remove-cell" 111 | ] 112 | }, 113 | "source": [ 114 | "To install the language model into the permanent storage on [CSC Notebooks](https://notebooks.csc.fi/), we must also pass the optional `model_dir` argument to the `download()` function, which contains a string that points towards a directory in the permanent storage, namely `/home/jovyan/work`.\n", 115 | "\n", 116 | "If the models are not placed in the permanent storage, they will be deleted when the server is shut down.\n", 117 | "\n", 118 | "Run the cell below to download the Stanza language model for Wolof into the directory `../stanza_models`.\n", 119 | "\n", 120 | "Note that `..` moves up one step in the directory structure relative to this notebook, which places the model into the directory `stanza_models` under the directory `notebooks`." 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "tags": [ 128 | "remove-cell" 129 | ] 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "# Download a Stanza language model for Wolof into the directory \"../stanza_models\"\n", 134 | "stanza.download(lang='wo', model_dir='../stanza_models')" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": { 140 | "tags": [ 141 | "remove-cell" 142 | ] 143 | }, 144 | "source": [ 145 | "### Quick exercise\n", 146 | "\n", 147 | "Check [the list of language models](https://stanfordnlp.github.io/stanza/available_models.html) available for Stanza and download a model for a language that you would like to work with.\n", 148 | "\n", 149 | "Use the code below: remember to replace the input to the `lang` argument with the code corresponding to your language of interest.\n", 150 | "\n", 151 | "```python\n", 152 | "stanza.download(lang='XX', model_dir='../stanza_models')\n", 153 | "```" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "tags": [ 161 | "remove-cell" 162 | ] 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "# Write your code below this line and press Shift and Enter to run the code\n" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "### Loading a language model into Stanza\n", 174 | "\n", 175 | "To load a Stanza language model into Python, we must first create a *Pipeline* object by initialising an instance of the `Pipeline()` class from the `stanza` module.\n", 176 | "\n", 177 | "To exemplify this procedure, let's initialise a pipeline with a language model for Wolof.\n", 178 | "\n", 179 | "To load a language model for Wolof into the pipeline, we must provide the string `wo` to the `lang` argument of the `Pipeline()` function.\n", 180 | "\n", 181 | "```python\n", 182 | "# Initialise a Stanza pipeline with a language model for Wolof;\n", 183 | "# assign model to variable 'nlp_wo'.\n", 184 | "nlp_wo = stanza.Pipeline(lang='wo')\n", 185 | "```" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": { 191 | "tags": [ 192 | "remove-cell" 193 | ] 194 | }, 195 | "source": [ 196 | "Because we did **not** place the language model into the default directory, we must also provide a string containing the path to the directory with Stanza language models to the `dir` argument.\n", 197 | "\n", 198 | "We then store the resulting pipeline under the variable `nlp_wo`." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": { 205 | "tags": [ 206 | "remove-cell" 207 | ] 208 | }, 209 | "outputs": [], 210 | "source": [ 211 | "# Use the Pipeline() class to initialise a Stanza pipeline with a language model for Wolof, which\n", 212 | "# is assigned to the variable 'nlp_wo'.\n", 213 | "nlp_wo = stanza.Pipeline(lang='wo', dir='../stanza_models')" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "# Call the variable to examine the output\n", 223 | "nlp_wo" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "Loading a language model into Stanza returns *Pipeline* object, which consists of a number of *processors* that perform various natural language processing tasks.\n", 231 | "\n", 232 | "The output above lists the processors under the heading of the same name, together with the names of the packages used to train these processors.\n", 233 | "\n", 234 | "As we learned in [Part II](http://localhost:8888/notebooks/part_ii/04_basic_nlp_continued.ipynb#Modifying-spaCy-pipelines), one might not always need all linguistic annotations created by a model, which always come with a computational cost. \n", 235 | "\n", 236 | "To speed up processing, you can define the processors to be included in the *Pipeline* object by providing the argument `processors` with a string object that contains the [processor names](https://stanfordnlp.github.io/stanza/pipeline.html#processors) to be included in the pipeline, which must be separated by commas.\n", 237 | "\n", 238 | "For example, creating a *Pipeline* using the command below would only include the processors for tokenization and part-of-speech tagging into the pipeline.\n", 239 | "\n", 240 | "```python\n", 241 | "# Initialise a Stanza pipeline with a language model for Wolof;\n", 242 | "# assign model to variable 'nlp_wo'. Only include tokenizer \n", 243 | "# and part-of-speech tagger.\n", 244 | "nlp_wo = stanza.Pipeline(lang='wo', processors='tokenize, pos')\n", 245 | "```" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "### Processing text using Stanza" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": { 259 | "tags": [ 260 | "remove-input" 261 | ] 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "# Run this cell to view a YouTube video related to this topic\n", 266 | "from IPython.display import YouTubeVideo\n", 267 | "YouTubeVideo('w8vvgP4dQTU', height=350, width=600)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "Now that we have initialised a Stanza *Pipeline* with a language model, we can feed some text in Wolof to the model under `nlp_wo` as a string object.\n", 275 | "\n", 276 | "We store the result under the variable `doc_wo`." 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "# Feed text to the model under 'nlp_wo'; store result under the variable 'doc'\n", 286 | "doc_wo = nlp_wo(\"Réew maa ngi lebe turam wi ci dex gi ko peek ci penku ak bëj-gànnaar, te ab balluwaayam bawoo ca Fuuta Jallon ca Ginne, di Dexug Senegaal. Ab kilimaam bu gëwéel la te di bu fendi te yor ñaari jamono: jamonoy nawet (jamonoy taw) ak ju noor (jamonoy fendi).\")\n", 287 | "\n", 288 | "# Check the type of the output\n", 289 | "type(doc_wo)" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "This returns a Stanza [*Document*](https://stanfordnlp.github.io/stanza/data_objects.html#document) object, which contains the linguistic annotations created by passing the text through the pipeline.\n", 297 | "\n", 298 | "The attribute `sentences` of a Stanza *Document* object contains a list, where each item contains a single sentence.\n", 299 | "\n", 300 | "Thus we can use brackets to access the first item `[0]` in the list." 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": { 307 | "tags": [ 308 | "output_scroll" 309 | ] 310 | }, 311 | "outputs": [], 312 | "source": [ 313 | "# Get the first item in the list of sentences\n", 314 | "doc_wo.sentences[0]" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "Although the output contains both brackets `[]` and curly braces `{}`, which Python typically uses for marking lists and dictionaries, respectively, the output is not a list with nested dictionaries, but a Stanza [*Sentence*](https://stanfordnlp.github.io/stanza/data_objects.html#sentence) object." 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": null, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "# Check the type of the first item in the Document object\n", 331 | "type(doc_wo.sentences[0])" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "The *Sentence* object contains [various attributes and methods](https://stanfordnlp.github.io/stanza/data_objects.html#sentence) for accessing the linguistic annotations created by the language model.\n", 339 | "\n", 340 | "If we wish to interact with the annotations using data structures native to Python, we can use the `to_dict()` method to cast the annotations into a list of dictionaries, in which each dictionary stands for a single Stanza [*Token*](https://stanfordnlp.github.io/stanza/data_objects.html#token) object.\n", 341 | "\n", 342 | "The *key* and *value* pairs in these dictionaries contain the linguistic annotations for each *Token*." 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": null, 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [ 351 | "# Cast the first Sentence object into a Python dictionary; store under variable 'doc_dict'\n", 352 | "doc_dict = doc_wo.sentences[0].to_dict()\n", 353 | "\n", 354 | "# Get the dictionary for the first Token\n", 355 | "doc_dict[0]" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "As you can see, the dictionary consists of key and value pairs, which hold the linguistic annotations.\n", 363 | "\n", 364 | "We can retrieve a list of keys available for a Python dictionary using the `keys()` method." 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "metadata": {}, 371 | "outputs": [], 372 | "source": [ 373 | "# Get a list of keys for the first Token in the dictionary 'doc_dict'\n", 374 | "doc_dict[0].keys()" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "Now that we have listed the keys, let's retrieve the value under the key `lemma`." 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": null, 387 | "metadata": {}, 388 | "outputs": [], 389 | "source": [ 390 | "# Get the value under key 'lemma' for the first item [0] in the dictionary 'doc_dict'\n", 391 | "doc_dict[0]['lemma']" 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "metadata": {}, 397 | "source": [ 398 | "This returns the lemma of the word \"réew\", which stands for \"country\"." 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | "### Processing multiple texts using Stanza" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": { 412 | "tags": [ 413 | "remove-input" 414 | ] 415 | }, 416 | "outputs": [], 417 | "source": [ 418 | "# Run this cell to view a YouTube video related to this topic\n", 419 | "from IPython.display import YouTubeVideo\n", 420 | "YouTubeVideo('L2MmfJ3x5Jk', height=350, width=600)" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "To process multiple documents with Stanza, the most efficent way is to first collect the documents as string objects into a Python list.\n", 428 | "\n", 429 | "Let's define a toy example with a couple of example documents in Wolof and store them as string objects into a list under the variable `str_docs`." 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": null, 435 | "metadata": {}, 436 | "outputs": [], 437 | "source": [ 438 | "# Define a Python list consisting of two strings\n", 439 | "str_docs = ['Lislaam a ngi njëkk a tàbbi ci Senegaal ci diggante VIIIeelu xarnu ak IXeelu xarnu, ña ko fa dugal di ay yaxantukat yu araab-yu-berber.',\n", 440 | " 'Li ëpp ci gëstu yi ñu def ci wàllug Gëstu-askan (walla demogaraafi) ci Senegaal dafa sukkandiku ci Waññ (recensement) yi ñu jotoon a def ci 1976, 1988 rawati na 2002.'] " 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "Next, we create a list of Stanza *Document* objects using a Python list comprehension. These *Document* objects are annotated for their linguistic features when they are passed through a *Pipeline* object.\n", 448 | "\n", 449 | "At this stage, we simply cast each string in the list `str_docs` to a Stanza *Document* object. We store the result into a list named `docs_wo_in`.\n", 450 | "\n", 451 | "Before proceeding to create the *Document* objects, let's examine how the list comprehension is structured by taking apart its syntax step by step.\n", 452 | "\n", 453 | "The list comprehension is like a `for` loop, which was introduced in [Part II](../part_ii/01_basic_text_processing.html#manipulating-text), which uses the contents of an existing list to create a new list.\n", 454 | "\n", 455 | "To begin with, just like lists, list comprehensions are marked using surrounding brackets `[]`.\n", 456 | "\n", 457 | "```python\n", 458 | "docs_wo_in = []\n", 459 | "```\n", 460 | "\n", 461 | "Next, on the right-hand side of the `for` statement, we use the variable `doc` to refer to items in the list `str_docs` that we are looping over.\n", 462 | "\n", 463 | "```python\n", 464 | "docs_wo_in = [... for doc in str_docs]\n", 465 | "```\n", 466 | "\n", 467 | "Now that we can refer to list items using the variable `doc`, we can define what we do to each item on the left-hand side of the `for` statement.\n", 468 | "\n", 469 | "```python\n", 470 | "docs_wo_in = [stanza.Document([], text=doc) for doc in str_docs]\n", 471 | "```\n", 472 | "\n", 473 | "For each item in the list `str_docs`, we initialise an empty `Document` object and pass two inputs to this object: \n", 474 | "\n", 475 | " 1. an empty list `[]` that will be populated with linguistic annotations, \n", 476 | " 2. the contents of the string variable under `doc` to the argument `text`. " 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": null, 482 | "metadata": {}, 483 | "outputs": [], 484 | "source": [ 485 | "# Use a list comprehension to create a Python list with Stanza Document objects.\n", 486 | "docs_wo_in = [stanza.Document([], text=doc) for doc in str_docs]\n", 487 | "\n", 488 | "# Call the variable to check the output\n", 489 | "docs_wo_in" 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": {}, 495 | "source": [ 496 | "Don't let the output fool you here: what looks like two empty Python lists nested within a list are actually Stanza *Document* objects.\n", 497 | "\n", 498 | "Let's use the brackets to access and examine the first *Document* object in the list `docs_wo_in`." 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": null, 504 | "metadata": {}, 505 | "outputs": [], 506 | "source": [ 507 | "# Check the type of the first item in the list 'docs_wo_in'\n", 508 | "type(docs_wo_in[0])" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "As you can see, the object is indeed a Stanza *Document* object.\n", 516 | "\n", 517 | "We can verify that our input texts made it into this document by examining the `text` attribute." 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": null, 523 | "metadata": {}, 524 | "outputs": [], 525 | "source": [ 526 | "# Check the contents of the 'text' attribute under the \n", 527 | "# first Sentence in the list 'docs_wo_in'\n", 528 | "docs_wo_in[0].text" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "Now that we have a list of Stanza *Document* objects, we can pass them all at once to the language model for annotation.\n", 536 | "\n", 537 | "This can be achieved by simply providing the list as input to the Wolof language model stored under `nlp_wo`.\n", 538 | "\n", 539 | "We then store the annotated Stanza *Document* objects under the variable `docs_wo_out`." 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": null, 545 | "metadata": { 546 | "tags": [ 547 | "output_scroll" 548 | ] 549 | }, 550 | "outputs": [], 551 | "source": [ 552 | "# Pass the list of Document objects to the language model 'nlp_wo'\n", 553 | "# for annotation.\n", 554 | "docs_wo_out = nlp_wo(docs_wo_in)\n", 555 | "\n", 556 | "# Call the variable to check the output\n", 557 | "docs_wo_out" 558 | ] 559 | }, 560 | { 561 | "cell_type": "markdown", 562 | "metadata": {}, 563 | "source": [ 564 | "As you can see, passing the *Document* objects to the language model populates them with linguistic annotations, which can be then explored as introduced [above](#Processing-text-using-Stanza)." 565 | ] 566 | }, 567 | { 568 | "cell_type": "markdown", 569 | "metadata": {}, 570 | "source": [ 571 | "## Interfacing Stanza with spaCy" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "metadata": { 578 | "tags": [ 579 | "remove-input" 580 | ] 581 | }, 582 | "outputs": [], 583 | "source": [ 584 | "# Run this cell to view a YouTube video related to this topic\n", 585 | "from IPython.display import YouTubeVideo\n", 586 | "YouTubeVideo('Yqy7I7c7EXc', height=350, width=600)" 587 | ] 588 | }, 589 | { 590 | "cell_type": "markdown", 591 | "metadata": {}, 592 | "source": [ 593 | "If you are more familiar with the spaCy library for natural language processing, whose use was covered extensively in [Part II](../part_ii/03_basic_nlp.ipynb), then you will be happy to know that you can also use some of the Stanza language models in spaCy!\n", 594 | "\n", 595 | "This can be achieved using a Python library named [spacy-stanza](https://spacy.io/universe/project/spacy-stanza), which interfaces the two libraries.\n", 596 | "\n", 597 | "Given that Stanza currently has more pre-trained language models available than spaCy, the spacy-stanza library considerably increases the number of language models available for spaCy.\n", 598 | "\n", 599 | "There is, however, **one major limitation**: the language in question must be supported by both [Stanza](https://stanfordnlp.github.io/stanza/available_models.html) and [spaCy](https://spacy.io/usage/models#languages).\n", 600 | "\n", 601 | "For example, we cannot use the Stanza language model for Wolof in spaCy, because spaCy does not support the Wolof language.\n", 602 | "\n", 603 | "To start using Stanza language models in spaCy, let's start by importing the spacy-stanza library (module name: `spacy_stanza`)." 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "metadata": {}, 610 | "outputs": [], 611 | "source": [ 612 | "# Import the spaCy and spacy-stanza libraries\n", 613 | "import spacy\n", 614 | "import spacy_stanza" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "metadata": {}, 620 | "source": [ 621 | "This imports both spaCy and spacy-stanza libraries into Python. To continue, we must ensure that we have the Stanza language model for Finnish available as well.\n", 622 | "\n", 623 | "As shown above, this model can be downloaded using the following command:\n", 624 | "\n", 625 | "```python\n", 626 | "# Download a Stanza language model for Finnish\n", 627 | "stanza.download(lang='fi')\n", 628 | "```" 629 | ] 630 | }, 631 | { 632 | "cell_type": "markdown", 633 | "metadata": { 634 | "tags": [ 635 | "remove-cell" 636 | ] 637 | }, 638 | "source": [ 639 | "Just as with the language model for Wolof above, we download the Stanza language model into the permanent storage on the CSC server.\n", 640 | "\n", 641 | "To do so, provide a string object that points towards the directory `../stanza_models` to the argument `model_dir` of the `download()` function." 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "metadata": { 648 | "tags": [ 649 | "remove-cell" 650 | ] 651 | }, 652 | "outputs": [], 653 | "source": [ 654 | "# Download a Stanza language model for Finnish into the directory '../stanza_models'\n", 655 | "stanza.download(lang='fi', model_dir='../stanza_models')" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "Because spaCy supports [the Finnish language](https://spacy.io/usage/models#languages), we can load Stanza language models for Finnish into spaCy using the spacy-stanza library.\n", 663 | "\n", 664 | "This can be achieved using the `load_pipeline()` function available under the `spacy_stanza` module.\n", 665 | "\n", 666 | "To load Stanza language model for a given language, you must provide the two-letter code for the language in question (e.g. `fi`) to the argument `name`:\n", 667 | "\n", 668 | "```python\n", 669 | "# Load a Stanza language model for Finnish into spaCy\n", 670 | "nlp_fi = spacy_stanza.load_pipeline(name='fi')\n", 671 | "```" 672 | ] 673 | }, 674 | { 675 | "cell_type": "markdown", 676 | "metadata": { 677 | "tags": [ 678 | "remove-cell" 679 | ] 680 | }, 681 | "source": [ 682 | "Because we did not download the Stanza language models into the default directory, we must also provide the optional argument `dir` to the `load_pipeline()` function.\n", 683 | "\n", 684 | "The `dir` argument takes a string object as its input, which must point to the directory that contains Stanza language models." 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": null, 690 | "metadata": { 691 | "tags": [ 692 | "remove-cell" 693 | ] 694 | }, 695 | "outputs": [], 696 | "source": [ 697 | "# Use the load_pipeline function to load a Stanza model into spaCy.\n", 698 | "# Assign the result under the variable 'nlp'.\n", 699 | "nlp_fi = spacy_stanza.load_pipeline(name='fi', dir='../stanza_models')" 700 | ] 701 | }, 702 | { 703 | "cell_type": "markdown", 704 | "metadata": {}, 705 | "source": [ 706 | "If we examine the resulting object under the variable `nlp_fi` using Python's `type()` function, we will see that the object is indeed a spaCy *Language* object. " 707 | ] 708 | }, 709 | { 710 | "cell_type": "code", 711 | "execution_count": null, 712 | "metadata": {}, 713 | "outputs": [], 714 | "source": [ 715 | "# Check the type of the object under 'nlp_fi'\n", 716 | "type(nlp_fi)" 717 | ] 718 | }, 719 | { 720 | "cell_type": "markdown", 721 | "metadata": {}, 722 | "source": [ 723 | "Generally, this object behaves just like any other spaCy *Language* object that we learned to use in [Part II](../part_ii/03_basic_nlp.ipynb#Performing-basic-NLP-tasks-using-spaCy).\n", 724 | "\n", 725 | "We can explore its use by processing a few sentences from a recent [news article](https://yle.fi/aihe/artikkeli/2021/03/08/yleiso-aanesti-tarja-halonen-on-inspiroivin-nainen-karkikolmikkoon-ylsivat-myos) in written Finnish.\n", 726 | "\n", 727 | "We feed the text as a string object to the *Language* object under `nlp_fi` and store the result under the variable `doc_fi`." 728 | ] 729 | }, 730 | { 731 | "cell_type": "code", 732 | "execution_count": null, 733 | "metadata": {}, 734 | "outputs": [], 735 | "source": [ 736 | "# Feed the text to the language model under 'nlp_fi', store result under 'doc_fi'\n", 737 | "doc_fi = nlp_fi('Tove Jansson keräsi 148 ääntä eli 18,2% annetuista äänistä. Kirjailija, kuvataiteilija ja pilapiirtäjä tuli kansainvälisesti tunnetuksi satukirjoistaan ja sarjakuvistaan.')" 738 | ] 739 | }, 740 | { 741 | "cell_type": "markdown", 742 | "metadata": {}, 743 | "source": [ 744 | "Let's continue by retrieving sentences from the *Doc* object, which are available under the attribute `sents`, as we learned in [Part II](../part_ii/03_basic_nlp.ipynb#Sentence-segmentation).\n", 745 | "\n", 746 | "The object available under the `sents` attribute is a Python generator that yields *Doc* objects. \n", 747 | "\n", 748 | "To examine them, we must catch the objects into a suitable data structure. In this case, the data structure that best fits our needs is a Python list.\n", 749 | "\n", 750 | "Hence we cast the output from the generator object under `sents` into a list using the `list()` function." 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": null, 756 | "metadata": {}, 757 | "outputs": [], 758 | "source": [ 759 | "# Get sentences contained in the Doc object 'doc_fi'.\n", 760 | "# Cast the result into list.\n", 761 | "sents_fi = list(doc_fi.sents)\n", 762 | "\n", 763 | "# Call the variable to check the output\n", 764 | "sents_fi" 765 | ] 766 | }, 767 | { 768 | "cell_type": "markdown", 769 | "metadata": {}, 770 | "source": [ 771 | "We can also use spaCy's `displacy` submodule to visualise the syntactic dependencies.\n", 772 | "\n", 773 | "To do so for the first sentence under `sents_fi`, we must first access the first item in the list using brackets `[0]` as usual.\n", 774 | "\n", 775 | "Let's start by checking the type of this object." 776 | ] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "execution_count": null, 781 | "metadata": {}, 782 | "outputs": [], 783 | "source": [ 784 | "# Check the type of the first item in the list 'sents_fi'\n", 785 | "type(sents_fi[0])" 786 | ] 787 | }, 788 | { 789 | "cell_type": "markdown", 790 | "metadata": {}, 791 | "source": [ 792 | "As you can see, the result is a spaCy *Span* object, which is a sequence of *Token* objects contained within a *Doc* object.\n", 793 | "\n", 794 | "We can then call the `render` function from the `displacy` submodule to visualise the syntactic dependencies for the *Span* object under `sents_fi[0]`." 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": null, 800 | "metadata": {}, 801 | "outputs": [], 802 | "source": [ 803 | "# Import the displacy submodule\n", 804 | "from spacy import displacy\n", 805 | "\n", 806 | "# Use the render function to render the first item [0] in the list 'sents_fi'.\n", 807 | "# Pass the argument 'style' with the value 'dep' to visualise syntactic dependencies.\n", 808 | "displacy.render(sents_fi[0], style='dep')" 809 | ] 810 | }, 811 | { 812 | "cell_type": "markdown", 813 | "metadata": {}, 814 | "source": [ 815 | "Note that spaCy will raise a warning about storing custom attributes when writing the *Doc* object to disk for visualisation.\n", 816 | "\n", 817 | "We can also examine the linguistic annotations created for individual *Token* objects within this *Span* object." 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": null, 823 | "metadata": {}, 824 | "outputs": [], 825 | "source": [ 826 | "# Loop over each Token object in the Span\n", 827 | "for token in sents_fi[0]:\n", 828 | " \n", 829 | " # Print the token, its lemma, dependency and morphological features\n", 830 | " print(token, token.lemma_, token.dep_, token.morph)" 831 | ] 832 | }, 833 | { 834 | "cell_type": "markdown", 835 | "metadata": {}, 836 | "source": [ 837 | "The examples above show how we can access the linguistic annotations created by a Stanza language model through spaCy *Doc*, *Span* and *Token* objects.\n", 838 | "\n", 839 | "This section should have given you an idea of how to begin processing diverse languages.\n", 840 | "\n", 841 | "In the [following section](02_universal_dependencies.ipynb), we will dive deeper into the Universal Dependencies framework." 842 | ] 843 | } 844 | ], 845 | "metadata": { 846 | "celltoolbar": "Edit Metadata", 847 | "execution": { 848 | "timeout": -1 849 | }, 850 | "kernelspec": { 851 | "display_name": "Python 3 (ipykernel)", 852 | "language": "python", 853 | "name": "python3" 854 | }, 855 | "language_info": { 856 | "codemirror_mode": { 857 | "name": "ipython", 858 | "version": 3 859 | }, 860 | "file_extension": ".py", 861 | "mimetype": "text/x-python", 862 | "name": "python", 863 | "nbconvert_exporter": "python", 864 | "pygments_lexer": "ipython3", 865 | "version": "3.9.7" 866 | } 867 | }, 868 | "nbformat": 4, 869 | "nbformat_minor": 4 870 | } 871 | -------------------------------------------------------------------------------- /part_iii/img/alignment.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | Produced by OmniGraffle 7.18.5\n2021-04-16 06:53:00 +0000 27 | 28 | spaCy vs. Transformer Tokens 29 | 30 | 31 | Layer 1 32 | 33 | 34 | The token “Helsinki” does not exist in 35 | the Transformer’s vocabulary, which 36 | is why the word is split into three 37 | subwords 38 | found in the vocabulary 39 | 40 | 41 | 42 | 43 | Helsinki is the capital of Finland. 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | H els inki is the capital of Finland. 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | spaCy tokens 92 | 93 | 94 | 95 | 96 | Transformer 97 | tokens 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | -------------------------------------------------------------------------------- /part_iii/img/parasyn.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | Produced by OmniGraffle 7.18\n2020-12-08 08:09:00 +0000 27 | 28 | Paradigmatic / syntagmatic 29 | 30 | Layer 1 31 | 32 | 33 | 34 | 35 | This city 36 | Helsinki 37 | The capital 38 | 39 | 40 | 41 | 42 | 43 | 44 | is 45 | hosted 46 | was founded 47 | 48 | 49 | 50 | 51 | 52 | 53 | located 54 | the capital 55 | the Olympics 56 | 57 | 58 | 59 | 60 | 61 | 62 | in 1952 63 | of Finland 64 | in 1550 65 | by the Baltic 66 | Sea 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | Paradigmatic organisation: 79 | choices among 80 | 81 | alternatives 82 | 83 | 84 | 85 | 86 | Syntagmatic organisation: 87 | combinations of choices 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | -------------------------------------------------------------------------------- /part_iii/img/type_token.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | Produced by OmniGraffle 7.18.5\n2021-04-15 09:40:24 +0000 32 | 33 | Vectors vs. contextual vectors 34 | 35 | 36 | Layer 1 37 | 38 | 39 | the 40 | 41 | 42 | 43 | 44 | capital 45 | 46 | 47 | 48 | 49 | is 50 | 51 | 52 | 53 | 54 | Helsinki 55 | 56 | 57 | 58 | 59 | of 60 | 61 | 62 | 63 | 64 | Finland 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | Individual 73 | lexical types 74 | in 75 | the model’s vocabulary 76 | 77 | 78 | 79 | 80 | [0.69, 81 | -0.78, 82 | -0.13, 83 | -0.99, 84 | -0.56, 85 | ...] 86 | 87 | 88 | 89 | 90 | [0.46, 91 | 0.13, 92 | 0.73, 93 | -0.56, 94 | -0.59, 95 | ...] 96 | 97 | 98 | 99 | 100 | [0.68, 101 | -0.86, 102 | -0.63, 103 | 0.29, 104 | -0.96, 105 | ...] 106 | 107 | 108 | 109 | 110 | [-0.42, 111 | -0.26, 112 | -0.35, 113 | -0.77, 114 | -0.32, 115 | ...] 116 | 117 | 118 | 119 | 120 | [-0.26, 121 | -0.53, 122 | 0.46, 123 | -0.06, 124 | -0.51, 125 | ...] 126 | 127 | 128 | 129 | 130 | [0.64, 131 | 0.05, 132 | -0.32, 133 | -0.91, 134 | -0.4, 135 | ...] 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | High-dimensional vector 144 | representations for 145 | lexical types 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | Tokens 158 | are instances of a particular lexical type 159 | 160 | 161 | 162 | 163 | … Stockholm is a capital city by the Baltic Sea … 164 | 165 | 166 | 167 | 168 | … Capital market is a financial market in which long-term debt … 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | Word embeddings 188 | learn 189 | representations for 190 | lexical types 191 | 192 | 193 | 194 | 195 | Contextual word embeddings 196 | learn 197 | representations for 198 | tokens 199 | , that is, 200 | instances of lexical types in their 201 | context of use! 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==1.0.0 2 | altair==4.1.0 3 | argon2-cffi==21.1.0 4 | astunparse==1.6.3 5 | attrs==21.2.0 6 | backcall==0.2.0 7 | bleach==4.1.0 8 | blis==0.7.5 9 | bpemb==0.3.3 10 | cachetools==4.2.4 11 | catalogue==2.0.6 12 | certifi==2021.10.8 13 | cffi==1.15.0 14 | charset-normalizer==2.0.7 15 | click==8.0.3 16 | conllu==4.4.1 17 | cycler==0.11.0 18 | cymem==2.0.6 19 | debugpy==1.5.1 20 | decorator==5.1.0 21 | defusedxml==0.7.1 22 | emoji==1.6.1 23 | en-core-web-lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.2.0/en_core_web_lg-3.2.0-py3-none-any.whl 24 | en-core-web-md @ https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl 25 | en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl 26 | en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl 27 | entrypoints==0.3 28 | filelock==3.4.0 29 | flatbuffers==2.0 30 | fonttools==4.28.2 31 | gast==0.4.0 32 | gensim==3.8.3 33 | google-auth==2.3.3 34 | google-auth-oauthlib==0.4.6 35 | google-pasta==0.2.0 36 | grpcio==1.42.0 37 | h5py==3.6.0 38 | huggingface-hub==0.1.2 39 | idna==3.3 40 | importlib-metadata==4.8.2 41 | importlib-resources==5.4.0 42 | ipykernel==6.5.1 43 | ipython==7.29.0 44 | ipython-genutils==0.2.0 45 | ipywidgets==7.6.5 46 | jedi==0.18.1 47 | Jinja2==3.0.3 48 | joblib==1.1.0 49 | jsonschema==4.2.1 50 | jupyter==1.0.0 51 | jupyter-client==7.1.0 52 | jupyter-console==6.4.0 53 | jupyter-core==4.9.1 54 | jupyterlab-pygments==0.1.2 55 | jupyterlab-widgets==1.0.2 56 | keras==2.7.0 57 | Keras-Preprocessing==1.1.2 58 | kiwisolver==1.3.2 59 | langcodes==3.3.0 60 | libclang==12.0.0 61 | Markdown==3.3.6 62 | MarkupSafe==2.0.1 63 | matplotlib==3.5.0 64 | matplotlib-inline==0.1.3 65 | mistune==0.8.4 66 | murmurhash==1.0.6 67 | nbclient==0.5.9 68 | nbconvert==6.3.0 69 | nbformat==5.1.3 70 | nest-asyncio==1.5.1 71 | notebook==6.4.6 72 | numpy==1.21.4 73 | oauthlib==3.1.1 74 | opt-einsum==3.3.0 75 | packaging==21.3 76 | pandas==1.3.4 77 | pandocfilters==1.5.0 78 | parso==0.8.2 79 | pathy==0.6.1 80 | pexpect==4.8.0 81 | pickleshare==0.7.5 82 | Pillow==8.4.0 83 | preshed==3.0.6 84 | prometheus-client==0.12.0 85 | prompt-toolkit==3.0.22 86 | protobuf==3.19.1 87 | ptyprocess==0.7.0 88 | pyasn1==0.4.8 89 | pyasn1-modules==0.2.8 90 | pycparser==2.21 91 | pydantic==1.8.2 92 | Pygments==2.10.0 93 | pyparsing==3.0.6 94 | pyrsistent==0.18.0 95 | python-dateutil==2.8.2 96 | pytz==2021.3 97 | PyYAML==6.0 98 | pyzmq==22.3.0 99 | qtconsole==5.2.1 100 | QtPy==1.11.2 101 | regex==2021.11.10 102 | requests==2.26.0 103 | requests-oauthlib==1.3.0 104 | rsa==4.7.2 105 | sacremoses==0.0.46 106 | scikit-learn==1.0.1 107 | scipy==1.7.2 108 | seaborn==0.11.2 109 | Send2Trash==1.8.0 110 | sentencepiece==0.1.96 111 | setuptools-scm==6.3.2 112 | six==1.16.0 113 | smart-open==5.2.1 114 | spacy==3.2.0 115 | spacy-alignments==0.8.4 116 | spacy-legacy==3.0.8 117 | spacy-loggers==1.0.1 118 | spacy-stanza==1.0.1 119 | spacy-transformers==1.1.2 120 | srsly==2.4.2 121 | stanza==1.3.0 122 | tensorboard==2.7.0 123 | tensorboard-data-server==0.6.1 124 | tensorboard-plugin-wit==1.8.0 125 | tensorflow==2.7.0 126 | tensorflow-estimator==2.7.0 127 | tensorflow-io-gcs-filesystem==0.22.0 128 | termcolor==1.1.0 129 | terminado==0.12.1 130 | testpath==0.5.0 131 | thinc==8.0.13 132 | threadpoolctl==3.0.0 133 | tokenizers==0.10.3 134 | tomli==1.2.2 135 | toolz==0.11.2 136 | torch==1.10.0 137 | tornado==6.1 138 | tqdm==4.62.3 139 | traitlets==5.1.1 140 | transformers==4.11.3 141 | typer==0.4.0 142 | typing_extensions==4.0.0 143 | urllib3==1.26.7 144 | wasabi==0.8.2 145 | wcwidth==0.2.5 146 | webencodings==0.5.1 147 | Werkzeug==2.0.2 148 | whatlies==0.6.5 149 | widgetsnbextension==3.5.2 150 | wrapt==1.13.3 151 | zipp==3.6.0 152 | --------------------------------------------------------------------------------