├── .gitignore ├── 1-intro-to-python-and-text.ipynb ├── 2-python-basics-recap-and-more.ipynb ├── 3-choosing-and-collecting-text.ipynb ├── 4-cleaning-and-exploring-text.ipynb ├── 5-analysis-and-visualisation.ipynb ├── LICENSE ├── README.md ├── assets ├── cd-directory.png ├── conda-install-pip.png ├── create.png ├── deepnote-python39.png ├── download-or-clone.png ├── download-zip.png ├── function-black-box.png ├── jupyter-notebooks.png ├── list-comprehension-with-condition.png ├── list-comprehension.png ├── new-env.png └── pipeline.png ├── data ├── CLEAN-2199-0.txt ├── ORIGINAL-2199-0.txt └── PREPPED-2199-0.txt ├── exercises ├── 1-reversing-strings-solution.ipynb └── 1-reversing-strings.ipynb ├── poetry.lock ├── pyproject.toml ├── requirements.txt └── runtime.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | # PyCharm 132 | .idea 133 | -------------------------------------------------------------------------------- /1-intro-to-python-and-text.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "\n", 10 | "# Introduction to Python, Jupyter Notebooks and Working with Text\n", 11 | "\n", 12 | "---\n", 13 | "---\n", 14 | "\n", 15 | "## The Programming Language Python\n", 16 | "Python is a **general-purpose programming language** that can be used in many different ways; for example, it can be used to analyse data, create websites and automate boring stuff. \n", 17 | "\n", 18 | "Python is **excellent for beginners** because its basic syntax is simple and uncluttered with punctuation. Coding in Python often feels like writing in natural language. Python is a very popular language and has a large, friendly community of users around the world, so there are many tutorials and helpful experts out there to help you get started.\n", 19 | "\n", 20 | "![Logo of the Python programming language](https://www.python.org/static/img/python-logo.png \"Logo of the Python programming language\")\n", 21 | "\n", 22 | "---\n", 23 | "---\n", 24 | "\n", 25 | "## Jupyter Notebooks\n", 26 | "\n", 27 | "This 'document' you are reading right now is a Jupyter Notebook. It allows you to combine explanatory **text** and Python **code** that executes to produce the results you can see on the same page. You can also create and display visualisations from your data in the same document.\n", 28 | "\n", 29 | "Notebooks are particularly useful for *exploring* your data at an early stage and *documenting* exactly what steps you have taken (and why) to get to your results. This documentation is extremely important to record what you did so that others can reproduce your work... and because otherwise you are guaranteed to forget what you did in the future.\n", 30 | "\n", 31 | "![Logo of Jupyter Notebooks](https://jupyter.org/assets/nav_logo.svg \"Logo of Jupyter Notebooks\")\n", 32 | "\n", 33 | "For a more in-depth tutorial on getting started with Jupyter Notebooks try this [Jupyter Notebook for Beginners Tutorial](https://towardsdatascience.com/jupyter-notebook-for-beginners-a-tutorial-f55b57c23ada)." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "### Notebook Basics\n", 41 | "\n", 42 | "#### Text cells\n", 43 | "\n", 44 | "The box this text is written in is called a *cell*. It is a *text cell* marked up in a very simple language called 'Markdown'. Here is a useful [Markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet). You can edit and then run cells to produce a result. Running this text cell produces formatted text.\n", 45 | "\n", 46 | "---\n", 47 | "> Exercise: Double-click on this cell now to edit it. Run the cell with the keyboard shortcut `Crtl+Enter`, or by clicking the Run button in the toolbar at the top.\n", 48 | "\n", 49 | "\"Click\n", 50 | "\n", 51 | "---\n", 52 | "\n", 53 | "#### Code cells\n", 54 | "\n", 55 | "The other main kind of cell is a *code cell*. The cell immediately below this one is a code cell. Running a code cell runs the code in the cell (marked by the **`In [1]:`**) and produces a result (marked by the **`Out [1]:`**). We say the code is **evaluated**.\n", 56 | "\n", 57 | "---\n", 58 | "> Exercise: Try running the code cell below to evaluate the Python code. Then change the sum to try and get a different result.\n", 59 | "\n", 60 | "---" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "3 + 4" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "From now on, always try running every code cell you see." 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "# This is a comment in a code cell. Comments start with a # symbol. They are ignored and do not do anything." 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "**Important!**\n", 93 | "\n", 94 | "When running code cells you need to run them in order, from top to bottom of the notebook. This is because cells rely on the results of other cells. Without those earlier results being available you will get an error.\n", 95 | "\n", 96 | "To run all the cells in a notebook at once, and in order, choose Cell > Run All from the menu above. To clear all the results from all the cells, so you can start again, choose Cell > All Output > Clear." 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "---\n", 104 | "---\n", 105 | "\n", 106 | "## How to Join In with Coding\n", 107 | "\n", 108 | "* **Edit** any cell and try changing the code, or delete it and write your own.\n", 109 | "\n", 110 | "* Before running a cell, try to **guess** what the output will be by thinking through what will happen.\n", 111 | "\n", 112 | "* If you encounter an **error**, realise this is normal. Errors happen all the time and by reading the error message you will learn something new.\n", 113 | "\n", 114 | "* Remember: you cannot 'break' your computer by editing this code, so **don't be afraid to experiment**." 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "---\n", 122 | "---\n", 123 | "\n", 124 | "## Simple String Manipulation in Python\n", 125 | "\n", 126 | "When we want to store and manipulate books, archives, records or other textual data in a computer, we have to do so with **strings**. Strings are the way that Python (and most programming languages) deal with text.\n", 127 | "\n", 128 | "A string is a simple *sequence of characters*, for example, the string `coffee` is a sequence of the individual characters `c` `o` `f` `f` `e` `e`.\n", 129 | "\n", 130 | "![A cup of coffee](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Cup-o-coffee-simple.svg/240px-Cup-o-coffee-simple.svg.png \"A cup of coffee\")\n", 131 | "\n", 132 | "This section introduces some basic things you can do in Python to create and manipulate strings. What you learn here forms the basis of any and all text-mining you may do with Python in the future. It's worth getting the fundamentals right." 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "### Create and Store Strings with Names\n", 140 | "Strings are simple to create in Python. You can simply write some characters in quote marks (either single `'` or double `\"` is fine in general).\n", 141 | "\n", 142 | "By running the cell below Python will evaluate the code, recognise it is a new string, and then print it out. Try this now." 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "'Butterflies are important as pollinators.'" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "In order to do something useful with this string, other than print it out, we need to store it by using the *assignment operator* `=` (equals sign). Whatever is on the right-hand side of the `=` is stored with the _name_ on the left-hand side. In other words, the name is assigned to the value. The name is like a label that sticks to the value and can be used to refer to it at a later point.\n", 159 | "\n", 160 | "The pattern is as follows:\n", 161 | "\n", 162 | "`name = value`\n", 163 | "\n", 164 | "(In other programming languages, this is called creating a *variable*.)\n", 165 | "\n", 166 | "Run the code cell below." 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "my_sentence = 'Butterflies are important as pollinators.'" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "Notice that nothing is printed to the screen.\n", 183 | "\n", 184 | "That's because the string is stored with the name `my_sentence` rather than being printed out. In order to see what is 'inside' `my_sentence` we can simply write `my_sentence` in a code cell, run it, and the interpreter will print it out for us.\n", 185 | "\n", 186 | "Remember to run every code cell you come across in the notebook from now on." 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "my_sentence" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "---\n", 203 | "> Exercise: Try creating your own string in the code cell below and assign it the name `my_string`. Then add a second line of code that will cause the string to be printed out. You should have two lines of code before you run the code cell.\n", 204 | "\n", 205 | "---\n", 206 | "\n", 207 | "If you are in need of inspiration, copy and paste a string from the [Cambridge Digital Library Darwin-Hooker letters](https://cudl.lib.cam.ac.uk/collections/darwinhooker/1)." 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "# Write code here" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "### Concatenate Strings\n", 224 | "If you add two numbers together with a plus sign (`+`) you get addition. With strings, if you add two (or more) of them together with `+` they are concatenated, that is, the sequences of characters are joined." 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "another_sentence = my_sentence + ' Bees are too.'\n", 234 | "another_sentence" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "# Write code here to create and print a new string that concatenates 3 strings together of your choice" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "### Manipulate Whole Strings with Methods\n", 251 | "Python strings have some built-in *methods* that allow you to manipulate a whole string at once. You apply the method using what is called *dot notation*, with the name of the string first, followed by a dot (`.`), then the name of the method, and finally a pair of parenthesis (`()`), which tells Python to run the method.\n", 252 | "\n", 253 | "The pattern is as follows:\n", 254 | "\n", 255 | "`name_of_my_string.method_name()`\n", 256 | "\n", 257 | "You can change all characters to lowercase or uppercase:" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "my_string.lower()" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "my_string.upper()" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "Note that these functions do not change the original string but instead create a new one. The original string is still the same as it was before:" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "my_string" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "In order to 'save' the newly manipulated string, you have to assign it a new name:" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "new_string = my_string.lower()\n", 308 | "new_string" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "### Test Strings with Methods\n", 316 | "\n", 317 | "You can also test a string to see if it is passes some test, e.g. is the string all alphabetic characters only?" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "my_sentence.isalpha()" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "There are many different string methods to try.\n", 334 | "\n", 335 | "The [full documentation on string methods](https://docs.python.org/3.10/library/stdtypes.html#string-methods) gives all the technical details. It is a skill and art to read code documentation, and you should start to learn it as soon as you can on your code journey. But it's not necessary for the session today.\n", 336 | "\n", 337 | "---\n", 338 | "> Exercise: Try three of the string methods listed in the documentation above on your own string `my_string`.\n", 339 | "\n", 340 | "---\n" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "# Write code here" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "### Access Individual Characters with an Index\n", 357 | "A string is just a sequence of characters, much like a list. You can access individual characters in a string by specifying which ones you want. To do this we use what is called an *index* number in square brackets (`[]`). Like this:" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "my_sentence[1]" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "Here the index is `1` and we expect to get the first character of the string... or do we? Did you notice something unexpected?\n", 374 | "\n", 375 | "It gave us `u` instead of `B`.\n", 376 | "\n", 377 | "In programming, things are often counted from 0 rather than 1, which is called *zero-based numbering*. Thus, in the example above, `1` gives us the *second* character in the string, not the first like you might expect.\n", 378 | "\n", 379 | "If you want the first character in the string, you need to specify the index `0`." 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": null, 385 | "metadata": {}, 386 | "outputs": [], 387 | "source": [ 388 | "# Write code here to get the first character in the string" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "### Access a Range of Characters with Slicing\n", 396 | "\n", 397 | "You can also pick out a range of characters from within a string, by giving a start index, followed by an end index, with a semi-colon (`:`) in between. This is called *slice notation*.\n", 398 | "\n", 399 | "The example below gives us the character at index `0` all the way up to, but **not** including, the character at index `20`." 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": {}, 406 | "outputs": [], 407 | "source": [ 408 | "my_sentence[0:20]" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": {}, 415 | "outputs": [], 416 | "source": [ 417 | "# Write code here to access the word \"important\" from the string" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "We can also slice in jumps through a string. To do this, we add what is known is the *step*. \n", 425 | "\n", 426 | "The pattern is as follows:\n", 427 | " \n", 428 | "`my_sentence[start:stop:step]`\n", 429 | "\n", 430 | "So, to go in jumps of 2 characters:" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": null, 436 | "metadata": {}, 437 | "outputs": [], 438 | "source": [ 439 | "my_sentence[0:20:2]" 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": {}, 445 | "source": [ 446 | "### Lists of Strings\n", 447 | "Another important and useful thing we can do with strings is storing them in a *list*. For example, you could have each sentence in a document stored as a list of strings.\n", 448 | "\n", 449 | "To create a new list of strings we list them inside square brackets `[]` on the right-hand side of the assignment operator (`=`):" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": null, 455 | "metadata": {}, 456 | "outputs": [], 457 | "source": [ 458 | "my_list = ['Butterflies are important as pollinators',\n", 459 | " 'Butterflies feed primarily on nectar from flowers',\n", 460 | " 'Butterflies are widely used in objects of art']" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": {}, 467 | "outputs": [], 468 | "source": [ 469 | "# Write code here to print out the list" 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": {}, 475 | "source": [ 476 | "Yes, we have used square brackets (`[]`) before for indexing individual characters (above), but this is a different use of square brackets to create lists. If you are unfamiliar with Python, I can reassure you that eventually you get used to these different uses of square brackets." 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "### Access Individual Strings in a List with an Index\n", 484 | "Just like with strings, we can access individual items inside a list by index number:" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": null, 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "my_list[0]" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "### Access a Range of Strings in a List with Slicing\n", 501 | "Likewise, we can access a range of items inside a list by using slice notation:" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "my_list[0:2]" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "Just like with strings, we can also slice in steps." 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": null, 523 | "metadata": {}, 524 | "outputs": [], 525 | "source": [ 526 | "# Write code here to slice every other item in `my_list` (i.e. the first and third item)" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "To access the whole list we can use the shorthand:" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": null, 539 | "metadata": {}, 540 | "outputs": [], 541 | "source": [ 542 | "my_list[:]" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "Why would we want to do this? Well, combine this trick with a negative step, and we can go *backwards* through the whole list!" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [ 558 | "my_list[::-1]" 559 | ] 560 | }, 561 | { 562 | "cell_type": "markdown", 563 | "metadata": {}, 564 | "source": [ 565 | "---\n", 566 | "---\n", 567 | "## Summary\n", 568 | "\n", 569 | "Let's take a moment to summarise what we have covered so far.\n", 570 | "\n", 571 | "* Python is a general-purpose programming language that is good for beginners.\n", 572 | "* Jupyter notebooks have two main types of **cell** to run: **code** and **text** (Markdown).\n", 573 | "* How to create and store a **string** with a **name** (_aka_ variables).\n", 574 | "* How to manipulate strings by:\n", 575 | " * **Concatenating** two or more strings together;\n", 576 | " * Accessed an individual character with an **index**;\n", 577 | " * Accessed a range of characters with **slicing**;\n", 578 | " * Changing whole strings or testing strings with **string methods**.\n", 579 | "* How to create and store a **list** of strings with a name.\n", 580 | "* Accessed strings in a list with an index and with slicing." 581 | ] 582 | } 583 | ], 584 | "metadata": { 585 | "kernelspec": { 586 | "display_name": "Python 3 (ipykernel)", 587 | "language": "python", 588 | "name": "python3" 589 | }, 590 | "language_info": { 591 | "codemirror_mode": { 592 | "name": "ipython", 593 | "version": 3 594 | }, 595 | "file_extension": ".py", 596 | "mimetype": "text/x-python", 597 | "name": "python", 598 | "nbconvert_exporter": "python", 599 | "pygments_lexer": "ipython3", 600 | "version": "3.9.10" 601 | } 602 | }, 603 | "nbformat": 4, 604 | "nbformat_minor": 1 605 | } 606 | -------------------------------------------------------------------------------- /2-python-basics-recap-and-more.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# More Python Basics\n", 10 | "\n", 11 | "---\n", 12 | "---\n", 13 | "\n", 14 | "## Going Further with Python for Text-mining\n", 15 | "\n", 16 | "This notebook builds on the [previous notebook](1-intro-to-python-and-text.ipynb) to teach you a bit more Python so you can understand the text-mining examples presented in the following notebooks.\n", 17 | "\n", 18 | "These are the fundamentals in working with strings in Python and other basics that every Python user may use every day. However, this is just an introduction and it is not expected that you will be ready and capable of simply diving in to coding straight after completing this course.\n", 19 | "\n", 20 | "Rather, these notebooks and the accompanying live teaching sessions are supposed to give you just a taster of what text-mining with Python is about. By the end of the course I hope you will come away with either: an interest to learn more; or equally valid, an informed feeling that coding is not for you.\n", 21 | "\n", 22 | "Having said this, there are many approaches to learning programming, and it is often only once you happen upon the right approach for you that you make good progress. It is worth trying different topics, teachers, media and learning styles. I tried to learn programming several times over many years and eventually found the right course that kickstarted my own coding journey.\n", 23 | "\n", 24 | "---\n", 25 | "---\n", 26 | "\n", 27 | "## Recap of Strings\n", 28 | "Welcome back! Here's a quick recap of what we learnt in [1-intro-to-python-and-text](1-intro-to-python-and-text.ipynb). Strings are the way that Python deals with text. \n", 29 | "\n", 30 | "Create a *string* and store it with a *name*:" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "my_sentence = 'The Moon formed 4.51 billion years ago.'\n", 40 | "my_sentence" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "_Concatenate_ strings together:" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "my_sentence + \" \" + \"It is the fifth-largest satellite in the Solar System.\"" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "_Index_ a string. Remember that indexing in Python starts at 0." 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "my_sentence[16]" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "_Slice_ a string. Remember that the slice goes from the first index up to but _not_ including the second index." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "my_sentence[0:20]" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "Transform a string with _string methods_. Important: the original string `my_sentence` is unchanged. Instead, a string method _returns_ a new string." 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "my_sentence.swapcase()" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "Test a string with string methods:" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "my_sentence.islower()" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "Create a _list_ of strings:" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "my_list = ['The Moon formed 4.51 billion years ago',\n", 137 | " \"The Moon is Earth's only permanent natural satellite\",\n", 138 | " 'The Moon was first reached in September 1959']\n", 139 | "my_list" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "_Slice_ a list. Add a _step_ to jump through a string or list by more than one. Use a step of `-1` to go backwards. " 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "my_list[0:3:2]" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "---\n", 163 | "---\n", 164 | "\n", 165 | "## Create a List of Strings with List Comprehensions\n", 166 | "\n", 167 | "Let's get going on some new material.\n", 168 | "\n", 169 | "We can create new lists in a quick and elegant way by using _list comprehensions_. Essentially, a list comprehension _loops_ over each item in a list, one by one, and returns something each time, and creates a new list.\n", 170 | "\n", 171 | "For example, here is a list of strings:\n", 172 | "\n", 173 | "`['banana', 'apple', 'orange', 'kiwi']`\n", 174 | "\n", 175 | "We could use a list comprehension to loop over this list and create a new list with every item made UPPERCASE. The resulting list would look like this:\n", 176 | "\n", 177 | "`['BANANA', 'APPLE', 'ORANGE', 'KIWI']`\n", 178 | "\n", 179 | "The code for doing this is below:" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "fruit = ['banana', 'apple', 'orange', 'kiwi']\n", 189 | "fruit_u = [item.upper() for item in fruit]\n", 190 | "fruit_u" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "The pattern is as follows:\n", 198 | "\n", 199 | "`[return_something for each_item in list]`\n", 200 | "\n", 201 | "First thing to say is the `for` and `in` are *keywords*, that is, they are special reserved words in Python. These must be present exactly in this order in every list comprehension.\n", 202 | "\n", 203 | "The other words (`return_something`, `each_item`, `list`) are placeholders for whatever variables (names) you are working with in your case.\n", 204 | "\n", 205 | "![List comprehensions diagram](assets/list-comprehension.png)\n", 206 | "\n", 207 | "> Let's look at some of the details:\n", 208 | " * A list comprehension goes inside square brackets (`[]`), which tells Python to create a new list.\n", 209 | " * `list` is the name of your list. It has to be a list you have already created in a previous step.\n", 210 | " * The `each_item in list` part is the loop.\n", 211 | " * `each_item` is the name you assign to each item as it is selected by the loop. The name you choose should be something descriptive that helps you remember what it is.\n", 212 | " * The `return_something for` part is what happens each time it loops over an item. The `return_something` could just be the original item, or it could be something fairly complicated.\n", 213 | "\n", 214 | "The most basic example is just to return exactly the same item each time it loops over and return all items in a list.\n", 215 | "\n", 216 | "Here is an example where we have taken our original list `my_list` and created a new list `new_list` with the exact same items unchanged:" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "new_list = [item for item in my_list]\n", 226 | "new_list" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "Why do this? There does not seem much point to creating the same list again. " 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "### Manipulate Lists with String Methods\n", 241 | "\n", 242 | "By adding a string method to a list comprehension we have a powerful way to manipulate a list.\n", 243 | "\n", 244 | "We have already seen this in the `fruit` example above. Here's another example of the same thing with the 'Moon' list we've been working with. Every time the Python loops over an item it transforms it to uppercase before adding it to the new list:" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "new_list_upper = [item.upper() for item in my_list]\n", 254 | "new_list_upper" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [ 263 | "# Write code to transform every item in the list with a string method (of your choice)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "Hint: see the [full documentation on string methods](https://docs.python.org/3.10/library/stdtypes.html#string-methods)." 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "### Filter Lists with a Condition\n", 278 | "\n", 279 | "We can _filter_ a list by adding a _condition_ so that only certain items are included in the new list:" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": {}, 286 | "outputs": [], 287 | "source": [ 288 | "new_list_p = [item for item in my_list if 'p' in item]\n", 289 | "new_list_p" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "The pattern is as follows:\n", 297 | "\n", 298 | "`[return_something for each_item in list if some_condition]`\n", 299 | "\n", 300 | "![List comprehensions with condition diagram](assets/list-comprehension-with-condition.png)\n", 301 | "\n", 302 | "Essentially, what we are saying here is that **if** the character \"p\" is **in** the item when Python loops over it, keep it and add it to the new list, otherwise ignore it and throw it away.\n", 303 | "\n", 304 | "Thus, the new list has only two of the strings in it. The first string has a \"p\" in \"permanent\"; the second has a \"p\" in \"September\"." 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "# Write code to filter the list for items that include a number (of your choice)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "---\n", 321 | "---\n", 322 | "## Adding New Capabilities with Imports\n", 323 | "\n", 324 | "Python has a lot of amazing capabilities built-in to the language itself, like being able to manipulate strings. However, in any Python project you are likely to want to use Python code written by someone else to go beyond the built-in capabilities. Code 'written by someone else' comes in the form of a file (or files) separate to the one you are currently working on.\n", 325 | "\n", 326 | "An external Python file (or sometimes a *package* of files) is called a *module* and in order to use them in your code, you need to *import* it.\n", 327 | "\n", 328 | "This is a simple process using the keyword `import` and the name of the module. Just make sure that you `import` something _before_ you want to use it!\n", 329 | "\n", 330 | "The pattern is as follows:\n", 331 | "\n", 332 | "`import module_name`\n", 333 | "\n", 334 | "Here are a series of examples. See if you can guess what each one is doing before running it." 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "import math\n", 344 | "math.pi" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "import random\n", 354 | "random.random()" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "import locale\n", 364 | "locale.getlocale()" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "The answers are: the value of the mathematical constant *pi*, a random number (different every time you run it), and the current locale that the computer thinks it is working in." 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "---\n", 379 | "---\n", 380 | "## Reusing Code with Functions\n", 381 | "\n", 382 | "A function is a _reusable block of code_ that has been wrapped up and given a _name_. The function might have been written by someone else, or it could have been written by you. We don't cover how to write functions in this course; just how to run functions written by someone else.\n", 383 | "\n", 384 | "In order to run the code of a function, we use the name followed by parentheses `()`. \n", 385 | "\n", 386 | "The pattern is as follows:\n", 387 | "\n", 388 | "`name_of_function()`\n", 389 | "\n", 390 | "We have already seen this earlier. Here are a selection of functions (or methods) we have run so far:" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [ 399 | "# 'lower()' is the function (aka method)\n", 400 | "my_sentence = 'Butterflies are important as pollinators.'\n", 401 | "my_sentence.lower()" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "# 'isalpha()' is the function (aka method)\n", 411 | "my_sentence.isalpha()" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "# 'random()' is the function\n", 421 | "random.random()" 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "---\n", 429 | "#### Functions and Methods\n", 430 | "There is a technical difference between functions and methods. You don't need to worry about the distinction for our course. We will treat all functions and methods as the same.\n", 431 | "\n", 432 | "If you are interested in learning more about functions and methods try this [Datacamp Python Functions Tutorial](https://www.datacamp.com/community/tutorials/functions-python-tutorial).\n", 433 | "\n", 434 | "---" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "### Functions that Take Arguments\n", 442 | "If we need to pass particular information to a function, we put that information _in between_ the `()`. Like this:" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": null, 448 | "metadata": {}, 449 | "outputs": [], 450 | "source": [ 451 | "math.sqrt(25)" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "The `25` is the value we want to pass to the `sqrt()` function so it can do its work. This value is called an _argument_ to the function. Functions may take any number of arguments, depending on what the function needs.\n", 459 | "\n", 460 | "Here is another function with an argument:" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": {}, 467 | "outputs": [], 468 | "source": [ 469 | "import calendar\n", 470 | "calendar.isleap(2020)" 471 | ] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": {}, 476 | "source": [ 477 | "Essentially, you can think of a function as a box. \n", 478 | "\n", 479 | "![Function black box diagram](assets/function-black-box.png)\n", 480 | "\n", 481 | "You put an input into the box (the input may be nothing), the box does something with the input, and then the box gives you back an output. You generally don't need to worry _how_ the function does what it does (unless you really want to, in which case you can look at its code). You just know that it works.\n", 482 | "\n", 483 | "> ***Functions are the basis of how we 'get stuff done' in Python.***\n", 484 | "\n", 485 | "For example, we can use the `requests` module to get the text of a Web page:" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "metadata": {}, 492 | "outputs": [], 493 | "source": [ 494 | "import requests\n", 495 | "response = requests.get('https://www.wikipedia.org/')\n", 496 | "response.text[136:267]" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "metadata": {}, 502 | "source": [ 503 | "The string `'https://www.wikipedia.org/'` is the argument we pass to the `get()` function for it to open the Web page and read it for us.\n", 504 | "\n", 505 | "Why not try your own URL? What happens if you print the whole of `response.text` instead of slicing out some of the characters?" 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": {}, 511 | "source": [ 512 | "---\n", 513 | "---\n", 514 | "## Summary\n", 515 | "\n", 516 | "Here's what we've covered - how to:\n", 517 | "\n", 518 | "* Create and manipulate a new list with a **list comprehension**.\n", 519 | "* Filter a list with a **condition**.\n", 520 | "* **import** a **module** to add new capabilities.\n", 521 | "* Run a **function** with parentheses.\n", 522 | "* Pass input **arguments** into a function." 523 | ] 524 | } 525 | ], 526 | "metadata": { 527 | "kernelspec": { 528 | "display_name": "Python 3 (ipykernel)", 529 | "language": "python", 530 | "name": "python3" 531 | }, 532 | "language_info": { 533 | "codemirror_mode": { 534 | "name": "ipython", 535 | "version": 3 536 | }, 537 | "file_extension": ".py", 538 | "mimetype": "text/x-python", 539 | "name": "python", 540 | "nbconvert_exporter": "python", 541 | "pygments_lexer": "ipython3", 542 | "version": "3.9.10" 543 | } 544 | }, 545 | "nbformat": 4, 546 | "nbformat_minor": 1 547 | } -------------------------------------------------------------------------------- /3-choosing-and-collecting-text.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Step 1: Choosing and Collecting Text\n", 10 | "\n", 11 | "---\n", 12 | "---\n", 13 | "\n", 14 | "## The Text-Mining Pipeline: 5 Steps of Text-Mining\n", 15 | "There is no set way to do text-mining, but typically a workflow will involve steps like these:\n", 16 | "1. Choosing and collecting your text\n", 17 | "2. Cleaning and preparing your text\n", 18 | "3. Exploring your data\n", 19 | "4. Analysing your data\n", 20 | "5. Presenting the results of your analysis\n", 21 | "\n", 22 | "![Text-mining pipeline](assets/pipeline.png)\n", 23 | "\n", 24 | "You may go through these steps more than once to refine your data and results, and frequently steps may be merged together. The important thing to realise is that steps 1-2 are critical in ensuring your data is capable of actually answering your research questions. You are likely to spend significant time on cleaning and preparing your text.\n", 25 | "\n", 26 | "Remember:\n", 27 | "\n", 28 | "> **Rubbish in = rubbish out**\n", 29 | "\n", 30 | "This notebook covers step 1. The next notebook [4-cleaning-and-exploring-text](4-cleaning-and-exploring-text.ipynb) will show steps 2-3 and notebook [5-analysis-and-visualisation](5-analysis-and-visualisation.ipynb) will show steps 4-5.\n", 31 | "\n", 32 | "No matter your research subject, you need to be aware of the many issues of electronic data collection. We cannot cover them all here, but you should ask yourself some questions as you start to collect data, such as:\n", 33 | "* What sort of data do I need to answer my research questions?\n", 34 | "* What data is available?\n", 35 | "* What is the quality of the data?\n", 36 | "* How can I get the data?\n", 37 | "* Am I allowed to use it for text-mining?" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "---\n", 45 | "---\n", 46 | "## A Simple Example: Top Words Used in Homer's *Iliad*\n", 47 | "\n", 48 | "Our research question will be:\n", 49 | "\n", 50 | "> What are the top 10 words used in Homer's Iliad in English translation?\n", 51 | "\n", 52 | "### What sort of data do I need to answer my research questions?\n", 53 | "\n", 54 | "I need a copy of Homer's *Iliad* in English translation. In this instance, I am not bothered by which translation.\n", 55 | "\n", 56 | "### What data is available?\n", 57 | "\n", 58 | "[Project Gutenberg](http://www.gutenberg.org/) is the first provider of free electronic books and has over 58,000. \"You will find the world's great literature here, with focus on older works for which U.S. copyright has expired. Thousands of volunteers digitized and diligently proofread the eBooks, for enjoyment and education.\"\n", 59 | "\n", 60 | "Here is Homer's *Iliad*, translated by Samuel Butler in 1898: http://www.gutenberg.org/ebooks/2199\n", 61 | "\n", 62 | "### What is the quality of the data?\n", 63 | "\n", 64 | "Potentially variable. When some books are digitised by OCR ([Optical Character Recognition](https://en.wikipedia.org/wiki/Optical_character_recognition)) they don't get corrected before being published online. Common OCR errors occur because the text has been 'transcribed' incorrectly and OCR software often makes the same mistakes time and again. We are very fortunate that this text does not suffer from common OCR errors because it has been corrected by volunteers.\n", 65 | "\n", 66 | "I won't be covering what to do about OCR errors in this course, but if you are curious you can read more about [ways of correcting predictable errors](https://usesofscale.com/gritty-details/basic-ocr-correction/) and how the British Library has dealt with this in a blog post [Dealing with Optical Character Recognition errors in Victorian newspapers](https://blogs.bl.uk/digital-scholarship/2016/07/dealing-with-optical-character-recognition-errors-in-victorian-newspapers.html).\n", 67 | "\n", 68 | "### How can I get the data?\n", 69 | "\n", 70 | "Project Gutenberg clearly states on their [Terms of Use](https://www.gutenberg.org/policy/terms_of_use.html) that their website is 'intended for human users only'. If you want to use code to get their data you must use one of their [mirror sites](http://www.gutenberg.org/MIRRORS.ALL) -- you should pick the one that is nearest to your location.\n", 71 | "\n", 72 | "We will be using the text file at http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/2/1/9/2199/2199-0.txt\n", 73 | "\n", 74 | "### Am I allowed to use it for text-mining?\n", 75 | "\n", 76 | "Project Gutenberg says in their [Permission: How To](https://www.gutenberg.org/policy/permission.html) that \"The vast majority of Project Gutenberg eBooks are in the public domain in the US.\" However, since UK copyright is different from US copyright, we still have to check for ourselves. This is a complicated area, but broadly we can say that UK copyright expires 70 years after the death of the author. Since [Samuel Butler](https://en.wikipedia.org/wiki/Samuel_Butler_(novelist)) died in 1902, we are probably ok to use his work.\n", 77 | "\n", 78 | "UK law now allows for using copyright material for computation analysis. The onus is on the individual researcher to ensure that their use is covered under the exception, so make sure that you check your plans before you start to collect your source materials. Cambridge University Library provides a [Text & Data Mining: Law on TDM](http://libguides.cam.ac.uk/tdm/law) guide and enquiry support for doing this.\n", 79 | "\n", 80 | "---\n", 81 | "---\n", 82 | "## Getting a Copy of the Homer's *Iliad* Text\n", 83 | "We saw in the last notebook that we can use a Python library called `requests` to get Web pages. We can therefore get a copy of the text file like this:" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "import requests\n", 93 | "response = requests.get('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/2/1/9/2199/2199-0.txt')\n", 94 | "iliad = response.text\n", 95 | "iliad[23664:23990]" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "We can find out how many characters the file has by using the `len()` method (function)." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "len(iliad)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "We can search for a particular string in the file. The function `find()` returns the index of the _first_ matching string it finds." 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "word = 'shield'\n", 128 | "iliad.find(word)" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "# Write code to get the immediate context of this word 'shield'" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "---\n", 145 | "---\n", 146 | "## Inspecting the Text\n", 147 | "The first thing to do is inspect the text to check it is what we expected, and to give us an idea of the sort of cleaning we'll need to do in the next step. \n", 148 | "\n", 149 | "First, download a copy to your computer by going to the main URL http://www.gutenberg.org/ebooks/2199 and clicking on the link 'Plain Text UTF-8'. You can either view it in your browser or save it locally and open it with the text editor of your choice.\n", 150 | "\n", 151 | "Looking again at the text by eye we can see that the book starts with a load of front matter we don't want.\n", 152 | "\n", 153 | "The book actually starts after the text \"`*** START OF THIS PROJECT GUTENBERG EBOOK THE ILIAD ***`\":" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "iliad[708:854]" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "There is also unwanted matter at the end after \"`***END OF THE PROJECT GUTENBERG EBOOK THE ILIAD OF HOMER***`\" that we should get rid of too.\n", 170 | "\n", 171 | "Why does the text have all these `\\r` and `\\n` in them? The character sequence `\\r\\n` is used to signify the start of a new line. In a typical text editor, the software reads this sequence and displays it as a newline, supressing the actual characters in the text file and removing them from view. The output in a Jupyter notebook is less sophisticated than this and shows all the characters as they appear in the file.\n", 172 | "\n", 173 | "---\n", 174 | "### Editing Text Files\n", 175 | "Text files are files that should be *opened* as plain text and nothing else. They often have file extensions such as `.txt`, `.html`, `.xml`, `.csv`. Microsoft Word documents (`.doc` and `.docx`) are not plain text and you should never edit your `.txt` files in Microsoft Word or WordPad. You need a proper *text editor*.\n", 176 | "\n", 177 | "Recommended text editors:\n", 178 | "* All platforms: [Atom](https://atom.io/) or [Sublime Text](https://www.sublimetext.com/)\n", 179 | "\n", 180 | "---" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "---\n", 188 | "---\n", 189 | "## Creating and Preparing a Local Copy\n", 190 | "\n", 191 | "It is not very efficient (nor allowed) to keep making Web requests to Project Gutenberg, especially with a very large corpus. I have therefore downloaded a copy for us (`ORIGINAL-2199-0.txt`) and placed it in the project under the `data` folder. \n", 192 | "\n", 193 | "I have also taken some steps to prepare a usable version of the file on your behalf (`PREPPED-2199-0.txt`). We will use this local copy instead from now on. In the spirit of full transparency and documentation here is what I have done:\n", 194 | "\n", 195 | "* Removed the unwanted Gutenberg-related matter at the front and back of the book.\n", 196 | "* Made sure that the character encoding is 'UTF-8'.\n", 197 | "\n", 198 | "You don't need to worry about the details of _character encoding_ for this course. You only need to know that Python works most easily with UTF-8 files and so we must have the file in that encoding to avoid problems.\n", 199 | "\n", 200 | "---\n", 201 | "### Character Encoding\n", 202 | "Character encoding is a very important topic, but it is not an easy one. If you end up dealing with a lot of text files in building up your corpus you will have to be aware that dealing with files that have different, or unknown, character encodings can get very messy. If you don't know, or wrongly assume the character encoding of a file you can end up with this sort of thing: � �ࡻࢅ�࢖\n", 203 | "\n", 204 | "This is especially important if your corpus is written in a non-English language, because the accents or non-Latin alphabet characters of the text may get mangled. The short answer to the problem is to always make sure your files are encoded with 'UTF-8' if you can.\n", 205 | "\n", 206 | "> **'UTF-8' is often not the default encoding on Windows machines. This means that you can quickly end up in a mess when you edit and save your 'UTF-8' text files on Windows. The encoding may be automatically changed to 'ISO-8859-1' or 'latin-1'. You should find out how to save files as 'UTF-8' in your text editor.**\n", 207 | "\n", 208 | "---" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "---\n", 216 | "---\n", 217 | "## Summary\n", 218 | "\n", 219 | "In this notebook:\n", 220 | "\n", 221 | "* List the 5 typical steps of a **text-mining pipeline**.\n", 222 | "* Consider **issues of electronic data collection** such as choice, availability, quality, access and copyright.\n", 223 | "* Getting and saving a **local copy**.\n", 224 | "* **Inspecting** the text by eye for issues." 225 | ] 226 | } 227 | ], 228 | "metadata": { 229 | "kernelspec": { 230 | "display_name": "Python 3 (ipykernel)", 231 | "language": "python", 232 | "name": "python3" 233 | }, 234 | "language_info": { 235 | "codemirror_mode": { 236 | "name": "ipython", 237 | "version": 3 238 | }, 239 | "file_extension": ".py", 240 | "mimetype": "text/x-python", 241 | "name": "python", 242 | "nbconvert_exporter": "python", 243 | "pygments_lexer": "ipython3", 244 | "version": "3.9.10" 245 | } 246 | }, 247 | "nbformat": 4, 248 | "nbformat_minor": 1 249 | } 250 | -------------------------------------------------------------------------------- /4-cleaning-and-exploring-text.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Steps 2 and 3: Cleaning and Exploring Text\n", 8 | "\n", 9 | "---\n", 10 | "---\n", 11 | "Before we can analyse text we have to clean and prepare it.\n", 12 | "\n", 13 | "## Tokenising the Text\n", 14 | "_Tokenising_ means splitting a text into meaningful elements, such as words, sentences, or symbols.\n", 15 | "\n", 16 | "To do this we are going to use [spaCy](https://spacy.io), a free and open-source Natural Language Processing (NLP) package. If you're interested in learning how to work with spaCy more broadly for a variety of NLP tasks I recommend the tutorial [Natural Language Processing with spaCy in Python](https://realpython.com/natural-language-processing-spacy-python/).\n", 17 | "\n", 18 | "---\n", 19 | "### spaCy and NLTK\n", 20 | "[NLTK](https://www.nltk.org/) was the first open-source Python library for Natural Language Processing (NLP), originally released in 2001, and it is still a valuable tool for teaching and research. Much of the literature uses NLTK code in its examples, and researchers publish models, code and data sets for obscure languages designed for use with NLTK.\n", 21 | "\n", 22 | "If you are interested in using NLTK for the tasks presented in these notebooks instead of spaCy, please see notebooks from the first iteration of this course: https://github.com/mchesterkadwell/intro-to-text-mining-with-python\n", 23 | "\n", 24 | "NLTK has been overtaken in efficiency and ease of use by other more modern libraries, such as [SpaCy](https://spacy.io/), which uses machine learning. It is designed to use less computer memory and split workloads across multiple processor cores (or even computers) so that it can handle very large corpora easily. It also has excellent documentation. However, out of the box, spaCy only has support for a limited set of modern languages and is not designed for research projects where you want control over every algorithm. It is necessary to train your own models to use spaCy for your use case.\n", 25 | "\n", 26 | "I am using spaCy in these notebooks to cut down on the amount of code I am presenting. The idea is to reduce the complexity so that beginners can concentrate on first principles.\n", 27 | "\n", 28 | "---" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "---\n", 36 | "---\n", 37 | "## Loading Data from a File\n", 38 | "\n", 39 | "Firstly, we need to reload into memory the file that was saved at the end of the notebook [3-choosing-and-collecting-text](3-choosing-and-collecting-text.ipynb).\n", 40 | "\n", 41 | "---\n", 42 | "### Opening and Reading Text Files\n", 43 | "We don't have enough time in this course to cover opening, reading and closing text files. Fortunately, it is not necessary to understand the next block of code to understand the rest of the notebook.\n", 44 | "\n", 45 | "If you would like to learn more about this by yourself, try this guide [Reading and Writing Files in Python](https://realpython.com/read-write-files-python/).\n", 46 | "\n", 47 | "---" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "# Import a module that helps with filepaths\n", 57 | "from pathlib import Path\n", 58 | "\n", 59 | "# Create a filepath for the file\n", 60 | "text_file = Path('data', 'PREPPED-2199-0.txt')\n", 61 | "\n", 62 | "# Open the file, read it and store the text with the name `iliad`\n", 63 | "with open(text_file, encoding='utf-8') as file:\n", 64 | " iliad = file.read()\n", 65 | "\n", 66 | "iliad[0:200]" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "---\n", 74 | "---\n", 75 | "## Tokenising with spaCy\n", 76 | "\n", 77 | "First we import and load an English language model provided by spaCy (`en_core_web_sm`) and give it the name `nlp`, ready to do the work on the text." 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "import en_core_web_sm\n", 87 | "nlp = en_core_web_sm.load()" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "Then we pass the text as an argument to the function `nlp` and spaCy does the rest. spaCy processes the text into a _document_ that contains a lot of information about the text.\n", 95 | "\n", 96 | "This may take a while as the book is long. Watch until the `In [*]:` to the left of the code has finished and turned into an output number.\n", 97 | "\n", 98 | "> **Important**: If you are running this notebook on **Binder**, you will need to modify the next line to something like `document = nlp(iliad[0:200000])` so that less of the file is processed at once. Binder has a memory (RAM) limit of around 2GB, and the *Iliad* is big and takes 2-3GB to process. If Binder goes over its memory limit this causes the kernel to die. You can now keep an eye on how much memory you are using in the top right-hand corner of the menu-bar, at the top of page. If you are running the notebooks locally on your own machine this shouldn't be an issue." 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "document = nlp(iliad)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "We print out the document text stored in `document.text` just to check that spaCy has correctly parsed it." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "document.text[0:500]" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "The document can be treated like a list of word tokens (and information about those tokens), which we can print out using a list comprehension:" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "tokens = [word.text for word in document]\n", 140 | "tokens" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "# Write code here to get just the first 20 tokens" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "spaCy has split the text into sentences for us too. The document stores these sentences in `document.sents` and we can also print them out using a list comprehension." 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "sentences = [sent.text for sent in document.sents]\n", 166 | "sentences" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "There are a number of problems with the word tokens: the capitalisation of the words has been preserved, and some of the tokens have unwanted special characters or comprise single items of punctuation.\n", 174 | "\n", 175 | "---\n", 176 | "---\n", 177 | "## Normalising to Lowercase\n", 178 | "Normalising all words to lowercase ensures that the same word in different cases can be recognised as the same word, e.g. we want 'Shield', 'shield' and 'SHIELD' to be recognised as the same word.\n", 179 | "\n", 180 | "However, whether you choose to do this depends on the nature of your corpus and the questions you are investigating. For example, in another case, you may not want the word 'Conservative' to be conflated with the word 'conservative'.\n", 181 | "\n", 182 | "How can we lowercase all the tokens in the list of tokens `tokens`? By using the string method `lower()` and a list comprehension." 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "tokens_lower = [token.lower() for token in tokens]\n", 192 | "tokens_lower[0:20]" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "---\n", 200 | "---\n", 201 | "## Removing Puctuation\n", 202 | "Punctuation such as commas, fullstops and apostrophes can complicate processing a corpus. For example, if punctuation is left in, the words \"poet\" and \"poet,\" might be considered to be different words.\n", 203 | "\n", 204 | "This is a complicated matter, however, and what you choose to do would vary depending on the nature of your corpus and what questions you wish to ask. It may be appropriate to remove punctuation at different stages of processing. In our case we are going to remove it *after* the text has been tokenised.\n", 205 | "\n", 206 | "We will replace *all* punctuation with the empty string `''`. (You do not need to understand this code fully.)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "# Import a module that helps with string processing\n", 216 | "import string\n", 217 | "\n", 218 | "# Make a table that translates all punctuation to an empty value (`None`)\n", 219 | "table = str.maketrans('', '', string.punctuation)\n", 220 | "punc_table = {chr(key):value for (key, value) in table.items()}\n", 221 | "punc_table" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "tokens_nopunct = [token.translate(table) for token in tokens_lower]\n", 231 | "tokens_nopunct[0:20]" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "---\n", 239 | "---\n", 240 | "## Removing Non-Word Tokens\n", 241 | "\n", 242 | "We are still left with some problematic tokens that are not useful words, such as empty tokens (`''`) and newline characters (`\\r`, `\\n`).\n", 243 | "\n", 244 | "We can try a filter condition for the empty tokens. The operator `!=` is the negative equality operator, so `if token != ''` means \"if token is _not_ equal to the empty string\"." 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "tokens_notempty = [token for token in tokens_nopunct if token != '']\n", 254 | "tokens_notempty[0:10]" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "The operator `==` is the equality operator. If you just want a list of empty tokens, write the list comprehension replacing the `!=` with `==`." 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "# Write code here to get a list of empty tokens" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "Finally, we can remove all the newline characters by adding a condition that filters out all non-alphabetic characters. The string method to use is `isalpha()`." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": {}, 284 | "outputs": [], 285 | "source": [ 286 | "tokens = [token for token in tokens_notempty if token.isalpha()]\n", 287 | "tokens" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "---\n", 295 | "---\n", 296 | "## Saving Data to a File\n", 297 | "\n", 298 | "Now we need to save the clean tokens into a file (`CLEAN-6130-8.txt`) and place it under the `data` folder. This is the reverse process to loading the data from a file that we did at the beginning of this notebook." 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "# Import a module that helps with filepaths\n", 308 | "from pathlib import Path\n", 309 | "\n", 310 | "# Create a filepath for the file\n", 311 | "tokens_file = Path('data', 'CLEAN-2199-0.txt')\n", 312 | "\n", 313 | "# Open a file and save the list of tokens inside it\n", 314 | "with open(tokens_file, 'w', encoding='utf-8') as file:\n", 315 | " file.writelines(' '.join(tokens))" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "You should inspect the file now to see what it looks like." 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": { 328 | "pycharm": { 329 | "name": "#%% md\n" 330 | } 331 | }, 332 | "source": [ 333 | "---\n", 334 | "---\n", 335 | "## Summary\n", 336 | "\n", 337 | "In this notebook we have: \n", 338 | "\n", 339 | "* **Loaded** text from a file.\n", 340 | "* **Tokenised** the text into words and sentences with **spaCy**.\n", 341 | "* **Normalised** the words into **lowercase** and **removed non-word** tokens (punctuation and empty tokens)\n", 342 | "* **Saved** the clean tokens to file." 343 | ] 344 | } 345 | ], 346 | "metadata": { 347 | "kernelspec": { 348 | "display_name": "Python 3 (ipykernel)", 349 | "language": "python", 350 | "name": "python3" 351 | }, 352 | "language_info": { 353 | "codemirror_mode": { 354 | "name": "ipython", 355 | "version": 3 356 | }, 357 | "file_extension": ".py", 358 | "mimetype": "text/x-python", 359 | "name": "python", 360 | "nbconvert_exporter": "python", 361 | "pygments_lexer": "ipython3", 362 | "version": "3.9.10" 363 | } 364 | }, 365 | "nbformat": 4, 366 | "nbformat_minor": 1 367 | } 368 | -------------------------------------------------------------------------------- /5-analysis-and-visualisation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Steps 4 and 5: Analysing Data and Visualising Results\n", 10 | "---\n", 11 | "---\n", 12 | "Firstly, we need to reload the cleaned list of word tokens we were using in the [previous notebook](4-cleaning-and-exploring.ipynb) that we saved in a file. (You don't need to understand what is happening here in detail to follow the rest of the notebook.)" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "# Import a module that helps with filepaths\n", 22 | "from pathlib import Path\n", 23 | "\n", 24 | "# Create a filepath for the file\n", 25 | "tokens_file = Path('data', 'CLEAN-2199-0.txt')\n", 26 | "\n", 27 | "# Create an empty list to hold the tokens\n", 28 | "tokens = []\n", 29 | "\n", 30 | "# Open the text file and append all the words to a list of tokens\n", 31 | "with open(tokens_file, encoding='utf-8') as file:\n", 32 | " for token in file.read().split():\n", 33 | " tokens.append(token)\n", 34 | "\n", 35 | "tokens[0:20]" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "---\n", 43 | "---\n", 44 | "## Step 4: Analysing Data with Frequency Analysis\n", 45 | "Let's take a moment to remember our research question:\n", 46 | "\n", 47 | "> What are the top 10 words used in Homer's Iliad in English translation?\n", 48 | "\n", 49 | "In order to answer this question we need to count the number of each unique word in the text. Then we can see which are the most popular, or frequent, 10 words. This metric is called a *frequency distribution*. \n", 50 | "\n", 51 | "### English Stopwords\n", 52 | "Before we start, we need to take a moment to think about what sort of words we are actually interested in counting. \n", 53 | "\n", 54 | "We are not interested in common words in English that carry little meaning, such as \"the\", \"a\" and \"its\". These are called *stopwords*. There is no definitive list of stopwords, but most Python packages used for Natural Language Processing provide one as a starting point, and spaCy is no exception." 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "# Import the spaCy standard stopwords list\n", 64 | "from spacy.lang.en.stop_words import STOP_WORDS\n", 65 | "stopwords = [stop for stop in STOP_WORDS]\n", 66 | "\n", 67 | "# Sort the stopwords in alphabetical order to make them easier to inspect\n", 68 | "sorted(stopwords)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "# Write code here to count the number of stopwords" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "> **Exercise**: What do you notice about these stopwords?\n", 85 | "\n", 86 | "For your own research you will need to consider which stopwords are most appropriate:\n", 87 | "* Will standard stopword lists for modern languages be suitable for that language written 10, 50, 200 years ago?\n", 88 | "* Are there special stopwords specific to the topic or style of literature?\n", 89 | "* How might you find or create your own stopword list?\n", 90 | "\n", 91 | "Now we can filter out the stopwords that match this list:" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "tokens_nostops = [token for token in tokens if token not in stopwords]\n", 101 | "tokens_nostops" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### Creating a Frequency Distribution\n", 109 | "At last, we are ready to create a frequency distribution by counting the frequency of each unique word in the text.\n", 110 | "\n", 111 | "First, we create a frequency distribution:" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "# Import a module that helps with counting\n", 121 | "from collections import Counter\n", 122 | "\n", 123 | "# Count the frequency of words\n", 124 | "word_freq = Counter(tokens_nostops)\n", 125 | "word_freq" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "This `Counter` maps each word to the number of times it appears in the text, e.g. `'coward': 17`. By scrolling down the list you can inspect what look like common and infrequent words.\n", 133 | "\n", 134 | "Now we can get precisely the 10 most common words using the function `most_common()`:" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "common_words = word_freq.most_common(10)\n", 144 | "common_words" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "> **Exercise**: Investigate what is further down the list of top words." 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "---\n", 159 | "---\n", 160 | "## Step 5: Presenting Results of the Analysis Visually\n", 161 | "There are many options for displaying simple charts, and very complex data, in Jupyter notebooks. We are going to use the most well-known library called [Matplotlib](https://matplotlib.org/), although it is perhaps not the easiest to use compared with some others.\n", 162 | "\n", 163 | "To create a Matplotlib plot we need to:\n", 164 | "\n", 165 | "* Import the matplotlib plot function\n", 166 | "* Arrange our data into a pair of lists: one for the x-axis, one for the y-axis\n", 167 | "* Set the appearance of titles, labels, ticks and gridlines\n", 168 | "* Pass the data into the plot function\n", 169 | "\n", 170 | "Let's display our results as a simple line plot:" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "# Display the plot inline in the notebook with interactive controls\n", 180 | "# Comment out this line if you are running the notebook in Deepnote\n", 181 | "%matplotlib notebook\n", 182 | "\n", 183 | "# Import the matplotlib plot function\n", 184 | "import matplotlib.pyplot as plt\n", 185 | "\n", 186 | "# Get a list of the most common words\n", 187 | "words = [word for word,_ in common_words]\n", 188 | "\n", 189 | "# Get a list of the frequency counts for these words\n", 190 | "freqs = [count for _,count in common_words]\n", 191 | "\n", 192 | "# Set titles, labels, ticks and gridlines\n", 193 | "plt.title(\"Top 10 Words used in Homer's Iliad in English translation\")\n", 194 | "plt.xlabel(\"Word\")\n", 195 | "plt.ylabel(\"Count\")\n", 196 | "plt.xticks(range(len(words)), [str(s) for s in words], rotation=90)\n", 197 | "plt.grid(visible=True, which='major', color='#333333', linestyle='--', alpha=0.2)\n", 198 | "\n", 199 | "# Plot the frequency counts\n", 200 | "plt.plot(freqs)\n", 201 | "\n", 202 | "# Show the plot\n", 203 | "plt.show()" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "With this interactive plot you can:\n", 211 | "\n", 212 | "* Resize the plot by dragging the bottom right-hand corner.\n", 213 | "* Pan across the plot to see values further to the right (if there are any to display).\n", 214 | "* Zoom into the plot.\n", 215 | "\n", 216 | "> **Exercise**: Change the code to explore different data and ways of displaying your data. \n", 217 | "\n", 218 | "There are also lots of other graphs that Matplotlib can create, and alternative plotting libraries to use instead, but these are beyond the scope of our course.\n", 219 | "\n", 220 | "---\n", 221 | "---\n", 222 | "## Review and Reflection\n", 223 | "Now that you have seen the data and graph we have generated, no doubt you can see many ways we should improve. \n", 224 | "\n", 225 | "The process of text-mining a corpus (or individual text) is an iterative process. As you clean and explore the data, you will go back over your workflow again and again: from the collection stage, through to cleaning, analysis and presentation.\n", 226 | "\n", 227 | "> **Exercise**: List the ways you think we should improve the pipeline, from collection to plot.\n", 228 | "\n", 229 | "Fortunately, when you do your text-mining in code (and write explanatory text to document it) you know exactly what you did and can rerun and modify the process.\n", 230 | "\n", 231 | "---\n", 232 | "### Going Further: Libraries Libraries Libraries\n", 233 | "\n", 234 | "By now, you will be getting the idea that much of what you want to do in Python involves importing libraries to help you. Remember, libraries are _just code that someone else has written_.\n", 235 | "\n", 236 | "As reminder, here are some of the useful libraries we have used or mentioned in these notebooks:\n", 237 | "* [Requests](http://docs.python-requests.org/en/master/) - HTTP (web) requests library\n", 238 | "* [SpaCy](https://spacy.io/) - natural language processing library\n", 239 | "* [Matplotlib](https://matplotlib.org/) - 2D plotting library\n", 240 | "\n", 241 | "---" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "---\n", 249 | "---\n", 250 | "## Summary\n", 251 | "\n", 252 | "Finally, we have: \n", 253 | "\n", 254 | "* **Loaded** clean token data from a file into a list.\n", 255 | "* Removed English **stopwords** from the list of tokens.\n", 256 | "* Created a **frequency distribution** and found the 10 most frequent words.\n", 257 | "* Visualised the frequency distribution in a **line plot**.\n", 258 | "\n", 259 | "---\n", 260 | "---\n", 261 | "## What's Next?\n", 262 | "You will get the most out of this course if you can follow up on the learning over the next few days and weeks before you forget it all. This is particularly important when learning to code. Abstract concepts need to be reinforced little and often.\n", 263 | "\n", 264 | "### Recommended book on Python\n", 265 | "\n", 266 | "* Sweigart, A. 2019. _Automate the Boring Stuff_ (2nd ed.) San Francisco: No Starch Press. [Available online](https://automatetheboringstuff.com/). This is just an all-round good book on Python and how to give yourself computer superpowers.\n", 267 | "\n", 268 | "### Text-mining and NLP in General\n", 269 | "\n", 270 | "* Work through this series of [Programming Historian tutorials](https://programminghistorian.org/en/lessons/working-with-text-files) to get some more practice with basic text files and basic text-mining techniques.\n", 271 | "* Follow a more in-depth set of Jupyter notebooks with [The Art of Literary Text Analysis](https://github.com/sgsinclair/alta/blob/master/ipynb/ArtOfLiteraryTextAnalysis.ipynb).\n", 272 | "* Read a practical and well-explained approach to Natural Language Processing (NLP) in Python: Srinivasa-Desikan, B., 2018. _Natural Language Processing and Computational Linguistics : A practical guide to text analysis with Python, Gensim, spaCy, and Keras._ Birmingham: Packt Publishing. This book has chapters on text pre-processing steps, various NLP techniques, and comes with Jupyter notebooks to follow.\n", 273 | "\n", 274 | "### Python for Digital Humanities\n", 275 | "\n", 276 | "* Work through [Python Programming for the Humanities](http://www.karsdorp.io/python-course/).\n", 277 | "* Browse a big list of resources for [Teaching Yourself to Code in DH](http://scottbot.net/teaching-yourself-to-code-in-dh/)." 278 | ] 279 | } 280 | ], 281 | "metadata": { 282 | "kernelspec": { 283 | "display_name": "Python 3 (ipykernel)", 284 | "language": "python", 285 | "name": "python3" 286 | }, 287 | "language_info": { 288 | "codemirror_mode": { 289 | "name": "ipython", 290 | "version": 3 291 | }, 292 | "file_extension": ".py", 293 | "mimetype": "text/x-python", 294 | "name": "python", 295 | "nbconvert_exporter": "python", 296 | "pygments_lexer": "ipython3", 297 | "version": "3.9.10" 298 | } 299 | }, 300 | "nbformat": 4, 301 | "nbformat_minor": 1 302 | } -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020-2022 Mary Chester-Kadwell 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to Text-Mining with Python 2 | 3 | [![Python](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-3911/) 4 | [![Python](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-3103/) 5 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mchesterkadwell/intro-to-text-mining-with-python-2020/master) 6 | 7 | ## Introduction 8 | 9 | This repository contains Jupyter notebooks used for teaching *'Introduction to Text-Mining with Python'*, 10 | a course in the [Cambridge 11 | Digital Humanities](https://www.cdh.cam.ac.uk) (CDH) Learning programme (from 2020 onwards). 12 | 13 | The notebooks are designed to be worked on as self-paced materials in a 'flipped classroom' approach. They 14 | are also written as stand-alone notebooks for anyone to follow and use as they wish. 15 | 16 | The aim is to teach the basic concepts of a text-mining workflow to a wide audience. The level of Python and 17 | text-mining is aimed at beginners. As such, the notebooks are designed to be run as a teaching aid, not as a 18 | serious text analysis tool. 19 | 20 | ### Content 21 | 22 | The notebooks cover: 23 | 24 | * Basic Python (strings, lists, list comprehensions, imports, functions, opening/reading/saving files) 25 | * Steps in a text-mining pipeline for research 26 | * Basic text-mining concepts (tokenising, normalising, cleaning, stopwords) 27 | * Creating a frequency distribution and plotting the results 28 | 29 | ### License 30 | 31 | The code is released under an [MIT License](LICENSE). The text is licensed under Creative Commons 32 | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). 33 | 34 | ## Quick Start: Launch Notebooks Online 35 | 36 | ### For a Quick Look: Run on Binder 37 | 38 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/mchesterkadwell/intro-to-text-mining-with-python-2020/master) 39 | 40 | If you just want to play quickly with the notebooks and see what they show, click on the "launch binder" button. 41 | Binder will launch a virtual environment in your browser where you can open and run the notebooks without installing 42 | anything. 43 | 44 | Limitations of Binder: 45 | 46 | * Some cells in the notebooks may use more memory than Binder allows, causing the notebook's kernel to crash. After it 47 | has restarted, try modifying the code to process fewer documents. 48 | * Binder may shut down after about 10 minutes of inactivity e.g. if you don't keep the window open. You can simply 49 | open a new Binder to start again. 50 | * Binder will not save any changes you make to the notebooks. 51 | 52 | ### Run in the Cloud without any Installation: Run on Deepnote 53 | 54 | [](https://deepnote.com/launch?url=https%3A%2F%2Fgithub.com%2Fmchesterkadwell%2Fintro-to-text-mining-with-python-2020) 55 | 56 | To run and keep a copy of the notebooks for yourself, click on the "Launch in Deepnote" button. Deepnote will create 57 | a project based on this repository automatically and run in the cloud, so you don't have to install anything on your 58 | local computer. 59 | 60 | After the project has started, go to the buttons on the left-hand side and click on the **Environment** button to open 61 | the Environment tab. In the Environment tab, pick **Python 3.9** from the dropdown. 62 | 63 | ![](assets/deepnote-python39.png) 64 | 65 | After a minute or two, Deepnote will start the new machine and run the install steps. Then the notebooks are ready to 66 | use. Click on the Folder icon to open the list of notebooks. 67 | 68 | Limitations of Deepnote: 69 | 70 | * Deepnote requires you to sign up for an account. 71 | * Deepnote has a (generous) limit on the number of free hours you can use each month. 72 | * On the free tier, the notebooks will likely run slower than on your own computer. 73 | 74 | ## Running Notebooks on Your Own Computer (Beginners) 75 | 76 | These instructions are suitable if you have never installed Jupyter Notebooks 77 | or Python on your own computer before. 78 | 79 | ### Install Jupyter Notebooks and Python with Anaconda 80 | 81 | [Install Anaconda (Python 3.9)](https://www.anaconda.com/products/individual). 82 | 83 | Once it has installed, [open Anaconda Navigator](http://docs.anaconda.com/anaconda/user-guide/getting-started/#open-navigator). 84 | 85 | ### Download the Notebooks from GitHub 86 | 87 | Go to the [GitHub page](https://github.com/mchesterkadwell/intro-to-text-mining-with-python-2020) 88 | where this code repository is kept. Click the green "Code" button to the top-right of the page. 89 | 90 | ![](assets/download-or-clone.png) 91 | 92 | If you have never used `git` version control before I recommend you simply download the notebooks with the 93 | "Download ZIP" option. In most operating systems this will automatically unzip it back into individual files. Move 94 | the folder to somewhere you want to keep it, such as "My Documents". 95 | 96 | If you have used `git` before, then you can clone the repo with this command: 97 | 98 | `git clone https://github.com/mchesterkadwell/intro-to-text-mining-with-python-2020.git` 99 | 100 | ### Run Notebooks in a Dedicated Environment 101 | 102 | In simple terms, an environment is like an isolated box in which to run a 103 | notebook safe from interference by other notebooks. Anaconda provides one 104 | default environment, called ‘root’, in which to get up and running quickly. 105 | However, you should really make a new environment for each project (which may 106 | have one or more related notebooks). 107 | 108 | In **Anaconda Navigator > Environments** click on the ‘Create’ button in the 109 | bottom left of the Environments list. 110 | 111 | ![](assets/create.png) 112 | 113 | Type a name e.g. 'intro-to-text-mining', make sure that 'Python' is _checked_ 114 | and under the dropdown pick **'3.9'**. Make sure that 'R' is left _unchecked_. 115 | 116 | Then click the ‘Create’ button. 117 | 118 | ![](assets/new-env.png) 119 | 120 | It will take a few seconds to set up... 121 | 122 | Then in **Anaconda Navigator > Environments** make sure you have selected your 123 | new environment. 124 | 125 | On the right of the environment name is a small green play arrow. Click on it and pick ‘Open Terminal’ from the 126 | dropdown. 127 | 128 | In the Terminal that opens type the following, and press return: 129 | 130 | `conda install pip` 131 | 132 | ![](assets/conda-install-pip.png) 133 | 134 | If you do not already have pip installed, it will install it. Otherwise it will give a message: 135 | 136 | `# All requested packages already installed.` 137 | 138 | Then change directory to wherever you saved the notebooks folder by typing something like: 139 | 140 | `cd path\to\notebooks` 141 | 142 | where `path\to\notebooks` is the filepath to wherever you’ve put the notebooks folder. 143 | 144 | If you are on a **Mac**, make sure to use forward slashes in the filepath instead e.g. `path/to/notebooks` 145 | 146 | ![](assets/cd-directory.png) 147 | 148 | Then install all the dependencies by typing: 149 | 150 | `pip install -r requirements.txt` 151 | 152 | Then: 153 | 154 | `pip install jupyter` 155 | 156 | This should initiate a big list of downloads and will take a while to finish. Please be patient. 157 | 158 | Finally, to launch the Jupyter notebook server type: 159 | 160 | `jupyter notebook` 161 | 162 | This opens a web page showing the project: 163 | 164 | ![](assets/jupyter-notebooks.png) 165 | 166 | If not, you can copy and paste one of the URLs in the Terminal window into your browser e.g. 167 | http://localhost:8888/?token=ddb27d2a1a6cb29a3483c24d6ff9f7263eb9676f02d71075 168 | (This one will not work on your machine, as the token is unique every time.) 169 | 170 | When you are finished with the notebook, press **ctrl+c** to stop the notebook server. 171 | 172 | You can close the Terminal window. 173 | 174 | ### Starting the Notebook Server Again 175 | 176 | Next time you want to start the notebook server: 177 | 178 | In **Anaconda Navigator > Environments** make sure you have selected your new environment. 179 | 180 | On the right of the environment name is a small green play arrow. Click on it and pick ‘Open Terminal’ from the 181 | dropdown. 182 | To launch the Jupyter notebook server type: 183 | 184 | `jupyter notebook` 185 | 186 | When you are finished with the notebook, press **ctrl+c** to stop the notebook server. You can close the Terminal window. 187 | 188 | 189 | -------------------------------------------------------------------------------- /assets/cd-directory.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/cd-directory.png -------------------------------------------------------------------------------- /assets/conda-install-pip.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/conda-install-pip.png -------------------------------------------------------------------------------- /assets/create.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/create.png -------------------------------------------------------------------------------- /assets/deepnote-python39.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/deepnote-python39.png -------------------------------------------------------------------------------- /assets/download-or-clone.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/download-or-clone.png -------------------------------------------------------------------------------- /assets/download-zip.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/download-zip.png -------------------------------------------------------------------------------- /assets/function-black-box.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/function-black-box.png -------------------------------------------------------------------------------- /assets/jupyter-notebooks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/jupyter-notebooks.png -------------------------------------------------------------------------------- /assets/list-comprehension-with-condition.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/list-comprehension-with-condition.png -------------------------------------------------------------------------------- /assets/list-comprehension.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/list-comprehension.png -------------------------------------------------------------------------------- /assets/new-env.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/new-env.png -------------------------------------------------------------------------------- /assets/pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchesterkadwell/intro-to-text-mining-with-python-2020/89b526940f3ca0a837ffb10224a679a2749280b4/assets/pipeline.png -------------------------------------------------------------------------------- /exercises/1-reversing-strings-solution.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Exercise 1: Reversing Strings \n", 10 | "# Solution\n", 11 | "\n", 12 | "---\n", 13 | "---\n", 14 | "\n", 15 | "There are various possible ways to do this exercise. Compare yours to this particular solution." 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "# Create a new string of the text\n", 25 | "darwin_text = \"Cambridge University Library houses the world's largest and most significant collection of the personal papers of Charles Robert Darwin (1809-1882). Darwin corresponded with around 2,000 people from all around the world and all walks of life. His correspondents included other leading scientists and thinkers, such as the geologist Charles Lyell. They bring into sharp focus every aspect of Darwin's scientific work throughout that period, and illuminate the mutual friendships he shared with other scientists.\"\n", 26 | "darwin_text" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "# Split the text into a new list of sentences\n", 36 | "# Hint: use the string method split('.') which will split the text up at each full stop\n", 37 | "darwin_list = darwin_text.split('.')\n", 38 | "darwin_list" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "We don't really want the last sentence as it is just empty." 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "# Create a new list that contains just the first 4 sentences\n", 55 | "# Hint: use list slicing\n", 56 | "darwin_list_trimmed = darwin_list[0:4]\n", 57 | "darwin_list_trimmed" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "# Create a new list with all the sentences reversed\n", 67 | "# Hint: use list slicing\n", 68 | "darwin_list_reversed = darwin_list_trimmed[::-1]\n", 69 | "darwin_list_reversed" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "# Create a new string with all the reversed sentences joined by a full stop\n", 79 | "# Hint: start with the string of a full stop and join the list to it like this '.'join(my_list)\n", 80 | "darwin_reversed = '.'.join(darwin_list_reversed)\n", 81 | "darwin_reversed" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "# Remove the space at the front of the string\n", 91 | "# Hint: use string slicing or the string method lstrip()\n", 92 | "darwin_tidy = darwin_reversed.lstrip()\n", 93 | "darwin_tidy" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "# Add a full stop at the end of the string\n", 103 | "darwin_tidy = darwin_tidy + \".\"\n", 104 | "darwin_tidy" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "# Reverse all the characters\n", 114 | "# Hint: use string slicing\n", 115 | "darwin_reversed = darwin_tidy[::-1]\n", 116 | "darwin_reversed" 117 | ] 118 | } 119 | ], 120 | "metadata": { 121 | "kernelspec": { 122 | "display_name": "Python 3", 123 | "language": "python", 124 | "name": "python3" 125 | }, 126 | "language_info": { 127 | "codemirror_mode": { 128 | "name": "ipython", 129 | "version": 3 130 | }, 131 | "file_extension": ".py", 132 | "mimetype": "text/x-python", 133 | "name": "python", 134 | "nbconvert_exporter": "python", 135 | "pygments_lexer": "ipython3", 136 | "version": "3.8.2" 137 | } 138 | }, 139 | "nbformat": 4, 140 | "nbformat_minor": 1 141 | } 142 | -------------------------------------------------------------------------------- /exercises/1-reversing-strings.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Excercise 1: Reversing Strings\n", 10 | "\n", 11 | "---\n", 12 | "---\n", 13 | "\n", 14 | "In this optional exercise, we are going to put everything together from [1-intro-to-python-and-text](/notebooks/notebooks/1-intro-to-python-and-text.ipynb). and write some code. The idea is to get used to how code can be used to manipulate data, and how to think about doing this, step by step.\n", 15 | "\n", 16 | "The aim of the exercise is to take the text below, reverse all the sentences and then reverse all the characters.\n", 17 | "\n", 18 | "Text:\n", 19 | "\n", 20 | "> Cambridge University Library houses the world's largest and most significant collection of the personal papers of Charles Robert Darwin (1809-1882). Darwin corresponded with around 2,000 people from all around the world and all walks of life. His correspondents included other leading scientists and thinkers, such as the geologist Charles Lyell. They bring into sharp focus every aspect of Darwin's scientific work throughout that period, and illuminate the mutual friendships he shared with other scientists.\n", 21 | "\n", 22 | "Reversed text:\n", 23 | "\n", 24 | "> .)2881-9081( niwraD treboR selrahC fo srepap lanosrep eht fo noitcelloc tnacifingis tsom dna tsegral s'dlrow eht sesuoh yrarbiL ytisrevinU egdirbmaC.efil fo sklaw lla dna dlrow eht dnuora lla morf elpoep 000,2 dnuora htiw dednopserroc niwraD .lleyL selrahC tsigoloeg eht sa hcus ,srekniht dna stsitneics gnidael rehto dedulcni stnednopserroc siH .stsitneics rehto htiw derahs eh spihsdneirf lautum eht etanimulli dna ,doirep taht tuohguorht krow cifitneics s'niwraD fo tcepsa yreve sucof prahs otni gnirb yehT" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "# Create a new string of the text" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "# Split the text into a new list of sentences\n", 43 | "# Hint: use the string method split('.') which will split the text up at each full stop" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "We don't really want the last sentence as it is just empty." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# Create a new list that contains just the first 4 sentences\n", 60 | "# Hint: use list slicing" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "# Create a new list with all the sentences reversed\n", 70 | "# Hint: use list slicing" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "# Create a new string with all the reversed sentences joined by a full stop\n", 80 | "# Hint: start with the string of a full stop and join the list to it like this '.'join(darwin_list_reversed)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "# Remove the space at the front of the string and give it a new name `darwin_tidy`\n", 90 | "# Hint: use string slicing or the string method lstrip()" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "# Add a full stop at the end of the string" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "# Reverse all the characters\n", 109 | "# Hint: use string slicing" 110 | ] 111 | } 112 | ], 113 | "metadata": { 114 | "kernelspec": { 115 | "display_name": "Python 3", 116 | "language": "python", 117 | "name": "python3" 118 | }, 119 | "language_info": { 120 | "codemirror_mode": { 121 | "name": "ipython", 122 | "version": 3 123 | }, 124 | "file_extension": ".py", 125 | "mimetype": "text/x-python", 126 | "name": "python", 127 | "nbconvert_exporter": "python", 128 | "pygments_lexer": "ipython3", 129 | "version": "3.8.2" 130 | } 131 | }, 132 | "nbformat": 4, 133 | "nbformat_minor": 1 134 | } 135 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "intro-to-text-mining-with-python-2020" 3 | version = "0.1.0" 4 | description = "Cambridge Digital Humanities Learning: Introduction to Text-Mining with Python" 5 | authors = ["Mary Chester-Kadwell "] 6 | 7 | [tool.poetry.dependencies] 8 | python = ">=3.9,<3.11" 9 | beautifulsoup4 = "^4.10.0" 10 | matplotlib = "^3.5.1" 11 | spacy = "3.2.3" 12 | en-core-web-sm = {url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0.tar.gz"} 13 | jupyter = "^1.0.0" 14 | 15 | [tool.poetry.dev-dependencies] 16 | 17 | [build-system] 18 | requires = ["poetry-core>=1.0.0"] 19 | build-backend = "poetry.core.masonry.api" 20 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # Uncomment the following line if you wish to install jupyter in your virtual environment 2 | # jupyter 3 | beautifulsoup4 4 | matplotlib 5 | requests 6 | spacy==3.2.3 7 | en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0.tar.gz 8 | -------------------------------------------------------------------------------- /runtime.txt: -------------------------------------------------------------------------------- 1 | python-3.10 --------------------------------------------------------------------------------