├── .gitignore ├── LICENSE ├── Python syntax cheat sheet.ipynb ├── Python syntax cheat sheet.pdf ├── README.md ├── ire-board ├── IRE Board members - complete.ipynb ├── IRE Board members - working.ipynb ├── ire-board.html └── ire_board_scrape.py ├── md-warn-notices ├── Maryland WARN Notices - multiple pages.ipynb └── Maryland WARN Notices.ipynb ├── requirements.txt ├── sd-lobbyists ├── data │ └── .gitkeep └── download_lobbyist_data.py ├── tx-railroad-commission ├── dl_pages_details.py ├── dl_pages_results.py ├── main.py ├── pages-detail │ └── .gitkeep ├── pages-results │ └── .gitkeep ├── scrape_detail_pages.py └── tx-railroad-commission-data.csv └── us-senate-press-gallery ├── U.S. Senate Press Gallery - complete.ipynb └── U.S. Senate Press Gallery - working.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | # Created by https://www.toptal.com/developers/gitignore/api/osx 131 | # Edit at https://www.toptal.com/developers/gitignore?templates=osx 132 | 133 | ### OSX ### 134 | # General 135 | .DS_Store 136 | .AppleDouble 137 | .LSOverride 138 | 139 | # Icon must end with two \r 140 | Icon 141 | 142 | 143 | # Thumbnails 144 | ._* 145 | 146 | # Files that might appear in the root of a volume 147 | .DocumentRevisions-V100 148 | .fseventsd 149 | .Spotlight-V100 150 | .TemporaryItems 151 | .Trashes 152 | .VolumeIcon.icns 153 | .com.apple.timemachine.donotpresent 154 | 155 | # Directories potentially created on remote AFP share 156 | .AppleDB 157 | .AppleDesktop 158 | Network Trash Folder 159 | Temporary Items 160 | .apdisk 161 | 162 | # End of https://www.toptal.com/developers/gitignore/api/osx 163 | 164 | session-notes 165 | tx-railroad-commission/*/*.html 166 | sd-lobbyists/data/*.zip 167 | sd-lobbyists/data/*.csv 168 | ire-board/*.csv -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 IRE & NICAR 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Python syntax cheat sheet.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Python syntax cheat sheet\n", 8 | "\n", 9 | "This notebook demonstrates some basic syntax rules of the Python programming language.\n", 10 | "\n", 11 | "- [Basic data types](#Basic-data-types)\n", 12 | " - [Strings](#Strings)\n", 13 | " - [Numbers and math](#Numbers-and-math)\n", 14 | " - [Booleans](#Booleans)\n", 15 | "- [Variable assignment](#Variable-assignment)\n", 16 | "- [String methods](#String-methods)\n", 17 | "- [Comments](#Comments)\n", 18 | "- [The print() function](#The-print()-function)\n", 19 | "- [Collections of data](#Collections-of-data)\n", 20 | " - [Lists](#Lists)\n", 21 | " - [Dictionaries](#Dictionaries)\n", 22 | "- [`for` loops](#for-loops)\n", 23 | "- [`if` statements](#if-statements)" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### Basic data types\n", 31 | "Just like Excel and other data processing software, Python recognizes a variety of data types, including three we'll focus on here:\n", 32 | "- Strings (text)\n", 33 | "- Numbers (integers, numbers with decimals and more)\n", 34 | "- Booleans (`True` and `False`).\n", 35 | "\n", 36 | "You can use the built-in [`type()`](https://docs.python.org/3/library/functions.html#type) function to check the data type of a value." 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "#### Strings\n", 44 | "\n", 45 | "A string is a group of characters -- letters, numbers, whatever -- enclosed within single or double quotes (doesn't matter as long as they match). The code in these notebooks uses single quotes. (The Python style guide doesn't recommend one over the other: [\"Pick a rule and stick to it.\"](https://www.python.org/dev/peps/pep-0008/#string-quotes))\n", 46 | "\n", 47 | "If your string _contains_ apostrophes or quotes, you have two options: _Escape_ the offending character with a forward slash `\\`:\n", 48 | "\n", 49 | "```python\n", 50 | "'Isn\\'t it nice here?'\n", 51 | "```\n", 52 | "\n", 53 | "... or change the surrounding punctuation:\n", 54 | "\n", 55 | "```python\n", 56 | "\"Isn't it nice here?\"\n", 57 | "```\n", 58 | "\n", 59 | "The style guide recommends the latter over the former.\n", 60 | "\n", 61 | "When you call the `type()` function on a string, Python will return `str`.\n", 62 | "\n", 63 | "Calling the [`str()` function](https://docs.python.org/3/library/stdtypes.html#str) on a value will return the string version of that value (see examples below)." 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "'Investigative Reporters and Editors'" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "type('hello!')" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "45" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "type(45)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "str(45)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "type(str(45))" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "str(True)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "If you \"add\" strings together with a plus sign `+`, it will concatenate them:" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "'IRE' + '/' + 'NICAR'" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "#### Numbers and math\n", 150 | "\n", 151 | "Python recognizes a variety of numeric data types. Two of the most common are integers (whole numbers) and floats (numbers with decimals).\n", 152 | "\n", 153 | "Calling `int()` on a piece of numeric data (even if it's being stored as a string) will attempt to coerce it to an integer; calling `float()` will try to convert it to a float." 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "12" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "12.4" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "type(12)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "type(12.4)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "int(35.6)" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "int('45')" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "float(46)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "float('45')" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "You can do [basic math](https://www.digitalocean.com/community/tutorials/how-to-do-math-in-python-3-with-operators) in Python. You can also do [more advanced math](https://docs.python.org/3/library/math.html)." 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "4+2" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "10-9" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "5*10" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "1000/10" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "# ** raises a number to the power of another number\n", 278 | "5**2" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "#### Booleans\n", 286 | "\n", 287 | "Just like in Excel, which has `TRUE` and `FALSE` data types, Python has boolean data types. They are `True` and `False` -- note that only the first letter is capitalized, and they are not sandwiched between quotes.\n", 288 | "\n", 289 | "Boolean values are typically returned when you're evaluating some sort of conditional statement -- comparing values, checking to see if a string is inside another string or if a value is in a list, etc.\n", 290 | "\n", 291 | "[Python's comparison operators](https://docs.python.org/3/reference/expressions.html#comparisons) include:\n", 292 | "\n", 293 | "- `>` greater than\n", 294 | "- `<` less than\n", 295 | "- `>=` greater than or equal to\n", 296 | "- `<=` less than or equal to\n", 297 | "- `==` equal to\n", 298 | "- `!=` not equal to" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "True" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "False" 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": null, 322 | "metadata": {}, 323 | "outputs": [], 324 | "source": [ 325 | "4 > 6" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "10 == 10" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "'crapulence' == 'Crapulence'" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "type(True)" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "### Variable assignment\n", 360 | "\n", 361 | "The `=` sign assigns a value to a variable name that you choose. Later, you can retrieve that value by referencing its variable name. Variable names can be pretty much anything you want ([as long as you follow some basic rules](https://thehelloworldprogram.com/python/python-variable-assignment-statements-rules-conventions-naming/)).\n", 362 | "\n", 363 | "This can be a tricky concept at first! For more detail, [here's a pretty good explainer from Digital Ocean](https://www.digitalocean.com/community/tutorials/how-to-use-variables-in-python-3)." 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "my_name = 'Frank'" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "my_name" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "You can also _reassign_ a different value to a variable name, though it's usually better practice to create a new variable." 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "my_name = 'Susan'" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": null, 403 | "metadata": {}, 404 | "outputs": [], 405 | "source": [ 406 | "my_name" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "A common thing to do is to \"save\" the results of an expression by assigning the result to a variable." 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "my_fav_number = 10 + 3" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "metadata": {}, 429 | "outputs": [], 430 | "source": [ 431 | "my_fav_number" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "It's also common to refer to previously defined variables in an expression: " 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "metadata": {}, 445 | "outputs": [], 446 | "source": [ 447 | "nfl_teams = 32\n", 448 | "mlb_teams = 30\n", 449 | "nba_teams = 30\n", 450 | "nhl_teams = 31\n", 451 | "\n", 452 | "number_of_pro_sports_teams = nfl_teams + mlb_teams + nba_teams + nhl_teams" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "metadata": {}, 459 | "outputs": [], 460 | "source": [ 461 | "number_of_pro_sports_teams" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "### String methods\n", 469 | "\n", 470 | "Let's go back to strings for a second. String objects have a number of useful [methods](https://docs.python.org/3/library/stdtypes.html#string-methods) -- let's use an example string to demonstrate a few common ones." 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [ 479 | "my_cool_string = ' Hello, friends!'" 480 | ] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "metadata": {}, 485 | "source": [ 486 | "`upper()` converts the string to uppercase:" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [ 495 | "my_cool_string.upper()" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": {}, 501 | "source": [ 502 | "`lower()` converts to lowercase:" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": null, 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [ 511 | "my_cool_string.lower()" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "`replace()` will replace a piece of text with other text that you specify:" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": null, 524 | "metadata": {}, 525 | "outputs": [], 526 | "source": [ 527 | "my_cool_string.replace('friends', 'enemies')" 528 | ] 529 | }, 530 | { 531 | "cell_type": "markdown", 532 | "metadata": {}, 533 | "source": [ 534 | "`count()` will count the number of occurrences of a character or group of characters: " 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": null, 540 | "metadata": {}, 541 | "outputs": [], 542 | "source": [ 543 | "my_cool_string.count('H')" 544 | ] 545 | }, 546 | { 547 | "cell_type": "markdown", 548 | "metadata": {}, 549 | "source": [ 550 | "Note that `count()` is case-sensitive. If your task is \"count all the e's,\" convert your original string to upper or lowercase first:" 551 | ] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "execution_count": null, 556 | "metadata": {}, 557 | "outputs": [], 558 | "source": [ 559 | "my_cool_string.upper().count('E')" 560 | ] 561 | }, 562 | { 563 | "cell_type": "markdown", 564 | "metadata": {}, 565 | "source": [ 566 | "[`split()`](https://docs.python.org/3/library/stdtypes.html#str.split) will split the string into a [_list_](#Lists) (more on these in a second) on a given delimiter (if you don't specify a delimiter, it'll default to splitting on a space):" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": null, 572 | "metadata": {}, 573 | "outputs": [], 574 | "source": [ 575 | "my_cool_string.split()" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "my_cool_string.split(',')" 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": null, 590 | "metadata": {}, 591 | "outputs": [], 592 | "source": [ 593 | "my_cool_string.split('Pitt')" 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "`strip()` removes whitespace from either side of your string (but not internal whitespace):" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": null, 606 | "metadata": {}, 607 | "outputs": [], 608 | "source": [ 609 | "my_cool_string.strip()" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "You can use a cool thing called \"method chaining\" to combine methods -- just tack 'em onto the end. Let's say we wanted to strip whitespace from our string _and_ make it uppercase:" 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": null, 622 | "metadata": {}, 623 | "outputs": [], 624 | "source": [ 625 | "my_cool_string.strip().upper()" 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "Notice, however, that our original string is unchanged:" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": null, 638 | "metadata": {}, 639 | "outputs": [], 640 | "source": [ 641 | "my_cool_string" 642 | ] 643 | }, 644 | { 645 | "cell_type": "markdown", 646 | "metadata": {}, 647 | "source": [ 648 | "Why? Because we haven't assigned the results of anything we've done to a variable. A common thing to do, especially when you're cleaning data, would be to assign the results to a new variable:" 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": null, 654 | "metadata": {}, 655 | "outputs": [], 656 | "source": [ 657 | "my_cool_string_clean = my_cool_string.strip().upper()" 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": null, 663 | "metadata": {}, 664 | "outputs": [], 665 | "source": [ 666 | "my_cool_string_clean" 667 | ] 668 | }, 669 | { 670 | "cell_type": "markdown", 671 | "metadata": {}, 672 | "source": [ 673 | "### Comments\n", 674 | "A line with a comment -- a note that you don't want Python to interpret -- starts with a `#` sign. These are notes to collaborators and to your future self about what's happening at this point in your script, and why.\n", 675 | "\n", 676 | "Typically you'd put this on the line right above the line of code you're commenting on:" 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": null, 682 | "metadata": {}, 683 | "outputs": [], 684 | "source": [ 685 | "avg_settlement = 40827348.34328237\n", 686 | "\n", 687 | "# coercing this to an int because we don't need any decimal precision\n", 688 | "int(avg_settlement)" 689 | ] 690 | }, 691 | { 692 | "cell_type": "markdown", 693 | "metadata": {}, 694 | "source": [ 695 | "Multi-line comments are sandwiched between triple quotes (or triple apostrophes):\n", 696 | "\n", 697 | "`'''\n", 698 | "this\n", 699 | "is a long\n", 700 | "comment\n", 701 | "'''`\n", 702 | "\n", 703 | "or\n", 704 | "\n", 705 | "`\"\"\"\n", 706 | "this\n", 707 | "is a long\n", 708 | "comment\n", 709 | "\"\"\"`" 710 | ] 711 | }, 712 | { 713 | "cell_type": "markdown", 714 | "metadata": {}, 715 | "source": [ 716 | "### The `print()` function\n", 717 | "\n", 718 | "So far, we've just been running the notebook cells to get the last value returned by the code we write. Using the [`print()`](https://docs.python.org/3/library/functions.html#print) function is a way to print specific things in your script to the screen. This function is handy for debugging.\n", 719 | "\n", 720 | "To print multiple things on the same line, separate them with a comma." 721 | ] 722 | }, 723 | { 724 | "cell_type": "code", 725 | "execution_count": null, 726 | "metadata": {}, 727 | "outputs": [], 728 | "source": [ 729 | "print('Hello!')" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": {}, 736 | "outputs": [], 737 | "source": [ 738 | "print(my_name)" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": {}, 745 | "outputs": [], 746 | "source": [ 747 | "print('Hello,', my_name)" 748 | ] 749 | }, 750 | { 751 | "cell_type": "markdown", 752 | "metadata": {}, 753 | "source": [ 754 | "## Collections of data\n", 755 | "\n", 756 | "Now we're going to talk about two ways you can use Python to group data into a collection: lists and dictionaries." 757 | ] 758 | }, 759 | { 760 | "cell_type": "markdown", 761 | "metadata": {}, 762 | "source": [ 763 | "### Lists\n", 764 | "\n", 765 | "A _list_ is a comma-separated list of items inside square brackets: `[]`.\n", 766 | "\n", 767 | "Here's a list of ingredients, each one a string, that together makes up a salsa recipe." 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": null, 773 | "metadata": {}, 774 | "outputs": [], 775 | "source": [ 776 | "salsa_ingredients = ['tomato', 'onion', 'jalapeño', 'lime', 'cilantro']" 777 | ] 778 | }, 779 | { 780 | "cell_type": "markdown", 781 | "metadata": {}, 782 | "source": [ 783 | "To get an item out of a list, you'd refer to its numerical position in the list -- its _index_ (1, 2, 3, etc.) -- inside square brackets immediately following your reference to that list. In Python, as in many other programming languages, counting starts at 0. That means the first item in a list is item `0`." 784 | ] 785 | }, 786 | { 787 | "cell_type": "code", 788 | "execution_count": null, 789 | "metadata": {}, 790 | "outputs": [], 791 | "source": [ 792 | "salsa_ingredients[0]" 793 | ] 794 | }, 795 | { 796 | "cell_type": "code", 797 | "execution_count": null, 798 | "metadata": {}, 799 | "outputs": [], 800 | "source": [ 801 | "salsa_ingredients[1]" 802 | ] 803 | }, 804 | { 805 | "cell_type": "markdown", 806 | "metadata": {}, 807 | "source": [ 808 | "You can use _negative indexing_ to grab things from the right-hand side of the list -- and in fact, `[-1]` is a common idiom for getting \"the last item in a list\" when it's not clear how many items are in your list." 809 | ] 810 | }, 811 | { 812 | "cell_type": "code", 813 | "execution_count": null, 814 | "metadata": {}, 815 | "outputs": [], 816 | "source": [ 817 | "salsa_ingredients[-1]" 818 | ] 819 | }, 820 | { 821 | "cell_type": "markdown", 822 | "metadata": {}, 823 | "source": [ 824 | "If you wanted to get a slice of multiple items out of your list, you'd use colons (just like in Excel, kind of!).\n", 825 | "\n", 826 | "If you wanted to get the first three items, you'd do this:" 827 | ] 828 | }, 829 | { 830 | "cell_type": "code", 831 | "execution_count": null, 832 | "metadata": {}, 833 | "outputs": [], 834 | "source": [ 835 | "salsa_ingredients[0:3]" 836 | ] 837 | }, 838 | { 839 | "cell_type": "markdown", 840 | "metadata": {}, 841 | "source": [ 842 | "You could also have left off the initial 0 -- when you leave out the first number, Python defaults to \"the first item in the list.\" In the same way, if you leave off the last number, Python defaults to \"the last item in the list.\"" 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": null, 848 | "metadata": {}, 849 | "outputs": [], 850 | "source": [ 851 | "salsa_ingredients[:3]" 852 | ] 853 | }, 854 | { 855 | "cell_type": "markdown", 856 | "metadata": {}, 857 | "source": [ 858 | "Note, too, that this slice is giving us items 0, 1 and 2. The `3` in our slice is the first item we _don't_ want. That can be kind of confusing at first. Let's try a few more:" 859 | ] 860 | }, 861 | { 862 | "cell_type": "code", 863 | "execution_count": null, 864 | "metadata": {}, 865 | "outputs": [], 866 | "source": [ 867 | "# everything in the list except the first item\n", 868 | "salsa_ingredients[1:]" 869 | ] 870 | }, 871 | { 872 | "cell_type": "code", 873 | "execution_count": null, 874 | "metadata": {}, 875 | "outputs": [], 876 | "source": [ 877 | "# the second, third and fourth items\n", 878 | "salsa_ingredients[1:4]" 879 | ] 880 | }, 881 | { 882 | "cell_type": "code", 883 | "execution_count": null, 884 | "metadata": {}, 885 | "outputs": [], 886 | "source": [ 887 | "# the last two items\n", 888 | "salsa_ingredients[-2:]" 889 | ] 890 | }, 891 | { 892 | "cell_type": "markdown", 893 | "metadata": {}, 894 | "source": [ 895 | "To see how many items are in a list, use the `len()` function:" 896 | ] 897 | }, 898 | { 899 | "cell_type": "code", 900 | "execution_count": null, 901 | "metadata": {}, 902 | "outputs": [], 903 | "source": [ 904 | "len(salsa_ingredients)" 905 | ] 906 | }, 907 | { 908 | "cell_type": "markdown", 909 | "metadata": {}, 910 | "source": [ 911 | "To add an item to a list, use the [`append()`](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) method:" 912 | ] 913 | }, 914 | { 915 | "cell_type": "code", 916 | "execution_count": null, 917 | "metadata": {}, 918 | "outputs": [], 919 | "source": [ 920 | "salsa_ingredients" 921 | ] 922 | }, 923 | { 924 | "cell_type": "code", 925 | "execution_count": null, 926 | "metadata": {}, 927 | "outputs": [], 928 | "source": [ 929 | "salsa_ingredients.append('mayonnaise')" 930 | ] 931 | }, 932 | { 933 | "cell_type": "code", 934 | "execution_count": null, 935 | "metadata": {}, 936 | "outputs": [], 937 | "source": [ 938 | "salsa_ingredients" 939 | ] 940 | }, 941 | { 942 | "cell_type": "markdown", 943 | "metadata": {}, 944 | "source": [ 945 | "Haha _gross_. To remove an item from a list, use the `pop()` method. If you don't specify the index number of the item you want to pop out, it will default to \"the last item.\"" 946 | ] 947 | }, 948 | { 949 | "cell_type": "code", 950 | "execution_count": null, 951 | "metadata": {}, 952 | "outputs": [], 953 | "source": [ 954 | "salsa_ingredients.pop()" 955 | ] 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": null, 960 | "metadata": { 961 | "scrolled": true 962 | }, 963 | "outputs": [], 964 | "source": [ 965 | "salsa_ingredients" 966 | ] 967 | }, 968 | { 969 | "cell_type": "markdown", 970 | "metadata": {}, 971 | "source": [ 972 | "You can use the [`in` and `not in`](https://docs.python.org/3/reference/expressions.html#membership-test-operations) expressions to test membership in a list (will return a boolean):" 973 | ] 974 | }, 975 | { 976 | "cell_type": "code", 977 | "execution_count": null, 978 | "metadata": {}, 979 | "outputs": [], 980 | "source": [ 981 | "'lime' in salsa_ingredients" 982 | ] 983 | }, 984 | { 985 | "cell_type": "code", 986 | "execution_count": null, 987 | "metadata": {}, 988 | "outputs": [], 989 | "source": [ 990 | "'cilantro' not in salsa_ingredients" 991 | ] 992 | }, 993 | { 994 | "cell_type": "markdown", 995 | "metadata": {}, 996 | "source": [ 997 | "### Dictionaries\n", 998 | "\n", 999 | "A _dictionary_ is a comma-separated list of key/value pairs inside curly brackets: `{}`. Let's make an entire salsa recipe:" 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "code", 1004 | "execution_count": null, 1005 | "metadata": {}, 1006 | "outputs": [], 1007 | "source": [ 1008 | "salsa = {\n", 1009 | " 'ingredients': salsa_ingredients,\n", 1010 | " 'instructions': 'Chop up all the ingredients and cook them for awhile.',\n", 1011 | " 'oz_made': 12\n", 1012 | "}" 1013 | ] 1014 | }, 1015 | { 1016 | "cell_type": "markdown", 1017 | "metadata": {}, 1018 | "source": [ 1019 | "To retrieve a value from a dictionary, you'd refer to the name of its key inside square brackets `[]` immediately after your reference to the dictionary:" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "code", 1024 | "execution_count": null, 1025 | "metadata": {}, 1026 | "outputs": [], 1027 | "source": [ 1028 | "salsa['oz_made']" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "code", 1033 | "execution_count": null, 1034 | "metadata": {}, 1035 | "outputs": [], 1036 | "source": [ 1037 | "salsa['ingredients']" 1038 | ] 1039 | }, 1040 | { 1041 | "cell_type": "markdown", 1042 | "metadata": {}, 1043 | "source": [ 1044 | "To add a new key/value pair to a dictionary, assign a new key to the dictionary inside square brackets and set the value of that key with `=`:" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "code", 1049 | "execution_count": null, 1050 | "metadata": {}, 1051 | "outputs": [], 1052 | "source": [ 1053 | "salsa['tastes_great'] = True" 1054 | ] 1055 | }, 1056 | { 1057 | "cell_type": "code", 1058 | "execution_count": null, 1059 | "metadata": {}, 1060 | "outputs": [], 1061 | "source": [ 1062 | "salsa" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "markdown", 1067 | "metadata": {}, 1068 | "source": [ 1069 | "To delete a key/value pair out of a dictionary, use the `del` command and reference the key:" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "code", 1074 | "execution_count": null, 1075 | "metadata": {}, 1076 | "outputs": [], 1077 | "source": [ 1078 | "del salsa['tastes_great']" 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "code", 1083 | "execution_count": null, 1084 | "metadata": {}, 1085 | "outputs": [], 1086 | "source": [ 1087 | "salsa" 1088 | ] 1089 | }, 1090 | { 1091 | "cell_type": "markdown", 1092 | "metadata": {}, 1093 | "source": [ 1094 | "### Indentation\n", 1095 | "\n", 1096 | "Whitespace matters in Python. Sometimes you'll need to indent bits of code to make things work. This can be confusing! `IndentationError`s are common even for experienced programmers. (FWIW, Jupyter will try to be helpful and insert the correct amount of \"significant whitespace\" for you.)\n", 1097 | "\n", 1098 | "You can use tabs or spaces, just don't mix them. [The Python style guide](https://www.python.org/dev/peps/pep-0008/) recommends indenting your code in groups of four spaces, so that's what we'll use." 1099 | ] 1100 | }, 1101 | { 1102 | "cell_type": "markdown", 1103 | "metadata": {}, 1104 | "source": [ 1105 | "### `for` loops\n", 1106 | "\n", 1107 | "You would use a `for` loop to iterate over a collection of things. The statement begins with the keyword `for` (lowercase), then a temporary `variable_name` of your choice to represent each item as you loop through the collection, then the Python keyword `in`, then the collection you're looping over (or its variable name), then a colon, then the indented block of code with instructions about what to do with each item in the collection.\n", 1108 | "\n", 1109 | "Let's say we have a list of numbers that we assign to the variable `list_of_numbers`." 1110 | ] 1111 | }, 1112 | { 1113 | "cell_type": "code", 1114 | "execution_count": null, 1115 | "metadata": {}, 1116 | "outputs": [], 1117 | "source": [ 1118 | "list_of_numbers = [1, 2, 3, 4, 5, 6]" 1119 | ] 1120 | }, 1121 | { 1122 | "cell_type": "markdown", 1123 | "metadata": {}, 1124 | "source": [ 1125 | "We could loop over the list and print out each number:" 1126 | ] 1127 | }, 1128 | { 1129 | "cell_type": "code", 1130 | "execution_count": null, 1131 | "metadata": {}, 1132 | "outputs": [], 1133 | "source": [ 1134 | "for number in list_of_numbers:\n", 1135 | " print(number)" 1136 | ] 1137 | }, 1138 | { 1139 | "cell_type": "markdown", 1140 | "metadata": {}, 1141 | "source": [ 1142 | "We could print out each number _times 6_:" 1143 | ] 1144 | }, 1145 | { 1146 | "cell_type": "code", 1147 | "execution_count": null, 1148 | "metadata": {}, 1149 | "outputs": [], 1150 | "source": [ 1151 | "for number in list_of_numbers:\n", 1152 | " print(number*6)" 1153 | ] 1154 | }, 1155 | { 1156 | "cell_type": "markdown", 1157 | "metadata": {}, 1158 | "source": [ 1159 | "... whatever you need to do in you loop. Note that the variable name `number` in our loop is totally arbitrary. This also would work:" 1160 | ] 1161 | }, 1162 | { 1163 | "cell_type": "code", 1164 | "execution_count": null, 1165 | "metadata": {}, 1166 | "outputs": [], 1167 | "source": [ 1168 | "for banana in list_of_numbers:\n", 1169 | " print(banana)" 1170 | ] 1171 | }, 1172 | { 1173 | "cell_type": "markdown", 1174 | "metadata": {}, 1175 | "source": [ 1176 | "It can be hard, at first, to figure out what's a \"Python word\" and what's a variable name that you get to define. This comes with practice." 1177 | ] 1178 | }, 1179 | { 1180 | "cell_type": "markdown", 1181 | "metadata": {}, 1182 | "source": [ 1183 | "Strings are iterable, too. Let's loop over the letters in a sentence:" 1184 | ] 1185 | }, 1186 | { 1187 | "cell_type": "code", 1188 | "execution_count": null, 1189 | "metadata": {}, 1190 | "outputs": [], 1191 | "source": [ 1192 | "sentence = 'Hello, IRE/NICAR!'\n", 1193 | "\n", 1194 | "for letter in sentence:\n", 1195 | " print(letter)" 1196 | ] 1197 | }, 1198 | { 1199 | "cell_type": "markdown", 1200 | "metadata": {}, 1201 | "source": [ 1202 | "To this point: Strings are iterable, like lists, so you can use the same kinds of methods:" 1203 | ] 1204 | }, 1205 | { 1206 | "cell_type": "code", 1207 | "execution_count": null, 1208 | "metadata": {}, 1209 | "outputs": [], 1210 | "source": [ 1211 | "# get the first five characters\n", 1212 | "sentence[:5]" 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "code", 1217 | "execution_count": null, 1218 | "metadata": {}, 1219 | "outputs": [], 1220 | "source": [ 1221 | "# get the length of the sentence\n", 1222 | "len(sentence)" 1223 | ] 1224 | }, 1225 | { 1226 | "cell_type": "code", 1227 | "execution_count": null, 1228 | "metadata": {}, 1229 | "outputs": [], 1230 | "source": [ 1231 | "'Hello' in sentence" 1232 | ] 1233 | }, 1234 | { 1235 | "cell_type": "markdown", 1236 | "metadata": {}, 1237 | "source": [ 1238 | "You can iterate over dictionaries, too -- just remember that dictionaries _don't keep track of the order that items were added to it_.\n", 1239 | "\n", 1240 | "When you're looping over a dictionary, the variable name in your `for` loop will refer to the keys. Let's loop over our `salsa` dictionary from up above to see what I mean." 1241 | ] 1242 | }, 1243 | { 1244 | "cell_type": "code", 1245 | "execution_count": null, 1246 | "metadata": {}, 1247 | "outputs": [], 1248 | "source": [ 1249 | "for key in salsa:\n", 1250 | " print(key)" 1251 | ] 1252 | }, 1253 | { 1254 | "cell_type": "markdown", 1255 | "metadata": {}, 1256 | "source": [ 1257 | "To get the _value_ of a dictionary item in a for loop, you'd need to use the key to retrieve it from the dictionary:" 1258 | ] 1259 | }, 1260 | { 1261 | "cell_type": "code", 1262 | "execution_count": null, 1263 | "metadata": {}, 1264 | "outputs": [], 1265 | "source": [ 1266 | "for key in salsa:\n", 1267 | " print(salsa[key])" 1268 | ] 1269 | }, 1270 | { 1271 | "cell_type": "markdown", 1272 | "metadata": {}, 1273 | "source": [ 1274 | "### `if` statements\n", 1275 | "Just like in Excel, you can use the \"if\" keyword to handle conditional logic.\n", 1276 | "\n", 1277 | "These statements begin with the keyword `if` (lowercase), then the condition to evaluate, then a colon, then a new line with a block of indented code to execute if the condition resolves to `True`." 1278 | ] 1279 | }, 1280 | { 1281 | "cell_type": "code", 1282 | "execution_count": null, 1283 | "metadata": {}, 1284 | "outputs": [], 1285 | "source": [ 1286 | "if 4 < 6:\n", 1287 | " print('4 is less than 6')" 1288 | ] 1289 | }, 1290 | { 1291 | "cell_type": "markdown", 1292 | "metadata": {}, 1293 | "source": [ 1294 | "You can also add an `else` statement (and a colon) with an indented block of code you want to run if the condition resolves to `False`." 1295 | ] 1296 | }, 1297 | { 1298 | "cell_type": "code", 1299 | "execution_count": null, 1300 | "metadata": {}, 1301 | "outputs": [], 1302 | "source": [ 1303 | "if 4 > 6:\n", 1304 | " print('4 is greater than 6?!')\n", 1305 | "else:\n", 1306 | " print('4 is not greater than 6.')" 1307 | ] 1308 | }, 1309 | { 1310 | "cell_type": "markdown", 1311 | "metadata": {}, 1312 | "source": [ 1313 | "If you need to, you can add multiple conditions with `elif`." 1314 | ] 1315 | }, 1316 | { 1317 | "cell_type": "code", 1318 | "execution_count": null, 1319 | "metadata": {}, 1320 | "outputs": [], 1321 | "source": [ 1322 | "HOME_SCORE = 6\n", 1323 | "AWAY_SCORE = 8\n", 1324 | "\n", 1325 | "if HOME_SCORE > AWAY_SCORE:\n", 1326 | " print('we won!')\n", 1327 | "elif HOME_SCORE == AWAY_SCORE:\n", 1328 | " print('we tied!')\n", 1329 | "else:\n", 1330 | " print('we lost!')" 1331 | ] 1332 | } 1333 | ], 1334 | "metadata": { 1335 | "kernelspec": { 1336 | "display_name": "Python 3 (ipykernel)", 1337 | "language": "python", 1338 | "name": "python3" 1339 | }, 1340 | "language_info": { 1341 | "codemirror_mode": { 1342 | "name": "ipython", 1343 | "version": 3 1344 | }, 1345 | "file_extension": ".py", 1346 | "mimetype": "text/x-python", 1347 | "name": "python", 1348 | "nbconvert_exporter": "python", 1349 | "pygments_lexer": "ipython3", 1350 | "version": "3.10.9" 1351 | } 1352 | }, 1353 | "nbformat": 4, 1354 | "nbformat_minor": 2 1355 | } 1356 | -------------------------------------------------------------------------------- /Python syntax cheat sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cjwinchester/nicar23-python-scraping/06b9e729075e6c04c7f0c777d3d99c317332c95a/Python syntax cheat sheet.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NICAR 2023: Web scraping with Python 2 | 3 | ### 🔗 [bit.ly/nicar23-scraping](https://bit.ly/nicar23-scraping) 4 | 5 | This repo contains materials for a half-day workshop at the NICAR 2023 data journalism conference in Nashville on using Python to scrape data from websites. 6 | 7 | The session is scheduled for Sunday, March 5, from 9 a.m. - 12:30 p.m. in room `Midtown 3` on Meeting Space Level 2. 8 | 9 | ### First step 10 | 11 | Open the Terminal application. Copy and paste this text into the Terminal and hit enter: 12 | 13 | ```bat 14 | cd Desktop/hands_on_classes/20230305-sunday-web-scraping-with-python--preregistered-attendees-only & .\env\Scripts\activate 15 | ``` 16 | 17 | ### Course outline 18 | - Do you really need to scrape this? 19 | - Process overview: 20 | - Fetch, parse, write data to file 21 | - Some best practices 22 | - Make sure you feel OK about whether your scraping project is (legally, ethically, etc.) allowable 23 | - Don't DDOS your target server 24 | - When feasible, save copies of pages locally, then scrape from those files 25 | - [Rotate user-agent strings](https://www.useragents.me/) and other headers if necessary to avoid bot detection 26 | - Using your favorite brower's inspection tools to deconstruct the target page(s) 27 | - See if the data is delivered to the page in a ready-to-use format, such as JSON ([example](https://sdlegislature.gov/Session/Archived)) 28 | - Is the HTML part of the actual page structure, or is it built on the fly when the page loads? ([example](https://rrctx.force.com/s/complaints)) 29 | - Can you open the URL directly in an incognito window and get to the same content, or does the page require a specific state to deliver the content (via search navigation, etc.)? ([example](https://rrctx.force.com/s/ietrs-complaint/a0ct0000000mOmhAAE/complaint0000000008)) 30 | - Are there [URL query parameters](https://en.wikipedia.org/wiki/Query_string) that you can tweak to get different results? ([example](https://www.worksafe.qld.gov.au/news-and-events/alerts)) 31 | - Choose tools that the most sense for your target page(s) -- a few popular options: 32 | - [`requests`](https://requests.readthedocs.io/en/latest/) and [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 33 | - [`playwright`](https://playwright.dev/python) (optionally using `BeautifulSoup` for the HTML parsing) 34 | - [`scrapy`](https://scrapy.org/) for larger spidering/crawling tasks 35 | - Overview of our Python setup today 36 | - Activating the virtual environment 37 | - Jupyter notebooks 38 | - Running `.py` files from the command line 39 | - Our projects today: 40 | - [Maryland WARN notices](md-warn-notices) 41 | - [U.S. Senate press gallery](us-senate-press-gallery) 42 | - [IRE board members](ire-board) 43 | - [South Dakota lobbyist registration data](sd-lobbyists) 44 | - [Texas Railroad Commission complaints](tx-railroad-commission) 45 | 46 | ### Additional resources 47 | - Need to scrape on a timer? [Try GitHub Actions](https://palewi.re/docs/first-github-scraper) (Other options: Using your computer's scheduler tools, putting your script on a remote server with a [`crontab` configuration](https://en.wikipedia.org/wiki/Cron), [switching to Google Apps Script and setting up time-based triggers](https://developers.google.com/apps-script/guides/triggers), etc.) 48 | - [A neat technique for copying data to your clipboard while scraping a Flourish visualization](https://til.simonwillison.net/shot-scraper/scraping-flourish) 49 | - [Walkthrough: Class-based scraping](https://blog.apps.npr.org/2016/06/17/scraping-tips.html) 50 | 51 | 52 | ### Running this code at home 53 | - Install Python, if you haven't already ([here's our guide](https://docs.google.com/document/d/1cYmpfZEZ8r-09Q6Go917cKVcQk_d0P61gm0q8DAdIdg/edit)) 54 | - Clone or download this repo 55 | - `cd` into the repo directory and install the requirements, preferably into a virtual environment using your tooling of choice: `pip install -r requirements.txt` 56 | - `playwright install` 57 | - `jupyter notebook` to launch the notebook server 58 | -------------------------------------------------------------------------------- /ire-board/IRE Board members - complete.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "4917bfba", 6 | "metadata": {}, 7 | "source": [ 8 | "# IRE Board members\n", 9 | "\n", 10 | "The goal: Scrape [this list of IRE board members](https://www.ire.org/about-ire/past-ire-board-members/) into a CSV.\n", 11 | "\n", 12 | "This project introduces a few new concepts:\n", 13 | "- Scraping data that's not part of a table\n", 14 | "- Specifying custom request headers to evade a bot detection rule on our server\n", 15 | "- Using string methods and default values when parsing out the data" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "id": "bfd3d8c7", 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "# stdlib library we'll use to write the CSV file\n", 26 | "import csv\n", 27 | "\n", 28 | "# installed library to handle the HTTP traffic\n", 29 | "import requests\n", 30 | "\n", 31 | "# installed library to parse the HTML\n", 32 | "from bs4 import BeautifulSoup" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "id": "1acd7756", 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "URL = 'https://www.ire.org/about-ire/past-ire-board-members/'" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "id": "accded42", 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "# set up request headers\n", 53 | "# the IRE website rejects incoming requests with the\n", 54 | "# `requests` library's default user-agent, so we\n", 55 | "# need to pretend to be a browser -- we can do that by\n", 56 | "# setting the `User-Agent` value to mimic a value that\n", 57 | "# a browser would send, and add this to the headers\n", 58 | "# of the request before it's sent\n", 59 | "# read more: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent\n", 60 | "headers = {\n", 61 | " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'\n", 62 | "}" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "id": "03294e7e", 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "# send a GET request to fetch the page using the headers we just created\n", 73 | "r = requests.get(\n", 74 | " 'https://www.ire.org/about-ire/past-ire-board-members/',\n", 75 | " headers=headers\n", 76 | ")\n", 77 | "\n", 78 | "# raise an error if the HTTP request returns an error code\n", 79 | "# HTTP codes: https://http.cat\n", 80 | "r.raise_for_status()" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "id": "e5c65871", 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "# use the BeautifulSoup object to parse the response text\n", 91 | "# -- r.text -- with the default HTML parser\n", 92 | "# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use\n", 93 | "soup = BeautifulSoup(r.text, 'html.parser')" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "id": "400f25c3", 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "print(soup)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "id": "73db6014", 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "# search the HTML tree to find the div\n", 114 | "# with the `id` attribute of \"past-ire-board-members\"\n", 115 | "target_div = soup.find(\n", 116 | " 'div',\n", 117 | " {'id': 'past-ire-board-members'}\n", 118 | ")" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "id": "df88000b", 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "print(target_div)" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "id": "4ad3f74f", 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "# within that div, find all the paragraph tags\n", 139 | "members = target_div.find_all('p')" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "id": "7b51b34b", 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "members" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "id": "cb711ee3", 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "# set up the CSV headers to write to file\n", 160 | "csv_headers = [\n", 161 | " 'name',\n", 162 | " 'terms',\n", 163 | " 'was_president',\n", 164 | " 'is_deceased'\n", 165 | "]" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "id": "787cb02f", 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "# next, set up the file to write the CSV data into\n", 176 | "# https://docs.python.org/3/library/csv.html#csv.writer\n", 177 | "\n", 178 | "# open the CSV file in write ('w') mode, specifying newline='' to deal with\n", 179 | "# potential PC-only line ending problem\n", 180 | "with open('ire-board.csv', 'w', newline='') as outfile:\n", 181 | "\n", 182 | " # set up a csv.writer object tied to the file we just opened\n", 183 | " writer = csv.writer(outfile)\n", 184 | "\n", 185 | " # write the list of headers\n", 186 | " writer.writerow(csv_headers)\n", 187 | "\n", 188 | " # loop over the list of paragraphs we targeted above\n", 189 | " for member in members:\n", 190 | "\n", 191 | " # we don't want the entire Tag object, just the text\n", 192 | " text = member.text\n", 193 | "\n", 194 | " # set up some default values -- the member was not president\n", 195 | " was_president = False\n", 196 | "\n", 197 | " # and is not deceased\n", 198 | " is_deceased = False\n", 199 | "\n", 200 | " # IRE denotes past presidents with a leading asterisk\n", 201 | " # so check to see if the string startswith '*'\n", 202 | " # https://docs.python.org/3/library/stdtypes.html?highlight=startswith#str.startswith\n", 203 | " if text.startswith('*'):\n", 204 | "\n", 205 | " # if so, switch the value for the `was_president` variable to True\n", 206 | " was_president = True\n", 207 | "\n", 208 | " # check to see if \"(dec)\" is anywhere in the text, which\n", 209 | " # indicates this person is deceased\n", 210 | " # https://docs.python.org/3/reference/expressions.html#in\n", 211 | " if '(dec)' in text:\n", 212 | " is_deceased = True\n", 213 | "\n", 214 | " # next, start parsing out the pieces\n", 215 | " # separate the name from the terms by splitting on \"(\"\n", 216 | " text_split = text.split('(')\n", 217 | "\n", 218 | " # the name will be the first ([0]) item in the resulting list\n", 219 | " # while we're at it, strip off any leading asterisks\n", 220 | " # https://docs.python.org/3/library/stdtypes.html?highlight=lstrip#str.lstrip\n", 221 | " # and strip() off any leading or trailing whitespace\n", 222 | " # https://docs.python.org/3/library/stdtypes.html?highlight=lstrip#str.strip\n", 223 | " name = text_split[0].lstrip('*').strip()\n", 224 | "\n", 225 | " # the term(s) of service will be the second item ([1]) in that list\n", 226 | " # and the term text is always terminated with a closing parens\n", 227 | " # so splitting on that closing parens and taking the first ([0])\n", 228 | " # item in the list will give us the term(s)\n", 229 | " terms = text_split[1].split(')')[0]\n", 230 | "\n", 231 | " # put the collected data into a list\n", 232 | " data = [\n", 233 | " name,\n", 234 | " terms,\n", 235 | " was_president,\n", 236 | " is_deceased\n", 237 | " ]\n", 238 | "\n", 239 | " # and write this row of data into the CSV file\n", 240 | " writer.writerow(data)" 241 | ] 242 | } 243 | ], 244 | "metadata": { 245 | "kernelspec": { 246 | "display_name": "Python 3 (ipykernel)", 247 | "language": "python", 248 | "name": "python3" 249 | }, 250 | "language_info": { 251 | "codemirror_mode": { 252 | "name": "ipython", 253 | "version": 3 254 | }, 255 | "file_extension": ".py", 256 | "mimetype": "text/x-python", 257 | "name": "python", 258 | "nbconvert_exporter": "python", 259 | "pygments_lexer": "ipython3", 260 | "version": "3.10.9" 261 | } 262 | }, 263 | "nbformat": 4, 264 | "nbformat_minor": 5 265 | } 266 | -------------------------------------------------------------------------------- /ire-board/IRE Board members - working.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "4917bfba", 6 | "metadata": {}, 7 | "source": [ 8 | "# IRE Board members\n", 9 | "\n", 10 | "The goal: Scrape [this list of IRE board members](https://www.ire.org/about-ire/past-ire-board-members/) into a CSV.\n", 11 | "\n", 12 | "This project introduces a few new concepts:\n", 13 | "- Scraping data that's not part of a table\n", 14 | "- Specifying custom request headers to evade a bot detection rule on our server\n", 15 | "- Using string methods and default values when parsing out the data\n", 16 | "\n", 17 | "[The completed version is here](IRE%20Board%20members%20-%20complete.ipynb).\n", 18 | "\n", 19 | "([See also this standalone version featuring a few more advanced techniques](/edit/ire-board/ire_board_scrape.py).)" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "id": "bfd3d8c7", 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "# stdlib library we'll use to write the CSV file\n", 30 | "import csv\n", 31 | "\n", 32 | "# installed library to handle the HTTP traffic\n", 33 | "import requests\n", 34 | "\n", 35 | "# installed library to parse the HTML\n", 36 | "from bs4 import BeautifulSoup" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "id": "1acd7756", 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "URL = 'https://www.ire.org/about-ire/past-ire-board-members/'" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "id": "434e47d8", 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "# make the request\n", 57 | "\n", 58 | "# check for HTTP errors" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "id": "accded42", 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "# set up request headers with a custom user-agent string\n" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "id": "03294e7e", 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "# try the request again, with the new headers\n", 79 | "\n", 80 | "\n", 81 | "# and raise for errors\n" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "id": "e5c65871", 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "# parse the HTML into soup\n" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "id": "7c3f3e35", 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "id": "73db6014", 106 | "metadata": {}, 107 | "outputs": [], 108 | "source": [ 109 | "# search the HTML tree to find the div\n", 110 | "# with the `id` attribute of \"past-ire-board-members\"\n" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "id": "df88000b", 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "id": "4ad3f74f", 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "# within that div, find all the paragraph tags\n" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "id": "7b51b34b", 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "id": "c0058f3a", 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "# noodle around here to isolate the pieces of data for export" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "id": "6ec1c43b", 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "id": "a8373c6c", 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "id": "7cfb1966", 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "id": "f35f134d", 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "id": "cb711ee3", 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "# set up the CSV headers to write to file\n" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "id": "787cb02f", 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "# next, set up the file to write the CSV data into\n", 199 | "# https://docs.python.org/3/library/csv.html#csv.writer\n", 200 | "\n", 201 | "# open the CSV file in write ('w') mode, specifying newline='' to deal with\n", 202 | "# potential PC-only line ending problem\n", 203 | "\n", 204 | "\n", 205 | " # set up a csv.writer object tied to the file we just opened\n", 206 | "\n", 207 | "\n", 208 | " # write the list of headers\n", 209 | "\n", 210 | "\n", 211 | " # loop over the list of paragraphs we targeted above\n", 212 | "\n", 213 | "\n", 214 | " # we don't want the entire Tag object, just the text\n", 215 | "\n", 216 | "\n", 217 | " # set up some default values -- the member was not president\n", 218 | "\n", 219 | "\n", 220 | " # and is not deceased\n", 221 | "\n", 222 | "\n", 223 | " # IRE denotes past presidents with a leading asterisk\n", 224 | " # so check to see if the string startswith '*'\n", 225 | " # https://docs.python.org/3/library/stdtypes.html?highlight=startswith#str.startswith\n", 226 | "\n", 227 | "\n", 228 | " # if so, switch the value for the `was_president` variable to True\n", 229 | "\n", 230 | "\n", 231 | " # check to see if \"(dec)\" is anywhere in the text, which\n", 232 | " # indicates this person is deceased\n", 233 | " # https://docs.python.org/3/reference/expressions.html#in\n", 234 | "\n", 235 | "\n", 236 | " # next, start parsing out the pieces\n", 237 | " # separate the name from the terms by splitting on \"(\"\n", 238 | "\n", 239 | "\n", 240 | " # the name will be the first ([0]) item in the resulting list\n", 241 | " # while we're at it, strip off any leading asterisks\n", 242 | " # https://docs.python.org/3/library/stdtypes.html?highlight=lstrip#str.lstrip\n", 243 | " # and strip() off any leading or trailing whitespace\n", 244 | " # https://docs.python.org/3/library/stdtypes.html?highlight=lstrip#str.strip\n", 245 | "\n", 246 | "\n", 247 | " # the term(s) of service will be the second item ([1]) in that list\n", 248 | " # and the term text is always terminated with a closing parens\n", 249 | " # so splitting on that closing parens and taking the first ([0])\n", 250 | " # item in the list will give us the term(s)\n", 251 | "\n", 252 | "\n", 253 | " # put the collected data into a list\n", 254 | "\n", 255 | "\n", 256 | " # and write this row of data into the CSV file\n" 257 | ] 258 | } 259 | ], 260 | "metadata": { 261 | "kernelspec": { 262 | "display_name": "Python 3 (ipykernel)", 263 | "language": "python", 264 | "name": "python3" 265 | }, 266 | "language_info": { 267 | "codemirror_mode": { 268 | "name": "ipython", 269 | "version": 3 270 | }, 271 | "file_extension": ".py", 272 | "mimetype": "text/x-python", 273 | "name": "python", 274 | "nbconvert_exporter": "python", 275 | "pygments_lexer": "ipython3", 276 | "version": "3.10.9" 277 | } 278 | }, 279 | "nbformat": 4, 280 | "nbformat_minor": 5 281 | } 282 | -------------------------------------------------------------------------------- /ire-board/ire_board_scrape.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This version demonstrates a few more advanced techniques -- inline comments are mainly for stuff not covered in the basic script: 3 | - Separation of concerns: Writing a function to handle each task -- downloading the page and scraping the data -- and setting up the script to allow those functions to be imported into other scripts, if that need should ever arise 4 | - Doing a little more text processing to break the name into last/rest components, and to separate out terms of service, so now the atomic observation being written to file is a term of service, not a board member 5 | - Using csv.DictWriter instead of csv.writer 6 | - Demonstrating a few other useful Python techniques, such as list comprehensions, multiple assignment, star unpacking and custom list sorting 7 | ''' 8 | 9 | import os 10 | import csv 11 | 12 | import requests 13 | from bs4 import BeautifulSoup 14 | 15 | 16 | def download_page(url, html_file_out): 17 | 18 | if not os.path.exists(html_file_out): 19 | 20 | headers = { 21 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' # noqa 22 | } 23 | 24 | r = requests.get( 25 | url, 26 | headers=headers 27 | ) 28 | 29 | r.raise_for_status() 30 | 31 | with open(html_file_out, 'w') as outfile: 32 | outfile.write(r.text) 33 | 34 | print(f'Downloaded {html_file_out}') 35 | 36 | return html_file_out 37 | 38 | 39 | def parse_data(html_file_in, csv_file_out): 40 | with open(html_file_in, 'r') as infile: 41 | html = infile.read() 42 | 43 | soup = BeautifulSoup( 44 | html, 45 | 'html.parser' 46 | ) 47 | 48 | target_div = soup.find( 49 | 'div', 50 | {'id': 'past-ire-board-members'} 51 | ) 52 | 53 | # https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions 54 | members = [x.text.strip() for x in target_div.find_all('p')] 55 | 56 | csv_headers = [ 57 | 'name_last', 58 | 'name_rest', 59 | 'term_start', 60 | 'term_end', 61 | 'was_president', 62 | 'is_deceased' 63 | ] 64 | 65 | # start an empty list to hold records to write 66 | parsed_member_data = [] 67 | 68 | # loop over member text 69 | for member in members: 70 | 71 | was_president = False 72 | is_deceased = False 73 | 74 | if member.startswith('*'): 75 | was_president = True 76 | 77 | if '(dec)' in member: 78 | is_deceased = True 79 | 80 | # https://exercism.org/tracks/python/concepts/unpacking-and-multiple-assignment 81 | # https://docs.python.org/3/tutorial/controlflow.html?highlight=unpack#unpacking-argument-lists 82 | # here, the value attached to the `rest` var is ignored 83 | name, terms, *rest = member.split('(') 84 | 85 | name_clean = name.lstrip('*').strip() 86 | terms_clean = terms.split(')')[0] 87 | 88 | # split the name into last, rest 89 | name_split = name_clean.rsplit(' ', 1) 90 | 91 | # handle generational suffixes 92 | if name_split[-1] == 'Jr.': 93 | name_split = name_split[0].rsplit(' ', 1) 94 | name_split[0] += ' Jr.' 95 | 96 | rest, last = name_split 97 | 98 | # loop over the terms of service 99 | for term in terms_clean.split(','): 100 | term_start, term_end = term.strip().split('-') 101 | 102 | # create a dict by zipping together the headers with the list of data 103 | data = dict(zip(csv_headers, [ 104 | last, 105 | rest, 106 | term_start, 107 | term_end, 108 | was_president, 109 | is_deceased 110 | ])) 111 | 112 | # add the dict to the main list 113 | parsed_member_data.append(data) 114 | 115 | # sort member data by last name, then first name, then term start 116 | data_sorted = sorted( 117 | parsed_member_data, 118 | key=lambda x: ( 119 | x['name_last'], 120 | x['name_rest'], 121 | x['term_start'] 122 | ) 123 | ) 124 | 125 | # write to file, specifying the encoding and 126 | # dealing with a Windows-specific problem that 127 | # sometimes pops up when writing to file 128 | with open(csv_file_out, 'w', encoding='utf-8', newline='') as outfile: 129 | writer = csv.DictWriter( 130 | outfile, 131 | fieldnames=csv_headers 132 | ) 133 | writer.writeheader() 134 | writer.writerows(data_sorted) 135 | 136 | print(f'Wrote {csv_file_out}') 137 | 138 | 139 | # https://realpython.com/if-name-main-python/ 140 | if __name__ == '__main__': 141 | 142 | url = 'https://www.ire.org/about-ire/past-ire-board-members/' 143 | 144 | # https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals 145 | files_name = 'ire-board' 146 | filename_page = f'{files_name}.html' 147 | filename_csv = f'{files_name}-terms.csv' 148 | 149 | # call the functions 150 | download_page(url, filename_page) 151 | parse_data(filename_page, filename_csv) 152 | -------------------------------------------------------------------------------- /md-warn-notices/Maryland WARN Notices - multiple pages.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "8fa7c11c", 6 | "metadata": {}, 7 | "source": [ 8 | "# Maryland WARN Notices - multiple pages\n", 9 | "\n", 10 | "Extra credit: Figure out how to target and extract WARN data for multiple years. The process:\n", 11 | "- Using `requests`, fetch the main page\n", 12 | "- Using `bs4`, target the list of links to pages with data for previous years\n", 13 | "- Using a `for` loop, iterate over each link\n", 14 | " - Fetch the page\n", 15 | " - Turn the contents into `soup`\n", 16 | " - Target the elements to extract\n", 17 | " - Add the parsed data to your list" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "id": "7b6c6fb2", 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [] 27 | } 28 | ], 29 | "metadata": { 30 | "kernelspec": { 31 | "display_name": "Python 3 (ipykernel)", 32 | "language": "python", 33 | "name": "python3" 34 | }, 35 | "language_info": { 36 | "codemirror_mode": { 37 | "name": "ipython", 38 | "version": 3 39 | }, 40 | "file_extension": ".py", 41 | "mimetype": "text/x-python", 42 | "name": "python", 43 | "nbconvert_exporter": "python", 44 | "pygments_lexer": "ipython3", 45 | "version": "3.10.9" 46 | } 47 | }, 48 | "nbformat": 4, 49 | "nbformat_minor": 5 50 | } 51 | -------------------------------------------------------------------------------- /md-warn-notices/Maryland WARN Notices.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Maryland WARN Notices\n", 8 | "\n", 9 | "The goal: Scrape the main table on [the first page of Maryland's list of WARN letters](https://www.dllr.state.md.us/employment/warn.shtml) and, if time, write the data to a CSV.\n", 10 | "\n", 11 | "### Table of contents\n", 12 | "\n", 13 | "- [Using Jupyter notebooks](#Using-Jupyter-notebooks)\n", 14 | "- [What _is_ a web page, anyway?](#What-is-a-web-page,-anyway?)\n", 15 | "- [Inspect the source](#Inspect-the-source)\n", 16 | "- [Import libraries](#Import-libraries)\n", 17 | "- [Request the page](#Request-the-page)\n", 18 | "- [Turn your HTML into soup](#Turn-your-HTML-into-soup)\n", 19 | "- [Targeting and extracting data](#Targeting-and-extracting-data)\n", 20 | "- [Write the results to file](#Write-the-results-to-file)" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### Using Jupyter notebooks\n", 28 | "\n", 29 | "There are several ways to write and run Python code on your computer. One way -- the method we're using today -- is to use [Jupyter notebooks](https://jupyter.org/), which run in your browser and allow you to intersperse documentation with your code. They're handy for bundling your code with a human-readable explanation of what's happening at each step. Check out some examples from the [L.A. Times](https://github.com/datadesk/notebooks) and [BuzzFeed News](https://github.com/BuzzFeedNews/everything#data-and-analyses).\n", 30 | "\n", 31 | "**To add a new cell to your notebook**: Click the + button in the menu or press the `b` button on your keyboard.\n", 32 | "\n", 33 | "**To run a cell of code**: Select the cell and click the \"Run\" button in the menu, or you can press Shift+Enter.\n", 34 | "\n", 35 | "**One common gotcha**: The notebook doesn't \"know\" about code you've written until you've _run_ the cell containing it. For example, if you define a variable called `my_name` in one cell, and later, when you try to access that variable in another cell but get an error that says `NameError: name 'my_name' is not defined`, the most likely solution is to run (or re-run) the cell in which you defined `my_name`." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### What _is_ a web page, anyway?\n", 43 | "\n", 44 | "Generally, a web page consists of a bunch of specifically formatted text files stored on a computer (a _server_) that's probably sitting on a rack in a giant data center somewhere.\n", 45 | "\n", 46 | "Mostly you'll be dealing with `.html` (HyperText Markup Language) files that might include references to `.css` (Cascading Style Sheet) files, which determine how the page looks, and/or `.js` (JavaScript) files, which add interactivity, and other specially formatted text files.\n", 47 | "\n", 48 | "Today, we'll focus on the HTML, which gives structure to the page.\n", 49 | "\n", 50 | "Most HTML elements are represented by a pair of tags -- an opening tag and a closing tag.\n", 51 | "\n", 52 | "A table, for example, starts with `