├── .gitignore ├── 0_python_basics.ipynb ├── 1_opening_files.ipynb ├── 2_handling_data.ipynb ├── 3_visualizing_data.ipynb ├── 4_web_scraping.ipynb ├── README.md ├── environment.yml ├── example_data ├── 311-service-requests.csv ├── auto_df.csv ├── csv_sample.csv ├── dd_example.h5 ├── excel_sample.xlsx ├── hdf_sample.h5 ├── hkl_example.h5 ├── json_sample.json ├── stata_sample.dta └── text_sample.txt ├── images ├── API_screenshot.PNG ├── CSSInspector_screenshot.PNG ├── CloneRepo.PNG ├── DIV_screenshot.PNG ├── DevTools_screenshot.PNG ├── Earnings_console_screenshot.PNG ├── Earnings_graph_screenshot.PNG ├── SSRN_screenshot.png ├── bannerimage.png └── jupyterimage.png └── postBuild /.gitignore: -------------------------------------------------------------------------------- 1 | /* 2 | !.gitignore 3 | 4 | **untitled*.py 5 | **untitled*.txt 6 | **/.ipynb_checkpoints 7 | **/desktop.ini 8 | 9 | !images/ 10 | !example_data/ 11 | 12 | !*.ipynb 13 | !*.md 14 | !*.yml 15 | 16 | !postBuild 17 | 18 | exercises.ipynb 19 | exercises_answers.ipynb 20 | sandbox.ipynb 21 | 22 | **/desktop.ini -------------------------------------------------------------------------------- /0_python_basics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Basics of the Python syntax" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "**Author:** Ties de Kok ([Personal Website](https://www.tiesdekok.com)) \n", 15 | "**Last updated:** June 2020 \n", 16 | "**Conda Environment:** `LearnPythonForResearch` \n", 17 | "**Python version:** Python 3.7 \n", 18 | "**License:** MIT License " 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "**Note:** Some features (like the ToC) will only work if you run it locally, use Binder, or use nbviewer by clicking this link: \n", 26 | "https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# *Introduction*" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "This notebook contains an overview of basic Python functionality that you might come across using Python for Social Science Research. \n", 41 | "\n", 42 | "**Note:** this notebook is deliberately skipping over some concepts (e.g. classes) to focus on the main things you need to know to get started. " 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "# *Table of Contents* " 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "* [Variables](#variables) \n", 57 | "* [Displaying something](#display)\n", 58 | "* [Numerical operations](#num-operations) \n", 59 | "* [String operations](#string-operations) \n", 60 | "* [Data structures](#data-structures) \n", 61 | "* [Slicing](#slicing) \n", 62 | "* [Functions](#functions) \n", 63 | "* [Whitespaces](#whitespace) \n", 64 | "* [Conditionals](#conditionals) \n", 65 | "* [Looping](#looping) \n", 66 | "* [Comprehensions](#comprehensions) \n", 67 | "* [Catching exceptions](#catching-exceptions) \n", 68 | "* [Importing libraries](#importing) \n", 69 | "* [OS operations](#os-operations) \n", 70 | "* [File Input / Output](#files) " 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "## Variables [(to top)](#toc)" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "Basic numeric types in Python are `int` for integers and `float` for floating point numbers. \n", 85 | "Strings are represented by `str`, in Python 3.x this implies a sequence of Unicode characters." 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 1, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "a = 5\n", 95 | "b = 3.5\n", 96 | "c = 'A string'" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 2, 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "data": { 106 | "text/plain": [ 107 | "(int, float, str)" 108 | ] 109 | }, 110 | "execution_count": 2, 111 | "metadata": {}, 112 | "output_type": "execute_result" 113 | } 114 | ], 115 | "source": [ 116 | "type(a), type(b), type(c)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "Converting types:" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 3, 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "data": { 133 | "text/plain": [ 134 | "(3, '5')" 135 | ] 136 | }, 137 | "execution_count": 3, 138 | "metadata": {}, 139 | "output_type": "execute_result" 140 | } 141 | ], 142 | "source": [ 143 | "int(3.6), str(5)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "Checking types:" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 4, 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "data": { 160 | "text/plain": [ 161 | "(int, float, str)" 162 | ] 163 | }, 164 | "execution_count": 4, 165 | "metadata": {}, 166 | "output_type": "execute_result" 167 | } 168 | ], 169 | "source": [ 170 | "type(a), type(b), type(c)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 5, 176 | "metadata": {}, 177 | "outputs": [ 178 | { 179 | "data": { 180 | "text/plain": [ 181 | "False" 182 | ] 183 | }, 184 | "execution_count": 5, 185 | "metadata": {}, 186 | "output_type": "execute_result" 187 | } 188 | ], 189 | "source": [ 190 | "isinstance(a, float)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "## Displaying something [(to top)](#toc)" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 6, 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "name": "stdout", 207 | "output_type": "stream", 208 | "text": [ 209 | "Hello\n" 210 | ] 211 | } 212 | ], 213 | "source": [ 214 | "print('Hello')" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "*Note:* `print 'Hello'` does not work in Python 3" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 7, 227 | "metadata": {}, 228 | "outputs": [ 229 | { 230 | "name": "stdout", 231 | "output_type": "stream", 232 | "text": [ 233 | "Hello World\n" 234 | ] 235 | } 236 | ], 237 | "source": [ 238 | "print('Hello ' + 'World')" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 8, 244 | "metadata": {}, 245 | "outputs": [ 246 | { 247 | "name": "stdout", 248 | "output_type": "stream", 249 | "text": [ 250 | "I have 2 apples\n" 251 | ] 252 | } 253 | ], 254 | "source": [ 255 | "apples = 'apples'\n", 256 | "print('I have', 2, apples)" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "## Numerical operations [(to top)](#toc)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 9, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "data": { 273 | "text/plain": [ 274 | "4" 275 | ] 276 | }, 277 | "execution_count": 9, 278 | "metadata": {}, 279 | "output_type": "execute_result" 280 | } 281 | ], 282 | "source": [ 283 | "2+2" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 10, 289 | "metadata": {}, 290 | "outputs": [ 291 | { 292 | "data": { 293 | "text/plain": [ 294 | "0.75" 295 | ] 296 | }, 297 | "execution_count": 10, 298 | "metadata": {}, 299 | "output_type": "execute_result" 300 | } 301 | ], 302 | "source": [ 303 | "3 / 4" 304 | ] 305 | }, 306 | { 307 | "cell_type": "markdown", 308 | "metadata": {}, 309 | "source": [ 310 | "*Note:* Python 3 (unlike Python 2) will automatically create a `float` value if you divide to integers." 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "## String operations [(to top)](#toc)" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": {}, 323 | "source": [ 324 | "Define strings with single, double or triple quotes (for multi-line)" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 11, 330 | "metadata": {}, 331 | "outputs": [], 332 | "source": [ 333 | "hello = 'world'\n", 334 | "saying = \"hello world\"\n", 335 | "paragraph = \"\"\" This is\n", 336 | "a paragraph\n", 337 | "\"\"\"" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "*Note:* you can also create a `raw string` by prefixing the string with \"r\" (`r\"string\"`) which will be interpreted literally:" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 12, 350 | "metadata": {}, 351 | "outputs": [ 352 | { 353 | "name": "stdout", 354 | "output_type": "stream", 355 | "text": [ 356 | "This is a regular string with a special character\n", 357 | "This is on a new line\n" 358 | ] 359 | } 360 | ], 361 | "source": [ 362 | "print(\"This is a regular string with a special character\\nThis is on a new line\")" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 13, 368 | "metadata": {}, 369 | "outputs": [ 370 | { 371 | "name": "stdout", 372 | "output_type": "stream", 373 | "text": [ 374 | "This is a raw string with a special character\\nThis is on a new line\n" 375 | ] 376 | } 377 | ], 378 | "source": [ 379 | "print(r\"This is a raw string with a special character\\nThis is on a new line\")" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "### Variables in strings" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "Using the `.format()` syntax:" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 14, 399 | "metadata": {}, 400 | "outputs": [ 401 | { 402 | "data": { 403 | "text/plain": [ 404 | "'20 0.3333333333333333'" 405 | ] 406 | }, 407 | "execution_count": 14, 408 | "metadata": {}, 409 | "output_type": "execute_result" 410 | } 411 | ], 412 | "source": [ 413 | "'{} {}'.format(20, 1/3)" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 15, 419 | "metadata": {}, 420 | "outputs": [ 421 | { 422 | "data": { 423 | "text/plain": [ 424 | "'0.3333333333333333 20'" 425 | ] 426 | }, 427 | "execution_count": 15, 428 | "metadata": {}, 429 | "output_type": "execute_result" 430 | } 431 | ], 432 | "source": [ 433 | "'{1} {0}'.format(20, 1/3)" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 16, 439 | "metadata": {}, 440 | "outputs": [ 441 | { 442 | "data": { 443 | "text/plain": [ 444 | "'20.000 0.33'" 445 | ] 446 | }, 447 | "execution_count": 16, 448 | "metadata": {}, 449 | "output_type": "execute_result" 450 | } 451 | ], 452 | "source": [ 453 | "'{:.3f} {:.2f}'.format(20, 1/3)" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "*Note 1:* using `.format()` is recommended over the legacy `%` method (`\"The number is: %s\" % 10` --> `'The number is: 10'`) \n", 461 | "*Note 2:* This webpage has a great overview of the various formatting options: https://pyformat.info/ " 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "**Note:** starting from Python 3.6 you can also use so-caled \"F strings\": \n", 469 | "F string are great but will not work without Python 3.6+ so `.format()` tends to have better compatibility if you share your code with others." 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 17, 475 | "metadata": {}, 476 | "outputs": [ 477 | { 478 | "data": { 479 | "text/plain": [ 480 | "'The year 2018 is pretty awesome with F-strings from Python 3.6'" 481 | ] 482 | }, 483 | "execution_count": 17, 484 | "metadata": {}, 485 | "output_type": "execute_result" 486 | } 487 | ], 488 | "source": [ 489 | "year, p_version = 2018, '3.6'\n", 490 | "f'The year {year} is pretty awesome with F-strings from Python {p_version}'" 491 | ] 492 | }, 493 | { 494 | "cell_type": "markdown", 495 | "metadata": {}, 496 | "source": [ 497 | "## Data structures [(to top)](#toc)" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "There are 4 basic data structures: \n", 505 | "* lists (`list`) \n", 506 | "* tuples (`tuple`) \n", 507 | "* dictionaries (`dict`) \n", 508 | "* sets (`set`) " 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "### Lists" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "Lists are enclosed in brackets" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 18, 528 | "metadata": {}, 529 | "outputs": [ 530 | { 531 | "data": { 532 | "text/plain": [ 533 | "['dogs', 'cat', 'bird', 'lizard']" 534 | ] 535 | }, 536 | "execution_count": 18, 537 | "metadata": {}, 538 | "output_type": "execute_result" 539 | } 540 | ], 541 | "source": [ 542 | "pets = ['dogs', 'cat', 'bird'] \n", 543 | "pets.append('lizard') \n", 544 | "pets" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": {}, 550 | "source": [ 551 | "### Tuple" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "Tuples are enclosed in parentheses \n", 559 | "*Note:* You cannot add or remove elements from a tuple but they are faster and consume less memory" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 19, 565 | "metadata": {}, 566 | "outputs": [ 567 | { 568 | "data": { 569 | "text/plain": [ 570 | "('dogs', 'cat', 'bird')" 571 | ] 572 | }, 573 | "execution_count": 19, 574 | "metadata": {}, 575 | "output_type": "execute_result" 576 | } 577 | ], 578 | "source": [ 579 | "pets = ('dogs', 'cat', 'bird')\n", 580 | "pets" 581 | ] 582 | }, 583 | { 584 | "cell_type": "markdown", 585 | "metadata": {}, 586 | "source": [ 587 | "### Dictionaries" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "Dictionaries are build using curly brackets \n", 595 | "*Note:* It is best to treat dictionaries as if they are unordered. You retrieve values based on the key, value pairs." 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": 20, 601 | "metadata": {}, 602 | "outputs": [ 603 | { 604 | "name": "stdout", 605 | "output_type": "stream", 606 | "text": [ 607 | "fred 29\n" 608 | ] 609 | } 610 | ], 611 | "source": [ 612 | "person = {'name': 'fred', 'age': 29}\n", 613 | "print(person['name'], person['age'])" 614 | ] 615 | }, 616 | { 617 | "cell_type": "markdown", 618 | "metadata": {}, 619 | "source": [ 620 | "You add items to a dictionary by specifying a key and value:" 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": 21, 626 | "metadata": {}, 627 | "outputs": [ 628 | { 629 | "data": { 630 | "text/plain": [ 631 | "{'name': 'fred', 'age': 29, 'money': 50}" 632 | ] 633 | }, 634 | "execution_count": 21, 635 | "metadata": {}, 636 | "output_type": "execute_result" 637 | } 638 | ], 639 | "source": [ 640 | "person['money'] = 50\n", 641 | "person" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": 22, 647 | "metadata": {}, 648 | "outputs": [ 649 | { 650 | "data": { 651 | "text/plain": [ 652 | "{'name': 'fred', 'money': 50}" 653 | ] 654 | }, 655 | "execution_count": 22, 656 | "metadata": {}, 657 | "output_type": "execute_result" 658 | } 659 | ], 660 | "source": [ 661 | "del person['age']\n", 662 | "person" 663 | ] 664 | }, 665 | { 666 | "cell_type": "markdown", 667 | "metadata": {}, 668 | "source": [ 669 | "### Set" 670 | ] 671 | }, 672 | { 673 | "cell_type": "markdown", 674 | "metadata": {}, 675 | "source": [ 676 | "A set is like a list but it can only hold unique values. " 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": 23, 682 | "metadata": {}, 683 | "outputs": [ 684 | { 685 | "data": { 686 | "text/plain": [ 687 | "{'dogs', 'horse', 'zebra'}" 688 | ] 689 | }, 690 | "execution_count": 23, 691 | "metadata": {}, 692 | "output_type": "execute_result" 693 | } 694 | ], 695 | "source": [ 696 | "pets_1 = set(['dogs', 'cat', 'bird'])\n", 697 | "pets_2 = set(['dogs', 'horse', 'zebra', 'zebra'])\n", 698 | "pets_2" 699 | ] 700 | }, 701 | { 702 | "cell_type": "markdown", 703 | "metadata": {}, 704 | "source": [ 705 | "There are many useful operations that you can perform using sets" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": 24, 711 | "metadata": {}, 712 | "outputs": [ 713 | { 714 | "data": { 715 | "text/plain": [ 716 | "{'bird', 'cat', 'dogs', 'horse', 'zebra'}" 717 | ] 718 | }, 719 | "execution_count": 24, 720 | "metadata": {}, 721 | "output_type": "execute_result" 722 | } 723 | ], 724 | "source": [ 725 | "pets_1.union(pets_2)" 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "execution_count": 25, 731 | "metadata": {}, 732 | "outputs": [ 733 | { 734 | "data": { 735 | "text/plain": [ 736 | "{'dogs'}" 737 | ] 738 | }, 739 | "execution_count": 25, 740 | "metadata": {}, 741 | "output_type": "execute_result" 742 | } 743 | ], 744 | "source": [ 745 | "pets_1.intersection(pets_2)" 746 | ] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "execution_count": 26, 751 | "metadata": {}, 752 | "outputs": [ 753 | { 754 | "data": { 755 | "text/plain": [ 756 | "{'bird', 'cat'}" 757 | ] 758 | }, 759 | "execution_count": 26, 760 | "metadata": {}, 761 | "output_type": "execute_result" 762 | } 763 | ], 764 | "source": [ 765 | "pets_1.difference(pets_2)" 766 | ] 767 | }, 768 | { 769 | "cell_type": "markdown", 770 | "metadata": {}, 771 | "source": [ 772 | "### Combinations" 773 | ] 774 | }, 775 | { 776 | "cell_type": "markdown", 777 | "metadata": {}, 778 | "source": [ 779 | "Data structures can hold any Python object!" 780 | ] 781 | }, 782 | { 783 | "cell_type": "code", 784 | "execution_count": 27, 785 | "metadata": {}, 786 | "outputs": [ 787 | { 788 | "data": { 789 | "text/plain": [ 790 | "('apple', 'orange')" 791 | ] 792 | }, 793 | "execution_count": 27, 794 | "metadata": {}, 795 | "output_type": "execute_result" 796 | } 797 | ], 798 | "source": [ 799 | "combo = ('apple', 'orange')\n", 800 | "mix = {'fruit' : [combo, ('banana', 'pear')]}\n", 801 | "mix['fruit'][0]" 802 | ] 803 | }, 804 | { 805 | "cell_type": "markdown", 806 | "metadata": {}, 807 | "source": [ 808 | "## Slicing [(to top)](#toc)" 809 | ] 810 | }, 811 | { 812 | "cell_type": "markdown", 813 | "metadata": {}, 814 | "source": [ 815 | "If an object is ordered (such as a list or tuple) you can select on index \n", 816 | "**Note:** Python starts counting at 0. So the first value is at index 0." 817 | ] 818 | }, 819 | { 820 | "cell_type": "code", 821 | "execution_count": 28, 822 | "metadata": {}, 823 | "outputs": [], 824 | "source": [ 825 | "pets = ['dogs', 'cat', 'bird', 'lizzard']" 826 | ] 827 | }, 828 | { 829 | "cell_type": "code", 830 | "execution_count": 29, 831 | "metadata": {}, 832 | "outputs": [ 833 | { 834 | "data": { 835 | "text/plain": [ 836 | "'dogs'" 837 | ] 838 | }, 839 | "execution_count": 29, 840 | "metadata": {}, 841 | "output_type": "execute_result" 842 | } 843 | ], 844 | "source": [ 845 | "favorite_pet = pets[0]\n", 846 | "favorite_pet" 847 | ] 848 | }, 849 | { 850 | "cell_type": "code", 851 | "execution_count": 30, 852 | "metadata": {}, 853 | "outputs": [ 854 | { 855 | "data": { 856 | "text/plain": [ 857 | "'lizzard'" 858 | ] 859 | }, 860 | "execution_count": 30, 861 | "metadata": {}, 862 | "output_type": "execute_result" 863 | } 864 | ], 865 | "source": [ 866 | "reptile = pets[-1]\n", 867 | "reptile" 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": 31, 873 | "metadata": {}, 874 | "outputs": [ 875 | { 876 | "data": { 877 | "text/plain": [ 878 | "['cat', 'bird']" 879 | ] 880 | }, 881 | "execution_count": 31, 882 | "metadata": {}, 883 | "output_type": "execute_result" 884 | } 885 | ], 886 | "source": [ 887 | "pets[1:3]" 888 | ] 889 | }, 890 | { 891 | "cell_type": "code", 892 | "execution_count": 32, 893 | "metadata": {}, 894 | "outputs": [ 895 | { 896 | "data": { 897 | "text/plain": [ 898 | "['dogs', 'cat']" 899 | ] 900 | }, 901 | "execution_count": 32, 902 | "metadata": {}, 903 | "output_type": "execute_result" 904 | } 905 | ], 906 | "source": [ 907 | "pets[:2]" 908 | ] 909 | }, 910 | { 911 | "cell_type": "markdown", 912 | "metadata": {}, 913 | "source": [ 914 | "*Note:* this also works on strings:" 915 | ] 916 | }, 917 | { 918 | "cell_type": "code", 919 | "execution_count": 33, 920 | "metadata": {}, 921 | "outputs": [ 922 | { 923 | "data": { 924 | "text/plain": [ 925 | "'ba'" 926 | ] 927 | }, 928 | "execution_count": 33, 929 | "metadata": {}, 930 | "output_type": "execute_result" 931 | } 932 | ], 933 | "source": [ 934 | "fruit = 'banana'\n", 935 | "fruit[:2]" 936 | ] 937 | }, 938 | { 939 | "cell_type": "markdown", 940 | "metadata": {}, 941 | "source": [ 942 | "## Functions [(to top)](#toc)" 943 | ] 944 | }, 945 | { 946 | "cell_type": "markdown", 947 | "metadata": {}, 948 | "source": [ 949 | "A Python function takes arguments as input and defines logic to process these inputs (and possibly returns something)." 950 | ] 951 | }, 952 | { 953 | "cell_type": "code", 954 | "execution_count": 34, 955 | "metadata": {}, 956 | "outputs": [], 957 | "source": [ 958 | "def add_5(number):\n", 959 | " return number + 5" 960 | ] 961 | }, 962 | { 963 | "cell_type": "markdown", 964 | "metadata": {}, 965 | "source": [ 966 | "The action of defining a function does not execute the code! It will only execute once you call the function:" 967 | ] 968 | }, 969 | { 970 | "cell_type": "code", 971 | "execution_count": 35, 972 | "metadata": {}, 973 | "outputs": [ 974 | { 975 | "data": { 976 | "text/plain": [ 977 | "15" 978 | ] 979 | }, 980 | "execution_count": 35, 981 | "metadata": {}, 982 | "output_type": "execute_result" 983 | } 984 | ], 985 | "source": [ 986 | "add_5(10)" 987 | ] 988 | }, 989 | { 990 | "cell_type": "markdown", 991 | "metadata": {}, 992 | "source": [ 993 | "You can also add arguments with default values:" 994 | ] 995 | }, 996 | { 997 | "cell_type": "code", 998 | "execution_count": 36, 999 | "metadata": {}, 1000 | "outputs": [], 1001 | "source": [ 1002 | "def add(number, add=5):\n", 1003 | " return number + add" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "code", 1008 | "execution_count": 37, 1009 | "metadata": {}, 1010 | "outputs": [ 1011 | { 1012 | "data": { 1013 | "text/plain": [ 1014 | "15" 1015 | ] 1016 | }, 1017 | "execution_count": 37, 1018 | "metadata": {}, 1019 | "output_type": "execute_result" 1020 | } 1021 | ], 1022 | "source": [ 1023 | "add(10)" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": 38, 1029 | "metadata": {}, 1030 | "outputs": [ 1031 | { 1032 | "data": { 1033 | "text/plain": [ 1034 | "13" 1035 | ] 1036 | }, 1037 | "execution_count": 38, 1038 | "metadata": {}, 1039 | "output_type": "execute_result" 1040 | } 1041 | ], 1042 | "source": [ 1043 | "add(10, add=3)" 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "markdown", 1048 | "metadata": {}, 1049 | "source": [ 1050 | "### Python also has unnamed functions for one-time use called \"lambda functions\"" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "execution_count": 39, 1056 | "metadata": {}, 1057 | "outputs": [ 1058 | { 1059 | "name": "stdout", 1060 | "output_type": "stream", 1061 | "text": [ 1062 | "[('one', 1), ('two', 2), ('three', 3), ('four', 4)]\n" 1063 | ] 1064 | } 1065 | ], 1066 | "source": [ 1067 | "pairs = [('three', 3), ('four', 4), ('one', 1), ('two', 2)]\n", 1068 | "pairs.sort(key=lambda pair: pair[1])\n", 1069 | "print(pairs)" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "markdown", 1074 | "metadata": {}, 1075 | "source": [ 1076 | "**Note:** don't worry if these don't make sense. Just remember that they are simply functions without a name." 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "markdown", 1081 | "metadata": {}, 1082 | "source": [ 1083 | "## Whitespace (blocks) [(to top)](#toc)" 1084 | ] 1085 | }, 1086 | { 1087 | "cell_type": "markdown", 1088 | "metadata": {}, 1089 | "source": [ 1090 | "Indentations are **required** by Python to sub-set blocks of code. \n", 1091 | "*Note:* these subsets have their own local scope, notice variable `a`:" 1092 | ] 1093 | }, 1094 | { 1095 | "cell_type": "code", 1096 | "execution_count": 40, 1097 | "metadata": {}, 1098 | "outputs": [], 1099 | "source": [ 1100 | "def example():\n", 1101 | " a = 'Layer 1'\n", 1102 | " print('First indentation scope:', a)\n", 1103 | " \n", 1104 | " def layer_2():\n", 1105 | " a = 'Layer 2'\n", 1106 | " print('Second indentation scope:', a)\n", 1107 | " \n", 1108 | " layer_2()" 1109 | ] 1110 | }, 1111 | { 1112 | "cell_type": "code", 1113 | "execution_count": 41, 1114 | "metadata": {}, 1115 | "outputs": [ 1116 | { 1117 | "name": "stdout", 1118 | "output_type": "stream", 1119 | "text": [ 1120 | "First indentation scope: Layer 1\n", 1121 | "Second indentation scope: Layer 2\n" 1122 | ] 1123 | } 1124 | ], 1125 | "source": [ 1126 | "example()" 1127 | ] 1128 | }, 1129 | { 1130 | "cell_type": "markdown", 1131 | "metadata": {}, 1132 | "source": [ 1133 | "## Conditionals [(to top)](#toc)" 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "code", 1138 | "execution_count": 42, 1139 | "metadata": {}, 1140 | "outputs": [ 1141 | { 1142 | "name": "stdout", 1143 | "output_type": "stream", 1144 | "text": [ 1145 | "C\n" 1146 | ] 1147 | } 1148 | ], 1149 | "source": [ 1150 | "grade = 95\n", 1151 | "if grade == 90:\n", 1152 | " print('A')\n", 1153 | "elif grade < 90:\n", 1154 | " print('B')\n", 1155 | "elif grade >= 80:\n", 1156 | " print('C')\n", 1157 | "else:\n", 1158 | " print('D')" 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "markdown", 1163 | "metadata": {}, 1164 | "source": [ 1165 | "## Looping [(to top)](#toc)" 1166 | ] 1167 | }, 1168 | { 1169 | "cell_type": "markdown", 1170 | "metadata": {}, 1171 | "source": [ 1172 | "*Note:* the `range` function generates a range of numbers ([Documentation Link](https://docs.python.org/3.7/library/stdtypes.html?highlight=range#range))" 1173 | ] 1174 | }, 1175 | { 1176 | "cell_type": "code", 1177 | "execution_count": 43, 1178 | "metadata": {}, 1179 | "outputs": [ 1180 | { 1181 | "name": "stdout", 1182 | "output_type": "stream", 1183 | "text": [ 1184 | "0\n", 1185 | "2\n", 1186 | "4\n" 1187 | ] 1188 | } 1189 | ], 1190 | "source": [ 1191 | "for num in range(0, 6, 2): \n", 1192 | " print(num)" 1193 | ] 1194 | }, 1195 | { 1196 | "cell_type": "code", 1197 | "execution_count": 44, 1198 | "metadata": {}, 1199 | "outputs": [ 1200 | { 1201 | "name": "stdout", 1202 | "output_type": "stream", 1203 | "text": [ 1204 | "Apple\n", 1205 | "Banana\n", 1206 | "Orange\n" 1207 | ] 1208 | } 1209 | ], 1210 | "source": [ 1211 | "list_fruit = ['Apple', 'Banana', 'Orange']\n", 1212 | "for fruit in list_fruit:\n", 1213 | " print(fruit)" 1214 | ] 1215 | }, 1216 | { 1217 | "cell_type": "markdown", 1218 | "metadata": {}, 1219 | "source": [ 1220 | "You can break a loop prematurely by including a `break` statement" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "code", 1225 | "execution_count": 45, 1226 | "metadata": {}, 1227 | "outputs": [ 1228 | { 1229 | "name": "stdout", 1230 | "output_type": "stream", 1231 | "text": [ 1232 | "0\n", 1233 | "1\n", 1234 | "2\n" 1235 | ] 1236 | } 1237 | ], 1238 | "source": [ 1239 | "for num in range(100):\n", 1240 | " print(num)\n", 1241 | " if num == 2:\n", 1242 | " break" 1243 | ] 1244 | }, 1245 | { 1246 | "cell_type": "markdown", 1247 | "metadata": {}, 1248 | "source": [ 1249 | "You can also loop until a condition is met using `while` \n", 1250 | "*Note:* running `while True` will loop indefinitely until a `break` statement is called" 1251 | ] 1252 | }, 1253 | { 1254 | "cell_type": "code", 1255 | "execution_count": 46, 1256 | "metadata": {}, 1257 | "outputs": [ 1258 | { 1259 | "name": "stdout", 1260 | "output_type": "stream", 1261 | "text": [ 1262 | "0\n", 1263 | "1\n", 1264 | "2\n", 1265 | "3\n" 1266 | ] 1267 | } 1268 | ], 1269 | "source": [ 1270 | "count = 0\n", 1271 | "while count < 4:\n", 1272 | " print(count)\n", 1273 | " count += 1" 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "markdown", 1278 | "metadata": {}, 1279 | "source": [ 1280 | "Looping over a tuple in a list:" 1281 | ] 1282 | }, 1283 | { 1284 | "cell_type": "code", 1285 | "execution_count": 47, 1286 | "metadata": {}, 1287 | "outputs": [ 1288 | { 1289 | "name": "stdout", 1290 | "output_type": "stream", 1291 | "text": [ 1292 | "3\n", 1293 | "7\n" 1294 | ] 1295 | } 1296 | ], 1297 | "source": [ 1298 | "tuple_in_list = [(1, 2), (3, 4)]\n", 1299 | "for a, b in tuple_in_list:\n", 1300 | " print(a + b)" 1301 | ] 1302 | }, 1303 | { 1304 | "cell_type": "markdown", 1305 | "metadata": {}, 1306 | "source": [ 1307 | "Looping over a dictionary: " 1308 | ] 1309 | }, 1310 | { 1311 | "cell_type": "code", 1312 | "execution_count": 48, 1313 | "metadata": {}, 1314 | "outputs": [ 1315 | { 1316 | "name": "stdout", 1317 | "output_type": "stream", 1318 | "text": [ 1319 | "one 11\n", 1320 | "two 12\n", 1321 | "three 13\n" 1322 | ] 1323 | } 1324 | ], 1325 | "source": [ 1326 | "dictionary = {'one' : 1, 'two' : 2, 'three' : 3}\n", 1327 | "for key, value in dictionary.items():\n", 1328 | " print(key, value + 10)" 1329 | ] 1330 | }, 1331 | { 1332 | "cell_type": "markdown", 1333 | "metadata": {}, 1334 | "source": [ 1335 | "## Comprehensions [(to top)](#toc)" 1336 | ] 1337 | }, 1338 | { 1339 | "cell_type": "markdown", 1340 | "metadata": {}, 1341 | "source": [ 1342 | "A comprehension makes it easier to generate a list or dictionary using a loop. " 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "markdown", 1347 | "metadata": {}, 1348 | "source": [ 1349 | "**List comprehension:**" 1350 | ] 1351 | }, 1352 | { 1353 | "cell_type": "code", 1354 | "execution_count": 49, 1355 | "metadata": {}, 1356 | "outputs": [ 1357 | { 1358 | "data": { 1359 | "text/plain": [ 1360 | "[5, 6, 7, 8, 9, 10]" 1361 | ] 1362 | }, 1363 | "execution_count": 49, 1364 | "metadata": {}, 1365 | "output_type": "execute_result" 1366 | } 1367 | ], 1368 | "source": [ 1369 | "new_list = [x + 5 for x in range(0,6)]\n", 1370 | "new_list" 1371 | ] 1372 | }, 1373 | { 1374 | "cell_type": "markdown", 1375 | "metadata": {}, 1376 | "source": [ 1377 | "*Traditional way to achieve the same:*" 1378 | ] 1379 | }, 1380 | { 1381 | "cell_type": "code", 1382 | "execution_count": 50, 1383 | "metadata": {}, 1384 | "outputs": [ 1385 | { 1386 | "data": { 1387 | "text/plain": [ 1388 | "[5, 6, 7, 8, 9, 10]" 1389 | ] 1390 | }, 1391 | "execution_count": 50, 1392 | "metadata": {}, 1393 | "output_type": "execute_result" 1394 | } 1395 | ], 1396 | "source": [ 1397 | "new_list = []\n", 1398 | "for x in range(0,6):\n", 1399 | " new_list.append(x + 5)\n", 1400 | "new_list" 1401 | ] 1402 | }, 1403 | { 1404 | "cell_type": "markdown", 1405 | "metadata": {}, 1406 | "source": [ 1407 | "**Dictionary comprehension:**" 1408 | ] 1409 | }, 1410 | { 1411 | "cell_type": "code", 1412 | "execution_count": 51, 1413 | "metadata": {}, 1414 | "outputs": [ 1415 | { 1416 | "data": { 1417 | "text/plain": [ 1418 | "{'num_0': 5, 'num_1': 6, 'num_2': 7, 'num_3': 8, 'num_4': 9, 'num_5': 10}" 1419 | ] 1420 | }, 1421 | "execution_count": 51, 1422 | "metadata": {}, 1423 | "output_type": "execute_result" 1424 | } 1425 | ], 1426 | "source": [ 1427 | "new_dict = {'num_{}'.format(x) : x + 5 for x in range(0,6)}\n", 1428 | "new_dict" 1429 | ] 1430 | }, 1431 | { 1432 | "cell_type": "markdown", 1433 | "metadata": {}, 1434 | "source": [ 1435 | "*Traditional way to achieve the same:*" 1436 | ] 1437 | }, 1438 | { 1439 | "cell_type": "code", 1440 | "execution_count": 52, 1441 | "metadata": {}, 1442 | "outputs": [ 1443 | { 1444 | "data": { 1445 | "text/plain": [ 1446 | "{'num_0': 5, 'num_1': 6, 'num_2': 7, 'num_3': 8, 'num_4': 9, 'num_5': 10}" 1447 | ] 1448 | }, 1449 | "execution_count": 52, 1450 | "metadata": {}, 1451 | "output_type": "execute_result" 1452 | } 1453 | ], 1454 | "source": [ 1455 | "new_dict = {}\n", 1456 | "for x in range(0,6):\n", 1457 | " new_dict['num_{}'.format(x)] = x + 5\n", 1458 | "new_dict" 1459 | ] 1460 | }, 1461 | { 1462 | "cell_type": "markdown", 1463 | "metadata": {}, 1464 | "source": [ 1465 | "## Catching Exceptions [(to top)](#toc)" 1466 | ] 1467 | }, 1468 | { 1469 | "cell_type": "markdown", 1470 | "metadata": {}, 1471 | "source": [ 1472 | "A Python exception looks like this:" 1473 | ] 1474 | }, 1475 | { 1476 | "cell_type": "code", 1477 | "execution_count": 53, 1478 | "metadata": {}, 1479 | "outputs": [ 1480 | { 1481 | "ename": "ValueError", 1482 | "evalue": "list.remove(x): x not in list", 1483 | "output_type": "error", 1484 | "traceback": [ 1485 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 1486 | "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", 1487 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mnum_list\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m2\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m3\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mnum_list\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mremove\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", 1488 | "\u001b[1;31mValueError\u001b[0m: list.remove(x): x not in list" 1489 | ] 1490 | } 1491 | ], 1492 | "source": [ 1493 | "num_list = [1, 2, 3]\n", 1494 | "num_list.remove(4)" 1495 | ] 1496 | }, 1497 | { 1498 | "cell_type": "markdown", 1499 | "metadata": {}, 1500 | "source": [ 1501 | "You can catch exceptions using `try` and `except`:" 1502 | ] 1503 | }, 1504 | { 1505 | "cell_type": "code", 1506 | "execution_count": 54, 1507 | "metadata": {}, 1508 | "outputs": [ 1509 | { 1510 | "name": "stdout", 1511 | "output_type": "stream", 1512 | "text": [ 1513 | "ERROR!\n" 1514 | ] 1515 | } 1516 | ], 1517 | "source": [ 1518 | "try:\n", 1519 | " num_list.remove(4)\n", 1520 | "except:\n", 1521 | " print('ERROR!')" 1522 | ] 1523 | }, 1524 | { 1525 | "cell_type": "markdown", 1526 | "metadata": {}, 1527 | "source": [ 1528 | "It is usually best practice to specify the error you expect. Running \"blind\" try/except blocks runs the risk of missing an error that you didn't expect, which would lead to unexpected behavior." 1529 | ] 1530 | }, 1531 | { 1532 | "cell_type": "code", 1533 | "execution_count": 55, 1534 | "metadata": {}, 1535 | "outputs": [ 1536 | { 1537 | "name": "stdout", 1538 | "output_type": "stream", 1539 | "text": [ 1540 | "Error: list.remove(x): x not in list\n", 1541 | "Done\n" 1542 | ] 1543 | } 1544 | ], 1545 | "source": [ 1546 | "try:\n", 1547 | " num_list.remove(4)\n", 1548 | "except ValueError as e:\n", 1549 | " print('Error: ', e)\n", 1550 | "except Exception as e:\n", 1551 | " print('Other error: ', e)\n", 1552 | "finally:\n", 1553 | " print('Done')" 1554 | ] 1555 | }, 1556 | { 1557 | "cell_type": "markdown", 1558 | "metadata": {}, 1559 | "source": [ 1560 | "## Importing Libraries [(to top)](#toc)" 1561 | ] 1562 | }, 1563 | { 1564 | "cell_type": "markdown", 1565 | "metadata": {}, 1566 | "source": [ 1567 | "*Note:* It is considered best practice to include all import statements at the top of your code." 1568 | ] 1569 | }, 1570 | { 1571 | "cell_type": "code", 1572 | "execution_count": 56, 1573 | "metadata": {}, 1574 | "outputs": [ 1575 | { 1576 | "data": { 1577 | "text/plain": [ 1578 | "0.8414709848078965" 1579 | ] 1580 | }, 1581 | "execution_count": 56, 1582 | "metadata": {}, 1583 | "output_type": "execute_result" 1584 | } 1585 | ], 1586 | "source": [ 1587 | "import math\n", 1588 | "math.sin(1)" 1589 | ] 1590 | }, 1591 | { 1592 | "cell_type": "code", 1593 | "execution_count": 57, 1594 | "metadata": {}, 1595 | "outputs": [ 1596 | { 1597 | "data": { 1598 | "text/plain": [ 1599 | "0.8414709848078965" 1600 | ] 1601 | }, 1602 | "execution_count": 57, 1603 | "metadata": {}, 1604 | "output_type": "execute_result" 1605 | } 1606 | ], 1607 | "source": [ 1608 | "import math as math_lib\n", 1609 | "math_lib.sin(1)" 1610 | ] 1611 | }, 1612 | { 1613 | "cell_type": "code", 1614 | "execution_count": 58, 1615 | "metadata": {}, 1616 | "outputs": [ 1617 | { 1618 | "data": { 1619 | "text/plain": [ 1620 | "0.8414709848078965" 1621 | ] 1622 | }, 1623 | "execution_count": 58, 1624 | "metadata": {}, 1625 | "output_type": "execute_result" 1626 | } 1627 | ], 1628 | "source": [ 1629 | "from math import sin\n", 1630 | "sin(1)" 1631 | ] 1632 | }, 1633 | { 1634 | "cell_type": "markdown", 1635 | "metadata": {}, 1636 | "source": [ 1637 | "## OS operations [(to top)](#toc)" 1638 | ] 1639 | }, 1640 | { 1641 | "cell_type": "code", 1642 | "execution_count": 59, 1643 | "metadata": {}, 1644 | "outputs": [], 1645 | "source": [ 1646 | "import os" 1647 | ] 1648 | }, 1649 | { 1650 | "cell_type": "markdown", 1651 | "metadata": {}, 1652 | "source": [ 1653 | "### Get current working directory" 1654 | ] 1655 | }, 1656 | { 1657 | "cell_type": "code", 1658 | "execution_count": 60, 1659 | "metadata": {}, 1660 | "outputs": [ 1661 | { 1662 | "data": { 1663 | "text/plain": [ 1664 | "'E:\\\\Dropbox\\\\Work\\\\Programming\\\\active\\\\LearnPythonforResearch'" 1665 | ] 1666 | }, 1667 | "execution_count": 60, 1668 | "metadata": {}, 1669 | "output_type": "execute_result" 1670 | } 1671 | ], 1672 | "source": [ 1673 | "os.getcwd()" 1674 | ] 1675 | }, 1676 | { 1677 | "cell_type": "markdown", 1678 | "metadata": {}, 1679 | "source": [ 1680 | "### List files/folders in directory" 1681 | ] 1682 | }, 1683 | { 1684 | "cell_type": "code", 1685 | "execution_count": 61, 1686 | "metadata": {}, 1687 | "outputs": [ 1688 | { 1689 | "data": { 1690 | "text/plain": [ 1691 | "['.git',\n", 1692 | " '.gitignore',\n", 1693 | " '.ipynb_checkpoints',\n", 1694 | " '0_python_basics.ipynb',\n", 1695 | " '1_opening_files.ipynb']" 1696 | ] 1697 | }, 1698 | "execution_count": 61, 1699 | "metadata": {}, 1700 | "output_type": "execute_result" 1701 | } 1702 | ], 1703 | "source": [ 1704 | "os.listdir()[:5]" 1705 | ] 1706 | }, 1707 | { 1708 | "cell_type": "markdown", 1709 | "metadata": {}, 1710 | "source": [ 1711 | "*Note:* combine with simple comprehension to filter on file type!" 1712 | ] 1713 | }, 1714 | { 1715 | "cell_type": "code", 1716 | "execution_count": 62, 1717 | "metadata": {}, 1718 | "outputs": [ 1719 | { 1720 | "data": { 1721 | "text/plain": [ 1722 | "['0_python_basics.ipynb',\n", 1723 | " '1_opening_files.ipynb',\n", 1724 | " '2_handling_data.ipynb',\n", 1725 | " '3_visualizing_data.ipynb',\n", 1726 | " '4_web_scraping.ipynb']" 1727 | ] 1728 | }, 1729 | "execution_count": 62, 1730 | "metadata": {}, 1731 | "output_type": "execute_result" 1732 | } 1733 | ], 1734 | "source": [ 1735 | "[file for file in os.listdir() if file[-5:] == 'ipynb'][:5]" 1736 | ] 1737 | }, 1738 | { 1739 | "cell_type": "markdown", 1740 | "metadata": {}, 1741 | "source": [ 1742 | "### Change working directory" 1743 | ] 1744 | }, 1745 | { 1746 | "cell_type": "code", 1747 | "execution_count": 63, 1748 | "metadata": {}, 1749 | "outputs": [], 1750 | "source": [ 1751 | "os.chdir(r'E:\\\\Dropbox\\\\Work\\\\Programming\\\\active\\\\LearnPythonforResearch')" 1752 | ] 1753 | }, 1754 | { 1755 | "cell_type": "markdown", 1756 | "metadata": {}, 1757 | "source": [ 1758 | "*Note:* `r'path'` indicates a raw string \n", 1759 | "A raw string does not see `\\` as a special character" 1760 | ] 1761 | }, 1762 | { 1763 | "cell_type": "markdown", 1764 | "metadata": {}, 1765 | "source": [ 1766 | "## File Input/Output [(to top)](#toc)" 1767 | ] 1768 | }, 1769 | { 1770 | "cell_type": "markdown", 1771 | "metadata": {}, 1772 | "source": [ 1773 | "You can open a file with different file modes: \n", 1774 | "`w` -> write only \n", 1775 | "`r` -> read only \n", 1776 | "`w+` -> read and write + completely overwrite file \n", 1777 | "`a+` -> read and write + append at the bottom\n" 1778 | ] 1779 | }, 1780 | { 1781 | "cell_type": "markdown", 1782 | "metadata": {}, 1783 | "source": [ 1784 | "*Note 1:* specifying your encoding when writing/reading files is generally considered best practice to avoid unexpected behavior. \n", 1785 | "*Note 2:* there are alternative ways to read/write files, but using `with` is strongly recommended as it will automatically close the file." 1786 | ] 1787 | }, 1788 | { 1789 | "cell_type": "code", 1790 | "execution_count": 64, 1791 | "metadata": {}, 1792 | "outputs": [], 1793 | "source": [ 1794 | "with open('new_file.txt', 'w', encoding='utf-8') as file:\n", 1795 | " file.write('Content of new file. \\nHi there!')" 1796 | ] 1797 | }, 1798 | { 1799 | "cell_type": "code", 1800 | "execution_count": 65, 1801 | "metadata": {}, 1802 | "outputs": [], 1803 | "source": [ 1804 | "with open('new_file.txt', 'r', encoding='utf-8') as file:\n", 1805 | " file_content = file.read()" 1806 | ] 1807 | }, 1808 | { 1809 | "cell_type": "code", 1810 | "execution_count": 66, 1811 | "metadata": {}, 1812 | "outputs": [ 1813 | { 1814 | "data": { 1815 | "text/plain": [ 1816 | "'Content of new file. \\nHi there!'" 1817 | ] 1818 | }, 1819 | "execution_count": 66, 1820 | "metadata": {}, 1821 | "output_type": "execute_result" 1822 | } 1823 | ], 1824 | "source": [ 1825 | "file_content" 1826 | ] 1827 | }, 1828 | { 1829 | "cell_type": "code", 1830 | "execution_count": 67, 1831 | "metadata": {}, 1832 | "outputs": [ 1833 | { 1834 | "name": "stdout", 1835 | "output_type": "stream", 1836 | "text": [ 1837 | "Content of new file. \n", 1838 | "Hi there!\n" 1839 | ] 1840 | } 1841 | ], 1842 | "source": [ 1843 | "print(file_content)" 1844 | ] 1845 | }, 1846 | { 1847 | "cell_type": "markdown", 1848 | "metadata": {}, 1849 | "source": [ 1850 | "You can also append lines to an existing file" 1851 | ] 1852 | }, 1853 | { 1854 | "cell_type": "code", 1855 | "execution_count": 68, 1856 | "metadata": {}, 1857 | "outputs": [], 1858 | "source": [ 1859 | "with open('new_file.txt', 'a+', encoding='utf-8') as file:\n", 1860 | " file.write('\\n' + 'New line')" 1861 | ] 1862 | }, 1863 | { 1864 | "cell_type": "code", 1865 | "execution_count": 69, 1866 | "metadata": {}, 1867 | "outputs": [ 1868 | { 1869 | "name": "stdout", 1870 | "output_type": "stream", 1871 | "text": [ 1872 | "Content of new file. \n", 1873 | "Hi there!\n", 1874 | "New line\n" 1875 | ] 1876 | } 1877 | ], 1878 | "source": [ 1879 | "with open('new_file.txt', 'r', encoding='utf-8') as file:\n", 1880 | " print(file.read())" 1881 | ] 1882 | } 1883 | ], 1884 | "metadata": { 1885 | "kernelspec": { 1886 | "display_name": "Python 3", 1887 | "language": "python", 1888 | "name": "python3" 1889 | }, 1890 | "language_info": { 1891 | "codemirror_mode": { 1892 | "name": "ipython", 1893 | "version": 3 1894 | }, 1895 | "file_extension": ".py", 1896 | "mimetype": "text/x-python", 1897 | "name": "python", 1898 | "nbconvert_exporter": "python", 1899 | "pygments_lexer": "ipython3", 1900 | "version": "3.7.6" 1901 | } 1902 | }, 1903 | "nbformat": 4, 1904 | "nbformat_minor": 4 1905 | } 1906 | -------------------------------------------------------------------------------- /1_opening_files.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Opening files with python" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "**Author:** Ties de Kok ([Personal Website](https://www.tiesdekok.com)) \n", 15 | "**Last updated:** June 2020 \n", 16 | "**Conda Environment:** `LearnPythonForResearch` \n", 17 | "**Python version:** Python 3.7 \n", 18 | "**License:** MIT License " 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "**Note:** Some features (like the ToC) will only work if you run it locally, use Binder, or use nbviewer by clicking this link: \n", 26 | "https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/1_opening_files.ipynb" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# *Introduction*" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "With Python you can open and save a wide variety of files. \n", 41 | "There are often multiple ways to open a particular file format, the examples below are in my experience the most convenient\n", 42 | "\n", 43 | "**Note:** All the sample files are in the `example_data` folder. The code to generate these files is at the end of this notebook." 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "# *Table of Contents* " 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "* [Indexing a folder](#indexing) \n", 58 | "* [Text files](#text-files)\n", 59 | "* [Excel files](#excel-files) \n", 60 | "* [CSV files](#csv-files) \n", 61 | "* [Stata files](#stata-files) \n", 62 | "* [SAS files](#sas-files) \n", 63 | "* [JSON files](#json-files) \n", 64 | "* [HDF files: Pandas](#hdf-pandas-files) \n", 65 | "* [HDF files: General Python objects](#hdf-general-files)\n", 66 | "* [Code to generate examples files](#example-files) " 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Imports" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 1, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "import os\n", 83 | "import pandas as pd\n", 84 | "from glob import glob\n", 85 | "import json" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "## Indexing a folder [(to top)](#toc)" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "Irrespective of the type of file you are trying to open it is useful to be able to index all files in a folder. \n", 100 | "This is necessary if you, for example, want to loop over all files in a folder." 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "There are multiple ways to go about this, but I will show `os.listdir`, `glob`, and `os.walk`." 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "### First define the path to the folder that we want to index" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "Our examples files are in the `example_data` folder so we can set the directory as such:" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 2, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "data_path = os.path.join(os.getcwd(), 'example_data')" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "### Get all the files in the root of the folder" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "*Note:* this will ignore files in sub-folders!" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 3, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "data": { 154 | "text/plain": [ 155 | "['311-service-requests.csv', 'auto_df.csv']" 156 | ] 157 | }, 158 | "execution_count": 3, 159 | "metadata": {}, 160 | "output_type": "execute_result" 161 | } 162 | ], 163 | "source": [ 164 | "filenames = os.listdir(data_path)\n", 165 | "filenames[:2] # show first two items" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 4, 171 | "metadata": {}, 172 | "outputs": [ 173 | { 174 | "data": { 175 | "text/plain": [ 176 | "['E:\\\\Dropbox\\\\Work\\\\Programming\\\\active\\\\LearnPythonforResearch\\\\example_data\\\\311-service-requests.csv',\n", 177 | " 'E:\\\\Dropbox\\\\Work\\\\Programming\\\\active\\\\LearnPythonforResearch\\\\example_data\\\\auto_df.csv']" 178 | ] 179 | }, 180 | "execution_count": 4, 181 | "metadata": {}, 182 | "output_type": "execute_result" 183 | } 184 | ], 185 | "source": [ 186 | "filepaths = [os.path.join(data_path, filename) for filename in filenames]\n", 187 | "filepaths[:2] # show first two items" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "We can alternatively use `glob` as this directly allows to include pathname matching. \n", 195 | "For example if we only want Excel `.xlsx` files:" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 5, 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "data": { 205 | "text/plain": [ 206 | "['E:\\\\Dropbox\\\\Work\\\\Programming\\\\active\\\\LearnPythonforResearch\\\\example_data\\\\excel_sample.xlsx']" 207 | ] 208 | }, 209 | "execution_count": 5, 210 | "metadata": {}, 211 | "output_type": "execute_result" 212 | } 213 | ], 214 | "source": [ 215 | "glob(os.path.join(data_path, '*.xlsx'))" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "### Get all files, also those in sub-folders:" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "If the folder contains multiple levels we need to either use `os.walk()` or `glob`:" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 6, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "data": { 239 | "text/plain": [ 240 | "['E:\\\\Dropbox\\\\Work\\\\Programming\\\\active\\\\LearnPythonforResearch\\\\.gitignore',\n", 241 | " 'E:\\\\Dropbox\\\\Work\\\\Programming\\\\active\\\\LearnPythonforResearch\\\\0_python_basics.ipynb']" 242 | ] 243 | }, 244 | "execution_count": 6, 245 | "metadata": {}, 246 | "output_type": "execute_result" 247 | } 248 | ], 249 | "source": [ 250 | "folder = os.getcwd()\n", 251 | "filepaths = []\n", 252 | "for root,dirs,files in os.walk(folder):\n", 253 | " for i in files:\n", 254 | " filepaths.append(os.path.join(root,i))\n", 255 | "filepaths[:2]" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "Personally, using `glob` yields cleaner code although it is a bit harder to understand: \n", 263 | "*Note* `glob` will ignore files/folders that start with a dot, which is why the `.gitignore` file is not included" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 7, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "data": { 273 | "text/plain": [ 274 | "['E:\\\\Dropbox\\\\Work\\\\Programming\\\\active\\\\LearnPythonforResearch\\\\0_python_basics.ipynb',\n", 275 | " 'E:\\\\Dropbox\\\\Work\\\\Programming\\\\active\\\\LearnPythonforResearch\\\\1_opening_files.ipynb']" 276 | ] 277 | }, 278 | "execution_count": 7, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "filepaths_glob = glob(os.path.join(folder, '**/*'), recursive=True)\n", 285 | "filepaths_glob[:2]" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "## Text files [(to top)](#toc)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "Opening text files is done using the default Python library." 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "You can open a file with different file modes: \n", 307 | "w -> write only \n", 308 | "r -> read only \n", 309 | "w+ -> read and write + completely overwrite file \n", 310 | "a+ -> read and write + append at the bottom " 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "### Opening a file" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 8, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "with open(os.path.join(data_path, 'text_sample.txt'), 'r', encoding='utf-8') as file:\n", 327 | " file_content = file.read()" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 9, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "name": "stdout", 337 | "output_type": "stream", 338 | "text": [ 339 | "Learning Python is great. \n", 340 | "Good luck!\n" 341 | ] 342 | } 343 | ], 344 | "source": [ 345 | "print(file_content)" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "### Writing to a file" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": 10, 358 | "metadata": {}, 359 | "outputs": [], 360 | "source": [ 361 | "with open(os.path.join(data_path, 'text_sample.txt'), 'w+', encoding='utf-8') as file:\n", 362 | " file.write('Learning Python is great. \\nGood luck!')" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "### Additional information" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "Note that I am using a `with` statement when opening files. \n", 377 | "Another method is to use `open` and `close`:" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 11, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "f = open(os.path.join(data_path, 'text_sample.txt'), 'r')\n", 387 | "file_content = f.read()\n", 388 | "f.close()" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "The `with` method is preferred as it automatically closes the file. \n", 396 | "This prevents the file from being 'in use' if you forget to use `.close()`" 397 | ] 398 | }, 399 | { 400 | "cell_type": "markdown", 401 | "metadata": {}, 402 | "source": [ 403 | "### Looping over indexed files" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 12, 409 | "metadata": {}, 410 | "outputs": [], 411 | "source": [ 412 | "text_files = glob(os.path.join(data_path, '*.txt'))\n", 413 | "text_list = []\n", 414 | "\n", 415 | "for i in text_files:\n", 416 | " with open(i, 'r') as f:\n", 417 | " text_list.append(f.read())" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": 13, 423 | "metadata": {}, 424 | "outputs": [ 425 | { 426 | "data": { 427 | "text/plain": [ 428 | "['Learning Python is great. \\nGood luck!']" 429 | ] 430 | }, 431 | "execution_count": 13, 432 | "metadata": {}, 433 | "output_type": "execute_result" 434 | } 435 | ], 436 | "source": [ 437 | "text_list" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "## Excel files [(to top)](#toc)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "You can open `Excel`, `csv`, `Stata`, `SAS` files in many ways, but I strongly recommend to use `Pandas`." 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "### Open Excel file" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": 14, 464 | "metadata": {}, 465 | "outputs": [], 466 | "source": [ 467 | "excel_file = pd.read_excel(os.path.join(data_path, 'excel_sample.xlsx'))" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "metadata": {}, 473 | "source": [ 474 | "This function has a lot of options, see: \n", 475 | "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html \n", 476 | "\n", 477 | "*Note:* You often want to specify the encoding to prevent errors, for example: `, encoding='utf-8'`" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [ 484 | "### Save Excel file" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": 15, 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "excel_file.to_excel(os.path.join(data_path, 'excel_sample.xlsx'))" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "This saves a `Pandas` dataframe object, see the data handling file. \n", 501 | "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html \n", 502 | "\n", 503 | "*Note:* You can save as `.xls` but also `.xlsx`" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "## CSV files [(to top)](#toc)" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "You can open `Excel`, `csv`, `Stata`, `SAS` files in many ways, but I strongly recommend to use `Pandas`." 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "### Open CSV file" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": 16, 530 | "metadata": {}, 531 | "outputs": [], 532 | "source": [ 533 | "csv_file = pd.read_csv(os.path.join(data_path, 'csv_sample.csv'), sep=',')" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "### Save CSV file" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 17, 553 | "metadata": {}, 554 | "outputs": [], 555 | "source": [ 556 | "csv_file.to_csv(os.path.join(data_path, 'csv_sample.csv'), sep=',')" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "## Stata files [(to top)](#toc)" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "### Open Stata file" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 18, 583 | "metadata": {}, 584 | "outputs": [], 585 | "source": [ 586 | "stata_file = pd.read_stata(os.path.join(data_path, 'stata_sample.dta'))" 587 | ] 588 | }, 589 | { 590 | "cell_type": "markdown", 591 | "metadata": {}, 592 | "source": [ 593 | "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_stata.html " 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "### Save Stata file" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": 19, 606 | "metadata": {}, 607 | "outputs": [], 608 | "source": [ 609 | "stata_file.to_stata(os.path.join(data_path, 'stata_sample.dta'), write_index=False)" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_stata.html" 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": {}, 622 | "source": [ 623 | "## SAS files [(to top)](#toc)" 624 | ] 625 | }, 626 | { 627 | "cell_type": "markdown", 628 | "metadata": {}, 629 | "source": [ 630 | "Pandas can only read SAS files but cannot write them: \n", 631 | "\n", 632 | "```\n", 633 | "sas_file = pd.read_sas(r'C:\\file.sas7bdat', format='sas7bdat')\n", 634 | "```" 635 | ] 636 | }, 637 | { 638 | "cell_type": "markdown", 639 | "metadata": {}, 640 | "source": [ 641 | "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html \n", 642 | "This function works in most cases but files with text are likely to throw hard to fix encoding errors." 643 | ] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "metadata": {}, 648 | "source": [ 649 | "## JSON files [(to top)](#toc)" 650 | ] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "metadata": {}, 655 | "source": [ 656 | "### JSON files using pandas" 657 | ] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "metadata": {}, 662 | "source": [ 663 | "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html \n", 664 | "\n", 665 | "*Note:* The path can also be a url" 666 | ] 667 | }, 668 | { 669 | "cell_type": "markdown", 670 | "metadata": {}, 671 | "source": [ 672 | "#### Read JSON file to dataframe" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": 20, 678 | "metadata": {}, 679 | "outputs": [], 680 | "source": [ 681 | "json_df = pd.read_json(os.path.join(data_path, 'json_sample.json'))" 682 | ] 683 | }, 684 | { 685 | "cell_type": "markdown", 686 | "metadata": {}, 687 | "source": [ 688 | "#### Save dataframe to JSON file" 689 | ] 690 | }, 691 | { 692 | "cell_type": "code", 693 | "execution_count": 21, 694 | "metadata": {}, 695 | "outputs": [], 696 | "source": [ 697 | "json_df.to_json(os.path.join(data_path, 'json_sample.json'))" 698 | ] 699 | }, 700 | { 701 | "cell_type": "markdown", 702 | "metadata": {}, 703 | "source": [ 704 | "### JSON files using the `JSON` module" 705 | ] 706 | }, 707 | { 708 | "cell_type": "markdown", 709 | "metadata": {}, 710 | "source": [ 711 | "**Read JSON:**" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": 22, 717 | "metadata": {}, 718 | "outputs": [], 719 | "source": [ 720 | "with open(os.path.join(data_path, 'json_sample.json'), 'r', encoding='utf-8') as f:\n", 721 | " json_data = json.load(f)" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": {}, 727 | "source": [ 728 | "**Write JSON:**" 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": 23, 734 | "metadata": {}, 735 | "outputs": [], 736 | "source": [ 737 | "with open(os.path.join(data_path, 'json_sample.json'), 'w', encoding='utf-8') as f:\n", 738 | " json.dump(json_data, f)" 739 | ] 740 | }, 741 | { 742 | "cell_type": "markdown", 743 | "metadata": {}, 744 | "source": [ 745 | "## HDF files for Pandas [(to top)](#toc)" 746 | ] 747 | }, 748 | { 749 | "cell_type": "markdown", 750 | "metadata": {}, 751 | "source": [ 752 | "The traditional formats such as .csv are not very efficient as big-data file formats. \n", 753 | "\n", 754 | "For larger datasets I recommend to use the `Hierarchical Data Format` or `HDF` in short.\n", 755 | "This `.hdf` file format is designed to store and organize large amounts of data. \n", 756 | "\n", 757 | "Writing and reading `.hdf` files is extremely fast compared to `.csv`:\n", 758 | "\n", 759 | "**Writing:**\n", 760 | "\n", 761 | "```\n", 762 | "%timeit test_hdf_fixed_write(df)\n", 763 | "1 loops, best of 3: 237 ms per loop\n", 764 | "\n", 765 | "%timeit test_hdf_table_write(df)\n", 766 | "1 loops, best of 3: 901 ms per loop\n", 767 | "\n", 768 | "%timeit test_csv_write(df)\n", 769 | "1 loops, best of 3: 3.44 s per loop\n", 770 | "```\n", 771 | "\n", 772 | "**Reading:**\n", 773 | "\n", 774 | "```\n", 775 | "%timeit test_hdf_fixed_read()\n", 776 | "10 loops, best of 3: 19.1 ms per loop\n", 777 | "\n", 778 | "%timeit test_hdf_table_read()\n", 779 | "10 loops, best of 3: 39 ms per loop\n", 780 | "\n", 781 | "%timeit test_csv_read()\n", 782 | "1 loops, best of 3: 620 ms per loop\n", 783 | "```\n", 784 | "\n", 785 | "The downside of `HDF` is that the file sizes tend to be larger. \n", 786 | "They can also not be easily inspected as you will need to load them into Python (or some other tool that can read HDF files)." 787 | ] 788 | }, 789 | { 790 | "cell_type": "markdown", 791 | "metadata": {}, 792 | "source": [ 793 | "### Read HDF files using Pandas" 794 | ] 795 | }, 796 | { 797 | "cell_type": "code", 798 | "execution_count": 24, 799 | "metadata": {}, 800 | "outputs": [], 801 | "source": [ 802 | "hdf_df = pd.read_hdf(os.path.join(data_path, 'hdf_sample.h5'))" 803 | ] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "metadata": {}, 808 | "source": [ 809 | "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_hdf.html " 810 | ] 811 | }, 812 | { 813 | "cell_type": "markdown", 814 | "metadata": {}, 815 | "source": [ 816 | "### Write HDF files using Pandas" 817 | ] 818 | }, 819 | { 820 | "cell_type": "markdown", 821 | "metadata": {}, 822 | "source": [ 823 | "*Note*: Pandas requires to set a key. You can give it any `key` you like. I recommend to use the filename without `.h5` as `key`" 824 | ] 825 | }, 826 | { 827 | "cell_type": "code", 828 | "execution_count": 25, 829 | "metadata": {}, 830 | "outputs": [], 831 | "source": [ 832 | "hdf_df.to_hdf(os.path.join(data_path, 'hdf_sample.h5'), 'hdf_sample')" 833 | ] 834 | }, 835 | { 836 | "cell_type": "markdown", 837 | "metadata": {}, 838 | "source": [ 839 | "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html" 840 | ] 841 | }, 842 | { 843 | "cell_type": "markdown", 844 | "metadata": {}, 845 | "source": [ 846 | "## HDF files for general Python objects [(to top)](#toc)" 847 | ] 848 | }, 849 | { 850 | "cell_type": "markdown", 851 | "metadata": {}, 852 | "source": [ 853 | "The above works great for Python dataframes, but sometimes you want to save general Python objects to HDF files as well. \n", 854 | "\n", 855 | "Traditionally one would use `pickle` in order to do this, however, in my experience HDF files work a lot better. \n", 856 | "\n", 857 | "There are couple options, but I recommend the `hickle` library (https://github.com/telegraphic/hickle)" 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": 26, 863 | "metadata": {}, 864 | "outputs": [], 865 | "source": [ 866 | "import hickle as hkl" 867 | ] 868 | }, 869 | { 870 | "cell_type": "code", 871 | "execution_count": 27, 872 | "metadata": {}, 873 | "outputs": [], 874 | "source": [ 875 | "test_data = {'a' : 1, 'b' : 2, 'c' : 3, 'd': 4, 'e' : 5}" 876 | ] 877 | }, 878 | { 879 | "cell_type": "markdown", 880 | "metadata": {}, 881 | "source": [ 882 | "Save data using `hickle`:" 883 | ] 884 | }, 885 | { 886 | "cell_type": "code", 887 | "execution_count": 28, 888 | "metadata": {}, 889 | "outputs": [], 890 | "source": [ 891 | "hkl.dump(test_data, os.path.join(data_path, 'hkl_example.h5'), mode='w')" 892 | ] 893 | }, 894 | { 895 | "cell_type": "markdown", 896 | "metadata": {}, 897 | "source": [ 898 | "Load data using `hickle`:" 899 | ] 900 | }, 901 | { 902 | "cell_type": "code", 903 | "execution_count": 29, 904 | "metadata": {}, 905 | "outputs": [ 906 | { 907 | "data": { 908 | "text/plain": [ 909 | "{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}" 910 | ] 911 | }, 912 | "execution_count": 29, 913 | "metadata": {}, 914 | "output_type": "execute_result" 915 | } 916 | ], 917 | "source": [ 918 | "test_data = hkl.load(os.path.join(data_path, 'hkl_example.h5'))\n", 919 | "test_data" 920 | ] 921 | }, 922 | { 923 | "cell_type": "markdown", 924 | "metadata": {}, 925 | "source": [ 926 | "# Code used to generate example files [(to top)](#toc)" 927 | ] 928 | }, 929 | { 930 | "cell_type": "markdown", 931 | "metadata": {}, 932 | "source": [ 933 | "Dictionary with random data:" 934 | ] 935 | }, 936 | { 937 | "cell_type": "code", 938 | "execution_count": 30, 939 | "metadata": {}, 940 | "outputs": [], 941 | "source": [ 942 | "raw_data = {'foreign':{1:'Domestic',2:'Domestic',3:'Domestic',6:'Domestic',7:'Domestic',8:'Domestic',9:'Domestic',14:'Domestic',21:'Domestic',23:'Domestic',24:'Domestic',30:'Domestic',31:'Domestic',33:'Domestic',37:'Domestic',38:'Domestic',43:'Domestic',48:'Domestic',50:'Domestic',51:'Domestic',53:'Foreign',56:'Foreign',57:'Foreign',66:'Foreign',70:'Foreign'},\n", 943 | " 'make':{1:'AMCPacer',2:'AMCSpirit',3:'BuickCentury',6:'BuickOpel',7:'BuickRegal',8:'BuickRiviera',9:'BuickSkylark',14:'Chev.Impala',21:'DodgeMagnum',23:'FordFiesta',24:'FordMustang',30:'Merc.Marquis',31:'Merc.Monarch',33:'Merc.Zephyr',37:'OldsDelta88',38:'OldsOmega',43:'Plym.Horizon',48:'Pont.GrandPrix',50:'Pont.Phoenix',51:'Pont.Sunbird',53:'AudiFox',56:'Datsun210',57:'Datsun510',66:'ToyotaCelica',70:'VWDiesel'},\n", 944 | " 'price': {1:4749,2:3799,3:4816,6:4453,7:5189,8:10372,9:4082,14:5705,21:5886,23:4389,24:4187,30:6165,31:4516,33:3291,37:4890,38:4181,43:4482,48:5222,50:4424,51:4172,53:6295,56:4589,57:5079,66:5899,70:5397},\n", 945 | " 'weight':{1:3350,2:2640,3:3250,6:2230,7:3280,8:3880,9:3400,14:3690,21:3600,23:1800,24:2650,30:3720,31:3370,33:2830,37:3690,38:3370,43:2200,48:3210,50:3420,51:2690,53:2070,56:2020,57:2280,66:2410,70:2040}}" 946 | ] 947 | }, 948 | { 949 | "cell_type": "markdown", 950 | "metadata": {}, 951 | "source": [ 952 | "Convert dictionary to Pandas dataframe for easy saving" 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": 31, 958 | "metadata": {}, 959 | "outputs": [ 960 | { 961 | "data": { 962 | "text/html": [ 963 | "
\n", 964 | "\n", 977 | "\n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | "
foreignmakepriceweight
1DomesticAMCPacer47493350
2DomesticAMCSpirit37992640
3DomesticBuickCentury48163250
6DomesticBuickOpel44532230
7DomesticBuickRegal51893280
\n", 1025 | "
" 1026 | ], 1027 | "text/plain": [ 1028 | " foreign make price weight\n", 1029 | "1 Domestic AMCPacer 4749 3350\n", 1030 | "2 Domestic AMCSpirit 3799 2640\n", 1031 | "3 Domestic BuickCentury 4816 3250\n", 1032 | "6 Domestic BuickOpel 4453 2230\n", 1033 | "7 Domestic BuickRegal 5189 3280" 1034 | ] 1035 | }, 1036 | "execution_count": 31, 1037 | "metadata": {}, 1038 | "output_type": "execute_result" 1039 | } 1040 | ], 1041 | "source": [ 1042 | "df_data = pd.DataFrame(raw_data)\n", 1043 | "df_data.head()" 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "markdown", 1048 | "metadata": {}, 1049 | "source": [ 1050 | "Save the different files" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "execution_count": 32, 1056 | "metadata": {}, 1057 | "outputs": [], 1058 | "source": [ 1059 | "data_path = os.path.join(os.getcwd(), 'example_data')" 1060 | ] 1061 | }, 1062 | { 1063 | "cell_type": "code", 1064 | "execution_count": 33, 1065 | "metadata": {}, 1066 | "outputs": [], 1067 | "source": [ 1068 | "with open(os.path.join(data_path, 'text_sample.txt'), 'w+') as file:\n", 1069 | " file.write('Learning Python is great. \\nGood luck!')" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "code", 1074 | "execution_count": 34, 1075 | "metadata": {}, 1076 | "outputs": [], 1077 | "source": [ 1078 | "df_data.to_excel(os.path.join(data_path, 'excel_sample.xlsx'))\n", 1079 | "df_data.to_csv(os.path.join(data_path, 'csv_sample.csv'))\n", 1080 | "df_data.to_stata(os.path.join(data_path, 'stata_sample.dta'))\n", 1081 | "df_data.to_hdf(os.path.join(data_path, 'hdf_sample.h5'), 'hdf_sample')" 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "code", 1086 | "execution_count": 35, 1087 | "metadata": {}, 1088 | "outputs": [], 1089 | "source": [ 1090 | "df_data.to_json(os.path.join(data_path, 'json_sample.json'))" 1091 | ] 1092 | }, 1093 | { 1094 | "cell_type": "markdown", 1095 | "metadata": {}, 1096 | "source": [ 1097 | "**Note:** pandas does not have a `.to_sas()` function" 1098 | ] 1099 | } 1100 | ], 1101 | "metadata": { 1102 | "kernelspec": { 1103 | "display_name": "Python 3", 1104 | "language": "python", 1105 | "name": "python3" 1106 | }, 1107 | "language_info": { 1108 | "codemirror_mode": { 1109 | "name": "ipython", 1110 | "version": 3 1111 | }, 1112 | "file_extension": ".py", 1113 | "mimetype": "text/x-python", 1114 | "name": "python", 1115 | "nbconvert_exporter": "python", 1116 | "pygments_lexer": "ipython3", 1117 | "version": "3.7.6" 1118 | } 1119 | }, 1120 | "nbformat": 4, 1121 | "nbformat_minor": 4 1122 | } 1123 | -------------------------------------------------------------------------------- /4_web_scraping.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Web scraping with python" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "**Author:** Ties de Kok ([Personal Website](https://www.tiesdekok.com)) \n", 15 | "**Last updated:** June 2020 \n", 16 | "**Conda Environment:** `LearnPythonForResearch` \n", 17 | "**Python version:** Python 3.7 \n", 18 | "**License:** MIT License " 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "**Note:** Some features (like the ToC) will only work if you run it locally, use Binder, or use nbviewer by clicking this link: \n", 26 | "https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/4_web_scraping.ipynb" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# *Introduction*" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "Depending on the website it can be very easy or very hard to extract the information you need. \n", 41 | "\n", 42 | "Websites can be classified into roughly two categories:\n", 43 | "1. Computer oriented webpage: API (Application Program Interface)\n", 44 | "2. Human oriented webpage: regular website\n", 45 | "\n", 46 | "Option 1 (an API) is designed to be approach programmatically so extracting the data you need is usually easy. However, in many cases you don't have an API available so you might have to resort to scraping the regular website (option 2). \n", 47 | "\n", 48 | "It is worth noting that option 2 can put a strain on the server of the website. Therefore, only resort to option 2 if there is no API available, and if you decide to scrape the regular website make sure to do so in a way that is as polite as possible!\n", 49 | "\n", 50 | "**This notebook is structured as follows:**\n", 51 | "\n", 52 | "1. Using the `requests` package to interact with a website or API\n", 53 | "2. Extract data using an API\n", 54 | "3. Extract data from a regular website using regular expressions\n", 55 | "4. Extract data from a regular website by parsing the HTML\n", 56 | "5. Extract data from Javascript heavy websites using Selenium\n", 57 | "6. Advanced webscraping using Scrapy" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "**Note 1:** In this notebook I will often build upon chapter 11 of 'automate the boring stuff' which is available here: \n", 65 | "https://automatetheboringstuff.com/chapter11/" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "**Note 2:** In this notebook I focus primarily on extracting information from webpages (i.e. `web scraping`) and very little on programming a bot to automatically traverse the web (i.e. `web crawling`)." 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "**Note 3:** I recommend reading this blog post on the legality of web scraping/crawling: \n", 80 | "https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/\n", 81 | "\n", 82 | "**2019 update** I also recommend to read into the \"HIQ vs. Linkedin Case\": \n", 83 | "e.g. https://www.natlawreview.com/article/data-scraping-survives-least-now-key-takeaways-9th-circuit-ruling-hiq-vs-linkedin" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "# *Table of Contents* " 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "* [Requests package](#requests) \n", 98 | "* [Extract data using an API](#api)\n", 99 | "* [Extract data from a regular website using regular expressions](#ws-re) \n", 100 | "* [Extract data from a regular website by parsing the HTML](#ws-lxml)\n", 101 | "* [Extract data from Javascript heavy websites (Headless browsers / Selenium)](#selenium) " 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "## Requests package [(to top)](#toc)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "We will use the `requests` module. I like the description mentioned in the book 'automate the boring stuff':\n", 116 | "> The requests module lets you easily download files from the Web without having to worry about complicated issues such as network errors, connection problems, and data compression. " 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 1, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "import requests" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "*Note:* If you google around on webscraping with Python you will probably also find mentions of the `urllib2` package. I highly recommend to use `requests` as it will make your life a lot easier for most tasks. " 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "### Basics of the `requests` package" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "The `requests` package takes a URL and allows you to interact with the contents. For example:" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 2, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "res = requests.get('https://automatetheboringstuff.com/files/rj.txt')" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 5, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare\n", 168 | "\n", 169 | "This eBook is for the use of anyone anywhere at no cost and with\n", 170 | "almost no restrictions whatsoever. You may copy it, give it away or\n", 171 | "re-use it under the terms of the Projec\n" 172 | ] 173 | } 174 | ], 175 | "source": [ 176 | "print(res.text[4:250])" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "The `requests` package is incredibly useful because it deals with a lot of connection related issues automatically. We can for example check whether the webpage returned any errors relatively easily:" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 6, 189 | "metadata": {}, 190 | "outputs": [ 191 | { 192 | "data": { 193 | "text/plain": [ 194 | "200" 195 | ] 196 | }, 197 | "execution_count": 6, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "res.status_code " 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 7, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "data": { 213 | "text/plain": [ 214 | "404" 215 | ] 216 | }, 217 | "execution_count": 7, 218 | "metadata": {}, 219 | "output_type": "execute_result" 220 | } 221 | ], 222 | "source": [ 223 | "requests.get('https://automatetheboringstuff.com/thisdoesnotexist.txt').status_code" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "You can find a list of most common HTTP Status Codes here: \n", 231 | "https://www.smartlabsoftware.com/ref/http-status-codes.htm" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "## Extract data using an API [(to top)](#toc)" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "APIs are designed to be approached and 'read' by computers, whereas regular webpages are designed for humans not computers. \n", 246 | "\n", 247 | "An API, in a simplified sense, has two characteristics:\n", 248 | "1. A request is made using a URL that contains parameters specifying the information requested\n", 249 | "2. A response by the server in a machine-readable format. \n", 250 | "\n", 251 | "The machine-readable formats are usually either:\n", 252 | "- JSON\n", 253 | "- XML\n", 254 | "- (sometimes plain text)" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "### Demonstration using an example" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "Let's say, for the sake of an example, that we are interested in retrieving current and historical Bitcoin prices. \n", 269 | "\n", 270 | "After a quick Google search we find that this information is available on https://www.coindesk.com/price/.\n", 271 | "\n", 272 | "We could go about and scrape this webpage directly, but as a responsible web-scraper you look around and notice that coindesk fortunately offers an API that we can use to retrieve the information that we need. The details of the API are here:\n", 273 | "\n", 274 | "https://www.coindesk.com/api/\n", 275 | "\n", 276 | "There appear to be two API calls that we are interested in:\n", 277 | "\n", 278 | "1) We can retrieve the current bitcoin price using: https://api.coindesk.com/v1/bpi/currentprice.json \n", 279 | "2) We can retrieve historical bitcoin prices using: https://api.coindesk.com/v1/bpi/historical/close.json\n", 280 | "\n", 281 | "Clicking on either of these links will show the response of the server. If you click the first link it will look something like this:\n", 282 | "\n", 283 | "![](https://i.imgur.com/CpzgsTo.png)\n", 284 | "\n", 285 | "Not very readable for humans, but easily processed by a machine!\n", 286 | "\n" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "### Task 1: get the current Bitcoin price" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "As discussed above, we can retrieve the current Bitcoin price by \"opening\" the following URL: \n", 301 | "https://api.coindesk.com/v1/bpi/currentprice.json\n", 302 | "\n", 303 | "Using the `requests` library we can easily \"open\" this url and retrieve the response." 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 8, 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [ 312 | "res = requests.get('https://api.coindesk.com/v1/bpi/currentprice.json')" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "An important observation is that this API returns information in the so-called `JSON` format. \n", 320 | "\n", 321 | "You can learn more about the JSON format here: https://www.w3schools.com/js/js_json_syntax.asp.\n", 322 | "\n", 323 | "We could, as before, return this results as plain text:" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 9, 329 | "metadata": {}, 330 | "outputs": [ 331 | { 332 | "data": { 333 | "text/plain": [ 334 | "'{\"time\":{\"updated\":\"Jun 3, 2020 02:14:00 UTC\",\"updatedISO\":\"2020-06-03T02:14:00+00:00\",\"updateduk\":\"Jun 3, 2020 at 03:14 BST\"},\"disclaimer\":\"This data was produced from the CoinDesk Bitcoin Price Index (USD). Non-USD currency data converted using hourly conversion rate from openexchangerates.org\",\"chartName\":\"Bitcoin\",\"bpi\":{\"USD\":{\"code\":\"USD\",\"symbol\":\"$\",\"rate\":\"9,494.8652\",\"description\":\"United States Dollar\",\"rate_float\":9494.8652},\"GBP\":{\"code\":\"GBP\",\"symbol\":\"£\",\"rate\":\"7,558.3400\",\"description\":\"British Pound Sterling\",\"rate_float\":7558.34},\"EUR\":{\"code\":\"EUR\",\"symbol\":\"€\",\"rate\":\"8,500.6484\",\"description\":\"Euro\",\"rate_float\":8500.6484}}}'" 335 | ] 336 | }, 337 | "execution_count": 9, 338 | "metadata": {}, 339 | "output_type": "execute_result" 340 | } 341 | ], 342 | "source": [ 343 | "text_res = res.text\n", 344 | "text_res" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "This is, however, not desirable because we want see the prices that we want but we have no way of easily and reliably extract these prices from the string.\n", 352 | "\n", 353 | "We can, however, achieve this by telling `requests` that the response is in the JSON format:" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 10, 359 | "metadata": {}, 360 | "outputs": [ 361 | { 362 | "data": { 363 | "text/plain": [ 364 | "{'time': {'updated': 'Jun 3, 2020 02:14:00 UTC',\n", 365 | " 'updatedISO': '2020-06-03T02:14:00+00:00',\n", 366 | " 'updateduk': 'Jun 3, 2020 at 03:14 BST'},\n", 367 | " 'disclaimer': 'This data was produced from the CoinDesk Bitcoin Price Index (USD). Non-USD currency data converted using hourly conversion rate from openexchangerates.org',\n", 368 | " 'chartName': 'Bitcoin',\n", 369 | " 'bpi': {'USD': {'code': 'USD',\n", 370 | " 'symbol': '$',\n", 371 | " 'rate': '9,494.8652',\n", 372 | " 'description': 'United States Dollar',\n", 373 | " 'rate_float': 9494.8652},\n", 374 | " 'GBP': {'code': 'GBP',\n", 375 | " 'symbol': '£',\n", 376 | " 'rate': '7,558.3400',\n", 377 | " 'description': 'British Pound Sterling',\n", 378 | " 'rate_float': 7558.34},\n", 379 | " 'EUR': {'code': 'EUR',\n", 380 | " 'symbol': '€',\n", 381 | " 'rate': '8,500.6484',\n", 382 | " 'description': 'Euro',\n", 383 | " 'rate_float': 8500.6484}}}" 384 | ] 385 | }, 386 | "execution_count": 10, 387 | "metadata": {}, 388 | "output_type": "execute_result" 389 | } 390 | ], 391 | "source": [ 392 | "json_res = res.json()\n", 393 | "json_res" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "All that is left now is to extract the Bitcoin prices. This is now easy because `res.json()` returns a Python dictionary." 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 11, 406 | "metadata": {}, 407 | "outputs": [ 408 | { 409 | "data": { 410 | "text/plain": [ 411 | "{'code': 'EUR',\n", 412 | " 'symbol': '€',\n", 413 | " 'rate': '8,500.6484',\n", 414 | " 'description': 'Euro',\n", 415 | " 'rate_float': 8500.6484}" 416 | ] 417 | }, 418 | "execution_count": 11, 419 | "metadata": {}, 420 | "output_type": "execute_result" 421 | } 422 | ], 423 | "source": [ 424 | "json_res['bpi']['EUR']" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": 12, 430 | "metadata": {}, 431 | "outputs": [ 432 | { 433 | "data": { 434 | "text/plain": [ 435 | "'8,500.6484'" 436 | ] 437 | }, 438 | "execution_count": 12, 439 | "metadata": {}, 440 | "output_type": "execute_result" 441 | } 442 | ], 443 | "source": [ 444 | "json_res['bpi']['EUR']['rate']" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "### Task 2: write a function to retrieve historical Bitcoin prices" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "We can retrieve historical Bitcoin prices through the following API URL: \n", 459 | "https://api.coindesk.com/v1/bpi/historical/close.json\n", 460 | "\n", 461 | "Looking at https://www.coindesk.com/api/ tells us that we can pass the following parameters to this URL: \n", 462 | "* `index` -> to specify the index\n", 463 | "* `currency` -> to specify the currency \n", 464 | "* `start` -> to specify the start date of the interval\n", 465 | "* `end` -> to specify the end date of the interval \n", 466 | "\n", 467 | "We are primarily interested in the `start` and `end` parameter.\n", 468 | "\n", 469 | "As illustrated in the example, if we want to get the prices between 2013-09-01 and 2013-09-05 we would construct our URL as such:\n", 470 | "\n", 471 | "https://api.coindesk.com/v1/bpi/historical/close.json?start=2013-09-01&end=2013-09-05\n", 472 | "\n", 473 | "**But how do we do this using Python?**\n", 474 | "\n", 475 | "Fortunately, the `requests` library makes it very easy to pass parameters to a URL as illustrated below. \n", 476 | "For more info, see: http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": 13, 482 | "metadata": {}, 483 | "outputs": [], 484 | "source": [ 485 | "API_endpoint = 'https://api.coindesk.com/v1/bpi/historical/close.json'\n", 486 | "payload = {'start' : '2013-09-01', 'end' : '2013-09-05'}" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": 14, 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [ 495 | "res = requests.get(API_endpoint, params=payload)" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": {}, 501 | "source": [ 502 | "We can print the resulting URL (for manual inspection for example) using `res.url`:" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 15, 508 | "metadata": {}, 509 | "outputs": [ 510 | { 511 | "name": "stdout", 512 | "output_type": "stream", 513 | "text": [ 514 | "https://api.coindesk.com/v1/bpi/historical/close.json?start=2013-09-01&end=2013-09-05\n" 515 | ] 516 | } 517 | ], 518 | "source": [ 519 | "print(res.url)" 520 | ] 521 | }, 522 | { 523 | "cell_type": "markdown", 524 | "metadata": {}, 525 | "source": [ 526 | "Again, the result is in the JSON format so we can easily process it:" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": 16, 532 | "metadata": {}, 533 | "outputs": [ 534 | { 535 | "data": { 536 | "text/plain": [ 537 | "{'2013-09-01': 128.2597,\n", 538 | " '2013-09-02': 127.3648,\n", 539 | " '2013-09-03': 127.5915,\n", 540 | " '2013-09-04': 120.5738,\n", 541 | " '2013-09-05': 120.5333}" 542 | ] 543 | }, 544 | "execution_count": 16, 545 | "metadata": {}, 546 | "output_type": "execute_result" 547 | } 548 | ], 549 | "source": [ 550 | "bitcoin_2013 = res.json()\n", 551 | "bitcoin_2013['bpi']" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "### Wrap the above into a function" 559 | ] 560 | }, 561 | { 562 | "cell_type": "markdown", 563 | "metadata": {}, 564 | "source": [ 565 | "In the example above we hardcode the parameter values (the interval dates), if we want to change the dates we have to manually alter the string values. This is not very convenient, it is easier to wrap everything into a function:" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": 17, 571 | "metadata": {}, 572 | "outputs": [], 573 | "source": [ 574 | "API_endpoint = 'https://api.coindesk.com/v1/bpi/historical/close.json'\n", 575 | "\n", 576 | "def get_bitcoin_prices(start_date, end_date, API_endpoint = API_endpoint):\n", 577 | " payload = {'start' : start_date, 'end' : end_date}\n", 578 | " res = requests.get(API_endpoint, params=payload)\n", 579 | " json_res = res.json()\n", 580 | " return json_res['bpi']" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": 18, 586 | "metadata": {}, 587 | "outputs": [ 588 | { 589 | "data": { 590 | "text/plain": [ 591 | "{'2016-01-01': 434.463,\n", 592 | " '2016-01-02': 433.586,\n", 593 | " '2016-01-03': 430.361,\n", 594 | " '2016-01-04': 433.493,\n", 595 | " '2016-01-05': 432.253,\n", 596 | " '2016-01-06': 429.464,\n", 597 | " '2016-01-07': 458.28,\n", 598 | " '2016-01-08': 453.37,\n", 599 | " '2016-01-09': 449.143,\n", 600 | " '2016-01-10': 448.964}" 601 | ] 602 | }, 603 | "execution_count": 18, 604 | "metadata": {}, 605 | "output_type": "execute_result" 606 | } 607 | ], 608 | "source": [ 609 | "get_bitcoin_prices('2016-01-01', '2016-01-10')" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "## Extract data from a regular website (i.e. webscraping) [(to top)](#toc)" 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": {}, 622 | "source": [ 623 | "In order to extract information from a regular webpage you first have to: \n", 624 | "1. Construct or retrieve the URL\n", 625 | "2. Retrieve page returned from URL and put it in memory (usually HTML)\n", 626 | "\n", 627 | "**From here you have a choice:**\n", 628 | " \n", 629 | "* Treat the HTML source as text and use regular expression to extract the information.\n", 630 | "\n", 631 | " *Or* \n", 632 | " \n", 633 | "* Process the HTML use the native HTML structure to extract information (Using `LXML` or `Requests-HTML`\n", 634 | "\n", 635 | "I will discuss both methods below. However, **I strongly recommend to go with the second option**. HTML is machine readable by nature, which means that you are better off with parsing the HTML in 95% of the cases compared to trying to write complicated regular expressions. " 636 | ] 637 | }, 638 | { 639 | "cell_type": "markdown", 640 | "metadata": {}, 641 | "source": [ 642 | "## Extract data from a regular website using regular expressions [(to top)](#toc)" 643 | ] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "metadata": {}, 648 | "source": [ 649 | "### Regular expressions" 650 | ] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "metadata": {}, 655 | "source": [ 656 | "Python has a native package to deal with regular expressions, you can import it as such:" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 22, 662 | "metadata": {}, 663 | "outputs": [], 664 | "source": [ 665 | "import re" 666 | ] 667 | }, 668 | { 669 | "cell_type": "markdown", 670 | "metadata": {}, 671 | "source": [ 672 | "## Demonstration" 673 | ] 674 | }, 675 | { 676 | "cell_type": "markdown", 677 | "metadata": {}, 678 | "source": [ 679 | "*Reminder:* You usually only want to use regular expressions if you want to do something quick-and-dirty, using LXML is nearly always a better solution!" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "Let's say our goal is to get the number of abstract views for a particular paper on SSRN: \n", 687 | "For example this one: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1968579" 688 | ] 689 | }, 690 | { 691 | "cell_type": "markdown", 692 | "metadata": {}, 693 | "source": [ 694 | "### Step 1: download the source of the page" 695 | ] 696 | }, 697 | { 698 | "cell_type": "code", 699 | "execution_count": 31, 700 | "metadata": {}, 701 | "outputs": [], 702 | "source": [ 703 | "ssrn_url = r'https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1968579'\n", 704 | "page_source = requests.get(ssrn_url, headers={'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'})" 705 | ] 706 | }, 707 | { 708 | "cell_type": "markdown", 709 | "metadata": {}, 710 | "source": [ 711 | "*Note:* Some websites will block any visits from a client without a user agent, this is why we add the user agent above." 712 | ] 713 | }, 714 | { 715 | "cell_type": "markdown", 716 | "metadata": {}, 717 | "source": [ 718 | "### Step 2: convert source to a string (i.e. text)" 719 | ] 720 | }, 721 | { 722 | "cell_type": "markdown", 723 | "metadata": {}, 724 | "source": [ 725 | "*Note:* by doing so we essentially ignore the inherent structure of an HTML file, we just treat it as a very large string." 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "execution_count": 34, 731 | "metadata": {}, 732 | "outputs": [], 733 | "source": [ 734 | "source_text = page_source.text" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "metadata": {}, 740 | "source": [ 741 | "### Step 3: use a regular expression to extract the number of views" 742 | ] 743 | }, 744 | { 745 | "cell_type": "markdown", 746 | "metadata": {}, 747 | "source": [ 748 | "Using the Chrome browser we can, for example, right click on the number and select 'inspect' to bring up this screen:" 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "metadata": {}, 754 | "source": [ 755 | "![](https://i.imgur.com/NcClhwO.png)" 756 | ] 757 | }, 758 | { 759 | "cell_type": "markdown", 760 | "metadata": {}, 761 | "source": [ 762 | "Based on this we can construct a regular expression to capture the value that we want. \n", 763 | "Note, we have to account for any spaces, tabs, and newlines otherwise the regular expression will not capture what we want, this can be very tricky. \n", 764 | "\n", 765 | "Once we have identified the appropriate regular expression (it can help to use tools like www.pythex.org) we can use `re.findall()`:" 766 | ] 767 | }, 768 | { 769 | "cell_type": "code", 770 | "execution_count": 35, 771 | "metadata": {}, 772 | "outputs": [ 773 | { 774 | "data": { 775 | "text/plain": [ 776 | "[' 434,321']" 777 | ] 778 | }, 779 | "execution_count": 35, 780 | "metadata": {}, 781 | "output_type": "execute_result" 782 | } 783 | ], 784 | "source": [ 785 | "found_values = re.findall('Abstract Views\\r\\n\\t\\t\\t\\t
(.*?)
', source_text)\n", 786 | "found_values" 787 | ] 788 | }, 789 | { 790 | "cell_type": "markdown", 791 | "metadata": {}, 792 | "source": [ 793 | "After cleaning the value up a bit (remove spaces and remove comma) we can convert the value to an integral so that Python handles it as a number:" 794 | ] 795 | }, 796 | { 797 | "cell_type": "code", 798 | "execution_count": 36, 799 | "metadata": {}, 800 | "outputs": [ 801 | { 802 | "data": { 803 | "text/plain": [ 804 | "434321" 805 | ] 806 | }, 807 | "execution_count": 36, 808 | "metadata": {}, 809 | "output_type": "execute_result" 810 | } 811 | ], 812 | "source": [ 813 | "int(found_values[0].strip().replace(',', ''))" 814 | ] 815 | }, 816 | { 817 | "cell_type": "markdown", 818 | "metadata": {}, 819 | "source": [ 820 | "**As you can see, regular expression are rarely convenient for web scraping and if possible should be avoided!**" 821 | ] 822 | }, 823 | { 824 | "cell_type": "markdown", 825 | "metadata": {}, 826 | "source": [ 827 | "## Extract data from a regular website by parsing the HTML [(to top)](#toc)" 828 | ] 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "metadata": {}, 833 | "source": [ 834 | "**Note:** I will show both the higher-level `Requests-HTML` and the lower-level `LXML`" 835 | ] 836 | }, 837 | { 838 | "cell_type": "markdown", 839 | "metadata": {}, 840 | "source": [ 841 | "In the example above we treat a HTML page as plain-text and ignore the inherent format of HTML. \n", 842 | "A better alternative is to utilize the inherent structure of HTML to extract the information that we need. " 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": {}, 848 | "source": [ 849 | "A quick refresher on HTML from 'automate the boring stuff':\n", 850 | "\n", 851 | "> In case it’s been a while since you’ve looked at any HTML, here’s a quick overview of the basics. An HTML file is a plaintext file with the .html file extension. The text in these files is surrounded by tags, which are words enclosed in angle brackets. The tags tell the browser how to format the web page. A starting tag and closing tag can enclose some text to form an element. The text (or inner HTML) is the content between the starting and closing tags. For example, the following HTML will display Hello world! in the browser, with Hello in bold:\n", 852 | "\n", 853 | " Hello world!" 854 | ] 855 | }, 856 | { 857 | "cell_type": "markdown", 858 | "metadata": {}, 859 | "source": [ 860 | "You can view the HTML source by right-clicking a page and selecting `view page source`:\n", 861 | "![](https://automatetheboringstuff.com/images/000009.jpg)" 862 | ] 863 | }, 864 | { 865 | "cell_type": "markdown", 866 | "metadata": {}, 867 | "source": [ 868 | "## Demonstration" 869 | ] 870 | }, 871 | { 872 | "cell_type": "markdown", 873 | "metadata": {}, 874 | "source": [ 875 | "**Request-HTML** \n", 876 | "\n", 877 | " `Requests-HTML` is a convenient library that extends the functionality of `requests` by allowing HTML parsing. \n", 878 | "\n", 879 | "You can find the documentation here: https://github.com/kennethreitz/requests-html)\n", 880 | "\n", 881 | "\n", 882 | "**LXML**\n", 883 | "\n", 884 | "`LXML` is a powerfull XML parser that is used as a parser by many packages. However, you can also use it directly in combination with the `requests` package. \n", 885 | " \n", 886 | "You can find the documentation for `LXML` here: http://lxml.de/\n", 887 | "\n", 888 | "*Note:* an alternative to LXML is Beautifulsoup but nowadays (in my experience) it is better to use LXML.\n", 889 | "\n", 890 | "---\n", 891 | "\n", 892 | "\n" 893 | ] 894 | }, 895 | { 896 | "cell_type": "code", 897 | "execution_count": 4, 898 | "metadata": {}, 899 | "outputs": [], 900 | "source": [ 901 | "import requests_html\n", 902 | "import lxml.html" 903 | ] 904 | }, 905 | { 906 | "cell_type": "markdown", 907 | "metadata": {}, 908 | "source": [ 909 | "Create a session object for `requests_html`:" 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": 5, 915 | "metadata": {}, 916 | "outputs": [], 917 | "source": [ 918 | "session = requests_html.HTMLSession()" 919 | ] 920 | }, 921 | { 922 | "cell_type": "markdown", 923 | "metadata": {}, 924 | "source": [ 925 | "### Example introduction" 926 | ] 927 | }, 928 | { 929 | "cell_type": "markdown", 930 | "metadata": {}, 931 | "source": [ 932 | "Let's say we want to extract information (title, description, speakers) about talks from the jupytercon conference. \n", 933 | "\n", 934 | "We have identified that this information is available on this URL: \n", 935 | "https://conferences.oreilly.com/jupyter/jup-ny/public/schedule/proceedings\n", 936 | "\n", 937 | "**NOTE: I would normally not recommend scraping these types of websites. However, JupyterCon is awesome so I my hope is that you encounter some interesting talks while looking through the proceedings! :)**" 938 | ] 939 | }, 940 | { 941 | "cell_type": "markdown", 942 | "metadata": {}, 943 | "source": [ 944 | "## Using `Requests-HTML`:" 945 | ] 946 | }, 947 | { 948 | "cell_type": "markdown", 949 | "metadata": {}, 950 | "source": [ 951 | "### Part 1 + Part 2: Load the source from the URL + parse HTML" 952 | ] 953 | }, 954 | { 955 | "cell_type": "code", 956 | "execution_count": 6, 957 | "metadata": {}, 958 | "outputs": [], 959 | "source": [ 960 | "JC_URL = 'https://conferences.oreilly.com/jupyter/jup-ny/public/schedule/proceedings'\n", 961 | "res = session.get(JC_URL)" 962 | ] 963 | }, 964 | { 965 | "cell_type": "code", 966 | "execution_count": 7, 967 | "metadata": {}, 968 | "outputs": [ 969 | { 970 | "name": "stdout", 971 | "output_type": "stream", 972 | "text": [ 973 | "\n" 974 | ] 975 | } 976 | ], 977 | "source": [ 978 | "print(type(res))" 979 | ] 980 | }, 981 | { 982 | "cell_type": "markdown", 983 | "metadata": {}, 984 | "source": [ 985 | "Note: as the names implies `requests-html` combines `requests` with the HTML parser (so we don't need to use `requests` first)" 986 | ] 987 | }, 988 | { 989 | "cell_type": "markdown", 990 | "metadata": {}, 991 | "source": [ 992 | "## Using `Requests` + `LXML`:" 993 | ] 994 | }, 995 | { 996 | "cell_type": "markdown", 997 | "metadata": {}, 998 | "source": [ 999 | "### Part 1: Load the source from the URL" 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "code", 1004 | "execution_count": 43, 1005 | "metadata": {}, 1006 | "outputs": [], 1007 | "source": [ 1008 | "JC_URL = 'https://conferences.oreilly.com/jupyter/jup-ny/public/schedule/proceedings'\n", 1009 | "jc_source = requests.get(JC_URL)" 1010 | ] 1011 | }, 1012 | { 1013 | "cell_type": "markdown", 1014 | "metadata": {}, 1015 | "source": [ 1016 | "### Part 2: Process the result into an LXML object" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "code", 1021 | "execution_count": 44, 1022 | "metadata": {}, 1023 | "outputs": [], 1024 | "source": [ 1025 | "tree = lxml.html.fromstring(jc_source.text)" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "markdown", 1030 | "metadata": {}, 1031 | "source": [ 1032 | "The function `lxml.html.fromstring(res.text)` converts the raw HTML (i.e. the string representation) and converts it into an `HtmlElement` that we can structurally search:" 1033 | ] 1034 | }, 1035 | { 1036 | "cell_type": "code", 1037 | "execution_count": 45, 1038 | "metadata": {}, 1039 | "outputs": [ 1040 | { 1041 | "data": { 1042 | "text/plain": [ 1043 | "lxml.html.HtmlElement" 1044 | ] 1045 | }, 1046 | "execution_count": 45, 1047 | "metadata": {}, 1048 | "output_type": "execute_result" 1049 | } 1050 | ], 1051 | "source": [ 1052 | "type(tree)" 1053 | ] 1054 | }, 1055 | { 1056 | "cell_type": "markdown", 1057 | "metadata": {}, 1058 | "source": [ 1059 | "## Part 3: extract the information from the HTML structure" 1060 | ] 1061 | }, 1062 | { 1063 | "cell_type": "markdown", 1064 | "metadata": {}, 1065 | "source": [ 1066 | "The beauty of an `HtmlElement` is that we can use the structure of the HTML document to our advantage to extract specifics parts of the website. \n", 1067 | "\n", 1068 | "There are two ways to go about this: \n", 1069 | "1. Using a `css selector`\n", 1070 | "2. Using an `XPath`\n", 1071 | "\n", 1072 | "I recommend to only use `css selectors` as they tend increasingly tend to be the superior option in near all cases. " 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "markdown", 1077 | "metadata": {}, 1078 | "source": [ 1079 | "### *What is a `css selector`?*\n", 1080 | "\n", 1081 | "CSS is a language that is used to define the style of an HTML document. \n", 1082 | "It does this by attaching some piece of styling (e.g. \"make text bold\") to a particular HTML object. \n", 1083 | "This attaching is achieved by defining patterns that select the appropriate HTML elements: these patterns are called `CSS selectors`.\n", 1084 | "\n", 1085 | "To illustrate, let's say that we have this piece of HTML:\n", 1086 | "\n", 1087 | " \n", 1088 | " \n", 1089 | "\n", 1090 | "

Python is great!

\n", 1091 | "\n", 1092 | " \n", 1093 | " \n", 1094 | "\n", 1095 | "We can change the color of the title text to blue through this piece of CSS code:\n", 1096 | "\n", 1097 | " h1 {\n", 1098 | " color: Blue;\n", 1099 | " }\n", 1100 | "\n", 1101 | "The `h1` is the `css selector` and it essentially tells the browser that everything between `

` should have `color: Blue`.\n", 1102 | "\n", 1103 | "Now, the cool thing is that we can also use these `css selectors` to select the HTML elements that we want to extract! " 1104 | ] 1105 | }, 1106 | { 1107 | "cell_type": "markdown", 1108 | "metadata": {}, 1109 | "source": [ 1110 | "### *Syntax of a `css selector`*\n", 1111 | "\n", 1112 | "Below are the most frequent ways to select a particular HTML element:\n", 1113 | "\n", 1114 | "1. Use a dot to select HTML elements based on the **class**: `.classname`\n", 1115 | "2. Use a hash symbol (#) to select HTML elements based on the **id**: `#idname`\n", 1116 | "3. Directly put the name of an element to select HTML elements based on the **element**: `p`, `span`, `h1` \n", 1117 | "\n", 1118 | "You can also chain multiple conditions together using `>`, `+`, and `~`. \n", 1119 | "If we want to get all `

` elements with a `

` parent we can do `div > p` for example.\n", 1120 | "\n", 1121 | "For a full overview I recommend checking this page: \n", 1122 | "https://www.w3schools.com/cssref/css_selectors.asp" 1123 | ] 1124 | }, 1125 | { 1126 | "cell_type": "markdown", 1127 | "metadata": {}, 1128 | "source": [ 1129 | "### *A pragmatic way to generate the right `css selector`*\n", 1130 | "\n", 1131 | "If you are unfamiliar with programming websites then it might be hard to wrap your head around CSS selectors. \n", 1132 | "Fortunately, there are tools out there that can make it very easy to generate the css selector that you need! \n", 1133 | "\n", 1134 | "***Option 1:*** \n", 1135 | "\n", 1136 | "If you want just one element you can use the build-in Chrome DevTools (Firefox has something similar). \n", 1137 | "You achieve this by right clicking on the element you want and then click `\"inspect\"`, this should bring up the Dev console. \n", 1138 | "\n", 1139 | "If you then right click on the element you want to extract, you can have DevTools generate a `css selector`:\n", 1140 | "\n", 1141 | "\n", 1142 | "\n", 1143 | "\n", 1144 | "This will result in the following `css selector`:\n", 1145 | "\n", 1146 | "`#en_proceedings > div:nth-child(1) > div.en_session_title > a`\n", 1147 | "\n", 1148 | "***Option 2:***\n", 1149 | "\n", 1150 | "The above can be limiting if you want to select multiple elements. \n", 1151 | "An other option that makes this easier is to use an awesome Chrome extension called `SelectorGadget`. \n", 1152 | "\n", 1153 | "You can install it here: \n", 1154 | "https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb\n", 1155 | "\n", 1156 | "\n", 1157 | "There is more information available here as well: \n", 1158 | "http://selectorgadget.com/\n", 1159 | "\n", 1160 | "With this extension you can simply highlight what do / do not want to select and it will generate the `css selector` that you need. For example, if we want all the titles:\n", 1161 | "\n", 1162 | "\n", 1163 | "\n", 1164 | "\n", 1165 | "This yields the following `css selector`: \n", 1166 | "\n", 1167 | "`'.en_session_title a'`\n", 1168 | "\n", 1169 | "\n", 1170 | "*Note:* The number between brackets after 'Clear' indicates the number of elements selected." 1171 | ] 1172 | }, 1173 | { 1174 | "cell_type": "markdown", 1175 | "metadata": {}, 1176 | "source": [ 1177 | "## CSS Selectors with `Requests-HTML`:" 1178 | ] 1179 | }, 1180 | { 1181 | "cell_type": "markdown", 1182 | "metadata": {}, 1183 | "source": [ 1184 | "### Generate a list of all titles" 1185 | ] 1186 | }, 1187 | { 1188 | "cell_type": "code", 1189 | "execution_count": 46, 1190 | "metadata": {}, 1191 | "outputs": [], 1192 | "source": [ 1193 | "title_elements = res.html.find('.en_session_title a')" 1194 | ] 1195 | }, 1196 | { 1197 | "cell_type": "code", 1198 | "execution_count": 47, 1199 | "metadata": {}, 1200 | "outputs": [ 1201 | { 1202 | "data": { 1203 | "text/plain": [ 1204 | "48" 1205 | ] 1206 | }, 1207 | "execution_count": 47, 1208 | "metadata": {}, 1209 | "output_type": "execute_result" 1210 | } 1211 | ], 1212 | "source": [ 1213 | "len(title_elements)" 1214 | ] 1215 | }, 1216 | { 1217 | "cell_type": "markdown", 1218 | "metadata": {}, 1219 | "source": [ 1220 | "#### Get text of first element:" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "code", 1225 | "execution_count": 48, 1226 | "metadata": {}, 1227 | "outputs": [ 1228 | { 1229 | "data": { 1230 | "text/plain": [ 1231 | "'Containerizing notebooks for serverless execution (sponsored by AWS)'" 1232 | ] 1233 | }, 1234 | "execution_count": 48, 1235 | "metadata": {}, 1236 | "output_type": "execute_result" 1237 | } 1238 | ], 1239 | "source": [ 1240 | "title_elements[0].text" 1241 | ] 1242 | }, 1243 | { 1244 | "cell_type": "markdown", 1245 | "metadata": {}, 1246 | "source": [ 1247 | "*Note:* if you are only interested in the first (or only) object you can add `first=True` to `res.html.find()` and it will only return one result" 1248 | ] 1249 | }, 1250 | { 1251 | "cell_type": "markdown", 1252 | "metadata": {}, 1253 | "source": [ 1254 | "#### Get text of all elements:" 1255 | ] 1256 | }, 1257 | { 1258 | "cell_type": "code", 1259 | "execution_count": 49, 1260 | "metadata": {}, 1261 | "outputs": [ 1262 | { 1263 | "data": { 1264 | "text/plain": [ 1265 | "['Containerizing notebooks for serverless execution (sponsored by AWS)',\n", 1266 | " 'Advanced data science, part 2: Five ways to handle missing data in Jupyter notebooks',\n", 1267 | " 'All the cool kids are doing it; maybe we should too? Jupyter, gravitational waves, and the LIGO and Virgo Scientific Collaborations']" 1268 | ] 1269 | }, 1270 | "execution_count": 49, 1271 | "metadata": {}, 1272 | "output_type": "execute_result" 1273 | } 1274 | ], 1275 | "source": [ 1276 | "[element.text for element in title_elements][:3]" 1277 | ] 1278 | }, 1279 | { 1280 | "cell_type": "markdown", 1281 | "metadata": {}, 1282 | "source": [ 1283 | "### Extract the hyperlink that leads to the talk page" 1284 | ] 1285 | }, 1286 | { 1287 | "cell_type": "markdown", 1288 | "metadata": {}, 1289 | "source": [ 1290 | "Above we extract the text, but we can also add `.attrs` to access any attributes of the element:" 1291 | ] 1292 | }, 1293 | { 1294 | "cell_type": "code", 1295 | "execution_count": 50, 1296 | "metadata": {}, 1297 | "outputs": [ 1298 | { 1299 | "data": { 1300 | "text/plain": [ 1301 | "{'href': '/jupyter/jup-ny/public/schedule/detail/71980'}" 1302 | ] 1303 | }, 1304 | "execution_count": 50, 1305 | "metadata": {}, 1306 | "output_type": "execute_result" 1307 | } 1308 | ], 1309 | "source": [ 1310 | "title_elements[0].attrs" 1311 | ] 1312 | }, 1313 | { 1314 | "cell_type": "markdown", 1315 | "metadata": {}, 1316 | "source": [ 1317 | "As you can see, there is a `href` attribute with the url. \n", 1318 | "So we can create a list with both the text and the url:" 1319 | ] 1320 | }, 1321 | { 1322 | "cell_type": "code", 1323 | "execution_count": 51, 1324 | "metadata": {}, 1325 | "outputs": [], 1326 | "source": [ 1327 | "talks = []\n", 1328 | "for element in title_elements:\n", 1329 | " talks.append((element.text, \n", 1330 | " element.attrs['href']))" 1331 | ] 1332 | }, 1333 | { 1334 | "cell_type": "code", 1335 | "execution_count": 52, 1336 | "metadata": {}, 1337 | "outputs": [ 1338 | { 1339 | "data": { 1340 | "text/plain": [ 1341 | "[('Containerizing notebooks for serverless execution (sponsored by AWS)',\n", 1342 | " '/jupyter/jup-ny/public/schedule/detail/71980'),\n", 1343 | " ('Advanced data science, part 2: Five ways to handle missing data in Jupyter notebooks',\n", 1344 | " '/jupyter/jup-ny/public/schedule/detail/68407'),\n", 1345 | " ('All the cool kids are doing it; maybe we should too? Jupyter, gravitational waves, and the LIGO and Virgo Scientific Collaborations',\n", 1346 | " '/jupyter/jup-ny/public/schedule/detail/71345')]" 1347 | ] 1348 | }, 1349 | "execution_count": 52, 1350 | "metadata": {}, 1351 | "output_type": "execute_result" 1352 | } 1353 | ], 1354 | "source": [ 1355 | "talks[:3]" 1356 | ] 1357 | }, 1358 | { 1359 | "cell_type": "markdown", 1360 | "metadata": {}, 1361 | "source": [ 1362 | "### Extract the title, hyperlink, description, and authors for each talk" 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "markdown", 1367 | "metadata": {}, 1368 | "source": [ 1369 | "We can use the above approach and do also get a list of all the authors and the descriptions. \n", 1370 | "It, however, becomes a little bit tricky to combine everything given that one talk might have multiple authors. \n", 1371 | "\n", 1372 | "To deal with this (common) problem it is best to loop over each talk element separately and only then extract the information for that talk, that way it is easy to keep everything linked to a specific talk. \n", 1373 | "\n", 1374 | "If we look in the Chrome DevTools element viewer, we can observe that each talk is a separate `
` with the `en_session` class:\n", 1375 | "\n", 1376 | "" 1377 | ] 1378 | }, 1379 | { 1380 | "cell_type": "markdown", 1381 | "metadata": {}, 1382 | "source": [ 1383 | "We first select all the `divs` with the `en_session` class that have a parent with `en_proceedings` as id:" 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "code", 1388 | "execution_count": 53, 1389 | "metadata": {}, 1390 | "outputs": [ 1391 | { 1392 | "data": { 1393 | "text/plain": [ 1394 | "[,\n", 1395 | " ,\n", 1396 | " ]" 1397 | ] 1398 | }, 1399 | "execution_count": 53, 1400 | "metadata": {}, 1401 | "output_type": "execute_result" 1402 | } 1403 | ], 1404 | "source": [ 1405 | "talk_elements = res.html.find('#en_proceedings > .en_session')\n", 1406 | "talk_elements[:3]" 1407 | ] 1408 | }, 1409 | { 1410 | "cell_type": "markdown", 1411 | "metadata": {}, 1412 | "source": [ 1413 | "Now we can loop over each of these elements and extract the information we want:" 1414 | ] 1415 | }, 1416 | { 1417 | "cell_type": "code", 1418 | "execution_count": 54, 1419 | "metadata": {}, 1420 | "outputs": [], 1421 | "source": [ 1422 | "talk_details = []\n", 1423 | "for talk in talk_elements:\n", 1424 | " title = talk.find('.en_session_title a', first=True).text\n", 1425 | " href = talk.find('.en_session_title a', first=True).attrs['href']\n", 1426 | " description = talk.find('.en_session_description', first=True).text.strip()\n", 1427 | " speakers = [speaker.text for speaker in talk.find('.speaker_names > a')]\n", 1428 | " talk_details.append((title, href, description, speakers))" 1429 | ] 1430 | }, 1431 | { 1432 | "cell_type": "markdown", 1433 | "metadata": {}, 1434 | "source": [ 1435 | "For the sake of the example, below a prettified inspection of the data we gathered:" 1436 | ] 1437 | }, 1438 | { 1439 | "cell_type": "code", 1440 | "execution_count": 56, 1441 | "metadata": {}, 1442 | "outputs": [ 1443 | { 1444 | "name": "stdout", 1445 | "output_type": "stream", 1446 | "text": [ 1447 | "The title is: Containerizing notebooks for serverless execution (sponsored by AWS)\n", 1448 | "Speakers: ['Kevin McCormick', 'Vladimir Zhukov'] \n", 1449 | "\n", 1450 | "Description: \n", 1451 | " Kevin McCormick explains the story of two approaches which were used internally at AWS to accelerate new ML algorithm development, and easily package Jupyter notebooks for scheduled execution, by creating custom Jupyter kernels that automatically create Docker containers, and dispatch them to either a distributed training service or job execution environment. \n", 1452 | "\n", 1453 | "For details see: https://conferences.oreilly.com//jupyter/jup-ny/public/schedule/detail/71980\n", 1454 | "---------------------------------------------------------------------------------------------------- \n", 1455 | "\n", 1456 | "The title is: Advanced data science, part 2: Five ways to handle missing data in Jupyter notebooks\n", 1457 | "Speakers: ['Matt Brems'] \n", 1458 | "\n", 1459 | "Description: \n", 1460 | " Missing data plagues nearly every data science problem. Often, people just drop or ignore missing data. However, this usually ends up with bad results. Matt Brems explains how bad dropping or ignoring missing data can be and teaches you how to handle missing data the right way by leveraging Jupyter notebooks to properly reweight or impute your data. \n", 1461 | "\n", 1462 | "For details see: https://conferences.oreilly.com//jupyter/jup-ny/public/schedule/detail/68407\n", 1463 | "---------------------------------------------------------------------------------------------------- \n", 1464 | "\n", 1465 | "The title is: All the cool kids are doing it; maybe we should too? Jupyter, gravitational waves, and the LIGO and Virgo Scientific Collaborations\n", 1466 | "Speakers: ['Will M Farr'] \n", 1467 | "\n", 1468 | "Description: \n", 1469 | " Will Farr shares examples of Jupyter use within the LIGO and Virgo Scientific Collaborations and offers lessons about the (many) advantages and (few) disadvantages of Jupyter for large, global scientific collaborations. Along the way, Will speculates on Jupyter's future role in gravitational wave astronomy. \n", 1470 | "\n", 1471 | "For details see: https://conferences.oreilly.com//jupyter/jup-ny/public/schedule/detail/71345\n", 1472 | "---------------------------------------------------------------------------------------------------- \n", 1473 | "\n" 1474 | ] 1475 | } 1476 | ], 1477 | "source": [ 1478 | "for title, href, description, speakers in talk_details[:3]:\n", 1479 | " print('The title is: ', title)\n", 1480 | " print('Speakers: ', speakers, '\\n')\n", 1481 | " print('Description: \\n', description, '\\n')\n", 1482 | " print('For details see: ', 'https://conferences.oreilly.com/' + href)\n", 1483 | " print('-'*100, '\\n')" 1484 | ] 1485 | }, 1486 | { 1487 | "cell_type": "markdown", 1488 | "metadata": {}, 1489 | "source": [ 1490 | "## CSS Selectors with `LXML`:" 1491 | ] 1492 | }, 1493 | { 1494 | "cell_type": "markdown", 1495 | "metadata": {}, 1496 | "source": [ 1497 | "**Note:** In order to use css selectors with LXML you might have to install `cssselect` by running this in your command prompt: \n", 1498 | "`pip install cssselect`" 1499 | ] 1500 | }, 1501 | { 1502 | "cell_type": "markdown", 1503 | "metadata": {}, 1504 | "source": [ 1505 | "### Generate a list of all titles:" 1506 | ] 1507 | }, 1508 | { 1509 | "cell_type": "markdown", 1510 | "metadata": {}, 1511 | "source": [ 1512 | "We can use the css selector that we generated earlier with the SelectorGadget extension:" 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "code", 1517 | "execution_count": 57, 1518 | "metadata": {}, 1519 | "outputs": [], 1520 | "source": [ 1521 | "title_elements = tree.cssselect('.en_session_title a')" 1522 | ] 1523 | }, 1524 | { 1525 | "cell_type": "code", 1526 | "execution_count": 58, 1527 | "metadata": {}, 1528 | "outputs": [ 1529 | { 1530 | "data": { 1531 | "text/plain": [ 1532 | "48" 1533 | ] 1534 | }, 1535 | "execution_count": 58, 1536 | "metadata": {}, 1537 | "output_type": "execute_result" 1538 | } 1539 | ], 1540 | "source": [ 1541 | "len(title_elements)" 1542 | ] 1543 | }, 1544 | { 1545 | "cell_type": "markdown", 1546 | "metadata": {}, 1547 | "source": [ 1548 | "If we select the first title element we see that it doesn't return the text:" 1549 | ] 1550 | }, 1551 | { 1552 | "cell_type": "code", 1553 | "execution_count": 59, 1554 | "metadata": {}, 1555 | "outputs": [ 1556 | { 1557 | "data": { 1558 | "text/plain": [ 1559 | "" 1560 | ] 1561 | }, 1562 | "execution_count": 59, 1563 | "metadata": {}, 1564 | "output_type": "execute_result" 1565 | } 1566 | ], 1567 | "source": [ 1568 | "title_elements[0]" 1569 | ] 1570 | }, 1571 | { 1572 | "cell_type": "markdown", 1573 | "metadata": {}, 1574 | "source": [ 1575 | "In order to extract the text we have to add `.text` to the end:" 1576 | ] 1577 | }, 1578 | { 1579 | "cell_type": "code", 1580 | "execution_count": 60, 1581 | "metadata": {}, 1582 | "outputs": [ 1583 | { 1584 | "data": { 1585 | "text/plain": [ 1586 | "' Containerizing notebooks for serverless execution (sponsored by AWS)'" 1587 | ] 1588 | }, 1589 | "execution_count": 60, 1590 | "metadata": {}, 1591 | "output_type": "execute_result" 1592 | } 1593 | ], 1594 | "source": [ 1595 | "title_elements[0].text" 1596 | ] 1597 | }, 1598 | { 1599 | "cell_type": "markdown", 1600 | "metadata": {}, 1601 | "source": [ 1602 | "We can do this for all titles to get a list with all the title texts:" 1603 | ] 1604 | }, 1605 | { 1606 | "cell_type": "code", 1607 | "execution_count": 61, 1608 | "metadata": {}, 1609 | "outputs": [ 1610 | { 1611 | "data": { 1612 | "text/plain": [ 1613 | "[' Containerizing notebooks for serverless execution (sponsored by AWS)',\n", 1614 | " 'Advanced data science, part 2: Five ways to handle missing data in Jupyter notebooks',\n", 1615 | " 'All the cool kids are doing it; maybe we should too? Jupyter, gravitational waves, and the LIGO and Virgo Scientific Collaborations']" 1616 | ] 1617 | }, 1618 | "execution_count": 61, 1619 | "metadata": {}, 1620 | "output_type": "execute_result" 1621 | } 1622 | ], 1623 | "source": [ 1624 | "title_texts = [x.text for x in title_elements]\n", 1625 | "title_texts[:3]" 1626 | ] 1627 | }, 1628 | { 1629 | "cell_type": "markdown", 1630 | "metadata": {}, 1631 | "source": [ 1632 | "### Extract the hyperlink that leads to the talk page" 1633 | ] 1634 | }, 1635 | { 1636 | "cell_type": "markdown", 1637 | "metadata": {}, 1638 | "source": [ 1639 | "Above we extract the text, but we can also add `.attrib` to access any attributes of the element:" 1640 | ] 1641 | }, 1642 | { 1643 | "cell_type": "code", 1644 | "execution_count": 62, 1645 | "metadata": {}, 1646 | "outputs": [ 1647 | { 1648 | "data": { 1649 | "text/plain": [ 1650 | "{'href': '/jupyter/jup-ny/public/schedule/detail/71980'}" 1651 | ] 1652 | }, 1653 | "execution_count": 62, 1654 | "metadata": {}, 1655 | "output_type": "execute_result" 1656 | } 1657 | ], 1658 | "source": [ 1659 | "title_elements[0].attrib" 1660 | ] 1661 | }, 1662 | { 1663 | "cell_type": "markdown", 1664 | "metadata": {}, 1665 | "source": [ 1666 | "As you can see, there is a `href` attribute with the url. \n", 1667 | "So we can create a list with both the text and the url:" 1668 | ] 1669 | }, 1670 | { 1671 | "cell_type": "code", 1672 | "execution_count": 63, 1673 | "metadata": {}, 1674 | "outputs": [], 1675 | "source": [ 1676 | "talks = []\n", 1677 | "for element in title_elements:\n", 1678 | " talks.append((element.text, \n", 1679 | " element.attrib['href']))" 1680 | ] 1681 | }, 1682 | { 1683 | "cell_type": "code", 1684 | "execution_count": 64, 1685 | "metadata": {}, 1686 | "outputs": [ 1687 | { 1688 | "data": { 1689 | "text/plain": [ 1690 | "[(' Containerizing notebooks for serverless execution (sponsored by AWS)',\n", 1691 | " '/jupyter/jup-ny/public/schedule/detail/71980'),\n", 1692 | " ('Advanced data science, part 2: Five ways to handle missing data in Jupyter notebooks',\n", 1693 | " '/jupyter/jup-ny/public/schedule/detail/68407'),\n", 1694 | " ('All the cool kids are doing it; maybe we should too? Jupyter, gravitational waves, and the LIGO and Virgo Scientific Collaborations',\n", 1695 | " '/jupyter/jup-ny/public/schedule/detail/71345')]" 1696 | ] 1697 | }, 1698 | "execution_count": 64, 1699 | "metadata": {}, 1700 | "output_type": "execute_result" 1701 | } 1702 | ], 1703 | "source": [ 1704 | "talks[:3]" 1705 | ] 1706 | }, 1707 | { 1708 | "cell_type": "markdown", 1709 | "metadata": {}, 1710 | "source": [ 1711 | "### Extract the title, hyperlink, description, and authors for each talk" 1712 | ] 1713 | }, 1714 | { 1715 | "cell_type": "markdown", 1716 | "metadata": {}, 1717 | "source": [ 1718 | "We can use the above approach and do also get a list of all the authors and the descriptions. \n", 1719 | "It, however, becomes a little bit tricky to combine everything given that one talk might have multiple authors. \n", 1720 | "\n", 1721 | "To deal with this (common) problem it is best to loop over each talk element separately and only then extract the information for that talk, that way it is easy to keep everything linked to a specific talk. \n", 1722 | "\n", 1723 | "If we look in the Chrome DevTools element viewer, we can observe that each talk is a separate `
` with the `en_session` class:\n", 1724 | "\n", 1725 | "" 1726 | ] 1727 | }, 1728 | { 1729 | "cell_type": "markdown", 1730 | "metadata": {}, 1731 | "source": [ 1732 | "We first select all the `divs` with the `en_session` class that have a parent with `en_proceedings` as id:" 1733 | ] 1734 | }, 1735 | { 1736 | "cell_type": "code", 1737 | "execution_count": 65, 1738 | "metadata": {}, 1739 | "outputs": [ 1740 | { 1741 | "data": { 1742 | "text/plain": [ 1743 | "[,\n", 1744 | " ,\n", 1745 | " ]" 1746 | ] 1747 | }, 1748 | "execution_count": 65, 1749 | "metadata": {}, 1750 | "output_type": "execute_result" 1751 | } 1752 | ], 1753 | "source": [ 1754 | "talk_elements = tree.cssselect('#en_proceedings > .en_session')\n", 1755 | "talk_elements[:3]" 1756 | ] 1757 | }, 1758 | { 1759 | "cell_type": "markdown", 1760 | "metadata": {}, 1761 | "source": [ 1762 | "Now we can loop over each of these elements and extract the information we want:" 1763 | ] 1764 | }, 1765 | { 1766 | "cell_type": "code", 1767 | "execution_count": 66, 1768 | "metadata": {}, 1769 | "outputs": [], 1770 | "source": [ 1771 | "talk_details = []\n", 1772 | "for talk in talk_elements:\n", 1773 | " title = talk.cssselect('.en_session_title a')[0].text\n", 1774 | " href = talk.cssselect('.en_session_title a')[0].attrib['href']\n", 1775 | " description = talk.cssselect('.en_session_description')[0].text.strip()\n", 1776 | " speakers = [speaker.text for speaker in talk.cssselect('.speaker_names > a')]\n", 1777 | " talk_details.append((title, href, description, speakers))" 1778 | ] 1779 | }, 1780 | { 1781 | "cell_type": "markdown", 1782 | "metadata": {}, 1783 | "source": [ 1784 | "For the sake of the example, below a prettified inspection of the data we gathered:" 1785 | ] 1786 | }, 1787 | { 1788 | "cell_type": "code", 1789 | "execution_count": 68, 1790 | "metadata": {}, 1791 | "outputs": [ 1792 | { 1793 | "name": "stdout", 1794 | "output_type": "stream", 1795 | "text": [ 1796 | "The title is: Containerizing notebooks for serverless execution (sponsored by AWS)\n", 1797 | "Speakers: ['Kevin McCormick', 'Vladimir Zhukov'] \n", 1798 | "\n", 1799 | "Description: \n", 1800 | " Kevin McCormick explains the story of two approaches which were used internally at AWS to accelerate new ML algorithm development, and easily package Jupyter notebooks for scheduled execution, by creating custom Jupyter kernels that automatically create Docker containers, and dispatch them to either a distributed training service or job execution environment. \n", 1801 | "\n", 1802 | "For details see: https://conferences.oreilly.com//jupyter/jup-ny/public/schedule/detail/71980\n", 1803 | "---------------------------------------------------------------------------------------------------- \n", 1804 | "\n", 1805 | "The title is: Advanced data science, part 2: Five ways to handle missing data in Jupyter notebooks\n", 1806 | "Speakers: ['Matt Brems'] \n", 1807 | "\n", 1808 | "Description: \n", 1809 | " Missing data plagues nearly every data science problem. Often, people just drop or ignore missing data. However, this usually ends up with bad results. Matt Brems explains how bad dropping or ignoring missing data can be and teaches you how to handle missing data the right way by leveraging Jupyter notebooks to properly reweight or impute your data. \n", 1810 | "\n", 1811 | "For details see: https://conferences.oreilly.com//jupyter/jup-ny/public/schedule/detail/68407\n", 1812 | "---------------------------------------------------------------------------------------------------- \n", 1813 | "\n", 1814 | "The title is: All the cool kids are doing it; maybe we should too? Jupyter, gravitational waves, and the LIGO and Virgo Scientific Collaborations\n", 1815 | "Speakers: ['Will M Farr'] \n", 1816 | "\n", 1817 | "Description: \n", 1818 | " Will Farr shares examples of Jupyter use within the LIGO and Virgo Scientific Collaborations and offers lessons about the (many) advantages and (few) disadvantages of Jupyter for large, global scientific collaborations. Along the way, Will speculates on Jupyter's future role in gravitational wave astronomy. \n", 1819 | "\n", 1820 | "For details see: https://conferences.oreilly.com//jupyter/jup-ny/public/schedule/detail/71345\n", 1821 | "---------------------------------------------------------------------------------------------------- \n", 1822 | "\n" 1823 | ] 1824 | } 1825 | ], 1826 | "source": [ 1827 | "for title, href, description, speakers in talk_details[:3]:\n", 1828 | " print('The title is: ', title)\n", 1829 | " print('Speakers: ', speakers, '\\n')\n", 1830 | " print('Description: \\n', description, '\\n')\n", 1831 | " print('For details see: ', 'https://conferences.oreilly.com/' + href)\n", 1832 | " print('-'*100, '\\n')\n", 1833 | " " 1834 | ] 1835 | }, 1836 | { 1837 | "cell_type": "markdown", 1838 | "metadata": {}, 1839 | "source": [ 1840 | "## Extract data from Javascript heavy websites (Headless browsers / Selenium) [(to top)](#toc)" 1841 | ] 1842 | }, 1843 | { 1844 | "cell_type": "markdown", 1845 | "metadata": {}, 1846 | "source": [ 1847 | "A lot of websites nowadays use Javascript elements that are difficult (or impossible) to crawl using `requests`.\n", 1848 | "\n", 1849 | "In these scenarios we can use an alternative method where we have Python interact with a browser that is capable of handling Javascript elements. \n", 1850 | "\n", 1851 | "There are essentially two ways to do this:\n", 1852 | "\n", 1853 | "1. Use a so-called `headless automated browsing` package that runs in the background (you don't see the browser).\n", 1854 | "2. Use the `Selenium Webdriver` to control a browser like Chrome (you do see the browser)." 1855 | ] 1856 | }, 1857 | { 1858 | "cell_type": "markdown", 1859 | "metadata": {}, 1860 | "source": [ 1861 | "## Headless automated browsing" 1862 | ] 1863 | }, 1864 | { 1865 | "cell_type": "markdown", 1866 | "metadata": {}, 1867 | "source": [ 1868 | "The goal of headless browser automation is to interact with a browser that is in the background (i.e. has no user interface). \n", 1869 | "They essentially render a website the same way a normal browser would, but they are more lightweight due to not having to spend resources on the user interface. \n", 1870 | "\n", 1871 | "There are many packages available: https://github.com/dhamaniasad/HeadlessBrowsers \n", 1872 | "\n", 1873 | "**The easiest solution is to use the `requests-html` package with `r.html.render()`, see here: [requests-html: javascript support](https://github.com/kennethreitz/requests-html#javascript-support)**\n", 1874 | "\n", 1875 | "Alternatives:\n", 1876 | "\n", 1877 | "1. Ghost.py (http://jeanphix.me/Ghost.py/)\n", 1878 | "2. Dryscrape (https://dryscrape.readthedocs.io/en/latest/)\n", 1879 | "3. Splinter (http://splinter.readthedocs.io/en/latest/index.html?highlight=headless)\n", 1880 | "\n", 1881 | "Setting up headless browsers can be tricky and they can also be hard to debug (given that they run in the background)" 1882 | ] 1883 | }, 1884 | { 1885 | "cell_type": "markdown", 1886 | "metadata": {}, 1887 | "source": [ 1888 | "#### Example using `requests-html`" 1889 | ] 1890 | }, 1891 | { 1892 | "cell_type": "markdown", 1893 | "metadata": {}, 1894 | "source": [ 1895 | "*Note:* if you get an error you might have to run `pyppeteer-install` in your terminal to install Chromium ." 1896 | ] 1897 | }, 1898 | { 1899 | "cell_type": "code", 1900 | "execution_count": 1, 1901 | "metadata": {}, 1902 | "outputs": [], 1903 | "source": [ 1904 | "import requests_html" 1905 | ] 1906 | }, 1907 | { 1908 | "cell_type": "code", 1909 | "execution_count": 6, 1910 | "metadata": {}, 1911 | "outputs": [ 1912 | { 1913 | "name": "stdout", 1914 | "output_type": "stream", 1915 | "text": [ 1916 | "Financial Accounting\n", 1917 | "Management Accounting\n", 1918 | "Computer Science\n", 1919 | "Data Engineering\n" 1920 | ] 1921 | } 1922 | ], 1923 | "source": [ 1924 | "asession = requests_html.AsyncHTMLSession()\n", 1925 | "URL = 'https://www.tiesdekok.com'\n", 1926 | "r = await asession.get(URL)\n", 1927 | "await r.html.arender()\n", 1928 | "for element in r.html.find('.ul-interests > li'):\n", 1929 | " print(element.text)" 1930 | ] 1931 | }, 1932 | { 1933 | "cell_type": "markdown", 1934 | "metadata": {}, 1935 | "source": [ 1936 | "## Selenium" 1937 | ] 1938 | }, 1939 | { 1940 | "cell_type": "markdown", 1941 | "metadata": {}, 1942 | "source": [ 1943 | "The `Selenium WebDriver` allows to control a browser, this essentially automates / simulates a normal user interacting with the browser. \n", 1944 | "One of the most common ways to use the `Selenium WebDriver` is through the Python language bindings. \n", 1945 | "\n", 1946 | "Combining `Selenium` with Python makes it very easy to automate web browser interaction, allowing you to scrape essentially every webpage imaginable!\n", 1947 | "\n", 1948 | "**Note: if you can use `requests` + `LXML` then this is always preferred as it is much faster compared to using Selenium.**\n", 1949 | "\n", 1950 | "The package page for the Selenium Python bindings is here: https://pypi.python.org/pypi/selenium\n", 1951 | "\n", 1952 | "If you run below it will install both `selenium` and the `selenium Python bindings`:\n", 1953 | "> pip install selenium\n", 1954 | "\n", 1955 | "You will also need to install a driver to interface with a browser of your preference, I personally use the `ChromeDriver` to interact with the Chrome browser: \n", 1956 | "https://sites.google.com/a/chromium.org/chromedriver/downloads" 1957 | ] 1958 | }, 1959 | { 1960 | "cell_type": "markdown", 1961 | "metadata": {}, 1962 | "source": [ 1963 | "## Quick demonstration" 1964 | ] 1965 | }, 1966 | { 1967 | "cell_type": "markdown", 1968 | "metadata": {}, 1969 | "source": [ 1970 | "### Set up selenium" 1971 | ] 1972 | }, 1973 | { 1974 | "cell_type": "code", 1975 | "execution_count": 8, 1976 | "metadata": {}, 1977 | "outputs": [], 1978 | "source": [ 1979 | "import selenium, os\n", 1980 | "from selenium import webdriver" 1981 | ] 1982 | }, 1983 | { 1984 | "cell_type": "markdown", 1985 | "metadata": {}, 1986 | "source": [ 1987 | "Often `selenium` cannot automatically find the `ChromeDriver` so it helps to find the location it is installed and point `selenium` to it. \n", 1988 | "In my case it is here:" 1989 | ] 1990 | }, 1991 | { 1992 | "cell_type": "code", 1993 | "execution_count": 13, 1994 | "metadata": {}, 1995 | "outputs": [], 1996 | "source": [ 1997 | "CHROME = r\"C:\\chromedriver83.exe\"\n", 1998 | "os.environ [\"webdriver.chrome.driver\" ] = CHROME" 1999 | ] 2000 | }, 2001 | { 2002 | "cell_type": "markdown", 2003 | "metadata": {}, 2004 | "source": [ 2005 | "### Start a selenium session" 2006 | ] 2007 | }, 2008 | { 2009 | "cell_type": "code", 2010 | "execution_count": 14, 2011 | "metadata": {}, 2012 | "outputs": [], 2013 | "source": [ 2014 | "driver = webdriver.Chrome(CHROME)" 2015 | ] 2016 | }, 2017 | { 2018 | "cell_type": "markdown", 2019 | "metadata": {}, 2020 | "source": [ 2021 | "After executing `driver = webdriver.Chrome(CHROME)` you should see a chrome window pop-up, this is the window that you can control with Python!" 2022 | ] 2023 | }, 2024 | { 2025 | "cell_type": "markdown", 2026 | "metadata": {}, 2027 | "source": [ 2028 | "### Load a page" 2029 | ] 2030 | }, 2031 | { 2032 | "cell_type": "markdown", 2033 | "metadata": {}, 2034 | "source": [ 2035 | "Let's say we want to extract something from the Yahoo Finance page for Tesla (TSLA): \n", 2036 | "https://finance.yahoo.com/quote/TSLA/" 2037 | ] 2038 | }, 2039 | { 2040 | "cell_type": "code", 2041 | "execution_count": 15, 2042 | "metadata": {}, 2043 | "outputs": [], 2044 | "source": [ 2045 | "Tesla_URL = r'https://finance.yahoo.com/quote/TSLA/'" 2046 | ] 2047 | }, 2048 | { 2049 | "cell_type": "code", 2050 | "execution_count": 16, 2051 | "metadata": {}, 2052 | "outputs": [], 2053 | "source": [ 2054 | "driver.get(Tesla_URL)" 2055 | ] 2056 | }, 2057 | { 2058 | "cell_type": "markdown", 2059 | "metadata": {}, 2060 | "source": [ 2061 | "If you open the Chrome window you should see that it now loaded the URL we gave it." 2062 | ] 2063 | }, 2064 | { 2065 | "cell_type": "markdown", 2066 | "metadata": {}, 2067 | "source": [ 2068 | "### Navigate" 2069 | ] 2070 | }, 2071 | { 2072 | "cell_type": "markdown", 2073 | "metadata": {}, 2074 | "source": [ 2075 | "You can select an element multiple ways (most frequent ones):\n", 2076 | "\n", 2077 | "> driver.find_element_by_name() \n", 2078 | "> driver.find_element_by_id() \n", 2079 | "> driver.find_element_by_class_name() \n", 2080 | "> driver.find_element_by_css_selector() \n", 2081 | "> driver.find_element_by_tag_name() \n" 2082 | ] 2083 | }, 2084 | { 2085 | "cell_type": "markdown", 2086 | "metadata": {}, 2087 | "source": [ 2088 | "Let's say we want to extract some values from the \"earnings\" interactive figure on the right side:\n", 2089 | "\n", 2090 | "" 2091 | ] 2092 | }, 2093 | { 2094 | "cell_type": "markdown", 2095 | "metadata": {}, 2096 | "source": [ 2097 | "This would be near-impossible using `requests` as it would simply not load the element, it only loads in an actual browser. \n", 2098 | "\n", 2099 | "We could extract this data in two ways:\n", 2100 | "\n", 2101 | "1. Programming Selenium to mouse-over the element we want, and use CSS selectors to extract the values from the mouse-over window.\n", 2102 | "2. Use the console to interact with the underlying Javascript data directly.\n", 2103 | "\n", 2104 | "The second method is far more convenient than the first so I will demonstrate that:" 2105 | ] 2106 | }, 2107 | { 2108 | "cell_type": "markdown", 2109 | "metadata": {}, 2110 | "source": [ 2111 | "### Retrieve data from Javascript directly\n", 2112 | "We can use a neat trick to find out which Javascript variable holds a certain value that we are looking for: \n", 2113 | "https://stackoverflow.com/questions/26796873/find-which-variable-holds-a-value-using-chrome-devtools\n", 2114 | "\n", 2115 | "After pasting the provided function into the dev console we can run `globalSearch(App, '-1.82')` in the Chrome Dev Console to get:\n", 2116 | "\n", 2117 | "> App.main.context.dispatcher.stores.QuoteSummaryStore.earnings.earningsChart.quarterly[3].estimate.fmt\n", 2118 | "\n", 2119 | "This is all the information that we need to extract all the data points:" 2120 | ] 2121 | }, 2122 | { 2123 | "cell_type": "code", 2124 | "execution_count": 17, 2125 | "metadata": {}, 2126 | "outputs": [], 2127 | "source": [ 2128 | "script = 'App.main.context.dispatcher.stores.QuoteSummaryStore.earnings.earningsChart.quarterly'" 2129 | ] 2130 | }, 2131 | { 2132 | "cell_type": "code", 2133 | "execution_count": 18, 2134 | "metadata": {}, 2135 | "outputs": [], 2136 | "source": [ 2137 | "quarterly_values = driver.execute_script('return {}'.format(script))" 2138 | ] 2139 | }, 2140 | { 2141 | "cell_type": "markdown", 2142 | "metadata": {}, 2143 | "source": [ 2144 | "*Note:* I add `return` in the beginning to get a JSON response. " 2145 | ] 2146 | }, 2147 | { 2148 | "cell_type": "code", 2149 | "execution_count": 19, 2150 | "metadata": {}, 2151 | "outputs": [ 2152 | { 2153 | "data": { 2154 | "text/plain": [ 2155 | "[{'actual': {'fmt': '-1.12', 'raw': -1.12},\n", 2156 | " 'date': '2Q2019',\n", 2157 | " 'estimate': {'fmt': '-0.36', 'raw': -0.36}},\n", 2158 | " {'actual': {'fmt': '1.86', 'raw': 1.86},\n", 2159 | " 'date': '3Q2019',\n", 2160 | " 'estimate': {'fmt': '-0.42', 'raw': -0.42}},\n", 2161 | " {'actual': {'fmt': '2.06', 'raw': 2.06},\n", 2162 | " 'date': '4Q2019',\n", 2163 | " 'estimate': {'fmt': '1.72', 'raw': 1.72}},\n", 2164 | " {'actual': {'fmt': '1.14', 'raw': 1.14},\n", 2165 | " 'date': '1Q2020',\n", 2166 | " 'estimate': {'fmt': '-0.25', 'raw': -0.25}}]" 2167 | ] 2168 | }, 2169 | "execution_count": 19, 2170 | "metadata": {}, 2171 | "output_type": "execute_result" 2172 | } 2173 | ], 2174 | "source": [ 2175 | "quarterly_values" 2176 | ] 2177 | }, 2178 | { 2179 | "cell_type": "markdown", 2180 | "metadata": {}, 2181 | "source": [ 2182 | "Using `driver.execute_script()` is essentially the programmatical way of executing it in the dev console: \n", 2183 | "\n", 2184 | "\n", 2185 | "" 2186 | ] 2187 | }, 2188 | { 2189 | "cell_type": "markdown", 2190 | "metadata": {}, 2191 | "source": [ 2192 | "If you are not familiar with Javascript and programming for the web then this might be very hard to wrap you head around, but if you are serious about web-scraping these kinds of tricks can save you days of work. " 2193 | ] 2194 | }, 2195 | { 2196 | "cell_type": "markdown", 2197 | "metadata": {}, 2198 | "source": [ 2199 | "### Close driver" 2200 | ] 2201 | }, 2202 | { 2203 | "cell_type": "code", 2204 | "execution_count": 20, 2205 | "metadata": {}, 2206 | "outputs": [], 2207 | "source": [ 2208 | "driver.close()" 2209 | ] 2210 | }, 2211 | { 2212 | "cell_type": "markdown", 2213 | "metadata": {}, 2214 | "source": [ 2215 | "## Web crawling with Scrapy" 2216 | ] 2217 | }, 2218 | { 2219 | "cell_type": "markdown", 2220 | "metadata": {}, 2221 | "source": [ 2222 | "In the examples above we always provide the URL directly. \n", 2223 | "We could program a loop (with any of the above methods) that takes a URL from the page and then goes to that page and extracts another URL, etc. \n", 2224 | "\n", 2225 | "This tends to get confusing pretty fast, if you really want to create a crawler you might be better of to look into the `scrapy` package. \n", 2226 | "\n", 2227 | "`Scrapy` allows you to create a `spider` that basically 'walks' through webpages and crawls the information. \n", 2228 | "\n", 2229 | "In my experience you don't need this for 95% of our use-cases, but feel free to try it out: http://scrapy.org/" 2230 | ] 2231 | } 2232 | ], 2233 | "metadata": { 2234 | "kernelspec": { 2235 | "display_name": "Python 3", 2236 | "language": "python", 2237 | "name": "python3" 2238 | }, 2239 | "language_info": { 2240 | "codemirror_mode": { 2241 | "name": "ipython", 2242 | "version": 3 2243 | }, 2244 | "file_extension": ".py", 2245 | "mimetype": "text/x-python", 2246 | "name": "python", 2247 | "nbconvert_exporter": "python", 2248 | "pygments_lexer": "ipython3", 2249 | "version": "3.7.6" 2250 | } 2251 | }, 2252 | "nbformat": 4, 2253 | "nbformat_minor": 4 2254 | } 2255 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | Get started with Python for Research 3 |

4 |

5 | 6 | 7 | 8 | 9 | 10 |

11 | 12 |

13 | Want to learn how to use Python for (Social Science) Research?
14 | This repository has everything that you need to get started!

15 | Author: Ties de Kok (Personal Page) 16 |

17 | 18 | ## Table of contents 19 | 20 | * [Introduction](#introduction) 21 | * [Who is this repository for?](#audience) 22 | * [How to use this repository?](#howtouse) 23 | * [Getting your Python setup ready](#setup) 24 | * [Installing Anaconda](#anacondainstall) 25 | * [Setting up Conda Environment](#setupenv) 26 | * [Using Python](#usingpython) 27 | * [Jupyter Notebook/Lab](#jupyter) 28 | * [Installing packages](#packages) 29 | * [Tutorial Notebooks](#notebooks) 30 | * [Exercises](#exercises) 31 | * [Code along](#codealong) 32 | * [Binder](#binder) 33 | * [Local installation](#clonerepo) 34 | * [Questions?](#questions) 35 | * [License](#license) 36 | * [Special thanks](#specialthanks) 37 | 38 |

Introduction

39 | 40 | The goal of this GitHub page is to provide you with everything you need to get started with Python for actual research projects. 41 | 42 |

Who is this repository for?

43 | 44 | The topics and techniques demonstrated in this repository are primarily oriented towards empirical research projects in fields such as Accounting, Finance, Marketing, Political Science, and other Social Sciences. 45 | 46 | However, many of the basics are also perfectly applicable if you are looking to use Python for any other type of Data Science! 47 | 48 |

How to use this repository?

49 | 50 | This repository is written to facilitate learning by doing 51 | 52 | **If you are starting from scratch I recommend the following:** 53 | 54 | 1. Familiarize yourself with the [`Getting your Python setup ready`](#setup) and [`Using Python`](#usingpython) sections below 55 | 2. Check the [`Code along!`](#codealong) section to make sure that you can interactively use the Jupyter Notebooks 56 | 3. Work through the [`0_python_basics.ipynb`](0_python_basics.ipynb) notebook and try to get a basics grasp on the Python syntax 57 | 4. Do the "Basic Python tasks" part of the [`exercises.ipynb`](exercises.ipynb) notebook 58 | 5. Work through the [`1_opening_files.ipynb`](#), [`2_handling_data.ipynb`](2_handling_data.ipynb), and [`3_visualizing_data.ipynb`](3_visualizing_data.ipynb) notebooks. 59 | **Note:** the [`2_handling_data.ipynb`](2_handling_data.ipynb) notebook is very comprehensive, feel free to skip the more advanced parts at first. 60 | 6. Do the "Data handling tasks (+ some plotting)" part of the [`exercises.ipynb`](exercises.ipynb) notebook 61 | 62 | If you are interested in web-scraping: 63 | 64 | 7. Work through the [`4_web_scraping.ipynb`](4_web_scraping.ipynb) notebook 65 | 8. Do the "Web scraping" part of the [`exercises.ipynb`](exercises.ipynb) notebook 66 | 67 | If you are interested in Natural Language Processing with Python: 68 | 69 | 9. Take a look at my [Python NLP tutorial repository + notebook](https://github.com/TiesdeKok/Python_NLP_Tutorial) 70 | 71 | **If you are already familiar with the Python basics:** 72 | 73 | Use the notebooks provided in this repository selectively depending on the types of problems that you try to solve with Python. 74 | 75 | Everything in the notebooks is purposely sectioned by the task description. So if you, for example, are looking to merge two Pandas dataframes together, you can use the `Combining dataframes` section of the [`2_handling_data.ipynb`](2_handling_data.ipynb) notebook as a starting point. 76 | 77 | 78 |

Getting your Python setup ready

79 | 80 | There are multiple ways to get your Python environment set up. To keep things simple I will only provide you with what I believe to be the best and easiest way to get started: the Anaconda distribution + a conda environment. 81 | 82 |

Anaconda Distribution

83 | 84 | The Anaconda Distribution bundles Python with a large collection of Python packages from the (data) science Python eco-system. 85 | 86 | By installing the Anaconda Distribution you essentially obtain everything you need to get started with Python for Research! 87 | 88 |

Step 1: Install Anaconda

89 | 90 | 1. Go to [anaconda.com/download/](https://www.anaconda.com/download/) 91 | 2. Download the **Python 3.x version** installer 92 | 3. Install Anaconda. 93 | * It is worth to take note of the installation directory in case you ever need to find it again. 94 | 4. Check if the installation works by launching a command prompt (terminal) and type `python`, it should say Anaconda at the top. 95 | * On Windows I recommend using the `Anaconda Prompt` 96 | 97 | *Note:* Anaconda also comes with the `Anaconda Explorer`, I haven't personally used it yet but it might be convenient. 98 | 99 |

Step 2: Set up the learnpythonforresearch environment

100 | 101 | 1. Make sure you've cloned/downloaded this repository: [Clone repository](#clonerepo) 102 | 2. `cd` (i.e. Change) to the folder where you extracted the ZIP file 103 | for example: `cd "C:\Files\Work\Project_1"` 104 | *Note:* if you are changing do folder on another drive you might have to also switch drives by typing, for example, `E:` 105 | 3. Run the following command `conda env create -f environment.yml` 106 | 4. Activate the environment with: `conda activate LearnPythonforResearch` 107 | 108 | A full list of all the packages used is provided in the `environment.yml` file. 109 | 110 |

Python 3 vs Python 2?

111 | 112 | Python 3.x is the newer and superior version over Python 2.7 so I strongly recommend to use Python 3.x whenever possible. There is no reason to use Python 2.7, unless you are forced to work with old Python 2.7 code. 113 | 114 |

Using Python

115 | 116 | **Basic methods:** 117 | 118 | The native way to run Python code is by saving the code to a file with the ".py" extension and executing it from the console / terminal: 119 | 120 | ```python code.py``` 121 | 122 | Alternatively, you can run some quick code by starting a python or ipython interactive console by typing either `python` or `ipython` in your console / terminal. 123 | 124 |

Jupyter Notebook/Lab

125 | 126 | The above is, however, not very convenient for research purposes as we desire easy interactivity and good documentation options. 127 | Fortunately, the awesome **Jupyter Notebooks** provide a great alternative way of using Python for research purposes. 128 | 129 | [Jupyter](http://jupyter.org/) comes pre-installed with the Anaconda distribution so you should have everything already installed and ready to go. 130 | 131 | ***Note on Jupyter Lab*** 132 | 133 | > **JupyterLab 1.0: Jupyter’s Next-Generation Notebook Interface** 134 | JupyterLab is a web-based interactive development environment for Jupyter notebooks, code, and data. JupyterLab is flexible: configure and arrange the user interface to support a wide range of workflows in data science, scientific computing, and machine learning. JupyterLab is extensible and modular: write plugins that add new components and integrate with existing ones. 135 | 136 | Jupyter Lab is an additional interface layer that extends the functionality of Jupyter Notebooks which are the primary way you interact with Python code. 137 | 138 | ***What is the Jupyter Notebook?*** 139 | 140 | From the [Jupyter](http://jupyter.org/) website: 141 | > The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. 142 | 143 | In other words, the Jupyter Notebook allows you to program Python code straight from your browser! 144 | 145 | ***How does the Jupyter Notebook/Lab work in the background?*** 146 | 147 | The diagram below sums up the basics components of Jupyter: 148 | 149 | 150 | 151 | At the heart there is the *Jupyter Server* that handles everything, the *Jupyter Notebook* which is accessed and used through your browser, and the *kernel* that executes the code. We will be focusing on the natively included *Python Kernel* but Jupyter is language agnostic so you can also use it with other languages/software such as 'R'. 152 | 153 | It is worth noting that in most cases you will be running the `Jupyter Server` on your own computer and will connect to it locally in your browser (i.e. you don't need to be connected to the internet). However, it is also possible to run the Jupyter Server on a different computer, for example a high performance computation server in the cloud, and connect to it over the internet. 154 | 155 | ***How to start a Jupyter Notebook/Lab?*** 156 | 157 | The primary method that I would recommend to start a Jupyter Notebook/Lab is to use the command line (terminal) directly: 158 | 159 | 1. Open your command prompt / terminal (on Windows I recommend the Anaconda Prompt) 160 | 2. Activate the right environment with `conda activate LearnPythonForResearch` 161 | 2. `cd` (i.e. Change) to the desired starting directory 162 | for example: `cd "C:\Files\Work\Project_1"` 163 | *Note:* if you are changing do folder on another drive you might have to also switch drives by typing, for example, `E:` 164 | 3. Start the Jupyter Notebook/Lab server by typing: `jupyter notebook` or `jupyter lab` 165 | 166 | This should automatically open up the corresponding Jupyter Notebook/Lab in your default browser. 167 | You can also manually go to the Jupyter Notebook/Lab by going to `localhost:8888` with your browser. (You might be asked for a password, which can find in the terminal window where there Jupyter server is running.) 168 | 169 | ***How to close a Jupyter Server erver?*** 170 | 171 | If you want to close down the Jupyter Server: open up the command prompt window that runs the server and press `CTRL + C` twice. 172 | Make sure that you have saved any open Jupyter Notebooks! 173 | 174 | ***How to use the Jupyter Notebook?*** 175 | 176 | *Some shortcuts are worth mentioning for reference purposes:* 177 | 178 | `command mode` --> enable by pressing `esc` 179 | `edit mode` --> enable by pressing `enter` 180 | 181 | | `command mode` |`edit mode` | `both modes` 182 | |--- |--- |--- 183 | | `Y` : cell to code | `Tab` : code completion or indent | `Shift-Enter` : run cell, select below 184 | | `M` : cell to markdown | `Shift-Tab` : tooltip | `Ctrl-Enter` : run cell 185 | | `A` : insert cell above | `Ctrl-A` : select all | 186 | | `B` : insert cell below | `Ctrl-Z` : undo | 187 | | `X`: cut selected cell | 188 | 189 | 190 |

Installing Packages

191 | 192 | The Python eco-system consists of many packages and modules that people have programmed and made available for everyone to use. 193 | These packages/modules are one of the things that makes Python so useful. 194 | 195 | Some packages are natively included with Python and Anaconda, but anything not included you need to install first before you can import them. 196 | I will discuss the three primary methods of installing packages: 197 | 198 | **Method 1:** use `pip` 199 | 200 | > Many packages are available on the "Python Package Index" (i.e. "PyPI"): [https://pypi.python.org/pypi](https://pypi.python.org/pypi) 201 | > 202 | > You can install packages that are on "PyPI" by using the `pip` command: 203 | > 204 | > Example, install the `requests` package: run `pip install requests` in your command line / terminal (not in the Jupyter Notebook!). 205 | > 206 | > To uninstall you can use `pip uninstall` and to upgrade an existing package you can add the `-U` flag (`pip install -U requests`) 207 | 208 | **Method 2:** use `conda` 209 | 210 | >Sometimes when you try something with `pip` you get a compile error (especially on Windows). You can try to fix this by configuring the right compiler but most of the times it is easier to try to install it directly via Anaconda as these are pre-compiled. For example: 211 | > 212 | >`conda install scipy` 213 | > 214 | >Full documentation is here: [Conda documentation](https://conda.io/docs/user-guide/tasks/manage-pkgs.html) 215 | 216 | **Method 3:** install directly using the `setup.py` file 217 | 218 | >Sometimes a package is not on pypi and conda (you often find these packages on GitHub). Follow these steps to install those: 219 | > 220 | >1. Download the folder with all the files (if archived, make sure to unpack the folder) 221 | >2. Open your command prompt (terminal) and `cd` to the folder you just downloaded 222 | >3. Type: `python setup.py install` 223 | 224 |

Tutorial Notebooks

225 | 226 | This repository covers the following topics: 227 | 228 | * [`0_python_basics.ipynb`](0_python_basics.ipynb): Basics of the Python syntax 229 | * [`1_opening_files.ipynb`](1_opening_files.ipynb): Examples on how to open TXT, CSV, Excel, Stata, Sas, JSON, and HDF files. 230 | * [`2_handling_data.ipynb`](2_handling_data.ipynb): A comprehensive overview on how to use the `Pandas` library for data wrangling. 231 | * [`3_visualizing_data.ipynb`](3_visualizing_data.ipynb): Examples on how to generate visualizations with Python. 232 | * [`4_web_scraping.ipynb`](4_web_scraping.ipynb): A comprehensive overview on how to use `Requests`, `Requests-html`, and `Selenium` for APIs and web scraping. 233 | 234 | Additionally, if you are interested in Natural Language Processing I have a notebook for that as well: 235 | * [`NLP_Notebook`](https://nbviewer.jupyter.org/github/TiesdeKok/Python_NLP_Tutorial/blob/master/NLP_Notebook.ipynb): Basics of the Python syntax 236 | 237 |

Exercises

238 | 239 | I have provided several tasks / exercises that you can try to solve in the [`exercises.ipynb`](exercises.ipynb) notebook. 240 | 241 | **Note:** To avoid the "oh, that looks easy!" trap I have not uploaded the exercises notebook with examples answers. 242 | *Feel free to email me for the answer keys once you are done!* 243 | 244 |

Code along!

245 | 246 | You can code along in two ways: 247 | 248 |

Option 1: use Binder

249 | 250 | If you want to experiment with the code in a live environment you can also use `binder`. 251 | 252 | Binder allows to create a live environment where you can execute code just as-if you were on your own computer based on a GitHub repository, it is very awesome! 253 | 254 | Click on the button below to launch binder: 255 | 256 | 257 | 258 | **Note: you could use binder to complete the exercises but it will not save!!** 259 | 260 |

Option 2: Set up local Python setup

261 | 262 | You can essentially "download" the contents of this repository by cloning the repository. 263 | 264 | You can do this by clicking "Clone or download" button and then "Download ZIP": 265 | 266 | 267 | 268 | After you download and extracted the zip file into a folder you can follow the steps to set up your environment: 269 | 270 | 1. [Installing Anaconda](#anacondainstall) 271 | 2. [Setting up Conda Environment](#setupenv) 272 | 273 |

Questions?

274 | 275 | If you have questions or experience problems please use the `issues` tab of this repository. 276 | 277 |

License

278 | 279 | [MIT](LICENSE) - Ties de Kok - 2020 280 | 281 |

Special Thanks

282 | 283 | https://github.com/teles/array-mixer for having an awesome readme that I used as a template. 284 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: LearnPythonforResearch 2 | channels: 3 | - conda-forge 4 | - defaults 5 | dependencies: 6 | - python=3.7 7 | - pip 8 | - jupyterlab 9 | - numpy 10 | - pandas 11 | - matplotlib 12 | - ipywidgets 13 | - requests 14 | - xlrd 15 | - openpyxl 16 | - seaborn 17 | - bokeh 18 | - hickle 19 | - lxml 20 | - cssselect 21 | - PyTables 22 | - plotnine 23 | - plotly 24 | - selenium 25 | - tqdm 26 | - pip: 27 | - qgrid 28 | - requests-html -------------------------------------------------------------------------------- /example_data/auto_df.csv: -------------------------------------------------------------------------------- 1 | ;make;price;mpg;rep78;headroom;trunk;weight;length;turn;displacement;gear_ratio;foreign 2 | 0;AMC Concord;4099;22;3.0;2.5;11;2930;186;40;121;3.57999992371;Domestic 3 | 1;AMC Pacer;4749;17;3.0;3.0;11;3350;173;40;258;2.52999997139;Domestic 4 | 2;AMC Spirit;3799;22;;3.0;12;2640;168;35;121;3.07999992371;Domestic 5 | 3;Buick Century;4816;20;3.0;4.5;16;3250;196;40;196;2.93000006676;Domestic 6 | 4;Buick Electra;7827;15;4.0;4.0;20;4080;222;43;350;2.41000008583;Domestic 7 | 5;Buick LeSabre;5788;18;3.0;4.0;21;3670;218;43;231;2.73000001907;Domestic 8 | 6;Buick Opel;4453;26;;3.0;10;2230;170;34;304;2.86999988556;Domestic 9 | 7;Buick Regal;5189;20;3.0;2.0;16;3280;200;42;196;2.93000006676;Domestic 10 | 8;Buick Riviera;10372;16;3.0;3.5;17;3880;207;43;231;2.93000006676;Domestic 11 | 9;Buick Skylark;4082;19;3.0;3.5;13;3400;200;42;231;3.07999992371;Domestic 12 | 10;Cad. Deville;11385;14;3.0;4.0;20;4330;221;44;425;2.27999997139;Domestic 13 | 11;Cad. Eldorado;14500;14;2.0;3.5;16;3900;204;43;350;2.19000005722;Domestic 14 | 12;Cad. Seville;15906;21;3.0;3.0;13;4290;204;45;350;2.24000000954;Domestic 15 | 13;Chev. Chevette;3299;29;3.0;2.5;9;2110;163;34;231;2.93000006676;Domestic 16 | 14;Chev. Impala;5705;16;4.0;4.0;20;3690;212;43;250;2.55999994278;Domestic 17 | 15;Chev. Malibu;4504;22;3.0;3.5;17;3180;193;31;200;2.73000001907;Domestic 18 | 16;Chev. Monte Carlo;5104;22;2.0;2.0;16;3220;200;41;200;2.73000001907;Domestic 19 | 17;Chev. Monza;3667;24;2.0;2.0;7;2750;179;40;151;2.73000001907;Domestic 20 | 18;Chev. Nova;3955;19;3.0;3.5;13;3430;197;43;250;2.55999994278;Domestic 21 | 19;Dodge Colt;3984;30;5.0;2.0;8;2120;163;35;98;3.53999996185;Domestic 22 | 20;Dodge Diplomat;4010;18;2.0;4.0;17;3600;206;46;318;2.47000002861;Domestic 23 | 21;Dodge Magnum;5886;16;2.0;4.0;17;3600;206;46;318;2.47000002861;Domestic 24 | 22;Dodge St. Regis;6342;17;2.0;4.5;21;3740;220;46;225;2.94000005722;Domestic 25 | 23;Ford Fiesta;4389;28;4.0;1.5;9;1800;147;33;98;3.15000009537;Domestic 26 | 24;Ford Mustang;4187;21;3.0;2.0;10;2650;179;43;140;3.07999992371;Domestic 27 | 25;Linc. Continental;11497;12;3.0;3.5;22;4840;233;51;400;2.47000002861;Domestic 28 | 26;Linc. Mark V;13594;12;3.0;2.5;18;4720;230;48;400;2.47000002861;Domestic 29 | 27;Linc. Versailles;13466;14;3.0;3.5;15;3830;201;41;302;2.47000002861;Domestic 30 | 28;Merc. Bobcat;3829;22;4.0;3.0;9;2580;169;39;140;2.73000001907;Domestic 31 | 29;Merc. Cougar;5379;14;4.0;3.5;16;4060;221;48;302;2.75;Domestic 32 | 30;Merc. Marquis;6165;15;3.0;3.5;23;3720;212;44;302;2.25999999046;Domestic 33 | 31;Merc. Monarch;4516;18;3.0;3.0;15;3370;198;41;250;2.43000006676;Domestic 34 | 32;Merc. XR-7;6303;14;4.0;3.0;16;4130;217;45;302;2.75;Domestic 35 | 33;Merc. Zephyr;3291;20;3.0;3.5;17;2830;195;43;140;3.07999992371;Domestic 36 | 34;Olds 98;8814;21;4.0;4.0;20;4060;220;43;350;2.41000008583;Domestic 37 | 35;Olds Cutl Supr;5172;19;3.0;2.0;16;3310;198;42;231;2.93000006676;Domestic 38 | 36;Olds Cutlass;4733;19;3.0;4.5;16;3300;198;42;231;2.93000006676;Domestic 39 | 37;Olds Delta 88;4890;18;4.0;4.0;20;3690;218;42;231;2.73000001907;Domestic 40 | 38;Olds Omega;4181;19;3.0;4.5;14;3370;200;43;231;3.07999992371;Domestic 41 | 39;Olds Starfire;4195;24;1.0;2.0;10;2730;180;40;151;2.73000001907;Domestic 42 | 40;Olds Toronado;10371;16;3.0;3.5;17;4030;206;43;350;2.41000008583;Domestic 43 | 41;Plym. Arrow;4647;28;3.0;2.0;11;3260;170;37;156;3.04999995232;Domestic 44 | 42;Plym. Champ;4425;34;5.0;2.5;11;1800;157;37;86;2.97000002861;Domestic 45 | 43;Plym. Horizon;4482;25;3.0;4.0;17;2200;165;36;105;3.36999988556;Domestic 46 | 44;Plym. Sapporo;6486;26;;1.5;8;2520;182;38;119;3.53999996185;Domestic 47 | 45;Plym. Volare;4060;18;2.0;5.0;16;3330;201;44;225;3.23000001907;Domestic 48 | 46;Pont. Catalina;5798;18;4.0;4.0;20;3700;214;42;231;2.73000001907;Domestic 49 | 47;Pont. Firebird;4934;18;1.0;1.5;7;3470;198;42;231;3.07999992371;Domestic 50 | 48;Pont. Grand Prix;5222;19;3.0;2.0;16;3210;201;45;231;2.93000006676;Domestic 51 | 49;Pont. Le Mans;4723;19;3.0;3.5;17;3200;199;40;231;2.93000006676;Domestic 52 | 50;Pont. Phoenix;4424;19;;3.5;13;3420;203;43;231;3.07999992371;Domestic 53 | 51;Pont. Sunbird;4172;24;2.0;2.0;7;2690;179;41;151;2.73000001907;Domestic 54 | 52;Audi 5000;9690;17;5.0;3.0;15;2830;189;37;131;3.20000004768;Foreign 55 | 53;Audi Fox;6295;23;3.0;2.5;11;2070;174;36;97;3.70000004768;Foreign 56 | 54;BMW 320i;9735;25;4.0;2.5;12;2650;177;34;121;3.6400001049;Foreign 57 | 55;Datsun 200;6229;23;4.0;1.5;6;2370;170;35;119;3.8900001049;Foreign 58 | 56;Datsun 210;4589;35;5.0;2.0;8;2020;165;32;85;3.70000004768;Foreign 59 | 57;Datsun 510;5079;24;4.0;2.5;8;2280;170;34;119;3.53999996185;Foreign 60 | 58;Datsun 810;8129;21;4.0;2.5;8;2750;184;38;146;3.54999995232;Foreign 61 | 59;Fiat Strada;4296;21;3.0;2.5;16;2130;161;36;105;3.36999988556;Foreign 62 | 60;Honda Accord;5799;25;5.0;3.0;10;2240;172;36;107;3.04999995232;Foreign 63 | 61;Honda Civic;4499;28;4.0;2.5;5;1760;149;34;91;3.29999995232;Foreign 64 | 62;Mazda GLC;3995;30;4.0;3.5;11;1980;154;33;86;3.73000001907;Foreign 65 | 63;Peugeot 604;12990;14;;3.5;14;3420;192;38;163;3.57999992371;Foreign 66 | 64;Renault Le Car;3895;26;3.0;3.0;10;1830;142;34;79;3.72000002861;Foreign 67 | 65;Subaru;3798;35;5.0;2.5;11;2050;164;36;97;3.80999994278;Foreign 68 | 66;Toyota Celica;5899;18;5.0;2.5;14;2410;174;36;134;3.05999994278;Foreign 69 | 67;Toyota Corolla;3748;31;5.0;3.0;9;2200;165;35;97;3.21000003815;Foreign 70 | 68;Toyota Corona;5719;18;5.0;2.0;11;2670;175;36;134;3.04999995232;Foreign 71 | 69;VW Dasher;7140;23;4.0;2.5;12;2160;172;36;97;3.74000000954;Foreign 72 | 70;VW Diesel;5397;41;5.0;3.0;15;2040;155;35;90;3.77999997139;Foreign 73 | 71;VW Rabbit;4697;25;4.0;3.0;15;1930;155;35;89;3.77999997139;Foreign 74 | 72;VW Scirocco;6850;25;4.0;2.0;16;1990;156;36;97;3.77999997139;Foreign 75 | 73;Volvo 260;11995;17;5.0;2.5;14;3170;193;37;163;2.98000001907;Foreign 76 | -------------------------------------------------------------------------------- /example_data/csv_sample.csv: -------------------------------------------------------------------------------- 1 | ,Unnamed: 0,Unnamed: 0.1,foreign,make,price,weight 2 | 0,0,1,Domestic,AMCPacer,4749,3350 3 | 1,1,2,Domestic,AMCSpirit,3799,2640 4 | 2,2,3,Domestic,BuickCentury,4816,3250 5 | 3,3,6,Domestic,BuickOpel,4453,2230 6 | 4,4,7,Domestic,BuickRegal,5189,3280 7 | 5,5,8,Domestic,BuickRiviera,10372,3880 8 | 6,6,9,Domestic,BuickSkylark,4082,3400 9 | 7,7,14,Domestic,Chev.Impala,5705,3690 10 | 8,8,21,Domestic,DodgeMagnum,5886,3600 11 | 9,9,23,Domestic,FordFiesta,4389,1800 12 | 10,10,24,Domestic,FordMustang,4187,2650 13 | 11,11,30,Domestic,Merc.Marquis,6165,3720 14 | 12,12,31,Domestic,Merc.Monarch,4516,3370 15 | 13,13,33,Domestic,Merc.Zephyr,3291,2830 16 | 14,14,37,Domestic,OldsDelta88,4890,3690 17 | 15,15,38,Domestic,OldsOmega,4181,3370 18 | 16,16,43,Domestic,Plym.Horizon,4482,2200 19 | 17,17,48,Domestic,Pont.GrandPrix,5222,3210 20 | 18,18,50,Domestic,Pont.Phoenix,4424,3420 21 | 19,19,51,Domestic,Pont.Sunbird,4172,2690 22 | 20,20,53,Foreign,AudiFox,6295,2070 23 | 21,21,56,Foreign,Datsun210,4589,2020 24 | 22,22,57,Foreign,Datsun510,5079,2280 25 | 23,23,66,Foreign,ToyotaCelica,5899,2410 26 | 24,24,70,Foreign,VWDiesel,5397,2040 27 | -------------------------------------------------------------------------------- /example_data/dd_example.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/example_data/dd_example.h5 -------------------------------------------------------------------------------- /example_data/excel_sample.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/example_data/excel_sample.xlsx -------------------------------------------------------------------------------- /example_data/hdf_sample.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/example_data/hdf_sample.h5 -------------------------------------------------------------------------------- /example_data/hkl_example.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/example_data/hkl_example.h5 -------------------------------------------------------------------------------- /example_data/json_sample.json: -------------------------------------------------------------------------------- 1 | {"foreign": {"1": "Domestic", "14": "Domestic", "2": "Domestic", "21": "Domestic", "23": "Domestic", "24": "Domestic", "3": "Domestic", "30": "Domestic", "31": "Domestic", "33": "Domestic", "37": "Domestic", "38": "Domestic", "43": "Domestic", "48": "Domestic", "50": "Domestic", "51": "Domestic", "53": "Foreign", "56": "Foreign", "57": "Foreign", "6": "Domestic", "66": "Foreign", "7": "Domestic", "70": "Foreign", "8": "Domestic", "9": "Domestic"}, "make": {"1": "AMCPacer", "14": "Chev.Impala", "2": "AMCSpirit", "21": "DodgeMagnum", "23": "FordFiesta", "24": "FordMustang", "3": "BuickCentury", "30": "Merc.Marquis", "31": "Merc.Monarch", "33": "Merc.Zephyr", "37": "OldsDelta88", "38": "OldsOmega", "43": "Plym.Horizon", "48": "Pont.GrandPrix", "50": "Pont.Phoenix", "51": "Pont.Sunbird", "53": "AudiFox", "56": "Datsun210", "57": "Datsun510", "6": "BuickOpel", "66": "ToyotaCelica", "7": "BuickRegal", "70": "VWDiesel", "8": "BuickRiviera", "9": "BuickSkylark"}, "price": {"1": 4749, "14": 5705, "2": 3799, "21": 5886, "23": 4389, "24": 4187, "3": 4816, "30": 6165, "31": 4516, "33": 3291, "37": 4890, "38": 4181, "43": 4482, "48": 5222, "50": 4424, "51": 4172, "53": 6295, "56": 4589, "57": 5079, "6": 4453, "66": 5899, "7": 5189, "70": 5397, "8": 10372, "9": 4082}, "weight": {"1": 3350, "14": 3690, "2": 2640, "21": 3600, "23": 1800, "24": 2650, "3": 3250, "30": 3720, "31": 3370, "33": 2830, "37": 3690, "38": 3370, "43": 2200, "48": 3210, "50": 3420, "51": 2690, "53": 2070, "56": 2020, "57": 2280, "6": 2230, "66": 2410, "7": 3280, "70": 2040, "8": 3880, "9": 3400}} -------------------------------------------------------------------------------- /example_data/stata_sample.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/example_data/stata_sample.dta -------------------------------------------------------------------------------- /example_data/text_sample.txt: -------------------------------------------------------------------------------- 1 | Learning Python is great. 2 | Good luck! -------------------------------------------------------------------------------- /images/API_screenshot.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/images/API_screenshot.PNG -------------------------------------------------------------------------------- /images/CSSInspector_screenshot.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/images/CSSInspector_screenshot.PNG -------------------------------------------------------------------------------- /images/CloneRepo.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/images/CloneRepo.PNG -------------------------------------------------------------------------------- /images/DIV_screenshot.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/images/DIV_screenshot.PNG -------------------------------------------------------------------------------- /images/DevTools_screenshot.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/images/DevTools_screenshot.PNG -------------------------------------------------------------------------------- /images/Earnings_console_screenshot.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/images/Earnings_console_screenshot.PNG -------------------------------------------------------------------------------- /images/Earnings_graph_screenshot.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/images/Earnings_graph_screenshot.PNG -------------------------------------------------------------------------------- /images/SSRN_screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/images/SSRN_screenshot.png -------------------------------------------------------------------------------- /images/bannerimage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/images/bannerimage.png -------------------------------------------------------------------------------- /images/jupyterimage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TiesdeKok/LearnPythonforResearch/3777625b646336a67e4c9d23b270ddeff8e58854/images/jupyterimage.png -------------------------------------------------------------------------------- /postBuild: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | jupyter labextension install @jupyter-widgets/jupyterlab-manager qgrid 4 | jupyter labextension install jupyterlab-plotly@4.8.1 5 | jupyter labextension install @jupyter-widgets/jupyterlab-manager plotlywidget@4.8.1 6 | jupyter labextension install @jupyter-widgets/jupyterlab-manager --------------------------------------------------------------------------------