├── 2 clean data ├── 2 Clean data - Complete.ipynb ├── 2 Clean data.ipynb ├── results.csv └── results_clean.csv ├── 3 analyse data ├── 3 Analyse data - Complete.ipynb ├── 3 Analyse data.ipynb ├── Extra 3 Analyse data exercises - Complete.ipynb ├── Extra 3 Analyse data exercises.ipynb ├── donations per party, absolute + percentages.csv └── results_clean.csv ├── 4 scrape data ├── 4 Scrape data - Complete.ipynb ├── 4 Scrape data.ipynb ├── scrapedData single header.csv └── scrapedData.csv ├── LICENSE └── README.md /2 clean data/2 Clean data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "toc": true 7 | }, 8 | "source": [ 9 | "
`, the HTML abbrevation for table data. You can use BeautifulSoup to look for all `td`'s in this 21st row by typing: `rows.find_all('td')`." 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "metadata": { 296 | "ExecuteTime": { 297 | "end_time": "2018-05-10T16:14:08.924923Z", 298 | "start_time": "2018-05-10T16:14:08.917175Z" 299 | } 300 | }, 301 | "outputs": [], 302 | "source": [] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "Just for your information: you can even save the data from the `td`'s to a variable called cells, simply type ` cells = rows[21].find_all('td')`" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": { 315 | "ExecuteTime": { 316 | "end_time": "2018-05-10T16:14:10.971479Z", 317 | "start_time": "2018-05-10T16:14:10.965290Z" 318 | } 319 | }, 320 | "outputs": [], 321 | "source": [] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "Now that you know how to only select 1 certain row, you can probably guess how to select a data cell. Exactly, use `cells[0]` to get the first cell of `cells`." 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": { 334 | "ExecuteTime": { 335 | "end_time": "2018-05-10T16:14:13.255237Z", 336 | "start_time": "2018-05-10T16:14:13.248465Z" 337 | } 338 | }, 339 | "outputs": [], 340 | "source": [] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "It works, but it doesn't look too good, does it? Let's get rid of the HTML bits and pieces around our data. Add `.text` to get the job done." 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": null, 352 | "metadata": { 353 | "ExecuteTime": { 354 | "end_time": "2018-05-10T16:14:15.058803Z", 355 | "start_time": "2018-05-10T16:14:15.053231Z" 356 | } 357 | }, 358 | "outputs": [], 359 | "source": [] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "Looks much better, doesn't it? \n", 366 | "\n", 367 | "Unfortunately, there are too many rows in this table to get each cell like we got `Comanche Peak 105000445`. We'll going to have to automate it. Luckily this is one of the big benefits of programming. \n", 368 | "\n", 369 | "Here's what we're going to do: \n", 370 | "1. create an empty list to be used later\n", 371 | "2. extract the table from our soup, save it to the `table` variable\n", 372 | "3. 'loop over' our table....\n", 373 | "4. ...to save the data we need for each row in the table\n", 374 | "5. add the selected data to the list\n", 375 | "6. print the list\n", 376 | "\n", 377 | "At step 3 we'll 'loop over' the table. What does it mean? Well, using a for loop as its called means that we'll give our computer an assignment and have it done **for** every something. It's like your mum when she told you to treat your friends with candy: **for every one of your friend, give them a piece of candy** It's shorter than naming all your friends one by one and repeating the assignment time and time again, right? We're doing exactly the same by telling our computer: **for every row in the table, get the data inside the cells**." 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": { 384 | "ExecuteTime": { 385 | "end_time": "2018-05-10T16:37:25.348411Z", 386 | "start_time": "2018-05-10T16:37:25.338361Z" 387 | } 388 | }, 389 | "outputs": [], 390 | "source": [] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": { 396 | "ExecuteTime": { 397 | "end_time": "2018-05-10T16:16:29.254893Z", 398 | "start_time": "2018-05-10T16:16:29.198302Z" 399 | } 400 | }, 401 | "outputs": [], 402 | "source": [] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "Congrats! You just wrote your very first scraper - well done!\n", 409 | "\n", 410 | "## Saving the scraped data\n", 411 | "\n", 412 | "Now, off course having your data printed inside the notebook is nice. But it would be even beter to store the data in a CSV file. Remember that I explained what we'd actually be doing? Off course things are a bit more complicated; let me explain. Here's what I told you before:\n", 413 | "\n", 414 | "- tell your computer which site to visit: where do you want to download data from? \n", 415 | " - we'll be using the `requests` library to requests webpages\n", 416 | "- save the webpage (the html-page) to the computer\n", 417 | " - this too will be done with library `requests`\n", 418 | "- from the webpage, select the data you want to have\n", 419 | " - we'll be using `BeautifulSoup` to do this\n", 420 | "- write the selection to a csv-file\n", 421 | " - this is done with the `csv` library\n", 422 | "\n", 423 | "Here's what the code will actually do: \n", 424 | "1. Create a CSV file to save data in\n", 425 | "2. Create a CSV writer to write data with to the CSV file\n", 426 | "3. Tell your computer which site(s) to visit\n", 427 | "4. Get the webpage\n", 428 | "5. Select data from the webpage\n", 429 | "6. Write data with the CSV writer to the CSV file \n", 430 | "7. Save file\n", 431 | "\n", 432 | "## Save data to CSV\n", 433 | "\n", 434 | "Here's how to save data to a CSV file using the CSV library - the process involves a couple steps:\n", 435 | "1. create a file, open it, make sure it's 'writeable', use `open('filename.csv', 'w', encoding='utf8', newline='')`\n", 436 | "2. create a writer, you'll need a writer if you want to write data to the file, use `csv.writer(filename, delimiter=',')`\n", 437 | "3. write data to the file using the writer, use `writer.writerow([data])`\n", 438 | "\n", 439 | "Off course you can repeat step 3 as often as necessary." 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": null, 445 | "metadata": { 446 | "ExecuteTime": { 447 | "end_time": "2018-05-10T16:24:02.221916Z", 448 | "start_time": "2018-05-10T16:24:02.212331Z" 449 | } 450 | }, 451 | "outputs": [], 452 | "source": [] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "Using the `ls` command you can see that a new file was created. " 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "metadata": { 465 | "ExecuteTime": { 466 | "end_time": "2018-05-10T16:38:31.872133Z", 467 | "start_time": "2018-05-10T16:38:31.742944Z" 468 | } 469 | }, 470 | "outputs": [], 471 | "source": [] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": {}, 476 | "source": [ 477 | "## The scraper\n", 478 | "Before we broke our essay scraper into sentences before. Now I'll be putting all these sentences together. This way, you can get a good overview of what a scraper could look like. Here's a list of what we need to do, in the exact order: \n", 479 | "1. Create a CSV file, open it, make it writeable\n", 480 | "2. Create a CSV writer to write data\n", 481 | "3. Write the column headers to the file\n", 482 | "4. Tell your computer which site(s) to visit\n", 483 | "5. Get the webpage\n", 484 | "6. Select data from the webpage\n", 485 | "7. Write data with the CSV writer to the CSV file \n", 486 | "8. Save file" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": { 493 | "ExecuteTime": { 494 | "end_time": "2018-05-10T16:32:33.718331Z", 495 | "start_time": "2018-05-10T16:32:32.641607Z" 496 | }, 497 | "collapsed": true, 498 | "jupyter": { 499 | "outputs_hidden": true 500 | } 501 | }, 502 | "outputs": [], 503 | "source": [] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": {}, 508 | "source": [ 509 | "If you want to check if everything worked as it's supposed to, you can import the ScrapedData.csv file as a dataframe using `pd.read_csv('filename.csv')`. Look at the dataframe to see if there's data in the file. Using `df.shape` you can even quickly check if there is as much data in the file as you'd expect. " 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": null, 515 | "metadata": { 516 | "ExecuteTime": { 517 | "end_time": "2018-05-10T16:32:37.086801Z", 518 | "start_time": "2018-05-10T16:32:37.041336Z" 519 | } 520 | }, 521 | "outputs": [], 522 | "source": [] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "`df.shape` will give you the number of rows and columns of the dataframe. A quick way to check if really everything that should be in the CSV file is there." 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": null, 534 | "metadata": { 535 | "ExecuteTime": { 536 | "end_time": "2018-05-10T16:34:50.897133Z", 537 | "start_time": "2018-05-10T16:34:50.889696Z" 538 | } 539 | }, 540 | "outputs": [], 541 | "source": [] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "Note that the headers are in the dataset twice:\n", 548 | "while scraping we added header; but we also scraped the headers since the headers are in the first row of the table and we scraped all table rows...\n", 549 | "\n", 550 | "Now what? \n", 551 | "\n", 552 | "You can easily delete a row by using ``df.drop(df.index[N])``, to drop the Nth row by index number.\n", 553 | "\n", 554 | "To make sure you get the index number right, why not print the first rows once more? We're in a notebook after all... You can use ``df.head()``" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": null, 560 | "metadata": {}, 561 | "outputs": [], 562 | "source": [] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "Looking at these first 5 rows, you'll find that you want to delete the row with indexnumber 0. As stated before, you can use ``df.drop``. By default Pandas will create and return a copy of your dataset, and delete the row of your choosing in that copy. This means that the original will still include dropped row.\n", 569 | "\n", 570 | "Consider this a safety belt when deleting data using Pandas. ;)" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": {}, 577 | "outputs": [], 578 | "source": [] 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "metadata": {}, 583 | "source": [ 584 | "To delete the first row in the original dataset - and not in a copy that Pandas will return to you; you'll need to use ``inplace=True``. The full command becomes: ``df.drop(df.index[0], inplace=True)``. \n", 585 | "\n", 586 | "``inplace=True`` will delete the row in the original dataset, and won't return anything. Try it:" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": null, 592 | "metadata": {}, 593 | "outputs": [], 594 | "source": [] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "To see that it worked, request the head of the dataframe..." 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": null, 606 | "metadata": {}, 607 | "outputs": [], 608 | "source": [] 609 | }, 610 | { 611 | "cell_type": "markdown", 612 | "metadata": {}, 613 | "source": [ 614 | "If you want to you can save this cleaned version, by using ``df.to_csv()``..." 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": null, 620 | "metadata": {}, 621 | "outputs": [], 622 | "source": [] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": {}, 627 | "source": [ 628 | "Well done, happy web scraping!" 629 | ] 630 | } 631 | ], 632 | "metadata": { 633 | "kernelspec": { 634 | "display_name": "Python 3", 635 | "language": "python", 636 | "name": "python3" 637 | }, 638 | "language_info": { 639 | "codemirror_mode": { 640 | "name": "ipython", 641 | "version": 3 642 | }, 643 | "file_extension": ".py", 644 | "mimetype": "text/x-python", 645 | "name": "python", 646 | "nbconvert_exporter": "python", 647 | "pygments_lexer": "ipython3", 648 | "version": "3.7.5" 649 | }, 650 | "toc": { 651 | "nav_menu": {}, 652 | "number_sections": true, 653 | "sideBar": true, 654 | "skip_h1_title": true, 655 | "toc_cell": false, 656 | "toc_position": {}, 657 | "toc_section_display": "block", 658 | "toc_window_display": false 659 | } 660 | }, 661 | "nbformat": 4, 662 | "nbformat_minor": 4 663 | } 664 | -------------------------------------------------------------------------------- /4 scrape data/scrapedData single header.csv: -------------------------------------------------------------------------------- 1 | ,plantNamedocketNumber,licenseNumber,reactorType,location,OwnerOperator,NRCRegion 2 | 1,Arkansas Nuclear 105000313,DPR-51,PWR,"6 miles WNW of Russellville, AR","Entergy Nuclear Operations, Inc. ",4 3 | 2,Arkansas Nuclear 205000368,NPF-6,PWR,"6 miles WNW of Russellville, AR","Entergy Nuclear Operations, Inc. ",4 4 | 3,Beaver Valley 105000334,DPR-66,PWR,"17 miles W of McCandless, PA",FirstEnergy Nuclear Operating Co. ,1 5 | 4,Beaver Valley 205000412,NPF-73,PWR,"17 miles W of McCandless, PA",FirstEnergy Nuclear Operating Co. ,1 6 | 5,Braidwood 105000456,NPF-72,PWR,"20 miles SSW of Joliet, IL","Exelon Generation Co., LLC ",3 7 | 6,Braidwood 205000457,NPF-77,PWR,"20 miles SSW of Joliet, IL","Exelon Generation Co., LLC ",3 8 | 7,Browns Ferry 105000259,DPR-33,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority ,2 9 | 8,Browns Ferry 205000260,DPR-52,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority ,2 10 | 9,Browns Ferry 305000296,DPR-68,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority ,2 11 | 10,Brunswick 105000325,DPR-71,BWR,"30 miles S of Wilmington, NC","Duke Energy Progress, LLC ",2 12 | 11,Brunswick 205000324,DPR-62,BWR,"30 miles S of Wilmington, NC","Duke Energy Progress, LLC",2 13 | 12,Byron 105000454,NPF-37,PWR,"17 miles SW of Rockford, IL","Exelon Generation Co., LLC ",3 14 | 13,Byron 205000455,NPF-66,PWR,"17 miles SW of Rockford, IL","Exelon Generation Co., LLC ",3 15 | 14,Callaway05000483,NPF-30,PWR,"25 miles ENE of Jefferson City, MO",Ameren UE ,4 16 | 15,Calvert Cliffs 105000317,DPR-53,PWR,"40 miles S of Annapolis, MD",Constellation Energy,1 17 | 16,Calvert Cliffs 205000318,DPR-69,PWR,"40 miles S of Annapolis, MD",Constellation Energy,1 18 | 17,Catawba 105000413,NPF-35,PWR,"18 miles S of Charlotte, NC","Duke Energy Carolinas, LLC ",2 19 | 18,Catawba 205000414,NPF-52,PWR,"18 miles S of Charlotte, NC","Duke Energy Carolinas, LLC",2 20 | 19,Clinton05000461,NPF-62,BWR,"23 miles SSE of Bloomington, IL","Exelon Generation Co., LLC",3 21 | 20,Columbia Generating Station05000397,NPF-21,BWR,"20 miles NNE of Pasco, WA",Energy Northwest ,4 22 | 21,Comanche Peak 105000445,NPF-87,PWR,"40 miles SW of Fort Worth, TX",TEX Operations Company LLC,4 23 | 22,Comanche Peak 205000446,NPF-89,PWR,"40 miles SW of Fort Worth, TX",TEX Operations Company LLC,4 24 | 23,Cooper05000298,DPR-46,BWR,"23 miles S of Nebraska City, NE",Nebraska Public Power District ,4 25 | 24,D.C. Cook 105000315,DPR-58,PWR,"13 miles S of Benton Harbor, MI",Indiana/Michigan Power Co. ,3 26 | 25,D.C. Cook 2 05000316,DPR-74,PWR,"13 miles S of Benton Harbor, MI",Indiana/Michigan Power Co. ,3 27 | 26,Davis-Besse05000346,NPF-3,PWR,"21 miles ESE of Toledo, OH",FirstEnergy Nuclear Operating Co. ,3 28 | 27,Diablo Canyon 105000275,DPR-80,PWR,"12 miles WSW of San Luis Obispo, CA",Pacific Gas & Electric Co. ,4 29 | 28,Diablo Canyon 205000323,DPR-82,PWR,"12 miles WSW of San Luis Obispo, CA",Pacific Gas & Electric Co. ,4 30 | 29,Dresden 205000237,DPR-19,BWR,"25 miles SW of Joliet, IL","Exelon Generation Co., LLC ",3 31 | 30,Dresden 305000249,DPR-25,BWR,"25 miles SW of Joliet, IL","Exelon Generation Co., LLC ",3 32 | 31,Duane Arnold05000331,DPR-49,BWR,"8 miles NW of Cedar Rapids, IA","NextEra Energy Duane Arnold, LLC ",3 33 | 32,Farley 105000348,NPF-2,PWR,"18 miles E of Dothan, AL",Southern Nuclear Operating Co. ,2 34 | 33,Farley 205000364,NPF-8,PWR,"18 miles E of Dothan, AL",Southern Nuclear Operating Co.,2 35 | 34,Fermi 205000341,NPF-43,BWR,"25 miles NE of Toledo, OH",DTE Electric Company,3 36 | 35,FitzPatrick05000333,DPR-59,BWR,"6 miles NE of Oswego, NY","Exelon FitzPatrick, LLC/Exelon Generation Company, LLC",1 37 | 36,Ginna05000244,DPR-18,PWR,"20 miles NE of Rochester, NY",Constellation Energy,1 38 | 37,Grand Gulf 105000416,NPF-29,BWR,"20 miles S of Vicksburg, MS","Entergy Nuclear Operations, Inc.",4 39 | 38,Hatch 105000321,DPR-57,BWR,"20 miles S of Vidalia, GA","Southern Nuclear Operating Co., Inc. ",2 40 | 39,Hatch 205000366,NPF-5,BWR,"20 miles S of Vidalia, GA","Southern Nuclear Operating Co., Inc. ",2 41 | 40,Hope Creek 105000354,NPF-57,BWR,"18 miles SE of Wilmington, DE","PSEG Nuclear, LLC",1 42 | 41,Indian Point 305000286,DPR-64,PWR,"24 miles N of New York City, NY","Entergy Nuclear Operations, Inc.",1 43 | 42,La Salle 105000373,NPF-11,BWR,"11 miles SE of Ottawa, IL","Exelon Generation Co., LLC ",3 44 | 43,La Salle 205000374,NPF-18,BWR,"11 miles SE of Ottawa, IL","Exelon Generation Co., LLC ",3 45 | 44,Limerick 105000352,NPF-39,BWR,"21 miles NW of Philadelphia, PA","Exelon Generation Co., LLC ",1 46 | 45,Limerick 205000353,NPF-85,BWR,"21 miles NW of Philadelphia, PA","Exelon Generation Co., LLC ",1 47 | 46,McGuire 105000369,NPF-9,PWR,"17 miles N of Charlotte, NC","Duke Energy Carolinas, LLC",2 48 | 47,McGuire 205000370,NPF-17,PWR,"17 miles N of Charlotte, NC","Duke Energy Carolinas, LLC",2 49 | 48,Millstone 205000336,DPR-65,PWR,"3.2 miles WSW of New London, CT",Dominion Generation ,1 50 | 49,Millstone 305000423,NPF-49,PWR,"3.2 miles WSW of New London, CT",Dominion Generation ,1 51 | 50,Monticello05000263,DPR-22,BWR,"35 miles NW of Minneapolis, MN",Northern States Power Company – Minnesota,3 52 | 51,Nine Mile Point 105000220,DPR-63,BWR,"6 miles NE of Oswego, NY",Constellation Energy,1 53 | 52,Nine Mile Point 205000410,NPF-69,BWR,"6 miles NE of Oswego, NY",Constellation Energy,1 54 | 53,North Anna 105000338,NPF-4,PWR,"40 miles NW of Richmond, VA",Dominion Generation,2 55 | 54,North Anna 205000339,NPF-7,PWR,"40 miles NW of Richmond, VA",Dominion Generation,2 56 | 55,Oconee 105000269,DPR-38,PWR,"30 miles W of Greenville, SC","Duke Energy Carolinas, LLC",2 57 | 56,Oconee 205000270,DPR-47,PWR,"30 miles W of Greenville, SC","Duke Energy Carolinas, LLC",2 58 | 57,Oconee 305000287,DPR-55,PWR,"30 miles W of Greenville, SC","Duke Energy Carolinas, LLC",2 59 | 58,Palisades05000255,DPR-20,PWR,"5 miles S of South Haven, MI","Entergy Nuclear Operations, Inc.",3 60 | 59,Palo Verde 105000528,NPF-41,PWR,"50 miles W of Phoenix, AZ",Arizona Public Service Co. ,4 61 | 60,Palo Verde 205000529,NPF-51,PWR,"50 miles W of Phoenix, AZ",Arizona Public Service Co. ,4 62 | 61,Palo Verde 305000530,NPF-74,PWR,"50 miles W of Phoenix, AZ",Arizona Public Service Co. ,4 63 | 62,Peach Bottom 205000277,DPR-44,BWR,"17.9 miles S of Lancaster, PA","Exelon Generation Co., LLC ",1 64 | 63,Peach Bottom 305000278,DPR-56,BWR,"17.9 miles S of Lancaster, PA","Exelon Generation Co., LLC ",1 65 | 64,Perry 105000440,NPF-58,BWR,"35 miles NE of Cleveland, OH",FirstEnergy Nuclear Operating Co. ,3 66 | 65,Point Beach 105000266,DPR-24,PWR,"13 miles NNW of Manitowoc, WI","NextEra Energy Point Beach, LLC",3 67 | 66,Point Beach 205000301,DPR-27,PWR,"13 miles NNW of Manitowoc, WI","NextEra Energy Point Beach, LLC",3 68 | 67,Prairie Island 105000282,DPR-42,PWR,"28 miles SE of Minneapolis, MN",Northern States Power Company – Minnesota ,3 69 | 68,Prairie Island 205000306,DPR-60,PWR,"28 miles SE of Minneapolis, MN",Northern States Power Company – Minnesota ,3 70 | 69,Quad Cities 105000254,DPR-29,BWR,"20 miles NE of Moline, IL","Exelon Generation Co., LLC ",3 71 | 70,Quad Cities 205000265,DPR-30,BWR,"20 miles NE of Moline, IL","Exelon Generation Co., LLC ",3 72 | 71,River Bend 105000458,NPF-47,BWR,"24 miles NNW of Baton Rouge, LA","Entergy Nuclear Operations, Inc.",4 73 | 72,Robinson 205000261,DPR-23,PWR,"26 miles NW of Florence, SC","Duke Energy Progress, LLC ",2 74 | 73,Saint Lucie 105000335,DPR-67,PWR,"10 miles SE of Ft. Pierce, FL",Florida Power & Light Co. ,2 75 | 74,Saint Lucie 205000389,NPF-16,PWR,"10 miles SE of Ft. Pierce, FL",Florida Power & Light Co. ,2 76 | 75,Salem 105000272,DPR-70,PWR,"18 miles S of Wilmington, DE","PSEG Nuclear, LLC",1 77 | 76,Salem 205000311,DPR-75,PWR,"18 miles S of Wilmington, DE","PSEG Nuclear, LLC",1 78 | 77,Seabrook 105000443,NPF-86,PWR,"13 miles S of Portsmouth, NH","NextEra Energy Seabrook, LLC",1 79 | 78,Sequoyah 105000327,DPR-77,PWR,"16 miles NE of Chattanooga, TN",Tennessee Valley Authority ,2 80 | 79,Sequoyah 205000328,DPR-79,PWR,"16 miles NE of Chattanooga, TN",Tennessee Valley Authority ,2 81 | 80,Shearon Harris 105000400,NPF-63,PWR,"20 miles SW of Raleigh, NC","Duke Energy Progress, LLC",2 82 | 81,South Texas 105000498,NPF-76,PWR,"90 miles SW of Houston, TX",STP Nuclear Operating Co. ,4 83 | 82,South Texas 205000499,NPF-80,PWR,"90 miles SW of Houston, TX",STP Nuclear Operating Co. ,4 84 | 83,Summer05000395,NPF-12,PWR,"26 miles NW of Columbia, SC",South Carolina Electric & Gas Co. ,2 85 | -------------------------------------------------------------------------------- /4 scrape data/scrapedData.csv: -------------------------------------------------------------------------------- 1 | plantNamedocketNumber,licenseNumber,reactorType,location,OwnerOperator,NRCRegion 2 | "Plant Name 3 | Docket Number",License Number,"Reactor 4 | Type",Location,Owner/Operator,NRC Region 5 | Arkansas Nuclear 105000313,DPR-51,PWR,"6 miles WNW of Russellville, AR","Entergy Nuclear Operations, Inc. ",4 6 | Arkansas Nuclear 205000368,NPF-6,PWR,"6 miles WNW of Russellville, AR","Entergy Nuclear Operations, Inc. ",4 7 | Beaver Valley 105000334,DPR-66,PWR,"17 miles W of McCandless, PA",FirstEnergy Nuclear Operating Co. ,1 8 | Beaver Valley 205000412,NPF-73,PWR,"17 miles W of McCandless, PA",FirstEnergy Nuclear Operating Co. ,1 9 | Braidwood 105000456,NPF-72,PWR,"20 miles SSW of Joliet, IL","Exelon Generation Co., LLC ",3 10 | Braidwood 205000457,NPF-77,PWR,"20 miles SSW of Joliet, IL","Exelon Generation Co., LLC ",3 11 | Browns Ferry 105000259,DPR-33,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority ,2 12 | Browns Ferry 205000260,DPR-52,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority ,2 13 | Browns Ferry 305000296,DPR-68,BWR,"32 miles W of Huntsville, AL",Tennessee Valley Authority ,2 14 | Brunswick 105000325,DPR-71,BWR,"30 miles S of Wilmington, NC","Duke Energy Progress, LLC ",2 15 | Brunswick 205000324,DPR-62,BWR,"30 miles S of Wilmington, NC","Duke Energy Progress, LLC",2 16 | Byron 105000454,NPF-37,PWR,"17 miles SW of Rockford, IL","Exelon Generation Co., LLC ",3 17 | Byron 205000455,NPF-66,PWR,"17 miles SW of Rockford, IL","Exelon Generation Co., LLC ",3 18 | Callaway05000483,NPF-30,PWR,"25 miles ENE of Jefferson City, MO",Ameren UE ,4 19 | Calvert Cliffs 105000317,DPR-53,PWR,"40 miles S of Annapolis, MD",Constellation Energy,1 20 | Calvert Cliffs 205000318,DPR-69,PWR,"40 miles S of Annapolis, MD",Constellation Energy,1 21 | Catawba 105000413,NPF-35,PWR,"18 miles S of Charlotte, NC","Duke Energy Carolinas, LLC ",2 22 | Catawba 205000414,NPF-52,PWR,"18 miles S of Charlotte, NC","Duke Energy Carolinas, LLC",2 23 | Clinton05000461,NPF-62,BWR,"23 miles SSE of Bloomington, IL","Exelon Generation Co., LLC",3 24 | Columbia Generating Station05000397,NPF-21,BWR,"20 miles NNE of Pasco, WA",Energy Northwest ,4 25 | Comanche Peak 105000445,NPF-87,PWR,"40 miles SW of Fort Worth, TX",TEX Operations Company LLC,4 26 | Comanche Peak 205000446,NPF-89,PWR,"40 miles SW of Fort Worth, TX",TEX Operations Company LLC,4 27 | Cooper05000298,DPR-46,BWR,"23 miles S of Nebraska City, NE",Nebraska Public Power District ,4 28 | D.C. Cook 105000315,DPR-58,PWR,"13 miles S of Benton Harbor, MI",Indiana/Michigan Power Co. ,3 29 | D.C. Cook 2 05000316,DPR-74,PWR,"13 miles S of Benton Harbor, MI",Indiana/Michigan Power Co. ,3 30 | Davis-Besse05000346,NPF-3,PWR,"21 miles ESE of Toledo, OH",FirstEnergy Nuclear Operating Co. ,3 31 | Diablo Canyon 105000275,DPR-80,PWR,"12 miles WSW of San Luis Obispo, CA",Pacific Gas & Electric Co. ,4 32 | Diablo Canyon 205000323,DPR-82,PWR,"12 miles WSW of San Luis Obispo, CA",Pacific Gas & Electric Co. ,4 33 | Dresden 205000237,DPR-19,BWR,"25 miles SW of Joliet, IL","Exelon Generation Co., LLC ",3 34 | Dresden 305000249,DPR-25,BWR,"25 miles SW of Joliet, IL","Exelon Generation Co., LLC ",3 35 | Duane Arnold05000331,DPR-49,BWR,"8 miles NW of Cedar Rapids, IA","NextEra Energy Duane Arnold, LLC ",3 36 | Farley 105000348,NPF-2,PWR,"18 miles E of Dothan, AL",Southern Nuclear Operating Co. ,2 37 | Farley 205000364,NPF-8,PWR,"18 miles E of Dothan, AL",Southern Nuclear Operating Co.,2 38 | Fermi 205000341,NPF-43,BWR,"25 miles NE of Toledo, OH",DTE Electric Company,3 39 | FitzPatrick05000333,DPR-59,BWR,"6 miles NE of Oswego, NY","Exelon FitzPatrick, LLC/Exelon Generation Company, LLC",1 40 | Ginna05000244,DPR-18,PWR,"20 miles NE of Rochester, NY",Constellation Energy,1 41 | Grand Gulf 105000416,NPF-29,BWR,"20 miles S of Vicksburg, MS","Entergy Nuclear Operations, Inc.",4 42 | Hatch 105000321,DPR-57,BWR,"20 miles S of Vidalia, GA","Southern Nuclear Operating Co., Inc. ",2 43 | Hatch 205000366,NPF-5,BWR,"20 miles S of Vidalia, GA","Southern Nuclear Operating Co., Inc. ",2 44 | Hope Creek 105000354,NPF-57,BWR,"18 miles SE of Wilmington, DE","PSEG Nuclear, LLC",1 45 | Indian Point 305000286,DPR-64,PWR,"24 miles N of New York City, NY","Entergy Nuclear Operations, Inc.",1 46 | La Salle 105000373,NPF-11,BWR,"11 miles SE of Ottawa, IL","Exelon Generation Co., LLC ",3 47 | La Salle 205000374,NPF-18,BWR,"11 miles SE of Ottawa, IL","Exelon Generation Co., LLC ",3 48 | Limerick 105000352,NPF-39,BWR,"21 miles NW of Philadelphia, PA","Exelon Generation Co., LLC ",1 49 | Limerick 205000353,NPF-85,BWR,"21 miles NW of Philadelphia, PA","Exelon Generation Co., LLC ",1 50 | McGuire 105000369,NPF-9,PWR,"17 miles N of Charlotte, NC","Duke Energy Carolinas, LLC",2 51 | McGuire 205000370,NPF-17,PWR,"17 miles N of Charlotte, NC","Duke Energy Carolinas, LLC",2 52 | Millstone 205000336,DPR-65,PWR,"3.2 miles WSW of New London, CT",Dominion Generation ,1 53 | Millstone 305000423,NPF-49,PWR,"3.2 miles WSW of New London, CT",Dominion Generation ,1 54 | Monticello05000263,DPR-22,BWR,"35 miles NW of Minneapolis, MN",Northern States Power Company – Minnesota,3 55 | Nine Mile Point 105000220,DPR-63,BWR,"6 miles NE of Oswego, NY",Constellation Energy,1 56 | Nine Mile Point 205000410,NPF-69,BWR,"6 miles NE of Oswego, NY",Constellation Energy,1 57 | North Anna 105000338,NPF-4,PWR,"40 miles NW of Richmond, VA",Dominion Generation,2 58 | North Anna 205000339,NPF-7,PWR,"40 miles NW of Richmond, VA",Dominion Generation,2 59 | Oconee 105000269,DPR-38,PWR,"30 miles W of Greenville, SC","Duke Energy Carolinas, LLC",2 60 | Oconee 205000270,DPR-47,PWR,"30 miles W of Greenville, SC","Duke Energy Carolinas, LLC",2 61 | Oconee 305000287,DPR-55,PWR,"30 miles W of Greenville, SC","Duke Energy Carolinas, LLC",2 62 | Palisades05000255,DPR-20,PWR,"5 miles S of South Haven, MI","Entergy Nuclear Operations, Inc.",3 63 | Palo Verde 105000528,NPF-41,PWR,"50 miles W of Phoenix, AZ",Arizona Public Service Co. ,4 64 | Palo Verde 205000529,NPF-51,PWR,"50 miles W of Phoenix, AZ",Arizona Public Service Co. ,4 65 | Palo Verde 305000530,NPF-74,PWR,"50 miles W of Phoenix, AZ",Arizona Public Service Co. ,4 66 | Peach Bottom 205000277,DPR-44,BWR,"17.9 miles S of Lancaster, PA","Exelon Generation Co., LLC ",1 67 | Peach Bottom 305000278,DPR-56,BWR,"17.9 miles S of Lancaster, PA","Exelon Generation Co., LLC ",1 68 | Perry 105000440,NPF-58,BWR,"35 miles NE of Cleveland, OH",FirstEnergy Nuclear Operating Co. ,3 69 | Point Beach 105000266,DPR-24,PWR,"13 miles NNW of Manitowoc, WI","NextEra Energy Point Beach, LLC",3 70 | Point Beach 205000301,DPR-27,PWR,"13 miles NNW of Manitowoc, WI","NextEra Energy Point Beach, LLC",3 71 | Prairie Island 105000282,DPR-42,PWR,"28 miles SE of Minneapolis, MN",Northern States Power Company – Minnesota ,3 72 | Prairie Island 205000306,DPR-60,PWR,"28 miles SE of Minneapolis, MN",Northern States Power Company – Minnesota ,3 73 | Quad Cities 105000254,DPR-29,BWR,"20 miles NE of Moline, IL","Exelon Generation Co., LLC ",3 74 | Quad Cities 205000265,DPR-30,BWR,"20 miles NE of Moline, IL","Exelon Generation Co., LLC ",3 75 | River Bend 105000458,NPF-47,BWR,"24 miles NNW of Baton Rouge, LA","Entergy Nuclear Operations, Inc.",4 76 | Robinson 205000261,DPR-23,PWR,"26 miles NW of Florence, SC","Duke Energy Progress, LLC ",2 77 | Saint Lucie 105000335,DPR-67,PWR,"10 miles SE of Ft. Pierce, FL",Florida Power & Light Co. ,2 78 | Saint Lucie 205000389,NPF-16,PWR,"10 miles SE of Ft. Pierce, FL",Florida Power & Light Co. ,2 79 | Salem 105000272,DPR-70,PWR,"18 miles S of Wilmington, DE","PSEG Nuclear, LLC",1 80 | Salem 205000311,DPR-75,PWR,"18 miles S of Wilmington, DE","PSEG Nuclear, LLC",1 81 | Seabrook 105000443,NPF-86,PWR,"13 miles S of Portsmouth, NH","NextEra Energy Seabrook, LLC",1 82 | Sequoyah 105000327,DPR-77,PWR,"16 miles NE of Chattanooga, TN",Tennessee Valley Authority ,2 83 | Sequoyah 205000328,DPR-79,PWR,"16 miles NE of Chattanooga, TN",Tennessee Valley Authority ,2 84 | Shearon Harris 105000400,NPF-63,PWR,"20 miles SW of Raleigh, NC","Duke Energy Progress, LLC",2 85 | South Texas 105000498,NPF-76,PWR,"90 miles SW of Houston, TX",STP Nuclear Operating Co. ,4 86 | South Texas 205000499,NPF-80,PWR,"90 miles SW of Houston, TX",STP Nuclear Operating Co. ,4 87 | Summer05000395,NPF-12,PWR,"26 miles NW of Columbia, SC",South Carolina Electric & Gas Co. ,2 88 | Surry 105000280,DPR-32,PWR,"17 miles NW of Newport News, VA",Dominion Generation,2 89 | Surry 205000281,DPR-37,PWR,"17 miles NW of Newport News, VA",Dominion Generation,2 90 | Susquehanna 105000387,NPF-14,BWR,"70 miles NE of Harrisburg, PA","Susquehanna Nuclear, LLC",1 91 | Susquehanna 205000388,NPF-22,BWR,"70 miles NE of Harrisburg, PA","Susquehanna Nuclear, LLC",1 92 | Turkey Point 305000250,DPR-31,PWR,"20 miles S of Miami, FL",Florida Power & Light Co. ,2 93 | Turkey Point 405000251,DPR-41,PWR,"20 miles S of Miami, FL",Florida Power & Light Co. ,2 94 | Vogtle 105000424,NPF-68,PWR,"26 miles SE of Augusta, GA",Southern Nuclear Operating Co.,2 95 | Vogtle 205000425,NPF-81,PWR,"26 miles SE of Augusta, GA",Southern Nuclear Operating Co.,2 96 | Waterford 305000382,NPF-38,PWR,"25 miles W of New Orleans, LA","Entergy Nuclear Operations, Inc.",4 97 | Watts Bar 105000390,NPF-90,PWR,"60 miles SW of Knoxville, TN",Tennessee Valley Authority ,2 98 | Watts Bar 205000391,NPF-96,PWR,"60 miles SW of Knoxville, TN",Tennessee Valley Authority ,2 99 | Wolf Creek 105000482,NPF-42,PWR,"3.5 miles NE of Burlington, KS",Wolf Creek Nuclear Operating Corp. ,4 100 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Winny de Jong 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Python for Journalists 2 | ====================== 3 | *Notebooks and files for the Python for Journalists course on [Datajournalism.com](https://datajournalism.com/watch/python-for-journalists)* 4 | 5 | * [What is Python anyway](#what-is-python-anyway) 6 | * [About the course](#about-the-course) 7 | * [Course Modules](#course-modules) 8 | * 1 Getting started 9 | * 2 Clean data 10 | * 3 Analyse data 11 | * 4 Scrape data 12 | * [Learning More, Reference And Tools](#learning-more-reference-and-tools) 13 | * [About Us](#about-us) 14 | 15 | 16 | What Is Python Anyway 17 | =============================== 18 | 19 | Python is a programming language for general-purpose programming. It's popular among data journalists for its readability, ease of use and efficiency. 20 | 21 | About the Course 22 | ================== 23 | The course Python for Journalists is meant for journalists looking to learn the most common uses of Python for data journalism. During four modules the course teaches you how to set up Python and all Python-related tools on your own computer. Next you'll learn how to clean up messy datasets using the Pandas library. In the third module you'll learn how to analyse data, again using the Pandas library. In the fourth and final module you'll learn how to automatically download data from the web, by using both the Beautiful Soup and Requests libraries to dabbling in webscraping. 24 | 25 | This Python for Journalists course is meant for those who dabbled in Python, but somehow didn't persevere; and for those who can't wait to dive in head first... Though no programming knowledge is required, it helps if you know what a terminal or command prompt is and if you are familiar with Excel. 26 | 27 | 28 | Course Modules 29 | =============== 30 | 31 | For all modules except module 1 Set up, there is a Jupyter Notebook available to follow along during the course. Each notebook contains exercises and explanations. Happy Pythoning! 32 | 33 | ### 1 Set Up 34 | 35 | This module revolves around installing the right tools on your laptop. To follow along in the coming modules, you'll need Python 3, and several Python libraries like Requests, Pandas and BeautifulSoup installed. Jupyter Notebooks come highly recommended. It's recommended that you install all of this software in one go, using the [Anaconda distribution](https://anaconda.org/). This first module does not include a Jupyter Notebook. 36 | 37 | **On your computer:** 38 | 39 | * Install the [Anaconda distribution](https://www.anaconda.com/download/#macos) to install **Python 3**, libraries Requests, Pandas, and BeautifulSoup, and Jupyter Notebooks all at once on your computer. 40 | * Note: choose for the Anaconda installation that includes **Python 3**, at the time of writing that would be Python 3.6. 41 | 42 | **Extra preparation:** 43 | If you want to make sure you have a solid foundation to build up on, you might want to learn about the Python syntax first. Here are some places where you can learn about different data types in Python, which might help before continuing with this course: (Since the following tutorials overlap, choosing one is highly recommended.) 44 | * Online beginner tutorials at [LearnPython.org](https://www.learnpython.org/) 45 | * Digital book [Python for you and me](https://pymbook.readthedocs.io/en/py3/) 46 | 47 | ### 2 Clean data 48 | 49 | In this second module we'll show you how to get into your Python conda environment, and how to start a Jupyter Notebook. Once that's out of the way, you'll learn how to import a CSV-file into your Jupyter Notebook, to get ready for some data cleaning. Among other things you'll learn how to search and replace values inside a column; how to change the datatype of a column; and how to extract data from a column to populate a new column. This module includes two Jupyter Notebooks: one empty and another one completed - all named 'clean data'. 50 | 51 | ### 3 Analyse data 52 | 53 | In this third module, you'll learn how to analyse data using the Pandas library. You'll learn how to explore your dataset, looking at summary statistics - count, median, mean, percentiles, standard deviation etc. - for each column. Next we'll look into how to sort, filter, sum and count values in columns. Finally you'll learn how to group data, creating (for those familiar with Excel) pivot tables, using the Pandas library. This module includes two Jupyter Notebooks: one empty and another one completed - all named 'analyse data'. 54 | 55 | **Extra exercises:** 56 | If you want to make sure you have fully grasped this module, you can take on the extra notebook that contains some exercises. Since this is a later add on to the course, there is no video to accompany this notebook. However, you should be able to pull through without video. :) Off course there are two extra notebooks: one completed, and one for you to work in. 57 | 58 | 59 | ### 4 Scrape data 60 | 61 | The final module revolves around scraping data using both the Requests and the BeautifulSoup libraries. Though in practice you'll likely first want to scrape data, to later clean and analyse those numbers, this module is last for training purposes. The modules on cleaning and analysing data introduced you to Python, Pandas and Jupyter Notebooks. Paving the way for some basic webscraping, including a for loop to collect data as efficient as possible. Finishing this module you should be able to write some basic webscrapers to collect data from the internet. This module includes two Jupyter Notebooks: one empty and another one completed - all named 'scrape data'. 62 | 63 | Learning More 64 | ================= 65 | 66 | * Allen B. Downey's digital book [Think Python: How to Think Like a Computer Scientist](http://greenteapress.com/thinkpython2/html/index.html) 67 | * Swaroop's free online book [A Byte of Python](https://python.swaroopch.com/) 68 | * Dan Bader's [Python video tutorials on YouTube](https://www.youtube.com/channel/UCI0vQvr9aFn27yR6Ej6n5UA) 69 | * Al Sweigart's [Automate the boring stuff with Python](https://automatetheboringstuff.com/) site 70 | * [Coding for Journalists](https://coding-for-journalists.readthedocs.io/en/latest/) 71 | 72 | ### Courses 73 | * [Your First Python Notebook](http://www.firstpythonnotebook.org/index.html): a step-by-step guide to analyzing data with Python and the Jupyter Notebook. 74 | * Data Camp [Python Courses](https://www.datacamp.com/courses/tech:python) 75 | * Zed Shaw's [Learn Python the Hard Way](https://learnpythonthehardway.org/) 76 | * EDx's Course [Introduction to Computer Science and Programming Using Python](https://www.edx.org/course/introduction-computer-science-mitx-6-00-1x-11) 77 | * Coursera [Python for Everybody Specialization](https://www.coursera.org/specializations/python) 78 | * Coursera [Applied Data Science with Python Specialization](https://www.coursera.org/specializations/data-science-python) 79 | 80 | 81 | About Us 82 | ======== 83 | 84 | ### About DataJournalism.com 85 | The European Journalism Center believes that the use of data in journalism is a cornerstone of building resilience in any newsroom. After 10 years of experience running data journalism programmes they've created [DataJournalism.com](https://datajournalism.com). The site provides data journalists with free resources, materials, online video courses and community forums. Once you sign in, you can enroll for free into one of our premium online courses or discuss with the community in our forums. Whether you are new to data journalism or deeply familiar with it, membership will expose you to like-minded data journalists and give you a free space to learn or improve your data skills. 86 | 87 | ### About Winny de Jong 88 | Winny works as a data journalist for the Dutch national news broadcast [NOS](https://nos.nl). There she interviews datasets instead of people trying to find news before it is news. Winny usually speaks about the importance of data literacy, how to develop ideas, and her data journalistic workflow. She has presented before for organizations like TEDx, Brussels News Summit, DataHarvest+ and multiple journalism colleges. Every Sunday she shares the best of the data journalism web in her [data journalism newsletter](https://datajournalistiek.nl/en/newsletter/). Visit her online at [winnymedia.nl](https://winnymedia.nl) or at [her data blog](https://datajournalistiek.nl/en). 89 | -------------------------------------------------------------------------------- |