├── .gitignore ├── README.md ├── index.ipynb ├── index.md ├── index_files └── index_54_0.png ├── quotes_data.csv ├── requirements.txt └── static ├── error.gif ├── graphic.png ├── hurray.gif ├── recursion.gif ├── soup.gif ├── surf.gif └── tools.gif /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | .DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

Intro to Web Scraping

2 | 3 | >A code along workshop with Flatiron School. 4 | 5 |
6 | 7 | **To launch the notebook, [click here](https://mybinder.org/v2/gh/flatiron-school/intro_to_webscraping/master?filepath=%2Findex.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/flatiron-school/intro_to_webscraping/master?filepath=%2Findex.ipynb)** 8 | 9 | The contents of this notebook were first developed by [Joél Collins](https://github.com/joelsewhere), and utilizes the [Quotes to Scrape](http://quotes.toscrape.com/) website to demonstrate basic web scraping in Python. 10 | 11 | ### Workshop Goals: 12 | 1. Retreive the HTML of a webpage with the `requests` library. 13 | 2. Introduction to the `tree` structure of `HTML`. 14 | 3. Use the `inspect` tool to sift through the HTML. 15 | 4. Parse HTML with the `BeautifulSoup` library. 16 | 5. Store data in a `csv` file using the `Pandas` library. 17 | 18 | Workshop dates: 19 | - 07/10/20 20 | 21 | 22 | -------------------------------------------------------------------------------- /index.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Welcome to Flatiron's Intro to Web Scraping! \n", 8 | "\n", 9 | "![](https://media.giphy.com/media/1dcWGdOBg0wcU/giphy.gif)\n", 10 | "\n", 11 | "\n", 12 | "## A few things:\n", 13 | "- This file is called a **[Jupyter Notebook](https://jupyter.org/about)**. An industry standard tool for Data Science.\n", 14 | "\n", 15 | "- Each section of a jupyter notebook is called a `cell`. \n", 16 | " - To run the code inside a cell, click into the cell and press `shift` + `enter`.\n", 17 | "\n", 18 | " \n", 19 | "### Workshop Goals:\n", 20 | "1. Retrieve the HTML of a webpage with the `requests` library.\n", 21 | "2. Introduction to the `tree` structure of `HTML`.\n", 22 | "3. Use the `inspect` tool to sift through the HTML.\n", 23 | "4. Parse HTML with the `BeautifulSoup` library.\n", 24 | "5. Store data in a `csv` file using the `Pandas` library.\n", 25 | "\n", 26 | "# Let's scrape some data!\n", 27 | "The data we are scraping today will be from the [Quotes to Scrape](http://quotes.toscrape.com/) website.\n", 28 | "\n", 29 | "\n", 30 | "## Step 1:\n", 31 | "> **Import the necessary tools for our project**\n", 32 | "\n", 33 | "![](https://media.giphy.com/media/KcE7Dq5f8TTXzZ1LAF/giphy.gif)\n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 1, 39 | "metadata": { 40 | "ExecuteTime": { 41 | "end_time": "2021-04-21T21:10:52.701401Z", 42 | "start_time": "2021-04-21T21:10:51.086864Z" 43 | } 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "# Webscraping\n", 48 | "import requests\n", 49 | "from bs4 import BeautifulSoup\n", 50 | "\n", 51 | "# Data organization\n", 52 | "import pandas as pd\n", 53 | "\n", 54 | "# Visualization\n", 55 | "import matplotlib.pyplot as plt\n", 56 | "plt.rcParams.update({'font.size': 22})" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "## Step 2\n", 64 | "> **We use the `requests` library to connect to the website we wish to scrape.**\n", 65 | "\n", 66 | "" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 2, 72 | "metadata": { 73 | "ExecuteTime": { 74 | "end_time": "2021-04-21T21:10:53.113870Z", 75 | "start_time": "2021-04-21T21:10:52.703140Z" 76 | } 77 | }, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "" 83 | ] 84 | }, 85 | "execution_count": 2, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "url = 'http://quotes.toscrape.com'\n", 92 | "response = requests.get(url)\n", 93 | "response" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "✅**A `Response 200` means our request was sucessful!** ✅\n", 101 | "\n", 102 | "❌Let's take a quick look at an *unsuccessful* response. ❌" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 3, 108 | "metadata": { 109 | "ExecuteTime": { 110 | "end_time": "2021-04-21T21:10:53.365799Z", 111 | "start_time": "2021-04-21T21:10:53.115768Z" 112 | } 113 | }, 114 | "outputs": [ 115 | { 116 | "data": { 117 | "text/plain": [ 118 | "" 119 | ] 120 | }, 121 | "execution_count": 3, 122 | "metadata": {}, 123 | "output_type": "execute_result" 124 | } 125 | ], 126 | "source": [ 127 | "bad_url = 'http://quotes.toscrape.com/20'\n", 128 | "requests.get(bad_url)" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "**A `Response 404` means that the url you are using in your request is not pointing to a valid webpage.**\n", 136 | "\n", 137 | "" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "## Step 3\n", 145 | "> **We collect the html from the website by adding `.text` to the end of the response variable.** " 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 4, 151 | "metadata": { 152 | "ExecuteTime": { 153 | "end_time": "2021-04-21T21:10:53.371878Z", 154 | "start_time": "2021-04-21T21:10:53.368038Z" 155 | } 156 | }, 157 | "outputs": [ 158 | { 159 | "data": { 160 | "text/plain": [ 161 | "'\\n\\n\\n\\t **We use `BeautifulSoup` to turn the html into something we can manipulate.**" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 5, 185 | "metadata": { 186 | "ExecuteTime": { 187 | "end_time": "2021-04-21T21:10:53.408867Z", 188 | "start_time": "2021-04-21T21:10:53.373516Z" 189 | } 190 | }, 191 | "outputs": [ 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "\n", 196 | "\n", 197 | "\n", 198 | "\n", 199 | "Quotes to Scrape\n", 200 | "\n", 201 | "\n", 202 | "\n", 203 | "\n", 204 | "
\n", 205 | "
\n", 206 | "
\n", 207 | "

\n", 208 | "Quotes to Scrape\n", 209 | "

\n", 210 | "
\n", 211 | "
\n", 212 | "

\n", 213 | "Login\n", 214 | "

\n", 215 | "
\n", 216 | "
\n", 217 | "
\n", 218 | "
\n", 219 | "
\n", 220 | "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\n", 221 | "by Albert Einstein\n", 222 | "(about)\n", 223 | "\n", 224 | "
\n", 225 | " Tags:\n", 226 | " \n", 227 | "change\n", 228 | "deep-thoughts\n", 229 | "thinking\n", 230 | "world\n", 231 | "
\n", 232 | "
\n", 233 | "
\n", 234 | "“It is our choices, Harry, that show what we truly are, far more than our abilities.”\n", 235 | "by J.K. Rowling\n", 236 | "(about)\n", 237 | "\n", 238 | "
\n", 239 | " Tags:\n", 240 | " \n", 241 | "abilities\n", 242 | "choices\n", 243 | "
\n", 244 | "
\n", 245 | "
\n", 246 | "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”\n", 247 | "by Albert Einstein\n", 248 | "(about)\n", 249 | "\n", 250 | "
\n", 251 | " Tags:\n", 252 | " \n", 253 | "inspirational\n", 254 | "life\n", 255 | "live\n", 256 | "miracle\n", 257 | "miracles\n", 258 | "
\n", 259 | "
\n", 260 | "
\n", 261 | "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”\n", 262 | "by Jane Austen\n", 263 | "(about)\n", 264 | "\n", 265 | "
\n", 266 | " Tags:\n", 267 | " \n", 268 | "aliteracy\n", 269 | "books\n", 270 | "classic\n", 271 | "humor\n", 272 | "
\n", 273 | "
\n", 274 | "
\n", 275 | "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”\n", 276 | "by Marilyn Monroe\n", 277 | "(about)\n", 278 | "\n", 279 | "
\n", 280 | " Tags:\n", 281 | " \n", 282 | "be-yourself\n", 283 | "inspirational\n", 284 | "
\n", 285 | "
\n", 286 | "
\n", 287 | "“Try not to become a man of success. Rather become a man of value.”\n", 288 | "by Albert Einstein\n", 289 | "(about)\n", 290 | "\n", 291 | "
\n", 292 | " Tags:\n", 293 | " \n", 294 | "adulthood\n", 295 | "success\n", 296 | "value\n", 297 | "
\n", 298 | "
\n", 299 | "
\n", 300 | "“It is better to be hated for what you are than to be loved for what you are not.”\n", 301 | "by André Gide\n", 302 | "(about)\n", 303 | "\n", 304 | "
\n", 305 | " Tags:\n", 306 | " \n", 307 | "life\n", 308 | "love\n", 309 | "
\n", 310 | "
\n", 311 | "
\n", 312 | "“I have not failed. I've just found 10,000 ways that won't work.”\n", 313 | "by Thomas A. Edison\n", 314 | "(about)\n", 315 | "\n", 316 | "
\n", 317 | " Tags:\n", 318 | " \n", 319 | "edison\n", 320 | "failure\n", 321 | "inspirational\n", 322 | "paraphrased\n", 323 | "
\n", 324 | "
\n", 325 | "
\n", 326 | "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”\n", 327 | "by Eleanor Roosevelt\n", 328 | "(about)\n", 329 | "\n", 330 | "
\n", 331 | " Tags:\n", 332 | " \n", 333 | "misattributed-eleanor-roosevelt\n", 334 | "
\n", 335 | "
\n", 336 | "
\n", 337 | "“A day without sunshine is like, you know, night.”\n", 338 | "by Steve Martin\n", 339 | "(about)\n", 340 | "\n", 341 | "
\n", 342 | " Tags:\n", 343 | " \n", 344 | "humor\n", 345 | "obvious\n", 346 | "simile\n", 347 | "
\n", 348 | "
\n", 349 | "\n", 356 | "
\n", 357 | "
\n", 358 | "

Top Ten tags

\n", 359 | "\n", 360 | "love\n", 361 | "\n", 362 | "\n", 363 | "inspirational\n", 364 | "\n", 365 | "\n", 366 | "life\n", 367 | "\n", 368 | "\n", 369 | "humor\n", 370 | "\n", 371 | "\n", 372 | "books\n", 373 | "\n", 374 | "\n", 375 | "reading\n", 376 | "\n", 377 | "\n", 378 | "friendship\n", 379 | "\n", 380 | "\n", 381 | "friends\n", 382 | "\n", 383 | "\n", 384 | "truth\n", 385 | "\n", 386 | "\n", 387 | "simile\n", 388 | "\n", 389 | "
\n", 390 | "
\n", 391 | "
\n", 392 | "
\n", 393 | "
\n", 394 | "

\n", 395 | " Quotes by: GoodReads.com\n", 396 | "

\n", 397 | "

\n", 398 | " Made with by Scrapinghub\n", 399 | "

\n", 400 | "
\n", 401 | "
\n", 402 | "\n", 403 | "" 404 | ] 405 | }, 406 | "execution_count": 5, 407 | "metadata": {}, 408 | "output_type": "execute_result" 409 | } 410 | ], 411 | "source": [ 412 | "soup = BeautifulSoup(html, 'lxml')\n", 413 | "soup" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "

Very Soupy

\n", 421 | "\n", 422 | "The name ***Beautiful Soup*** is an appropriate description. HTML does not make for a lovely reading experience. If you feel like you're staring at complete gibberish, you're not entirely wrong! HTML is a language designed for computers, not for human eyes.\n", 423 | "\n", 424 | "" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "**Fortunately for us,** we do not have to read through every line of the html in order to web scrape. \n", 432 | "\n", 433 | "Modern day web browsers come equipped with tools that allow us to easily sift through this soupy text.\n", 434 | "\n", 435 | "\n", 436 | "## Step 4\n", 437 | ">**We open up the page we're trying to scrape in a new tab.** Click Here! 👀" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "## Step 5\n", 445 | " > **We create a list of every `div` that has a `class` of \"quote\"**\n", 446 | "\n", 447 | "**In this instance,** every item we want to collect is divided into identically labeled containers.\n", 448 | "- A div with a class of 'quote'.\n", 449 | "\n", 450 | "Not all HTML is as well organized as this page, but HTML is basically just a bunch of different organizational containers that we use to divide up text and other forms of media. Figuring out the organizational structure of a website is the entire process for web scraping. " 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 6, 456 | "metadata": { 457 | "ExecuteTime": { 458 | "end_time": "2021-04-21T21:10:53.415596Z", 459 | "start_time": "2021-04-21T21:10:53.410245Z" 460 | } 461 | }, 462 | "outputs": [ 463 | { 464 | "data": { 465 | "text/plain": [ 466 | "10" 467 | ] 468 | }, 469 | "execution_count": 6, 470 | "metadata": {}, 471 | "output_type": "execute_result" 472 | } 473 | ], 474 | "source": [ 475 | "quote_divs = soup.find_all('div', {'class': 'quote'})\n", 476 | "len(quote_divs)" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "## Step 6\n", 484 | "> **To figure out how to grab all the datapoints from a quote, we isolate a single quote, and work through the code for a single `div`.**" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": 7, 490 | "metadata": { 491 | "ExecuteTime": { 492 | "end_time": "2021-04-21T21:10:53.419751Z", 493 | "start_time": "2021-04-21T21:10:53.417109Z" 494 | } 495 | }, 496 | "outputs": [], 497 | "source": [ 498 | "first_quote = quote_divs[0]" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "metadata": {}, 504 | "source": [ 505 | "### First we grab the text of the quote" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": 8, 511 | "metadata": { 512 | "ExecuteTime": { 513 | "end_time": "2021-04-21T21:10:53.426181Z", 514 | "start_time": "2021-04-21T21:10:53.422305Z" 515 | } 516 | }, 517 | "outputs": [ 518 | { 519 | "data": { 520 | "text/plain": [ 521 | "'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'" 522 | ] 523 | }, 524 | "execution_count": 8, 525 | "metadata": {}, 526 | "output_type": "execute_result" 527 | } 528 | ], 529 | "source": [ 530 | "text = first_quote.find('span', {'class':'text'})\n", 531 | "text = text.text\n", 532 | "text" 533 | ] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": {}, 538 | "source": [ 539 | "### Next we grab the author" 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": 9, 545 | "metadata": { 546 | "ExecuteTime": { 547 | "end_time": "2021-04-21T21:10:53.431920Z", 548 | "start_time": "2021-04-21T21:10:53.428068Z" 549 | } 550 | }, 551 | "outputs": [ 552 | { 553 | "data": { 554 | "text/plain": [ 555 | "'Albert Einstein'" 556 | ] 557 | }, 558 | "execution_count": 9, 559 | "metadata": {}, 560 | "output_type": "execute_result" 561 | } 562 | ], 563 | "source": [ 564 | "author = first_quote.find('small', {'class': 'author'})\n", 565 | "author_name = author.text\n", 566 | "author_name" 567 | ] 568 | }, 569 | { 570 | "cell_type": "markdown", 571 | "metadata": {}, 572 | "source": [ 573 | "### Let's also grab the link that points to the author's bio!" 574 | ] 575 | }, 576 | { 577 | "cell_type": "code", 578 | "execution_count": 10, 579 | "metadata": { 580 | "ExecuteTime": { 581 | "end_time": "2021-04-21T21:10:53.437215Z", 582 | "start_time": "2021-04-21T21:10:53.433319Z" 583 | } 584 | }, 585 | "outputs": [ 586 | { 587 | "data": { 588 | "text/plain": [ 589 | "'http://quotes.toscrape.com/author/Albert-Einstein'" 590 | ] 591 | }, 592 | "execution_count": 10, 593 | "metadata": {}, 594 | "output_type": "execute_result" 595 | } 596 | ], 597 | "source": [ 598 | "author_link = author.findNextSibling().attrs.get('href')\n", 599 | "author_link = url + author_link\n", 600 | "author_link" 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "metadata": {}, 606 | "source": [ 607 | "### And finally, let's grab all of the tags for the quote" 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": 11, 613 | "metadata": { 614 | "ExecuteTime": { 615 | "end_time": "2021-04-21T21:10:53.443153Z", 616 | "start_time": "2021-04-21T21:10:53.438609Z" 617 | } 618 | }, 619 | "outputs": [ 620 | { 621 | "data": { 622 | "text/plain": [ 623 | "['change', 'deep-thoughts', 'thinking', 'world']" 624 | ] 625 | }, 626 | "execution_count": 11, 627 | "metadata": {}, 628 | "output_type": "execute_result" 629 | } 630 | ], 631 | "source": [ 632 | "tag_container = first_quote.find('div', {'class': 'tags'})\n", 633 | "tag_links = tag_container.find_all('a')\n", 634 | "\n", 635 | "tags = []\n", 636 | "for tag in tag_links:\n", 637 | " tags.append(tag.text)\n", 638 | " \n", 639 | "tags" 640 | ] 641 | }, 642 | { 643 | "cell_type": "markdown", 644 | "metadata": {}, 645 | "source": [ 646 | "### Our data:" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": 12, 652 | "metadata": { 653 | "ExecuteTime": { 654 | "end_time": "2021-04-21T21:10:53.448574Z", 655 | "start_time": "2021-04-21T21:10:53.444499Z" 656 | } 657 | }, 658 | "outputs": [ 659 | { 660 | "name": "stdout", 661 | "output_type": "stream", 662 | "text": [ 663 | "text: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\n", 664 | "author name: Albert Einstein\n", 665 | "author link: http://quotes.toscrape.com/author/Albert-Einstein\n", 666 | "tags: ['change', 'deep-thoughts', 'thinking', 'world']\n" 667 | ] 668 | } 669 | ], 670 | "source": [ 671 | "print('text:', text)\n", 672 | "print('author name: ', author_name)\n", 673 | "print('author link: ', author_link)\n", 674 | "print('tags: ', tags)" 675 | ] 676 | }, 677 | { 678 | "cell_type": "markdown", 679 | "metadata": {}, 680 | "source": [ 681 | "# Step 7\n", 682 | "> We create a function to make out code reusable.\n", 683 | "\n", 684 | "Now that we know how to collect this data from a quote div, we can compile the code into a [function](https://www.geeksforgeeks.org/functions-in-python/) called `quote_data`. This allows us to grab a quote div, feed it into the function like so...\n", 685 | "> `quote_data(quote_div)`\n", 686 | "\n", 687 | "...and receive all of the data from that div." 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": 13, 693 | "metadata": { 694 | "ExecuteTime": { 695 | "end_time": "2021-04-21T21:10:53.455650Z", 696 | "start_time": "2021-04-21T21:10:53.450078Z" 697 | } 698 | }, 699 | "outputs": [], 700 | "source": [ 701 | "def quote_data(quote_div):\n", 702 | " # Collect the quote\n", 703 | " text = quote_div.find('span', {'class':'text'})\n", 704 | " text = text.text\n", 705 | " \n", 706 | " # Collect the author name\n", 707 | " author = quote_div.find('small', {'class': 'author'})\n", 708 | " author_name = author.text\n", 709 | " \n", 710 | " # Collect author link\n", 711 | " author_link = author.findNextSibling().attrs.get('href')\n", 712 | " author_link = url + author_link\n", 713 | " \n", 714 | " # Collect tags\n", 715 | " tag_container = quote_div.find('div', {'class': 'tags'})\n", 716 | "\n", 717 | " tag_links = tag_container.find_all('a')\n", 718 | "\n", 719 | " tags = []\n", 720 | " for tag in tag_links:\n", 721 | " tags.append(tag.text)\n", 722 | " \n", 723 | " # Return data as a dictionary\n", 724 | " return {'author': author_name,\n", 725 | " 'text': text,\n", 726 | " 'author_link': author_link,\n", 727 | " 'tags': tags}" 728 | ] 729 | }, 730 | { 731 | "cell_type": "markdown", 732 | "metadata": {}, 733 | "source": [ 734 | "### Let's test our fuction." 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 14, 740 | "metadata": { 741 | "ExecuteTime": { 742 | "end_time": "2021-04-21T21:10:53.460722Z", 743 | "start_time": "2021-04-21T21:10:53.457107Z" 744 | } 745 | }, 746 | "outputs": [ 747 | { 748 | "data": { 749 | "text/plain": [ 750 | "{'author': 'Thomas A. Edison',\n", 751 | " 'text': \"“I have not failed. I've just found 10,000 ways that won't work.”\",\n", 752 | " 'author_link': 'http://quotes.toscrape.com/author/Thomas-A-Edison',\n", 753 | " 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}" 754 | ] 755 | }, 756 | "execution_count": 14, 757 | "metadata": {}, 758 | "output_type": "execute_result" 759 | } 760 | ], 761 | "source": [ 762 | "quote_data(quote_divs[7])" 763 | ] 764 | }, 765 | { 766 | "cell_type": "markdown", 767 | "metadata": {}, 768 | "source": [ 769 | "Now we can collect the data from every quote on the first page with a simple [```for loop```](https://www.w3schools.com/python/python_for_loops.asp)!" 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": 15, 775 | "metadata": { 776 | "ExecuteTime": { 777 | "end_time": "2021-04-21T21:10:53.468852Z", 778 | "start_time": "2021-04-21T21:10:53.462096Z" 779 | }, 780 | "scrolled": true 781 | }, 782 | "outputs": [ 783 | { 784 | "name": "stdout", 785 | "output_type": "stream", 786 | "text": [ 787 | "10 quotes scraped!\n" 788 | ] 789 | }, 790 | { 791 | "data": { 792 | "text/plain": [ 793 | "[{'author': 'Albert Einstein',\n", 794 | " 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',\n", 795 | " 'author_link': 'http://quotes.toscrape.com/author/Albert-Einstein',\n", 796 | " 'tags': ['change', 'deep-thoughts', 'thinking', 'world']},\n", 797 | " {'author': 'J.K. Rowling',\n", 798 | " 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',\n", 799 | " 'author_link': 'http://quotes.toscrape.com/author/J-K-Rowling',\n", 800 | " 'tags': ['abilities', 'choices']}]" 801 | ] 802 | }, 803 | "execution_count": 15, 804 | "metadata": {}, 805 | "output_type": "execute_result" 806 | } 807 | ], 808 | "source": [ 809 | "page_one_data = []\n", 810 | "for div in quote_divs:\n", 811 | " # Apply our function on each quote\n", 812 | " data_from_div = quote_data(div)\n", 813 | " page_one_data.append(data_from_div)\n", 814 | " \n", 815 | "print(len(page_one_data), 'quotes scraped!')\n", 816 | "page_one_data[:2]" 817 | ] 818 | }, 819 | { 820 | "cell_type": "markdown", 821 | "metadata": { 822 | "ExecuteTime": { 823 | "end_time": "2020-07-03T18:55:14.762708Z", 824 | "start_time": "2020-07-03T18:55:14.758667Z" 825 | } 826 | }, 827 | "source": [ 828 | "# We just scraped an entire webpage!\n", 829 | "\n", 830 | "![](https://media.giphy.com/media/KiXl0vfc9XIIM/giphy.gif)" 831 | ] 832 | }, 833 | { 834 | "cell_type": "markdown", 835 | "metadata": {}, 836 | "source": [ 837 | "### Level Up: What if we wanted to scrape the quotes from *every* page?\n", 838 | "\n", 839 | "**Step 1:** The first thing we do is take the code from above that scraped the data for all of the quotes on page one, and move it into a function called `parse_page`." 840 | ] 841 | }, 842 | { 843 | "cell_type": "code", 844 | "execution_count": 16, 845 | "metadata": { 846 | "ExecuteTime": { 847 | "end_time": "2021-04-21T21:10:53.473642Z", 848 | "start_time": "2021-04-21T21:10:53.470305Z" 849 | } 850 | }, 851 | "outputs": [], 852 | "source": [ 853 | "def scrape_page(quote_divs):\n", 854 | " data = []\n", 855 | " for div in quote_divs:\n", 856 | " div_data = quote_data(div)\n", 857 | " data.append(div_data)\n", 858 | " \n", 859 | " return data" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": {}, 865 | "source": [ 866 | "**Step 2:** We grab the code we used at the very beginning to collect the html and the list of divs from a web page." 867 | ] 868 | }, 869 | { 870 | "cell_type": "code", 871 | "execution_count": 17, 872 | "metadata": { 873 | "ExecuteTime": { 874 | "end_time": "2021-04-21T21:10:53.601521Z", 875 | "start_time": "2021-04-21T21:10:53.475074Z" 876 | } 877 | }, 878 | "outputs": [], 879 | "source": [ 880 | "base_url = 'http://quotes.toscrape.com'\n", 881 | "response = requests.get(url)\n", 882 | "html = response.text\n", 883 | "soup = BeautifulSoup(html, 'lxml')\n", 884 | "quote_divs = soup.find_all('div', {'class': 'quote'})" 885 | ] 886 | }, 887 | { 888 | "cell_type": "markdown", 889 | "metadata": {}, 890 | "source": [ 891 | "**Step 3:** We feed all of the `quote_divs` into our newly made `parse_page` function to grab all of the data from that page." 892 | ] 893 | }, 894 | { 895 | "cell_type": "code", 896 | "execution_count": 18, 897 | "metadata": { 898 | "ExecuteTime": { 899 | "end_time": "2021-04-21T21:10:53.606735Z", 900 | "start_time": "2021-04-21T21:10:53.602884Z" 901 | } 902 | }, 903 | "outputs": [], 904 | "source": [ 905 | "data = scrape_page(quote_divs)" 906 | ] 907 | }, 908 | { 909 | "cell_type": "code", 910 | "execution_count": 19, 911 | "metadata": { 912 | "ExecuteTime": { 913 | "end_time": "2021-04-21T21:10:53.611866Z", 914 | "start_time": "2021-04-21T21:10:53.608230Z" 915 | } 916 | }, 917 | "outputs": [ 918 | { 919 | "data": { 920 | "text/plain": [ 921 | "[{'author': 'Albert Einstein',\n", 922 | " 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',\n", 923 | " 'author_link': 'http://quotes.toscrape.com/author/Albert-Einstein',\n", 924 | " 'tags': ['change', 'deep-thoughts', 'thinking', 'world']},\n", 925 | " {'author': 'J.K. Rowling',\n", 926 | " 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',\n", 927 | " 'author_link': 'http://quotes.toscrape.com/author/J-K-Rowling',\n", 928 | " 'tags': ['abilities', 'choices']}]" 929 | ] 930 | }, 931 | "execution_count": 19, 932 | "metadata": {}, 933 | "output_type": "execute_result" 934 | } 935 | ], 936 | "source": [ 937 | "data[:2]" 938 | ] 939 | }, 940 | { 941 | "cell_type": "markdown", 942 | "metadata": {}, 943 | "source": [ 944 | "**Step 4:** We check to see if there is a `Next Page` button at the bottom of the page.\n", 945 | "\n", 946 | "*This is requires multiple steps.*\n", 947 | "\n", 948 | "1. We grab the outer container called that has a class of `pager`." 949 | ] 950 | }, 951 | { 952 | "cell_type": "code", 953 | "execution_count": 20, 954 | "metadata": { 955 | "ExecuteTime": { 956 | "end_time": "2021-04-21T21:10:53.616414Z", 957 | "start_time": "2021-04-21T21:10:53.613232Z" 958 | } 959 | }, 960 | "outputs": [], 961 | "source": [ 962 | "pager = soup.find('ul', {'class': 'pager'})" 963 | ] 964 | }, 965 | { 966 | "cell_type": "markdown", 967 | "metadata": {}, 968 | "source": [ 969 | "If there is no pager element on the webpage, pager will be set to `None`.\n", 970 | "\n", 971 | "2. We use an if check to make sure a pager exists:" 972 | ] 973 | }, 974 | { 975 | "cell_type": "code", 976 | "execution_count": 21, 977 | "metadata": { 978 | "ExecuteTime": { 979 | "end_time": "2021-04-21T21:10:53.620316Z", 980 | "start_time": "2021-04-21T21:10:53.617767Z" 981 | } 982 | }, 983 | "outputs": [], 984 | "source": [ 985 | "if pager:\n", 986 | " next_page = pager.find('li', {'class': 'next'})" 987 | ] 988 | }, 989 | { 990 | "cell_type": "markdown", 991 | "metadata": {}, 992 | "source": [ 993 | "3. We then check to see if a `Next Page` button exists on the page. \n", 994 | "\n", 995 | " - Every page has a next button except the *last* page which only has a `Previous Page` button. Basically, we're checking to see if the `Next button` exists. It it does, we \"click\" it." 996 | ] 997 | }, 998 | { 999 | "cell_type": "code", 1000 | "execution_count": 22, 1001 | "metadata": { 1002 | "ExecuteTime": { 1003 | "end_time": "2021-04-21T21:10:53.624558Z", 1004 | "start_time": "2021-04-21T21:10:53.621819Z" 1005 | } 1006 | }, 1007 | "outputs": [], 1008 | "source": [ 1009 | "if next_page:\n", 1010 | " next_page = next_page.findChild('a')\\\n", 1011 | " .attrs\\\n", 1012 | " .get('href')" 1013 | ] 1014 | }, 1015 | { 1016 | "cell_type": "markdown", 1017 | "metadata": {}, 1018 | "source": [ 1019 | "With most webscraping tools, \"clicking a button\" means collecting the link inside the button and making a new request.\n", 1020 | "\n", 1021 | "If a link is pointing to a page on the same website, the links are usually just the forward slashes that need to be added to the base website url. This is called a `relative` link.\n", 1022 | "\n", 1023 | "**Step 5:** Collect the relative link that points to the next page, and add it to our base_url" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": 23, 1029 | "metadata": { 1030 | "ExecuteTime": { 1031 | "end_time": "2021-04-21T21:10:53.630467Z", 1032 | "start_time": "2021-04-21T21:10:53.627259Z" 1033 | } 1034 | }, 1035 | "outputs": [ 1036 | { 1037 | "data": { 1038 | "text/plain": [ 1039 | "'http://quotes.toscrape.com/page/2/'" 1040 | ] 1041 | }, 1042 | "execution_count": 23, 1043 | "metadata": {}, 1044 | "output_type": "execute_result" 1045 | } 1046 | ], 1047 | "source": [ 1048 | "next_page = url + next_page\n", 1049 | "next_page" 1050 | ] 1051 | }, 1052 | { 1053 | "cell_type": "markdown", 1054 | "metadata": {}, 1055 | "source": [ 1056 | "**Step 6:** We repeat the exact same process for this new link!\n", 1057 | "\n", 1058 | "ie:\n", 1059 | "1. Make request using a url that points to the next page.\n", 1060 | "2. Scrape quote divs\n", 1061 | "3. Collect data from every quote div on that page\n", 1062 | "4. Find the `Next page` button.\n", 1063 | "5. Collect the url from the button\n", 1064 | "6. Repeat\n", 1065 | "\n", 1066 | "So how do we do this over and over again without repeating ourselves?\n", 1067 | "\n", 1068 | "The first step is compile all of these steps into a new function called `scrape_quotes`.\n", 1069 | "\n", 1070 | "The second step is, something called `recursion`. \n", 1071 | "\n", 1072 | "

Recursion

\n", 1073 | "\n", 1074 | "![](https://media.giphy.com/media/l1J9R1Q7LJGSZOxFe/giphy.gif)\n", 1075 | "\n", 1076 | "> **Recursion** is a bit of a mind bend, so don't feel bad if it is hard to wrap your brain around. It took me a while to be able to understand recursive functions!\n", 1077 | "\n", 1078 | "Essentially, recursion is where we use a function inside of itself.\n", 1079 | "\n", 1080 | "**In this instance,** our code will be telling the computer, if there is a `Next page` button, rerun all of the code again on the page the next button points us to and check to see if there is a `Next page` button on *that* page. If there is, keep repeating the process until a `Next page` button is not found." 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "code", 1085 | "execution_count": 24, 1086 | "metadata": { 1087 | "ExecuteTime": { 1088 | "end_time": "2021-04-21T21:10:53.637283Z", 1089 | "start_time": "2021-04-21T21:10:53.632117Z" 1090 | } 1091 | }, 1092 | "outputs": [], 1093 | "source": [ 1094 | "def scrape_quotes(url):\n", 1095 | " base_url = 'http://quotes.toscrape.com'\n", 1096 | " response = requests.get(url)\n", 1097 | " html = response.text\n", 1098 | " soup = BeautifulSoup(html, 'lxml')\n", 1099 | " quote_divs = soup.find_all('div', {'class': 'quote'})\n", 1100 | " data = scrape_page(quote_divs)\n", 1101 | " \n", 1102 | " pager = soup.find('ul', {'class': 'pager'})\n", 1103 | " if pager:\n", 1104 | " next_page = pager.find('li', {'class': 'next'})\n", 1105 | " \n", 1106 | " if next_page:\n", 1107 | " next_page = next_page.findChild('a')\\\n", 1108 | " .attrs\\\n", 1109 | " .get('href')\n", 1110 | " \n", 1111 | " next_page = base_url + next_page\n", 1112 | " print('Scraping', next_page)\n", 1113 | " ## This is where the recursion happens\n", 1114 | " data += scrape_quotes(next_page)\n", 1115 | " \n", 1116 | " \n", 1117 | "\n", 1118 | " return data" 1119 | ] 1120 | }, 1121 | { 1122 | "cell_type": "markdown", 1123 | "metadata": {}, 1124 | "source": [ 1125 | "Now we can set a variable called `data` that is the output of our recursive function!\n", 1126 | "\n", 1127 | "> A print statement has been added to output what page is being scraped" 1128 | ] 1129 | }, 1130 | { 1131 | "cell_type": "code", 1132 | "execution_count": 25, 1133 | "metadata": { 1134 | "ExecuteTime": { 1135 | "end_time": "2021-04-21T21:10:55.662132Z", 1136 | "start_time": "2021-04-21T21:10:53.638724Z" 1137 | }, 1138 | "scrolled": true 1139 | }, 1140 | "outputs": [ 1141 | { 1142 | "name": "stdout", 1143 | "output_type": "stream", 1144 | "text": [ 1145 | "Scraping http://quotes.toscrape.com/page/2/\n", 1146 | "Scraping http://quotes.toscrape.com/page/3/\n", 1147 | "Scraping http://quotes.toscrape.com/page/4/\n", 1148 | "Scraping http://quotes.toscrape.com/page/5/\n", 1149 | "Scraping http://quotes.toscrape.com/page/6/\n", 1150 | "Scraping http://quotes.toscrape.com/page/7/\n", 1151 | "Scraping http://quotes.toscrape.com/page/8/\n", 1152 | "Scraping http://quotes.toscrape.com/page/9/\n", 1153 | "Scraping http://quotes.toscrape.com/page/10/\n", 1154 | "100 Quotes scraped!\n" 1155 | ] 1156 | } 1157 | ], 1158 | "source": [ 1159 | "data = scrape_quotes(url)\n", 1160 | "print(len(data), 'Quotes scraped!')" 1161 | ] 1162 | }, 1163 | { 1164 | "cell_type": "markdown", 1165 | "metadata": {}, 1166 | "source": [ 1167 | "Now we can visualize and explore our data!" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "code", 1172 | "execution_count": 26, 1173 | "metadata": { 1174 | "ExecuteTime": { 1175 | "end_time": "2021-04-21T21:10:55.860731Z", 1176 | "start_time": "2021-04-21T21:10:55.663516Z" 1177 | } 1178 | }, 1179 | "outputs": [ 1180 | { 1181 | "data": { 1182 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlAAAAF1CAYAAAAna9RdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAW7UlEQVR4nO3df/BldX3f8ddbFgWBBC1fKQK6xlojSSukKxqxqRG1KE7UmTQjrQxNjZhOtJgmadC0MVq1OI3aP9o4QwQhE8QQf4xWSCpjJNTUYhZEhKBVcRWEsGsRBUdR8N0/7qHzFXfd72e/P+5lv4/HzJ3vvefc873vA8Py3HPuPbe6OwAArNxD5j0AAMCDjYACABgkoAAABgkoAIBBAgoAYJCAAgAYJKCAuaiqC6rqjXN67aqqd1XV16vqk/OYAXhwE1BAkqSqdlTV7VV1yLJlv1JVV8xxrPXyjCTPSXJMd5+4fEVVvbaq7p5u36mq+5Y9vmE+4wKLRkABy21Jcta8hxhVVQcMbvLYJDu6+1sPXNHdb+7uQ7v70CS/muQT9z/u7p9ai3mBBz8BBSz3n5P8ZlUd/sAVVbW1qrqqtixbdkVV/cp0/19W1V9V1dur6s6quqmqnj4tv7mqdlbVGQ/4tUdU1eVVdVdV/WVVPXbZ7/7Jad0dVfW5qvqlZesuqKp3VNVlVfWtJD+/m3kfXVUfmrb/QlW9fFr+siTvTPKz01Gl14/+Q5pe+5aq+mZVfbKqnrZs3aFV9e7pn8H1VfWaqvrCsvX/oapum7a9sar+8ejrA/MnoIDltie5Islv7uP2T01yXZK/k+TdSd6T5ClJ/l6Slyb5r1V16LLn/4sk/zHJEUmuTXJRkkynES+ffsejkpyW5A+qavkRoH+e5E1JDkvy8d3McnGSW5I8OskvJnlzVZ3c3eflB48svW4f9vMTSf7BtJ8fTPKnVXXgtO6NSZYyO8p1apLT79+oqp6c5JeTHJ/kx6f1t+zD6wNzJqCAB/rdJK+qqqV92PZL3f2u7r4vyZ8kOTbJG7r7nu7+SJLvZhZT97u0u6/s7nuS/E5mR4WOTfKCzE6xvau77+3ua5K8L7MQut8Hu/uvuvv73f2d5UNMv+MZSX67u7/T3ddmdtTp9KyB7v6j7v56d38vyZszC6mfmFb/UpI3dvc3uvvLSf5g2ab3Jjk4yXFJDujum7r7S2sxE7CxBBTwA7r7+iQfTnL2Pmx++7L7355+3wOXLT8CdfOy1707yR2ZHTF6bJKnTqfB7qyqOzM7WvV3d7ftbjw6yR3dfdeyZV9OcvTAvuzRdFruc1X1jSRfT3JQZqcjK8mRD5ht+T7ekNk/1zcl2VlVF1XVkWsxE7CxBBSwO69L8vL8YHDc/4brhy9btjxo9sWx99+ZTu09MsmtmUXHX3b34ctuh3b3v162bf+I33trkkdW1WHLlj0myVdXOW+q6jlJXpXkxUkOn2b+dpLq7k6yM8kxyzY5dvn23X1hdz89syNWB2V2yg94kBFQwA/p7i9kdgru3yxbtiuzAHlpVR1QVf8qyeNX+VLPr6pnVNVDM3sv1FXdfXNmR8D+flWdXlUHTrenVNWTVjj/zUn+V5L/VFUHVdU/TPKyTO+xWqXDknwvya4kD03yhsxC6H6XJPmdqvrxqnpMkv8ffVV1XFX9k6p6WGbR9e0k963BTMAGE1DAnrwhySEPWPbyJL+V5P8m+anMImU13p3Z0a47kvyjzE7TZTr19twkL8nsaNLfJnlLkocN/O7Tkmydtv9Aktd19+WrnDdJ/nuSK5N8MclNSb6WWUzd799ndlrvy0n+LLOgumdad3CSt07b3JbZ6czfXYOZgA1WsyPOAKyHqvr1JKd09z+d9yzA2nEECmANVdWxVfW0qnrIdNmFszI7AgbsR7bs/SkADHhYkvMz+yTh15P8cWaXUAD2I07hAQAMcgoPAGCQgAIAGLSh74E64ogjeuvWrRv5kgAA++Tqq6/+Wnfv9mutNjSgtm7dmu3bt2/kSwIA7JOq+vKe1jmFBwAwSEABAAwSUAAAgwQUAMAgAQUAMEhAAQAMElAAAIMEFADAIAEFADBIQAEADBJQAACDBBQAwCABBQAwaMu8B2Dc1rMvnfcI62rHOafOewQA+JEcgQIAGCSgAAAGCSgAgEECCgBg0F4DqqoOqqpPVtWnq+qGqnr9tPyCqvpSVV073Y5f/3EBAOZvJZ/CuyfJs7r77qo6MMnHq+rPpnW/1d3vXb/xAAAWz14Dqrs7yd3TwwOnW6/nUAAAi2xF74GqqgOq6tokO5Nc3t1XTaveVFXXVdXbq+ph6zYlAMACWVFAdfd93X18kmOSnFhVP53kNUl+MslTkjwyyW/vbtuqOrOqtlfV9l27dq3R2AAA8zP0KbzuvjPJFUlO6e7beuaeJO9KcuIetjm3u7d197alpaVVDwwAMG8r+RTeUlUdPt0/OMmzk3y2qo6allWSFyW5fj0HBQBYFCv5FN5RSS6sqgMyC65LuvvDVfUXVbWUpJJcm+RX13FOAICFsZJP4V2X5ITdLH/WukwEALDgXIkcAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAbtNaCq6qCq+mRVfbqqbqiq10/LH1dVV1XV56vqT6rqoes/LgDA/K3kCNQ9SZ7V3U9OcnySU6rqaUnekuTt3f2EJF9P8rL1GxMAYHHsNaB65u7p4YHTrZM8K8l7p+UXJnnRukwIALBgVvQeqKo6oKquTbIzyeVJvpjkzu6+d3rKLUmO3sO2Z1bV9qravmvXrrWYGQBgrlYUUN19X3cfn+SYJCcmedLunraHbc/t7m3dvW1paWnfJwUAWBBDn8Lr7juTXJHkaUkOr6ot06pjkty6tqMBACymlXwKb6mqDp/uH5zk2UluTPKxJL84Pe2MJB9cryEBABbJlr0/JUclubCqDsgsuC7p7g9X1d8keU9VvTHJp5Kct45zAgAsjL0GVHdfl+SE3Sy/KbP3QwEAbCquRA4AMEhAAQAMElAAAINW8iZy2FBbz7503iOsqx3nnDrvEQBYJUegAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBew2oqjq2qj5WVTdW1Q1Vdda0/Peq6qtVde10e/76jwsAMH9bVvCce5P8RndfU1WHJbm6qi6f1r29u39//cYDAFg8ew2o7r4tyW3T/buq6sYkR6/3YAAAi2roPVBVtTXJCUmumha9sqquq6rzq+oRe9jmzKraXlXbd+3ataphAQAWwYoDqqoOTfK+JK/u7m8meUeSxyc5PrMjVG/d3XbdfW53b+vubUtLS2swMgDAfK0ooKrqwMzi6aLufn+SdPft3X1fd38/yR8mOXH9xgQAWBwr+RReJTkvyY3d/bZly49a9rQXJ7l+7ccDAFg8K/kU3klJTk/ymaq6dlr22iSnVdXxSTrJjiSvWJcJAQAWzEo+hffxJLWbVZet/TgAAIvPlcgBAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBew2oqjq2qj5WVTdW1Q1Vdda0/JFVdXlVfX76+Yj1HxcAYP5WcgTq3iS/0d1PSvK0JL9WVcclOTvJR7v7CUk+Oj0GANjv7TWguvu27r5mun9XkhuTHJ3khUkunJ52YZIXrdeQAACLZOg9UFW1NckJSa5KcmR335bMIivJo/awzZlVtb2qtu/atWt10wIALIAVB1RVHZrkfUle3d3fXOl23X1ud2/r7m1LS0v7MiMAwEJZUUBV1YGZxdNF3f3+afHtVXXUtP6oJDvXZ0QAgMWykk/hVZLzktzY3W9btupDSc6Y7p+R5INrPx4AwOLZsoLnnJTk9CSfqaprp2WvTXJOkkuq6mVJvpLkn63PiAAAi2WvAdXdH09Se1h98tqOAwCw+FyJHABgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGLTXgKqq86tqZ1Vdv2zZ71XVV6vq2un2/PUdEwBgcazkCNQFSU7ZzfK3d/fx0+2ytR0LAGBx7TWguvvKJHdswCwAAA8Kq3kP1Cur6rrpFN8j9vSkqjqzqrZX1fZdu3at4uUAABbDvgbUO5I8PsnxSW5L8tY9PbG7z+3ubd29bWlpaR9fDgBgcexTQHX37d19X3d/P8kfJjlxbccCAFhc+xRQVXXUsocvTnL9np4LALC/2bK3J1TVxUmemeSIqrolyeuSPLOqjk/SSXYkecU6zggAsFD2GlDdfdpuFp+3DrMAADwouBI5AMAgAQUAMEhAAQAMElAAAIMEFADAIAEFADBIQAEADBJQAACDBBQAwCABBQAwSEABAAwSUAAAg/b6ZcLA2tp69qXzHmHd7Tjn1HmPALCuHIECABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGLTXgKqq86tqZ1Vdv2zZI6vq8qr6/PTzEes7JgDA4ljJEagLkpzygGVnJ/lodz8hyUenxwAAm8JeA6q7r0xyxwMWvzDJhdP9C5O8aI3nAgBYWPv6Hqgju/u2JJl+PmrtRgIAWGzr/ibyqjqzqrZX1fZdu3at98sBAKy7fQ2o26vqqCSZfu7c0xO7+9zu3tbd25aWlvbx5QAAFse+BtSHkpwx3T8jyQfXZhwAgMW3kssYXJzkE0meWFW3VNXLkpyT5DlV9fkkz5keAwBsClv29oTuPm0Pq05e41kAAB4UXIkcAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQXv9MmGAUVvPvnTeI6yrHeecOu8RgDlzBAoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQQIKAGCQgAIAGCSgAAAGCSgAgEECCgBg0JbVbFxVO5LcleS+JPd297a1GAoAYJGtKqAmP9/dX1uD3wMA8KDgFB4AwKDVBlQn+UhVXV1VZ67FQAAAi261p/BO6u5bq+pRSS6vqs9295XLnzCF1ZlJ8pjHPGaVLwcAMH+rOgLV3bdOP3cm+UCSE3fznHO7e1t3b1taWlrNywEALIR9DqiqOqSqDrv/fpLnJrl+rQYDAFhUqzmFd2SSD1TV/b/n3d3952syFQDAAtvngOrum5I8eQ1nAQB4UHAZAwCAQQIKAGCQgAIAGLQWX+UCsKlsPfvSeY/AKu0459R5j8CDnCNQAACDBBQAwCABBQAwSEABAAwSUAAAgwQUAMAgAQUAMEhAAQAM2u8upOkCdwDAenMECgBgkIACABgkoAAABgkoAIBBAgoAYJCAAgAYJKAAAAYJKACAQfvdhTQBYLPbDBeV3nHOqXN9fUegAAAGCSgAgEECCgBgkIACABgkoAAABgkoAIBBAgoAYJDrQAGw6WyG6ySxvhyBAgAYJKAAAAYJKACAQQIKAGDQqgKqqk6pqs9V1Req6uy1GgoAYJHtc0BV1QFJ/luS5yU5LslpVXXcWg0GALCoVnME6sQkX+jum7r7u0nek+SFazMWAMDiWk1AHZ3k5mWPb5mWAQDs11ZzIc3azbL+oSdVnZnkzOnh3VX1uVW85kockeRr6/wai2wz7/9m3vdkc++/fd+8NvP+b+Z9T71lQ/b/sXtasZqAuiXJscseH5Pk1gc+qbvPTXLuKl5nSFVt7+5tG/V6i2Yz7/9m3vdkc++/fd+c+55s7v3fzPuezH//V3MK76+TPKGqHldVD03ykiQfWpuxAAAW1z4fgerue6vqlUn+R5IDkpzf3Tes2WQAAAtqVV8m3N2XJblsjWZZKxt2unBBbeb938z7nmzu/bfvm9dm3v/NvO/JnPe/un/ofd8AAPwIvsoFAGDQfhVQm/mrZarq/KraWVXXz3uWjVZVx1bVx6rqxqq6oarOmvdMG6WqDqqqT1bVp6d9f/28Z9poVXVAVX2qqj4871k2WlXtqKrPVNW1VbV93vNspKo6vKreW1Wfnf7b/9l5z7RRquqJ07/z+2/frKpXz3uujVJVvz79eXd9VV1cVQfNZY795RTe9NUy/yfJczK7xMJfJzmtu/9mroNtkKr6uSR3J/mj7v7pec+zkarqqCRHdfc1VXVYkquTvGgz/LuvqkpySHffXVUHJvl4krO6+3/PebQNU1X/Nsm2JD/W3S+Y9zwbqap2JNnW3ZvuWkBVdWGS/9nd75w+Cf7w7r5z3nNttOn/fV9N8tTu/vK851lvVXV0Zn/OHdfd366qS5Jc1t0XbPQs+9MRqE391TLdfWWSO+Y9xzx0923dfc10/64kN2aTXBW/Z+6eHh443faPvxWtQFUdk+TUJO+c9yxsnKr6sSQ/l+S8JOnu727GeJqcnOSLmyGeltmS5OCq2pLk4dnNNSg3wv4UUL5ahlTV1iQnJLlqvpNsnOkU1rVJdia5vLs3zb4n+S9J/l2S7897kDnpJB+pqqunb33YLH4iya4k75pO376zqg6Z91Bz8pIkF897iI3S3V9N8vtJvpLktiTf6O6PzGOW/SmgVvTVMuy/qurQJO9L8uru/ua859ko3X1fdx+f2bcBnFhVm+IUblW9IMnO7r563rPM0Und/TNJnpfk16ZT+ZvBliQ/k+Qd3X1Ckm8l2VTve02S6dTlLyT503nPslGq6hGZnV16XJJHJzmkql46j1n2p4Ba0VfLsH+a3v/zviQXdff75z3PPEynMK5IcsqcR9koJyX5hel9QO9J8qyq+uP5jrSxuvvW6efOJB/I7K0Mm8EtSW5ZdrT1vZkF1WbzvCTXdPft8x5kAz07yZe6e1d3fy/J+5M8fR6D7E8B5atlNqnpjdTnJbmxu98273k2UlUtVdXh0/2DM/vD5bPznWpjdPdruvuY7t6a2X/vf9Hdc/mb6DxU1SHThyYynb56bpJN8Snc7v7bJDdX1ROnRScn2e8/NLIbp2UTnb6bfCXJ06rq4dOf/Sdn9r7XDbeqK5Evks3+1TJVdXGSZyY5oqpuSfK67j5vvlNtmJOSnJ7kM9N7gZLktdOV8vd3RyW5cPokzkOSXNLdm+7j/JvUkUk+MPt/SLYkeXd3//l8R9pQr0py0fQX5puS/PKc59lQVfXwzD51/op5z7KRuvuqqnpvkmuS3JvkU5nTFcn3m8sYAABslP3pFB4AwIYQUAAAgwQUAMAgAQUAMEhAAQAMElAAAIMEFADAIAEFADDo/wGK7j7ZTuvb1gAAAABJRU5ErkJggg==\n", 1183 | "text/plain": [ 1184 | "
" 1185 | ] 1186 | }, 1187 | "metadata": { 1188 | "needs_background": "light" 1189 | }, 1190 | "output_type": "display_data" 1191 | } 1192 | ], 1193 | "source": [ 1194 | "def count_tags(quote):\n", 1195 | " return len(quote['tags'])\n", 1196 | "\n", 1197 | "def tag_lengths(data):\n", 1198 | " lengths = []\n", 1199 | " for quote in data:\n", 1200 | " lengths.append(count_tags(quote))\n", 1201 | " \n", 1202 | " return lengths\n", 1203 | " \n", 1204 | "lengths = tag_lengths(data)\n", 1205 | "plt.figure(figsize=(10,6)) \n", 1206 | "plt.hist(lengths, bins=9)\n", 1207 | "plt.title('Number of Tags');" 1208 | ] 1209 | }, 1210 | { 1211 | "cell_type": "markdown", 1212 | "metadata": {}, 1213 | "source": [ 1214 | "# Saving our data\n", 1215 | "\n", 1216 | "There are multiple ways to save data to a file. Pandas, `The Excel of Python` allows us to do this easily." 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "code", 1221 | "execution_count": 27, 1222 | "metadata": { 1223 | "ExecuteTime": { 1224 | "end_time": "2021-04-21T21:10:55.874367Z", 1225 | "start_time": "2021-04-21T21:10:55.861846Z" 1226 | } 1227 | }, 1228 | "outputs": [ 1229 | { 1230 | "data": { 1231 | "text/html": [ 1232 | "
\n", 1233 | "\n", 1246 | "\n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | "
authortextauthor_linktags
0Albert Einstein“The world as we have created it is a process ...http://quotes.toscrape.com/author/Albert-Einstein[change, deep-thoughts, thinking, world]
1J.K. Rowling“It is our choices, Harry, that show what we t...http://quotes.toscrape.com/author/J-K-Rowling[abilities, choices]
2Albert Einstein“There are only two ways to live your life. On...http://quotes.toscrape.com/author/Albert-Einstein[inspirational, life, live, miracle, miracles]
3Jane Austen“The person, be it gentleman or lady, who has ...http://quotes.toscrape.com/author/Jane-Austen[aliteracy, books, classic, humor]
4Marilyn Monroe“Imperfection is beauty, madness is genius and...http://quotes.toscrape.com/author/Marilyn-Monroe[be-yourself, inspirational]
\n", 1294 | "
" 1295 | ], 1296 | "text/plain": [ 1297 | " author text \\\n", 1298 | "0 Albert Einstein “The world as we have created it is a process ... \n", 1299 | "1 J.K. Rowling “It is our choices, Harry, that show what we t... \n", 1300 | "2 Albert Einstein “There are only two ways to live your life. On... \n", 1301 | "3 Jane Austen “The person, be it gentleman or lady, who has ... \n", 1302 | "4 Marilyn Monroe “Imperfection is beauty, madness is genius and... \n", 1303 | "\n", 1304 | " author_link \\\n", 1305 | "0 http://quotes.toscrape.com/author/Albert-Einstein \n", 1306 | "1 http://quotes.toscrape.com/author/J-K-Rowling \n", 1307 | "2 http://quotes.toscrape.com/author/Albert-Einstein \n", 1308 | "3 http://quotes.toscrape.com/author/Jane-Austen \n", 1309 | "4 http://quotes.toscrape.com/author/Marilyn-Monroe \n", 1310 | "\n", 1311 | " tags \n", 1312 | "0 [change, deep-thoughts, thinking, world] \n", 1313 | "1 [abilities, choices] \n", 1314 | "2 [inspirational, life, live, miracle, miracles] \n", 1315 | "3 [aliteracy, books, classic, humor] \n", 1316 | "4 [be-yourself, inspirational] " 1317 | ] 1318 | }, 1319 | "execution_count": 27, 1320 | "metadata": {}, 1321 | "output_type": "execute_result" 1322 | } 1323 | ], 1324 | "source": [ 1325 | "df = pd.DataFrame(data)\n", 1326 | "df.head()" 1327 | ] 1328 | }, 1329 | { 1330 | "cell_type": "code", 1331 | "execution_count": 28, 1332 | "metadata": { 1333 | "ExecuteTime": { 1334 | "end_time": "2021-04-21T21:10:55.880960Z", 1335 | "start_time": "2021-04-21T21:10:55.875494Z" 1336 | } 1337 | }, 1338 | "outputs": [], 1339 | "source": [ 1340 | "df.to_csv('quotes_data.csv')" 1341 | ] 1342 | }, 1343 | { 1344 | "cell_type": "markdown", 1345 | "metadata": {}, 1346 | "source": [ 1347 | "# Web scraping is a powerful tool. \n", 1348 | "\n", 1349 | "It can be used:\n", 1350 | "- To discover the most in demand skills for thousands of online job postings.\n", 1351 | "- To learn the average price of a product from thousands of online sale.\n", 1352 | "- To research social media networks.\n", 1353 | "\n", 1354 | "**And so much more!** As our world continues to develop online markets and communities, the uses for webscraping continue to grow. In the end, the power of web scraping comes from the ability to create datasets that otherwise do not exist, or at the very least, are not readily available to the public.\n" 1355 | ] 1356 | }, 1357 | { 1358 | "cell_type": "code", 1359 | "execution_count": null, 1360 | "metadata": {}, 1361 | "outputs": [], 1362 | "source": [] 1363 | } 1364 | ], 1365 | "metadata": { 1366 | "hide_input": false, 1367 | "kernelspec": { 1368 | "display_name": "Python 3", 1369 | "language": "python", 1370 | "name": "python3" 1371 | }, 1372 | "language_info": { 1373 | "codemirror_mode": { 1374 | "name": "ipython", 1375 | "version": 3 1376 | }, 1377 | "file_extension": ".py", 1378 | "mimetype": "text/x-python", 1379 | "name": "python", 1380 | "nbconvert_exporter": "python", 1381 | "pygments_lexer": "ipython3", 1382 | "version": "3.7.4" 1383 | }, 1384 | "toc": { 1385 | "base_numbering": 1, 1386 | "nav_menu": {}, 1387 | "number_sections": false, 1388 | "sideBar": true, 1389 | "skip_h1_title": false, 1390 | "title_cell": "Table of Contents", 1391 | "title_sidebar": "Contents", 1392 | "toc_cell": false, 1393 | "toc_position": { 1394 | "height": "calc(100% - 180px)", 1395 | "left": "10px", 1396 | "top": "150px", 1397 | "width": "303.797px" 1398 | }, 1399 | "toc_section_display": true, 1400 | "toc_window_display": false 1401 | }, 1402 | "varInspector": { 1403 | "cols": { 1404 | "lenName": 16, 1405 | "lenType": 16, 1406 | "lenVar": 40 1407 | }, 1408 | "kernels_config": { 1409 | "python": { 1410 | "delete_cmd_postfix": "", 1411 | "delete_cmd_prefix": "del ", 1412 | "library": "var_list.py", 1413 | "varRefreshCmd": "print(var_dic_list())" 1414 | }, 1415 | "r": { 1416 | "delete_cmd_postfix": ") ", 1417 | "delete_cmd_prefix": "rm(", 1418 | "library": "var_list.r", 1419 | "varRefreshCmd": "cat(var_dic_list()) " 1420 | } 1421 | }, 1422 | "types_to_exclude": [ 1423 | "module", 1424 | "function", 1425 | "builtin_function_or_method", 1426 | "instance", 1427 | "_Feature" 1428 | ], 1429 | "window_display": false 1430 | } 1431 | }, 1432 | "nbformat": 4, 1433 | "nbformat_minor": 2 1434 | } 1435 | -------------------------------------------------------------------------------- /index.md: -------------------------------------------------------------------------------- 1 | # Welcome to Flatiron's Intro to Web Scraping! 2 | 3 | ![](https://media.giphy.com/media/1dcWGdOBg0wcU/giphy.gif) 4 | 5 | 6 | ## A few things: 7 | - This file is called a **[Jupyter Notebook](https://jupyter.org/about)**. An industry standard tool for Data Science. 8 | 9 | - Each section of a jupyter notebook is called a `cell`. 10 | - To run the code inside a cell, click into the cell and press `shift` + `enter`. 11 | 12 | 13 | ### Workshop Goals: 14 | 1. Retrieve the HTML of a webpage with the `requests` library. 15 | 2. Introduction to the `tree` structure of `HTML`. 16 | 3. Use the `inspect` tool to sift through the HTML. 17 | 4. Parse HTML with the `BeautifulSoup` library. 18 | 5. Store data in a `csv` file using the `Pandas` library. 19 | 20 | # Let's scrape some data! 21 | The data we are scraping today will be from the [Quotes to Scrape](http://quotes.toscrape.com/) website. 22 | 23 | 24 | ## Step 1: 25 | > **Import the necessary tools for our project** 26 | 27 | ![](https://media.giphy.com/media/KcE7Dq5f8TTXzZ1LAF/giphy.gif) 28 | 29 | 30 | 31 | ```python 32 | # Webscraping 33 | import requests 34 | from bs4 import BeautifulSoup 35 | 36 | # Data organization 37 | import pandas as pd 38 | 39 | # Visualization 40 | import matplotlib.pyplot as plt 41 | plt.rcParams.update({'font.size': 22}) 42 | ``` 43 | 44 | ## Step 2 45 | > **We use the `requests` library to connect to the website we wish to scrape.** 46 | 47 | 48 | 49 | 50 | ```python 51 | url = 'http://quotes.toscrape.com' 52 | response = requests.get(url) 53 | response 54 | ``` 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | ✅**A `Response 200` means our request was sucessful!** ✅ 64 | 65 | ❌Let's take a quick look at an *unsuccessful* response. ❌ 66 | 67 | 68 | ```python 69 | bad_url = 'http://quotes.toscrape.com/20' 70 | requests.get(bad_url) 71 | ``` 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | **A `Response 404` means that the url you are using in your request is not pointing to a valid webpage.** 81 | 82 | 83 | 84 | ## Step 3 85 | > **We collect the html from the website by adding `.text` to the end of the response variable.** 86 | 87 | 88 | ```python 89 | html = response.text 90 | html[:50] 91 | ``` 92 | 93 | 94 | 95 | 96 | '\n\n\n\t **We use `BeautifulSoup` to turn the html into something we can manipulate.** 102 | 103 | 104 | ```python 105 | soup = BeautifulSoup(html, 'lxml') 106 | soup 107 | ``` 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | Quotes to Scrape 117 | 118 | 119 | 120 | 121 |
122 |
123 |
124 |

125 | Quotes to Scrape 126 |

127 |
128 |
129 |

130 | Login 131 |

132 |
133 |
134 |
135 |
136 |
137 | “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” 138 | by 139 | (about) 140 | 141 |
142 | Tags: 143 | 144 | change 145 | deep-thoughts 146 | thinking 147 | world 148 |
149 |
150 |
151 | “It is our choices, Harry, that show what we truly are, far more than our abilities.” 152 | by 153 | (about) 154 | 155 |
156 | Tags: 157 | 158 | abilities 159 | choices 160 |
161 |
162 |
163 | “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” 164 | by 165 | (about) 166 | 167 |
168 | Tags: 169 | 170 | inspirational 171 | life 172 | live 173 | miracle 174 | miracles 175 |
176 |
177 |
178 | “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” 179 | by 180 | (about) 181 | 182 |
183 | Tags: 184 | 185 | aliteracy 186 | books 187 | classic 188 | humor 189 |
190 |
191 |
192 | “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” 193 | by 194 | (about) 195 | 196 |
197 | Tags: 198 | 199 | be-yourself 200 | inspirational 201 |
202 |
203 |
204 | “Try not to become a man of success. Rather become a man of value.” 205 | by 206 | (about) 207 | 208 |
209 | Tags: 210 | 211 | adulthood 212 | success 213 | value 214 |
215 |
216 |
217 | “It is better to be hated for what you are than to be loved for what you are not.” 218 | by 219 | (about) 220 | 221 |
222 | Tags: 223 | 224 | life 225 | love 226 |
227 |
228 |
229 | “I have not failed. I've just found 10,000 ways that won't work.” 230 | by 231 | (about) 232 | 233 |
234 | Tags: 235 | 236 | edison 237 | failure 238 | inspirational 239 | paraphrased 240 |
241 |
242 |
243 | “A woman is like a tea bag; you never know how strong it is until it's in hot water.” 244 | by 245 | (about) 246 | 247 |
248 | Tags: 249 | 250 | misattributed-eleanor-roosevelt 251 |
252 |
253 |
254 | “A day without sunshine is like, you know, night.” 255 | by 256 | (about) 257 | 258 |
259 | Tags: 260 | 261 | humor 262 | obvious 263 | simile 264 |
265 |
266 | 273 |
274 |
275 |

Top Ten tags

276 | 277 | love 278 | 279 | 280 | inspirational 281 | 282 | 283 | life 284 | 285 | 286 | humor 287 | 288 | 289 | books 290 | 291 | 292 | reading 293 | 294 | 295 | friendship 296 | 297 | 298 | friends 299 | 300 | 301 | truth 302 | 303 | 304 | simile 305 | 306 |
307 |
308 |
309 | 319 | 320 | 321 | 322 | 323 | 324 |

Very Soupy

325 | 326 | The name ***Beautiful Soup*** is an appropriate description. HTML does not make for a lovely reading experience. If you feel like you're staring at complete gibberish, you're not entirely wrong! HTML is a language designed for computers, not for human eyes. 327 | 328 | 329 | 330 | **Fortunately for us,** we do not have to read through every line of the html in order to web scrape. 331 | 332 | Modern day web browsers come equipped with tools that allow us to easily sift through this soupy text. 333 | 334 | 335 | ## Step 4 336 | >**We open up the page we're trying to scrape in a new tab.** Click Here! 👀 337 | 338 | ## Step 5 339 | > **We create a list of every `div` that has a `class` of "quote"** 340 | 341 | **In this instance,** every item we want to collect is divided into identically labeled containers. 342 | - A div with a class of 'quote'. 343 | 344 | Not all HTML is as well organized as this page, but HTML is basically just a bunch of different organizational containers that we use to divide up text and other forms of media. Figuring out the organizational structure of a website is the entire process for web scraping. 345 | 346 | 347 | ```python 348 | quote_divs = soup.find_all('div', {'class': 'quote'}) 349 | len(quote_divs) 350 | ``` 351 | 352 | 353 | 354 | 355 | 10 356 | 357 | 358 | 359 | ## Step 6 360 | > **To figure out how to grab all the datapoints from a quote, we isolate a single quote, and work through the code for a single `div`.** 361 | 362 | 363 | ```python 364 | first_quote = quote_divs[0] 365 | ``` 366 | 367 | ### First we grab the text of the quote 368 | 369 | 370 | ```python 371 | text = first_quote.find('span', {'class':'text'}) 372 | text = text.text 373 | text 374 | ``` 375 | 376 | 377 | 378 | 379 | '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' 380 | 381 | 382 | 383 | ### Next we grab the author 384 | 385 | 386 | ```python 387 | author = first_quote.find('small', {'class': 'author'}) 388 | author_name = author.text 389 | author_name 390 | ``` 391 | 392 | 393 | 394 | 395 | 'Albert Einstein' 396 | 397 | 398 | 399 | ### Let's also grab the link that points to the author's bio! 400 | 401 | 402 | ```python 403 | author_link = author.findNextSibling().attrs.get('href') 404 | author_link = url + author_link 405 | author_link 406 | ``` 407 | 408 | 409 | 410 | 411 | 'http://quotes.toscrape.com/author/Albert-Einstein' 412 | 413 | 414 | 415 | ### And finally, let's grab all of the tags for the quote 416 | 417 | 418 | ```python 419 | tag_container = first_quote.find('div', {'class': 'tags'}) 420 | tag_links = tag_container.find_all('a') 421 | 422 | tags = [] 423 | for tag in tag_links: 424 | tags.append(tag.text) 425 | 426 | tags 427 | ``` 428 | 429 | 430 | 431 | 432 | ['change', 'deep-thoughts', 'thinking', 'world'] 433 | 434 | 435 | 436 | ### Our data: 437 | 438 | 439 | ```python 440 | print('text:', text) 441 | print('author name: ', author_name) 442 | print('author link: ', author_link) 443 | print('tags: ', tags) 444 | ``` 445 | 446 | text: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” 447 | author name: Albert Einstein 448 | author link: http://quotes.toscrape.com/author/Albert-Einstein 449 | tags: ['change', 'deep-thoughts', 'thinking', 'world'] 450 | 451 | 452 | # Step 7 453 | > We create a function to make out code reusable. 454 | 455 | Now that we know how to collect this data from a quote div, we can compile the code into a [function](https://www.geeksforgeeks.org/functions-in-python/) called `quote_data`. This allows us to grab a quote div, feed it into the function like so... 456 | > `quote_data(quote_div)` 457 | 458 | ...and receive all of the data from that div. 459 | 460 | 461 | ```python 462 | def quote_data(quote_div): 463 | # Collect the quote 464 | text = quote_div.find('span', {'class':'text'}) 465 | text = text.text 466 | 467 | # Collect the author name 468 | author = quote_div.find('small', {'class': 'author'}) 469 | author_name = author.text 470 | 471 | # Collect author link 472 | author_link = author.findNextSibling().attrs.get('href') 473 | author_link = url + author_link 474 | 475 | # Collect tags 476 | tag_container = quote_div.find('div', {'class': 'tags'}) 477 | 478 | tag_links = tag_container.find_all('a') 479 | 480 | tags = [] 481 | for tag in tag_links: 482 | tags.append(tag.text) 483 | 484 | # Return data as a dictionary 485 | return {'author': author_name, 486 | 'text': text, 487 | 'author_link': author_link, 488 | 'tags': tags} 489 | ``` 490 | 491 | ### Let's test our fuction. 492 | 493 | 494 | ```python 495 | quote_data(quote_divs[7]) 496 | ``` 497 | 498 | 499 | 500 | 501 | {'author': 'Thomas A. Edison', 502 | 'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 503 | 'author_link': 'http://quotes.toscrape.com/author/Thomas-A-Edison', 504 | 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']} 505 | 506 | 507 | 508 | Now we can collect the data from every quote on the first page with a simple [```for loop```](https://www.w3schools.com/python/python_for_loops.asp)! 509 | 510 | 511 | ```python 512 | page_one_data = [] 513 | for div in quote_divs: 514 | # Apply our function on each quote 515 | data_from_div = quote_data(div) 516 | page_one_data.append(data_from_div) 517 | 518 | print(len(page_one_data), 'quotes scraped!') 519 | page_one_data[:2] 520 | ``` 521 | 522 | 10 quotes scraped! 523 | 524 | 525 | 526 | 527 | 528 | [{'author': 'Albert Einstein', 529 | 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 530 | 'author_link': 'http://quotes.toscrape.com/author/Albert-Einstein', 531 | 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}, 532 | {'author': 'J.K. Rowling', 533 | 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 534 | 'author_link': 'http://quotes.toscrape.com/author/J-K-Rowling', 535 | 'tags': ['abilities', 'choices']}] 536 | 537 | 538 | 539 | # We just scraped an entire webpage! 540 | 541 | ![](https://media.giphy.com/media/KiXl0vfc9XIIM/giphy.gif) 542 | 543 | ### Level Up: What if we wanted to scrape the quotes from *every* page? 544 | 545 | **Step 1:** The first thing we do is take the code from above that scraped the data for all of the quotes on page one, and move it into a function called `parse_page`. 546 | 547 | 548 | ```python 549 | def scrape_page(quote_divs): 550 | data = [] 551 | for div in quote_divs: 552 | div_data = quote_data(div) 553 | data.append(div_data) 554 | 555 | return data 556 | ``` 557 | 558 | **Step 2:** We grab the code we used at the very beginning to collect the html and the list of divs from a web page. 559 | 560 | 561 | ```python 562 | base_url = 'http://quotes.toscrape.com' 563 | response = requests.get(url) 564 | html = response.text 565 | soup = BeautifulSoup(html, 'lxml') 566 | quote_divs = soup.find_all('div', {'class': 'quote'}) 567 | ``` 568 | 569 | **Step 3:** We feed all of the `quote_divs` into our newly made `parse_page` function to grab all of the data from that page. 570 | 571 | 572 | ```python 573 | data = scrape_page(quote_divs) 574 | ``` 575 | 576 | 577 | ```python 578 | data[:2] 579 | ``` 580 | 581 | 582 | 583 | 584 | [{'author': 'Albert Einstein', 585 | 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 586 | 'author_link': 'http://quotes.toscrape.com/author/Albert-Einstein', 587 | 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}, 588 | {'author': 'J.K. Rowling', 589 | 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 590 | 'author_link': 'http://quotes.toscrape.com/author/J-K-Rowling', 591 | 'tags': ['abilities', 'choices']}] 592 | 593 | 594 | 595 | **Step 4:** We check to see if there is a `Next Page` button at the bottom of the page. 596 | 597 | *This is requires multiple steps.* 598 | 599 | 1. We grab the outer container called that has a class of `pager`. 600 | 601 | 602 | ```python 603 | pager = soup.find('ul', {'class': 'pager'}) 604 | ``` 605 | 606 | If there is no pager element on the webpage, pager will be set to `None`. 607 | 608 | 2. We use an if check to make sure a pager exists: 609 | 610 | 611 | ```python 612 | if pager: 613 | next_page = pager.find('li', {'class': 'next'}) 614 | ``` 615 | 616 | 3. We then check to see if a `Next Page` button exists on the page. 617 | 618 | - Every page has a next button except the *last* page which only has a `Previous Page` button. Basically, we're checking to see if the `Next button` exists. It it does, we "click" it. 619 | 620 | 621 | ```python 622 | if next_page: 623 | next_page = next_page.findChild('a')\ 624 | .attrs\ 625 | .get('href') 626 | ``` 627 | 628 | With most webscraping tools, "clicking a button" means collecting the link inside the button and making a new request. 629 | 630 | If a link is pointing to a page on the same website, the links are usually just the forward slashes that need to be added to the base website url. This is called a `relative` link. 631 | 632 | **Step 5:** Collect the relative link that points to the next page, and add it to our base_url 633 | 634 | 635 | ```python 636 | next_page = url + next_page 637 | next_page 638 | ``` 639 | 640 | 641 | 642 | 643 | 'http://quotes.toscrape.com/page/2/' 644 | 645 | 646 | 647 | **Step 6:** We repeat the exact same process for this new link! 648 | 649 | ie: 650 | 1. Make request using a url that points to the next page. 651 | 2. Scrape quote divs 652 | 3. Collect data from every quote div on that page 653 | 4. Find the `Next page` button. 654 | 5. Collect the url from the button 655 | 6. Repeat 656 | 657 | So how do we do this over and over again without repeating ourselves? 658 | 659 | The first step is compile all of these steps into a new function called `scrape_quotes`. 660 | 661 | The second step is, something called `recursion`. 662 | 663 |

Recursion

664 | 665 | ![](https://media.giphy.com/media/l1J9R1Q7LJGSZOxFe/giphy.gif) 666 | 667 | > **Recursion** is a bit of a mind bend, so don't feel bad if it is hard to wrap your brain around. It took me a while to be able to understand recursive functions! 668 | 669 | Essentially, recursion is where we use a function inside of itself. 670 | 671 | **In this instance,** our code will be telling the computer, if there is a `Next page` button, rerun all of the code again on the page the next button points us to and check to see if there is a `Next page` button on *that* page. If there is, keep repeating the process until a `Next page` button is not found. 672 | 673 | 674 | ```python 675 | def scrape_quotes(url): 676 | base_url = 'http://quotes.toscrape.com' 677 | response = requests.get(url) 678 | html = response.text 679 | soup = BeautifulSoup(html, 'lxml') 680 | quote_divs = soup.find_all('div', {'class': 'quote'}) 681 | data = scrape_page(quote_divs) 682 | 683 | pager = soup.find('ul', {'class': 'pager'}) 684 | if pager: 685 | next_page = pager.find('li', {'class': 'next'}) 686 | 687 | if next_page: 688 | next_page = next_page.findChild('a')\ 689 | .attrs\ 690 | .get('href') 691 | 692 | next_page = base_url + next_page 693 | print('Scraping', next_page) 694 | ## This is where the recursion happens 695 | data += scrape_quotes(next_page) 696 | 697 | 698 | 699 | return data 700 | ``` 701 | 702 | Now we can set a variable called `data` that is the output of our recursive function! 703 | 704 | > A print statement has been added to output what page is being scraped 705 | 706 | 707 | ```python 708 | data = scrape_quotes(url) 709 | print(len(data), 'Quotes scraped!') 710 | ``` 711 | 712 | Scraping http://quotes.toscrape.com/page/2/ 713 | Scraping http://quotes.toscrape.com/page/3/ 714 | Scraping http://quotes.toscrape.com/page/4/ 715 | Scraping http://quotes.toscrape.com/page/5/ 716 | Scraping http://quotes.toscrape.com/page/6/ 717 | Scraping http://quotes.toscrape.com/page/7/ 718 | Scraping http://quotes.toscrape.com/page/8/ 719 | Scraping http://quotes.toscrape.com/page/9/ 720 | Scraping http://quotes.toscrape.com/page/10/ 721 | 100 Quotes scraped! 722 | 723 | 724 | Now we can visualize and explore our data! 725 | 726 | 727 | ```python 728 | def count_tags(quote): 729 | return len(quote['tags']) 730 | 731 | def tag_lengths(data): 732 | lengths = [] 733 | for quote in data: 734 | lengths.append(count_tags(quote)) 735 | 736 | return lengths 737 | 738 | lengths = tag_lengths(data) 739 | plt.figure(figsize=(10,6)) 740 | plt.hist(lengths, bins=9) 741 | plt.title('Number of Tags'); 742 | ``` 743 | 744 | 745 | ![png](index_files/index_54_0.png) 746 | 747 | 748 | # Saving our data 749 | 750 | There are multiple ways to save data to a file. Pandas, `The Excel of Python` allows us to do this easily. 751 | 752 | 753 | ```python 754 | df = pd.DataFrame(data) 755 | df.head() 756 | ``` 757 | 758 | 759 | 760 | 761 |
762 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 | 789 | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | 810 | 811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 | 822 |
authortextauthor_linktags
0Albert Einstein“The world as we have created it is a process ...http://quotes.toscrape.com/author/Albert-Einstein[change, deep-thoughts, thinking, world]
1J.K. Rowling“It is our choices, Harry, that show what we t...http://quotes.toscrape.com/author/J-K-Rowling[abilities, choices]
2Albert Einstein“There are only two ways to live your life. On...http://quotes.toscrape.com/author/Albert-Einstein[inspirational, life, live, miracle, miracles]
3Jane Austen“The person, be it gentleman or lady, who has ...http://quotes.toscrape.com/author/Jane-Austen[aliteracy, books, classic, humor]
4Marilyn Monroe“Imperfection is beauty, madness is genius and...http://quotes.toscrape.com/author/Marilyn-Monroe[be-yourself, inspirational]
823 |
824 | 825 | 826 | 827 | 828 | ```python 829 | df.to_csv('quotes_data.csv') 830 | ``` 831 | 832 | # Web scraping is a powerful tool. 833 | 834 | It can be used: 835 | - To discover the most in demand skills for thousands of online job postings. 836 | - To learn the average price of a product from thousands of online sale. 837 | - To research social media networks. 838 | 839 | **And so much more!** As our world continues to develop online markets and communities, the uses for webscraping continue to grow. In the end, the power of web scraping comes from the ability to create datasets that otherwise do not exist, or at the very least, are not readily available to the public. 840 | 841 | 842 | 843 | ```python 844 | 845 | ``` 846 | -------------------------------------------------------------------------------- /index_files/index_54_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flatiron-school/intro_to_webscraping/5d050cb71c20ce3846419aba7eefcbea2257578e/index_files/index_54_0.png -------------------------------------------------------------------------------- /quotes_data.csv: -------------------------------------------------------------------------------- 1 | ,author,text,author_link,tags 2 | 0,Albert Einstein,“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,http://quotes.toscrape.com/author/Albert-Einstein,"['change', 'deep-thoughts', 'thinking', 'world']" 3 | 1,J.K. Rowling,"“It is our choices, Harry, that show what we truly are, far more than our abilities.”",http://quotes.toscrape.com/author/J-K-Rowling,"['abilities', 'choices']" 4 | 2,Albert Einstein,“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”,http://quotes.toscrape.com/author/Albert-Einstein,"['inspirational', 'life', 'live', 'miracle', 'miracles']" 5 | 3,Jane Austen,"“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",http://quotes.toscrape.com/author/Jane-Austen,"['aliteracy', 'books', 'classic', 'humor']" 6 | 4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",http://quotes.toscrape.com/author/Marilyn-Monroe,"['be-yourself', 'inspirational']" 7 | 5,Albert Einstein,“Try not to become a man of success. Rather become a man of value.”,http://quotes.toscrape.com/author/Albert-Einstein,"['adulthood', 'success', 'value']" 8 | 6,André Gide,“It is better to be hated for what you are than to be loved for what you are not.”,http://quotes.toscrape.com/author/Andre-Gide,"['life', 'love']" 9 | 7,Thomas A. Edison,"“I have not failed. I've just found 10,000 ways that won't work.”",http://quotes.toscrape.com/author/Thomas-A-Edison,"['edison', 'failure', 'inspirational', 'paraphrased']" 10 | 8,Eleanor Roosevelt,“A woman is like a tea bag; you never know how strong it is until it's in hot water.”,http://quotes.toscrape.com/author/Eleanor-Roosevelt,['misattributed-eleanor-roosevelt'] 11 | 9,Steve Martin,"“A day without sunshine is like, you know, night.”",http://quotes.toscrape.com/author/Steve-Martin,"['humor', 'obvious', 'simile']" 12 | 10,Marilyn Monroe,"“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”",http://quotes.toscrape.com/author/Marilyn-Monroe,"['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']" 13 | 11,J.K. Rowling,"“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”",http://quotes.toscrape.com/author/J-K-Rowling,"['courage', 'friends']" 14 | 12,Albert Einstein,"“If you can't explain it to a six year old, you don't understand it yourself.”",http://quotes.toscrape.com/author/Albert-Einstein,"['simplicity', 'understand']" 15 | 13,Bob Marley,"“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”",http://quotes.toscrape.com/author/Bob-Marley,['love'] 16 | 14,Dr. Seuss,"“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”",http://quotes.toscrape.com/author/Dr-Seuss,['fantasy'] 17 | 15,Douglas Adams,"“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”",http://quotes.toscrape.com/author/Douglas-Adams,"['life', 'navigation']" 18 | 16,Elie Wiesel,"“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”",http://quotes.toscrape.com/author/Elie-Wiesel,"['activism', 'apathy', 'hate', 'indifference', 'inspirational', 'love', 'opposite', 'philosophy']" 19 | 17,Friedrich Nietzsche,"“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”",http://quotes.toscrape.com/author/Friedrich-Nietzsche,"['friendship', 'lack-of-friendship', 'lack-of-love', 'love', 'marriage', 'unhappy-marriage']" 20 | 18,Mark Twain,"“Good friends, good books, and a sleepy conscience: this is the ideal life.”",http://quotes.toscrape.com/author/Mark-Twain,"['books', 'contentment', 'friends', 'friendship', 'life']" 21 | 19,Allen Saunders,“Life is what happens to us while we are making other plans.”,http://quotes.toscrape.com/author/Allen-Saunders,"['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans']" 22 | 20,Pablo Neruda,"“I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”",http://quotes.toscrape.com/author/Pablo-Neruda,"['love', 'poetry']" 23 | 21,Ralph Waldo Emerson,“For every minute you are angry you lose sixty seconds of happiness.”,http://quotes.toscrape.com/author/Ralph-Waldo-Emerson,['happiness'] 24 | 22,Mother Teresa,"“If you judge people, you have no time to love them.”",http://quotes.toscrape.com/author/Mother-Teresa,['attributed-no-source'] 25 | 23,Garrison Keillor,“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”,http://quotes.toscrape.com/author/Garrison-Keillor,"['humor', 'religion']" 26 | 24,Jim Henson,“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”,http://quotes.toscrape.com/author/Jim-Henson,['humor'] 27 | 25,Dr. Seuss,"“Today you are You, that is truer than true. There is no one alive who is Youer than You.”",http://quotes.toscrape.com/author/Dr-Seuss,"['comedy', 'life', 'yourself']" 28 | 26,Albert Einstein,"“If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”",http://quotes.toscrape.com/author/Albert-Einstein,"['children', 'fairy-tales']" 29 | 27,J.K. Rowling,"“It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all - in which case, you fail by default.”",http://quotes.toscrape.com/author/J-K-Rowling,[] 30 | 28,Albert Einstein,“Logic will get you from A to Z; imagination will get you everywhere.”,http://quotes.toscrape.com/author/Albert-Einstein,['imagination'] 31 | 29,Bob Marley,"“One good thing about music, when it hits you, you feel no pain.”",http://quotes.toscrape.com/author/Bob-Marley,['music'] 32 | 30,Dr. Seuss,"“The more that you read, the more things you will know. The more that you learn, the more places you'll go.”",http://quotes.toscrape.com/author/Dr-Seuss,"['learning', 'reading', 'seuss']" 33 | 31,J.K. Rowling,"“Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”",http://quotes.toscrape.com/author/J-K-Rowling,['dumbledore'] 34 | 32,Bob Marley,"“The truth is, everyone is going to hurt you. You just got to find the ones worth suffering for.”",http://quotes.toscrape.com/author/Bob-Marley,['friendship'] 35 | 33,Mother Teresa,“Not all of us can do great things. But we can do small things with great love.”,http://quotes.toscrape.com/author/Mother-Teresa,"['misattributed-to-mother-teresa', 'paraphrased']" 36 | 34,J.K. Rowling,"“To the well-organized mind, death is but the next great adventure.”",http://quotes.toscrape.com/author/J-K-Rowling,"['death', 'inspirational']" 37 | 35,Charles M. Schulz,“All you need is love. But a little chocolate now and then doesn't hurt.”,http://quotes.toscrape.com/author/Charles-M-Schulz,"['chocolate', 'food', 'humor']" 38 | 36,William Nicholson,“We read to know we're not alone.”,http://quotes.toscrape.com/author/William-Nicholson,"['misattributed-to-c-s-lewis', 'reading']" 39 | 37,Albert Einstein,“Any fool can know. The point is to understand.”,http://quotes.toscrape.com/author/Albert-Einstein,"['knowledge', 'learning', 'understanding', 'wisdom']" 40 | 38,Jorge Luis Borges,“I have always imagined that Paradise will be a kind of library.”,http://quotes.toscrape.com/author/Jorge-Luis-Borges,"['books', 'library']" 41 | 39,George Eliot,“It is never too late to be what you might have been.”,http://quotes.toscrape.com/author/George-Eliot,['inspirational'] 42 | 40,George R.R. Martin,"“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”",http://quotes.toscrape.com/author/George-R-R-Martin,"['read', 'readers', 'reading', 'reading-books']" 43 | 41,C.S. Lewis,“You can never get a cup of tea large enough or a book long enough to suit me.”,http://quotes.toscrape.com/author/C-S-Lewis,"['books', 'inspirational', 'reading', 'tea']" 44 | 42,Marilyn Monroe,“You believe lies so you eventually learn to trust no one but yourself.”,http://quotes.toscrape.com/author/Marilyn-Monroe,[] 45 | 43,Marilyn Monroe,"“If you can make a woman laugh, you can make her do anything.”",http://quotes.toscrape.com/author/Marilyn-Monroe,"['girls', 'love']" 46 | 44,Albert Einstein,"“Life is like riding a bicycle. To keep your balance, you must keep moving.”",http://quotes.toscrape.com/author/Albert-Einstein,"['life', 'simile']" 47 | 45,Marilyn Monroe,“The real lover is the man who can thrill you by kissing your forehead or smiling into your eyes or just staring into space.”,http://quotes.toscrape.com/author/Marilyn-Monroe,['love'] 48 | 46,Marilyn Monroe,"“A wise girl kisses but doesn't love, listens but doesn't believe, and leaves before she is left.”",http://quotes.toscrape.com/author/Marilyn-Monroe,['attributed-no-source'] 49 | 47,Martin Luther King Jr.,“Only in the darkness can you see the stars.”,http://quotes.toscrape.com/author/Martin-Luther-King-Jr,"['hope', 'inspirational']" 50 | 48,J.K. Rowling,"“It matters not what someone is born, but what they grow to be.”",http://quotes.toscrape.com/author/J-K-Rowling,['dumbledore'] 51 | 49,James Baldwin,"“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”",http://quotes.toscrape.com/author/James-Baldwin,['love'] 52 | 50,Jane Austen,"“There is nothing I would not do for those who are really my friends. I have no notion of loving people by halves, it is not my nature.”",http://quotes.toscrape.com/author/Jane-Austen,"['friendship', 'love']" 53 | 51,Eleanor Roosevelt,“Do one thing every day that scares you.”,http://quotes.toscrape.com/author/Eleanor-Roosevelt,"['attributed', 'fear', 'inspiration']" 54 | 52,Marilyn Monroe,"“I am good, but not an angel. I do sin, but I am not the devil. I am just a small girl in a big world trying to find someone to love.”",http://quotes.toscrape.com/author/Marilyn-Monroe,['attributed-no-source'] 55 | 53,Albert Einstein,"“If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”",http://quotes.toscrape.com/author/Albert-Einstein,['music'] 56 | 54,Haruki Murakami,"“If you only read the books that everyone else is reading, you can only think what everyone else is thinking.”",http://quotes.toscrape.com/author/Haruki-Murakami,"['books', 'thought']" 57 | 55,Alexandre Dumas fils,“The difference between genius and stupidity is: genius has its limits.”,http://quotes.toscrape.com/author/Alexandre-Dumas-fils,['misattributed-to-einstein'] 58 | 56,Stephenie Meyer,"“He's like a drug for you, Bella.”",http://quotes.toscrape.com/author/Stephenie-Meyer,"['drug', 'romance', 'simile']" 59 | 57,Ernest Hemingway,“There is no friend as loyal as a book.”,http://quotes.toscrape.com/author/Ernest-Hemingway,"['books', 'friends', 'novelist-quotes']" 60 | 58,Helen Keller,"“When one door of happiness closes, another opens; but often we look so long at the closed door that we do not see the one which has been opened for us.”",http://quotes.toscrape.com/author/Helen-Keller,['inspirational'] 61 | 59,George Bernard Shaw,“Life isn't about finding yourself. Life is about creating yourself.”,http://quotes.toscrape.com/author/George-Bernard-Shaw,"['inspirational', 'life', 'yourself']" 62 | 60,Charles Bukowski,"“That's the problem with drinking, I thought, as I poured myself a drink. If something bad happens you drink in an attempt to forget; if something good happens you drink in order to celebrate; and if nothing happens you drink to make something happen.”",http://quotes.toscrape.com/author/Charles-Bukowski,['alcohol'] 63 | 61,Suzanne Collins,“You don’t forget the face of the person who was your last hope.”,http://quotes.toscrape.com/author/Suzanne-Collins,['the-hunger-games'] 64 | 62,Suzanne Collins,"“Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”",http://quotes.toscrape.com/author/Suzanne-Collins,['humor'] 65 | 63,C.S. Lewis,"“To love at all is to be vulnerable. Love anything and your heart will be wrung and possibly broken. If you want to make sure of keeping it intact you must give it to no one, not even an animal. Wrap it carefully round with hobbies and little luxuries; avoid all entanglements. Lock it up safe in the casket or coffin of your selfishness. But in that casket, safe, dark, motionless, airless, it will change. It will not be broken; it will become unbreakable, impenetrable, irredeemable. To love is to be vulnerable.”",http://quotes.toscrape.com/author/C-S-Lewis,['love'] 66 | 64,J.R.R. Tolkien,“Not all those who wander are lost.”,http://quotes.toscrape.com/author/J-R-R-Tolkien,"['bilbo', 'journey', 'lost', 'quest', 'travel', 'wander']" 67 | 65,J.K. Rowling,"“Do not pity the dead, Harry. Pity the living, and, above all those who live without love.”",http://quotes.toscrape.com/author/J-K-Rowling,['live-death-love'] 68 | 66,Ernest Hemingway,“There is nothing to writing. All you do is sit down at a typewriter and bleed.”,http://quotes.toscrape.com/author/Ernest-Hemingway,"['good', 'writing']" 69 | 67,Ralph Waldo Emerson,“Finish each day and be done with it. You have done what you could. Some blunders and absurdities no doubt crept in; forget them as soon as you can. Tomorrow is a new day. You shall begin it serenely and with too high a spirit to be encumbered with your old nonsense.”,http://quotes.toscrape.com/author/Ralph-Waldo-Emerson,"['life', 'regrets']" 70 | 68,Mark Twain,“I have never let my schooling interfere with my education.”,http://quotes.toscrape.com/author/Mark-Twain,['education'] 71 | 69,Dr. Seuss,“I have heard there are troubles of more than one kind. Some come from ahead and some come from behind. But I've bought a big bat. I'm all ready you see. Now my troubles are going to have troubles with me!”,http://quotes.toscrape.com/author/Dr-Seuss,['troubles'] 72 | 70,Alfred Tennyson,“If I had a flower for every time I thought of you...I could walk through my garden forever.”,http://quotes.toscrape.com/author/Alfred-Tennyson,"['friendship', 'love']" 73 | 71,Charles Bukowski,“Some people never go crazy. What truly horrible lives they must lead.”,http://quotes.toscrape.com/author/Charles-Bukowski,['humor'] 74 | 72,Terry Pratchett,"“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”",http://quotes.toscrape.com/author/Terry-Pratchett,"['humor', 'open-mind', 'thinking']" 75 | 73,Dr. Seuss,"“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”",http://quotes.toscrape.com/author/Dr-Seuss,"['humor', 'philosophy']" 76 | 74,J.D. Salinger,"“What really knocks me out is a book that, when you're all done reading it, you wish the author that wrote it was a terrific friend of yours and you could call him up on the phone whenever you felt like it. That doesn't happen much, though.”",http://quotes.toscrape.com/author/J-D-Salinger,"['authors', 'books', 'literature', 'reading', 'writing']" 77 | 75,George Carlin,“The reason I talk to myself is because I’m the only one whose answers I accept.”,http://quotes.toscrape.com/author/George-Carlin,"['humor', 'insanity', 'lies', 'lying', 'self-indulgence', 'truth']" 78 | 76,John Lennon,"“You may say I'm a dreamer, but I'm not the only one. I hope someday you'll join us. And the world will live as one.”",http://quotes.toscrape.com/author/John-Lennon,"['beatles', 'connection', 'dreamers', 'dreaming', 'dreams', 'hope', 'inspirational', 'peace']" 79 | 77,W.C. Fields,“I am free of all prejudice. I hate everyone equally. ”,http://quotes.toscrape.com/author/W-C-Fields,"['humor', 'sinister']" 80 | 78,Ayn Rand,“The question isn't who is going to let me; it's who is going to stop me.”,http://quotes.toscrape.com/author/Ayn-Rand,[] 81 | 79,Mark Twain,“′Classic′ - a book which people praise and don't read.”,http://quotes.toscrape.com/author/Mark-Twain,"['books', 'classic', 'reading']" 82 | 80,Albert Einstein,“Anyone who has never made a mistake has never tried anything new.”,http://quotes.toscrape.com/author/Albert-Einstein,['mistakes'] 83 | 81,Jane Austen,"“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”",http://quotes.toscrape.com/author/Jane-Austen,"['humor', 'love', 'romantic', 'women']" 84 | 82,J.K. Rowling,"“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”",http://quotes.toscrape.com/author/J-K-Rowling,['integrity'] 85 | 83,Jane Austen,"“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of my own, I shall be miserable if I have not an excellent library.”",http://quotes.toscrape.com/author/Jane-Austen,"['books', 'library', 'reading']" 86 | 84,Jane Austen,"“There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.”",http://quotes.toscrape.com/author/Jane-Austen,"['elizabeth-bennet', 'jane-austen']" 87 | 85,C.S. Lewis,“Some day you will be old enough to start reading fairy tales again.”,http://quotes.toscrape.com/author/C-S-Lewis,"['age', 'fairytales', 'growing-up']" 88 | 86,C.S. Lewis,“We are not necessarily doubting that God will do the best for us; we are wondering how painful the best will turn out to be.”,http://quotes.toscrape.com/author/C-S-Lewis,['god'] 89 | 87,Mark Twain,“The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.”,http://quotes.toscrape.com/author/Mark-Twain,"['death', 'life']" 90 | 88,Mark Twain,“A lie can travel half way around the world while the truth is putting on its shoes.”,http://quotes.toscrape.com/author/Mark-Twain,"['misattributed-mark-twain', 'truth']" 91 | 89,C.S. Lewis,"“I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”",http://quotes.toscrape.com/author/C-S-Lewis,"['christianity', 'faith', 'religion', 'sun']" 92 | 90,J.K. Rowling,"“The truth."" Dumbledore sighed. ""It is a beautiful and terrible thing, and should therefore be treated with great caution.”",http://quotes.toscrape.com/author/J-K-Rowling,['truth'] 93 | 91,Jimi Hendrix,"“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.”",http://quotes.toscrape.com/author/Jimi-Hendrix,"['death', 'life']" 94 | 92,J.M. Barrie,“To die will be an awfully big adventure.”,http://quotes.toscrape.com/author/J-M-Barrie,"['adventure', 'love']" 95 | 93,E.E. Cummings,“It takes courage to grow up and become who you really are.”,http://quotes.toscrape.com/author/E-E-Cummings,['courage'] 96 | 94,Khaled Hosseini,“But better to get hurt by the truth than comforted with a lie.”,http://quotes.toscrape.com/author/Khaled-Hosseini,['life'] 97 | 95,Harper Lee,“You never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.”,http://quotes.toscrape.com/author/Harper-Lee,['better-life-empathy'] 98 | 96,Madeleine L'Engle,"“You have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.”",http://quotes.toscrape.com/author/Madeleine-LEngle,"['books', 'children', 'difficult', 'grown-ups', 'write', 'writers', 'writing']" 99 | 97,Mark Twain,“Never tell the truth to people who are not worthy of it.”,http://quotes.toscrape.com/author/Mark-Twain,['truth'] 100 | 98,Dr. Seuss,"“A person's a person, no matter how small.”",http://quotes.toscrape.com/author/Dr-Seuss,['inspirational'] 101 | 99,George R.R. Martin,"“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”",http://quotes.toscrape.com/author/George-R-R-Martin,"['books', 'mind']" 102 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | beautifulsoup4==4.9.1 2 | brotlipy==0.7.0 3 | certifi==2020.6.20 4 | cffi==1.14.0 5 | chardet==3.0.4 6 | cryptography==2.9.2 7 | cycler==0.10.0 8 | decorator==4.4.2 9 | escapism==1.0.1 10 | idna==2.10 11 | ipython-genutils==0.2.0 12 | Jinja2==2.11.2 13 | kiwisolver==1.2.0 14 | lxml==4.5.1 15 | MarkupSafe==1.1.1 16 | matplotlib==3.2.2 17 | numpy==1.18.5 18 | pandas==1.0.5 19 | pycparser==2.20 20 | pyOpenSSL==19.1.0 21 | pyparsing==2.4.7 22 | PySocks==1.7.1 23 | python-dateutil==2.8.1 24 | python-json-logger==0.1.11 25 | pytz==2020.1 26 | requests==2.24.0 27 | ruamel.yaml==0.16.10 28 | ruamel.yaml.clib==0.2.0 29 | semver==2.10.2 30 | six==1.15.0 31 | soupsieve==2.0.1 32 | toml==0.10.1 33 | tornado==6.0.4 34 | traitlets==4.3.3 35 | urllib3==1.25.9 36 | websocket-client==0.57.0 37 | -------------------------------------------------------------------------------- /static/error.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flatiron-school/intro_to_webscraping/5d050cb71c20ce3846419aba7eefcbea2257578e/static/error.gif -------------------------------------------------------------------------------- /static/graphic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flatiron-school/intro_to_webscraping/5d050cb71c20ce3846419aba7eefcbea2257578e/static/graphic.png -------------------------------------------------------------------------------- /static/hurray.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flatiron-school/intro_to_webscraping/5d050cb71c20ce3846419aba7eefcbea2257578e/static/hurray.gif -------------------------------------------------------------------------------- /static/recursion.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flatiron-school/intro_to_webscraping/5d050cb71c20ce3846419aba7eefcbea2257578e/static/recursion.gif -------------------------------------------------------------------------------- /static/soup.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flatiron-school/intro_to_webscraping/5d050cb71c20ce3846419aba7eefcbea2257578e/static/soup.gif -------------------------------------------------------------------------------- /static/surf.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flatiron-school/intro_to_webscraping/5d050cb71c20ce3846419aba7eefcbea2257578e/static/surf.gif -------------------------------------------------------------------------------- /static/tools.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flatiron-school/intro_to_webscraping/5d050cb71c20ce3846419aba7eefcbea2257578e/static/tools.gif --------------------------------------------------------------------------------