├── .gitignore ├── 2020_Getting_Data_From_the_Internet ├── QandA │ └── gettingdatafromtheinternetqa.pdf ├── Slide_Decks │ └── gettingdatafromtheinternet16apr2020.pdf └── code │ ├── Football API │ ├── Football.ipynb │ └── images │ │ ├── matchday.jpg │ │ └── results.jpg │ ├── Multiple Files Covid │ ├── covid.ipynb │ └── images │ │ └── actual_filename.jpg │ ├── Single files │ └── Simple_File_requests.ipynb │ └── Web Scraping Tesco │ └── Tesco.ipynb ├── 2020_Web-scraping_and_API ├── code │ ├── README.md │ ├── auth │ │ └── guardian-api-key.txt │ ├── datasets │ │ └── extract_main_charity.csv │ ├── images │ │ └── UKDS_Logos_Col_Grey_300dpi.png │ ├── web-scraping-apis-code-2020-04-30.ipynb │ └── web-scraping-websites-code-2020-04-23.ipynb ├── faq │ └── README.md ├── installation │ └── README.md ├── reading-list │ └── README.md └── webinars │ ├── README.md │ ├── stick-figure1.png │ ├── ukds-nfod-web-scraping-apis-2020-04-30.pdf │ ├── ukds-nfod-web-scraping-case-study-2020-03-27.pdf │ └── ukds-nfod-web-scraping-websites-2020-04-23.pdf ├── LICENSE ├── README.md ├── postBuild └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | webinars/ukds-nfod-web-scraping-websites-2020-04-23.pptx 2 | webinars/ukds-nfod-web-scraping-apis-2020-04-30.pptx 3 | webinars/ukds-nfod-web-scraping-notes-2020-04-23.ipynb 4 | webinars/ukds-nfod-web-scraping-notes-2020-04-30.ipynb 5 | code/downloads/ 6 | .ipynb_checkpoints -------------------------------------------------------------------------------- /2020_Getting_Data_From_the_Internet/QandA/gettingdatafromtheinternetqa.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/web-scraping/4b82af70c2b5603721993a33679238fbd1d840ed/2020_Getting_Data_From_the_Internet/QandA/gettingdatafromtheinternetqa.pdf -------------------------------------------------------------------------------- /2020_Getting_Data_From_the_Internet/Slide_Decks/gettingdatafromtheinternet16apr2020.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/web-scraping/4b82af70c2b5603721993a33679238fbd1d840ed/2020_Getting_Data_From_the_Internet/Slide_Decks/gettingdatafromtheinternet16apr2020.pdf -------------------------------------------------------------------------------- /2020_Getting_Data_From_the_Internet/code/Football API/Football.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Using an API\n", 8 | "\n", 9 | "In this example we will be using the `api.football-data.org` api.\n", 10 | "The documentation for the Api can be found here https://www.football-data.org/ \n", 11 | "\n", 12 | "\n", 13 | "## Before starting make sure that you have read the API documentation" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": null, 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "# These are the libraries we will need to use\n", 23 | "\n", 24 | "import requests\n", 25 | "import json\n", 26 | "import csv\n", 27 | "import time\n", 28 | "import pandas as pd" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "## A simple call\n", 36 | "\n", 37 | "## We can use the JSON.dumps method to see the returned data or\n", 38 | "\n", 39 | "## We can write it to a file and use other tools to look at it. \n" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "r=requests.get(\"http://api.football-data.org/v2/competitions/2021/matches\", \n", 49 | " params={'matchday': '1', 'season': '2018'}, \n", 50 | " headers={'X-Auth-Token':'ae3c0b8b3a544082889c24df68c7e951'})\n", 51 | "\n", 52 | "\n", 53 | "# making the json more readable\n", 54 | "#print(json.dumps(r.json(), indent=4))\n", 55 | "\n", 56 | "# writing the json to a file\n", 57 | "with open(\"games.json\", \"w\") as outfile : outfile.write(json.dumps(r.json(), indent=4))\n" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "# What it really looks like\n", 67 | "print(r.text) " 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## Using the same approach as we did for multiple files \n", 75 | "## We can make multiple calls by iterating through the matchdays\n", 76 | "\n", 77 | "## Why might this code cause a problem? - Read the API documents! " 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "url = 'https://api.football-data.org/v2/competitions/2021/matches'\n", 87 | "headers = { 'X-Auth-Token': '' } # Your API Key\n", 88 | "\n", 89 | "filters = {'matchday': '1', 'season': '2018'}\n", 90 | "\n", 91 | "for md in range(1,39) :\n", 92 | " filters['matchday'] = str(md)\n", 93 | " r = requests.get(url, headers=headers, params=filters)\n", 94 | " print(r.url)\n", 95 | " \n" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "## Now we need to parse the json to extract the pieces of information we are going to need.\n", 103 | "\n", 104 | "## First however, we will have alook at the saved JSON file\n", 105 | "\n", 106 | "## We will start by extracting indivdual elements from the first match \n", 107 | "## (In Python this will be the index value of `0`)." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "response = r.json()\n", 117 | "\n", 118 | "print(response['matches'][0])" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "## Notice that this is a complete dictionary object. From which we can extract the values from the keys we want\n", 126 | "\n", 127 | "## We need the Home team and the Away team and the Goals scored by each at Full Time.\n", 128 | "\n", 129 | "## There is a lot more details in the dictionary object, so for fun we will also extract the score at Half Time. " 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "print(response['matches'][0]['score'])" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "print(response['matches'][0]['score']['fullTime']['homeTeam'])\n", 148 | "print(response['matches'][0]['score']['fullTime']['awayTeam'])" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "print(response['matches'][0]['homeTeam']['name'])\n", 158 | "print(response['matches'][0]['awayTeam']['name'])" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "print(response['matches'][0]['score']['halfTime']['homeTeam'])\n", 168 | "print(response['matches'][0]['score']['halfTime']['awayTeam'])" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "For clarity we will assign these values to simple variables" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "HomeTeam = response['matches'][0]['homeTeam']['name']\n", 185 | "AwayTeam = response['matches'][0]['awayTeam']['name']\n", 186 | "HomeTeamHalfTime = response['matches'][0]['score']['halfTime']['homeTeam']\n", 187 | "AwayTeamHalfTime = response['matches'][0]['score']['halfTime']['awayTeam']\n", 188 | "HomeTeamFullTime = response['matches'][0]['score']['fullTime']['homeTeam']\n", 189 | "AwayTeamFullTime = response['matches'][0]['score']['fullTime']['awayTeam']" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "## As we want to save all of the results we will write them all to a simple `csv` file for future use.\n", 197 | "\n", 198 | "## Because eventually we are going to retrieve the data for all of the `matchday`s. It might be useful to write this to the file as well.\n", 199 | "\n", 200 | "## If we go back and look at the JSON, we can see that `matchday` is recorded as part of the `filter` entry\n", 201 | "\n", 202 | "![Matchday](./images/matchday.jpg)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "Matchday = response['filters']['matchday']" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "## To write a csv file we will use the Python `csv` package.\n", 219 | "\n", 220 | "## We will try out our code by just writing out the single row we have so far.\n", 221 | "## In addition to the data elements, as it is a csv file we will also write out a header record with all of the column name in it" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "\n", 231 | "fw = open('results.csv', 'w') \n", 232 | "outfile = csv.writer(fw, delimiter=',', lineterminator='\\r')\n", 233 | "\n", 234 | "# The header record\n", 235 | "outfile.writerow([\"matchday\", \"Home_team\", \"Home_goals\", \"Away_team\", \"Away_goals\", \"HT_Home_goals\", \"HT_Away_Goals\"])\n", 236 | "\n", 237 | "#The data\n", 238 | "outfile.writerow([Matchday, HomeTeam, AwayTeam, HomeTeamFullTime, AwayTeamFullTime, HomeTeamHalfTime, AwayTeamHalfTime] )\n", 239 | " \n", 240 | "# close the file \n", 241 | "fw.close()\n", 242 | "\n" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "We can see what this looks like by Browsing the file \n", 250 | "\n", 251 | "![Result](./images/results.jpg)\n", 252 | "\n", 253 | "\n", 254 | "Now we are ready to start putting it all together.\n", 255 | "\n", 256 | "1. Each API call gets us all of the results for a given Matchday.\n", 257 | "2. We need to make a seperate API call for each of the 38 matchdays" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "fw = open('results.csv', 'w') \n", 267 | "outfile = csv.writer(fw, delimiter=',', lineterminator='\\r')\n", 268 | "\n", 269 | "# The header record\n", 270 | "outfile.writerow([\"matchday\", \"Home_team\", \"Away_team\", \"Home_goals\", \"Away_goals\", \"HT_Home_goals\", \"HT_Away_Goals\"])\n", 271 | "\n", 272 | "url = 'https://api.football-data.org/v2/competitions/2021/matches'\n", 273 | "headers = { 'X-Auth-Token': 'ae3c0b8b3a544082889c24df68c7e951' } # This is my API Key\n", 274 | "\n", 275 | "filters = {'matchday': '1', 'season': '2018'}\n", 276 | "\n", 277 | "for md in range(1,39) :\n", 278 | " filters['matchday'] = str(md)\n", 279 | " r = requests.get(url, headers=headers, params=filters)\n", 280 | " response = r.json()\n", 281 | " \n", 282 | " time.sleep(10)\n", 283 | " \n", 284 | " Matchday = response['filters']['matchday']\n", 285 | " for game in range(0,10) :\n", 286 | " HomeTeam = response['matches'][game]['homeTeam']['name']\n", 287 | " AwayTeam = response['matches'][game]['awayTeam']['name']\n", 288 | " HomeTeamHalfTime = response['matches'][game]['score']['halfTime']['homeTeam']\n", 289 | " AwayTeamHalfTime = response['matches'][game]['score']['halfTime']['awayTeam']\n", 290 | " HomeTeamFullTime = response['matches'][game]['score']['fullTime']['homeTeam']\n", 291 | " AwayTeamFullTime = response['matches'][game]['score']['fullTime']['awayTeam']\n", 292 | " \n", 293 | " #The data\n", 294 | " outfile.writerow([Matchday, HomeTeam, AwayTeam, HomeTeamFullTime, AwayTeamFullTime, HomeTeamHalfTime, AwayTeamHalfTime] ) \n", 295 | " \n", 296 | "# close the file \n", 297 | "fw.close()" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "## With the results file we can generate a league table" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "\n", 314 | "table = {}\n", 315 | "\n", 316 | "f = open('results.csv', 'r')\n", 317 | "f_csv = csv.reader(f, delimiter=',')\n", 318 | "next(f_csv)\n", 319 | "\n", 320 | "for line in f_csv:\n", 321 | " \n", 322 | " #print (line[0], line[1], line[2], line[3], line[4])\n", 323 | " if line[1] in table : # repeat all of this for line[2] - Away team\n", 324 | " table[line[1]]['played'] += 1\n", 325 | " table[line[2]]['played'] += 1\n", 326 | " if line[3] > line[4] :\n", 327 | " table[line[1]]['won'] += 1\n", 328 | " table[line[1]]['points'] += 3\n", 329 | " table[line[2]]['lost'] += 1\n", 330 | " if line[3] == line[4] :\n", 331 | " table[line[1]]['drawn'] += 1\n", 332 | " table[line[1]]['points'] += 1\n", 333 | " table[line[2]]['drawn'] += 1\n", 334 | " table[line[2]]['points'] += 1\n", 335 | " if line[3] < line[4] :\n", 336 | " table[line[1]]['lost'] += 1\n", 337 | " table[line[2]]['won'] += 1\n", 338 | " table[line[2]]['points'] += 3\n", 339 | " table[line[1]]['GF'] += int(line[3]) \n", 340 | " table[line[1]]['GA'] += int(line[4])\n", 341 | " table[line[2]]['GF'] += int(line[4]) \n", 342 | " table[line[2]]['GA'] += int(line[3])\n", 343 | " \n", 344 | " else :\n", 345 | " table[line[1]] = {}\n", 346 | " table[line[1]]['played'] = 0\n", 347 | " table[line[1]]['won'] = 0\n", 348 | " table[line[1]]['drawn'] = 0\n", 349 | " table[line[1]]['lost'] = 0\n", 350 | " table[line[1]]['GF'] = 0\n", 351 | " table[line[1]]['GA'] = 0\n", 352 | " table[line[1]]['points'] = 0\n", 353 | " table[line[1]]['played'] += 1\n", 354 | " table[line[2]] = {}\n", 355 | " table[line[2]]['played'] = 0\n", 356 | " table[line[2]]['won'] = 0\n", 357 | " table[line[2]]['drawn'] = 0\n", 358 | " table[line[2]]['lost'] = 0\n", 359 | " table[line[2]]['GF'] = 0\n", 360 | " table[line[2]]['GA'] = 0\n", 361 | " table[line[2]]['points'] = 0\n", 362 | " table[line[2]]['played'] += 1 \n", 363 | " if line[3] > line[4] :\n", 364 | " table[line[1]]['won'] += 1\n", 365 | " table[line[1]]['points'] += 3\n", 366 | " table[line[2]]['lost'] += 1 \n", 367 | " if line[3] == line[4] :\n", 368 | " table[line[1]]['drawn'] += 1\n", 369 | " table[line[1]]['points'] += 1\n", 370 | " table[line[2]]['drawn'] += 1\n", 371 | " table[line[2]]['points'] += 1\n", 372 | " if line[3] < line[4] :\n", 373 | " table[line[1]]['lost'] += 1\n", 374 | " table[line[2]]['won'] += 1\n", 375 | " table[line[2]]['points'] += 3\n", 376 | " table[line[1]]['GF'] += int(line[3]) \n", 377 | " table[line[1]]['GA'] += int(line[4])\n", 378 | " table[line[2]]['GF'] += int(line[4]) \n", 379 | " table[line[2]]['GA'] += int(line[3])\n", 380 | " \n", 381 | "\n", 382 | "# then need to flatten the dictionary and print\n", 383 | " \n", 384 | "f.close()\n", 385 | "\n", 386 | "# create the dataframe\n", 387 | "League_Table = pd.DataFrame(columns = ['Team', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Points'])\n", 388 | "\n", 389 | "for team in table :\n", 390 | " print(team, table[team]['won'], table[team]['drawn'], table[team]['lost'], table[team]['GF'], table[team]['GA'], (table[team]['GF'] - table[team]['GA']), table[team]['points'])\n", 391 | " League_Table = League_Table.append({'Team': team , 'W' : table[team]['won'], 'D' : table[team]['drawn'], 'L' : table[team]['lost'], 'GF' : table[team]['GF'], 'GA' : table[team]['GA'], 'GD' : (table[team]['GF'] - table[team]['GA']), 'Points' : table[team]['points']}, ignore_index=True)\n", 392 | "\n", 393 | "print(League_Table.head())\n", 394 | "\n", 395 | "# Sort the dataframe and save to a file\n", 396 | "League_Table.sort_values(['Points', 'GD'], ascending=[False, False], inplace=True )\n", 397 | "League_Table.to_csv('League_Table.csv', index = False)" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "## To get a half-time table I just need to change the comparisons (3 -> 5 and 4 -> 6)" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [ 413 | "table = {}\n", 414 | "\n", 415 | "f = open('results.csv', 'r')\n", 416 | "f_csv = csv.reader(f, delimiter=',')\n", 417 | "next(f_csv)\n", 418 | "\n", 419 | "for line in f_csv:\n", 420 | " \n", 421 | " #print (line[0], line[1], line[2], line[3], line[4])\n", 422 | " if line[1] in table : # repeat all of this for line[2] - Away team\n", 423 | " table[line[1]]['played'] += 1\n", 424 | " table[line[2]]['played'] += 1\n", 425 | " if line[5] > line[6] :\n", 426 | " table[line[1]]['won'] += 1\n", 427 | " table[line[1]]['points'] += 3\n", 428 | " table[line[2]]['lost'] += 1\n", 429 | " if line[5] == line[6] :\n", 430 | " table[line[1]]['drawn'] += 1\n", 431 | " table[line[1]]['points'] += 1\n", 432 | " table[line[2]]['drawn'] += 1\n", 433 | " table[line[2]]['points'] += 1\n", 434 | " if line[5] < line[6] :\n", 435 | " table[line[1]]['lost'] += 1\n", 436 | " table[line[2]]['won'] += 1\n", 437 | " table[line[2]]['points'] += 3\n", 438 | " table[line[1]]['GF'] += int(line[3]) \n", 439 | " table[line[1]]['GA'] += int(line[4])\n", 440 | " table[line[2]]['GF'] += int(line[4]) \n", 441 | " table[line[2]]['GA'] += int(line[3])\n", 442 | " \n", 443 | " else :\n", 444 | " table[line[1]] = {}\n", 445 | " table[line[1]]['played'] = 0\n", 446 | " table[line[1]]['won'] = 0\n", 447 | " table[line[1]]['drawn'] = 0\n", 448 | " table[line[1]]['lost'] = 0\n", 449 | " table[line[1]]['GF'] = 0\n", 450 | " table[line[1]]['GA'] = 0\n", 451 | " table[line[1]]['points'] = 0\n", 452 | " table[line[1]]['played'] += 1\n", 453 | " table[line[2]] = {}\n", 454 | " table[line[2]]['played'] = 0\n", 455 | " table[line[2]]['won'] = 0\n", 456 | " table[line[2]]['drawn'] = 0\n", 457 | " table[line[2]]['lost'] = 0\n", 458 | " table[line[2]]['GF'] = 0\n", 459 | " table[line[2]]['GA'] = 0\n", 460 | " table[line[2]]['points'] = 0\n", 461 | " table[line[2]]['played'] += 1 \n", 462 | " if line[5] > line[6] :\n", 463 | " table[line[1]]['won'] += 1\n", 464 | " table[line[1]]['points'] += 3\n", 465 | " table[line[2]]['lost'] += 1 \n", 466 | " if line[5] == line[6] :\n", 467 | " table[line[1]]['drawn'] += 1\n", 468 | " table[line[1]]['points'] += 1\n", 469 | " table[line[2]]['drawn'] += 1\n", 470 | " table[line[2]]['points'] += 1\n", 471 | " if line[5] < line[6] :\n", 472 | " table[line[1]]['lost'] += 1\n", 473 | " table[line[2]]['won'] += 1\n", 474 | " table[line[2]]['points'] += 3\n", 475 | " table[line[1]]['GF'] += int(line[3]) \n", 476 | " table[line[1]]['GA'] += int(line[4])\n", 477 | " table[line[2]]['GF'] += int(line[4]) \n", 478 | " table[line[2]]['GA'] += int(line[3])\n", 479 | " \n", 480 | "\n", 481 | "# then need to flatten the dictionary and print\n", 482 | " \n", 483 | "f.close()\n", 484 | "\n", 485 | "# create the dataframe\n", 486 | "League_Table = pd.DataFrame(columns = ['Team', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Points'])\n", 487 | "\n", 488 | "for team in table :\n", 489 | " #print(team, table[team]['won'], table[team]['drawn'], table[team]['lost'], table[team]['GF'], table[team]['GA'], (table[team]['GF'] - table[team]['GA']), table[team]['points'])\n", 490 | " League_Table = League_Table.append({'Team': team , 'W' : table[team]['won'], 'D' : table[team]['drawn'], 'L' : table[team]['lost'], 'GF' : table[team]['GF'], 'GA' : table[team]['GA'], 'GD' : (table[team]['GF'] - table[team]['GA']), 'Points' : table[team]['points']}, ignore_index=True)\n", 491 | "\n", 492 | "#print(League_Table.head())\n", 493 | "\n", 494 | "# Sort the dataframe and save to a file\n", 495 | "League_Table.sort_values(['Points', 'GD'], ascending=[False, False], inplace=True )\n", 496 | "League_Table.to_csv('League_Table (HT).csv', index = False)\n" 497 | ] 498 | } 499 | ], 500 | "metadata": { 501 | "kernelspec": { 502 | "display_name": "Python 3", 503 | "language": "python", 504 | "name": "python3" 505 | }, 506 | "language_info": { 507 | "codemirror_mode": { 508 | "name": "ipython", 509 | "version": 3 510 | }, 511 | "file_extension": ".py", 512 | "mimetype": "text/x-python", 513 | "name": "python", 514 | "nbconvert_exporter": "python", 515 | "pygments_lexer": "ipython3", 516 | "version": "3.7.5" 517 | } 518 | }, 519 | "nbformat": 4, 520 | "nbformat_minor": 4 521 | } 522 | -------------------------------------------------------------------------------- /2020_Getting_Data_From_the_Internet/code/Football API/images/matchday.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/web-scraping/4b82af70c2b5603721993a33679238fbd1d840ed/2020_Getting_Data_From_the_Internet/code/Football API/images/matchday.jpg -------------------------------------------------------------------------------- /2020_Getting_Data_From_the_Internet/code/Football API/images/results.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/web-scraping/4b82af70c2b5603721993a33679238fbd1d840ed/2020_Getting_Data_From_the_Internet/code/Football API/images/results.jpg -------------------------------------------------------------------------------- /2020_Getting_Data_From_the_Internet/code/Multiple Files Covid/covid.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import requests\n" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## The code below is what we have seen before\n", 17 | "\n", 18 | "## The file comes from the Webpage `https://www.getthedata.com/covid-19/utla-by-day`\n", 19 | "\n", 20 | "## With the link to the download file shown as `cases_by_utla_2020-03-16.csv`\n", 21 | "\n", 22 | "## But the actual file we would download is shown at the bottom left of the screen as we hover over the link. It is:\n", 23 | "\n", 24 | "## ![Actual Filename](./images/actual_filename.jpg)\n", 25 | "\n", 26 | "## If we want to download the file by other means then this is the filename we must use." 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "\n", 36 | "myurl = 'https://www.getthedata.com/downloads/cases_by_utla_2020-03-16.csv'\n", 37 | "\n", 38 | "savefilename = '200316.csv'\n", 39 | "r = requests.get(myurl)\n", 40 | "r.status_code\n", 41 | "file = open(savefilename, \"wb\")\n", 42 | "file.write(r.content)\n", 43 | "file.close()" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## If we wanted to download multiple files we can use the structure of the filenames and requests the downloads in a loop structure\n", 51 | "## Like we see below. \n", 52 | "## As we want to save the files we use a similar technique to name the output files." 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "stem = 'https://www.getthedata.com/downloads/cases_by_utla_2020-03-'\n", 62 | "ftype = '.csv'\n", 63 | "\n", 64 | "ym = 'd:/data/covid19/2003'\n", 65 | "\n", 66 | "for i in range(16,32):\n", 67 | " print(stem + str(i) + ftype)\n", 68 | " print(ym + str(i) + ftype) " 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "## Then we can put it all together" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "stem = 'https://www.getthedata.com/downloads/cases_by_utla_2020-03-'\n", 85 | "ftype = '.csv'\n", 86 | "\n", 87 | "ym = 'd:/data/covid19/2003'\n", 88 | "\n", 89 | "for i in range(16,32):\n", 90 | " # set up URL and filename\n", 91 | " myurl = stem + str(i) + ftype\n", 92 | " savefilename = ym + str(i) + ftype\n", 93 | " \n", 94 | " # Make the request\n", 95 | " r = requests.get(myurl)\n", 96 | " print(r.status_code)\n", 97 | " \n", 98 | " # Write the output file\n", 99 | " file = open(savefilename, \"wb\")\n", 100 | " file.write(r.content)\n", 101 | " file.close()\n" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "## Once downloaded we can use Excel Data | Import from other sources | from folder to combine the files into one." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [] 117 | } 118 | ], 119 | "metadata": { 120 | "kernelspec": { 121 | "display_name": "Python 3", 122 | "language": "python", 123 | "name": "python3" 124 | }, 125 | "language_info": { 126 | "codemirror_mode": { 127 | "name": "ipython", 128 | "version": 3 129 | }, 130 | "file_extension": ".py", 131 | "mimetype": "text/x-python", 132 | "name": "python", 133 | "nbconvert_exporter": "python", 134 | "pygments_lexer": "ipython3", 135 | "version": "3.7.5" 136 | } 137 | }, 138 | "nbformat": 4, 139 | "nbformat_minor": 4 140 | } 141 | -------------------------------------------------------------------------------- /2020_Getting_Data_From_the_Internet/code/Multiple Files Covid/images/actual_filename.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/web-scraping/4b82af70c2b5603721993a33679238fbd1d840ed/2020_Getting_Data_From_the_Internet/code/Multiple Files Covid/images/actual_filename.jpg -------------------------------------------------------------------------------- /2020_Getting_Data_From_the_Internet/code/Single files/Simple_File_requests.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Commandline Options\n", 8 | "\n", 9 | "* wget (wget --help)\n", 10 | "* curl (curl -- hrlp)\n", 11 | "\n", 12 | "\n", 13 | "## These are Linux tools nor natively available in Windows. Although in Windows 10 you can install the Windows Linux system and access them from there. \n", 14 | "\n", 15 | "## There are also other tools you can install to make them available\n", 16 | "\n", 17 | "## In general though if you are not used to the commandline it is probalby best to ignore them.\n", 18 | "\n", 19 | "\n", 20 | "wget www.bbc.co.uk -O bbc_wget.html\n", 21 | "curl www.bbc.co.uk -o bbc_curl.html\n", 22 | "\n", 23 | "wget http://api.football-data.org/v2/competitions -O football_competitions.json\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 1, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "import requests" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 7, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "# A csv file \n", 42 | "#myurl = 'https://www.getthedata.com/downloads/cases_by_utla_2020-03-16.csv'\n", 43 | "#savefilename = '200316.csv'\n", 44 | "\n", 45 | "# A JSON File\n", 46 | "#myurl = 'http://api.football-data.org/v2/competitions'\n", 47 | "#savefilename = 'competitions.json'\n", 48 | "\n", 49 | "# HTML\n", 50 | "myurl = 'https://www.bbc.co.uk/news/business-52276149'\n", 51 | "savefilename = 'bbc_business.html'" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 8, 57 | "metadata": {}, 58 | "outputs": [ 59 | { 60 | "name": "stdout", 61 | "output_type": "stream", 62 | "text": [ 63 | "200\n" 64 | ] 65 | } 66 | ], 67 | "source": [ 68 | "\n", 69 | "r = requests.get(myurl)\n", 70 | "print(r.status_code)\n", 71 | "\n", 72 | "file = open(savefilename, \"wb\")\n", 73 | "file.write(r.content)\n", 74 | "file.close()" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 4, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "data": { 84 | "text/plain": [ 85 | "'https://www.getthedata.com/downloads/cases_by_utla_2020-03-16.csv'" 86 | ] 87 | }, 88 | "execution_count": 4, 89 | "metadata": {}, 90 | "output_type": "execute_result" 91 | } 92 | ], 93 | "source": [ 94 | "r.url\n" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [] 103 | } 104 | ], 105 | "metadata": { 106 | "kernelspec": { 107 | "display_name": "Python 3", 108 | "language": "python", 109 | "name": "python3" 110 | }, 111 | "language_info": { 112 | "codemirror_mode": { 113 | "name": "ipython", 114 | "version": 3 115 | }, 116 | "file_extension": ".py", 117 | "mimetype": "text/x-python", 118 | "name": "python", 119 | "nbconvert_exporter": "python", 120 | "pygments_lexer": "ipython3", 121 | "version": "3.7.5" 122 | } 123 | }, 124 | "nbformat": 4, 125 | "nbformat_minor": 4 126 | } 127 | -------------------------------------------------------------------------------- /2020_Getting_Data_From_the_Internet/code/Web Scraping Tesco/Tesco.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Tesco Store finder\n", 8 | "\n", 9 | "The aim of this notebook is to collect the various pieces of information for all of the Tesco stores in the UK\n", 10 | "\n", 11 | "\n", 12 | "Approach\n", 13 | "\n", 14 | "If you use the Tesco store locator website ( https://www.tesco.com/store-locator/uk/ )you can find a list of local (you specify the locality) tesco stores. Each of which has its own web page and included in the web page is detailed information about the store. \n", 15 | "\n", 16 | "When you go to one of the store web pages you will notice that the URL is something like this:\n", 17 | "\n", 18 | "\n", 19 | "https://www.tesco.com/store-locator/uk/?bid=4634\n", 20 | "\n", 21 | "\n", 22 | "Internally Tesco are using four digit numbers to identify their stores" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "These are real stores \n", 30 | "https://www.tesco.com/store-locator/uk/?bid=4634\n", 31 | "\n", 32 | "https://www.tesco.com/store-locator/uk/?bid=6367\n", 33 | "\n", 34 | "\n", 35 | "This one isn't\n", 36 | "\n", 37 | "\n", 38 | "https://www.tesco.com/store-locator/uk/?bid=9999\n", 39 | "\n", 40 | "\n", 41 | "There are about ~2500 Tesco stores inthe UK so you can see that a lot of the number range 1000 - 9999 will not actually represent a real store.\n", 42 | "\n", 43 | "\n" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "### Using the Browser Inspector\n", 51 | "\n", 52 | "All modern browsers allow you to access the underlying HTML Code which makes up a Web page\n", 53 | "\n", 54 | "It is the job of the Browser to interpret the HTML and present the information it represents on the screen in a user friendlt manner.\n", 55 | "\n", 56 | "In order to Web scrape, you do need to have some understanding of HTML but not a great deal. Like most coding languages it is easier to read than to write and we only need to be able to read it a little bit, e.g. recognise different components or tags and a bit about the syntax of tags. \n", 57 | "\n", 58 | "A more important requirement is to be able to match what we see on the screen with the underlying HTML. A thorough understanding of the HTML and CSS code will allow you to do this, but there is a far easier way.\n", 59 | "\n", 60 | "This involves using the developer tools found in all modern browsers and in particular the 'element inspector'. This allows you to select an element on the web page; a table, part of a table, a link, almost anything and have the corresponding HTML code highlighted." 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "### Information that we might want to scrape\n", 68 | "\n", 69 | "\n", 70 | "* Store Name\n", 71 | "* Store Address\n", 72 | "* Store Geo-location \n", 73 | "* store type\n", 74 | "* Store Post Code\n" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "## The packages we need" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 1, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "import pandas as pd\n", 91 | "import requests\n", 92 | "from bs4 import BeautifulSoup as bs # pip install beautifulsoup4 but import from bs4\n", 93 | "import time\n", 94 | "import folium # !pip install folium - not included with the Anaconda python, so you may need to install\n", 95 | "from folium.plugins import MarkerCluster" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "The 'get' methods from requests only needs to be given a string representing a url\n", 103 | "\n", 104 | "Quite often if you need to provide multiple parameters you would build the url string up and then call " 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "## Quick example to show how Beautifulsoup works" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 3, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "r = requests.get('https://www.bbc.co.uk')\n", 121 | "##print(r.text)\n" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "## We can make the output look a bit better" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "soup = bs(r.text) \n", 138 | "prettyHTML = soup.prettify() \n", 139 | "#print(prettyHTML)" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "## We can find all of the images within the Web page" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "for imagelink in soup.find_all('img'):\n", 156 | " url = imagelink.get('src')\n", 157 | " print(url)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "## We can find all of the links within the Web page" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "urls = soup.findAll('a')\n", 176 | "for url in urls :\n", 177 | " print(url.get('href'))" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "## `find` and `FindAll` allow you to search for tags\n", 185 | "\n", 186 | "## `get` allows you to select parameter values or tag values\n", 187 | "\n", 188 | "## Typically we will be finding tags and then extracting values from them.\n", 189 | "\n", 190 | "## What we need to ensure when doing this is that we have selected the correct tags. In any given webpage some of the common tags will occur many times as we have just seen with the 'a' tag.\n", 191 | "\n", 192 | "## we can do this by either using a chain of tags which is unique and ends in the tag we want or make use of the parameters and values within a tag and find a unique combination which will identify the specific tag we want.\n", 193 | "\n", 194 | "## This is why we need to inspect the HTML in order to identify these unique combinations.\n", 195 | "\n", 196 | "## In HTML tags are written in a specific way \n", 197 | "\n" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "## Start with a single file" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "r = requests.get('https://www.tesco.com/store-locator/uk/?bid=6367')\n", 214 | "soup = bs(r.text) " 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "## Get the Title" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "for h1 in soup.find_all('h1'):\n", 231 | " store_id = h1.get('title')\n", 232 | " print(h1.text)" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "## Get the Store Id" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "for h1 in soup.findAll('h1'):\n", 249 | " store_id = h1.get('title')\n", 250 | " print(store_id)" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "## Get the address" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "for h2 in soup.find_all('h2'):\n", 267 | " if h2.text == 'Address' :\n", 268 | " print(h2.findNext('span').text)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "## or" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": {}, 282 | "outputs": [], 283 | "source": [ 284 | "# Store Address\n", 285 | "address = soup.find('div', {'class': 'address'}).find('span', {'itemprop': 'streetAddress'} ).text\n", 286 | "print(address)" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "## or" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "metadata": {}, 300 | "outputs": [], 301 | "source": [ 302 | "# if the class makes the element unique, I can use it.\n", 303 | "address = soup.find('span', {'itemprop': 'streetAddress'} )\n", 304 | "print(address.text)" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "## As long as you uniquely identify the tag you want, how you get there generally doesn't matter" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "## Get the Longitude and Latitude\n", 319 | "\n", 320 | "## This is a bit more involved and includes some simple python code to extract the actual values" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "# Latitude and Longitude\n", 330 | "imagelink = soup.find('img')\n", 331 | "url = imagelink.get('src')\n", 332 | "\n", 333 | "item_list = url.split('/')\n", 334 | "lat = item_list[8].split(',')[0]\n", 335 | "lng = item_list[8].split(',')[1]\n", 336 | "\n", 337 | "print ('lat =', lat, 'lng =', lng)" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "## That is all of the bits of information we wanted to collect - that is the Web scraping done\n", 345 | "\n", 346 | "## So now we will put it all together and add a few python bits to accumulate the data from many stores in a single save dataset.\n", 347 | "\n", 348 | "\n", 349 | "## We need to :\n", 350 | "\n", 351 | "1. Make use of the file naming convention to loop through a large number of possible stores, accepting that some won't exist.\n", 352 | "2. Although wasteful of space we will save all of the 'requested' files seperately and then process them with Beautifulsoup\n", 353 | "3. Create a CSV file of all of the data we extract from the files using Beautifulsoup.\n" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "# 1 --- DO NOT RUN\n", 363 | "\n", 364 | "stem = 'https://www.tesco.com/store-locator/uk/?bid=' \n", 365 | "filename_prefix = './stores/'\n", 366 | "filename_suffix = '.html'\n", 367 | "for x in range(3000,4000) :\n", 368 | " r = requests.get(stem + str(x))\n", 369 | " filename = filename_prefix + str(x) + filename_suffix\n", 370 | " f = open(filename, \"w\")\n", 371 | " f.write(r.text)\n", 372 | " f.close()\n", 373 | " time.sleep(5) # explain why this is included - 1000 store with 5 sec wait = 5000+ secs to run\n", 374 | " # the wait is added as a courtesy so as not to overload the server" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "## Given that we expect some of the values we used not to be actual stores we need to know how to identify a 'missing' store\n", 382 | "\n", 383 | "### In fact all calls return an HTML file, so we check to see which include 'Error' in the title.\n", 384 | "### \n", 385 | "### In other scenarios the resuests call coul return a status_code of 404 (file not found) which you could check for using the `status_code` value which is included as part of the response." 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [ 394 | "rm = requests.get('https://www.tesco.com/store-locator/uk/?bid=9999')\n", 395 | "soup = bs(rm.text) \n", 396 | "#prettyHTML = soup.prettify() \n", 397 | "#print(prettyHTML)\n", 398 | "title = soup.find('title')\n", 399 | "print(title.text)\n", 400 | "if title.text[0:5] == 'Error' :\n", 401 | " print(\"Ignore me\")\n", 402 | "else :\n", 403 | " print(\"process me\")" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "## Now that we have all of the files and can identify the 'duds'\n", 411 | "\n", 412 | "## We are ready to do the scraping\n", 413 | "\n", 414 | "## You could combine this step with the previous and do the scraping as you request the files" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": null, 420 | "metadata": {}, 421 | "outputs": [], 422 | "source": [ 423 | "# 2\n", 424 | "\n", 425 | "\n", 426 | "# create the dataframe\n", 427 | "store_info = pd.DataFrame(columns = ['Id', 'Name', 'Address', 'lat', 'lng'])\n", 428 | "\n", 429 | "# read the data from a file\n", 430 | "folder = './stores/'\n", 431 | "for i in range(3000,4000) :\n", 432 | " print(i)\n", 433 | " filename = folder + str(i) + '.html' \n", 434 | " with open(filename, 'r') as file_handle: content = file_handle.read()\n", 435 | " soup = bs(content) \n", 436 | " title = soup.find('title')\n", 437 | " if title.text[0:5] != 'Error' :\n", 438 | " h1 = soup.find('h1')\n", 439 | " store_id = h1.get('title')\n", 440 | " store_name = h1.text\n", 441 | " address = soup.find('div', {'class': 'address'}).find('span', {'itemprop': 'streetAddress'} ).text\n", 442 | " imagelink = soup.find('img')\n", 443 | " url = imagelink.get('src')\n", 444 | " item_list = url.split('/')\n", 445 | " lat = item_list[8].split(',')[0]\n", 446 | " lng = item_list[8].split(',')[1]\n", 447 | " #print(store_id + \" ++ \" + store_name + \" ++ \" + address + \" ++ \" + lat + \" ++ \" + lng)\n", 448 | " store_info = store_info.append({'Id' : store_id, 'Name' : store_name, \"Address\" : address, \"lat\" : lat, \"lng\" : lng}, ignore_index=True)\n", 449 | "print(store_info.shape)\n", 450 | "print(store_info.head())" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "## The store type is part of the store name and the post code is included in the address, so extracting them is just Python code" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "# 3\n", 467 | "\n", 468 | "### Adding the store_type and post_code columns\n", 469 | "\n", 470 | "def last_element(s, split_on) :\n", 471 | " l = s.split(split_on)\n", 472 | " return l[len(l) - 1]\n", 473 | "\n", 474 | "store_info['store_type'] = store_info.apply(lambda row: last_element(row.Name, ' '), axis=1)\n", 475 | "store_info['post_code'] = store_info.apply(lambda row: last_element(row.Address, ','), axis=1)\n", 476 | "store_info.head()\n", 477 | "store_info.to_csv('xstore_info.csv', index = False)" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [ 484 | "## Now we have our file of data, lets put it on a map" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": null, 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "data = pd.read_csv('store_info.csv')\n", 494 | "uk = folium.Map(location=[53, -1], control_scale=True, zoom_start=7)\n", 495 | "\n", 496 | "\n", 497 | "# adding the markers and pop ups\n", 498 | "for i in range(0,len(data)):\n", 499 | " popup_data = data.iloc[i]['post_code'] +'\\n' + data.iloc[i]['store_type']\n", 500 | " folium.Marker([ data.iloc[i]['lat'], data.iloc[i]['lng']], popup=popup_data).add_to(uk)\n", 501 | "uk\n", 502 | "#uk.save('Tesco_stores.html')" 503 | ] 504 | } 505 | ], 506 | "metadata": { 507 | "kernelspec": { 508 | "display_name": "Python 3", 509 | "language": "python", 510 | "name": "python3" 511 | }, 512 | "language_info": { 513 | "codemirror_mode": { 514 | "name": "ipython", 515 | "version": 3 516 | }, 517 | "file_extension": ".py", 518 | "mimetype": "text/x-python", 519 | "name": "python", 520 | "nbconvert_exporter": "python", 521 | "pygments_lexer": "ipython3", 522 | "version": "3.7.5" 523 | } 524 | }, 525 | "nbformat": 4, 526 | "nbformat_minor": 4 527 | } 528 | -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/code/README.md: -------------------------------------------------------------------------------- 1 | # Interactive coding materials 2 | 3 | We have developed a number of Jupyter notebooks containing a mix of Python code, narrative and output. 4 | 5 | If you would like to run and/or edit the code without installing any software on your machine, click on the button below. This launches a **Binder** service allowing you to interact with the code through your web browser - as it is temporary, you will lose your work when you log out and you will be booted out if you don’t do anything for a long time. 6 | 7 | Once Binder has been launched, click on the notebook you want to run. (*Don't worry if takes up to a minute to launch*) 8 | 9 | ### Launch Web-scraping for Social Science Research as a Binder service: [![Binder](http://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/UKDataServiceOpen/web-scraping/master?filepath=code)
10 | 11 | Alternatively, you can download the notebook files and run them on your own machine. See our guidance on installing Python and Jupyter [here](https://github.com/UKDataServiceOpen/new-forms-of-data/blob/master/installation.md). 12 | -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/code/auth/guardian-api-key.txt: -------------------------------------------------------------------------------- 1 | c95f2225-52ab-4bb8-a4f0-434bc42a9761 -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/code/images/UKDS_Logos_Col_Grey_300dpi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/web-scraping/4b82af70c2b5603721993a33679238fbd1d840ed/2020_Web-scraping_and_API/code/images/UKDS_Logos_Col_Grey_300dpi.png -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/code/web-scraping-apis-code-2020-04-30.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "skip" 8 | } 9 | }, 10 | "source": [ 11 | "![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": { 17 | "slideshow": { 18 | "slide_type": "skip" 19 | } 20 | }, 21 | "source": [ 22 | "# Web-scraping for Social Science Research" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": { 28 | "slideshow": { 29 | "slide_type": "skip" 30 | } 31 | }, 32 | "source": [ 33 | "Welcome to the UK Data Service training series on *Computational Social Science*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.\n", 34 | "\n", 35 | "* To access training materials for the entire series: [Training Materials]\n", 36 | "\n", 37 | "* To keep up to date with upcoming and past training events: [Events]\n", 38 | "\n", 39 | "* To get in contact with feedback, ideas or to seek assistance: [Help]\n", 40 | "\n", 41 | "Dr Julia Kasmire and Dr Diarmuid McDonnell
\n", 42 | "UK Data Service
\n", 43 | "University of Manchester
\n", 44 | "April 2020" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": { 50 | "slideshow": { 51 | "slide_type": "skip" 52 | }, 53 | "toc": true 54 | }, 55 | "source": [ 56 | "

Table of Contents

\n", 57 | "
" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": { 63 | "slideshow": { 64 | "slide_type": "skip" 65 | } 66 | }, 67 | "source": [ 68 | "## Introduction\n", 69 | "\n", 70 | "In this training series we cover some of the essential skills needed to collect data from the web. In particular we focus on two different approaches:\n", 71 | "1. Collecting data stored on web pages. [LINK]\n", 72 | "2. Downloading data from online databases using Application Programming Interfaces (APIs). [Focus of this notebook]\n", 73 | " \n", 74 | "Do not be alarmed by the technical aspects: both approaches can be implemented using simple code, a standard desktop or laptop, and a decent internet connection. \n", 75 | "\n", 76 | "Given the Covid-19 public health crisis in which this programme of work occurred, we will examine ways in which computational methods can provide valuable data for studying this phenomenon. This is a fast moving, evolving public health emergency that, in addition to other impacts, will shape research agendas across the sciences for years to come. Therefore it is important to learn how we, as social scientists, can access or generate data that will provide a better understanding of this disease.\n", 77 | "\n", 78 | "In this lesson we will search the Guardian newspaper's online database for articles referring to Covid-19." 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": { 84 | "slideshow": { 85 | "slide_type": "skip" 86 | } 87 | }, 88 | "source": [ 89 | "## Guide to using this resource\n", 90 | "\n", 91 | "This learning resource was built using Jupyter Notebook, an open-source software application that allows you to mix code, results and narrative in a single document. As Barba et al. (2019) espouse:\n", 92 | "> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.\n", 93 | "\n", 94 | "If you are familiar with Jupyter notebooks then skip ahead to the main content (*Collecting data from online databases using an API*). Otherwise, the following is a quick guide to navigating and interacting with the notebook." 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": { 100 | "slideshow": { 101 | "slide_type": "slide" 102 | } 103 | }, 104 | "source": [ 105 | "### Interaction\n", 106 | "\n", 107 | "**You only need to execute the code that is contained in sections which are marked by `In []`.**\n", 108 | "\n", 109 | "To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).\n", 110 | "\n", 111 | "Try it for yourself:" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": { 118 | "slideshow": { 119 | "slide_type": "subslide" 120 | } 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "print(\"Enter your name and press enter:\")\n", 125 | "name = input()\n", 126 | "print(\"\\r\")\n", 127 | "print(\"Hello {}, enjoy learning more about Python and web-scraping!\".format(name)) " 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": { 133 | "slideshow": { 134 | "slide_type": "subslide" 135 | } 136 | }, 137 | "source": [ 138 | "### Learn more\n", 139 | "\n", 140 | "Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the materials provided by Dani Arribas-Bel at the University of Liverpool. " 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": { 146 | "slideshow": { 147 | "slide_type": "skip" 148 | } 149 | }, 150 | "source": [ 151 | "## Collecting data from online databases using an API\n", 152 | "\n", 153 | "### What is an API?\n", 154 | "\n", 155 | "An Application Programming Interface (API) is\n", 156 | "> a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service\" (Oxford English Dictionary). \n", 157 | "\n", 158 | "In essence: an API acts as an intermediary between software applications. Think of an API's role as similar to that of a translator faciliating a conversation between two individuals who do not speak the same language. Neither individual needs to know the other's language, just how to formulate their response in a way the translator can understand. Similarly, an API simplifies how applications communicate with each other.\n", 159 | "\n", 160 | "It performs this role by providing a set of protocols/standards for making *requests* and formulating *responses* between applications. For example, a smart phone application might need real-time traffic data from an online database. An API can validate the application's request for data, and handle the online database's response (i.e., the transfer of data to the application). In the absence of an API, the smart phone application would need to know a lot more technical information about the online database in order to communicate with it (e.g., what commands does the database understand?). But thanks to the API, the smart phone application only needs to know how to formulate a request that the API understands, which then communicates the request to the database.\n", 161 | "\n", 162 | "(If you want to learn more about APIs, especially from a technical perspective, we highly recommend one of our previous webinars: What are APIs?)\n", 163 | "\n", 164 | "### Reasons to interact with an API\n", 165 | "\n", 166 | "Many public, private and charitable institutions collect and share data of value to social scientists. Often they deposit their data to a data portal - e.g., UK Government Open Data -, allowing you to download the files as and when needed. However, another approach they can adopt is to allow access to the underlying information that is stored in their database through an API. Using this method, individuals can send a customised *request* for information to the database; if the request is valid, the database *responds* by providing you with the information you asked for. Think of using an API as the difference between downloading a raw data file which then needs to be filtered to arrive at the information you need, and performing the filtering when you request the data, so only what you need is returned (the API method).\n", 167 | "\n", 168 | "Before we delve into writing code to capture data through an API, let's clearly state the logic underpinning the technique." 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": { 174 | "slideshow": { 175 | "slide_type": "skip" 176 | } 177 | }, 178 | "source": [ 179 | "### Logic of using an API\n", 180 | "\n", 181 | "We begin by identifying an online database containing information of interest. Then we need to **know** the following:\n", 182 | "1. The location of the API (i.e., web address) through which the database can be accessed. For example, the UK Police API can be accessed via https://data.police.uk/api.\n", 183 | "2. The terms of use associated with the API. Many APIs restrict the number of requests you can make over a given time period, while others require registration in order to authenticate who is trying to access the data. For example, the UK Police API does not require you to provide authentication but restricts the number of requests for data you can make (15 per second) - the number of allowable requests is known as the *rate limit*.\n", 184 | "3. The location of the data of interest on the API. For example, data on street-level crime from the UK Police API is available at: https://data.police.uk/api/crimes-street. The location of the data is known as its *endpoint*.\n", 185 | "\n", 186 | "We can usually find all of the information we need by reading the API's documentation e.g., https://data.police.uk/docs/.\n", 187 | "\n", 188 | "Then we need to **do** the following:\n", 189 | "4. Register your use of the API (if required).\n", 190 | "5. Request data from the endpoint of interest, supplying authentication if required. This process is known as *making a call* to the API.\n", 191 | "6. Write this data to a file for future use." 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": { 197 | "slideshow": { 198 | "slide_type": "skip" 199 | } 200 | }, 201 | "source": [ 202 | "## Example: Capturing Covid-19 data" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": { 208 | "slideshow": { 209 | "slide_type": "skip" 210 | } 211 | }, 212 | "source": [ 213 | "The Guardian API provides access to some of the data and metadata associcated with its content. For example, you can query its database to search for articles relating to certain topics (e.g., \"environment\", \"covid-19\"), or articles published over a certain date range." 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": { 219 | "slideshow": { 220 | "slide_type": "skip" 221 | } 222 | }, 223 | "source": [ 224 | "### Locating the API\n", 225 | "\n", 226 | "In some cases the organisation will allow you to explore the API without the need to write code or register for an API key. For instance, the Guardian API can be interacted with through the following user interface https://open-platform.theguardian.com/explore/.\n", 227 | "\n", 228 | "**TASK**: take some time to interact with the Guardian API interface using the link above.\n", 229 | "\n", 230 | "(Note: it possible to load websites into Python in order to view them, however the Guardian API doesn't allow this. See the example code below for how it would work for a different website - just remove the quotation marks enclosing the code and run the cell)." 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": { 237 | "slideshow": { 238 | "slide_type": "skip" 239 | } 240 | }, 241 | "outputs": [], 242 | "source": [ 243 | "\"\"\"\n", 244 | "from IPython.display import IFrame\n", 245 | "\n", 246 | "IFrame(\"https://ukdataservice.ac.uk/\", width=\"600\", height=\"650\")\n", 247 | "\"\"\"" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": { 253 | "slideshow": { 254 | "slide_type": "skip" 255 | } 256 | }, 257 | "source": [ 258 | "However, interacting with an API through a user interface (i.e., text boxes, drop-down menues) is slow, untransparent, labour intensive and often not possible. We will instead focus on writing code that performs the requests for us. Therefore, we need to use the following web address to access the API: https://content.guardianapis.com.\n", 259 | "\n", 260 | "(Note that you cannot request this web address through your browser; this is because access to the API is only possible by providing authentication i.e., an API key. Try for yourself by clicking on the links)." 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": { 266 | "slideshow": { 267 | "slide_type": "skip" 268 | } 269 | }, 270 | "source": [ 271 | "### API terms of use\n", 272 | "\n", 273 | "The Guardian API is well documented (not always the case, unfortunately) and we can clearly identify what is required in order to interact with it. Firstly, the API requires authentication, in the form of an API key, to be provided when making requests. The API key is generated when you register your use of the API.\n", 274 | "\n", 275 | "Secondly, the API provides multiple levels of access, each with its own set of restrictions and usage allowances. The free level of access allows you to make up to 12 calls per second, with a daily limit of 5,000; requests are also restricted to text content (no access to images, audio, video) - if you need more than the free level of access provides, there is a commercial product with custom rate limits, access to other type of data (e.g., images, videos) etc.\n", 276 | "\n", 277 | "See https://open-platform.theguardian.com/access/ for full information on levels of access." 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": { 283 | "slideshow": { 284 | "slide_type": "skip" 285 | } 286 | }, 287 | "source": [ 288 | "### Locating data\n", 289 | "\n", 290 | "The Guardian API allows access to five endpoints:\n", 291 | "* Content - https://content.guardianapis.com/search\n", 292 | "* Tags - http://content.guardianapis.com/tags\n", 293 | "* Sections - https://content.guardianapis.com/sections\n", 294 | "* Editions - https://content.guardianapis.com/editions\n", 295 | "* Single Item - https://content.guardianapis.com/" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": { 301 | "slideshow": { 302 | "slide_type": "slide" 303 | } 304 | }, 305 | "source": [ 306 | "### Registering use of API\n", 307 | "\n", 308 | "We need an API key in order to start requesting data. The API key acts as a user's unique id when accessing the API. For the purposes of this lesson we have registered our use and been given an API key which is contained in a file called *guardian-api-key.txt*. \n", 309 | "\n", 310 | "Run the code below to check if the file exists:" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": null, 316 | "metadata": { 317 | "slideshow": { 318 | "slide_type": "subslide" 319 | } 320 | }, 321 | "outputs": [], 322 | "source": [ 323 | "import os\n", 324 | "\n", 325 | "os.listdir(\"./auth\")" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": { 331 | "slideshow": { 332 | "slide_type": "subslide" 333 | } 334 | }, 335 | "source": [ 336 | "Good, now we need to load in the API key from this file.\n", 337 | "\n", 338 | "(Delete the `#` symbol if you want to see the value of the API key)." 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": { 345 | "slideshow": { 346 | "slide_type": "fragment" 347 | } 348 | }, 349 | "outputs": [], 350 | "source": [ 351 | "api_key = open(\"./auth/guardian-api-key.txt\", \"r\").read() # open the file and read its contents\n", 352 | "api_key" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": { 358 | "slideshow": { 359 | "slide_type": "skip" 360 | } 361 | }, 362 | "source": [ 363 | "You should generate your own API key, and keep it private and secure, when using the Guardian API for your own purposes." 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": { 369 | "slideshow": { 370 | "slide_type": "slide" 371 | } 372 | }, 373 | "source": [ 374 | "### Requesting data\n", 375 | "\n", 376 | "We're ready for the interesting bit: requesting data through the API. We'll focus on finding and saving data about articles relating to the Covid-19 public health crisis.\n", 377 | "\n", 378 | "There is a preliminary step, which is setting up Python with the modules it needs to interact with the API." 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": null, 384 | "metadata": { 385 | "slideshow": { 386 | "slide_type": "subslide" 387 | } 388 | }, 389 | "outputs": [], 390 | "source": [ 391 | "# Import modules\n", 392 | "\n", 393 | "import os # module for navigating your machine (e.g., file directories)\n", 394 | "import requests # module for requesting urls\n", 395 | "import json # module for working with JSON data structures\n", 396 | "from datetime import datetime # module for working with dates and time\n", 397 | "print(\"Succesfully imported necessary modules\")" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": { 403 | "slideshow": { 404 | "slide_type": "skip" 405 | } 406 | }, 407 | "source": [ 408 | "Modules are additional techniques or functions that are not present when you launch Python. Some do not even come with Python when you download it and must be installed on your machine separately - think of using `ssc install ` in Stata, or `install.packages()` in R. For now just understand that many useful modules need to be imported every time you start a new Python session." 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": { 414 | "slideshow": { 415 | "slide_type": "subslide" 416 | } 417 | }, 418 | "source": [ 419 | "Next, let's search for articles mentioning the term \"covid-19\"." 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": { 426 | "slideshow": { 427 | "slide_type": "subslide" 428 | } 429 | }, 430 | "outputs": [], 431 | "source": [ 432 | "# Define web address and search terms\n", 433 | "\n", 434 | "baseurl = \"http://content.guardianapis.com/search?\" # base web address\n", 435 | "searchterm = \"covid-19\" # term we want to search for\n", 436 | "auth = {\"api-key\": api_key} # authentication\n", 437 | "webadd = baseurl + \"q=\" + searchterm # construct web address for requesting\n", 438 | "print(webadd)\n", 439 | "\n", 440 | "# Make call to API\n", 441 | "\n", 442 | "response = requests.get(webadd, headers=auth) # request the web address\n", 443 | "response.status_code # check if API was requested successfully" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": { 449 | "slideshow": { 450 | "slide_type": "skip" 451 | } 452 | }, 453 | "source": [ 454 | "Let's unpack the above code, as there is a lot happening. First, we define a variable (also known as an 'object' in Python) called `baseurl` that contains the base web address of the *Content* endpoint. Then we define a variable containing the term we are interested in searching for (`searchterm`). Next, we define a variable to store the API key that is needed when making the request (`auth`). Finally we concatenate these separate elements to form a valid web address that can be requested from the API (`webadd`).\n", 455 | "\n", 456 | "The next step is to use the `get()` method of the `requests` module to request the web address, and in the same line of code, we store the results of the request in a variable called `response`. Finally, we check whether the request was successful by calling on the `status_code` attribute of the `response` variable.\n", 457 | "\n", 458 | "Confused? Don't worry, the conventions of Python and using its modules take a bit of getting used to. At this point, just understand that you can store the results of commands in variables, and a variable can have different attributes that can be accessed when needed. Also note that you have a lot of freedom in how you name your variables (subject to certain restrictions - see here for some guidance).\n", 459 | "\n", 460 | "For example, the following would also work (bonus points if you can name the brewery lending its name to the variables):" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": { 467 | "slideshow": { 468 | "slide_type": "subslide" 469 | } 470 | }, 471 | "outputs": [], 472 | "source": [ 473 | "highlander = \"http://content.guardianapis.com/search?\" # base web address\n", 474 | "jarl = \"covid-19\" # term we want to search for\n", 475 | "avalanche = {\"api-key\": api_key} # authentication\n", 476 | "hurricane_jack = highlander + \"q=\" + jarl # construct web address for requesting\n", 477 | "print(hurricane_jack)\n", 478 | "\n", 479 | "# Make call to API\n", 480 | "\n", 481 | "beers = requests.get(webadd, headers=avalanche) # request the web address\n", 482 | "beers.status_code # check if API was requested successfully" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": { 488 | "slideshow": { 489 | "slide_type": "skip" 490 | } 491 | }, 492 | "source": [ 493 | "Back to the request:\n", 494 | "\n", 495 | "Good, we get a status code of _200_ - this means we made a successful call to the API. Lau, Gonzalez and Nolan provide a succinct description of different types of status codes:\n", 496 | "\n", 497 | "* **100s** - Informational: More input is expected from client or server (e.g. 100 Continue, 102 Processing)\n", 498 | "* **200s** - Success: The client's request was successful (e.g. 200 OK, 202 Accepted)\n", 499 | "* **300s** - Redirection: Requested URL is located elsewhere; May need user's further action (e.g. 300 Multiple Choices, 301 Moved Permanently)\n", 500 | "* **400s** - Client Error: Client-side error (e.g. 400 Bad Request, 403 Forbidden, 404 Not Found)\n", 501 | "* **500s** - Server Error: Server-side error or server is incapable of performing the request (e.g. 500 Internal Server Error, 503 Service Unavailable)\n", 502 | "\n", 503 | "For clarity:\n", 504 | "* **Client**: your machine\n", 505 | "* **Server**: the machine you are requesting the resource from\n", 506 | "\n", 507 | "(See Appendix A for more examples of how the `requests` module works and what information it returns.)" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": { 513 | "slideshow": { 514 | "slide_type": "skip" 515 | } 516 | }, 517 | "source": [ 518 | "You may be wondering exactly what it is we requested. To see the content of our request, we can call the `json()` method on the `response` variable:" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": null, 524 | "metadata": { 525 | "slideshow": { 526 | "slide_type": "subslide" 527 | } 528 | }, 529 | "outputs": [], 530 | "source": [ 531 | "data = response.json() \n", 532 | "data # view the content of the response" 533 | ] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": { 538 | "slideshow": { 539 | "slide_type": "skip" 540 | } 541 | }, 542 | "source": [ 543 | "Just by scanning the first few lines we can pick out useful metadata about the response: we can see our search has yielded over 15,000 results, and there are ten results per page spread out over 1,500 pages. We can also see where the actual content of the search results is contained, in a section helpfully called `results`.\n", 544 | "\n", 545 | "It is important we familiarise ourselves with the hierarchical structure of the returned data. Note how we needed to call the `json()` method on the `response` variable. JSON (Javascript Object Notation) is a hierarchical data structure based on key-value pairs (known as *items*), which are separated by commas (Brooker, 2020; Tagliaferri, n.d.). For example, the `pageSize` key stores the value `10`. The hierarchical structure is evident by observing how a key can contain a list of other key-value pairs (items). For example, the `results` key contains a list of items relating to each search result.\n", 546 | "\n", 547 | "A JSON data structure (known in Python as a *dictionary*) can be difficult to understand at first, in no small part due to the unappealing presentation format. Visually, it is worth noting that this data structure begins and ends with curly braces (`{}`). \n", 548 | "\n", 549 | "Let's examine some Python methods for navigating and processing this data structure." 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "metadata": { 555 | "slideshow": { 556 | "slide_type": "subslide" 557 | } 558 | }, 559 | "source": [ 560 | "#### Navigating a dictionary (JSON) variable" 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "metadata": { 566 | "slideshow": { 567 | "slide_type": "subslide" 568 | } 569 | }, 570 | "source": [ 571 | "The first thing we should do is list the keys contained in the dictionary:" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "metadata": { 578 | "slideshow": { 579 | "slide_type": "fragment" 580 | } 581 | }, 582 | "outputs": [], 583 | "source": [ 584 | "data.keys()" 585 | ] 586 | }, 587 | { 588 | "cell_type": "markdown", 589 | "metadata": { 590 | "slideshow": { 591 | "slide_type": "subslide" 592 | } 593 | }, 594 | "source": [ 595 | "If you're wondering why only one key is listed, remember that we are dealing with a hierarchical data structure. The value of the `response` key is itself another dictionary, therefore we can access those keys as follows: " 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": { 602 | "slideshow": { 603 | "slide_type": "fragment" 604 | } 605 | }, 606 | "outputs": [], 607 | "source": [ 608 | "data[\"response\"].keys() # list keys contained in the \"response\" key" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "metadata": { 614 | "slideshow": { 615 | "slide_type": "subslide" 616 | } 617 | }, 618 | "source": [ 619 | "Note how we mentioned that keys can contain a list of other key-value pairs. A *list* is a Python data type used to store ordered sequences of elements (Tagliaferri, n.d.). Let's see how we can deal with lists by examining the `results` key:" 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": null, 625 | "metadata": { 626 | "slideshow": { 627 | "slide_type": "fragment" 628 | } 629 | }, 630 | "outputs": [], 631 | "source": [ 632 | "search_results = data[\"response\"][\"results\"] \n", 633 | "search_results # view the contents of the \"results\" key nested within the \"response\" key" 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "metadata": { 639 | "slideshow": { 640 | "slide_type": "skip" 641 | } 642 | }, 643 | "source": [ 644 | "Visually, a list begins and ends with square parentheses (`[]`) - however we can just ask Python to confirm that the `search_results` variable is a list:" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "metadata": { 651 | "slideshow": { 652 | "slide_type": "subslide" 653 | } 654 | }, 655 | "outputs": [], 656 | "source": [ 657 | "type(search_results)" 658 | ] 659 | }, 660 | { 661 | "cell_type": "markdown", 662 | "metadata": { 663 | "slideshow": { 664 | "slide_type": "skip" 665 | } 666 | }, 667 | "source": [ 668 | "We can check how many elements are in a list by calling on the `len()` function:" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": null, 674 | "metadata": { 675 | "slideshow": { 676 | "slide_type": "subslide" 677 | } 678 | }, 679 | "outputs": [], 680 | "source": [ 681 | "len(search_results)" 682 | ] 683 | }, 684 | { 685 | "cell_type": "markdown", 686 | "metadata": { 687 | "slideshow": { 688 | "slide_type": "skip" 689 | } 690 | }, 691 | "source": [ 692 | "We can view each element in the list of search results as follows:" 693 | ] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": null, 698 | "metadata": { 699 | "slideshow": { 700 | "slide_type": "subslide" 701 | } 702 | }, 703 | "outputs": [], 704 | "source": [ 705 | "# View the values of certain keys in each element in the list\n", 706 | "\n", 707 | "for result in search_results:\n", 708 | " print(result[\"type\"]) # view content type\n", 709 | " print(result[\"sectionName\"]) # view newspaper section content appeared in\n", 710 | " print(result[\"webPublicationDate\"]) # view date content was published online\n", 711 | " print(\"\\r\")\n", 712 | " print(\"-------------\")\n", 713 | " print(\"\\r\")" 714 | ] 715 | }, 716 | { 717 | "cell_type": "markdown", 718 | "metadata": { 719 | "slideshow": { 720 | "slide_type": "skip" 721 | } 722 | }, 723 | "source": [ 724 | "It takes a bit of time to get used to unfamiliar data structures and types. If you are a quantitative social scientist, you may be used to working with data structured in a tabular (variable-by-case) format: every row is an observation, every column is a variable, and every cell is a value.\n", 725 | "\n", 726 | "In section 4.7, we'll see how we can convert data from a JSON structure to a tabular format, but just keep in mind that JSON is a popular means of storing and sharing data found on the web and it is worth becoming proficient in managing and manipulating it." 727 | ] 728 | }, 729 | { 730 | "cell_type": "markdown", 731 | "metadata": { 732 | "slideshow": { 733 | "slide_type": "slide" 734 | } 735 | }, 736 | "source": [ 737 | "### Saving results\n", 738 | "\n", 739 | "The final task is to save the data to a file that we can use in the future. We'll write the data to a JSON file format, as this is the structure the data were returned in." 740 | ] 741 | }, 742 | { 743 | "cell_type": "code", 744 | "execution_count": null, 745 | "metadata": { 746 | "slideshow": { 747 | "slide_type": "subslide" 748 | } 749 | }, 750 | "outputs": [], 751 | "source": [ 752 | "# Create a downloads folder\n", 753 | "\n", 754 | "try:\n", 755 | " os.mkdir(\"./downloads\")\n", 756 | "except:\n", 757 | " print(\"Unable to create folder: already exists\")" 758 | ] 759 | }, 760 | { 761 | "cell_type": "markdown", 762 | "metadata": { 763 | "slideshow": { 764 | "slide_type": "skip" 765 | } 766 | }, 767 | "source": [ 768 | "The use of \"./\" tells the `os.mkdir()` command that the \"downloads\" folder should be created at the same level of the directory where this notebook is located. So if this notebook was stored in a directory located at \"C:/Users/joebloggs/notebooks\", the `os.mkdir()` command would result in a new folder located at \"C:/Users/joebloggs/notebooks/downloads\".\n", 769 | " \n", 770 | "(Technically the \"./\" is not needed and you could just write `os.mkdir(\"downloads\")` but it's good practice to be explicit.)" 771 | ] 772 | }, 773 | { 774 | "cell_type": "code", 775 | "execution_count": null, 776 | "metadata": { 777 | "scrolled": true, 778 | "slideshow": { 779 | "slide_type": "subslide" 780 | } 781 | }, 782 | "outputs": [], 783 | "source": [ 784 | "# Write the results to a JSON file\n", 785 | "\n", 786 | "date = datetime.now().strftime(\"%Y-%m-%d\") # get today's date in YYYY-MM-DD format\n", 787 | "print(date)\n", 788 | "\n", 789 | "outfile = \"./downloads/guardian-api-covid-19-search-\" + date + \".json\"\n", 790 | "\n", 791 | "with open(outfile, \"w\") as f:\n", 792 | " json.dump(data, f)" 793 | ] 794 | }, 795 | { 796 | "cell_type": "markdown", 797 | "metadata": { 798 | "slideshow": { 799 | "slide_type": "skip" 800 | } 801 | }, 802 | "source": [ 803 | "The code above defines a name and location for the file which will store the results of the API request. We then open the file in *write* mode and save (or \"dump\") the contents of the `data` variable to it." 804 | ] 805 | }, 806 | { 807 | "cell_type": "markdown", 808 | "metadata": { 809 | "slideshow": { 810 | "slide_type": "subslide" 811 | } 812 | }, 813 | "source": [ 814 | "How do we know this worked? The simplest way is to check whether a) the file was created, and b) the results were written to it." 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": null, 820 | "metadata": { 821 | "slideshow": { 822 | "slide_type": "subslide" 823 | } 824 | }, 825 | "outputs": [], 826 | "source": [ 827 | "# a) Check presence of file in \"downloads\" folder\n", 828 | "\n", 829 | "os.listdir(\"./downloads\")" 830 | ] 831 | }, 832 | { 833 | "cell_type": "code", 834 | "execution_count": null, 835 | "metadata": { 836 | "slideshow": { 837 | "slide_type": "subslide" 838 | } 839 | }, 840 | "outputs": [], 841 | "source": [ 842 | "# b) Open file and read (import) its contents\n", 843 | "\n", 844 | "with open(outfile, \"r\") as f:\n", 845 | " data = json.load(f) # use the \"load()\" method of the \"json\" module\n", 846 | " \n", 847 | "data" 848 | ] 849 | }, 850 | { 851 | "cell_type": "markdown", 852 | "metadata": { 853 | "slideshow": { 854 | "slide_type": "subslide" 855 | } 856 | }, 857 | "source": [ 858 | "And Voila, we have successfully requested data using an API!" 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": { 864 | "slideshow": { 865 | "slide_type": "slide" 866 | } 867 | }, 868 | "source": [ 869 | "### Refining Covid-19 data collection\n", 870 | "\n", 871 | "We will complete our work gathering data on articles relating to Covid-19 by refining our request and handling of the response. In particular, we will deal with the following outstanding issues:\n", 872 | "1. Including additional search terms\n", 873 | "2. Dealing with multiple pages of results\n", 874 | "3. Requesting the contents (text) of some of the articles contained in our list of search results\n", 875 | "4. Handling the rate limit" 876 | ] 877 | }, 878 | { 879 | "cell_type": "markdown", 880 | "metadata": { 881 | "slideshow": { 882 | "slide_type": "skip" 883 | } 884 | }, 885 | "source": [ 886 | "The following blocks of code contain fewer comments explaining what each command is doing, so feel free to add these back in if you like.\n", 887 | "\n", 888 | "This section - and the notebook in general - is sequential, therefore not running the code in descending order will invariably lead to errors down the line. If this occurs, return to this point and run the code cells one-by-one.\n", 889 | "\n", 890 | "Let's get some the preliminaries out of the way before we begin:" 891 | ] 892 | }, 893 | { 894 | "cell_type": "code", 895 | "execution_count": null, 896 | "metadata": { 897 | "slideshow": { 898 | "slide_type": "subslide" 899 | } 900 | }, 901 | "outputs": [], 902 | "source": [ 903 | "import os\n", 904 | "import requests\n", 905 | "import json\n", 906 | "import pandas as pd\n", 907 | "from datetime import datetime\n", 908 | "from bs4 import BeautifulSoup as soup\n", 909 | "\n", 910 | "date = datetime.now().strftime(\"%Y-%m-%d\")\n", 911 | "\n", 912 | "api_key = open(\"./auth/guardian-api-key.txt\", \"r\").read() # open the file and read its contents" 913 | ] 914 | }, 915 | { 916 | "cell_type": "markdown", 917 | "metadata": { 918 | "slideshow": { 919 | "slide_type": "subslide" 920 | } 921 | }, 922 | "source": [ 923 | "#### Including additional search terms\n", 924 | "\n", 925 | "We can use logical operators - AND, OR and NOT - to include additional search terms in our request." 926 | ] 927 | }, 928 | { 929 | "cell_type": "code", 930 | "execution_count": null, 931 | "metadata": { 932 | "slideshow": { 933 | "slide_type": "subslide" 934 | } 935 | }, 936 | "outputs": [], 937 | "source": [ 938 | "# Define web address and search terms\n", 939 | "\n", 940 | "baseurl = \"http://content.guardianapis.com/search?\"\n", 941 | "searchterms = \"covid-19 OR coronavirus\" # terms we want to search for\n", 942 | "auth = {\"api-key\": api_key}\n", 943 | "\n", 944 | "webadd = baseurl + \"q=\" + searchterms\n", 945 | "print(webadd)\n", 946 | "\n", 947 | "# Make call to API\n", 948 | "\n", 949 | "response = requests.get(webadd, headers=auth)\n", 950 | "response.status_code" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": null, 956 | "metadata": { 957 | "slideshow": { 958 | "slide_type": "subslide" 959 | } 960 | }, 961 | "outputs": [], 962 | "source": [ 963 | "data = response.json()\n", 964 | "data[\"response\"][\"total\"]" 965 | ] 966 | }, 967 | { 968 | "cell_type": "markdown", 969 | "metadata": { 970 | "slideshow": { 971 | "slide_type": "subslide" 972 | } 973 | }, 974 | "source": [ 975 | "Note how the total number of results is higher than if we just searched for \"covid-19\" on its own." 976 | ] 977 | }, 978 | { 979 | "cell_type": "markdown", 980 | "metadata": { 981 | "slideshow": { 982 | "slide_type": "skip" 983 | } 984 | }, 985 | "source": [ 986 | "**EXERCISE**: Adapt the code above to include other search terms that you consider relevant to Covid-19 (e.g., covid-19 AND Scotland). Consult the API's documentation if you need help constructing more complicated search queries (*Query term* section)." 987 | ] 988 | }, 989 | { 990 | "cell_type": "code", 991 | "execution_count": null, 992 | "metadata": { 993 | "slideshow": { 994 | "slide_type": "skip" 995 | } 996 | }, 997 | "outputs": [], 998 | "source": [ 999 | "# INSERT EXERCISE CODE HERE" 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "markdown", 1004 | "metadata": { 1005 | "slideshow": { 1006 | "slide_type": "subslide" 1007 | } 1008 | }, 1009 | "source": [ 1010 | "#### Dealing with multiple pages of results\n", 1011 | "\n", 1012 | "Sticking with our previous example, we noted how the search results were restricted to ten per page. APIs often do not return all of the results for a given request due to the computing resource needed to process large volumes of data. There are a couple of ways of dealing with this issue:\n", 1013 | "* Increasing the number of results returned per page; AND\n", 1014 | "* Requesting each individual page of results\n", 1015 | "\n", 1016 | "First, increase the number of results per page using the `page-size` filter:" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "code", 1021 | "execution_count": null, 1022 | "metadata": { 1023 | "slideshow": { 1024 | "slide_type": "subslide" 1025 | } 1026 | }, 1027 | "outputs": [], 1028 | "source": [ 1029 | "baseurl = \"http://content.guardianapis.com/search?\"\n", 1030 | "searchterms = \"covid-19 OR coronavirus\"\n", 1031 | "auth = {\"api-key\": api_key}\n", 1032 | "numresults = \"50\" # number of results per page\n", 1033 | "\n", 1034 | "webadd = baseurl + \"q=\" + searchterms + \"&page-size=\" + numresults\n", 1035 | "print(webadd)\n", 1036 | "\n", 1037 | "# Make call to API\n", 1038 | "\n", 1039 | "response = requests.get(webadd, headers=auth)\n", 1040 | "response.status_code" 1041 | ] 1042 | }, 1043 | { 1044 | "cell_type": "markdown", 1045 | "metadata": { 1046 | "slideshow": { 1047 | "slide_type": "subslide" 1048 | } 1049 | }, 1050 | "source": [ 1051 | "Confirm the `page-size` filter was applied:" 1052 | ] 1053 | }, 1054 | { 1055 | "cell_type": "code", 1056 | "execution_count": null, 1057 | "metadata": { 1058 | "slideshow": { 1059 | "slide_type": "fragment" 1060 | } 1061 | }, 1062 | "outputs": [], 1063 | "source": [ 1064 | "data = response.json()\n", 1065 | "data[\"response\"][\"pageSize\"]" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "markdown", 1070 | "metadata": { 1071 | "slideshow": { 1072 | "slide_type": "subslide" 1073 | } 1074 | }, 1075 | "source": [ 1076 | "Good, now let's request each page of results. We start by finding out how many pages we need to request:" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": null, 1082 | "metadata": { 1083 | "slideshow": { 1084 | "slide_type": "fragment" 1085 | } 1086 | }, 1087 | "outputs": [], 1088 | "source": [ 1089 | "total_pages = data[\"response\"][\"pages\"]\n", 1090 | "total_pages = total_pages + 1\n", 1091 | "total_pages" 1092 | ] 1093 | }, 1094 | { 1095 | "cell_type": "markdown", 1096 | "metadata": { 1097 | "slideshow": { 1098 | "slide_type": "subslide" 1099 | } 1100 | }, 1101 | "source": [ 1102 | "Note how we needed to add one to the total number of pages. This is because of the way Python loops over a range of numbers. For example, the code below loops over a range of numbers beginning at one and **up to but not including** five:" 1103 | ] 1104 | }, 1105 | { 1106 | "cell_type": "code", 1107 | "execution_count": null, 1108 | "metadata": { 1109 | "slideshow": { 1110 | "slide_type": "fragment" 1111 | } 1112 | }, 1113 | "outputs": [], 1114 | "source": [ 1115 | "for i in range(1, 5):\n", 1116 | " print(i)" 1117 | ] 1118 | }, 1119 | { 1120 | "cell_type": "markdown", 1121 | "metadata": { 1122 | "slideshow": { 1123 | "slide_type": "subslide" 1124 | } 1125 | }, 1126 | "source": [ 1127 | "Requesting the total number of pages may take a while, so let's just request the first 20:" 1128 | ] 1129 | }, 1130 | { 1131 | "cell_type": "code", 1132 | "execution_count": null, 1133 | "metadata": { 1134 | "slideshow": { 1135 | "slide_type": "subslide" 1136 | } 1137 | }, 1138 | "outputs": [], 1139 | "source": [ 1140 | "for pagenum in range(1, 21):\n", 1141 | " \n", 1142 | " webadd = baseurl + \"q=\" + searchterms + \"&page-size=\" + numresults \\\n", 1143 | " + \"&page=\" + str(pagenum)\n", 1144 | " print(webadd)\n", 1145 | "\n", 1146 | " response = requests.get(webadd, headers=auth)\n", 1147 | " data = response.json()\n", 1148 | " \n", 1149 | " outfile = \"./downloads/guardian-api-covid-19-search-page-\" \\\n", 1150 | " + str(pagenum) + \"-\" + date + \".json\"\n", 1151 | "\n", 1152 | " with open(outfile, \"w\") as f:\n", 1153 | " json.dump(data, f)" 1154 | ] 1155 | }, 1156 | { 1157 | "cell_type": "code", 1158 | "execution_count": null, 1159 | "metadata": { 1160 | "slideshow": { 1161 | "slide_type": "subslide" 1162 | } 1163 | }, 1164 | "outputs": [], 1165 | "source": [ 1166 | "os.listdir(\"./downloads\")" 1167 | ] 1168 | }, 1169 | { 1170 | "cell_type": "markdown", 1171 | "metadata": { 1172 | "slideshow": { 1173 | "slide_type": "subslide" 1174 | } 1175 | }, 1176 | "source": [ 1177 | "#### Requesting article contents\n", 1178 | "\n", 1179 | "The list of search results contain data and metadata about individual articles. What if we wanted the contents of the article i.e., the text? Thankfully the API has an endpoint (*Single Item*) that provides access to this data.\n", 1180 | "\n", 1181 | "First, we need to extract an article from our list of search results. We do this by accessing its *positional value* (index) in the list: " 1182 | ] 1183 | }, 1184 | { 1185 | "cell_type": "code", 1186 | "execution_count": null, 1187 | "metadata": { 1188 | "slideshow": { 1189 | "slide_type": "subslide" 1190 | } 1191 | }, 1192 | "outputs": [], 1193 | "source": [ 1194 | "search_results = data[\"response\"][\"results\"]\n", 1195 | "article = search_results[0] # extract first article in list of results\n", 1196 | "article" 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "markdown", 1201 | "metadata": { 1202 | "slideshow": { 1203 | "slide_type": "skip" 1204 | } 1205 | }, 1206 | "source": [ 1207 | "In Python, indexing begins at zero (in R indexing begins at 1). Therefore, the first item in a list is located at position zero (e.g., `search_results[0]`), the second item at position one (e.g., `search_results[1]`) etc.\n", 1208 | "\n", 1209 | "**TASK**: extract a different article using another index value. Remember, there are 50 elements in the `search_results` list, so `[49]` is the highest index value you can refer to." 1210 | ] 1211 | }, 1212 | { 1213 | "cell_type": "markdown", 1214 | "metadata": { 1215 | "slideshow": { 1216 | "slide_type": "subslide" 1217 | } 1218 | }, 1219 | "source": [ 1220 | "Next, we need to request its contents using the web address contained in the `apiUrl` key and specifying we want the body (text) of the article returned also:" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "code", 1225 | "execution_count": null, 1226 | "metadata": { 1227 | "slideshow": { 1228 | "slide_type": "subslide" 1229 | } 1230 | }, 1231 | "outputs": [], 1232 | "source": [ 1233 | "baseurl = article[\"apiUrl\"]\n", 1234 | "auth = {\"api-key\": api_key}\n", 1235 | "field = \"body\"\n", 1236 | "\n", 1237 | "webadd = baseurl + \"?show-fields=\" + field\n", 1238 | "\n", 1239 | "# Make call to API\n", 1240 | "\n", 1241 | "response = requests.get(webadd, headers=auth)\n", 1242 | "response.status_code" 1243 | ] 1244 | }, 1245 | { 1246 | "cell_type": "code", 1247 | "execution_count": null, 1248 | "metadata": { 1249 | "scrolled": true, 1250 | "slideshow": { 1251 | "slide_type": "subslide" 1252 | } 1253 | }, 1254 | "outputs": [], 1255 | "source": [ 1256 | "data = response.json()\n", 1257 | "data" 1258 | ] 1259 | }, 1260 | { 1261 | "cell_type": "markdown", 1262 | "metadata": { 1263 | "slideshow": { 1264 | "slide_type": "skip" 1265 | } 1266 | }, 1267 | "source": [ 1268 | "Note we now have a key called `fields` which contains a dictionary with one key (`body`):" 1269 | ] 1270 | }, 1271 | { 1272 | "cell_type": "code", 1273 | "execution_count": null, 1274 | "metadata": { 1275 | "slideshow": { 1276 | "slide_type": "skip" 1277 | } 1278 | }, 1279 | "outputs": [], 1280 | "source": [ 1281 | "data[\"response\"][\"content\"][\"fields\"].keys()" 1282 | ] 1283 | }, 1284 | { 1285 | "cell_type": "code", 1286 | "execution_count": null, 1287 | "metadata": { 1288 | "slideshow": { 1289 | "slide_type": "subslide" 1290 | } 1291 | }, 1292 | "outputs": [], 1293 | "source": [ 1294 | "text = data[\"response\"][\"content\"][\"fields\"][\"body\"]\n", 1295 | "text" 1296 | ] 1297 | }, 1298 | { 1299 | "cell_type": "markdown", 1300 | "metadata": { 1301 | "slideshow": { 1302 | "slide_type": "skip" 1303 | } 1304 | }, 1305 | "source": [ 1306 | "The article's content is returned as a long piece of unstructured text (known as a *string* data type in Python). However you may have noticed lots of strange symbols or tags scattered throughout (e.g., `

`, ``) - their presence confirms that the content is actually a piece of HTML, the programming language that web pages are written in. Thankfully, Python has a module for working with HTML data.\n", 1307 | "\n", 1308 | "`BeautifulSoup` is a Python module that provides a systematic way of navigating the elements of a web page and extracting its contents. Let's see how it works in practice:" 1309 | ] 1310 | }, 1311 | { 1312 | "cell_type": "code", 1313 | "execution_count": null, 1314 | "metadata": { 1315 | "slideshow": { 1316 | "slide_type": "subslide" 1317 | } 1318 | }, 1319 | "outputs": [], 1320 | "source": [ 1321 | "from bs4 import BeautifulSoup as soup # module for parsing web pages\n", 1322 | "\n", 1323 | "soup_text = soup(text, \"html.parser\") # convert text to HTML\n", 1324 | "type(soup_text)" 1325 | ] 1326 | }, 1327 | { 1328 | "cell_type": "markdown", 1329 | "metadata": { 1330 | "slideshow": { 1331 | "slide_type": "subslide" 1332 | } 1333 | }, 1334 | "source": [ 1335 | "In the above code, we apply the `soup()` method from the `BeautifulSoup` module to the `text` variable, and store the results in a new variable called `soup_text`. \n", 1336 | "\n", 1337 | "Now that the raw text has been converted to a different data type, we are able to navigate and extract the information of interest. For example, let's extract all of the links in the article:" 1338 | ] 1339 | }, 1340 | { 1341 | "cell_type": "code", 1342 | "execution_count": null, 1343 | "metadata": { 1344 | "slideshow": { 1345 | "slide_type": "subslide" 1346 | } 1347 | }, 1348 | "outputs": [], 1349 | "source": [ 1350 | "links = soup_text.find_all(\"a\") # find all tags\n", 1351 | "links" 1352 | ] 1353 | }, 1354 | { 1355 | "cell_type": "markdown", 1356 | "metadata": { 1357 | "slideshow": { 1358 | "slide_type": "skip" 1359 | } 1360 | }, 1361 | "source": [ 1362 | "We used the `find_all()` method to search for all `` tags in the article. And because there is more than one set of tags matching this id, we get a list of results. We can check how many tags there are by calling on the `len()` function:" 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "code", 1367 | "execution_count": null, 1368 | "metadata": { 1369 | "slideshow": { 1370 | "slide_type": "subslide" 1371 | } 1372 | }, 1373 | "outputs": [], 1374 | "source": [ 1375 | "len(links)" 1376 | ] 1377 | }, 1378 | { 1379 | "cell_type": "markdown", 1380 | "metadata": { 1381 | "slideshow": { 1382 | "slide_type": "skip" 1383 | } 1384 | }, 1385 | "source": [ 1386 | "The number of links represents the number of other resources (e.g., other articles, files on the web) this article refers to.\n", 1387 | "\n", 1388 | "We can view each element in the list of results as follows:" 1389 | ] 1390 | }, 1391 | { 1392 | "cell_type": "code", 1393 | "execution_count": null, 1394 | "metadata": { 1395 | "slideshow": { 1396 | "slide_type": "subslide" 1397 | } 1398 | }, 1399 | "outputs": [], 1400 | "source": [ 1401 | "for link in links:\n", 1402 | " print(\"--------\")\n", 1403 | " print(link.get(\"href\")) # extract the URL from within the tag\n", 1404 | " print(\"--------\")\n", 1405 | " print(\"\\r\") # print some blank space for better formatting" 1406 | ] 1407 | }, 1408 | { 1409 | "cell_type": "markdown", 1410 | "metadata": { 1411 | "slideshow": { 1412 | "slide_type": "subslide" 1413 | } 1414 | }, 1415 | "source": [ 1416 | "Parsing HTML data is a topic in itself and we cover it in much more detail in our Websites as a Source of Data lesson." 1417 | ] 1418 | }, 1419 | { 1420 | "cell_type": "markdown", 1421 | "metadata": { 1422 | "slideshow": { 1423 | "slide_type": "subslide" 1424 | } 1425 | }, 1426 | "source": [ 1427 | "Finally, let's conclude by saving the article's data and metadata to a JSON file:" 1428 | ] 1429 | }, 1430 | { 1431 | "cell_type": "code", 1432 | "execution_count": null, 1433 | "metadata": { 1434 | "slideshow": { 1435 | "slide_type": "subslide" 1436 | } 1437 | }, 1438 | "outputs": [], 1439 | "source": [ 1440 | "# Use the articles unique id to name the file\n", 1441 | "\n", 1442 | "article_id = data[\"response\"][\"content\"][\"id\"].replace(\"/\", \"-\")\n", 1443 | "#\n", 1444 | "# We need to remove forward slashes (\"/\") as our computer interprets\n", 1445 | "# these as folder separators.\n", 1446 | "#\n", 1447 | "\n", 1448 | "outfile = \"./downloads/\" + article_id + \".json\"\n", 1449 | "\n", 1450 | "with open(outfile, \"w\") as f:\n", 1451 | " json.dump(data, f)\n", 1452 | " \n", 1453 | "outfile # view the file name " 1454 | ] 1455 | }, 1456 | { 1457 | "cell_type": "markdown", 1458 | "metadata": { 1459 | "slideshow": { 1460 | "slide_type": "subslide" 1461 | } 1462 | }, 1463 | "source": [ 1464 | "#### Handling the rate limit\n", 1465 | "\n", 1466 | "The Guardian API's free level of access allows up to 12 calls (requests) per second, up to a daily limit of 5,000. You may consider this more than adequate for your purposes, but for argument's sake let's say you're worried about breaching these limits. How can you keep track of the number of calls you make?\n", 1467 | "\n", 1468 | "Thankfully, APIs include some metadata with the response that tells you the rate limit and how many calls you have remaining. First, let's see how we can access this metadata:" 1469 | ] 1470 | }, 1471 | { 1472 | "cell_type": "code", 1473 | "execution_count": null, 1474 | "metadata": { 1475 | "slideshow": { 1476 | "slide_type": "subslide" 1477 | } 1478 | }, 1479 | "outputs": [], 1480 | "source": [ 1481 | "baseurl = \"http://content.guardianapis.com/search?\"\n", 1482 | "searchterms = \"covid-19 OR coronavirus\"\n", 1483 | "auth = {\"api-key\": api_key}\n", 1484 | "\n", 1485 | "webadd = baseurl + \"q=\" + searchterms\n", 1486 | "print(webadd)\n", 1487 | "\n", 1488 | "response = requests.get(webadd, headers=auth)\n", 1489 | "response.headers # view response metadata" 1490 | ] 1491 | }, 1492 | { 1493 | "cell_type": "markdown", 1494 | "metadata": { 1495 | "slideshow": { 1496 | "slide_type": "subslide" 1497 | } 1498 | }, 1499 | "source": [ 1500 | "While most of this metadata won't be of interest (to a researcher, though software/web developers may disagree), some fields provide important information for tracking our use of the API:\n", 1501 | "* `X-RateLimit-Limit-day` - the number of calls you can make per day\n", 1502 | "* `X-RateLimit-Limit-minute` - the number of calls you can make per day\n", 1503 | "* `X-RateLimit-Remaining-day` - the number of calls remaining per day\n", 1504 | "* `X-RateLimit-Remaining-day` - the number of calls remaining per minute\n", 1505 | "\n", 1506 | "In addition, the `Content-Type` field is interesting as it tells us what format the content of the response is returned as (e.g., JSON, XML, CSV)." 1507 | ] 1508 | }, 1509 | { 1510 | "cell_type": "markdown", 1511 | "metadata": { 1512 | "slideshow": { 1513 | "slide_type": "subslide" 1514 | } 1515 | }, 1516 | "source": [ 1517 | "Let's use some of this metadata to define some variables that track our use of the API:" 1518 | ] 1519 | }, 1520 | { 1521 | "cell_type": "code", 1522 | "execution_count": null, 1523 | "metadata": { 1524 | "slideshow": { 1525 | "slide_type": "subslide" 1526 | } 1527 | }, 1528 | "outputs": [], 1529 | "source": [ 1530 | "rate_limit_daily = response.headers[\"X-RateLimit-Limit-day\"]\n", 1531 | "remaining_daily = response.headers[\"X-RateLimit-Remaining-day\"]\n", 1532 | "total_calls = int(rate_limit_daily) - int(remaining_daily)\n", 1533 | "\n", 1534 | "print(\"We have made {} calls to the API today.\".format(total_calls))" 1535 | ] 1536 | }, 1537 | { 1538 | "cell_type": "markdown", 1539 | "metadata": { 1540 | "slideshow": { 1541 | "slide_type": "skip" 1542 | } 1543 | }, 1544 | "source": [ 1545 | "(Note how we needed to convert the `rate_limit_daily` and `remaining_daily` variables to integers before we could perform mathematical operations on them.)" 1546 | ] 1547 | }, 1548 | { 1549 | "cell_type": "markdown", 1550 | "metadata": { 1551 | "slideshow": { 1552 | "slide_type": "subslide" 1553 | } 1554 | }, 1555 | "source": [ 1556 | "Great, now we have a way of keeping track of the number of calls we make to the API. Let's test this approach by making more requests:" 1557 | ] 1558 | }, 1559 | { 1560 | "cell_type": "code", 1561 | "execution_count": null, 1562 | "metadata": { 1563 | "slideshow": { 1564 | "slide_type": "subslide" 1565 | } 1566 | }, 1567 | "outputs": [], 1568 | "source": [ 1569 | "# Initialise the variables counting the number of calls\n", 1570 | "\n", 1571 | "if \"total_calls\" in globals():\n", 1572 | " print(\"You have made {} calls this session\".format(total_calls))\n", 1573 | " counter = total_calls\n", 1574 | "else:\n", 1575 | " counter = 0\n", 1576 | "\n", 1577 | "# Request data from Guardian API\n", 1578 | "\n", 1579 | "for pagenum in range(1, 50):\n", 1580 | " \n", 1581 | " baseurl = \"http://content.guardianapis.com/search?\"\n", 1582 | " searchterms = \"covid-19 OR coronavirus\"\n", 1583 | " auth = {\"api-key\": api_key}\n", 1584 | " numresults = \"50\"\n", 1585 | " \n", 1586 | " webadd = baseurl + \"q=\" + searchterms + \"&page-size=\" + \\\n", 1587 | " numresults + \"&page=\" + str(pagenum)\n", 1588 | "\n", 1589 | " # Make call to API\n", 1590 | "\n", 1591 | " response = requests.get(webadd, headers=auth)\n", 1592 | " response.status_code\n", 1593 | " \n", 1594 | " # Get metadata\n", 1595 | " \n", 1596 | " rate_limit_daily = response.headers[\"X-RateLimit-Limit-day\"]\n", 1597 | " remaining_daily = response.headers[\"X-RateLimit-Remaining-day\"]\n", 1598 | " total_calls = int(rate_limit_daily) - int(remaining_daily)\n", 1599 | " \n", 1600 | " # Update counter\n", 1601 | " \n", 1602 | " counter += 1\n", 1603 | " print(\"Call number {} to the API\".format(counter))\n", 1604 | "\n", 1605 | "# Update overall number of calls\n", 1606 | "\n", 1607 | "if \"total_calls\" not in globals():\n", 1608 | " total_calls = counter\n", 1609 | "elif \"total_calls\" in globals():\n", 1610 | " total_calls = total_calls + (counter - total_calls)\n", 1611 | "else:\n", 1612 | " pass\n", 1613 | "\n", 1614 | "print(\"You have made {} total calls to the API in this session\".format(total_calls)) " 1615 | ] 1616 | }, 1617 | { 1618 | "cell_type": "markdown", 1619 | "metadata": { 1620 | "slideshow": { 1621 | "slide_type": "skip" 1622 | } 1623 | }, 1624 | "source": [ 1625 | "**TASK**: Re-run the code above another two or three times to see whether the `total_calls` variable is correctly keeping track of the number of requests." 1626 | ] 1627 | }, 1628 | { 1629 | "cell_type": "markdown", 1630 | "metadata": { 1631 | "slideshow": { 1632 | "slide_type": "skip" 1633 | } 1634 | }, 1635 | "source": [ 1636 | "There are quite a few new techniques introduced in the code above, so let's unpack each element one at a time:" 1637 | ] 1638 | }, 1639 | { 1640 | "cell_type": "raw", 1641 | "metadata": { 1642 | "slideshow": { 1643 | "slide_type": "skip" 1644 | } 1645 | }, 1646 | "source": [ 1647 | "if \"total_calls\" in globals():\n", 1648 | " print(\"You have made {} calls this session\".format(total_calls))\n", 1649 | " counter = total_calls\n", 1650 | "else:\n", 1651 | " counter = 0" 1652 | ] 1653 | }, 1654 | { 1655 | "cell_type": "markdown", 1656 | "metadata": { 1657 | "slideshow": { 1658 | "slide_type": "skip" 1659 | } 1660 | }, 1661 | "source": [ 1662 | "The above block of code checks whether the `total_calls` variable exists: if it does then the variable keeping track of individual calls (`counter`) is initialised to the value of `total_calls`; if `total_calls` does not exist then `counter` is set to 0 (i.e., we have not made any calls to the API yet)." 1663 | ] 1664 | }, 1665 | { 1666 | "cell_type": "raw", 1667 | "metadata": { 1668 | "slideshow": { 1669 | "slide_type": "skip" 1670 | } 1671 | }, 1672 | "source": [ 1673 | "# Request data from Guardian API\n", 1674 | "\n", 1675 | "for pagenum in range(1, 6):\n", 1676 | " \n", 1677 | " baseurl = \"http://content.guardianapis.com/search?\"\n", 1678 | " searchterms = \"covid-19 OR coronavirus\"\n", 1679 | " auth = {\"api-key\": api_key}\n", 1680 | " numresults = \"50\"\n", 1681 | " \n", 1682 | " webadd = baseurl + \"q=\" + searchterms + \"&page-size=\" + \\\n", 1683 | " numresults + \"&page=\" + str(pagenum)\n", 1684 | "\n", 1685 | " # Make call to API\n", 1686 | "\n", 1687 | " response = requests.get(webadd, headers=auth)\n", 1688 | " response.status_code\n", 1689 | " \n", 1690 | " # Get metadata\n", 1691 | " \n", 1692 | " rate_limit_daily = response.headers[\"X-RateLimit-Limit-day\"]\n", 1693 | " remaining_daily = response.headers[\"X-RateLimit-Remaining-day\"]\n", 1694 | " total_calls = int(rate_limit_daily) - int(remaining_daily)\n", 1695 | " \n", 1696 | " # Update counter\n", 1697 | " \n", 1698 | " counter += 1\n", 1699 | " print(\"Call number {} to the API\".format(counter))" 1700 | ] 1701 | }, 1702 | { 1703 | "cell_type": "markdown", 1704 | "metadata": { 1705 | "slideshow": { 1706 | "slide_type": "skip" 1707 | } 1708 | }, 1709 | "source": [ 1710 | "The above block performs the usual request to the API, but at the end it increases the `counter` variable by 1 to record the fact we made a call." 1711 | ] 1712 | }, 1713 | { 1714 | "cell_type": "raw", 1715 | "metadata": { 1716 | "slideshow": { 1717 | "slide_type": "skip" 1718 | } 1719 | }, 1720 | "source": [ 1721 | "# Update overall number of calls\n", 1722 | "\n", 1723 | "if \"total_calls\" not in globals():\n", 1724 | " total_calls = counter\n", 1725 | "elif \"total_calls\" in globals():\n", 1726 | " total_calls = total_calls + (counter - total_calls)\n", 1727 | "else:\n", 1728 | " pass\n", 1729 | "\n", 1730 | "print(\"You have made {} total calls to the API in this session\".format(total_calls))" 1731 | ] 1732 | }, 1733 | { 1734 | "cell_type": "markdown", 1735 | "metadata": { 1736 | "slideshow": { 1737 | "slide_type": "skip" 1738 | } 1739 | }, 1740 | "source": [ 1741 | "Finally, this block updates the `total_calls` variable, conditional on whether it already exists or not." 1742 | ] 1743 | }, 1744 | { 1745 | "cell_type": "markdown", 1746 | "metadata": { 1747 | "slideshow": { 1748 | "slide_type": "skip" 1749 | } 1750 | }, 1751 | "source": [ 1752 | "Note that there usually isn't a penalty for exceeding the call limit; what happens is you are restricted from making further calls until a sufficient period of time has passed. We say \"usually\", because it depends on the API in question, whether you have paid for a commercial/custom level of access etc." 1753 | ] 1754 | }, 1755 | { 1756 | "cell_type": "markdown", 1757 | "metadata": { 1758 | "slideshow": { 1759 | "slide_type": "skip" 1760 | } 1761 | }, 1762 | "source": [ 1763 | "### Concluding remarks on Covid-19 data\n", 1764 | "\n", 1765 | "The Covid-19 pandemic is a seismic public health crisis that will dominate our lives for the foreseeable future. The example code above is not a craven attempt to provide some topicality to these materials, nor is it simply a particularly good example for learning how to interact with an API. There are real opportunities for social scientists to capture and analyse data on this phenomenon, such as The Guardian's reporting of the crisis.\n", 1766 | "\n", 1767 | "You may also be interested in the publicly available data repository provided by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE): https://github.com/CSSEGISandData/COVID-19. Updated daily, this resource provides CSV (Comma Separated Values) files of global Covid-19 statistics (e.g., country-level time series), as well as PDF copies of the World Health Organisation's situation reports.\n", 1768 | "\n", 1769 | "At a UK level, the NHS releases data about Covid-19 symptoms reported through its NHS Pathways and 111 online platforms: NHS Open Data. Data on reported cases is also provided by Public Health England (PHE): COVID-19: track coronavirus cases. Many of these datasets are available as openly available as CSV files.\n", 1770 | "\n", 1771 | "There is a collaborative Google document capturing sources of social data relating to COVID-19, curated by Dr Ben Baumberg Geiger.\n", 1772 | "\n", 1773 | "Finally, the Office for National Statistics (ONS) provides data and experimental indicators of social like in the UK under Covid-19: Coronavirus (COVID-19)." 1774 | ] 1775 | }, 1776 | { 1777 | "cell_type": "markdown", 1778 | "metadata": { 1779 | "slideshow": { 1780 | "slide_type": "slide" 1781 | } 1782 | }, 1783 | "source": [ 1784 | "### A social research example: UK police data\n", 1785 | "\n", 1786 | "We have also produced sample code for interacting with the UK Police API, which provides data on policing activities and street-level crime in England, Wales and Northern Ireland.\n", 1787 | "\n", 1788 | "UK Police API coding demonstration" 1789 | ] 1790 | }, 1791 | { 1792 | "cell_type": "markdown", 1793 | "metadata": { 1794 | "slideshow": { 1795 | "slide_type": "skip" 1796 | } 1797 | }, 1798 | "source": [ 1799 | "## Value, limitations and ethics\n", 1800 | "\n", 1801 | "Computational methods for collecting data from the web are an increasingly important component of a social scientist's toolkit. They enable individuals to collect and reshape data - qualitative and quantitative - that otherwise would be inaccessible for research purposes. Thus far, this notebook has focused on the logic and practice of using APIs, however it is crucial we reflect critically on its value, limitations and ethical implications for social science purposes.\n", 1802 | "\n", 1803 | "### Value\n", 1804 | "\n", 1805 | "* The process of interacting with an API is a common and mature computational method, with lots of established packages (e.g., `requests`in Python), examples and help available. As a result the learning curve is not as steep as with other methods, and it is possible for a beginner to access data from an API in under an hour (see the UK Police API example in Appendix B, if you do not believe us).\n", 1806 | "* APIs provide access to data that is intended to be shared; thus, the data are already in a more user-friendly format (e.g., as JSON or CSV files). Compare this to data scraped from a web page, which often needs extensive cleaning and parsing in order to be analysed and shared. \n", 1807 | "* The richness of some of the information and data made available through APIs is a point worth repeating. Many public, private and charitable institutions use their web sites to release and regularly update information of value to social scientists. Getting a handle on the volume, variety and velocity of such information is extremely challenging (if not impossible) without the use of computational methods.\n", 1808 | "* APIs provide flexible access to data: you do not need to bulk download data and then filter through the results to arrive at what you need. APIs allow you to format your request in a way that returns exactly and only what you require.\n", 1809 | "* Finally, the data you need might only be available through an API. This might be the case if you require organisational and financial information on UK companies, thereby necessitating the use of the Companies House API.\n", 1810 | "\n", 1811 | "### Limitations\n", 1812 | "\n", 1813 | "* APIs restrict the number of requests for data you can make. For example, the Guardian API allows up to 5,000 requests per day, the Companies House API up to 600 per five-minute period. These limits ensure the API can offer a reliable service to a wide user base (in contrast to a single user constantly requesting data and preventing others from doing so). Therefore you need to plan and structure your requests in such a way that you comply with the limit **and** retrieve the data you need for your analysis.\n", 1814 | "* The quality of an API's official documentation can vary wildly. Some APIs provide accurate, detailed guides on how to make requests and handle responses; others are sparsely written and it can be a pain figuring out what data are available at each endpoint, how to correctly specify the web address, what the rate limit is etc. And if guidance is available, it might not be for the programming language you are using.\n", 1815 | "* Data protection laws, such as the EU's General Data Protection Regulations (GDPR), impinge on the data you collect through APIs. GDPR means that you are responsible for processing, securing, storing, using and deleting an individual's personal data, even if it’s publicly available through an API. This is a critical and detailed area of data-driven activities, and we encourage you to consult relevant guidance (see *Further reading and resources* section).\n", 1816 | "* APIs can be updated, resulting in changes to the rate limit, authentication requirements, endpoints providing access to the data, cost of using the service etc. It can be a lot of work maintaining your code, especially if you make it available for use by others.\n", 1817 | "* An API is a product and you must comply with the Terms of Service/Use associated with it; else there can be legal implications resulting from non-compliance - for example, there might be restrictions around sharing data collected from an API.\n", 1818 | "* Interacting with APIs, and engaging in computational social science generally, is dependent on your computing setup. For example, you may not possess administrative rights on your machine, preventing you from scheduling your script to run on a regular basis (e.g., your computer automatically goes to sleep after a set period of time and you cannot change this setting). There are ways around this and you do not need a high performance computing setup to collect data from an API, but it is worth keeping in mind nonetheless.\n", 1819 | "\n", 1820 | "### Ethical considerations\n", 1821 | "\n", 1822 | "For the purposes of this discussion, we will assume you have sought and received ethical approval for a piece of research through the usual institutional processes: you've already considered consent, harm to researcher and participant, data security and curation etc. One of these aspects, *informed consent*, needs major consideration when it comes to using some APIs (Lomborg & Bechmann 2014). Let's take Twitter user data, which is available via an API (we'll work with this in a future training series). Users of Twitter will have signed up to the Terms of Service/Use when registering on the platform, where they will have agreed various clauses regarding the use and sharing of their information. However, can they reasonably be said to have given consent to participating in research using this data? What if you capture a user's personal data through the API, which at a later date the user deletes from their own profile: should you use this information in your research? There are no easy answers and we encourage you to consider these issues prior to beginning your data collection (see further reading suggestions)." 1823 | ] 1824 | }, 1825 | { 1826 | "cell_type": "markdown", 1827 | "metadata": { 1828 | "slideshow": { 1829 | "slide_type": "skip" 1830 | } 1831 | }, 1832 | "source": [ 1833 | "## Conclusion\n", 1834 | "\n", 1835 | "Interacting with an API is a simple yet powerful computational method for collecting data of value for social science research. It provides a relatively gentle introduction to using programming languages, also. However, \"with great power comes great responsibility\" (sorry). APIs take you into the realm of data protection, Terms of Service/Use, and many murky ethical issues. Wielded sensibly and sensitively, collecting data from APIs is a valuable and exciting social science research method.\n", 1836 | "\n", 1837 | "Good luck on your data-driven travels!" 1838 | ] 1839 | }, 1840 | { 1841 | "cell_type": "markdown", 1842 | "metadata": { 1843 | "slideshow": { 1844 | "slide_type": "skip" 1845 | } 1846 | }, 1847 | "source": [ 1848 | "## Bibliography\n", 1849 | "\n", 1850 | "Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. https://jupyter4edu.github.io/jupyter-edu-book/.\n", 1851 | "\n", 1852 | "Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.\n", 1853 | "\n", 1854 | "Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org\n", 1855 | "\n", 1856 | "Lomborg, S., and Bechmann, A. (2014). Using APIs for Data Collection on Social Media. *The Information Society: An International Journal*, 30(4): 256-265.\n", 1857 | "\n", 1858 | "Tagliaferri, L. (n.d.). *How to Code in Python 3*. https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf" 1859 | ] 1860 | }, 1861 | { 1862 | "cell_type": "markdown", 1863 | "metadata": { 1864 | "slideshow": { 1865 | "slide_type": "skip" 1866 | } 1867 | }, 1868 | "source": [ 1869 | "## Further reading and resources\n", 1870 | "\n", 1871 | "We publish a list of useful books, papers, websites and other resources on our web-scraping Github repository: [Reading list]\n", 1872 | "\n", 1873 | "The help documentation for the `requests` module is refreshingly readable and useful:\n", 1874 | "* `requests`\n", 1875 | "\n", 1876 | "You may also be interested in the following articles specifically relating to APIs:\n", 1877 | "* How To Use Web APIs in Python 3\n", 1878 | "* Guide to Data Protection\n", 1879 | "* Social Media Research: A Guide to Ethics" 1880 | ] 1881 | }, 1882 | { 1883 | "cell_type": "markdown", 1884 | "metadata": { 1885 | "slideshow": { 1886 | "slide_type": "skip" 1887 | } 1888 | }, 1889 | "source": [ 1890 | "## Appendices" 1891 | ] 1892 | }, 1893 | { 1894 | "cell_type": "markdown", 1895 | "metadata": { 1896 | "slideshow": { 1897 | "slide_type": "skip" 1898 | } 1899 | }, 1900 | "source": [ 1901 | "### Appendix A - Requesting URLs\n", 1902 | "\n", 1903 | "We refer to a website's location on the internet as its web address or Uniform Resource Locator (URL).\n", 1904 | "\n", 1905 | "In Python we've made use of the excellent `requests` module. By calling the `requests.get()` method, we mimic the manual process of launching a web browser and visiting a website. The `requests` module achieves this by placing a _request_ to the server hosting the website (e.g., show me the contents of the website), and handling the _response_ that is returned (e.g., the contents of the website and some metadata about the request). This _request-response_ protocol is known as HTTP (HyperText Transfer Protocol); HTTP allows computers to communicate with each other over the internet - you can learn more about it at W3 Schools.\n", 1906 | "\n", 1907 | "Run the code below to learn more about the data and metadata returned by `requests.get()`." 1908 | ] 1909 | }, 1910 | { 1911 | "cell_type": "code", 1912 | "execution_count": null, 1913 | "metadata": { 1914 | "slideshow": { 1915 | "slide_type": "skip" 1916 | } 1917 | }, 1918 | "outputs": [], 1919 | "source": [ 1920 | "import requests\n", 1921 | "\n", 1922 | "url = \"https://httpbin.org/html\"\n", 1923 | "response = requests.get(url)\n", 1924 | "\n", 1925 | "print(\"1. {}\".format(response)) # returns the object type (i.e. a response) and status code\n", 1926 | "print(\"\\r\")\n", 1927 | "\n", 1928 | "print(\"2. {}\".format(response.headers)) # returns a dictionary of response headers\n", 1929 | "print(\"\\r\")\n", 1930 | "\n", 1931 | "print(\"3. {}\".format(response.headers[\"Date\"])) # return a particular header\n", 1932 | "print(\"\\r\")\n", 1933 | "\n", 1934 | "print(\"4. {}\".format(response.request)) # returns the request object that requested this response\n", 1935 | "print(\"\\r\")\n", 1936 | "\n", 1937 | "print(\"5. {}\".format(response.url)) # returns the URL of the response\n", 1938 | "print(\"\\r\")\n", 1939 | "\n", 1940 | "#print(response.text) # returns the text contained in the response (i.e. the paragraphs, headers etc of the web page)\n", 1941 | "#print(response.content) # returns the content of the response (i.e. the HTML contents of the web page)\n", 1942 | "\n", 1943 | "# Visit https://www.w3schools.com/python/ref_requests_response.asp for a full list of what is returned by the server\n", 1944 | "# in response to a request." 1945 | ] 1946 | }, 1947 | { 1948 | "cell_type": "markdown", 1949 | "metadata": { 1950 | "slideshow": { 1951 | "slide_type": "skip" 1952 | } 1953 | }, 1954 | "source": [ 1955 | "-- END OF FILE --" 1956 | ] 1957 | } 1958 | ], 1959 | "metadata": { 1960 | "kernelspec": { 1961 | "display_name": "Python 3", 1962 | "language": "python", 1963 | "name": "python3" 1964 | }, 1965 | "language_info": { 1966 | "codemirror_mode": { 1967 | "name": "ipython", 1968 | "version": 3 1969 | }, 1970 | "file_extension": ".py", 1971 | "mimetype": "text/x-python", 1972 | "name": "python", 1973 | "nbconvert_exporter": "python", 1974 | "pygments_lexer": "ipython3", 1975 | "version": "3.7.3" 1976 | }, 1977 | "toc": { 1978 | "base_numbering": 1, 1979 | "nav_menu": { 1980 | "height": "86.7333px", 1981 | "width": "160px" 1982 | }, 1983 | "number_sections": true, 1984 | "sideBar": false, 1985 | "skip_h1_title": true, 1986 | "title_cell": "Table of Contents", 1987 | "title_sidebar": "Contents", 1988 | "toc_cell": true, 1989 | "toc_position": { 1990 | "height": "100px", 1991 | "left": "366px", 1992 | "top": "669.033px", 1993 | "width": "165px" 1994 | }, 1995 | "toc_section_display": true, 1996 | "toc_window_display": false 1997 | } 1998 | }, 1999 | "nbformat": 4, 2000 | "nbformat_minor": 2 2001 | } 2002 | -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/code/web-scraping-websites-code-2020-04-23.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "skip" 8 | } 9 | }, 10 | "source": [ 11 | "![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": { 17 | "extensions": { 18 | "jupyter_dashboards": { 19 | "version": 1, 20 | "views": { 21 | "grid_default": {}, 22 | "report_default": { 23 | "hidden": false 24 | } 25 | } 26 | } 27 | }, 28 | "slideshow": { 29 | "slide_type": "skip" 30 | } 31 | }, 32 | "source": [ 33 | "# Web-scraping for Social Science Research" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": { 39 | "extensions": { 40 | "jupyter_dashboards": { 41 | "version": 1, 42 | "views": { 43 | "grid_default": {}, 44 | "report_default": { 45 | "hidden": false 46 | } 47 | } 48 | } 49 | }, 50 | "slideshow": { 51 | "slide_type": "skip" 52 | } 53 | }, 54 | "source": [ 55 | "Welcome to the UK Data Service training series on *Computational Social Science*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.\n", 56 | "\n", 57 | "* To access training materials for the entire series: [Training Materials]\n", 58 | "\n", 59 | "* To keep up to date with upcoming and past training events: [Events]\n", 60 | "\n", 61 | "* To get in contact with feedback, ideas or to seek assistance: [Help]\n", 62 | "\n", 63 | "Dr Julia Kasmire and Dr Diarmuid McDonnell
\n", 64 | "UK Data Service
\n", 65 | "University of Manchester
\n", 66 | "April 2020" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": { 72 | "slideshow": { 73 | "slide_type": "skip" 74 | }, 75 | "toc": true 76 | }, 77 | "source": [ 78 | "

Table of Contents

\n", 79 | "" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": { 85 | "extensions": { 86 | "jupyter_dashboards": { 87 | "version": 1, 88 | "views": { 89 | "grid_default": {}, 90 | "report_default": { 91 | "hidden": false 92 | } 93 | } 94 | } 95 | }, 96 | "slideshow": { 97 | "slide_type": "skip" 98 | } 99 | }, 100 | "source": [ 101 | "## Introduction\n", 102 | "\n", 103 | "In this training series we cover some of the essential skills needed to collect data from the web. In particular we focus on two different approaches:\n", 104 | "1. Collecting data stored on web pages. [Focus of this notebook]\n", 105 | "2. Downloading data from online databases using Application Programming Interfaces (APIs). [LINK]\n", 106 | " \n", 107 | "Do not be alarmed by the technical aspects: both approaches can be implemented using simple code, a standard desktop or laptop, and a decent internet connection. \n", 108 | "\n", 109 | "Given the Covid-19 public health crisis in which this programme of work occurred, we will examine ways in which computational methods can provide valuable data for studying this phenomenon. This is a fast moving, evolving public health emergency that, in addition to other impacts, will shape research agendas across the sciences for years to come. Therefore it is important to learn how we, as social scientists, can access or generate data that will provide a better understanding of this disease." 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": { 115 | "extensions": { 116 | "jupyter_dashboards": { 117 | "version": 1, 118 | "views": { 119 | "grid_default": {}, 120 | "report_default": { 121 | "hidden": false 122 | } 123 | } 124 | } 125 | }, 126 | "slideshow": { 127 | "slide_type": "skip" 128 | } 129 | }, 130 | "source": [ 131 | "## Guide to using this resource\n", 132 | "\n", 133 | "This learning resource was built using Jupyter Notebook, an open-source software application that allows you to mix code, results and narrative in a single document. As Barba et al. (2019) espouse:\n", 134 | "> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.\n", 135 | "\n", 136 | "If you are familiar with Jupyter notebooks then skip ahead to the main content (*Collecting data from web pages*). Otherwise, the following is a quick guide to navigating and interacting with the notebook." 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": { 142 | "extensions": { 143 | "jupyter_dashboards": { 144 | "version": 1, 145 | "views": { 146 | "grid_default": {}, 147 | "report_default": { 148 | "hidden": false 149 | } 150 | } 151 | } 152 | }, 153 | "slideshow": { 154 | "slide_type": "slide" 155 | } 156 | }, 157 | "source": [ 158 | "### Interaction\n", 159 | "\n", 160 | "**You only need to execute the code that is contained in sections which are marked by `In []`.**\n", 161 | "\n", 162 | "To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).\n", 163 | "\n", 164 | "Try it for yourself:" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": { 171 | "slideshow": { 172 | "slide_type": "subslide" 173 | } 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "print(\"Enter your name and press enter:\")\n", 178 | "name = input()\n", 179 | "print(\"Hello {}, enjoy learning more about Python and web-scraping!\".format(name)) " 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": { 185 | "extensions": { 186 | "jupyter_dashboards": { 187 | "version": 1, 188 | "views": { 189 | "grid_default": {}, 190 | "report_default": { 191 | "hidden": false 192 | } 193 | } 194 | } 195 | }, 196 | "slideshow": { 197 | "slide_type": "subslide" 198 | } 199 | }, 200 | "source": [ 201 | "### Learn more\n", 202 | "\n", 203 | "Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the materials provided by Dani Arribas-Bel at the University of Liverpool. " 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": { 209 | "extensions": { 210 | "jupyter_dashboards": { 211 | "version": 1, 212 | "views": { 213 | "grid_default": {}, 214 | "report_default": { 215 | "hidden": false 216 | } 217 | } 218 | } 219 | }, 220 | "slideshow": { 221 | "slide_type": "skip" 222 | } 223 | }, 224 | "source": [ 225 | "## Collecting data from web pages\n", 226 | "\n", 227 | "### What is web-scraping?\n", 228 | "\n", 229 | "What is web-scraping? It is a computational technique for capturing information stored on a web page. \"Computational\" is the key word, as it is possible to perform this task manually, though that carries considerable disadvantages in terms of accuracy and labour resource.\n", 230 | "\n", 231 | "It is generally implemented using a programming script - that is, executable code written in some programming language, though there are software applications that you can use - a previous webinar by our colleague Peter Smyth on the 16 April demonstrated how to use MS Excel to collect data from a website.\n", 232 | "\n", 233 | "It is relatively simple to implement using open-source programming languages e.g., Python, R. You do not need to be highly computationally literate, nor write screeds of code: this is a popular and mature computational method, with tons of documentation and examples for you to learn from.\n", 234 | "\n", 235 | "### Reasons to engage in web-scraping\n", 236 | "\n", 237 | "Websites can be an important source of publicly available information on phenomena of interest - for instance, they are used to store and disseminate files, text, photos, videos, tables etc. However, the data stored on websites are typically not structured or formatted for ease of use by researchers: for example, it may not be possible to perform a bulk download of all the files you need (think of needing the annual accounts of all registered companies in London for your research...), or the information may not even be held in a file and instead spread across paragraphs and tables throughout a web page (or worse, web pages). Luckily, web-scraping provides a means of quickly and accurately capturing and formatting data stored on web pages.\n", 238 | "\n", 239 | "Before we delve into writing code to capture data from the web, let's clearly state the logic underpinning the technique." 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": { 245 | "extensions": { 246 | "jupyter_dashboards": { 247 | "version": 1, 248 | "views": { 249 | "grid_default": {}, 250 | "report_default": { 251 | "hidden": false 252 | } 253 | } 254 | } 255 | }, 256 | "slideshow": { 257 | "slide_type": "skip" 258 | } 259 | }, 260 | "source": [ 261 | "### Logic of web-scraping\n", 262 | "\n", 263 | "We begin by identifying a web page containing information we are interested in collecting. Then we need to **know** the following:\n", 264 | "1. The location (i.e., web address) where the web page can be accessed. For example, the UK Data Service homepage can be accessed via https://ukdataservice.ac.uk/.\n", 265 | "2. The location of the information we are interested in within the structure of the web page. This involves visually inspecting a web page's underlying code using a web browser.\n", 266 | "\n", 267 | "And **do** the following:\n", 268 | "3. Request the web page using its web address.\n", 269 | "4. Parse the structure of the web page so your programming language can work with its contents.\n", 270 | "5. Extract the information we are interested in.\n", 271 | "6. Write this information to a file for future use.\n", 272 | "\n", 273 | "For any programming task, it is useful to write out the steps needed to solve the problem: we call this *pseudo-code*, as it is captures the main tasks and the order in which they need to be executed. \n", 274 | "\n", 275 | "For our first example, let's convert the steps above into executable Python code for capturing data about Covid-19." 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": { 281 | "extensions": { 282 | "jupyter_dashboards": { 283 | "version": 1, 284 | "views": { 285 | "grid_default": {}, 286 | "report_default": { 287 | "hidden": false 288 | } 289 | } 290 | } 291 | }, 292 | "slideshow": { 293 | "slide_type": "skip" 294 | } 295 | }, 296 | "source": [ 297 | "## Example: Capturing Covid-19 data" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": { 303 | "extensions": { 304 | "jupyter_dashboards": { 305 | "version": 1, 306 | "views": { 307 | "grid_default": {}, 308 | "report_default": { 309 | "hidden": false 310 | } 311 | } 312 | } 313 | }, 314 | "slideshow": { 315 | "slide_type": "skip" 316 | } 317 | }, 318 | "source": [ 319 | "Worldometer is a website that provides up-to-date statistics on the following domains: the global population; food, water and energy consumption; environmental degradation and many others (known as its Real Time Statistics Project). In its own words:[1]\n", 320 | "> Worldometer is run by an international team of developers, researchers, and volunteers with the goal of making world statistics available in a thought-provoking and time relevant format to a wide audience around the world. Worldometer is owned by Dadax, an independent company. We have no political, governmental, or corporate affiliation.\n", 321 | "\n", 322 | "Since the outbreak of Covid-19 it has provided regular daily snapshots on the progress of this disease, both globally and at a country level.\n", 323 | "\n", 324 | "[1]: https://www.worldometers.info/about/" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": { 330 | "extensions": { 331 | "jupyter_dashboards": { 332 | "version": 1, 333 | "views": { 334 | "grid_default": {}, 335 | "report_default": { 336 | "hidden": false 337 | } 338 | } 339 | } 340 | }, 341 | "slideshow": { 342 | "slide_type": "skip" 343 | } 344 | }, 345 | "source": [ 346 | "### Identifying the web address\n", 347 | "\n", 348 | "The website can be accessed here: https://www.worldometers.info/coronavirus/\n", 349 | "\n", 350 | "Let's work through the steps necessary to collect data about the number of Covid-19 cases, deaths and recoveries globally.\n", 351 | "\n", 352 | "First, let's become familiar with this website: click on the link below to view the web page in your browser: https://www.worldometers.info/coronavirus/\n", 353 | "\n", 354 | "(Note: it possible to load websites into Python in order to view them, however the website we are interested in doesn't allow this. See the example code below for how it would work for a different website - just remove the quotation marks enclosing the code and run the cell)." 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": { 361 | "slideshow": { 362 | "slide_type": "skip" 363 | } 364 | }, 365 | "outputs": [], 366 | "source": [ 367 | "\"\"\"\n", 368 | "from IPython.display import IFrame\n", 369 | "\n", 370 | "IFrame(\"https://ukdataservice.ac.uk/\", width=\"600\", height=\"650\")\n", 371 | "\"\"\"" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": { 377 | "extensions": { 378 | "jupyter_dashboards": { 379 | "version": 1, 380 | "views": { 381 | "grid_default": {}, 382 | "report_default": { 383 | "hidden": false 384 | } 385 | } 386 | } 387 | }, 388 | "slideshow": { 389 | "slide_type": "skip" 390 | } 391 | }, 392 | "source": [ 393 | "### Locating information\n", 394 | "\n", 395 | "The statistics we need are near the top of the page under the following headings:\n", 396 | "* Coronavirus Cases:\n", 397 | "* Deaths:\n", 398 | "* Recovered:\n", 399 | "\n", 400 | "However, we need more information than this in order to scrape the statistics. Websites are written in a langauge called HyperText Markup Language (HTML), which can be understood as follows:[2]\n", 401 | "* HTML describes the structure of a web page\n", 402 | "* HTML consists of a series of elements\n", 403 | "* HTML elements tell the browser how to display the content\n", 404 | "* HTML elements are represented by tags\n", 405 | "* HTML tags label pieces of content such as \"heading\", \"paragraph\", \"table\", and so on\n", 406 | "* Browsers do not display the HTML tags, but use them to render the content of the page \n", 407 | "\n", 408 | "#### Visually inspecting the underlying HTML code\n", 409 | "\n", 410 | "Therefore, what we need are the tags that identify the section of the web page where the statistics are stored. We can discover the tags by examining the *source code* (HTML) of the web page. This can be done using your web browser: for example, if you use use Firefox you can right-click on the web page and select *View Page Source* from the list of options. \n", 411 | "\n", 412 | "**TASK**: Try this yourself with the Worldometer web page.\n", 413 | "\n", 414 | "The snippet below shows sample source code for the section of the Covid-19 web page we are interested in.\n", 415 | "\n", 416 | "\n", 417 | "\n", 418 | "[2]: https://www.w3schools.com/html/html_intro.asp" 419 | ] 420 | }, 421 | { 422 | "cell_type": "raw", 423 | "metadata": { 424 | "extensions": { 425 | "jupyter_dashboards": { 426 | "version": 1, 427 | "views": { 428 | "grid_default": {}, 429 | "report_default": { 430 | "hidden": false 431 | } 432 | } 433 | } 434 | }, 435 | "slideshow": { 436 | "slide_type": "skip" 437 | } 438 | }, 439 | "source": [ 440 | "
\n", 441 | "

Coronavirus Cases:

\n", 442 | "
\n", 443 | " 252,731 \n", 444 | "\n", 445 | "
\n", 446 | "
\n", 447 | "\n", 448 | "\n", 449 | "
view by country
\n", 450 | "\n", 451 | "
\n", 452 | "

Deaths:

\n", 453 | "
\n", 454 | " 10,405\n", 455 | "
\n", 456 | "
\n", 457 | "\n", 458 | "\n", 459 | "
\n", 460 | "

Recovered:

\n", 461 | "
\n", 462 | " 89,056\n", 463 | "
\n", 464 | "
" 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": { 470 | "extensions": { 471 | "jupyter_dashboards": { 472 | "version": 1, 473 | "views": { 474 | "grid_default": {}, 475 | "report_default": { 476 | "hidden": false 477 | } 478 | } 479 | } 480 | }, 481 | "slideshow": { 482 | "slide_type": "skip" 483 | } 484 | }, 485 | "source": [ 486 | "In the above example, we can see multiple tags that contain various elements (e.g., text content, other tags). For instance, we can see that the Covid-19 statistics are enclosed in `<\\span>` tags, which themselves are located within `
<\\div>` tags.\n", 487 | "\n", 488 | "As you can see, exploring and locating the contents of a web page remains a manual and visual process, and in Brooker's estimation (2020, 252):\n", 489 | "> Hence, more so than the actual Python, it's the detective work of unpicking the internal structure of a webpage that is probably the most vital skill here." 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": { 495 | "extensions": { 496 | "jupyter_dashboards": { 497 | "version": 1, 498 | "views": { 499 | "grid_default": {}, 500 | "report_default": { 501 | "hidden": false 502 | } 503 | } 504 | } 505 | }, 506 | "slideshow": { 507 | "slide_type": "slide" 508 | } 509 | }, 510 | "source": [ 511 | "### Requesting the web page\n", 512 | "\n", 513 | "Now that we possess the necessary information, let's begin the process of scraping the web page. There is a preliminary step, which is setting up Python with the modules it needs to perform the web-scrape." 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "metadata": { 520 | "extensions": { 521 | "jupyter_dashboards": { 522 | "version": 1, 523 | "views": { 524 | "grid_default": {}, 525 | "report_default": { 526 | "hidden": true 527 | } 528 | } 529 | } 530 | }, 531 | "slideshow": { 532 | "slide_type": "subslide" 533 | } 534 | }, 535 | "outputs": [], 536 | "source": [ 537 | "# Import modules\n", 538 | "\n", 539 | "import os # module for navigating your machine (e.g., file directories)\n", 540 | "import requests # module for requesting urls\n", 541 | "import csv # module for handling csv files\n", 542 | "import pandas as pd # module for handling data\n", 543 | "from datetime import datetime # module for working with dates and time\n", 544 | "from bs4 import BeautifulSoup as soup # module for parsing web pages\n", 545 | "\n", 546 | "print(\"Succesfully imported necessary modules\")" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": { 552 | "extensions": { 553 | "jupyter_dashboards": { 554 | "version": 1, 555 | "views": { 556 | "grid_default": {}, 557 | "report_default": { 558 | "hidden": false 559 | } 560 | } 561 | } 562 | }, 563 | "slideshow": { 564 | "slide_type": "skip" 565 | } 566 | }, 567 | "source": [ 568 | "Modules are additional techniques or functions that are not present when you launch Python. Some do not even come with Python when you download it and must be installed on your machine separately - think of using `ssc install ` in Stata, or `install.packages()` in R. For now just understand that many useful modules need to be imported every time you start a new Python session." 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": { 574 | "slideshow": { 575 | "slide_type": "subslide" 576 | } 577 | }, 578 | "source": [ 579 | "Now, let's implement the process of scraping the page. First, we need to request the web page using Python; this is analogous to opening a web browser and entering the web address manually. We refer to a page's location on the internet as its web address or Uniform Resource Locator (URL)." 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "execution_count": null, 585 | "metadata": { 586 | "extensions": { 587 | "jupyter_dashboards": { 588 | "version": 1, 589 | "views": { 590 | "grid_default": {}, 591 | "report_default": { 592 | "hidden": false 593 | } 594 | } 595 | } 596 | }, 597 | "slideshow": { 598 | "slide_type": "subslide" 599 | } 600 | }, 601 | "outputs": [], 602 | "source": [ 603 | "# Define the URL where the web page can be accessed\n", 604 | "\n", 605 | "url = \"https://www.worldometers.info/coronavirus/\"\n", 606 | "\n", 607 | "# Request the web page from the URL\n", 608 | "\n", 609 | "response = requests.get(url, allow_redirects=True) # request the url\n", 610 | "# response.headers\n", 611 | "response.status_code # check if page was requested successfully" 612 | ] 613 | }, 614 | { 615 | "cell_type": "markdown", 616 | "metadata": { 617 | "slideshow": { 618 | "slide_type": "skip" 619 | } 620 | }, 621 | "source": [ 622 | "First, we define a variable (also known as an 'object' in Python) called `url` that contains the web address of the page we want to request. Next, we use the `get()` method of the `requests` module to request the web page, and in the same line of code, we store the results of the request in a variable called `response`. Finally, we check whether the request was successful by calling on the `status_code` attribute of the `response` variable.\n", 623 | "\n", 624 | "Confused? Don't worry, the conventions of Python and using its modules take a bit of getting used to. At this point, just understand that you can store the results of commands in variables, and a variable can have different attributes that can be accessed when needed. Also note that you have a lot of freedom in how you name your variables (subject to certain restrictions - see here for some guidance).\n", 625 | "\n", 626 | "For example, the following would also work:" 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": null, 632 | "metadata": { 633 | "slideshow": { 634 | "slide_type": "subslide" 635 | } 636 | }, 637 | "outputs": [], 638 | "source": [ 639 | "web_address = \"https://www.worldometers.info/coronavirus/\"\n", 640 | "\n", 641 | "scrape_result = requests.get(web_address, allow_redirects=True)\n", 642 | "scrape_result.status_code" 643 | ] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "metadata": { 648 | "extensions": { 649 | "jupyter_dashboards": { 650 | "version": 1, 651 | "views": { 652 | "grid_default": {}, 653 | "report_default": { 654 | "hidden": false 655 | } 656 | } 657 | } 658 | }, 659 | "slideshow": { 660 | "slide_type": "skip" 661 | } 662 | }, 663 | "source": [ 664 | "Back to the request:\n", 665 | "\n", 666 | "Good, we get a status code of _200_ - this means we successfully requested the web page. Lau, Gonzalez and Nolan provide a succinct description of different types of response status codes:\n", 667 | "\n", 668 | "* **100s** - Informational: More input is expected from client or server (e.g. 100 Continue, 102 Processing)\n", 669 | "* **200s** - Success: The client's request was successful (e.g. 200 OK, 202 Accepted)\n", 670 | "* **300s** - Redirection: Requested URL is located elsewhere; May need user's further action (e.g. 300 Multiple Choices, 301 Moved Permanently)\n", 671 | "* **400s** - Client Error: Client-side error (e.g. 400 Bad Request, 403 Forbidden, 404 Not Found)\n", 672 | "* **500s** - Server Error: Server-side error or server is incapable of performing the request (e.g. 500 Internal Server Error, 503 Service Unavailable)\n", 673 | "\n", 674 | "For clarity:\n", 675 | "* **Client**: your machine\n", 676 | "* **Server**: the machine you are requesting the web page from" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": { 682 | "slideshow": { 683 | "slide_type": "skip" 684 | } 685 | }, 686 | "source": [ 687 | "You may be wondering exactly what it is we requested: if you were to type the URL (https://www.worldometers.info/coronavirus/) into your browser and hit `enter`, the web page should appear on your screen. This is not the case when we request the URL through Python but rest assured, we have successfully requested the web page. To see the content of our request, we can examine the `text` attribute of the `response` variable:" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": null, 693 | "metadata": { 694 | "slideshow": { 695 | "slide_type": "subslide" 696 | } 697 | }, 698 | "outputs": [], 699 | "source": [ 700 | "response.text[:1000]" 701 | ] 702 | }, 703 | { 704 | "cell_type": "markdown", 705 | "metadata": { 706 | "slideshow": { 707 | "slide_type": "skip" 708 | } 709 | }, 710 | "source": [ 711 | "This shows us a sample of the underlying code (HTML) of the web page we requested. It should be obvious that in its current form, the result of this request will be difficult to work with. This is where the `BeautifulSoup` module comes in handy.\n", 712 | "\n", 713 | "(See Appendix A for more examples of how the `requests` module works and what information it returns.)" 714 | ] 715 | }, 716 | { 717 | "cell_type": "markdown", 718 | "metadata": { 719 | "extensions": { 720 | "jupyter_dashboards": { 721 | "version": 1, 722 | "views": { 723 | "grid_default": {}, 724 | "report_default": { 725 | "hidden": false 726 | } 727 | } 728 | } 729 | }, 730 | "slideshow": { 731 | "slide_type": "slide" 732 | } 733 | }, 734 | "source": [ 735 | "### Parsing the web page\n", 736 | "\n", 737 | "Now it's time to identify and understand the structure of the web page we requested. We do this by converting the content contained in the `response.text` attribute into a `BeautifulSoup` variable. `BeautifulSoup` is a Python module that provides a systematic way of navigating the elements of a web page and extracting its contents. Let's see how it works in practice:" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": null, 743 | "metadata": { 744 | "extensions": { 745 | "jupyter_dashboards": { 746 | "version": 1, 747 | "views": { 748 | "grid_default": {}, 749 | "report_default": { 750 | "hidden": true 751 | } 752 | } 753 | } 754 | }, 755 | "slideshow": { 756 | "slide_type": "subslide" 757 | } 758 | }, 759 | "outputs": [], 760 | "source": [ 761 | "# Extract the contents of the webpage from the response\n", 762 | "\n", 763 | "soup_response = soup(response.text, \"html.parser\") # Parse the text as a Beautiful Soup object\n", 764 | "soup_sample = soup(response.text[:1000], \"html.parser\") # Parse a sample of the text\n", 765 | "soup_sample" 766 | ] 767 | }, 768 | { 769 | "cell_type": "markdown", 770 | "metadata": { 771 | "extensions": { 772 | "jupyter_dashboards": { 773 | "version": 1, 774 | "views": { 775 | "grid_default": {}, 776 | "report_default": { 777 | "hidden": false 778 | } 779 | } 780 | } 781 | }, 782 | "slideshow": { 783 | "slide_type": "skip" 784 | } 785 | }, 786 | "source": [ 787 | "`BeautifulSoup` has taken the unstructured text contained in `response.text` and parsed it as HTML: now we can clearly see the hierarchical structure and tags that comprise a web page's HTML. \n", 788 | "\n", 789 | "Note again how we call on a method (`soup()`) from a module (`BeautifulSoup`) and store the results in a variable (`soup_response`).\n", 790 | "\n", 791 | "Of course, we've only displayed a sample of the code here for readability. What about the full text contained in `soup_response`: how do we navigate such voluminous results? Thankfully the `BeautifulSoup` module provides some intuitive methods for doing so." 792 | ] 793 | }, 794 | { 795 | "cell_type": "markdown", 796 | "metadata": { 797 | "slideshow": { 798 | "slide_type": "skip" 799 | } 800 | }, 801 | "source": [ 802 | "**TASK**: view the full contents of the web page stored in the variable `soup_response`." 803 | ] 804 | }, 805 | { 806 | "cell_type": "code", 807 | "execution_count": null, 808 | "metadata": { 809 | "slideshow": { 810 | "slide_type": "skip" 811 | } 812 | }, 813 | "outputs": [], 814 | "source": [ 815 | "# INSERT CODE HERE" 816 | ] 817 | }, 818 | { 819 | "cell_type": "markdown", 820 | "metadata": { 821 | "extensions": { 822 | "jupyter_dashboards": { 823 | "version": 1, 824 | "views": { 825 | "grid_default": {}, 826 | "report_default": { 827 | "hidden": false 828 | } 829 | } 830 | } 831 | }, 832 | "slideshow": { 833 | "slide_type": "slide" 834 | } 835 | }, 836 | "source": [ 837 | "### Extracting information\n", 838 | "\n", 839 | "Now that we have parsed the web page, we can use Python to navigate and extract the information of interest. To begin with, let's locate the section of the web page containing the overall Covid-19 statistics." 840 | ] 841 | }, 842 | { 843 | "cell_type": "code", 844 | "execution_count": null, 845 | "metadata": { 846 | "extensions": { 847 | "jupyter_dashboards": { 848 | "version": 1, 849 | "views": { 850 | "grid_default": {}, 851 | "report_default": { 852 | "hidden": true 853 | } 854 | } 855 | } 856 | }, 857 | "slideshow": { 858 | "slide_type": "subslide" 859 | } 860 | }, 861 | "outputs": [], 862 | "source": [ 863 | "# Find the sections containing the data of interest\n", 864 | "\n", 865 | "sections = soup_response.find_all(\"div\", id=\"maincounter-wrap\")\n", 866 | "sections" 867 | ] 868 | }, 869 | { 870 | "cell_type": "markdown", 871 | "metadata": { 872 | "extensions": { 873 | "jupyter_dashboards": { 874 | "version": 1, 875 | "views": { 876 | "grid_default": {}, 877 | "report_default": { 878 | "hidden": false 879 | } 880 | } 881 | } 882 | }, 883 | "slideshow": { 884 | "slide_type": "skip" 885 | } 886 | }, 887 | "source": [ 888 | "We used the `find_all()` method to search for all `
` tags where the id=\"maincounter-wrap\". And because there is more than one set of tags matching this id, we get a list of results. We can check how many tags match this id by calling on the `len()` function:" 889 | ] 890 | }, 891 | { 892 | "cell_type": "code", 893 | "execution_count": null, 894 | "metadata": { 895 | "extensions": { 896 | "jupyter_dashboards": { 897 | "version": 1, 898 | "views": { 899 | "grid_default": {}, 900 | "report_default": { 901 | "hidden": true 902 | } 903 | } 904 | } 905 | }, 906 | "slideshow": { 907 | "slide_type": "subslide" 908 | } 909 | }, 910 | "outputs": [], 911 | "source": [ 912 | "len(sections)" 913 | ] 914 | }, 915 | { 916 | "cell_type": "markdown", 917 | "metadata": { 918 | "slideshow": { 919 | "slide_type": "skip" 920 | } 921 | }, 922 | "source": [ 923 | "We can view each element in the list of results as follows:" 924 | ] 925 | }, 926 | { 927 | "cell_type": "code", 928 | "execution_count": null, 929 | "metadata": { 930 | "slideshow": { 931 | "slide_type": "subslide" 932 | } 933 | }, 934 | "outputs": [], 935 | "source": [ 936 | "for section in sections:\n", 937 | " print(\"--------\")\n", 938 | " print(section)\n", 939 | " print(\"--------\")\n", 940 | " print(\"\\r\") # print some blank space for better formatting" 941 | ] 942 | }, 943 | { 944 | "cell_type": "markdown", 945 | "metadata": { 946 | "slideshow": { 947 | "slide_type": "subslide" 948 | } 949 | }, 950 | "source": [ 951 | "We are nearing the end of our scrape. The penultimate task is to extract the statistics within the `` tags and store them in some variables. We do this by accessing each item in the _sections_ list using its positional value (index)." 952 | ] 953 | }, 954 | { 955 | "cell_type": "code", 956 | "execution_count": null, 957 | "metadata": { 958 | "extensions": { 959 | "jupyter_dashboards": { 960 | "version": 1, 961 | "views": { 962 | "grid_default": {}, 963 | "report_default": { 964 | "hidden": true 965 | } 966 | } 967 | } 968 | }, 969 | "slideshow": { 970 | "slide_type": "fragment" 971 | } 972 | }, 973 | "outputs": [], 974 | "source": [ 975 | "cases = sections[0].find(\"span\").text.replace(\" \", \"\").replace(\",\", \"\")\n", 976 | "deaths = sections[1].find(\"span\").text.replace(\",\", \"\")\n", 977 | "recov = sections[2].find(\"span\").text.replace(\",\", \"\")\n", 978 | "print(\"Number of cases: {}; deaths: {}; and recoveries: {}.\".format(cases, deaths, recov))" 979 | ] 980 | }, 981 | { 982 | "cell_type": "markdown", 983 | "metadata": { 984 | "slideshow": { 985 | "slide_type": "skip" 986 | } 987 | }, 988 | "source": [ 989 | "The above code performs a couple of operations:\n", 990 | "* For each item (i.e., set of `
` tags) in the list, it finds the `` tags and extracts the text enclosed within them.\n", 991 | "* We clean the text by removing blank spaces and commas.\n", 992 | "\n", 993 | "In this example, referring to an item's positional index works because our list of `
` tags stored in the `sections` variable is ordered: the tag containing the number of cases appears before the tag containing the number of deaths, which appears before the tag containing the number of recovered patients.\n", 994 | "\n", 995 | "In Python, indexing begins at zero (in R indexing begins at 1). Therefore, the first item in the list is accessed using `sections[0]`, the second using `sections[1]` etc.\n", 996 | "\n", 997 | "(To learn more about lists in Python, see Chapter 22 of *How to Code in Python 3*)" 998 | ] 999 | }, 1000 | { 1001 | "cell_type": "markdown", 1002 | "metadata": { 1003 | "extensions": { 1004 | "jupyter_dashboards": { 1005 | "version": 1, 1006 | "views": { 1007 | "grid_default": {}, 1008 | "report_default": { 1009 | "hidden": false 1010 | } 1011 | } 1012 | } 1013 | }, 1014 | "slideshow": { 1015 | "slide_type": "slide" 1016 | } 1017 | }, 1018 | "source": [ 1019 | "### Saving results from the scrape\n", 1020 | "\n", 1021 | "The final task is to save the variables to a file that we can use in the future. We'll write to a Comma-Separated Values (CSV) file for this purpose, as it is an open-source, text-based file format that is commonly used for sharing data on the web." 1022 | ] 1023 | }, 1024 | { 1025 | "cell_type": "code", 1026 | "execution_count": null, 1027 | "metadata": { 1028 | "extensions": { 1029 | "jupyter_dashboards": { 1030 | "version": 1, 1031 | "views": { 1032 | "grid_default": {}, 1033 | "report_default": { 1034 | "hidden": true 1035 | } 1036 | } 1037 | } 1038 | }, 1039 | "slideshow": { 1040 | "slide_type": "subslide" 1041 | } 1042 | }, 1043 | "outputs": [], 1044 | "source": [ 1045 | "# Create a downloads folder\n", 1046 | "\n", 1047 | "try:\n", 1048 | " os.mkdir(\"./downloads\")\n", 1049 | "except:\n", 1050 | " print(\"Unable to create folder: already exists\")" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "markdown", 1055 | "metadata": { 1056 | "slideshow": { 1057 | "slide_type": "skip" 1058 | } 1059 | }, 1060 | "source": [ 1061 | "The use of \"./\" tells the `os.mkdir()` command that the \"downloads\" folder should be created at the same level of the directory where this notebook is located. So if this notebook was stored in a directory located at \"C:/Users/joebloggs/notebooks\", the `os.mkdir()` command would result in a new folder located at \"C:/Users/joebloggs/notebooks/downloads\".\n", 1062 | " \n", 1063 | "(Technically the \"./\" is not needed and you could just write `os.mkdir(\"downloads\")` but it's good practice to be explicit)" 1064 | ] 1065 | }, 1066 | { 1067 | "cell_type": "code", 1068 | "execution_count": null, 1069 | "metadata": { 1070 | "extensions": { 1071 | "jupyter_dashboards": { 1072 | "version": 1, 1073 | "views": { 1074 | "grid_default": {}, 1075 | "report_default": { 1076 | "hidden": true 1077 | } 1078 | } 1079 | } 1080 | }, 1081 | "slideshow": { 1082 | "slide_type": "subslide" 1083 | } 1084 | }, 1085 | "outputs": [], 1086 | "source": [ 1087 | "# Write the results to a CSV file\n", 1088 | "\n", 1089 | "date = datetime.now().strftime(\"%Y-%m-%d\") # get today's date in YYYY-MM-DD format\n", 1090 | "print(date)\n", 1091 | "\n", 1092 | "variables = [\"Cases\", \"Deaths\", \"Recoveries\"] # define variable names for the file\n", 1093 | "outfile = \"./downloads/covid-19-statistics-\" + date + \".csv\" # define a file for writing the results\n", 1094 | "obs = cases, deaths, recov # define an observation (row)\n", 1095 | "\n", 1096 | "with open(outfile, \"w\", newline=\"\") as f: # with the file open in \"write\" mode, and giving it a shorter name (f)\n", 1097 | " writer = csv.writer(f) # define a 'writer' object that allows us to export information to a CSV\n", 1098 | " writer.writerow(variables) # write the variable names to the first row of the file\n", 1099 | " writer.writerow(obs) # write the observation to the next row in the file" 1100 | ] 1101 | }, 1102 | { 1103 | "cell_type": "markdown", 1104 | "metadata": { 1105 | "slideshow": { 1106 | "slide_type": "skip" 1107 | } 1108 | }, 1109 | "source": [ 1110 | "The code above defines some headers and a name and location for the file which will store the results of the scrape. We then open the file in *write* mode, and write the headers to the first row, and the statistics to subsequent rows." 1111 | ] 1112 | }, 1113 | { 1114 | "cell_type": "markdown", 1115 | "metadata": { 1116 | "extensions": { 1117 | "jupyter_dashboards": { 1118 | "version": 1, 1119 | "views": { 1120 | "grid_default": {}, 1121 | "report_default": { 1122 | "hidden": false 1123 | } 1124 | } 1125 | } 1126 | }, 1127 | "slideshow": { 1128 | "slide_type": "subslide" 1129 | } 1130 | }, 1131 | "source": [ 1132 | "How do we know this worked? The simplest way is to check whether a) the file was created, and b) the results were written to it." 1133 | ] 1134 | }, 1135 | { 1136 | "cell_type": "code", 1137 | "execution_count": null, 1138 | "metadata": { 1139 | "extensions": { 1140 | "jupyter_dashboards": { 1141 | "version": 1, 1142 | "views": { 1143 | "grid_default": {}, 1144 | "report_default": { 1145 | "hidden": true 1146 | } 1147 | } 1148 | } 1149 | }, 1150 | "slideshow": { 1151 | "slide_type": "fragment" 1152 | } 1153 | }, 1154 | "outputs": [], 1155 | "source": [ 1156 | "# Check presence of file in \"downloads\" folder\n", 1157 | "\n", 1158 | "os.listdir(\"./downloads\")" 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "code", 1163 | "execution_count": null, 1164 | "metadata": { 1165 | "extensions": { 1166 | "jupyter_dashboards": { 1167 | "version": 1, 1168 | "views": { 1169 | "grid_default": {}, 1170 | "report_default": { 1171 | "hidden": true 1172 | } 1173 | } 1174 | } 1175 | }, 1176 | "slideshow": { 1177 | "slide_type": "subslide" 1178 | } 1179 | }, 1180 | "outputs": [], 1181 | "source": [ 1182 | "# Open file and read (import) its contents\n", 1183 | "\n", 1184 | "with open(outfile, \"r\") as f:\n", 1185 | " data = f.read()\n", 1186 | " \n", 1187 | "print(data) " 1188 | ] 1189 | }, 1190 | { 1191 | "cell_type": "markdown", 1192 | "metadata": { 1193 | "extensions": { 1194 | "jupyter_dashboards": { 1195 | "version": 1, 1196 | "views": { 1197 | "grid_default": {}, 1198 | "report_default": { 1199 | "hidden": false 1200 | } 1201 | } 1202 | } 1203 | }, 1204 | "slideshow": { 1205 | "slide_type": "skip" 1206 | } 1207 | }, 1208 | "source": [ 1209 | "And Voila, we have successfully scraped a web page!" 1210 | ] 1211 | }, 1212 | { 1213 | "cell_type": "markdown", 1214 | "metadata": { 1215 | "extensions": { 1216 | "jupyter_dashboards": { 1217 | "version": 1, 1218 | "views": { 1219 | "grid_default": {}, 1220 | "report_default": { 1221 | "hidden": false 1222 | } 1223 | } 1224 | } 1225 | }, 1226 | "slideshow": { 1227 | "slide_type": "slide" 1228 | } 1229 | }, 1230 | "source": [ 1231 | "### Country-level Covid-19 data\n", 1232 | "\n", 1233 | "We will complete our work gathering data on the Covid-19 pandemic by employing the techniques we learned previously to capture country-level statistics. We'll assume some knowledge on your part and thus you'll notice fewer annotations explaining what is happening at each step. As we progress through this example, there are some tasks for you to complete and some questions to be answered also.\n", 1234 | "\n", 1235 | "**TASK**: if you need to, re-acquaint yourself with the Worldometer website - the table is located near the bottom of the page." 1236 | ] 1237 | }, 1238 | { 1239 | "cell_type": "markdown", 1240 | "metadata": { 1241 | "slideshow": { 1242 | "slide_type": "subslide" 1243 | } 1244 | }, 1245 | "source": [ 1246 | "First, let's get some the preliminaries out of the way:" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "code", 1251 | "execution_count": null, 1252 | "metadata": { 1253 | "slideshow": { 1254 | "slide_type": "fragment" 1255 | } 1256 | }, 1257 | "outputs": [], 1258 | "source": [ 1259 | "import os\n", 1260 | "import requests\n", 1261 | "import csv\n", 1262 | "import pandas as pd\n", 1263 | "from datetime import datetime\n", 1264 | "from bs4 import BeautifulSoup as soup\n", 1265 | "\n", 1266 | "date = datetime.now().strftime(\"%Y-%m-%d\")" 1267 | ] 1268 | }, 1269 | { 1270 | "cell_type": "markdown", 1271 | "metadata": { 1272 | "slideshow": { 1273 | "slide_type": "subslide" 1274 | } 1275 | }, 1276 | "source": [ 1277 | "Request the web page and parse it:" 1278 | ] 1279 | }, 1280 | { 1281 | "cell_type": "code", 1282 | "execution_count": null, 1283 | "metadata": { 1284 | "slideshow": { 1285 | "slide_type": "fragment" 1286 | } 1287 | }, 1288 | "outputs": [], 1289 | "source": [ 1290 | "url = \"https://www.worldometers.info/coronavirus/\"\n", 1291 | "\n", 1292 | "response = requests.get(url, allow_redirects=True)\n", 1293 | "response.headers\n", 1294 | "soup_response = soup(response.text, \"html.parser\")\n", 1295 | "#\n", 1296 | "# QUESTION: How do you know if the web page was requested successfully?\n", 1297 | "#" 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "markdown", 1302 | "metadata": { 1303 | "slideshow": { 1304 | "slide_type": "subslide" 1305 | } 1306 | }, 1307 | "source": [ 1308 | "Find the table containing country-level statistics:" 1309 | ] 1310 | }, 1311 | { 1312 | "cell_type": "code", 1313 | "execution_count": null, 1314 | "metadata": { 1315 | "slideshow": { 1316 | "slide_type": "fragment" 1317 | } 1318 | }, 1319 | "outputs": [], 1320 | "source": [ 1321 | "table = soup_response.find(\"table\", id=\"main_table_countries_today\").find(\"tbody\")\n", 1322 | "rows = table.find_all(\"tr\", style=\"\")" 1323 | ] 1324 | }, 1325 | { 1326 | "cell_type": "markdown", 1327 | "metadata": { 1328 | "slideshow": { 1329 | "slide_type": "subslide" 1330 | } 1331 | }, 1332 | "source": [ 1333 | "Extract the information contained in each row in the table: " 1334 | ] 1335 | }, 1336 | { 1337 | "cell_type": "code", 1338 | "execution_count": null, 1339 | "metadata": { 1340 | "scrolled": true, 1341 | "slideshow": { 1342 | "slide_type": "fragment" 1343 | } 1344 | }, 1345 | "outputs": [], 1346 | "source": [ 1347 | "global_info = []\n", 1348 | "for row in rows:\n", 1349 | " columns = row.find_all(\"td\")\n", 1350 | " country_info = [column.text.strip() for column in columns]\n", 1351 | " global_info.append(country_info)\n", 1352 | "\n", 1353 | "print(global_info[0:3])\n", 1354 | "print(\"\\r\")\n", 1355 | "print(\"Number of rows in table: {}\".format(len(global_info)))\n", 1356 | "print(\"\\r\")\n", 1357 | "\n", 1358 | "del global_info[0] # delete first row containing world statistics" 1359 | ] 1360 | }, 1361 | { 1362 | "cell_type": "markdown", 1363 | "metadata": { 1364 | "slideshow": { 1365 | "slide_type": "skip" 1366 | } 1367 | }, 1368 | "source": [ 1369 | "First, we define a blank list to store statistics for each country (`global_info = []`); then for each row in the table, we extract the contents of each column and store the results in a list (`country_info = [column.text.strip() for column in columns])`; finally we add the results for each country to the overall list (`global_info.append(country_info)`)." 1370 | ] 1371 | }, 1372 | { 1373 | "cell_type": "markdown", 1374 | "metadata": { 1375 | "slideshow": { 1376 | "slide_type": "subslide" 1377 | } 1378 | }, 1379 | "source": [ 1380 | "We save the results of the scrape to a file:" 1381 | ] 1382 | }, 1383 | { 1384 | "cell_type": "code", 1385 | "execution_count": null, 1386 | "metadata": { 1387 | "slideshow": { 1388 | "slide_type": "fragment" 1389 | } 1390 | }, 1391 | "outputs": [], 1392 | "source": [ 1393 | "try:\n", 1394 | " os.mkdir(\"./downloads\")\n", 1395 | "except OSError as error:\n", 1396 | " print(\"Folder already exists\")\n", 1397 | "\n", 1398 | "variables = [\"Number\", \"Country\", \"Total Cases\", \"New Cases\", \"Total Deaths\", \n", 1399 | " \"New Deaths\", \"Total Recovered\", \"Active Cases\", \n", 1400 | " \"Serious_Critical\", \"Total Cases Per 1m Pop\", \"Deaths Per 1m Pop\",\n", 1401 | " \"Total Tests\", \"Tests Per 1m Pop\", \"Population\"]\n", 1402 | "outfile = \"./downloads/covid-19-country-statistics-\" + date + \".csv\"\n", 1403 | "print(outfile)\n", 1404 | "\n", 1405 | "with open(outfile, \"w\", newline=\"\") as f:\n", 1406 | " writer = csv.writer(f)\n", 1407 | " writer.writerow(variables)\n", 1408 | " for country in global_info:\n", 1409 | " writer.writerow(country)" 1410 | ] 1411 | }, 1412 | { 1413 | "cell_type": "markdown", 1414 | "metadata": { 1415 | "slideshow": { 1416 | "slide_type": "subslide" 1417 | } 1418 | }, 1419 | "source": [ 1420 | "Finally, we check the file was created; if so we load it into Python and examine its contents:" 1421 | ] 1422 | }, 1423 | { 1424 | "cell_type": "code", 1425 | "execution_count": null, 1426 | "metadata": { 1427 | "slideshow": { 1428 | "slide_type": "fragment" 1429 | } 1430 | }, 1431 | "outputs": [], 1432 | "source": [ 1433 | "data = pd.read_csv(outfile, encoding = \"ISO-8859-1\", index_col=False, usecols = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13])\n", 1434 | "data.sample(5)" 1435 | ] 1436 | }, 1437 | { 1438 | "cell_type": "code", 1439 | "execution_count": null, 1440 | "metadata": { 1441 | "scrolled": true, 1442 | "slideshow": { 1443 | "slide_type": "subslide" 1444 | } 1445 | }, 1446 | "outputs": [], 1447 | "source": [ 1448 | "data[data[\"Country\"]==\"Ireland\"]" 1449 | ] 1450 | }, 1451 | { 1452 | "cell_type": "markdown", 1453 | "metadata": {}, 1454 | "source": [ 1455 | "**TASK: change the code below to view the records for Ireland, Spain, and Kuwait.**" 1456 | ] 1457 | }, 1458 | { 1459 | "cell_type": "code", 1460 | "execution_count": null, 1461 | "metadata": {}, 1462 | "outputs": [], 1463 | "source": [ 1464 | "data[data[\"Country\"]==\"Kuwait\"]" 1465 | ] 1466 | }, 1467 | { 1468 | "cell_type": "markdown", 1469 | "metadata": { 1470 | "extensions": { 1471 | "jupyter_dashboards": { 1472 | "version": 1, 1473 | "views": { 1474 | "grid_default": {}, 1475 | "report_default": { 1476 | "hidden": false 1477 | } 1478 | } 1479 | } 1480 | }, 1481 | "slideshow": { 1482 | "slide_type": "skip" 1483 | } 1484 | }, 1485 | "source": [ 1486 | "### Concluding remarks on Covid-19 data\n", 1487 | "\n", 1488 | "The Covid-19 pandemic is a seismic public health crisis that will dominate our lives for the foreseeable future. The example code above is not a craven attempt to provide some topicality to these materials, nor is it simply a particularly good example for learning web-scraping techniques. There are real opportunities for social scientists to capture and analyse data on this phenomenon, starting with the core figures provided through the Worldometer website.\n", 1489 | "\n", 1490 | "You may also be interested in the publicly available data repository provided by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE): https://github.com/CSSEGISandData/COVID-19. Updated daily, this resource provides CSV (Comma Separated Values) files of global Covid-19 statistics (e.g., country-level time series), as well as PDF copies of the World Health Organisation's situation reports.\n", 1491 | "\n", 1492 | "At a UK level, the NHS releases data about Covid-19 symptoms reported through its NHS Pathways and 111 online platforms: NHS Open Data. Data on reported cases is also provided by Public Health England (PHE): COVID-19: track coronavirus cases. Many of these datasets are available as openly available as CSV files - you can learn how to download files in the [charity data example](#section_9_3).\n", 1493 | "\n", 1494 | "There is a collaborative Google document capturing sources of social data relating to COVID-19, curated by Dr Ben Baumberg Geiger.\n", 1495 | "\n", 1496 | "Finally, the Office for National Statistics (ONS) provides data and experimental indicators of social like in the UK under Covid-19: Coronavirus (COVID-19)." 1497 | ] 1498 | }, 1499 | { 1500 | "cell_type": "markdown", 1501 | "metadata": { 1502 | "slideshow": { 1503 | "slide_type": "skip" 1504 | } 1505 | }, 1506 | "source": [ 1507 | "## Value, limitations and ethics\n", 1508 | "\n", 1509 | "Computational methods for collecting data from the web are an increasingly important component of a social scientist's toolkit. They enable individuals to collect and reshape data - qualitative and quantitative - that otherwise would be inaccessible for research purposes. Thus far, this notebook has focused on the logic and practice of web-scraping, however it is crucial we reflect critically on its value, limitations and ethical implications for social science purposes.\n", 1510 | "\n", 1511 | "### Value\n", 1512 | "\n", 1513 | "* Web-scraping is a mature computational method, with lots of established packages (e.g., `requests` and `BeautifulSoup` in Python), examples and help available. As a result the learning curve is not as steep as with other methods, and it is possible for a beginner to create and execute a functioning web-scraping script in a matter of hours.\n", 1514 | "* Using computational, rather than manual, methods provides the ability to schedule or automate your data collection activities. For instance, you could schedule the Covid-19 code to execute at a set time every day.\n", 1515 | "* The richness of some of the information and data stored on web pages is a point worth repeating. Many public, private and charitable institutions use their web sites to release and regularly update information of value to social scientists. Getting a handle on the volume, variety and velocity of this information is extremely challenging without the use of computational methods.\n", 1516 | "* Computational methods not only enable accurate, real-time and reliable data collection, they also enable the reshaping of data into familiar formats (e.g., a CSV file, a database, a text document). While Python and HTML might be unfamiliar, the data that is returned through web-scraping can be formatted in such a way as to be compatible with your usual analytical methods (e.g., regression modelling, content analysis) and software applications (e.g., Stata, NVivo). In fact, we would go as far to say that computational methods are particularly valuable to social scientists from a data collection and processing perspective, and you can achieve much without ever engaging in \"big data analytics\" (e.g., machine learning, neural networks, natural language processing).\n", 1517 | "\n", 1518 | "### Limitations\n", 1519 | "\n", 1520 | "* Web-scraping may contravene the Terms of Service (ToS) of a website. Much like open datasets will have a licence stipulating how the data may be used, information stored on the web can also come with restrictions on use. For example, the Worldometer Covid-19 data that we scraped cannot be used without their permission, even though the ToS do not expressly prohibit web-scraping. In contrast, the beneficiary data provided by the Charity Commission for England and Wales (see Appendix B) is available under the Open Government Licence (OGL) Version 2, which permits copying, publishing, distribution and transmission.
So even in instances where scraping data is not prohibited, you may not be able to use it without seeking permission from the data owner. The safest approach is to seek permission in advance of conducting a web-scrape, especially if you intend to build a working relationship with the data owner - do not rely on the argument that the information is publicly available to begin with. In certain cases it may be easier to manually record or collect the data you are interested in.\n", 1521 | "* This brings us to a related point concerning the legal basis for collecting and using data from websites. In the UK there is no specific law prohibiting web-scraping or the use of data obtained via this method; however there are other laws which impinge on this activity. Copyright or intellectual property law may prohibit what information, if any, may be scraped from a website (see this example).
Data protection laws, such as the EU's General Data Protection Regulations (GDPR), also influence whether and how you collect data about individuals. This means you take responsibility for processing personal data, even if it’s publicly available. This is a critical and detailed area of data-driven activities, and we encourage you to consult relevant guidance (see *Further reading and resources* section).\n", 1522 | "* Web pages are frequently updated, therefore changes to their structure can break your code e.g., the URL for a file may change, or the table element now has a different id or was moved to a different web page. It can be a lot of work maintaining your code, especially if you make it available for use by others.\n", 1523 | "* Some websites may be advanced enough that they throttle or block scraping of their contents. For example, they may \"blacklist\" (ban) your IP address - your computer's unique id on the internet - from making requests to its server.\n", 1524 | "* Web-scraping, and computational social science in general, is dependent on your computing setup. For example, you may not possess administrative rights on your machine, preventing you from scheduling your script to run on a regular basis (e.g., your computer automatically goes to sleep after a set period of time and you cannot change this setting). There are ways around this and you do not need a high performance computing setup to scrape data from the web, but it is worth keeping in mind nonetheless.\n", 1525 | "\n", 1526 | "See the *Further reading and resources* section for useful articles exploring many of these issues.\n", 1527 | "\n", 1528 | "### Ethical considerations\n", 1529 | "\n", 1530 | "For the purposes of this discussion, we will assume you have sought and received ethical approval for a piece of research through the usual institutional processes: you've already considered consent, harm to researcher and participant, data security and curation etc. Instead, we will focus on a major ethical implication specific to web-scraping: the impact on the data owner's website. Each request you make to a website consumes computational resources, on your end and theirs: the server (i.e., machine) hosting the website must use some of its processing power and bandwidth to respond to the request. Web-scraping, especially frequently scheduled scripts, can overload a server by making too many requests, causing the website to crash. Individuals and organisations may rely on a website for vital and timely information, and causing it to crash could carry significant real-world implications." 1531 | ] 1532 | }, 1533 | { 1534 | "cell_type": "markdown", 1535 | "metadata": { 1536 | "slideshow": { 1537 | "slide_type": "skip" 1538 | } 1539 | }, 1540 | "source": [ 1541 | "## Conclusion\n", 1542 | "\n", 1543 | "Web-scraping is a simple yet powerful computational method for collecting data of value for social science research. It provides a relatively gentle introduction to using programming languages, also. However, \"with great power comes great responsibility\" (sorry). Web-scraping takes you into the realm of data protection, website Terms of Service (ToS), and many murky ethical issues. Wielded sensibly and sensitively, web-scraping is a valuable and exciting social science research method. \n", 1544 | "\n", 1545 | "Good luck on your data-driven travels!" 1546 | ] 1547 | }, 1548 | { 1549 | "cell_type": "markdown", 1550 | "metadata": { 1551 | "slideshow": { 1552 | "slide_type": "skip" 1553 | } 1554 | }, 1555 | "source": [ 1556 | "## Bibliography\n", 1557 | "\n", 1558 | "Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. https://jupyter4edu.github.io/jupyter-edu-book/.\n", 1559 | "\n", 1560 | "Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.\n", 1561 | "\n", 1562 | "Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org\n", 1563 | "\n", 1564 | "Tagliaferri, L. (n.d.). *How to Code in Python 3*. https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf" 1565 | ] 1566 | }, 1567 | { 1568 | "cell_type": "markdown", 1569 | "metadata": { 1570 | "slideshow": { 1571 | "slide_type": "skip" 1572 | } 1573 | }, 1574 | "source": [ 1575 | "## Further reading and resources\n", 1576 | "\n", 1577 | "We publish a list of useful books, papers, websites and other resources on our web-scraping Github repository: [Reading list]\n", 1578 | "\n", 1579 | "The help documentation for the `requests` and `BeautifulSoup` modules is refreshingly readable and useful:\n", 1580 | "* `requests`\n", 1581 | "* `BeautifulSoup` \n", 1582 | "\n", 1583 | "You may also be interested in the following articles specifically relating to web-scraping:\n", 1584 | "* Guide to Data Protection\n", 1585 | "* Collecting social media data for research\n", 1586 | "* Web Scraping and Crawling Are Perfectly Legal, Right?\n", 1587 | "* Web Crawling and Screen Scraping – the Legal Position" 1588 | ] 1589 | }, 1590 | { 1591 | "cell_type": "markdown", 1592 | "metadata": { 1593 | "slideshow": { 1594 | "slide_type": "skip" 1595 | } 1596 | }, 1597 | "source": [ 1598 | "## Appendices" 1599 | ] 1600 | }, 1601 | { 1602 | "cell_type": "markdown", 1603 | "metadata": { 1604 | "extensions": { 1605 | "jupyter_dashboards": { 1606 | "version": 1, 1607 | "views": { 1608 | "grid_default": {}, 1609 | "report_default": { 1610 | "hidden": false 1611 | } 1612 | } 1613 | } 1614 | }, 1615 | "slideshow": { 1616 | "slide_type": "skip" 1617 | } 1618 | }, 1619 | "source": [ 1620 | "### Appendix A - Requesting URLs\n", 1621 | "\n", 1622 | "In Python we've made use of the excellent `requests` module. By calling the `requests.get()` method, we mimic the manual process of launching a web browser and visiting a website. The `requests` module achieves this by placing a _request_ to the server hosting the website (e.g., show me the contents of the website), and handling the _response_ that is returned (e.g., the contents of the website and some metadata about the request). This _request-response_ protocol is known as HTTP (HyperText Transfer Protocol); HTTP allows computers to communicate with each other over the internet - you can learn more about it at W3 Schools.\n", 1623 | "\n", 1624 | "Run the code below to learn more about the data and metadata returned by `requests.get()`." 1625 | ] 1626 | }, 1627 | { 1628 | "cell_type": "code", 1629 | "execution_count": null, 1630 | "metadata": { 1631 | "extensions": { 1632 | "jupyter_dashboards": { 1633 | "version": 1, 1634 | "views": { 1635 | "grid_default": {}, 1636 | "report_default": { 1637 | "hidden": true 1638 | } 1639 | } 1640 | } 1641 | }, 1642 | "scrolled": true, 1643 | "slideshow": { 1644 | "slide_type": "skip" 1645 | } 1646 | }, 1647 | "outputs": [], 1648 | "source": [ 1649 | "import requests\n", 1650 | "\n", 1651 | "url = \"https://httpbin.org/html\"\n", 1652 | "response = requests.get(url)\n", 1653 | "\n", 1654 | "print(\"1. {}\".format(response)) # returns the object type (i.e. a response) and status code\n", 1655 | "print(\"\\r\")\n", 1656 | "\n", 1657 | "print(\"2. {}\".format(response.headers)) # returns a dictionary of response headers\n", 1658 | "print(\"\\r\")\n", 1659 | "\n", 1660 | "print(\"3. {}\".format(response.headers[\"Date\"])) # return a particular header\n", 1661 | "print(\"\\r\")\n", 1662 | "\n", 1663 | "print(\"4. {}\".format(response.request)) # returns the request object that requested this response\n", 1664 | "print(\"\\r\")\n", 1665 | "\n", 1666 | "print(\"5. {}\".format(response.url)) # returns the URL of the response\n", 1667 | "print(\"\\r\")\n", 1668 | "\n", 1669 | "#print(response.text) # returns the text contained in the response (i.e. the paragraphs, headers etc of the web page)\n", 1670 | "#print(response.content) # returns the content of the response (i.e. the HTML contents of the web page)\n", 1671 | "\n", 1672 | "# Visit https://www.w3schools.com/python/ref_requests_response.asp for a full list of what is returned by the server\n", 1673 | "# in response to a request." 1674 | ] 1675 | }, 1676 | { 1677 | "cell_type": "markdown", 1678 | "metadata": { 1679 | "slideshow": { 1680 | "slide_type": "skip" 1681 | } 1682 | }, 1683 | "source": [ 1684 | "-- END OF FILE --" 1685 | ] 1686 | } 1687 | ], 1688 | "metadata": { 1689 | "extensions": { 1690 | "jupyter_dashboards": { 1691 | "activeView": "grid_default", 1692 | "version": 1, 1693 | "views": { 1694 | "grid_default": { 1695 | "cellMargin": 10, 1696 | "defaultCellHeight": 20, 1697 | "maxColumns": 12, 1698 | "name": "grid", 1699 | "type": "grid" 1700 | }, 1701 | "report_default": { 1702 | "name": "report", 1703 | "type": "report" 1704 | } 1705 | } 1706 | } 1707 | }, 1708 | "kernelspec": { 1709 | "display_name": "Python 3", 1710 | "language": "python", 1711 | "name": "python3" 1712 | }, 1713 | "language_info": { 1714 | "codemirror_mode": { 1715 | "name": "ipython", 1716 | "version": 3 1717 | }, 1718 | "file_extension": ".py", 1719 | "mimetype": "text/x-python", 1720 | "name": "python", 1721 | "nbconvert_exporter": "python", 1722 | "pygments_lexer": "ipython3", 1723 | "version": "3.7.3" 1724 | }, 1725 | "toc": { 1726 | "base_numbering": 1, 1727 | "nav_menu": { 1728 | "height": "472px", 1729 | "width": "285px" 1730 | }, 1731 | "number_sections": true, 1732 | "sideBar": true, 1733 | "skip_h1_title": true, 1734 | "title_cell": "Table of Contents", 1735 | "title_sidebar": "Contents", 1736 | "toc_cell": true, 1737 | "toc_position": { 1738 | "height": "calc(100% - 180px)", 1739 | "left": "10px", 1740 | "top": "150px", 1741 | "width": "165px" 1742 | }, 1743 | "toc_section_display": true, 1744 | "toc_window_display": false 1745 | } 1746 | }, 1747 | "nbformat": 4, 1748 | "nbformat_minor": 2 1749 | } 1750 | -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/faq/README.md: -------------------------------------------------------------------------------- 1 | # Web-scraping for Social Science Research 2 | 3 | ## Frequently Asked Questions 4 | 5 | The following is a sample of questions that have been posed during the *Web-scraping for Social Science Research* webinar series. 6 | 7 | ### Q. Is web-scraping legal? 8 | In the United Kingdom (UK) there is no specific law prohibiting the capture or use of data from websites. However, many websites have Terms of Service/Use that must be abided when you use them. The information stored on websites can also be subject to copyright or data protection laws. If you work for a non-commercial/public research organisation or a cultural heritage institution (e.g., libraries, museums and archives), then your web-scraping activities may be covered under the [text and data mining (TDM) EU directive](http://lr-coordination.eu/News/What%E2%80%99s-new-in-the-Directive). 9 | 10 | ### Q. Can I collect qualitative data using web-scraping? 11 | Yes, in fact most of the information contained on websites is stored as text. It is relatively simple to capture this information using web-scraping techniques, and save the results to a file. 12 | 13 | ### Q. Is there a code of ethics for web-scraping? 14 | There isn't one we have come across. However, from an academic perspective web-scraping is a research method and thus part of a wider research design. Therefore research involving this method needs to receive ethical approval from your institution. This will typically involve demonstrating approaches, solutions and mitigations in terms of participant consent, harm to researcher and participant, data security and protection etc. 15 | 16 | In terms of good practice specific to web-scraping: 17 | * Unless explicitly stated in the Terms of Service/Use, always seek permission to scrape data from a website. 18 | * Think very clearly about how you will process, store and share data, especially those relating to individuals. 19 | * Do not make unneccessary requests to a website. Web-scraping can overload a website and cause it to crash. 20 | 21 | ### Q. Can I conduct web-scraping in R / Stata / other packages? 22 | Yes. We have chosen Python as we think it is particularly good and easy to learn for this task. However many other programming languages and software applications have web-scraping modules. While the exact commands will differ, the underpinning logic remains the same. 23 | 24 | ### Q. What kinds of data can you scrape from web pages? 25 | While web pages tend to mainly contain text (e.g., paragraphs or tables of interest), it is possible to scrape other types of data such as files, images etc. This is possible because these types of data tend to be stored as links on a web page; therefore you scrape these links and then request them using Python's `requests` module. However, there may be issues with permissions around the collection and use of certain images, files, or videos (e.g., copyright). 26 | 27 | ### Q. Can you scrape data from social media sites? 28 | In theory yes, but social media platforms (and many other organisations producing large amounts of data) prefer you to access their data via an Application Programming Interface (API) that connects to their online database. There are still restrictions around the nature, volume and use of data that you can access but these tend to be well documented. 29 | 30 | ### Q. Are there security concerns when scraping data from the web? 31 | When you request a web page, either manually or through a programming script, you communicate certain information about yourself. This includes the IP address of the computer you made the request from i.e., its unique id on the web. Generally speaking, security concerns stem from requesting web pages, no matter what technique you use. As the requester you may want to mask your IP address (we won't show you how); as the website, you may want to take action to limit or ban certain forms and frequencies of web-scraping activities. This article provides a succinct overview of some of these issues. 32 | -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/installation/README.md: -------------------------------------------------------------------------------- 1 | # Installation 2 | 3 | These instructions guide you through the process of installing Python and other packages (e.g., Jupyter Notebook) on your machine. There are also instructions for creating a computing environment that enables you to run the web-scraping code associated with the *Web-scraping for Social Science Research* training series. 4 | 5 | ### Install Python 6 | 7 | Installing and managing Python on your machine is tricky for a first-timer, therefore we have written detailed and clear instructions for this task: [Instructions] 8 | 9 | ### Reproduce *Web-scraping for Social Science Research* code 10 | 11 | -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/reading-list/README.md: -------------------------------------------------------------------------------- 1 | # Reading List 2 | 3 | The following are suggested books, papers, websites and other resources for developing your web-scraping and Python skills further. 4 | 5 | ### Web-scraping 6 | 7 | Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org
8 | Note: Chapter 7 covers web technologies but the book is an excellent introduction to quantitative data analysis in Python in general. 9 | 10 | Sweigart, A. (2015). *Automate the Boring Stuff with Python*. https://automatetheboringstuff.com/2e
11 | Note: Chapter 12 is a readable treatment of the practical and technical aspects of web-scraping. 12 | 13 | ### Python 14 | 15 | Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.
16 | Note: One of the best (and few) introductions to computational methods for social science research; mixes theoretical and practical content very well. 17 | 18 | Downey, A. (2015). *Think Python: How to Think Like a Computer Scientist*. Needham, Massachusetts: Green Tea Press. http://greenteapress.com/thinkpython2
19 | Note: A thorough and clear account of the main features of Python. 20 | 21 | King's College London (2019). *Code Camp*. https://github.com/kingsgeocomp/code-camp
22 | Note: A brilliant set of notebooks and lessons for learning the fundamentals of Python. 23 | 24 | VanderPlas, J. (2016). *Python Data Science Handbook: Essential Tools for Working with Data*. https://jakevdp.github.io/PythonDataScienceHandbook/
25 | Note: An in-depth treatment of core data science tasks in Python - calculations, data frames, visualisation and machine learning. 26 | -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/webinars/README.md: -------------------------------------------------------------------------------- 1 | # Webinars 2 | 3 | ## 1. Case Study 4 | This webinar demonstrates the research potential of web-scraping by describing its role in generating a linked administrative dataset to study the causal effect of a regulatory intervention in the UK charity sector. Presented by [Dr Diarmuid McDonnell](https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html) of the UK Data Service, this webinar will cover the process of scraping data about charities, practical and ethical implications, and the advantages and disadvantages of using this form of data for social science research more generally. 5 | * [Watch recording](https://www.youtube.com/watch?v=ygA1bONLq-4) 6 | * [Download slides](./ukds-nfod-web-scraping-case-study-2020-03-27.pdf) 7 | 8 | ## 2. Websites as a Source of Data 9 | This webinar delineates the value, logic and process of capturing data stored on websites. Presented by [Dr Diarmuid McDonnell](https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html) of the UK Data Service, this webinar will cover the step-by-step process of collecting data from a web page, including providing sample code written in the popular Python programming language. It demonstrates web-scraping techniques for capturing real-time information on the Covid-19 pandemic, as well as for the author’s own research specialism (charitable organisations). 10 | * [Watch recording](https://www.youtube.com/watch?v=Q-UaAFBtUDw) 11 | * [Download slides](./ukds-nfod-web-scraping-websites-2020-04-23.pdf) 12 | 13 | ## 3. APIs as a Source of Data 14 | This webinar delineates the value, logic and process of capturing data stored in online databases through an API (application programming interface). Presented by [Dr Diarmuid McDonnell](https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html) of the UK Data Service, this webinar will cover the step-by-step process of downloading data via an API, including providing sample code written in the popular Python programming language. It demonstrates techniques for downloading public information on the Covid-19 pandemic, as well as for a range of other social science subjects (e.g., crime data via the Police UK API, business information via the Companies House API). 15 | * [Watch recording](https://www.youtube.com/watch?v=nbsy7IzYm-0) 16 | * [Download slides](./ukds-nfod-web-scraping-apis-2020-04-30.pdf) 17 | -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/webinars/stick-figure1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/web-scraping/4b82af70c2b5603721993a33679238fbd1d840ed/2020_Web-scraping_and_API/webinars/stick-figure1.png -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/webinars/ukds-nfod-web-scraping-apis-2020-04-30.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/web-scraping/4b82af70c2b5603721993a33679238fbd1d840ed/2020_Web-scraping_and_API/webinars/ukds-nfod-web-scraping-apis-2020-04-30.pdf -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/webinars/ukds-nfod-web-scraping-case-study-2020-03-27.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/web-scraping/4b82af70c2b5603721993a33679238fbd1d840ed/2020_Web-scraping_and_API/webinars/ukds-nfod-web-scraping-case-study-2020-03-27.pdf -------------------------------------------------------------------------------- /2020_Web-scraping_and_API/webinars/ukds-nfod-web-scraping-websites-2020-04-23.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/web-scraping/4b82af70c2b5603721993a33679238fbd1d840ed/2020_Web-scraping_and_API/webinars/ukds-nfod-web-scraping-websites-2020-04-23.pdf -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Diarmuid McDonnell 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![UKDS Logo](./code/images/UKDS_Logos_Col_Grey_300dpi.png)
2 |
3 | # Web-scraping for Social Science Research 4 | 5 | Vast swathes of our social interactions and personal behaviours are now conducted online and/or captured digitally. In addition to common sources such as social media/network platforms and text corpora, websites and online databases contain rich information of relevance to social science research. Thus, computational methods for collecting data from the web are an increasingly important component of a social scientist’s toolkit. 6 | 7 | ## Topics 8 | 9 | The following topics are covered under this training series: 10 | 1. **Case Study** - understand the research potential of web-scraping through examining a published piece of social science research. 11 | 2. **Websites as a Source of Data** - learn how to collect data from websites using Python. 12 | 3. **APIs as a Source of Data** - learn how to download data from online databases using Python and Application Programming Interfaces (APIs). 13 | 14 | ## Materials 15 | 16 | The training materials - including webinar recordings, slides, and sample Python code - can be found in the following folders: 17 | * [code](./code) - run and/or download web-scraping code using our Jupyter notebook resources. 18 | * [faq](./faq) - read through some of the frequently asked questions that are posed during our webinars. 19 | * [installation](./installation) - view instructions for how to download and install Python and other packages necessary for working with new forms of data. 20 | * [reading-list](./reading-list) - explore further resources including articles, books, online resources and more. 21 | * [webinars](./webinars) - watch recordings of our webinars and download the underpinning slides. 22 | 23 | ## Acknowledgements 24 | 25 | We are grateful to UKRI through the Economic and Social Research Council for their generous funding of this training series. 26 | 27 | ## Further Information 28 | 29 | * To access learning materials from the wider *Computational Social Science* training series: [Training Materials] 30 | * To keep up to date with upcoming and past training events: [Events] 31 | * To get in contact with feedback, ideas or to seek assistance: [Help] 32 | 33 | Thank you and good luck on your journey exploring new forms of data!
34 | 35 | Dr Julia Kasmire and Dr Diarmuid McDonnell
36 | UK Data Service
37 | University of Manchester
38 | -------------------------------------------------------------------------------- /postBuild: -------------------------------------------------------------------------------- 1 | ## Enable Table of Contents 2 | 3 | jupyter contrib nbextension install --user 4 | jupyter nbextension enable --py widgetsnbextension 5 | jupyter nbextension enable rise 6 | jupyter nbextension enable toc2/main -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests==2.23.0 2 | pandas==1.0.1 3 | beautifulsoup4==4.8.2 4 | jupyter-contrib-nbextensions==0.5.1 5 | rise==5.6.1 --------------------------------------------------------------------------------