├── Get the weather-GitHub.ipynb ├── GetEMSdata-GitHub.ipynb ├── README.md ├── Simple Web Scraper.ipynb ├── WS_images ├── CopyPaste1.png ├── CopyPaste2.png ├── simpleHTMLpageSS.png ├── simpleHTMLpageSS2.png └── simpleHTMLpageSS3.png └── Web Scraper for Weather.ipynb /GetEMSdata-GitHub.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": false 7 | }, 8 | "source": [ 9 | "This notebook was written to scrape a publically listed emergency call log for the Eugene OR area. It may be useful as a guide for projects that need to scrape data from a webpage with dynamically generated content from a javascript interface." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 7, 15 | "metadata": { 16 | "collapsed": true 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "from selenium import webdriver #selenium is used to interact with the webpage, so the program can 'click' buttons.\n", 21 | "import pandas as pd #the data will be saved locally as a csv file. Pandas is a nice way to write/read/work with those files.\n", 22 | "from selenium.webdriver.firefox.firefox_binary import FirefoxBinary #This will let the program open the webpage on a new Firefox window.\n", 23 | "from bs4 import BeautifulSoup #BeautifulSoup is used to parse the HTML of the downloaded website to find the particular information desired.\n", 24 | "import time #I will need to delay the program to give the webpage time to open. time will be used for that.\n", 25 | "import sys #This is only used to assign a location to my path. The location where I have a needed file for selenium." 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "the next section adds my current directory to my path. The main reason for this is selenium. Selenium uses geckodriver to talk with Firefox. That application is something you have to download special. Ideally, I would have geckodriver in a folder in python's path already. Instead, I just have it saved into the same folder I'm collecting all my data from the EMS call log. Not great, I know, but it works." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 8, 38 | "metadata": { 39 | "collapsed": true 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "sys.path\n", 44 | "sys.path.append('/path/to/the/example_file.py')\n", 45 | "sys.path.append('C:\\\\Users\\\\Kyle\\\\Documents\\\\Blog Posts\\\\EugeneEMSCalls')" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "The call log website gives the date with the month abbreviation. This dictionary lets me easily convert that into a 2 number string." 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 9, 58 | "metadata": { 59 | "collapsed": true 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "Months={'Jan':'01','Feb':'02','Mar':'03','Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 10, 69 | "metadata": { 70 | "collapsed": true 71 | }, 72 | "outputs": [], 73 | "source": [ 74 | "#url of the website with the call log data\n", 75 | "url='http://coeapps.eugene-or.gov/ruralfirecad'" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "The next section opens a new window of Firefox browser. This is the tab where the url will be loaded and where the code will 'click' javascript buttons and download the HTML of the page. " 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 11, 88 | "metadata": { 89 | "collapsed": false 90 | }, 91 | "outputs": [], 92 | "source": [ 93 | "binary = FirefoxBinary('C:\\\\Program Files (x86)\\\\Mozilla Firefox\\\\firefox.exe')\n", 94 | "driver = webdriver.Firefox(firefox_binary=binary)\n", 95 | "driver.get(url)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 12, 101 | "metadata": { 102 | "collapsed": false 103 | }, 104 | "outputs": [ 105 | { 106 | "name": "stdout", 107 | "output_type": "stream", 108 | "text": [ 109 | "Number of Calls on Jan 9, 2017: 107\n" 110 | ] 111 | } 112 | ], 113 | "source": [ 114 | "#quick check to see if selenium correctly got to the page. \n", 115 | "#this searches the HTML of the page for the HTML element id'd as 'callSummary',\n", 116 | "#and prints the text.\n", 117 | "summary=driver.find_element_by_id('callSummary').text\n", 118 | "print(summary)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "The web scraping code. The way this deals with time is a bit wonky. Initially, it will step through every day of the current month from first day to last day. Those days that are still in the future won't have any calls and will not be saved into csv files. Then, the program steps back one month and repeats; again, going from the first to the last day. Eventually, the program reaches the set end date, where it breaks out of the while loop." 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 13, 131 | "metadata": { 132 | "collapsed": false 133 | }, 134 | "outputs": [], 135 | "source": [ 136 | "endIt=0;\n", 137 | "while endIt==0:\n", 138 | " calendarOptions=driver.find_element_by_id('calendar').text.split()\n", 139 | " monthsDays=[];\n", 140 | " for elm in calendarOptions:\n", 141 | " try:\n", 142 | " potlDay=int(elm)\n", 143 | " if potlDay<=31:\n", 144 | " monthsDays.append(elm)\n", 145 | " except:\n", 146 | " pass\n", 147 | " for day in monthsDays:\n", 148 | " driver.find_element_by_link_text(day).click()\n", 149 | " time.sleep(1) #giving firefox time to open.\n", 150 | " summary=driver.find_element_by_id('callSummary').text;\n", 151 | " dateData=summary[summary.index('on')+3:summary.index(':')].replace(',','').split(' ');\n", 152 | " if len(dateData[1])==1:\n", 153 | " dateData[1]='0'+dateData[1]\n", 154 | " date=int(dateData[2]+str(Months[dateData[0]])+dateData[1]);\n", 155 | " if int(date)==20161201: #End date is located here.\n", 156 | " endIt=1;\n", 157 | " break\n", 158 | " html = driver.page_source;\n", 159 | " soup = BeautifulSoup(html,'lxml'); #using BeautifulSoup to find the call logs.\n", 160 | " EMSdata=soup.find('table', class_='tablesorter');\n", 161 | " colNames1=EMSdata.thead.findAll('th') #recording the column names.\n", 162 | " colNames2=[];\n", 163 | " data1=[]\n", 164 | " for x in range(0,len(colNames1)):\n", 165 | " colNames2.append(colNames1[x].string.strip()) #saving each column value.\n", 166 | " data1.append([])\n", 167 | " for row in EMSdata.findAll(\"tr\"): #saving the individual call log data.\n", 168 | " cells = row.findAll('td')\n", 169 | " if len(cells)==len(colNames1):\n", 170 | " for y in range(0,len(cells)):\n", 171 | " data1[y].append(cells[y].string.strip())\n", 172 | " EMSdata1=pd.DataFrame(); #initializing a data frame to save 1 days worth of calls.\n", 173 | " for x in range(0,len(colNames2)):\n", 174 | " EMSdata1[colNames2[x]]=data1[x]\n", 175 | " EMSdata1['Date']=date;\n", 176 | " try:\n", 177 | " EMSdata1.to_csv('%s.csv'%(EMSdata1.loc[0,'Date']),index=False) #saving csv file of daily call logs.\n", 178 | " except:\n", 179 | " pass\n", 180 | " time.sleep(1) #giving time to save csv before moving on.\n", 181 | " if endIt==0:\n", 182 | " driver.find_element_by_link_text('Prev').click()" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": { 189 | "collapsed": true 190 | }, 191 | "outputs": [], 192 | "source": [] 193 | } 194 | ], 195 | "metadata": { 196 | "anaconda-cloud": {}, 197 | "kernelspec": { 198 | "display_name": "Python [default]", 199 | "language": "python", 200 | "name": "python3" 201 | }, 202 | "language_info": { 203 | "codemirror_mode": { 204 | "name": "ipython", 205 | "version": 3 206 | }, 207 | "file_extension": ".py", 208 | "mimetype": "text/x-python", 209 | "name": "python", 210 | "nbconvert_exporter": "python", 211 | "pygments_lexer": "ipython3", 212 | "version": "3.5.2" 213 | } 214 | }, 215 | "nbformat": 4, 216 | "nbformat_minor": 1 217 | } 218 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Web Scraping 2 | Guided example for web scraping in Python using urlopen from urllib.request, beautifulsoup, and pandas. 3 | 4 | # One Sentence Definition of Web Scraping 5 | Web scraping is having your computer visit many web pages, collect (scrape) data from each page, and save it locally to your computer for future use. It’s something you could do with copy/paste and an Excel table, but the sheer number of pages makes it impractical. 6 | 7 | ## Keys to Web Scraping: 8 | The keys to web scraping are patterns. There needs to be some pattern the program can follow to go from one web page to the next. The desired data needs to be in some pattern, so the web scraper can reliably collect it. Finding these patterns is the tricky, time consuming process that is at the very beginning. But after they're discovered, writing the code of the web scraper is easy. 9 | 10 | # Simple Web Scraper 11 | 12 | Here is the finished web scraper built in the rest of this document. I’ll go through each step, explaining why and what each element does, but this is the end goal. 13 | 14 | ```python 15 | from urllib.request import urlopen 16 | from bs4 import BeautifulSoup 17 | import pandas as pd 18 | 19 | df=pd.DataFrame() 20 | df=pd.DataFrame(columns=['emails']) 21 | for x in range(0,10): 22 | url='http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm' 23 | html=urlopen(url) 24 | soup=BeautifulSoup(html.read(),'lxml') 25 | subSoup=soup.find_all('p',class_='whs2'); 26 | for elm in subSoup: 27 | ret=elm.contents[0] 28 | if 'pip for Python 2.X or pip for Python 3.X are good ways to acquire new packages. 37 | 38 | This web scraper will make use of three modules: 39 | 40 | 1) urlopen from urllib.request : to download page contents from a given valid url 41 | 42 | 2) BeautifulSoup from bs4 : to navigate the HTML of the downloaded page 43 | 44 | 3) pandas : to store our scraped data 45 | 46 | # Target Web Page and Desired Data 47 | For the purposes of this little tutorial, I'll show how to scrape just a single web page. One with very simple HTML to make it easier to understand what's going on: 48 | 49 | 50 | 51 | This page has minimal information, but let's say I want to collect the email address: 52 | 53 | 54 | 55 | # urlopen from urllib.request 56 | 57 | urlopen is a no-frills way of making a web scraper, and the recipe is simple: (1) Give urlopen a valid url. (2) Watch as urlopen downloads the HTML from that url. 58 | 59 | ```python 60 | from urllib.request import urlopen 61 | url='http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm' 62 | html=urlopen(url) 63 | print(html) 64 | ``` 65 | If you fed urlopen a valid url, you should print something along the lines of: 66 | 67 | ```python 68 | 69 | ``` 70 | 71 | ## Patterns for Downloading many pages 72 | 73 | Writing a web scraper for a single page makes no sense. But often the many pages you'll want to scrape have some pattern in their urls. For example, these urls lead to pages with historical daily weather data for Eugene OR. The day in question is explicitly stated in the url: 74 | 75 | >https://www.wunderground.com/history/airport/KEUG/2016/01/05/DailyHistory.htmlreq_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999 76 | 77 | >https://www.wunderground.com/history/airport/KEUG/2016/01/04/DailyHistory.htmlreq_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999 78 | 79 | So, you could visit many of these pages by writing a python script that created knew url strings by changing the date. 80 | 81 | Example, downloading the html of every day in Jan, 2016: 82 | ```python 83 | for x in range(1,32): 84 | url="https://www.wunderground.com/history/airport/KEUG/2016/01/%s/DailyHistory.htmlreq_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999" % (x) 85 | html=urlopen(url) 86 | ``` 87 | 88 | So, you could visit many of these pages by writing a python script that created new url strings by changing the date. 89 | 90 | # BeautifulSoup to Navigate the HTML 91 | 92 | So now you have the HTML from the page, but no way of reading it. Cue BeautifulSoup. BeautifulSoup will do 2 major things for us. (1) It will decode the HTML into something we can read in the python script. (2) It will let us navigate through the many lines of HTML by making use of the tags and labels of HTML coding. 93 | 94 | ```python 95 | from bs4 import BeautifulSoup 96 | soup=BeautifulSoup(html.read(),'lxml') 97 | print(soup) 98 | ``` 99 | 100 | This is the tedious bit. See if you can find the email address in the parsed HTML: 101 | 102 | ```html 103 | 104 | 105 | 106 | 107 | 108 | 109 | Example of a simple HTML page 110 | 111 | 117 | 137 | 143 | 144 | 145 | 146 | 147 | 176 | 177 | 183 |

Example of a simple HTML page

184 |

Hypertext Markup Language (HTML) is the most common language used to 185 | create documents on the World Wide Web. HTML uses hundreds of different 186 | tags to define a layout for web pages. Most tags require an opening <tag> 187 | and a closing </tag>.

188 |

Example:  <b>On 189 | a webpage, this sentence would be in bold print.</b>

190 |

Below is an example of a very simple page:

191 |

192 |

 This 193 | is the code used to make the page:

194 |

<HTML>

195 |

<HEAD>

196 |

<TITLE>Your Title Here</TITLE> 197 |

198 |

</HEAD>

199 |

<BODY BGCOLOR="FFFFFF"> 200 |

201 |

<CENTER><IMG SRC="clouds.jpg" 202 | ALIGN="BOTTOM"> </CENTER>

203 |

<HR>

204 |

<a href="http://somegreatsite.com">Link 205 | Name</a>

206 |

is a link to another nifty site

207 |

<H1>This is a Header</H1>

208 |

<H2>This is a Medium Header</H2> 209 |

210 |

Send me mail at <a href="mailto:support@yourcompany.com">

211 |

support@yourcompany.com</a>.

212 |

<P> This is a new paragraph!

213 |

<P> <B>This is a new paragraph!</B> 214 |

215 |

<BR> <B><I>This is a new 216 | sentence without a paragraph break, in bold italics.</I></B> 217 |

218 |

<HR>

219 |

</BODY>

220 |

</HTML>

221 |

 

222 |

 

223 |

 

224 | 230 | 231 | 232 | 233 | ``` 234 | And that HTML is from a very simple webpage. Most pages you'll want to scrape are much more complex and sorting through them can be a chore. It's a tedious step that all writers of web scrapers must suffer through. 235 | 236 | You can make your life easier by figuring out where the information you want is located within the html of the webpage when it's open in your browser. 237 | 238 | ## Inspect Element to Locate the HTML Tags 239 | 240 | In Firefox, right click on the information you care about and select 'inspect element' (other browsers have a similar feature). The browser will let you look under the hood at the html of the page and take you right to the section you care about. 241 | 242 | 243 | 244 | Here I can note all the HTML tags I can use to locate the email address with BeautifulSoup: 245 | 246 | It's in a paragraph ('p'). It has a class ('whs2'). The address is in a link (< a > < /a >). The email address contains the special character ‘@’. 247 | 248 | That's more than enough for BeautifulSoup. 249 | 250 | 251 | ## Navigating with BeautifulSoup 252 | There are a lot of tools in BeautifulSoup, but we'll make do with three: find, find_all, and contents. 253 | 254 | find gives the first instance of some HTML tag: 255 | 256 | ```python 257 | soup.find('p') 258 | ``` 259 | 260 | ```html 261 |

Hypertext Markup Language (HTML) is the most common language used to 262 | create documents on the World Wide Web. HTML uses hundreds of different 263 | tags to define a layout for web pages. Most tags require an opening <tag> 264 | and a closing </tag>.

265 | ``` 266 | 267 | contents gives you any text find found (in the form of a list on length 1): 268 | 269 | ```python 270 | soup.find('p').contents 271 | ``` 272 | 273 | ```text 274 | ['Hypertext Markup Language (HTML) is the most common language used to \n create documents on the World Wide Web. HTML uses hundreds of different \n tags to define a layout for web pages. Most tags require an opening \n and a closing .'] 275 | ``` 276 | 277 | find_all returns every instance of that tag. Again, in a list: 278 | 279 | ```python 280 | soup.find_all('p') 281 | ``` 282 | 283 | ```html 284 | [

Hypertext Markup Language (HTML) is the most common language used to 285 | create documents on the World Wide Web. HTML uses hundreds of different 286 | tags to define a layout for web pages. Most tags require an opening <tag> 287 | and a closing </tag>.

, 288 |

Example:  <b>On 289 | a webpage, this sentence would be in bold print.</b>

, 290 |

Below is an example of a very simple page:

, 291 |

, 292 |

 This 293 | is the code used to make the page:

, 294 |

<HTML>

, 295 |

<HEAD>

, 296 |

<TITLE>Your Title Here</TITLE> 297 |

, 298 |

</HEAD>

, 299 |

<BODY BGCOLOR="FFFFFF"> 300 |

, 301 |

<CENTER><IMG SRC="clouds.jpg" 302 | ALIGN="BOTTOM"> </CENTER>

, 303 |

<HR>

, 304 |

<a href="http://somegreatsite.com">Link 305 | Name</a>

, 306 |

is a link to another nifty site

, 307 |

<H1>This is a Header</H1>

, 308 |

<H2>This is a Medium Header</H2> 309 |

, 310 |

Send me mail at <a href="mailto:support@yourcompany.com">

, 311 |

support@yourcompany.com</a>.

, 312 |

<P> This is a new paragraph!

, 313 |

<P> <B>This is a new paragraph!</B> 314 |

, 315 |

<BR> <B><I>This is a new 316 | sentence without a paragraph break, in bold italics.</I></B> 317 |

, 318 |

<HR>

, 319 |

</BODY>

, 320 |

</HTML>

, 321 |

 

, 322 |

 

, 323 |

 

] 324 | ``` 325 | 326 | If the HTML object also has a class, you can narrow your results even further: 327 | 328 | ```python 329 | soup.find_all('p', class_='whs2') 330 | ``` 331 | 332 | ```html 333 | [

<HTML>

, 334 |

<HEAD>

, 335 |

<TITLE>Your Title Here</TITLE> 336 |

, 337 |

</HEAD>

, 338 |

<BODY BGCOLOR="FFFFFF"> 339 |

, 340 |

<CENTER><IMG SRC="clouds.jpg" 341 | ALIGN="BOTTOM"> </CENTER>

, 342 |

<HR>

, 343 |

<a href="http://somegreatsite.com">Link 344 | Name</a>

, 345 |

is a link to another nifty site

, 346 |

<H1>This is a Header</H1>

, 347 |

<H2>This is a Medium Header</H2> 348 |

, 349 |

Send me mail at <a href="mailto:support@yourcompany.com">

, 350 |

support@yourcompany.com</a>.

, 351 |

<P> This is a new paragraph!

, 352 |

<P> <B>This is a new paragraph!</B> 353 |

, 354 |

<BR> <B><I>This is a new 355 | sentence without a paragraph break, in bold italics.</I></B> 356 |

, 357 |

<HR>

, 358 |

</BODY>

, 359 |

</HTML>

, 360 |

 

, 361 |

 

] 362 | ``` 363 | 364 | Finally, because HTML elements returned by find_all are in a list, you can iterate through them: 365 | 366 | ```python 367 | subSoup=soup.find_all('p',class_='whs2'); 368 | for elm in subSoup: 369 | print(elm.contents) 370 | ``` 371 | 372 | ```text 373 | [' '] 374 | [' '] 375 | ['Your Title Here \n '] 376 | [' '] 377 | [' \n '] 378 | ['
'] 379 | ['
'] 380 | ['Link \n Name '] 381 | ['is a link to another nifty site '] 382 | ['

This is a Header

'] 383 | ['

This is a Medium Header

\n '] 384 | ['Send me mail at '] 385 | ['support@yourcompany.com. '] 386 | ['

This is a new paragraph! '] 387 | ['

This is a new paragraph! \n '] 388 | ['
This is a new \n sentence without a paragraph break, in bold italics. \n '] 389 | ['


'] 390 | [' '] 391 | [' '] 392 | ['\xa0'] 393 | ['\xa0'] 394 | ``` 395 | 396 | Using this and a bit chunky python, we can pull out the email address: 397 | 398 | ```python 399 | subSoup=soup.find_all('p',class_='whs2'); 400 | for elm in subSoup: 401 | ret=elm.contents[0] 402 | if ' 410 | a chunky bit of python string processing gives us just the email: support@yourcompany.com 411 | ``` 412 | 413 | # Pandas to Save the Data in a Table 414 | 415 | If we scraped many more websites and took more data from each page, we'd want some easy way to access it in the future. (Printing 10,000 items in a Python notebook is a terrible idea.) 416 | 417 | There are many ways to do this, but let's focus on creating a DataFrame (think Excel table) using pandas. Why? Because pandas is really useful in data science for doing exploratory data analysis, parallel processing, sorting/cleaning/reformatting data, etc. It's easier to save the data into a pandas DataFrame right from the start than to have to transcribe it in later. 418 | 419 | First import Pandas: 420 | 421 | ```python 422 | import pandas as pd 423 | ``` 424 | 425 | And initialize an empty DataFrame: 426 | 427 | ```python 428 | df=pd.DataFrame() 429 | ``` 430 | 431 | (Note: Do this outside of any for-loops you have going on!) 432 | 433 | If you already know the names of the columns you want, you can assign that now: 434 | 435 | ```python 436 | df=pd.DataFrame(columns=['emails']) 437 | ``` 438 | 439 | 440 | 441 | 442 | 443 |
Emails
444 | 445 | And add the data to the DataFrame, in the form of a list: 446 | 447 | ```python 448 | df=pd['Emails']=[email]) 449 | ``` 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 |
Emails
0support@yourcompany.com
459 | 460 | (This would be more impressive if we had more than 1 email to put into the DataFrame.) 461 | 462 | If we had a second email (say: support2@yourcompany.com), we could append it to the DataFrame by stating the row and column it should be added to: 463 | 464 | ```python 465 | df.loc[1,'Emails']='support2@yourcompany.com' 466 | ``` 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 |
Emails
0support@yourcompany.com
1support2@yourcompany.com
479 | 480 | In this way, each time our web scraper goes to a new page and scrapes the desired data, we can save it in the next row of the DataFrame. 481 | 482 | After the scraping is complete, you can save to csv file with: 483 | 484 | ```python 485 | df.to_csv('Scraped_emails.csv',index=False) 486 | ``` 487 | 488 | And later import it directly into a DataFrame with: 489 | 490 | ```python 491 | df=pd.read_csv('Scraped_emails.csv') 492 | ``` 493 | 494 | # Warning and Practice Problem 495 | 496 | That’s it. That should be enough to go forth and write a web scraper of your own. But first a warning: 497 | 498 | Be a good web scraper. Web scrapers can visit many more sites than a human. And they never click on adds. For these and other reasons, many sites ban or limit the use of web scrapers. Make sure to check what with the site (often in a
robot.txt document) before collecting data. And even if they let you, think about limiting the rate you call up webpages. It will keep you in the good books. 499 | 500 | With everything presented here, you can scrape much more meaningful data than some fake email address. If you want to practice this yourself, write a web scraper to scrape the mean daily temperature for Eugene OR from Jan. 1, 2016 to Jan. 31, 2016 from 501 | this site 502 | 503 | If you get stuck, take a look at a very similar web scraper in this directory. 504 | -------------------------------------------------------------------------------- /Simple Web Scraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "The python script used to generate the readme file" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 3, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "from urllib.request import urlopen\n", 19 | "from bs4 import BeautifulSoup\n", 20 | "import pandas as pd" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 7, 26 | "metadata": { 27 | "collapsed": false 28 | }, 29 | "outputs": [ 30 | { 31 | "data": { 32 | "text/html": [ 33 | "
\n", 34 | "\n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | "
emails
0support2@yourcompany0.com
1support2@yourcompany1.com
2support2@yourcompany2.com
3support2@yourcompany3.com
4support2@yourcompany4.com
5support2@yourcompany5.com
6support2@yourcompany6.com
7support2@yourcompany7.com
8support2@yourcompany8.com
9support2@yourcompany9.com
\n", 84 | "
" 85 | ], 86 | "text/plain": [ 87 | " emails\n", 88 | "0 support2@yourcompany0.com\n", 89 | "1 support2@yourcompany1.com\n", 90 | "2 support2@yourcompany2.com\n", 91 | "3 support2@yourcompany3.com\n", 92 | "4 support2@yourcompany4.com\n", 93 | "5 support2@yourcompany5.com\n", 94 | "6 support2@yourcompany6.com\n", 95 | "7 support2@yourcompany7.com\n", 96 | "8 support2@yourcompany8.com\n", 97 | "9 support2@yourcompany9.com" 98 | ] 99 | }, 100 | "execution_count": 7, 101 | "metadata": {}, 102 | "output_type": "execute_result" 103 | } 104 | ], 105 | "source": [ 106 | "df=pd.DataFrame()\n", 107 | "df=pd.DataFrame(columns=['emails'])\n", 108 | "for x in range(0,10):\n", 109 | " url='http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm'\n", 110 | " html=urlopen(url)\n", 111 | " soup=BeautifulSoup(html.read(),'lxml')\n", 112 | " subSoup=soup.find_all('p',class_='whs2');\n", 113 | " for elm in subSoup:\n", 114 | " ret=elm.contents[0]\n", 115 | " if '\n", 55 | "\n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | "
DateMean TemperatureMean Temperature AverageMax TemperatureMax Temperature AverageMax Temperature RecordMin TemperatureMin Temperature AverageMin Temperature Record
0201701173641454767283415
1201701163241404764243416
2201701152841354764213410
3201701142741324760213415
4201701132740304762243410
520170112304034476226346
620170111384046476230348
720170110364039466534348
8201701093540394663313410
920170108314037466525346
1020170107264028466024349
11201701062640394662123412
12201701052340344667123412
13201701042840344669213414
1420170103304036466024349
1520170102344037465730348
1620170101363939456133347
17201612313339364562293410
18201612303939474566313310
19201612294339534563333312
20201612283639414565303319
21201612274439474561403316
2220161226333943456223331
232016122535394445622533-1
242016122435394445602633-4
252016122340394545663533-2
2620161222383945456230334
2720161221313933456229336
2820161220403951456428337
2920161219383945456430334
\n", 433 | "" 434 | ], 435 | "text/plain": [ 436 | " Date Mean Temperature Mean Temperature Average Max Temperature \\\n", 437 | "0 20170117 36 41 45 \n", 438 | "1 20170116 32 41 40 \n", 439 | "2 20170115 28 41 35 \n", 440 | "3 20170114 27 41 32 \n", 441 | "4 20170113 27 40 30 \n", 442 | "5 20170112 30 40 34 \n", 443 | "6 20170111 38 40 46 \n", 444 | "7 20170110 36 40 39 \n", 445 | "8 20170109 35 40 39 \n", 446 | "9 20170108 31 40 37 \n", 447 | "10 20170107 26 40 28 \n", 448 | "11 20170106 26 40 39 \n", 449 | "12 20170105 23 40 34 \n", 450 | "13 20170104 28 40 34 \n", 451 | "14 20170103 30 40 36 \n", 452 | "15 20170102 34 40 37 \n", 453 | "16 20170101 36 39 39 \n", 454 | "17 20161231 33 39 36 \n", 455 | "18 20161230 39 39 47 \n", 456 | "19 20161229 43 39 53 \n", 457 | "20 20161228 36 39 41 \n", 458 | "21 20161227 44 39 47 \n", 459 | "22 20161226 33 39 43 \n", 460 | "23 20161225 35 39 44 \n", 461 | "24 20161224 35 39 44 \n", 462 | "25 20161223 40 39 45 \n", 463 | "26 20161222 38 39 45 \n", 464 | "27 20161221 31 39 33 \n", 465 | "28 20161220 40 39 51 \n", 466 | "29 20161219 38 39 45 \n", 467 | "\n", 468 | " Max Temperature Average Max Temperature Record Min Temperature \\\n", 469 | "0 47 67 28 \n", 470 | "1 47 64 24 \n", 471 | "2 47 64 21 \n", 472 | "3 47 60 21 \n", 473 | "4 47 62 24 \n", 474 | "5 47 62 26 \n", 475 | "6 47 62 30 \n", 476 | "7 46 65 34 \n", 477 | "8 46 63 31 \n", 478 | "9 46 65 25 \n", 479 | "10 46 60 24 \n", 480 | "11 46 62 12 \n", 481 | "12 46 67 12 \n", 482 | "13 46 69 21 \n", 483 | "14 46 60 24 \n", 484 | "15 46 57 30 \n", 485 | "16 45 61 33 \n", 486 | "17 45 62 29 \n", 487 | "18 45 66 31 \n", 488 | "19 45 63 33 \n", 489 | "20 45 65 30 \n", 490 | "21 45 61 40 \n", 491 | "22 45 62 23 \n", 492 | "23 45 62 25 \n", 493 | "24 45 60 26 \n", 494 | "25 45 66 35 \n", 495 | "26 45 62 30 \n", 496 | "27 45 62 29 \n", 497 | "28 45 64 28 \n", 498 | "29 45 64 30 \n", 499 | "\n", 500 | " Min Temperature Average Min Temperature Record \n", 501 | "0 34 15 \n", 502 | "1 34 16 \n", 503 | "2 34 10 \n", 504 | "3 34 15 \n", 505 | "4 34 10 \n", 506 | "5 34 6 \n", 507 | "6 34 8 \n", 508 | "7 34 8 \n", 509 | "8 34 10 \n", 510 | "9 34 6 \n", 511 | "10 34 9 \n", 512 | "11 34 12 \n", 513 | "12 34 12 \n", 514 | "13 34 14 \n", 515 | "14 34 9 \n", 516 | "15 34 8 \n", 517 | "16 34 7 \n", 518 | "17 34 10 \n", 519 | "18 33 10 \n", 520 | "19 33 12 \n", 521 | "20 33 19 \n", 522 | "21 33 16 \n", 523 | "22 33 1 \n", 524 | "23 33 -1 \n", 525 | "24 33 -4 \n", 526 | "25 33 -2 \n", 527 | "26 33 4 \n", 528 | "27 33 6 \n", 529 | "28 33 7 \n", 530 | "29 33 4 " 531 | ] 532 | }, 533 | "execution_count": 9, 534 | "metadata": {}, 535 | "output_type": "execute_result" 536 | } 537 | ], 538 | "source": [ 539 | "#Added ease by using the package datetime to iterate through the days\n", 540 | "day=datetime.date.today();\n", 541 | "one_day = datetime.timedelta(days=1);\n", 542 | "#the rows I'll scrape from the table in the web page:\n", 543 | "temps=['Mean Temperature','Max Temperature','Min Temperature']\n", 544 | "#Initialize the pandas dataframe with a column for each value I will scrape from each page.\n", 545 | "df=pd.DataFrame(columns=['Date','Mean Temperature','Mean Temperature Average',\n", 546 | " 'Max Temperature','Max Temperature Average','Max Temperature Record',\n", 547 | " 'Min Temperature','Min Temperature Average','Min Temperature Record'])\n", 548 | "#Number of days worth of data I will scrape, starting today and working backwards:\n", 549 | "for t in range(0,30):\n", 550 | " #creating new url for each pass of the for-loop\n", 551 | " dayStr=day.strftime('%Y %m %d')\n", 552 | " yr, mon, da=dayStr.split(' ')\n", 553 | " day=day-one_day\n", 554 | " url=\"\"\"https://www.wunderground.com/history/airport/KEUG/%s/%s/%s/DailyHistory.html?req_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999\"\"\" %(yr,mon,da);\n", 555 | " #opening and parsing the html of the page\n", 556 | " html = urlopen(url);\n", 557 | " soup=BeautifulSoup(html.read(),'html.parser');\n", 558 | " #finding a subsection of the html. A table which contains all the data I desire\n", 559 | " soup2=soup.find('table', class_='responsive airport-history-summary-table');\n", 560 | " #locating most rows in the table\n", 561 | " classSet=soup2.find_all(class_='indent');\n", 562 | " vals=[];\n", 563 | " #iterating through the rows of the table to find the ones with data I desire\n", 564 | " for y in range(0,len(classSet)):\n", 565 | " a=classSet[y]\n", 566 | " cat=str(a.contents).replace('[','').replace(']','');\n", 567 | " if cat in temps:\n", 568 | " for x in range(0,10):\n", 569 | " if 'wx-value' in str(a):\n", 570 | " b=a.find(class_='wx-value')\n", 571 | " c=str(b.contents).replace(\"['\",\"\").replace(\"']\",\"\");\n", 572 | " vals.append(int(c))\n", 573 | " try:\n", 574 | " a=a.next_sibling\n", 575 | " except:\n", 576 | " break\n", 577 | " df.loc[t,'Date']=yr+mon+da\n", 578 | " df.iloc[t,1:]=vals\n", 579 | "df" 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "execution_count": null, 585 | "metadata": { 586 | "collapsed": true 587 | }, 588 | "outputs": [], 589 | "source": [] 590 | } 591 | ], 592 | "metadata": { 593 | "anaconda-cloud": {}, 594 | "kernelspec": { 595 | "display_name": "Python [default]", 596 | "language": "python", 597 | "name": "python3" 598 | }, 599 | "language_info": { 600 | "codemirror_mode": { 601 | "name": "ipython", 602 | "version": 3 603 | }, 604 | "file_extension": ".py", 605 | "mimetype": "text/x-python", 606 | "name": "python", 607 | "nbconvert_exporter": "python", 608 | "pygments_lexer": "ipython3", 609 | "version": "3.5.2" 610 | } 611 | }, 612 | "nbformat": 4, 613 | "nbformat_minor": 1 614 | } 615 | --------------------------------------------------------------------------------