├── Get the weather-GitHub.ipynb
├── GetEMSdata-GitHub.ipynb
├── README.md
├── Simple Web Scraper.ipynb
├── WS_images
    ├── CopyPaste1.png
    ├── CopyPaste2.png
    ├── simpleHTMLpageSS.png
    ├── simpleHTMLpageSS2.png
    └── simpleHTMLpageSS3.png
└── Web Scraper for Weather.ipynb


/GetEMSdata-GitHub.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "collapsed": false
  7 |    },
  8 |    "source": [
  9 |     "This notebook was written to scrape a publically listed emergency call log for the Eugene OR area. It may be useful as a guide for projects that need to scrape data from a webpage with dynamically generated content from a javascript interface."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 7,
 15 |    "metadata": {
 16 |     "collapsed": true
 17 |    },
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "from selenium import webdriver #selenium is used to interact with the webpage, so the program can 'click' buttons.\n",
 21 |     "import pandas as pd #the data will be saved locally as a csv file. Pandas is a nice way to write/read/work with those files.\n",
 22 |     "from selenium.webdriver.firefox.firefox_binary import FirefoxBinary #This will let the program open the webpage on a new Firefox window.\n",
 23 |     "from bs4 import BeautifulSoup #BeautifulSoup is used to parse the HTML of the downloaded website to find the particular information desired.\n",
 24 |     "import time #I will need to delay the program to give the webpage time to open. time will be used for that.\n",
 25 |     "import sys #This is only used to assign a location to my path. The location where I have a needed file for selenium."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "the next section adds my current directory to my path. The main reason for this is selenium. Selenium uses geckodriver to talk with Firefox. That application is something you have to download special. Ideally, I would have geckodriver in a folder in python's path already. Instead, I just have it saved into the same folder I'm collecting all my data from the EMS call log. Not great, I know, but it works."
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": 8,
 38 |    "metadata": {
 39 |     "collapsed": true
 40 |    },
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "sys.path\n",
 44 |     "sys.path.append('/path/to/the/example_file.py')\n",
 45 |     "sys.path.append('C:\\\\Users\\\\Kyle\\\\Documents\\\\Blog Posts\\\\EugeneEMSCalls')"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "The call log website gives the date with the month abbreviation. This dictionary lets me easily convert that into a 2 number string."
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 9,
 58 |    "metadata": {
 59 |     "collapsed": true
 60 |    },
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "Months={'Jan':'01','Feb':'02','Mar':'03','Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": 10,
 69 |    "metadata": {
 70 |     "collapsed": true
 71 |    },
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "#url of the website with the call log data\n",
 75 |     "url='http://coeapps.eugene-or.gov/ruralfirecad'"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "markdown",
 80 |    "metadata": {},
 81 |    "source": [
 82 |     "The next section opens a new window of Firefox browser. This is the tab where the url will be loaded and where the code will 'click' javascript buttons and download the HTML of the page. "
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": 11,
 88 |    "metadata": {
 89 |     "collapsed": false
 90 |    },
 91 |    "outputs": [],
 92 |    "source": [
 93 |     "binary = FirefoxBinary('C:\\\\Program Files (x86)\\\\Mozilla Firefox\\\\firefox.exe')\n",
 94 |     "driver = webdriver.Firefox(firefox_binary=binary)\n",
 95 |     "driver.get(url)"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 12,
101 |    "metadata": {
102 |     "collapsed": false
103 |    },
104 |    "outputs": [
105 |     {
106 |      "name": "stdout",
107 |      "output_type": "stream",
108 |      "text": [
109 |       "Number of Calls on Jan 9, 2017: 107\n"
110 |      ]
111 |     }
112 |    ],
113 |    "source": [
114 |     "#quick check to see if selenium correctly got to the page. \n",
115 |     "#this searches the HTML of the page for the HTML element id'd as 'callSummary',\n",
116 |     "#and prints the text.\n",
117 |     "summary=driver.find_element_by_id('callSummary').text\n",
118 |     "print(summary)"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "metadata": {},
124 |    "source": [
125 |     "The web scraping code. The way this deals with time is a bit wonky. Initially, it will step through every day of the current month from first day to last day. Those days that are still in the future won't have any calls and will not be saved into csv files. Then, the program steps back one month and repeats; again, going from the first to the last day. Eventually, the program reaches the set end date, where it breaks out of the while loop."
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": 13,
131 |    "metadata": {
132 |     "collapsed": false
133 |    },
134 |    "outputs": [],
135 |    "source": [
136 |     "endIt=0;\n",
137 |     "while endIt==0:\n",
138 |     "    calendarOptions=driver.find_element_by_id('calendar').text.split()\n",
139 |     "    monthsDays=[];\n",
140 |     "    for elm in calendarOptions:\n",
141 |     "        try:\n",
142 |     "            potlDay=int(elm)\n",
143 |     "            if potlDay<=31:\n",
144 |     "                monthsDays.append(elm)\n",
145 |     "        except:\n",
146 |     "            pass\n",
147 |     "    for day in monthsDays:\n",
148 |     "        driver.find_element_by_link_text(day).click()\n",
149 |     "        time.sleep(1) #giving firefox time to open.\n",
150 |     "        summary=driver.find_element_by_id('callSummary').text;\n",
151 |     "        dateData=summary[summary.index('on')+3:summary.index(':')].replace(',','').split(' ');\n",
152 |     "        if len(dateData[1])==1:\n",
153 |     "            dateData[1]='0'+dateData[1]\n",
154 |     "        date=int(dateData[2]+str(Months[dateData[0]])+dateData[1]);\n",
155 |     "        if int(date)==20161201: #End date is located here.\n",
156 |     "            endIt=1;\n",
157 |     "            break\n",
158 |     "        html = driver.page_source;\n",
159 |     "        soup = BeautifulSoup(html,'lxml'); #using BeautifulSoup to find the call logs.\n",
160 |     "        EMSdata=soup.find('table', class_='tablesorter');\n",
161 |     "        colNames1=EMSdata.thead.findAll('th') #recording the column names.\n",
162 |     "        colNames2=[];\n",
163 |     "        data1=[]\n",
164 |     "        for x in range(0,len(colNames1)):\n",
165 |     "            colNames2.append(colNames1[x].string.strip()) #saving each column value.\n",
166 |     "            data1.append([])\n",
167 |     "        for row in EMSdata.findAll(\"tr\"): #saving the individual call log data.\n",
168 |     "            cells = row.findAll('td')\n",
169 |     "            if len(cells)==len(colNames1):\n",
170 |     "                for y in range(0,len(cells)):\n",
171 |     "                    data1[y].append(cells[y].string.strip())\n",
172 |     "        EMSdata1=pd.DataFrame(); #initializing a data frame to save 1 days worth of calls.\n",
173 |     "        for x in range(0,len(colNames2)):\n",
174 |     "            EMSdata1[colNames2[x]]=data1[x]\n",
175 |     "        EMSdata1['Date']=date;\n",
176 |     "        try:\n",
177 |     "            EMSdata1.to_csv('%s.csv'%(EMSdata1.loc[0,'Date']),index=False) #saving csv file of daily call logs.\n",
178 |     "        except:\n",
179 |     "            pass\n",
180 |     "        time.sleep(1) #giving time to save csv before moving on.\n",
181 |     "    if endIt==0:\n",
182 |     "        driver.find_element_by_link_text('Prev').click()"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": null,
188 |    "metadata": {
189 |     "collapsed": true
190 |    },
191 |    "outputs": [],
192 |    "source": []
193 |   }
194 |  ],
195 |  "metadata": {
196 |   "anaconda-cloud": {},
197 |   "kernelspec": {
198 |    "display_name": "Python [default]",
199 |    "language": "python",
200 |    "name": "python3"
201 |   },
202 |   "language_info": {
203 |    "codemirror_mode": {
204 |     "name": "ipython",
205 |     "version": 3
206 |    },
207 |    "file_extension": ".py",
208 |    "mimetype": "text/x-python",
209 |    "name": "python",
210 |    "nbconvert_exporter": "python",
211 |    "pygments_lexer": "ipython3",
212 |    "version": "3.5.2"
213 |   }
214 |  },
215 |  "nbformat": 4,
216 |  "nbformat_minor": 1
217 | }
218 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Web Scraping
  2 | Guided example for web scraping in Python using <b>urlopen</b> from urllib.request, <b>beautifulsoup</b>, and <b>pandas</b>.
  3 | 
  4 | # One Sentence Definition of Web Scraping
  5 | Web scraping is having your computer visit many web pages, collect (scrape) data from each page, and save it locally to your computer for future use. It’s something you could do with copy/paste and an Excel table, but the sheer number of pages makes it impractical. 
  6 | 
  7 | ## Keys to Web Scraping:
  8 | <b>The keys to web scraping are patterns. There needs to be some pattern the program can follow to go from one web page to the next. The desired data needs to be in some pattern, so the web scraper can reliably collect it. Finding these patterns is the tricky, time consuming process that is at the very beginning. But after they're discovered, writing the code of the web scraper is easy.</b>
  9 | 
 10 | # Simple Web Scraper
 11 | 
 12 | Here is the finished web scraper built in the rest of this document. I’ll go through each step, explaining why and what each element does, but this is the end goal.
 13 | 
 14 | ```python
 15 | from urllib.request import urlopen
 16 | from bs4 import BeautifulSoup
 17 | import pandas as pd
 18 | 
 19 | df=pd.DataFrame()
 20 | df=pd.DataFrame(columns=['emails']) 
 21 | for x in range(0,10):
 22 |     url='http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm'
 23 |     html=urlopen(url) 
 24 |     soup=BeautifulSoup(html.read(),'lxml')
 25 |     subSoup=soup.find_all('p',class_='whs2'); 
 26 |     for elm in subSoup:
 27 |         ret=elm.contents[0]
 28 |         if '<a href' in ret and '@' in ret: 
 29 |             email=ret[25+len('mailto')+1:-2]
 30 |     df.loc[x,'emails']='support2@yourcompany%s.com' %(x)
 31 | df.to_csv('Scraped_emails.csv',index=False)
 32 | ```
 33 | 
 34 | ## Python Packages
 35 | 
 36 | Packages (and modules) are clever bits of code other, brilliant people have created to do useful things. By downloading and using them, we don’t have to know how to write code to communicate with servers, parse HTML, or a myriad of other things. Instead, we can get straight to what we want to do: scrape data off of websites. Anaconda has many of these packages already. If not, <a href="https://docs.python.org/2/installing/">pip for Python 2.X</a> or <a href="https://docs.python.org/3.6/installing/"> pip for Python 3.X</a> are good ways to acquire new packages.
 37 | 
 38 | This web scraper will make use of three modules:
 39 | 
 40 | 1) <b>urlopen</b> from urllib.request : to download page contents from a given valid url
 41 | 
 42 | 2) <b>BeautifulSoup</b> from bs4 : to navigate the HTML of the downloaded page
 43 | 
 44 | 3) <b>pandas</b> : to store our scraped data
 45 | 
 46 | # Target Web Page and Desired Data
 47 | For the purposes of this little tutorial, I'll show how to scrape just a <a href="http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm">single web page</a>. One with very simple HTML to make it easier to understand what's going on: 
 48 | 
 49 | <img src="./WS_images/simpleHTMLpageSS.png" />
 50 | 
 51 | This page has minimal information, but let's say I want to collect the email address:
 52 | 
 53 | <img src="./WS_images/simpleHTMLpageSS2.png" />
 54 | 
 55 | # urlopen from urllib.request
 56 | 
 57 | urlopen is a no-frills way of making a web scraper, and the recipe is simple: (1) Give urlopen a valid url. (2) Watch as urlopen downloads the HTML from that url.
 58 | 
 59 | ```python
 60 | from urllib.request import urlopen
 61 | url='http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm'
 62 | html=urlopen(url)
 63 | print(html)
 64 | ```
 65 | If you fed urlopen a valid url, you should print something along the lines of: 
 66 | 
 67 | ```python
 68 | <http.client.HTTPResponse object at many-numbers-and-letters>
 69 | ```
 70 | 
 71 | ## Patterns for Downloading many pages
 72 | 
 73 | Writing a web scraper for a single page makes no sense. But often the many pages you'll want to scrape have some pattern in their urls. For example, these urls lead to pages with historical daily weather data for Eugene OR. The day in question is explicitly stated in the url:
 74 | 
 75 | >https://www.wunderground.com/history/airport/KEUG/2016/01/05/DailyHistory.htmlreq_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999
 76 | 
 77 | >https://www.wunderground.com/history/airport/KEUG/2016/01/04/DailyHistory.htmlreq_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999
 78 | 
 79 | So, you could visit many of these pages by writing a python script that created knew url strings by changing the date.
 80 | 
 81 | Example, downloading the html of every day in Jan, 2016:
 82 | ```python
 83 | for x in range(1,32):
 84 | 	url="https://www.wunderground.com/history/airport/KEUG/2016/01/%s/DailyHistory.htmlreq_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999" % (x)
 85 | 	html=urlopen(url)
 86 | ```
 87 | 
 88 | So, you could visit many of these pages by writing a python script that created new url strings by changing the date.
 89 | 
 90 | # BeautifulSoup to Navigate the HTML
 91 | 
 92 | So now you have the HTML from the page, but no way of reading it. Cue BeautifulSoup. BeautifulSoup will do 2 major things for us. (1) It will decode the HTML into something we can read in the python script. (2) It will let us navigate through the many lines of HTML by making use of the tags and labels of HTML coding.
 93 | 
 94 | ```python
 95 | from bs4 import BeautifulSoup
 96 | soup=BeautifulSoup(html.read(),'lxml')
 97 | print(soup)
 98 | ```
 99 | 
100 | This is the tedious bit. See if you can find the email address in the parsed HTML:
101 | 
102 | ```html
103 | <!DOCTYPE doctype HTML public "-//W3C//DTD HTML 4.0 Frameset//EN">
104 | 
105 | <!-- saved from url=(0014)about:internet -->
106 | <html>
107 | <head>
108 | <meta content="text/html;charset=utf-8" http-equiv="content-type">
109 | <title>Example of a simple HTML page</title>
110 | <meta content="Adobe RoboHelp - www.adobe.com" name="generator">
111 | <link href="default_ns.css" rel="stylesheet"><script language="JavaScript" title="WebHelpSplitCss" type="text/javascript">
112 | <!--
113 | if (navigator.appName !="Netscape")
114 | {   document.write("<link rel='stylesheet' href='default.css'>");}
115 | //-->
116 | </script>
117 | <style type="text/css">
118 | <!--
119 | img_whs1 { border:none; width:301px; height:295px; float:none; }
120 | p.whs2 { margin-bottom:5pt; }
121 | p.whs3 { margin-bottom:9.5pt; }
122 | -->
123 | </style><script language="JavaScript" title="WebHelpInlineScript" type="text/javascript">
124 | <!--
125 | function reDo() {
126 |   if (innerWidth != origWidth || innerHeight != origHeight)
127 |      location.reload();
128 | }
129 | if ((parseInt(navigator.appVersion) == 4) && (navigator.appName == "Netscape")) {
130 | 	origWidth = innerWidth;
131 | 	origHeight = innerHeight;
132 | 	onresize = reDo;
133 | }
134 | onerror = null; 
135 | //-->
136 | </script>
137 | <style type="text/css">
138 | <!--
139 | div.WebHelpPopupMenu { position:absolute; left:0px; top:0px; z-index:4; visibility:hidden; }
140 | p.WebHelpNavBar { text-align:left; }
141 | -->
142 | </style><script language="javascript1.2" src="whmsg.js" type="text/javascript"></script>
143 | <script language="javascript" src="whver.js" type="text/javascript"></script>
144 | <script language="javascript1.2" src="whproxy.js" type="text/javascript"></script>
145 | <script language="javascript1.2" src="whutils.js" type="text/javascript"></script>
146 | <script language="javascript1.2" src="whtopic.js" type="text/javascript"></script>
147 | <script language="javascript1.2" type="text/javascript">
148 | <!--
149 | if (window.gbWhTopic)
150 | {
151 | 	if (window.setRelStartPage)
152 | 	{
153 | 	addTocInfo("Building your website\nCreating an EasySiteWizard 6 website\nExample of a simple HTML page");
154 | addButton("show",BTN_TEXT,"Show","","","","",0,0,"whd_show0.gif","whd_show2.gif","whd_show1.gif");
155 | addButton("hide",BTN_TEXT,"Hide","","","","",0,0,"whd_hide0.gif","whd_hide2.gif","whd_hide1.gif");
156 | addButton("synctoc",BTN_TEXT,"Show Topic in Contents","","","","",0,0,"whd_sync0.gif","whd_sync2.gif","whd_sync1.gif");
157 | 
158 | 	}
159 | 
160 | 
161 | 	if (window.setRelStartPage)
162 | 	{
163 | 	setRelStartPage("websiteos.html");
164 | 
165 | 		autoSync(0);
166 | 		sendSyncInfo();
167 | 		sendAveInfoOut();
168 | 	}
169 | 
170 | }
171 | else
172 | 	if (window.gbIE4)
173 | 		document.location.reload();
174 | //-->
175 | </script>
176 | </link></meta></meta></head>
177 | <body><script language="javascript1.2" type="text/javascript">
178 | <!--
179 | if (window.writeIntopicBar)
180 | 	writeIntopicBar(1);
181 | //-->
182 | </script>
183 | <h1>Example of a simple HTML page</h1>
184 | <p>Hypertext Markup Language (HTML) is the most common language used to 
185 |  create documents on the World Wide Web. HTML uses hundreds of different 
186 |  tags to define a layout for web pages. Most tags require an opening &lt;tag&gt; 
187 |  and a closing &lt;/tag&gt;.</p>
188 | <p><span style="font-weight: bold;"><b>Example:</b></span>  &lt;b&gt;On 
189 |  a webpage, this sentence would be in bold print.&lt;/b&gt; </p>
190 | <p>Below is an example of a very simple page: </p>
191 | <p><img border="0" class="img_whs1" height="295px" src="htmlpage.jpg" width="301px" x-maintain-ratio="TRUE"/></p>
192 | <p> This 
193 |  is the code used to make the page: </p>
194 | <p class="whs2">&lt;HTML&gt; </p>
195 | <p class="whs2">&lt;HEAD&gt; </p>
196 | <p class="whs2">&lt;TITLE&gt;Your Title Here&lt;/TITLE&gt; 
197 |  </p>
198 | <p class="whs2">&lt;/HEAD&gt; </p>
199 | <p class="whs2">&lt;BODY BGCOLOR="FFFFFF"&gt; 
200 |  </p>
201 | <p class="whs2">&lt;CENTER&gt;&lt;IMG SRC="clouds.jpg" 
202 |  ALIGN="BOTTOM"&gt; &lt;/CENTER&gt; </p>
203 | <p class="whs2">&lt;HR&gt; </p>
204 | <p class="whs2">&lt;a href="http://somegreatsite.com"&gt;Link 
205 |  Name&lt;/a&gt; </p>
206 | <p class="whs2">is a link to another nifty site </p>
207 | <p class="whs2">&lt;H1&gt;This is a Header&lt;/H1&gt; </p>
208 | <p class="whs2">&lt;H2&gt;This is a Medium Header&lt;/H2&gt; 
209 |  </p>
210 | <p class="whs2">Send me mail at &lt;a href="mailto:support@yourcompany.com"&gt;</p>
211 | <p class="whs2">support@yourcompany.com&lt;/a&gt;. </p>
212 | <p class="whs2">&lt;P&gt; This is a new paragraph! </p>
213 | <p class="whs2">&lt;P&gt; &lt;B&gt;This is a new paragraph!&lt;/B&gt; 
214 |  </p>
215 | <p class="whs2">&lt;BR&gt; &lt;B&gt;&lt;I&gt;This is a new 
216 |  sentence without a paragraph break, in bold italics.&lt;/I&gt;&lt;/B&gt; 
217 |  </p>
218 | <p class="whs2">&lt;HR&gt; </p>
219 | <p class="whs2">&lt;/BODY&gt; </p>
220 | <p class="whs2">&lt;/HTML&gt; </p>
221 | <p class="whs2"> </p>
222 | <p class="whs2"> </p>
223 | <p class="whs3"> </p>
224 | <script language="javascript1.2" type="text/javascript">
225 | <!--
226 | if (window.writeIntopicBar)
227 | 	writeIntopicBar(0);
228 | //-->
229 | </script>
230 | </body>
231 | </html>
232 | 
233 | ```
234 | And that HTML is from a very simple webpage. Most pages you'll want to scrape are much more complex and sorting through them can be a chore. It's a tedious step that all writers of web scrapers must suffer through. 
235 | 
236 | You can make your life easier by figuring out where the information you want is located within the html of the webpage when it's open in your browser.
237 | 
238 | ## Inspect Element to Locate the HTML Tags
239 | 
240 | In Firefox, right click on the information you care about and select 'inspect element' (other browsers have a similar feature). The browser will let you look under the hood at the html of the page and take you right to the section you care about.
241 | 
242 | <img src="./WS_images/simpleHTMLpageSS3.png" /> 
243 | 
244 | Here I can note all the HTML tags I can use to locate the email address with BeautifulSoup:
245 | 
246 | It's in a paragraph ('p'). It has a class ('whs2'). The address is in a link (< a > < /a >). The email address contains the special character ‘@’.
247 | 
248 | That's more than enough for BeautifulSoup.
249 | 
250 | 
251 | ## Navigating with BeautifulSoup
252 | There are <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">a lot of tools</a> in BeautifulSoup, but we'll make do with three: <b>find</b>, <b>find_all</b>, and <b>contents</b>.
253 | 
254 | <b>find</b> gives the first instance of some HTML tag:
255 | 
256 | ```python
257 | soup.find('p')
258 | ```
259 | 
260 | ```html
261 | <p>Hypertext Markup Language (HTML) is the most common language used to 
262 |  create documents on the World Wide Web. HTML uses hundreds of different 
263 |  tags to define a layout for web pages. Most tags require an opening &lt;tag&gt; 
264 |  and a closing &lt;/tag&gt;.</p>
265 |  ```
266 |  
267 |  <b>contents</b> gives you any text find found (in the form of a list on length 1):
268 |  
269 |  ```python
270 |  soup.find('p').contents
271 |  ```
272 |  
273 |  ```text
274 |  ['Hypertext Markup Language (HTML) is the most common language used to \n create documents on the World Wide Web. HTML uses hundreds of different \n tags to define a layout for web pages. Most tags require an opening <tag> \n and a closing </tag>.']
275 |  ```
276 |  
277 | <b>find_all</b> returns every instance of that tag. Again, in a list:
278 | 
279 |  ```python
280 |  soup.find_all('p')
281 |  ```
282 |  
283 |  ```html
284 |  [<p>Hypertext Markup Language (HTML) is the most common language used to 
285 |   create documents on the World Wide Web. HTML uses hundreds of different 
286 |   tags to define a layout for web pages. Most tags require an opening &lt;tag&gt; 
287 |   and a closing &lt;/tag&gt;.</p>,
288 |  <p><span style="font-weight: bold;"><b>Example:</b></span>  &lt;b&gt;On 
289 |   a webpage, this sentence would be in bold print.&lt;/b&gt; </p>,
290 |  <p>Below is an example of a very simple page: </p>,
291 |  <p><img border="0" class="img_whs1" height="295px" src="htmlpage.jpg" width="301px" x-maintain-ratio="TRUE"/></p>,
292 |  <p> This 
293 |   is the code used to make the page: </p>,
294 |  <p class="whs2">&lt;HTML&gt; </p>,
295 |  <p class="whs2">&lt;HEAD&gt; </p>,
296 |  <p class="whs2">&lt;TITLE&gt;Your Title Here&lt;/TITLE&gt; 
297 |   </p>,
298 |  <p class="whs2">&lt;/HEAD&gt; </p>,
299 |  <p class="whs2">&lt;BODY BGCOLOR="FFFFFF"&gt; 
300 |   </p>,
301 |  <p class="whs2">&lt;CENTER&gt;&lt;IMG SRC="clouds.jpg" 
302 |   ALIGN="BOTTOM"&gt; &lt;/CENTER&gt; </p>,
303 |  <p class="whs2">&lt;HR&gt; </p>,
304 |  <p class="whs2">&lt;a href="http://somegreatsite.com"&gt;Link 
305 |   Name&lt;/a&gt; </p>,
306 |  <p class="whs2">is a link to another nifty site </p>,
307 |  <p class="whs2">&lt;H1&gt;This is a Header&lt;/H1&gt; </p>,
308 |  <p class="whs2">&lt;H2&gt;This is a Medium Header&lt;/H2&gt; 
309 |   </p>,
310 |  <p class="whs2">Send me mail at &lt;a href="mailto:support@yourcompany.com"&gt;</p>,
311 |  <p class="whs2">support@yourcompany.com&lt;/a&gt;. </p>,
312 |  <p class="whs2">&lt;P&gt; This is a new paragraph! </p>,
313 |  <p class="whs2">&lt;P&gt; &lt;B&gt;This is a new paragraph!&lt;/B&gt; 
314 |   </p>,
315 |  <p class="whs2">&lt;BR&gt; &lt;B&gt;&lt;I&gt;This is a new 
316 |   sentence without a paragraph break, in bold italics.&lt;/I&gt;&lt;/B&gt; 
317 |   </p>,
318 |  <p class="whs2">&lt;HR&gt; </p>,
319 |  <p class="whs2">&lt;/BODY&gt; </p>,
320 |  <p class="whs2">&lt;/HTML&gt; </p>,
321 |  <p class="whs2"> </p>,
322 |  <p class="whs2"> </p>,
323 |  <p class="whs3"> </p>]
324 |  ```
325 |  
326 |  If the HTML object also has a class, you can narrow your results even further:
327 |  
328 |  ```python
329 |  soup.find_all('p', class_='whs2')
330 |  ```
331 |  
332 |  ```html
333 |  [<p class="whs2">&lt;HTML&gt; </p>,
334 |  <p class="whs2">&lt;HEAD&gt; </p>,
335 |  <p class="whs2">&lt;TITLE&gt;Your Title Here&lt;/TITLE&gt; 
336 |   </p>,
337 |  <p class="whs2">&lt;/HEAD&gt; </p>,
338 |  <p class="whs2">&lt;BODY BGCOLOR="FFFFFF"&gt; 
339 |   </p>,
340 |  <p class="whs2">&lt;CENTER&gt;&lt;IMG SRC="clouds.jpg" 
341 |   ALIGN="BOTTOM"&gt; &lt;/CENTER&gt; </p>,
342 |  <p class="whs2">&lt;HR&gt; </p>,
343 |  <p class="whs2">&lt;a href="http://somegreatsite.com"&gt;Link 
344 |   Name&lt;/a&gt; </p>,
345 |  <p class="whs2">is a link to another nifty site </p>,
346 |  <p class="whs2">&lt;H1&gt;This is a Header&lt;/H1&gt; </p>,
347 |  <p class="whs2">&lt;H2&gt;This is a Medium Header&lt;/H2&gt; 
348 |   </p>,
349 |  <p class="whs2">Send me mail at &lt;a href="mailto:support@yourcompany.com"&gt;</p>,
350 |  <p class="whs2">support@yourcompany.com&lt;/a&gt;. </p>,
351 |  <p class="whs2">&lt;P&gt; This is a new paragraph! </p>,
352 |  <p class="whs2">&lt;P&gt; &lt;B&gt;This is a new paragraph!&lt;/B&gt; 
353 |   </p>,
354 |  <p class="whs2">&lt;BR&gt; &lt;B&gt;&lt;I&gt;This is a new 
355 |   sentence without a paragraph break, in bold italics.&lt;/I&gt;&lt;/B&gt; 
356 |   </p>,
357 |  <p class="whs2">&lt;HR&gt; </p>,
358 |  <p class="whs2">&lt;/BODY&gt; </p>,
359 |  <p class="whs2">&lt;/HTML&gt; </p>,
360 |  <p class="whs2"> </p>,
361 |  <p class="whs2"> </p>]
362 |  ```
363 |  
364 | Finally, because HTML elements returned by find_all are in a list, you can iterate through them:
365 | 
366 | ```python
367 | subSoup=soup.find_all('p',class_='whs2');
368 | for elm in subSoup:
369 |     print(elm.contents)
370 | ```
371 | 
372 | ```text
373 | ['<HTML> ']
374 | ['<HEAD> ']
375 | ['<TITLE>Your Title Here</TITLE> \n ']
376 | ['</HEAD> ']
377 | ['<BODY BGCOLOR="FFFFFF"> \n ']
378 | ['<CENTER><IMG SRC="clouds.jpg" \n ALIGN="BOTTOM"> </CENTER> ']
379 | ['<HR> ']
380 | ['<a href="http://somegreatsite.com">Link \n Name</a> ']
381 | ['is a link to another nifty site ']
382 | ['<H1>This is a Header</H1> ']
383 | ['<H2>This is a Medium Header</H2> \n ']
384 | ['Send me mail at <a href="mailto:support@yourcompany.com">']
385 | ['support@yourcompany.com</a>. ']
386 | ['<P> This is a new paragraph! ']
387 | ['<P> <B>This is a new paragraph!</B> \n ']
388 | ['<BR> <B><I>This is a new \n sentence without a paragraph break, in bold italics.</I></B> \n ']
389 | ['<HR> ']
390 | ['</BODY> ']
391 | ['</HTML> ']
392 | ['\xa0']
393 | ['\xa0']
394 | ```
395 | 
396 | Using this and a bit chunky python, we can pull out the email address:
397 | 
398 | ```python
399 | subSoup=soup.find_all('p',class_='whs2');
400 | for elm in subSoup:
401 |     ret=elm.contents[0]
402 |     if '<a href' in ret and '@' in ret:
403 |         print("The contents of html object with the email address: %s" % (ret))
404 |         email=ret[25+len('mailto')+1:-2]
405 |         print("a chunky bit of python string processing gives us just the email: %s" % (email))
406 | ```
407 | 
408 | ```text
409 | The contents of html object with the email address: Send me mail at <a href="mailto:support@yourcompany.com">
410 | a chunky bit of python string processing gives us just the email: support@yourcompany.com
411 | ```
412 | 
413 | # Pandas to Save the Data in a Table
414 | 
415 | If we scraped many more websites and took more data from each page, we'd want some easy way to access it in the future. (Printing 10,000 items in a Python notebook is a terrible idea.)
416 | 
417 | There are many ways to do this, but let's focus on creating a DataFrame (think Excel table) using pandas. Why? Because pandas is really useful in data science for doing exploratory data analysis, parallel processing, sorting/cleaning/reformatting data, etc. It's easier to save the data into a pandas DataFrame right from the start than to have to transcribe it in later.
418 | 
419 | First import Pandas:
420 | 
421 | ```python
422 | import pandas as pd
423 | ```
424 | 
425 | And initialize an empty DataFrame:
426 | 
427 | ```python
428 | df=pd.DataFrame()
429 | ```
430 | 
431 | (Note: Do this outside of any for-loops you have going on!)
432 | 
433 | If you already know the names of the columns you want, you can assign that now:
434 | 
435 | ```python
436 | df=pd.DataFrame(columns=['emails'])
437 | ```
438 | 
439 | <table border="1">
440 | <tr>
441 | <td>Emails</td>
442 | </tr>
443 | </table>
444 | 
445 | And add the data to the DataFrame, in the form of a list:
446 | 
447 | ```python
448 | df=pd['Emails']=[email])
449 | ```
450 | 
451 | <table border="1">
452 | <tr>
453 | <td></td><td>Emails</td>
454 | </tr>
455 | <tr>
456 | <td>0</td><td>support@yourcompany.com</td>
457 | </tr>
458 | </table>
459 | 
460 | (This would be more impressive if we had more than 1 email to put into the DataFrame.)
461 | 
462 | If we had a second email (say: support2@yourcompany.com), we could append it to the DataFrame by stating the row and column it should be added to:
463 | 
464 | ```python
465 | df.loc[1,'Emails']='support2@yourcompany.com'
466 | ```
467 | 
468 | <table border="1">
469 | <tr>
470 | <td></td><td>Emails</td>
471 | </tr>
472 | <tr>
473 | <td>0</td><td>support@yourcompany.com</td>
474 | </tr>
475 | <tr>
476 | <td>1</td><td>support2@yourcompany.com</td>
477 | </tr>
478 | </table>
479 | 
480 | In this way, each time our web scraper goes to a new page and scrapes the desired data, we can save it in the next row of the DataFrame.
481 | 
482 | After the scraping is complete, you can save to csv file with:
483 | 
484 | ```python
485 | df.to_csv('Scraped_emails.csv',index=False)
486 | ```
487 | 
488 | And later import it directly into a DataFrame with:
489 | 
490 | ```python
491 | df=pd.read_csv('Scraped_emails.csv')
492 | ```
493 | 
494 | # Warning and Practice Problem
495 | 
496 | That’s it. That should be enough to go forth and write a web scraper of your own. But first a warning: 
497 | 
498 | <b>Be a good web scraper.</b> Web scrapers can visit many more sites than a human. And they never click on adds. For these and other reasons, many sites ban or limit the use of web scrapers. Make sure to check what with the site (often in a <a href="http://www.robotstxt.org/">robot.txt</a> document) before collecting data. And even if they let you, think about limiting the rate you call up webpages. It will keep you in the good books.
499 | 
500 | With everything presented here, you can scrape much more meaningful data than some fake email address. If you want to practice this yourself, <b>write a web scraper to scrape the mean daily temperature for Eugene OR from Jan. 1, 2016 to Jan. 31, 2016 from</b> 
501 |  <a href="https://www.wunderground.com/history/airport/KEUG/2016/1/1/DailyHistory.html?req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo="><b>this site</b></a>
502 |  
503 |  If you get stuck, take a look at a very similar web scraper in this directory.
504 | 


--------------------------------------------------------------------------------
/Simple Web Scraper.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "The python script used to generate the readme file"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 3,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "from urllib.request import urlopen\n",
 19 |     "from bs4 import BeautifulSoup\n",
 20 |     "import pandas as pd"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 7,
 26 |    "metadata": {
 27 |     "collapsed": false
 28 |    },
 29 |    "outputs": [
 30 |     {
 31 |      "data": {
 32 |       "text/html": [
 33 |        "<div>\n",
 34 |        "<table border=\"1\" class=\"dataframe\">\n",
 35 |        "  <thead>\n",
 36 |        "    <tr style=\"text-align: right;\">\n",
 37 |        "      <th></th>\n",
 38 |        "      <th>emails</th>\n",
 39 |        "    </tr>\n",
 40 |        "  </thead>\n",
 41 |        "  <tbody>\n",
 42 |        "    <tr>\n",
 43 |        "      <th>0</th>\n",
 44 |        "      <td>support2@yourcompany0.com</td>\n",
 45 |        "    </tr>\n",
 46 |        "    <tr>\n",
 47 |        "      <th>1</th>\n",
 48 |        "      <td>support2@yourcompany1.com</td>\n",
 49 |        "    </tr>\n",
 50 |        "    <tr>\n",
 51 |        "      <th>2</th>\n",
 52 |        "      <td>support2@yourcompany2.com</td>\n",
 53 |        "    </tr>\n",
 54 |        "    <tr>\n",
 55 |        "      <th>3</th>\n",
 56 |        "      <td>support2@yourcompany3.com</td>\n",
 57 |        "    </tr>\n",
 58 |        "    <tr>\n",
 59 |        "      <th>4</th>\n",
 60 |        "      <td>support2@yourcompany4.com</td>\n",
 61 |        "    </tr>\n",
 62 |        "    <tr>\n",
 63 |        "      <th>5</th>\n",
 64 |        "      <td>support2@yourcompany5.com</td>\n",
 65 |        "    </tr>\n",
 66 |        "    <tr>\n",
 67 |        "      <th>6</th>\n",
 68 |        "      <td>support2@yourcompany6.com</td>\n",
 69 |        "    </tr>\n",
 70 |        "    <tr>\n",
 71 |        "      <th>7</th>\n",
 72 |        "      <td>support2@yourcompany7.com</td>\n",
 73 |        "    </tr>\n",
 74 |        "    <tr>\n",
 75 |        "      <th>8</th>\n",
 76 |        "      <td>support2@yourcompany8.com</td>\n",
 77 |        "    </tr>\n",
 78 |        "    <tr>\n",
 79 |        "      <th>9</th>\n",
 80 |        "      <td>support2@yourcompany9.com</td>\n",
 81 |        "    </tr>\n",
 82 |        "  </tbody>\n",
 83 |        "</table>\n",
 84 |        "</div>"
 85 |       ],
 86 |       "text/plain": [
 87 |        "                      emails\n",
 88 |        "0  support2@yourcompany0.com\n",
 89 |        "1  support2@yourcompany1.com\n",
 90 |        "2  support2@yourcompany2.com\n",
 91 |        "3  support2@yourcompany3.com\n",
 92 |        "4  support2@yourcompany4.com\n",
 93 |        "5  support2@yourcompany5.com\n",
 94 |        "6  support2@yourcompany6.com\n",
 95 |        "7  support2@yourcompany7.com\n",
 96 |        "8  support2@yourcompany8.com\n",
 97 |        "9  support2@yourcompany9.com"
 98 |       ]
 99 |      },
100 |      "execution_count": 7,
101 |      "metadata": {},
102 |      "output_type": "execute_result"
103 |     }
104 |    ],
105 |    "source": [
106 |     "df=pd.DataFrame()\n",
107 |     "df=pd.DataFrame(columns=['emails'])\n",
108 |     "for x in range(0,10):\n",
109 |     "    url='http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm'\n",
110 |     "    html=urlopen(url)\n",
111 |     "    soup=BeautifulSoup(html.read(),'lxml')\n",
112 |     "    subSoup=soup.find_all('p',class_='whs2');\n",
113 |     "    for elm in subSoup:\n",
114 |     "        ret=elm.contents[0]\n",
115 |     "        if '<a href' in ret and '@' in ret:\n",
116 |     "            email=ret[25+len('mailto')+1:-2]\n",
117 |     "    df.loc[x,'emails']='support2@yourcompany%s.com' %(x)\n",
118 |     "df"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "code",
123 |    "execution_count": null,
124 |    "metadata": {
125 |     "collapsed": true
126 |    },
127 |    "outputs": [],
128 |    "source": [
129 |     "#df.to_csv('Scraped_emails.csv',index=False)"
130 |    ]
131 |   }
132 |  ],
133 |  "metadata": {
134 |   "anaconda-cloud": {},
135 |   "kernelspec": {
136 |    "display_name": "Python [default]",
137 |    "language": "python",
138 |    "name": "python3"
139 |   },
140 |   "language_info": {
141 |    "codemirror_mode": {
142 |     "name": "ipython",
143 |     "version": 3
144 |    },
145 |    "file_extension": ".py",
146 |    "mimetype": "text/x-python",
147 |    "name": "python",
148 |    "nbconvert_exporter": "python",
149 |    "pygments_lexer": "ipython3",
150 |    "version": "3.5.2"
151 |   }
152 |  },
153 |  "nbformat": 4,
154 |  "nbformat_minor": 1
155 | }
156 | 


--------------------------------------------------------------------------------
/WS_images/CopyPaste1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/keklarup/WebScraping/6826c27fee117ddc5789a33d904e5d592b774611/WS_images/CopyPaste1.png


--------------------------------------------------------------------------------
/WS_images/CopyPaste2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/keklarup/WebScraping/6826c27fee117ddc5789a33d904e5d592b774611/WS_images/CopyPaste2.png


--------------------------------------------------------------------------------
/WS_images/simpleHTMLpageSS.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/keklarup/WebScraping/6826c27fee117ddc5789a33d904e5d592b774611/WS_images/simpleHTMLpageSS.png


--------------------------------------------------------------------------------
/WS_images/simpleHTMLpageSS2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/keklarup/WebScraping/6826c27fee117ddc5789a33d904e5d592b774611/WS_images/simpleHTMLpageSS2.png


--------------------------------------------------------------------------------
/WS_images/simpleHTMLpageSS3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/keklarup/WebScraping/6826c27fee117ddc5789a33d904e5d592b774611/WS_images/simpleHTMLpageSS3.png


--------------------------------------------------------------------------------
/Web Scraper for Weather.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Python 3.x file to scrape weather data"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 2,
 13 |    "metadata": {
 14 |     "collapsed": false
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import pandas as pd #to store our scraped data\n",
 19 |     "from urllib.request import urlopen #to download page contents from a given valid url\n",
 20 |     "from bs4 import BeautifulSoup #to navigate around the page's html\n",
 21 |     "import datetime #to make stepping through dates easy during scraping"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "The source of the weather data is www.wunderground.com. In particular this web address:"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": 10,
 34 |    "metadata": {
 35 |     "collapsed": true
 36 |    },
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "\"\"\"https://www.wunderground.com/history/airport/KEUG/2017/01/03/\n",
 40 |     "DailyHistory.html?req_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=\n",
 41 |     "97404&reqdb.magic=1&reqdb.wmo=99999\"\"\";"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 9,
 47 |    "metadata": {
 48 |     "collapsed": false
 49 |    },
 50 |    "outputs": [
 51 |     {
 52 |      "data": {
 53 |       "text/html": [
 54 |        "<div>\n",
 55 |        "<table border=\"1\" class=\"dataframe\">\n",
 56 |        "  <thead>\n",
 57 |        "    <tr style=\"text-align: right;\">\n",
 58 |        "      <th></th>\n",
 59 |        "      <th>Date</th>\n",
 60 |        "      <th>Mean Temperature</th>\n",
 61 |        "      <th>Mean Temperature Average</th>\n",
 62 |        "      <th>Max Temperature</th>\n",
 63 |        "      <th>Max Temperature Average</th>\n",
 64 |        "      <th>Max Temperature Record</th>\n",
 65 |        "      <th>Min Temperature</th>\n",
 66 |        "      <th>Min Temperature Average</th>\n",
 67 |        "      <th>Min Temperature Record</th>\n",
 68 |        "    </tr>\n",
 69 |        "  </thead>\n",
 70 |        "  <tbody>\n",
 71 |        "    <tr>\n",
 72 |        "      <th>0</th>\n",
 73 |        "      <td>20170117</td>\n",
 74 |        "      <td>36</td>\n",
 75 |        "      <td>41</td>\n",
 76 |        "      <td>45</td>\n",
 77 |        "      <td>47</td>\n",
 78 |        "      <td>67</td>\n",
 79 |        "      <td>28</td>\n",
 80 |        "      <td>34</td>\n",
 81 |        "      <td>15</td>\n",
 82 |        "    </tr>\n",
 83 |        "    <tr>\n",
 84 |        "      <th>1</th>\n",
 85 |        "      <td>20170116</td>\n",
 86 |        "      <td>32</td>\n",
 87 |        "      <td>41</td>\n",
 88 |        "      <td>40</td>\n",
 89 |        "      <td>47</td>\n",
 90 |        "      <td>64</td>\n",
 91 |        "      <td>24</td>\n",
 92 |        "      <td>34</td>\n",
 93 |        "      <td>16</td>\n",
 94 |        "    </tr>\n",
 95 |        "    <tr>\n",
 96 |        "      <th>2</th>\n",
 97 |        "      <td>20170115</td>\n",
 98 |        "      <td>28</td>\n",
 99 |        "      <td>41</td>\n",
100 |        "      <td>35</td>\n",
101 |        "      <td>47</td>\n",
102 |        "      <td>64</td>\n",
103 |        "      <td>21</td>\n",
104 |        "      <td>34</td>\n",
105 |        "      <td>10</td>\n",
106 |        "    </tr>\n",
107 |        "    <tr>\n",
108 |        "      <th>3</th>\n",
109 |        "      <td>20170114</td>\n",
110 |        "      <td>27</td>\n",
111 |        "      <td>41</td>\n",
112 |        "      <td>32</td>\n",
113 |        "      <td>47</td>\n",
114 |        "      <td>60</td>\n",
115 |        "      <td>21</td>\n",
116 |        "      <td>34</td>\n",
117 |        "      <td>15</td>\n",
118 |        "    </tr>\n",
119 |        "    <tr>\n",
120 |        "      <th>4</th>\n",
121 |        "      <td>20170113</td>\n",
122 |        "      <td>27</td>\n",
123 |        "      <td>40</td>\n",
124 |        "      <td>30</td>\n",
125 |        "      <td>47</td>\n",
126 |        "      <td>62</td>\n",
127 |        "      <td>24</td>\n",
128 |        "      <td>34</td>\n",
129 |        "      <td>10</td>\n",
130 |        "    </tr>\n",
131 |        "    <tr>\n",
132 |        "      <th>5</th>\n",
133 |        "      <td>20170112</td>\n",
134 |        "      <td>30</td>\n",
135 |        "      <td>40</td>\n",
136 |        "      <td>34</td>\n",
137 |        "      <td>47</td>\n",
138 |        "      <td>62</td>\n",
139 |        "      <td>26</td>\n",
140 |        "      <td>34</td>\n",
141 |        "      <td>6</td>\n",
142 |        "    </tr>\n",
143 |        "    <tr>\n",
144 |        "      <th>6</th>\n",
145 |        "      <td>20170111</td>\n",
146 |        "      <td>38</td>\n",
147 |        "      <td>40</td>\n",
148 |        "      <td>46</td>\n",
149 |        "      <td>47</td>\n",
150 |        "      <td>62</td>\n",
151 |        "      <td>30</td>\n",
152 |        "      <td>34</td>\n",
153 |        "      <td>8</td>\n",
154 |        "    </tr>\n",
155 |        "    <tr>\n",
156 |        "      <th>7</th>\n",
157 |        "      <td>20170110</td>\n",
158 |        "      <td>36</td>\n",
159 |        "      <td>40</td>\n",
160 |        "      <td>39</td>\n",
161 |        "      <td>46</td>\n",
162 |        "      <td>65</td>\n",
163 |        "      <td>34</td>\n",
164 |        "      <td>34</td>\n",
165 |        "      <td>8</td>\n",
166 |        "    </tr>\n",
167 |        "    <tr>\n",
168 |        "      <th>8</th>\n",
169 |        "      <td>20170109</td>\n",
170 |        "      <td>35</td>\n",
171 |        "      <td>40</td>\n",
172 |        "      <td>39</td>\n",
173 |        "      <td>46</td>\n",
174 |        "      <td>63</td>\n",
175 |        "      <td>31</td>\n",
176 |        "      <td>34</td>\n",
177 |        "      <td>10</td>\n",
178 |        "    </tr>\n",
179 |        "    <tr>\n",
180 |        "      <th>9</th>\n",
181 |        "      <td>20170108</td>\n",
182 |        "      <td>31</td>\n",
183 |        "      <td>40</td>\n",
184 |        "      <td>37</td>\n",
185 |        "      <td>46</td>\n",
186 |        "      <td>65</td>\n",
187 |        "      <td>25</td>\n",
188 |        "      <td>34</td>\n",
189 |        "      <td>6</td>\n",
190 |        "    </tr>\n",
191 |        "    <tr>\n",
192 |        "      <th>10</th>\n",
193 |        "      <td>20170107</td>\n",
194 |        "      <td>26</td>\n",
195 |        "      <td>40</td>\n",
196 |        "      <td>28</td>\n",
197 |        "      <td>46</td>\n",
198 |        "      <td>60</td>\n",
199 |        "      <td>24</td>\n",
200 |        "      <td>34</td>\n",
201 |        "      <td>9</td>\n",
202 |        "    </tr>\n",
203 |        "    <tr>\n",
204 |        "      <th>11</th>\n",
205 |        "      <td>20170106</td>\n",
206 |        "      <td>26</td>\n",
207 |        "      <td>40</td>\n",
208 |        "      <td>39</td>\n",
209 |        "      <td>46</td>\n",
210 |        "      <td>62</td>\n",
211 |        "      <td>12</td>\n",
212 |        "      <td>34</td>\n",
213 |        "      <td>12</td>\n",
214 |        "    </tr>\n",
215 |        "    <tr>\n",
216 |        "      <th>12</th>\n",
217 |        "      <td>20170105</td>\n",
218 |        "      <td>23</td>\n",
219 |        "      <td>40</td>\n",
220 |        "      <td>34</td>\n",
221 |        "      <td>46</td>\n",
222 |        "      <td>67</td>\n",
223 |        "      <td>12</td>\n",
224 |        "      <td>34</td>\n",
225 |        "      <td>12</td>\n",
226 |        "    </tr>\n",
227 |        "    <tr>\n",
228 |        "      <th>13</th>\n",
229 |        "      <td>20170104</td>\n",
230 |        "      <td>28</td>\n",
231 |        "      <td>40</td>\n",
232 |        "      <td>34</td>\n",
233 |        "      <td>46</td>\n",
234 |        "      <td>69</td>\n",
235 |        "      <td>21</td>\n",
236 |        "      <td>34</td>\n",
237 |        "      <td>14</td>\n",
238 |        "    </tr>\n",
239 |        "    <tr>\n",
240 |        "      <th>14</th>\n",
241 |        "      <td>20170103</td>\n",
242 |        "      <td>30</td>\n",
243 |        "      <td>40</td>\n",
244 |        "      <td>36</td>\n",
245 |        "      <td>46</td>\n",
246 |        "      <td>60</td>\n",
247 |        "      <td>24</td>\n",
248 |        "      <td>34</td>\n",
249 |        "      <td>9</td>\n",
250 |        "    </tr>\n",
251 |        "    <tr>\n",
252 |        "      <th>15</th>\n",
253 |        "      <td>20170102</td>\n",
254 |        "      <td>34</td>\n",
255 |        "      <td>40</td>\n",
256 |        "      <td>37</td>\n",
257 |        "      <td>46</td>\n",
258 |        "      <td>57</td>\n",
259 |        "      <td>30</td>\n",
260 |        "      <td>34</td>\n",
261 |        "      <td>8</td>\n",
262 |        "    </tr>\n",
263 |        "    <tr>\n",
264 |        "      <th>16</th>\n",
265 |        "      <td>20170101</td>\n",
266 |        "      <td>36</td>\n",
267 |        "      <td>39</td>\n",
268 |        "      <td>39</td>\n",
269 |        "      <td>45</td>\n",
270 |        "      <td>61</td>\n",
271 |        "      <td>33</td>\n",
272 |        "      <td>34</td>\n",
273 |        "      <td>7</td>\n",
274 |        "    </tr>\n",
275 |        "    <tr>\n",
276 |        "      <th>17</th>\n",
277 |        "      <td>20161231</td>\n",
278 |        "      <td>33</td>\n",
279 |        "      <td>39</td>\n",
280 |        "      <td>36</td>\n",
281 |        "      <td>45</td>\n",
282 |        "      <td>62</td>\n",
283 |        "      <td>29</td>\n",
284 |        "      <td>34</td>\n",
285 |        "      <td>10</td>\n",
286 |        "    </tr>\n",
287 |        "    <tr>\n",
288 |        "      <th>18</th>\n",
289 |        "      <td>20161230</td>\n",
290 |        "      <td>39</td>\n",
291 |        "      <td>39</td>\n",
292 |        "      <td>47</td>\n",
293 |        "      <td>45</td>\n",
294 |        "      <td>66</td>\n",
295 |        "      <td>31</td>\n",
296 |        "      <td>33</td>\n",
297 |        "      <td>10</td>\n",
298 |        "    </tr>\n",
299 |        "    <tr>\n",
300 |        "      <th>19</th>\n",
301 |        "      <td>20161229</td>\n",
302 |        "      <td>43</td>\n",
303 |        "      <td>39</td>\n",
304 |        "      <td>53</td>\n",
305 |        "      <td>45</td>\n",
306 |        "      <td>63</td>\n",
307 |        "      <td>33</td>\n",
308 |        "      <td>33</td>\n",
309 |        "      <td>12</td>\n",
310 |        "    </tr>\n",
311 |        "    <tr>\n",
312 |        "      <th>20</th>\n",
313 |        "      <td>20161228</td>\n",
314 |        "      <td>36</td>\n",
315 |        "      <td>39</td>\n",
316 |        "      <td>41</td>\n",
317 |        "      <td>45</td>\n",
318 |        "      <td>65</td>\n",
319 |        "      <td>30</td>\n",
320 |        "      <td>33</td>\n",
321 |        "      <td>19</td>\n",
322 |        "    </tr>\n",
323 |        "    <tr>\n",
324 |        "      <th>21</th>\n",
325 |        "      <td>20161227</td>\n",
326 |        "      <td>44</td>\n",
327 |        "      <td>39</td>\n",
328 |        "      <td>47</td>\n",
329 |        "      <td>45</td>\n",
330 |        "      <td>61</td>\n",
331 |        "      <td>40</td>\n",
332 |        "      <td>33</td>\n",
333 |        "      <td>16</td>\n",
334 |        "    </tr>\n",
335 |        "    <tr>\n",
336 |        "      <th>22</th>\n",
337 |        "      <td>20161226</td>\n",
338 |        "      <td>33</td>\n",
339 |        "      <td>39</td>\n",
340 |        "      <td>43</td>\n",
341 |        "      <td>45</td>\n",
342 |        "      <td>62</td>\n",
343 |        "      <td>23</td>\n",
344 |        "      <td>33</td>\n",
345 |        "      <td>1</td>\n",
346 |        "    </tr>\n",
347 |        "    <tr>\n",
348 |        "      <th>23</th>\n",
349 |        "      <td>20161225</td>\n",
350 |        "      <td>35</td>\n",
351 |        "      <td>39</td>\n",
352 |        "      <td>44</td>\n",
353 |        "      <td>45</td>\n",
354 |        "      <td>62</td>\n",
355 |        "      <td>25</td>\n",
356 |        "      <td>33</td>\n",
357 |        "      <td>-1</td>\n",
358 |        "    </tr>\n",
359 |        "    <tr>\n",
360 |        "      <th>24</th>\n",
361 |        "      <td>20161224</td>\n",
362 |        "      <td>35</td>\n",
363 |        "      <td>39</td>\n",
364 |        "      <td>44</td>\n",
365 |        "      <td>45</td>\n",
366 |        "      <td>60</td>\n",
367 |        "      <td>26</td>\n",
368 |        "      <td>33</td>\n",
369 |        "      <td>-4</td>\n",
370 |        "    </tr>\n",
371 |        "    <tr>\n",
372 |        "      <th>25</th>\n",
373 |        "      <td>20161223</td>\n",
374 |        "      <td>40</td>\n",
375 |        "      <td>39</td>\n",
376 |        "      <td>45</td>\n",
377 |        "      <td>45</td>\n",
378 |        "      <td>66</td>\n",
379 |        "      <td>35</td>\n",
380 |        "      <td>33</td>\n",
381 |        "      <td>-2</td>\n",
382 |        "    </tr>\n",
383 |        "    <tr>\n",
384 |        "      <th>26</th>\n",
385 |        "      <td>20161222</td>\n",
386 |        "      <td>38</td>\n",
387 |        "      <td>39</td>\n",
388 |        "      <td>45</td>\n",
389 |        "      <td>45</td>\n",
390 |        "      <td>62</td>\n",
391 |        "      <td>30</td>\n",
392 |        "      <td>33</td>\n",
393 |        "      <td>4</td>\n",
394 |        "    </tr>\n",
395 |        "    <tr>\n",
396 |        "      <th>27</th>\n",
397 |        "      <td>20161221</td>\n",
398 |        "      <td>31</td>\n",
399 |        "      <td>39</td>\n",
400 |        "      <td>33</td>\n",
401 |        "      <td>45</td>\n",
402 |        "      <td>62</td>\n",
403 |        "      <td>29</td>\n",
404 |        "      <td>33</td>\n",
405 |        "      <td>6</td>\n",
406 |        "    </tr>\n",
407 |        "    <tr>\n",
408 |        "      <th>28</th>\n",
409 |        "      <td>20161220</td>\n",
410 |        "      <td>40</td>\n",
411 |        "      <td>39</td>\n",
412 |        "      <td>51</td>\n",
413 |        "      <td>45</td>\n",
414 |        "      <td>64</td>\n",
415 |        "      <td>28</td>\n",
416 |        "      <td>33</td>\n",
417 |        "      <td>7</td>\n",
418 |        "    </tr>\n",
419 |        "    <tr>\n",
420 |        "      <th>29</th>\n",
421 |        "      <td>20161219</td>\n",
422 |        "      <td>38</td>\n",
423 |        "      <td>39</td>\n",
424 |        "      <td>45</td>\n",
425 |        "      <td>45</td>\n",
426 |        "      <td>64</td>\n",
427 |        "      <td>30</td>\n",
428 |        "      <td>33</td>\n",
429 |        "      <td>4</td>\n",
430 |        "    </tr>\n",
431 |        "  </tbody>\n",
432 |        "</table>\n",
433 |        "</div>"
434 |       ],
435 |       "text/plain": [
436 |        "        Date Mean Temperature Mean Temperature Average Max Temperature  \\\n",
437 |        "0   20170117               36                       41              45   \n",
438 |        "1   20170116               32                       41              40   \n",
439 |        "2   20170115               28                       41              35   \n",
440 |        "3   20170114               27                       41              32   \n",
441 |        "4   20170113               27                       40              30   \n",
442 |        "5   20170112               30                       40              34   \n",
443 |        "6   20170111               38                       40              46   \n",
444 |        "7   20170110               36                       40              39   \n",
445 |        "8   20170109               35                       40              39   \n",
446 |        "9   20170108               31                       40              37   \n",
447 |        "10  20170107               26                       40              28   \n",
448 |        "11  20170106               26                       40              39   \n",
449 |        "12  20170105               23                       40              34   \n",
450 |        "13  20170104               28                       40              34   \n",
451 |        "14  20170103               30                       40              36   \n",
452 |        "15  20170102               34                       40              37   \n",
453 |        "16  20170101               36                       39              39   \n",
454 |        "17  20161231               33                       39              36   \n",
455 |        "18  20161230               39                       39              47   \n",
456 |        "19  20161229               43                       39              53   \n",
457 |        "20  20161228               36                       39              41   \n",
458 |        "21  20161227               44                       39              47   \n",
459 |        "22  20161226               33                       39              43   \n",
460 |        "23  20161225               35                       39              44   \n",
461 |        "24  20161224               35                       39              44   \n",
462 |        "25  20161223               40                       39              45   \n",
463 |        "26  20161222               38                       39              45   \n",
464 |        "27  20161221               31                       39              33   \n",
465 |        "28  20161220               40                       39              51   \n",
466 |        "29  20161219               38                       39              45   \n",
467 |        "\n",
468 |        "   Max Temperature Average Max Temperature Record Min Temperature  \\\n",
469 |        "0                       47                     67              28   \n",
470 |        "1                       47                     64              24   \n",
471 |        "2                       47                     64              21   \n",
472 |        "3                       47                     60              21   \n",
473 |        "4                       47                     62              24   \n",
474 |        "5                       47                     62              26   \n",
475 |        "6                       47                     62              30   \n",
476 |        "7                       46                     65              34   \n",
477 |        "8                       46                     63              31   \n",
478 |        "9                       46                     65              25   \n",
479 |        "10                      46                     60              24   \n",
480 |        "11                      46                     62              12   \n",
481 |        "12                      46                     67              12   \n",
482 |        "13                      46                     69              21   \n",
483 |        "14                      46                     60              24   \n",
484 |        "15                      46                     57              30   \n",
485 |        "16                      45                     61              33   \n",
486 |        "17                      45                     62              29   \n",
487 |        "18                      45                     66              31   \n",
488 |        "19                      45                     63              33   \n",
489 |        "20                      45                     65              30   \n",
490 |        "21                      45                     61              40   \n",
491 |        "22                      45                     62              23   \n",
492 |        "23                      45                     62              25   \n",
493 |        "24                      45                     60              26   \n",
494 |        "25                      45                     66              35   \n",
495 |        "26                      45                     62              30   \n",
496 |        "27                      45                     62              29   \n",
497 |        "28                      45                     64              28   \n",
498 |        "29                      45                     64              30   \n",
499 |        "\n",
500 |        "   Min Temperature Average Min Temperature Record  \n",
501 |        "0                       34                     15  \n",
502 |        "1                       34                     16  \n",
503 |        "2                       34                     10  \n",
504 |        "3                       34                     15  \n",
505 |        "4                       34                     10  \n",
506 |        "5                       34                      6  \n",
507 |        "6                       34                      8  \n",
508 |        "7                       34                      8  \n",
509 |        "8                       34                     10  \n",
510 |        "9                       34                      6  \n",
511 |        "10                      34                      9  \n",
512 |        "11                      34                     12  \n",
513 |        "12                      34                     12  \n",
514 |        "13                      34                     14  \n",
515 |        "14                      34                      9  \n",
516 |        "15                      34                      8  \n",
517 |        "16                      34                      7  \n",
518 |        "17                      34                     10  \n",
519 |        "18                      33                     10  \n",
520 |        "19                      33                     12  \n",
521 |        "20                      33                     19  \n",
522 |        "21                      33                     16  \n",
523 |        "22                      33                      1  \n",
524 |        "23                      33                     -1  \n",
525 |        "24                      33                     -4  \n",
526 |        "25                      33                     -2  \n",
527 |        "26                      33                      4  \n",
528 |        "27                      33                      6  \n",
529 |        "28                      33                      7  \n",
530 |        "29                      33                      4  "
531 |       ]
532 |      },
533 |      "execution_count": 9,
534 |      "metadata": {},
535 |      "output_type": "execute_result"
536 |     }
537 |    ],
538 |    "source": [
539 |     "#Added ease by using the package datetime to iterate through the days\n",
540 |     "day=datetime.date.today();\n",
541 |     "one_day = datetime.timedelta(days=1);\n",
542 |     "#the rows I'll scrape from the table in the web page:\n",
543 |     "temps=['Mean Temperature','Max Temperature','Min Temperature']\n",
544 |     "#Initialize the pandas dataframe with a column for each value I will scrape from each page.\n",
545 |     "df=pd.DataFrame(columns=['Date','Mean Temperature','Mean Temperature Average',\n",
546 |     "                         'Max Temperature','Max Temperature Average','Max Temperature Record',\n",
547 |     "                         'Min Temperature','Min Temperature Average','Min Temperature Record'])\n",
548 |     "#Number of days worth of data I will scrape, starting today and working backwards:\n",
549 |     "for t in range(0,30):\n",
550 |     "    #creating new url for each pass of the for-loop\n",
551 |     "    dayStr=day.strftime('%Y %m %d')\n",
552 |     "    yr, mon, da=dayStr.split(' ')\n",
553 |     "    day=day-one_day\n",
554 |     "    url=\"\"\"https://www.wunderground.com/history/airport/KEUG/%s/%s/%s/DailyHistory.html?req_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999\"\"\" %(yr,mon,da);\n",
555 |     "    #opening and parsing the html of the page\n",
556 |     "    html = urlopen(url);\n",
557 |     "    soup=BeautifulSoup(html.read(),'html.parser');\n",
558 |     "    #finding a subsection of the html. A table which contains all the data I desire\n",
559 |     "    soup2=soup.find('table', class_='responsive airport-history-summary-table');\n",
560 |     "    #locating most rows in the table\n",
561 |     "    classSet=soup2.find_all(class_='indent');\n",
562 |     "    vals=[];\n",
563 |     "    #iterating through the rows of the table to find the ones with data I desire\n",
564 |     "    for y in range(0,len(classSet)):\n",
565 |     "        a=classSet[y]\n",
566 |     "        cat=str(a.contents).replace('[<span>','').replace('</span>]','');\n",
567 |     "        if cat in temps:\n",
568 |     "            for x in range(0,10):\n",
569 |     "                if 'wx-value' in str(a):\n",
570 |     "                    b=a.find(class_='wx-value')\n",
571 |     "                    c=str(b.contents).replace(\"['\",\"\").replace(\"']\",\"\");\n",
572 |     "                    vals.append(int(c))\n",
573 |     "                try:\n",
574 |     "                    a=a.next_sibling\n",
575 |     "                except:\n",
576 |     "                    break\n",
577 |     "    df.loc[t,'Date']=yr+mon+da\n",
578 |     "    df.iloc[t,1:]=vals\n",
579 |     "df"
580 |    ]
581 |   },
582 |   {
583 |    "cell_type": "code",
584 |    "execution_count": null,
585 |    "metadata": {
586 |     "collapsed": true
587 |    },
588 |    "outputs": [],
589 |    "source": []
590 |   }
591 |  ],
592 |  "metadata": {
593 |   "anaconda-cloud": {},
594 |   "kernelspec": {
595 |    "display_name": "Python [default]",
596 |    "language": "python",
597 |    "name": "python3"
598 |   },
599 |   "language_info": {
600 |    "codemirror_mode": {
601 |     "name": "ipython",
602 |     "version": 3
603 |    },
604 |    "file_extension": ".py",
605 |    "mimetype": "text/x-python",
606 |    "name": "python",
607 |    "nbconvert_exporter": "python",
608 |    "pygments_lexer": "ipython3",
609 |    "version": "3.5.2"
610 |   }
611 |  },
612 |  "nbformat": 4,
613 |  "nbformat_minor": 1
614 | }
615 | 


--------------------------------------------------------------------------------