├── Get the weather-GitHub.ipynb
├── GetEMSdata-GitHub.ipynb
├── README.md
├── Simple Web Scraper.ipynb
├── WS_images
├── CopyPaste1.png
├── CopyPaste2.png
├── simpleHTMLpageSS.png
├── simpleHTMLpageSS2.png
└── simpleHTMLpageSS3.png
└── Web Scraper for Weather.ipynb
/GetEMSdata-GitHub.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "collapsed": false
7 | },
8 | "source": [
9 | "This notebook was written to scrape a publically listed emergency call log for the Eugene OR area. It may be useful as a guide for projects that need to scrape data from a webpage with dynamically generated content from a javascript interface."
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 7,
15 | "metadata": {
16 | "collapsed": true
17 | },
18 | "outputs": [],
19 | "source": [
20 | "from selenium import webdriver #selenium is used to interact with the webpage, so the program can 'click' buttons.\n",
21 | "import pandas as pd #the data will be saved locally as a csv file. Pandas is a nice way to write/read/work with those files.\n",
22 | "from selenium.webdriver.firefox.firefox_binary import FirefoxBinary #This will let the program open the webpage on a new Firefox window.\n",
23 | "from bs4 import BeautifulSoup #BeautifulSoup is used to parse the HTML of the downloaded website to find the particular information desired.\n",
24 | "import time #I will need to delay the program to give the webpage time to open. time will be used for that.\n",
25 | "import sys #This is only used to assign a location to my path. The location where I have a needed file for selenium."
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "the next section adds my current directory to my path. The main reason for this is selenium. Selenium uses geckodriver to talk with Firefox. That application is something you have to download special. Ideally, I would have geckodriver in a folder in python's path already. Instead, I just have it saved into the same folder I'm collecting all my data from the EMS call log. Not great, I know, but it works."
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 8,
38 | "metadata": {
39 | "collapsed": true
40 | },
41 | "outputs": [],
42 | "source": [
43 | "sys.path\n",
44 | "sys.path.append('/path/to/the/example_file.py')\n",
45 | "sys.path.append('C:\\\\Users\\\\Kyle\\\\Documents\\\\Blog Posts\\\\EugeneEMSCalls')"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "The call log website gives the date with the month abbreviation. This dictionary lets me easily convert that into a 2 number string."
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": 9,
58 | "metadata": {
59 | "collapsed": true
60 | },
61 | "outputs": [],
62 | "source": [
63 | "Months={'Jan':'01','Feb':'02','Mar':'03','Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": 10,
69 | "metadata": {
70 | "collapsed": true
71 | },
72 | "outputs": [],
73 | "source": [
74 | "#url of the website with the call log data\n",
75 | "url='http://coeapps.eugene-or.gov/ruralfirecad'"
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "The next section opens a new window of Firefox browser. This is the tab where the url will be loaded and where the code will 'click' javascript buttons and download the HTML of the page. "
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": 11,
88 | "metadata": {
89 | "collapsed": false
90 | },
91 | "outputs": [],
92 | "source": [
93 | "binary = FirefoxBinary('C:\\\\Program Files (x86)\\\\Mozilla Firefox\\\\firefox.exe')\n",
94 | "driver = webdriver.Firefox(firefox_binary=binary)\n",
95 | "driver.get(url)"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 12,
101 | "metadata": {
102 | "collapsed": false
103 | },
104 | "outputs": [
105 | {
106 | "name": "stdout",
107 | "output_type": "stream",
108 | "text": [
109 | "Number of Calls on Jan 9, 2017: 107\n"
110 | ]
111 | }
112 | ],
113 | "source": [
114 | "#quick check to see if selenium correctly got to the page. \n",
115 | "#this searches the HTML of the page for the HTML element id'd as 'callSummary',\n",
116 | "#and prints the text.\n",
117 | "summary=driver.find_element_by_id('callSummary').text\n",
118 | "print(summary)"
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "The web scraping code. The way this deals with time is a bit wonky. Initially, it will step through every day of the current month from first day to last day. Those days that are still in the future won't have any calls and will not be saved into csv files. Then, the program steps back one month and repeats; again, going from the first to the last day. Eventually, the program reaches the set end date, where it breaks out of the while loop."
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": 13,
131 | "metadata": {
132 | "collapsed": false
133 | },
134 | "outputs": [],
135 | "source": [
136 | "endIt=0;\n",
137 | "while endIt==0:\n",
138 | " calendarOptions=driver.find_element_by_id('calendar').text.split()\n",
139 | " monthsDays=[];\n",
140 | " for elm in calendarOptions:\n",
141 | " try:\n",
142 | " potlDay=int(elm)\n",
143 | " if potlDay<=31:\n",
144 | " monthsDays.append(elm)\n",
145 | " except:\n",
146 | " pass\n",
147 | " for day in monthsDays:\n",
148 | " driver.find_element_by_link_text(day).click()\n",
149 | " time.sleep(1) #giving firefox time to open.\n",
150 | " summary=driver.find_element_by_id('callSummary').text;\n",
151 | " dateData=summary[summary.index('on')+3:summary.index(':')].replace(',','').split(' ');\n",
152 | " if len(dateData[1])==1:\n",
153 | " dateData[1]='0'+dateData[1]\n",
154 | " date=int(dateData[2]+str(Months[dateData[0]])+dateData[1]);\n",
155 | " if int(date)==20161201: #End date is located here.\n",
156 | " endIt=1;\n",
157 | " break\n",
158 | " html = driver.page_source;\n",
159 | " soup = BeautifulSoup(html,'lxml'); #using BeautifulSoup to find the call logs.\n",
160 | " EMSdata=soup.find('table', class_='tablesorter');\n",
161 | " colNames1=EMSdata.thead.findAll('th') #recording the column names.\n",
162 | " colNames2=[];\n",
163 | " data1=[]\n",
164 | " for x in range(0,len(colNames1)):\n",
165 | " colNames2.append(colNames1[x].string.strip()) #saving each column value.\n",
166 | " data1.append([])\n",
167 | " for row in EMSdata.findAll(\"tr\"): #saving the individual call log data.\n",
168 | " cells = row.findAll('td')\n",
169 | " if len(cells)==len(colNames1):\n",
170 | " for y in range(0,len(cells)):\n",
171 | " data1[y].append(cells[y].string.strip())\n",
172 | " EMSdata1=pd.DataFrame(); #initializing a data frame to save 1 days worth of calls.\n",
173 | " for x in range(0,len(colNames2)):\n",
174 | " EMSdata1[colNames2[x]]=data1[x]\n",
175 | " EMSdata1['Date']=date;\n",
176 | " try:\n",
177 | " EMSdata1.to_csv('%s.csv'%(EMSdata1.loc[0,'Date']),index=False) #saving csv file of daily call logs.\n",
178 | " except:\n",
179 | " pass\n",
180 | " time.sleep(1) #giving time to save csv before moving on.\n",
181 | " if endIt==0:\n",
182 | " driver.find_element_by_link_text('Prev').click()"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": null,
188 | "metadata": {
189 | "collapsed": true
190 | },
191 | "outputs": [],
192 | "source": []
193 | }
194 | ],
195 | "metadata": {
196 | "anaconda-cloud": {},
197 | "kernelspec": {
198 | "display_name": "Python [default]",
199 | "language": "python",
200 | "name": "python3"
201 | },
202 | "language_info": {
203 | "codemirror_mode": {
204 | "name": "ipython",
205 | "version": 3
206 | },
207 | "file_extension": ".py",
208 | "mimetype": "text/x-python",
209 | "name": "python",
210 | "nbconvert_exporter": "python",
211 | "pygments_lexer": "ipython3",
212 | "version": "3.5.2"
213 | }
214 | },
215 | "nbformat": 4,
216 | "nbformat_minor": 1
217 | }
218 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Web Scraping
2 | Guided example for web scraping in Python using urlopen from urllib.request, beautifulsoup, and pandas.
3 |
4 | # One Sentence Definition of Web Scraping
5 | Web scraping is having your computer visit many web pages, collect (scrape) data from each page, and save it locally to your computer for future use. It’s something you could do with copy/paste and an Excel table, but the sheer number of pages makes it impractical.
6 |
7 | ## Keys to Web Scraping:
8 | The keys to web scraping are patterns. There needs to be some pattern the program can follow to go from one web page to the next. The desired data needs to be in some pattern, so the web scraper can reliably collect it. Finding these patterns is the tricky, time consuming process that is at the very beginning. But after they're discovered, writing the code of the web scraper is easy.
9 |
10 | # Simple Web Scraper
11 |
12 | Here is the finished web scraper built in the rest of this document. I’ll go through each step, explaining why and what each element does, but this is the end goal.
13 |
14 | ```python
15 | from urllib.request import urlopen
16 | from bs4 import BeautifulSoup
17 | import pandas as pd
18 |
19 | df=pd.DataFrame()
20 | df=pd.DataFrame(columns=['emails'])
21 | for x in range(0,10):
22 | url='http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm'
23 | html=urlopen(url)
24 | soup=BeautifulSoup(html.read(),'lxml')
25 | subSoup=soup.find_all('p',class_='whs2');
26 | for elm in subSoup:
27 | ret=elm.contents[0]
28 | if 'pip for Python 2.X or pip for Python 3.X are good ways to acquire new packages.
37 |
38 | This web scraper will make use of three modules:
39 |
40 | 1) urlopen from urllib.request : to download page contents from a given valid url
41 |
42 | 2) BeautifulSoup from bs4 : to navigate the HTML of the downloaded page
43 |
44 | 3) pandas : to store our scraped data
45 |
46 | # Target Web Page and Desired Data
47 | For the purposes of this little tutorial, I'll show how to scrape just a single web page. One with very simple HTML to make it easier to understand what's going on:
48 |
49 |
50 |
51 | This page has minimal information, but let's say I want to collect the email address:
52 |
53 |
54 |
55 | # urlopen from urllib.request
56 |
57 | urlopen is a no-frills way of making a web scraper, and the recipe is simple: (1) Give urlopen a valid url. (2) Watch as urlopen downloads the HTML from that url.
58 |
59 | ```python
60 | from urllib.request import urlopen
61 | url='http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm'
62 | html=urlopen(url)
63 | print(html)
64 | ```
65 | If you fed urlopen a valid url, you should print something along the lines of:
66 |
67 | ```python
68 |
69 | ```
70 |
71 | ## Patterns for Downloading many pages
72 |
73 | Writing a web scraper for a single page makes no sense. But often the many pages you'll want to scrape have some pattern in their urls. For example, these urls lead to pages with historical daily weather data for Eugene OR. The day in question is explicitly stated in the url:
74 |
75 | >https://www.wunderground.com/history/airport/KEUG/2016/01/05/DailyHistory.htmlreq_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999
76 |
77 | >https://www.wunderground.com/history/airport/KEUG/2016/01/04/DailyHistory.htmlreq_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999
78 |
79 | So, you could visit many of these pages by writing a python script that created knew url strings by changing the date.
80 |
81 | Example, downloading the html of every day in Jan, 2016:
82 | ```python
83 | for x in range(1,32):
84 | url="https://www.wunderground.com/history/airport/KEUG/2016/01/%s/DailyHistory.htmlreq_city=Eugene&req_state=OR&req_statename=Oregon&reqdb.zip=97404&reqdb.magic=1&reqdb.wmo=99999" % (x)
85 | html=urlopen(url)
86 | ```
87 |
88 | So, you could visit many of these pages by writing a python script that created new url strings by changing the date.
89 |
90 | # BeautifulSoup to Navigate the HTML
91 |
92 | So now you have the HTML from the page, but no way of reading it. Cue BeautifulSoup. BeautifulSoup will do 2 major things for us. (1) It will decode the HTML into something we can read in the python script. (2) It will let us navigate through the many lines of HTML by making use of the tags and labels of HTML coding.
93 |
94 | ```python
95 | from bs4 import BeautifulSoup
96 | soup=BeautifulSoup(html.read(),'lxml')
97 | print(soup)
98 | ```
99 |
100 | This is the tedious bit. See if you can find the email address in the parsed HTML:
101 |
102 | ```html
103 |
104 |
105 |
106 |
107 |
108 |
109 | Example of a simple HTML page
110 |
111 |
117 |
137 |
143 |
144 |
145 |
146 |
147 |
176 |
177 |
183 |
Example of a simple HTML page
184 |
Hypertext Markup Language (HTML) is the most common language used to
185 | create documents on the World Wide Web. HTML uses hundreds of different
186 | tags to define a layout for web pages. Most tags require an opening <tag>
187 | and a closing </tag>.
188 |
Example: <b>On
189 | a webpage, this sentence would be in bold print.</b>
Send me mail at <a href="mailto:support@yourcompany.com">
211 |
support@yourcompany.com</a>.
212 |
<P> This is a new paragraph!
213 |
<P> <B>This is a new paragraph!</B>
214 |
215 |
<BR> <B><I>This is a new
216 | sentence without a paragraph break, in bold italics.</I></B>
217 |
218 |
<HR>
219 |
</BODY>
220 |
</HTML>
221 |
222 |
223 |
224 |
230 |
231 |
232 |
233 | ```
234 | And that HTML is from a very simple webpage. Most pages you'll want to scrape are much more complex and sorting through them can be a chore. It's a tedious step that all writers of web scrapers must suffer through.
235 |
236 | You can make your life easier by figuring out where the information you want is located within the html of the webpage when it's open in your browser.
237 |
238 | ## Inspect Element to Locate the HTML Tags
239 |
240 | In Firefox, right click on the information you care about and select 'inspect element' (other browsers have a similar feature). The browser will let you look under the hood at the html of the page and take you right to the section you care about.
241 |
242 |
243 |
244 | Here I can note all the HTML tags I can use to locate the email address with BeautifulSoup:
245 |
246 | It's in a paragraph ('p'). It has a class ('whs2'). The address is in a link (< a > < /a >). The email address contains the special character ‘@’.
247 |
248 | That's more than enough for BeautifulSoup.
249 |
250 |
251 | ## Navigating with BeautifulSoup
252 | There are a lot of tools in BeautifulSoup, but we'll make do with three: find, find_all, and contents.
253 |
254 | find gives the first instance of some HTML tag:
255 |
256 | ```python
257 | soup.find('p')
258 | ```
259 |
260 | ```html
261 |
Hypertext Markup Language (HTML) is the most common language used to
262 | create documents on the World Wide Web. HTML uses hundreds of different
263 | tags to define a layout for web pages. Most tags require an opening <tag>
264 | and a closing </tag>.
265 | ```
266 |
267 | contents gives you any text find found (in the form of a list on length 1):
268 |
269 | ```python
270 | soup.find('p').contents
271 | ```
272 |
273 | ```text
274 | ['Hypertext Markup Language (HTML) is the most common language used to \n create documents on the World Wide Web. HTML uses hundreds of different \n tags to define a layout for web pages. Most tags require an opening \n and a closing .']
275 | ```
276 |
277 | find_all returns every instance of that tag. Again, in a list:
278 |
279 | ```python
280 | soup.find_all('p')
281 | ```
282 |
283 | ```html
284 | [
Hypertext Markup Language (HTML) is the most common language used to
285 | create documents on the World Wide Web. HTML uses hundreds of different
286 | tags to define a layout for web pages. Most tags require an opening <tag>
287 | and a closing </tag>.
,
288 |
Example: <b>On
289 | a webpage, this sentence would be in bold print.</b>
Send me mail at <a href="mailto:support@yourcompany.com">
,
311 |
support@yourcompany.com</a>.
,
312 |
<P> This is a new paragraph!
,
313 |
<P> <B>This is a new paragraph!</B>
314 |
,
315 |
<BR> <B><I>This is a new
316 | sentence without a paragraph break, in bold italics.</I></B>
317 |
,
318 |
<HR>
,
319 |
</BODY>
,
320 |
</HTML>
,
321 |
,
322 |
,
323 |
]
324 | ```
325 |
326 | If the HTML object also has a class, you can narrow your results even further:
327 |
328 | ```python
329 | soup.find_all('p', class_='whs2')
330 | ```
331 |
332 | ```html
333 | [
Send me mail at <a href="mailto:support@yourcompany.com">
,
350 |
support@yourcompany.com</a>.
,
351 |
<P> This is a new paragraph!
,
352 |
<P> <B>This is a new paragraph!</B>
353 |
,
354 |
<BR> <B><I>This is a new
355 | sentence without a paragraph break, in bold italics.</I></B>
356 |
,
357 |
<HR>
,
358 |
</BODY>
,
359 |
</HTML>
,
360 |
,
361 |
]
362 | ```
363 |
364 | Finally, because HTML elements returned by find_all are in a list, you can iterate through them:
365 |
366 | ```python
367 | subSoup=soup.find_all('p',class_='whs2');
368 | for elm in subSoup:
369 | print(elm.contents)
370 | ```
371 |
372 | ```text
373 | [' ']
374 | [' ']
375 | ['Your Title Here \n ']
376 | [' ']
377 | [' \n ']
378 | ['
']
379 | [' ']
380 | ['Link \n Name ']
381 | ['is a link to another nifty site ']
382 | ['
444 |
445 | And add the data to the DataFrame, in the form of a list:
446 |
447 | ```python
448 | df=pd['Emails']=[email])
449 | ```
450 |
451 |
452 |
453 |
Emails
454 |
455 |
456 |
0
support@yourcompany.com
457 |
458 |
459 |
460 | (This would be more impressive if we had more than 1 email to put into the DataFrame.)
461 |
462 | If we had a second email (say: support2@yourcompany.com), we could append it to the DataFrame by stating the row and column it should be added to:
463 |
464 | ```python
465 | df.loc[1,'Emails']='support2@yourcompany.com'
466 | ```
467 |
468 |
469 |
470 |
Emails
471 |
472 |
473 |
0
support@yourcompany.com
474 |
475 |
476 |
1
support2@yourcompany.com
477 |
478 |
479 |
480 | In this way, each time our web scraper goes to a new page and scrapes the desired data, we can save it in the next row of the DataFrame.
481 |
482 | After the scraping is complete, you can save to csv file with:
483 |
484 | ```python
485 | df.to_csv('Scraped_emails.csv',index=False)
486 | ```
487 |
488 | And later import it directly into a DataFrame with:
489 |
490 | ```python
491 | df=pd.read_csv('Scraped_emails.csv')
492 | ```
493 |
494 | # Warning and Practice Problem
495 |
496 | That’s it. That should be enough to go forth and write a web scraper of your own. But first a warning:
497 |
498 | Be a good web scraper. Web scrapers can visit many more sites than a human. And they never click on adds. For these and other reasons, many sites ban or limit the use of web scrapers. Make sure to check what with the site (often in a robot.txt document) before collecting data. And even if they let you, think about limiting the rate you call up webpages. It will keep you in the good books.
499 |
500 | With everything presented here, you can scrape much more meaningful data than some fake email address. If you want to practice this yourself, write a web scraper to scrape the mean daily temperature for Eugene OR from Jan. 1, 2016 to Jan. 31, 2016 from
501 | this site
502 |
503 | If you get stuck, take a look at a very similar web scraper in this directory.
504 |
--------------------------------------------------------------------------------
/Simple Web Scraper.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "The python script used to generate the readme file"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 3,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "from urllib.request import urlopen\n",
19 | "from bs4 import BeautifulSoup\n",
20 | "import pandas as pd"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 7,
26 | "metadata": {
27 | "collapsed": false
28 | },
29 | "outputs": [
30 | {
31 | "data": {
32 | "text/html": [
33 | "