└── Robust_extraction_of_web_data_with_Python.ipynb
/Robust_extraction_of_web_data_with_Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Who am I?\n",
8 | "- Slawomir Tulski (Slaw)\n",
9 | "- currently: Big Data Engineer at WorldRemit\n",
10 | "- previously: Python Data Programmer at Import.io (web scraping start-up)\n",
11 | "- linkedin: https://www.linkedin.com/in/slawomir-tulski-091611116/\n",
12 | "- personal website: http://slawomirtulski.com/"
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "metadata": {},
18 | "source": [
19 | "### My goals for today\n",
20 | "- show how to tackle problem of web scraping in different ways than \"standard\" approach\n",
21 | "- present useful tips and tricks in web-scraping \n",
22 | "- avoid making tutorial on popular html parsing / scraping libraries\n",
23 | "\n",
24 | "### Plan\n",
25 | "- Quick intoduction to scraping\n",
26 | "- Stop crawling, investigate your target instead\n",
27 | " + case study 1: getting all urls you need from website \n",
28 | "- Look for APIs, even if service does not provide (public) one\n",
29 | " + case study 2: getting API KEY and using hidden API in store locator service\n",
30 | " + case study 3: getting available airbnb properties in London\n",
31 | " + case study 4: api json response embeded in html\n",
32 | "- Handling JavaScript with Selenium\n",
33 | " + case study 5: handling infinite scroll\n",
34 | "- Keynotes"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "## Some basic...\n",
42 | "\n",
43 | "### How your browser works?\n",
44 | "- World Wide Web operates on a client/server model\n",
45 | "- Web browser contacts a web server and requests information or resources\n",
46 | "- Server locates and then sends the information (html, images etc.) back to the web browser \n",
47 | "- Browser displays the results\n",
48 | "- Browser can execute JavaScript code to dynmically \"do things\" (sends requests, site appreance and bassicaly everyting)\n",
49 | "- 4 basic types of http requests (GET and POST - you'll use those most often while scraping, PUT, DELETE)\n",
50 | "\n",
51 | "### How to see what my browser is doing?\n",
52 | "- web browsers usually have some sort of \"Developers Toolkit\" (if not you should think about changing your browser)\n",
53 | "- there should be 'Network' tab which shows you what is being sent from/to your broweser/server\n",
54 | "- you can check exactly what type of request were sent, headers, parameters, cookies etc.\n",
55 | "- also you can find in your Developers Tools console to execute JavaScript\n",
56 | "\n",
57 | "### \"Standard\" scraping approach\n",
58 | "0. I want DATA!!!!\n",
59 | "1. don't scrape... find data somewhere else!\n",
60 | "2. don't scrape... they should provide an API!\n",
61 | "3. ok.. you're screwed. get HTML and parse it!\n",
62 | "4. you need a lot of data from different pages of one web service? - build crawler and \"catch them all\""
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "## Using sitemaps instead of crawling whole website\n",
70 | "\n",
71 | "### what is web \"crawler\" ?\n",
72 | "* automate bot which recurse from strat page to all internal link it founds\n",
73 | "* theoretacaly, it will traverse through all urls on website\n",
74 | "\n",
75 | "### why it's not the best idea?\n",
76 | "* not precise (it's brute force... a lot of requests made and a lot of garbage scraped)\n",
77 | "* need to write more code and care about lot of things (what type of url it got, can I go there?)\n",
78 | "* assumes particual page layout and test whatever it encounter\n",
79 | "* easy to catch into trap (honeypots)\n",
80 | "\n",
81 | "### what to use instead?\n",
82 | "* very often there is sitemap of whole website already available!\n",
83 | "* very often sitemaps are hidden! if you can't see it on page, try **/sitemap.xml** [https://www.skipthedishes.com/]\n",
84 | "* also, information about sitmap can be found in **robots.txt** file [https://www.walmart.com/]\n",
85 | "* if there is no sitemap, try to follow a pattern **get categories -> get pages -> get listing -> get item**"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 24,
91 | "metadata": {
92 | "collapsed": true
93 | },
94 | "outputs": [],
95 | "source": [
96 | "# built-in\n",
97 | "import json\n",
98 | "import random\n",
99 | "import re\n",
100 | "import time\n",
101 | "# 3rd part\n",
102 | "from IPython.display import HTML\n",
103 | "import pandas as pd\n",
104 | "import requests\n",
105 | "from selenium import webdriver"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 15,
111 | "metadata": {
112 | "collapsed": false
113 | },
114 | "outputs": [
115 | {
116 | "name": "stdout",
117 | "output_type": "stream",
118 | "text": [
119 | "getting properties from: http://www.rightmove.co.uk/sitemap_propertydetails0.xml\n",
120 | "getting properties from: http://www.rightmove.co.uk/sitemap_propertydetails1.xml\n",
121 | "getting properties from: http://www.rightmove.co.uk/sitemap_propertydetails2.xml\n",
122 | "I've got 150000 of urls with properties.\n",
123 | "Some examples:\n",
124 | "\n",
125 | "- http://www.rightmove.co.uk/property-to-rent/property-50480715.html\n",
126 | "\n",
127 | "- http://www.rightmove.co.uk/property-to-rent/property-53775521.html\n",
128 | "\n",
129 | "- http://www.rightmove.co.uk/commercial-property-for-sale/property-64919567.html\n",
130 | "\n",
131 | "- http://www.rightmove.co.uk/property-to-rent/property-68904185.html\n",
132 | "\n",
133 | "- http://www.rightmove.co.uk/commercial-property-to-let/property-47279781.html\n",
134 | "\n",
135 | "- http://www.rightmove.co.uk/property-to-rent/property-61726357.html\n"
136 | ]
137 | }
138 | ],
139 | "source": [
140 | "\"\"\"\n",
141 | "Case Study 1: getting all links from sitemap \n",
142 | "\n",
143 | "You want to analyse housing market in UK. Data which interest you most are on http://www.rightmove.co.uk/.\n",
144 | "Unfortunately, there is no API available and you need get data from HTMLs. As a first step, before putting your hands\n",
145 | "on data, you need to know urls of all avaiable properties on website. Later, you will use those links to extract data.\n",
146 | "\n",
147 | "Find all urls to properties on rightmove.co.uk. Be as precise as possible. Do not built inefficient crawlers.\n",
148 | "\"\"\"\n",
149 | "main_sitemap_url = 'http://www.rightmove.co.uk/sitemap.xml'\n",
150 | "main_sitemap_text = requests.get(main_sitemap_url).text\n",
151 | "properties_sitemaps = re.findall(r'(http://www.rightmove.co.uk/sitemap_propertydetails\\d+.xml)', main_sitemap_text)\n",
152 | "limit_pages = 3\n",
153 | "all_properites = []\n",
154 | "for pmap_url in properties_sitemaps[:limit_pages]:\n",
155 | " print('getting properties from: ', pmap_url)\n",
156 | " pmap_text = requests.get(pmap_url).text\n",
157 | " p_urls = re.findall(r'(http://www.rightmove.co.uk/[\\-a-z]+/property-\\d+.html)', pmap_text)\n",
158 | " all_properites.extend(p_urls)\n",
159 | "print('I\\'ve got ' + str(len(all_properites)) + ' of urls with properties.\\nSome examples:')\n",
160 | "for url in all_properites[:6]:\n",
161 | " print('\\n- '+url)"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {
167 | "collapsed": false
168 | },
169 | "source": [
170 | "## Look for APIs - even if service does not provide (public) one\n",
171 | "\n",
172 | "### why APIs are better (I know... silly question)\n",
173 | "* web appearance can change frequently (which will brake scrapers dependant on html tags), but API stays same for longer time\n",
174 | "* often, responses from API contains very structured data (e.g. in JSON or XML format)\n",
175 | "\n",
176 | "### but there is no API available for website 'X' ;(\n",
177 | "* a lot of modern web services uses some kind of APIs internally [https://www.airbnb.co.uk/s/London/homes]\n",
178 | "* to find out if web service is using API track network in your developer’s tools. (I like Chrome’s tools, but Firefox, Opera etc. also has nice ones)\n",
179 | "* there are some treasures hidden in requests with type xhr, fetch, json etc.\n",
180 | "* often, you need to supply additional information with your request (like API keys or tokens)\n",
181 | "* API responses can be dynamically embeded in HTML [https://www.walmart.com/]"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 16,
187 | "metadata": {
188 | "collapsed": false
189 | },
190 | "outputs": [
191 | {
192 | "name": "stdout",
193 | "output_type": "stream",
194 | "text": [
195 | "Got API KEY from main page: 41C97F66-D0FF-11DD-8143-EF6F37ABAA09\n",
196 | "Raw response from API: {'response': {'collectioncount': 16, 'attributes': {'country': 'US', 'province': '', 'postalcode': '20004', 'city': 'WASHINGTON', 'radiusuom': 'mile', 'radius': '40', 'state': 'DC', 'address': '', 'centerpoint': '-77.0255,38.8957'}, 'activeobject': '', 'collection': [{'fri_open_time': '8:00 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(202) 462-3146', 'email': 'truevalue17@truevalue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:30 PM', 'google_shared': None, 'uid': 1051076976, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:30 PM', 'yelp': None, 'sun_open_time': '10:00 AM', 'tvurl': 'http://www.truevalueon17th.com/', 'sun_close_time': '- 6:00 PM', 'tue_open_time': '8:00 AM', 'tvpaint': '1', '_distance': '1.32', 'wed_close_time': '- 7:30 PM', 'mon_open_time': '8:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 7:30 PM', 'thur_open_time': '8:00 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 7:30 PM', 'latitude': '38.911917', 'activeshiptostore': '1', 'bho': '[[\"1000\",\"1800\"],[\"0800\",\"1930\"],[\"0800\",\"1930\"],[\"0800\",\"1930\"],[\"0800\",\"1930\"],[\"0800\",\"1930\"],[\"0900\",\"1800\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-77.03849', 'ja': None, 'country': 'US', 'postalcode': '20009-2433', 'clientkey': 'L4ZK7Q8W-PA4X-4IS6-587Y-FRJLJ8Z5JF84', 'city': 'Washington', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalueon17th.com', 'sat_open_time': '9:00 AM', 'twitterurl': None, 'wed_open_time': '8:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'True Value On 17th', 'address1': '1623 17th St NW', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'DC'}, {'fri_open_time': '9:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(202) 659-8686', 'email': 'info@districthardware.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:00 PM', 'google_shared': None, 'uid': 1051078615, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 6:00 PM', 'yelp': None, 'sun_open_time': '11:00 AM', 'tvurl': 'http://www.thebikeshopdc.com', 'sun_close_time': '- 5:00 PM', 'tue_open_time': '9:00 AM', 'tvpaint': None, '_distance': '1.50', 'wed_close_time': '- 7:00 PM', 'mon_open_time': '9:00 AM', 'google_email': 'social@districthardware.com', 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 7:00 PM', 'thur_open_time': '9:00 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 7:00 PM', 'latitude': '38.9038748979592', 'activeshiptostore': '1', 'bho': '[[\"1100\",\"1700\"],[\"0900\",\"1900\"],[\"0900\",\"1900\"],[\"0900\",\"1900\"],[\"0900\",\"1900\"],[\"0900\",\"1800\"],[\"0900\",\"1800\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/districthardwareandbike', 'longitude': '-77.0514083673469', 'ja': None, 'country': 'US', 'postalcode': '20037-1432', 'clientkey': '2VMQMTVK-8O6O-8EUD-BA78-S3N6DFM1QESS', 'city': 'Washington', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.districthardware.com', 'sat_open_time': '9:00 AM', 'twitterurl': 'http://www.twitter.com/dchardwarebike', 'wed_open_time': '9:00 AM', 'google': '1', 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': '1', 'name': 'District Hardware and Bike', 'address1': '1108 24th Street NW', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'DC'}, {'fri_open_time': '5:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(202) 636-1701', 'email': None, 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 4:00 PM', 'google_shared': 'Yes', 'uid': 1182202005, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 4:00 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': None, 'sun_close_time': None, 'tue_open_time': '5:00 AM', 'tvpaint': None, '_distance': '2.79', 'wed_close_time': '- 4:00 PM', 'mon_open_time': '5:00 AM', 'google_email': 'tom@websolutions.net', 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 4:00 PM', 'thur_open_time': '5:00 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 4:00 PM', 'latitude': '38.91536', 'activeshiptostore': None, 'bho': '[[\"9999\",\"9999\"],[\"0500\",\"1600\"],[\"0500\",\"1600\"],[\"0500\",\"1600\"],[\"0500\",\"1600\"],[\"0500\",\"1600\"],[\"0700\",\"1100\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-76.98028', 'ja': None, 'country': 'US', 'postalcode': '20002-1834', 'clientkey': 'WOM3OP7L-SBW9-JV36-8DN0-O7U209J7LRQL', 'city': 'WASHINGTON', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/kamcobldgsply', 'sat_open_time': '7:00 AM', 'twitterurl': None, 'wed_open_time': '5:00 AM', 'google': '1', 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'KAMCO BUILDING SUPPLY', 'address1': '2100 W VIRGINIA AVENUE NE', 'sat_close_time': '- 11:00 AM', 'jaurl': None, 'state': 'DC'}, {'fri_open_time': '8:30 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(703) 524-2503', 'email': 'billstruevalue@truevalue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:00 PM', 'google_shared': 'YES', 'uid': 1051078648, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:00 PM', 'yelp': None, 'sun_open_time': '10:00 AM', 'tvurl': 'http://www.billstruevalue.com', 'sun_close_time': '- 5:00 PM', 'tue_open_time': '8:30 AM', 'tvpaint': '1', '_distance': '5.34', 'wed_close_time': '- 7:00 PM', 'mon_open_time': '8:30 AM', 'google_email': 'billstruevalue@truevalue.net', 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 7:00 PM', 'thur_open_time': '8:30 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 7:00 PM', 'latitude': '38.89758', 'activeshiptostore': '1', 'bho': '[[\"1000\",\"1700\"],[\"0830\",\"1900\"],[\"0830\",\"1900\"],[\"0830\",\"1900\"],[\"0830\",\"1900\"],[\"0830\",\"1900\"],[\"0830\",\"1800\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/billstruevalue', 'longitude': '-77.12483', 'ja': None, 'country': 'US', 'postalcode': '22207-2528', 'clientkey': '5JGZQATK-3G0D-KXC4-C6E3-TBAS768FPE51', 'city': 'Arlington', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.billstruevalue.com', 'sat_open_time': '8:30 AM', 'twitterurl': None, 'wed_open_time': '8:30 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'Bills True Value', 'address1': '2213 N. Buchanan Street', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'VA'}, {'fri_open_time': '7:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(301) 229-3700', 'email': 'CGEH@TrueValue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 8:00 PM', 'google_shared': 'Yes', 'uid': 1051076988, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 8:00 PM', 'yelp': None, 'sun_open_time': '9:00 AM', 'tvurl': 'http://www.truevalue.com/christophershardware', 'sun_close_time': '- 5:00 PM', 'tue_open_time': '7:00 AM', 'tvpaint': None, '_distance': '7.98', 'wed_close_time': '- 8:00 PM', 'mon_open_time': '7:00 AM', 'google_email': 'mikec@truevalue.net', 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 8:00 PM', 'thur_open_time': '7:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=HZUK3INI-P5FB-M95A-U15P-BCWH5QWTK30V&forceview=y&Adref=xx', 'tue_close_time': '- 8:00 PM', 'latitude': '38.96912', 'activeshiptostore': None, 'bho': '[[\"0900\",\"1700\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0830\",\"1800\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/home.php#!/GlenEchoHardware', 'longitude': '-77.14004', 'ja': None, 'country': 'US', 'postalcode': '20816', 'clientkey': 'HZUK3INI-P5FB-M95A-U15P-BCWH5QWTK30V', 'city': 'Bethesda', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/christophershardware', 'sat_open_time': '8:30 AM', 'twitterurl': 'http://www.twitter.com/GlenEchoHdwe', 'wed_open_time': '7:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'Christophers Glen Echo Hardware', 'address1': '7301 Mcarthur Blvd', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '6:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(703) 823-8700', 'email': None, 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 5:00 PM', 'google_shared': None, 'uid': 1051078649, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 5:00 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': None, 'sun_close_time': None, 'tue_open_time': '6:00 AM', 'tvpaint': None, '_distance': '8.98', 'wed_close_time': '- 5:00 PM', 'mon_open_time': '6:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 5:00 PM', 'thur_open_time': '6:00 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 5:00 PM', 'latitude': '38.7984625564754', 'activeshiptostore': None, 'bho': '[[\"9999\",\"9999\"],[\"0600\",\"1700\"],[\"0600\",\"1700\"],[\"0600\",\"1700\"],[\"0600\",\"1700\"],[\"0600\",\"1700\"],[\"0730\",\"1100\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-77.1362345995378', 'ja': None, 'country': 'US', 'postalcode': '22304-4822', 'clientkey': 'ACULYL3K-A7QA-N6TD-3KO2-763059ZYGSEK', 'city': 'ALEXANDRIA', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/kamcobldg', 'sat_open_time': '7:30 AM', 'twitterurl': None, 'wed_open_time': '6:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'KAMCO BLDG SPLY & TRUE VALUE', 'address1': '5860 FARINGTON AVE', 'sat_close_time': '- 11:00 AM', 'jaurl': None, 'state': 'VA'}, {'fri_open_time': '9:30 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(703) 765-4110', 'email': 'hhvs@vacoxmail.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 8:00 PM', 'google_shared': None, 'uid': 1051078650, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 8:00 PM', 'yelp': None, 'sun_open_time': '11:00 AM', 'tvurl': None, 'sun_close_time': '- 6:00 PM', 'tue_open_time': '9:30 AM', 'tvpaint': None, '_distance': '10.64', 'wed_close_time': '- 8:00 PM', 'mon_open_time': '9:30 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 8:00 PM', 'thur_open_time': '9:30 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 8:00 PM', 'latitude': '38.7436373198134', 'activeshiptostore': None, 'bho': '[[\"1100\",\"1800\"],[\"0930\",\"2000\"],[\"0930\",\"2000\"],[\"0930\",\"2000\"],[\"0930\",\"2000\"],[\"0930\",\"2000\"],[\"0930\",\"2000\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-77.0570290532158', 'ja': None, 'country': 'US', 'postalcode': '22308-1203', 'clientkey': 'GKNHJ7J6-3FNY-HHZS-DWCS-HYK8QW3I0JI7', 'city': 'Alexandria', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/hollinhallvarietystore', 'sat_open_time': '9:30 AM', 'twitterurl': None, 'wed_open_time': '9:30 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'Hollin Hall Variety Store', 'address1': '7902 Fort Hunt Rd', 'sat_close_time': '- 8:00 PM', 'jaurl': None, 'state': 'VA'}, {'fri_open_time': '7:30 AM', 'giftcard': None, 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(301) 292-1900', 'email': None, 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 5:00 PM', 'google_shared': None, 'uid': 1051076986, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 5:00 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': 'http://www.truevalue.com/ford', 'sun_close_time': None, 'tue_open_time': '7:30 AM', 'tvpaint': '1', '_distance': '11.58', 'wed_close_time': '- 5:00 PM', 'mon_open_time': '7:30 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 5:00 PM', 'thur_open_time': '7:30 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=VB7GDJK3-IDS0-XKHT-PNGL-VK4UG394HK6E&forceview=y&Adref=xx', 'tue_close_time': '- 5:00 PM', 'latitude': '38.730159346246', 'activeshiptostore': None, 'bho': '[[\"9999\",\"9999\"],[\"0730\",\"1700\"],[\"0730\",\"1700\"],[\"0730\",\"1700\"],[\"0730\",\"1700\"],[\"0730\",\"1700\"],[\"0730\",\"1600\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-76.9922404673855', 'ja': None, 'country': 'US', 'postalcode': '20744-5148', 'clientkey': 'VB7GDJK3-IDS0-XKHT-PNGL-VK4UG394HK6E', 'city': 'FORT WASHINGTON', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/ford', 'sat_open_time': '7:30 AM', 'twitterurl': None, 'wed_open_time': '7:30 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'FORD LUMBER COMPANY', 'address1': '11616 LIVINGSTON RD', 'sat_close_time': '- 4:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '7:30 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(301) 570-1300', 'email': 'Christophers@ChristophersHW.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:30 PM', 'google_shared': 'Yes', 'uid': 1051076989, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:30 PM', 'yelp': '1', 'sun_open_time': '9:00 AM', 'tvurl': None, 'sun_close_time': '- 5:00 PM', 'tue_open_time': '7:30 AM', 'tvpaint': None, '_distance': '17.50', 'wed_close_time': '- 7:30 PM', 'mon_open_time': '7:30 AM', 'google_email': 'christophers@christophershw.com', 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 7:30 PM', 'thur_open_time': '7:30 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 7:30 PM', 'latitude': '39.14891', 'activeshiptostore': None, 'bho': '[[\"0900\",\"1700\"],[\"0730\",\"1930\"],[\"0730\",\"1930\"],[\"0730\",\"1930\"],[\"0730\",\"1930\"],[\"0730\",\"1930\"],[\"0800\",\"1800\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-77.02204', 'ja': None, 'country': 'US', 'postalcode': '20860', 'clientkey': 'YOL39PAC-QNYN-CVFS-QHDO-1I4DOMOT4PCB', 'city': 'Sandy Spring', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.christophershardwarestore.com', 'sat_open_time': '8:00 AM', 'twitterurl': None, 'wed_open_time': '7:30 AM', 'google': '1', 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'Christophers Hardware', 'address1': '500 Olney Sandy Spring Rd', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '8:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(703) 361-3141', 'email': 'jerice@truevalue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:00 PM', 'google_shared': None, 'uid': 1051076977, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:00 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': 'http://www.jericeco.com/', 'sun_close_time': None, 'tue_open_time': '8:00 AM', 'tvpaint': None, '_distance': '25.53', 'wed_close_time': '- 7:00 PM', 'mon_open_time': '8:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 7:00 PM', 'thur_open_time': '8:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=7R0OKS9L-1M3V-03T6-T10M-F121Y326BYIU&forceview=y&Adref=xx', 'tue_close_time': '- 7:00 PM', 'latitude': '38.7579993877551', 'activeshiptostore': '1', 'bho': '[[\"9999\",\"9999\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1800\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/home.php?#!/pages/Manassas-VA/JE-Rice-Co/209185793694?ref=ts&__a=9&ajaxpipe=1', 'longitude': '-77.4656865306122', 'ja': None, 'country': 'US', 'postalcode': '20110', 'clientkey': '7R0OKS9L-1M3V-03T6-T10M-F121Y326BYIU', 'city': 'Manassas', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.jericeco.com', 'sat_open_time': '8:00 AM', 'twitterurl': None, 'wed_open_time': '8:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'J E Rice Co.', 'address1': '9124 Mathis Ave', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'VA'}, {'fri_open_time': '8:00 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(443) 607-4162', 'email': 'Chesapeakehardware@truevalue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:00 PM', 'google_shared': None, 'uid': -2013484794, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:00 PM', 'yelp': None, 'sun_open_time': '8:00 AM', 'tvurl': None, 'sun_close_time': '- 5:00 PM', 'tue_open_time': '8:00 AM', 'tvpaint': '1', '_distance': '26.83', 'wed_close_time': '- 7:00 PM', 'mon_open_time': '8:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 7:00 PM', 'thur_open_time': '8:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=Q1ZGD4YP-NXGT-H3QM-2RUD-OFD11NRUZYNC&forceview=y&Adref=xx', 'tue_close_time': '- 7:00 PM', 'latitude': '38.8150991440624', 'activeshiptostore': '1', 'bho': '[[\"0800\",\"1700\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-76.5376268933135', 'ja': None, 'country': 'US', 'postalcode': '20733-9639', 'clientkey': 'Q1ZGD4YP-NXGT-H3QM-2RUD-OFD11NRUZYNC', 'city': 'CHURCHTON', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/chesapeakehardware', 'sat_open_time': '8:00 AM', 'twitterurl': None, 'wed_open_time': '8:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'CHESAPEAKE HARDWARE', 'address1': '5570-C SHADY SIDE ROAD', 'sat_close_time': '- 7:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '7:30 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(410) 647-4611', 'email': 'tony@clementhardware.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:00 PM', 'google_shared': None, 'uid': -1478017856, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:00 PM', 'yelp': None, 'sun_open_time': '9:00 AM', 'tvurl': None, 'sun_close_time': '- 5:00 PM', 'tue_open_time': '7:30 AM', 'tvpaint': None, '_distance': '28.61', 'wed_close_time': '- 7:00 PM', 'mon_open_time': '7:30 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 7:00 PM', 'thur_open_time': '7:30 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 7:00 PM', 'latitude': '39.08077', 'activeshiptostore': '1', 'bho': '[[\"0900\",\"1700\"],[\"0730\",\"1900\"],[\"0730\",\"1900\"],[\"0730\",\"1900\"],[\"0730\",\"1900\"],[\"0730\",\"1900\"],[\"0700\",\"1800\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/pages/Clement-Hardware/218395654855700', 'longitude': '-76.54892', 'ja': None, 'country': 'US', 'postalcode': '21146-2954', 'clientkey': 'HCFP6UZL-75MR-TKAO-RP2Z-NMHJAEH1PCNM', 'city': 'SEVERNA PARK', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://clementhardware.com/', 'sat_open_time': '7:00 AM', 'twitterurl': None, 'wed_open_time': '7:30 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'CLEMENT HARDWARE', 'address1': '500 RITCHIE HIGHWAY', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '7:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(301) 253-2131', 'email': None, 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 6:00 PM', 'google_shared': None, 'uid': 1051078622, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 6:00 PM', 'yelp': None, 'sun_open_time': '10:00 AM', 'tvurl': 'http://www.hyattbuildingsupply.com/', 'sun_close_time': '- 2:00 PM', 'tue_open_time': '7:00 AM', 'tvpaint': '1', '_distance': '28.77', 'wed_close_time': '- 6:00 PM', 'mon_open_time': '7:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 6:00 PM', 'thur_open_time': '7:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=N7IVBTZQ-HZ6O-KGV2-ZUEX-QAHBZXTIRS3E&forceview=y&Adref=xx', 'tue_close_time': '- 6:00 PM', 'latitude': '39.2873782648885', 'activeshiptostore': None, 'bho': '[[\"1000\",\"1400\"],[\"0700\",\"1800\"],[\"0700\",\"1800\"],[\"0700\",\"1800\"],[\"0700\",\"1800\"],[\"0700\",\"1800\"],[\"0800\",\"1600\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-77.2075308119523', 'ja': None, 'country': 'US', 'postalcode': '20872-1830', 'clientkey': 'N7IVBTZQ-HZ6O-KGV2-ZUEX-QAHBZXTIRS3E', 'city': 'DAMASCUS', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.hyattbuildingsupply.com', 'sat_open_time': '8:00 AM', 'twitterurl': None, 'wed_open_time': '7:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'HYATT TRUE VALUE', 'address1': '26200 RIDGE RD', 'sat_close_time': '- 4:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '7:00 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(410) 268-3939', 'email': 'jared@kbtruevalue.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 8:00 PM', 'google_shared': None, 'uid': 1051078638, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 8:00 PM', 'yelp': None, 'sun_open_time': '8:00 AM', 'tvurl': 'http://www.kbtruevalue.com/', 'sun_close_time': '- 6:00 PM', 'tue_open_time': '7:00 AM', 'tvpaint': '1', '_distance': '28.88', 'wed_close_time': '- 8:00 PM', 'mon_open_time': '7:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 8:00 PM', 'thur_open_time': '7:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=CH10ZM2D-T28I-979R-X1UQ-NYSIN4L4XT4P&forceview=y&Adref=xx', 'tue_close_time': '- 8:00 PM', 'latitude': '38.9502181524996', 'activeshiptostore': '1', 'bho': '[[\"0800\",\"1800\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"1900\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/kbtruevalue', 'longitude': '-76.4927551456658', 'ja': None, 'country': 'US', 'postalcode': '21403-1756', 'clientkey': 'CH10ZM2D-T28I-979R-X1UQ-NYSIN4L4XT4P', 'city': 'Annapolis', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.kbtruevalue.com', 'sat_open_time': '7:00 AM', 'twitterurl': 'http://twitter.com/kbtruevalue', 'wed_open_time': '7:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'K & B True Value', 'address1': '912 Forest Dr', 'sat_close_time': '- 7:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '7:30 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(410) 535-0442', 'email': 'lusbymot@verizon.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 5:30 PM', 'google_shared': None, 'uid': 1051076983, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:30 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': 'http://www.lusbyhardware.com/', 'sun_close_time': None, 'tue_open_time': '7:30 AM', 'tvpaint': None, '_distance': '34.26', 'wed_close_time': '- 5:30 PM', 'mon_open_time': '7:30 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 5:30 PM', 'thur_open_time': '7:30 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 5:30 PM', 'latitude': '38.539349691187', 'activeshiptostore': '1', 'bho': '[[\"9999\",\"9999\"],[\"0730\",\"1730\"],[\"0730\",\"1730\"],[\"0730\",\"1730\"],[\"0730\",\"1730\"],[\"0730\",\"1930\"],[\"0730\",\"1730\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-76.5835104820298', 'ja': None, 'country': 'US', 'postalcode': '20678', 'clientkey': '9MZ7X579-CK0M-3IMU-TT63-OPZF55IIPXFZ', 'city': 'Prince Frederick', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.Lusbyhardware.com', 'sat_open_time': '7:30 AM', 'twitterurl': None, 'wed_open_time': '7:30 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'Lusby Motor Company Inc.', 'address1': '155 Main St', 'sat_close_time': '- 5:30 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '8:00 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(301) 259-2540', 'email': 'hungerford@olg.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 5:00 PM', 'google_shared': 'YES', 'uid': 1051076982, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 5:00 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': 'http://www.aghungerford.com/', 'sun_close_time': None, 'tue_open_time': '8:00 AM', 'tvpaint': '1', '_distance': '35.93', 'wed_close_time': '- 5:00 PM', 'mon_open_time': '8:00 AM', 'google_email': 'hungerford@olg.com', 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 5:00 PM', 'thur_open_time': '8:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=N5VPHAFC-MVJV-T31M-RRXY-8PKZ81EMOPHM&forceview=y&Adref=xx', 'tue_close_time': '- 5:00 PM', 'latitude': '38.3786245621919', 'activeshiptostore': '1', 'bho': '[[\"9999\",\"9999\"],[\"0800\",\"1700\"],[\"0800\",\"1700\"],[\"0800\",\"1700\"],[\"0800\",\"1700\"],[\"0800\",\"1700\"],[\"0800\",\"1700\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-76.95422029337', 'ja': None, 'country': 'US', 'postalcode': '20664', 'clientkey': 'N5VPHAFC-MVJV-T31M-RRXY-8PKZ81EMOPHM', 'city': 'Newburg', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.aghungerford.com', 'sat_open_time': '8:00 AM', 'twitterurl': None, 'wed_open_time': '8:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'A G Hungerford & Son Inc.', 'address1': '12165 Rock Point Rd', 'sat_close_time': '- 5:00 PM', 'jaurl': None, 'state': 'MD'}], 'collectionname': 'poi'}, 'code': 1}\n"
197 | ]
198 | }
199 | ],
200 | "source": [
201 | "\"\"\"\n",
202 | "Case Study 2: getting API_KEY from html and then data from API\n",
203 | "\n",
204 | "The True Value Company is an American retailer-owned hardware cooperative with over 4,000 independent retail \n",
205 | "locations worldwide. Create scraper which gets all available True Value shops given post code. Scraper\n",
206 | "should not have any API key hardcoded, as it can change during site lifetime.\n",
207 | "\n",
208 | "Minimum data you should get:\n",
209 | "- address\n",
210 | "- city\n",
211 | "- country\n",
212 | "- latitude\n",
213 | "- longitude\n",
214 | "- name\n",
215 | "- postalcode\n",
216 | "- state\n",
217 | "\"\"\"\n",
218 | "main_page_url = 'http://hosted.where2getit.com/truevalue/index2015.html'\n",
219 | "main_page_text = requests.get(main_page_url).text\n",
220 | "api_key = re.findall(r\"appkey: '([0-9A-Z\\-]+)', \", main_page_text)[0]\n",
221 | "print('Got API KEY from main page: ', api_key)\n",
222 | "api_endpoint = 'http://hosted.where2getit.com/truevalue/rest/locatorsearch'\n",
223 | "POST_CODE = 20004\n",
224 | "body = {\n",
225 | " \"request\": {\n",
226 | " \"appkey\": api_key,\n",
227 | " \"formdata\": {\n",
228 | " \"geoip\": False,\n",
229 | " \"dataview\": \"store_default\",\n",
230 | " \"limit\": 40,\n",
231 | " \"geolocs\": {\n",
232 | " \"geoloc\": [\n",
233 | " {\n",
234 | " \"addressline\": str(POST_CODE)\n",
235 | " }\n",
236 | " ]\n",
237 | " },\n",
238 | " \"searchradius\": \"40|50|80\",\n",
239 | " \"where\": {\n",
240 | " \"and\": {\n",
241 | " \"giftcard\": {\n",
242 | " \"eq\": \"\"\n",
243 | " },\n",
244 | " \"tvpaint\": {\n",
245 | " \"eq\": \"\"\n",
246 | " },\n",
247 | " \"creditcard\": {\n",
248 | " \"eq\": \"\"\n",
249 | " },\n",
250 | " \"localad\": {\n",
251 | " \"eq\": \"\"\n",
252 | " },\n",
253 | " \"ja\": {\n",
254 | " \"eq\": \"\"\n",
255 | " },\n",
256 | " \"tvr\": {\n",
257 | " \"eq\": \"\"\n",
258 | " },\n",
259 | " \"activeshiptostore\": {\n",
260 | " \"eq\": \"\"\n",
261 | " },\n",
262 | " \"main_id\": {\n",
263 | " \"eq\": \"\"\n",
264 | " },\n",
265 | " \"corronado\": {\n",
266 | " \"eq\": \"\"\n",
267 | " },\n",
268 | " \"tv\": {\n",
269 | " \"eq\": \"1\"\n",
270 | " }\n",
271 | " }\n",
272 | " },\n",
273 | " \"false\": \"0\"\n",
274 | " }\n",
275 | " }\n",
276 | "}\n",
277 | "r = requests.post(api_endpoint, data=json.dumps(body))\n",
278 | "data = json.loads(r.text)\n",
279 | "print('Raw response from API: ', data)\n",
280 | "shops = [{'name':entry['name'],\n",
281 | " 'address':entry['address1'],\n",
282 | " 'postalcode':entry['postalcode'],\n",
283 | " 'city':entry['city'],\n",
284 | " 'state':entry['state'],\n",
285 | " 'country':entry['country'],\n",
286 | " 'latitude':entry['latitude'],\n",
287 | " 'longitude':entry['longitude']\n",
288 | " } for entry in data['response']['collection']]\n"
289 | ]
290 | },
291 | {
292 | "cell_type": "code",
293 | "execution_count": 17,
294 | "metadata": {
295 | "collapsed": false
296 | },
297 | "outputs": [
298 | {
299 | "data": {
300 | "text/html": [
301 | "
\n",
302 | " \n",
303 | " \n",
304 | " | \n",
305 | " address | \n",
306 | " city | \n",
307 | " country | \n",
308 | " latitude | \n",
309 | " longitude | \n",
310 | " name | \n",
311 | " postalcode | \n",
312 | " state | \n",
313 | "
\n",
314 | " \n",
315 | " \n",
316 | " \n",
317 | " | 0 | \n",
318 | " 1623 17th St NW | \n",
319 | " Washington | \n",
320 | " US | \n",
321 | " 38.911917 | \n",
322 | " -77.03849 | \n",
323 | " True Value On 17th | \n",
324 | " 20009-2433 | \n",
325 | " DC | \n",
326 | "
\n",
327 | " \n",
328 | " | 1 | \n",
329 | " 1108 24th Street NW | \n",
330 | " Washington | \n",
331 | " US | \n",
332 | " 38.9038748979592 | \n",
333 | " -77.0514083673469 | \n",
334 | " District Hardware and Bike | \n",
335 | " 20037-1432 | \n",
336 | " DC | \n",
337 | "
\n",
338 | " \n",
339 | " | 2 | \n",
340 | " 2100 W VIRGINIA AVENUE NE | \n",
341 | " WASHINGTON | \n",
342 | " US | \n",
343 | " 38.91536 | \n",
344 | " -76.98028 | \n",
345 | " KAMCO BUILDING SUPPLY | \n",
346 | " 20002-1834 | \n",
347 | " DC | \n",
348 | "
\n",
349 | " \n",
350 | " | 3 | \n",
351 | " 2213 N. Buchanan Street | \n",
352 | " Arlington | \n",
353 | " US | \n",
354 | " 38.89758 | \n",
355 | " -77.12483 | \n",
356 | " Bills True Value | \n",
357 | " 22207-2528 | \n",
358 | " VA | \n",
359 | "
\n",
360 | " \n",
361 | " | 4 | \n",
362 | " 7301 Mcarthur Blvd | \n",
363 | " Bethesda | \n",
364 | " US | \n",
365 | " 38.96912 | \n",
366 | " -77.14004 | \n",
367 | " Christophers Glen Echo Hardware | \n",
368 | " 20816 | \n",
369 | " MD | \n",
370 | "
\n",
371 | " \n",
372 | " | 5 | \n",
373 | " 5860 FARINGTON AVE | \n",
374 | " ALEXANDRIA | \n",
375 | " US | \n",
376 | " 38.7984625564754 | \n",
377 | " -77.1362345995378 | \n",
378 | " KAMCO BLDG SPLY & TRUE VALUE | \n",
379 | " 22304-4822 | \n",
380 | " VA | \n",
381 | "
\n",
382 | " \n",
383 | " | 6 | \n",
384 | " 7902 Fort Hunt Rd | \n",
385 | " Alexandria | \n",
386 | " US | \n",
387 | " 38.7436373198134 | \n",
388 | " -77.0570290532158 | \n",
389 | " Hollin Hall Variety Store | \n",
390 | " 22308-1203 | \n",
391 | " VA | \n",
392 | "
\n",
393 | " \n",
394 | " | 7 | \n",
395 | " 11616 LIVINGSTON RD | \n",
396 | " FORT WASHINGTON | \n",
397 | " US | \n",
398 | " 38.730159346246 | \n",
399 | " -76.9922404673855 | \n",
400 | " FORD LUMBER COMPANY | \n",
401 | " 20744-5148 | \n",
402 | " MD | \n",
403 | "
\n",
404 | " \n",
405 | " | 8 | \n",
406 | " 500 Olney Sandy Spring Rd | \n",
407 | " Sandy Spring | \n",
408 | " US | \n",
409 | " 39.14891 | \n",
410 | " -77.02204 | \n",
411 | " Christophers Hardware | \n",
412 | " 20860 | \n",
413 | " MD | \n",
414 | "
\n",
415 | " \n",
416 | " | 9 | \n",
417 | " 9124 Mathis Ave | \n",
418 | " Manassas | \n",
419 | " US | \n",
420 | " 38.7579993877551 | \n",
421 | " -77.4656865306122 | \n",
422 | " J E Rice Co. | \n",
423 | " 20110 | \n",
424 | " VA | \n",
425 | "
\n",
426 | " \n",
427 | " | 10 | \n",
428 | " 5570-C SHADY SIDE ROAD | \n",
429 | " CHURCHTON | \n",
430 | " US | \n",
431 | " 38.8150991440624 | \n",
432 | " -76.5376268933135 | \n",
433 | " CHESAPEAKE HARDWARE | \n",
434 | " 20733-9639 | \n",
435 | " MD | \n",
436 | "
\n",
437 | " \n",
438 | " | 11 | \n",
439 | " 500 RITCHIE HIGHWAY | \n",
440 | " SEVERNA PARK | \n",
441 | " US | \n",
442 | " 39.08077 | \n",
443 | " -76.54892 | \n",
444 | " CLEMENT HARDWARE | \n",
445 | " 21146-2954 | \n",
446 | " MD | \n",
447 | "
\n",
448 | " \n",
449 | " | 12 | \n",
450 | " 26200 RIDGE RD | \n",
451 | " DAMASCUS | \n",
452 | " US | \n",
453 | " 39.2873782648885 | \n",
454 | " -77.2075308119523 | \n",
455 | " HYATT TRUE VALUE | \n",
456 | " 20872-1830 | \n",
457 | " MD | \n",
458 | "
\n",
459 | " \n",
460 | " | 13 | \n",
461 | " 912 Forest Dr | \n",
462 | " Annapolis | \n",
463 | " US | \n",
464 | " 38.9502181524996 | \n",
465 | " -76.4927551456658 | \n",
466 | " K & B True Value | \n",
467 | " 21403-1756 | \n",
468 | " MD | \n",
469 | "
\n",
470 | " \n",
471 | " | 14 | \n",
472 | " 155 Main St | \n",
473 | " Prince Frederick | \n",
474 | " US | \n",
475 | " 38.539349691187 | \n",
476 | " -76.5835104820298 | \n",
477 | " Lusby Motor Company Inc. | \n",
478 | " 20678 | \n",
479 | " MD | \n",
480 | "
\n",
481 | " \n",
482 | " | 15 | \n",
483 | " 12165 Rock Point Rd | \n",
484 | " Newburg | \n",
485 | " US | \n",
486 | " 38.3786245621919 | \n",
487 | " -76.95422029337 | \n",
488 | " A G Hungerford & Son Inc. | \n",
489 | " 20664 | \n",
490 | " MD | \n",
491 | "
\n",
492 | " \n",
493 | "
"
494 | ],
495 | "text/plain": [
496 | ""
497 | ]
498 | },
499 | "execution_count": 17,
500 | "metadata": {},
501 | "output_type": "execute_result"
502 | }
503 | ],
504 | "source": [
505 | "HTML(pd.DataFrame(shops).to_html())"
506 | ]
507 | },
508 | {
509 | "cell_type": "code",
510 | "execution_count": 18,
511 | "metadata": {
512 | "collapsed": false
513 | },
514 | "outputs": [
515 | {
516 | "name": "stdout",
517 | "output_type": "stream",
518 | "text": [
519 | "Got listing from: https://www.airbnb.co.uk/api/v2/explore_tabs?metadata_only=false&items_per_grid=20&version=1.2.8&luxury_pre_launch=false&_intents=p1&screen_size=small&locale=en-GB&timezone_offset=60&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&is_new_cards_experiment=false&is_standard_search=true&_format=for_explore_search_web&fetch_filters=true¤cy=GBP&supports_for_you_v3=true&location=London&is_guided_search=true&s_tag=DOIPutuT&selected_tab_id=home_tab§ion_offset=0&auto_ib=false&allow_override%5B%5D=&refinements%5B%5D=homes\n"
520 | ]
521 | },
522 | {
523 | "data": {
524 | "text/plain": [
525 | "'\\nnote:\\nsome airbnb usefull endpoints\\n# get listings\\nhttps://www.airbnb.co.uk/api/v2/explore_tabs?version=1.2.8&_format=for_explore_search_web&items_per_grid=18&experiences_per_grid=20&guidebooks_per_grid=20&fetch_filters=true&is_guided_search=true&is_new_cards_experiment=false&supports_for_you_v3=true&screen_size=small&timezone_offset=60&auto_ib=false&luxury_pre_launch=false&metadata_only=false&is_standard_search=true&tab_id=home_tab&location=London&allow_override%5B%5D=&ne_lat=51.599363500119274&ne_lng=-0.06168207198925302&sw_lat=51.47626857868991&sw_lng=-0.289648380583003&zoom=12&search_by_map=true&federated_search_session_id=6d72b1e2-cb68-4877-b27c-8614e11fc5b0&_intents=p1&key=d306zoyjsyarp7ifhu67rjxn52tv0t20¤cy=&locale=en-GB\\n# get booking detials\\nhttps://www.airbnb.co.uk/api/v2/pdp_listing_booking_details?guests=1&listing_id=13575756&_format=for_web_dateless&_interaction_type=pageload&_intents=p3_book_it&_parent_request_uuid=aed9f0ce-1534-4a89-9cf6-c6813adcb95b&_p3_impression_id=p3_1506465875_Q2VDMsV0pLs27%2BtX&show_smart_promotion=0&force_boost_unc_priority_message_type=&number_of_adults=1&number_of_children=0&number_of_infants=0&key=d306zoyjsyarp7ifhu67rjxn52tv0t20¤cy=GBP&locale=en-GB\\n'"
526 | ]
527 | },
528 | "execution_count": 18,
529 | "metadata": {},
530 | "output_type": "execute_result"
531 | }
532 | ],
533 | "source": [
534 | "\"\"\"\n",
535 | "Case Study 3: get available airbnb properties in London\n",
536 | "\n",
537 | "You want to visit London and airbnb looks like a nice option for you. As you are crazy data geek and you want to run\n",
538 | "some fancy algorithms to make a better choice of apartment to rent - you need data! Get all available airbnb \n",
539 | "properties in London. You are interested in pricing, location, rating, no. of reviews, images and more. \n",
540 | "You also don't like to repeate yourself, so you need to build scraper which will survive till your next trip.\n",
541 | "\"\"\"\n",
542 | "\n",
543 | "headers = {'accept-encoding': 'gzip, deflate, br',\n",
544 | " 'x-requested-with': 'XMLHttpRequest',\n",
545 | " 'accept-language': 'en-US,en;q=0.8,pl;q=0.6',\n",
546 | " 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',\n",
547 | " 'accept': 'application/json, text/javascript, */*; q=0.01',\n",
548 | " 'referer': 'https://www.airbnb.co.uk/s/London/homes',\n",
549 | " 'authority': 'www.airbnb.co.uk'}\n",
550 | "# get api key embeded in html (yes... same story again :))\n",
551 | "html = requests.get('https://www.airbnb.co.uk/s/London/homes', headers=headers).text\n",
552 | "api_key = re.findall(r'key\\":\\"([a-zA-Z0-7]*)\\"},\\"deep_link', html)[0]\n",
553 | "# get first listing\n",
554 | "enpoint='https://www.airbnb.co.uk/api/v2/explore_tabs'\n",
555 | "params = {'version':'1.2.8',\n",
556 | " '_format':'for_explore_search_web',\n",
557 | " 'items_per_grid':'20',\n",
558 | " 'fetch_filters':'true',\n",
559 | " 'is_guided_search':'true',\n",
560 | " 'is_new_cards_experiment':'false',\n",
561 | " 'supports_for_you_v3':'true',\n",
562 | " 'screen_size':'small',\n",
563 | " 'timezone_offset':'60',\n",
564 | " 'auto_ib':'false',\n",
565 | " 'luxury_pre_launch':'false',\n",
566 | " 'metadata_only':'false',\n",
567 | " 'is_standard_search':'true',\n",
568 | " 'refinements[]':'homes',\n",
569 | " 'selected_tab_id':'home_tab',\n",
570 | " 'location':'London',\n",
571 | " 'allow_override[]':'',\n",
572 | " 's_tag':'DOIPutuT',\n",
573 | " 'section_offset':'0',\n",
574 | " '_intents':'p1',\n",
575 | " 'key':api_key,\n",
576 | " 'currency':'GBP',\n",
577 | " 'locale':'en-GB'}\n",
578 | "r = requests.get(enpoint, params=params)\n",
579 | "print('Got listing from: ', r.url)\n",
580 | "ds = json.loads(r.text)\n",
581 | "\n",
582 | "\"\"\"\n",
583 | "note:\n",
584 | "some airbnb usefull endpoints\n",
585 | "# get listings\n",
586 | "https://www.airbnb.co.uk/api/v2/explore_tabs?version=1.2.8&_format=for_explore_search_web&items_per_grid=18&experiences_per_grid=20&guidebooks_per_grid=20&fetch_filters=true&is_guided_search=true&is_new_cards_experiment=false&supports_for_you_v3=true&screen_size=small&timezone_offset=60&auto_ib=false&luxury_pre_launch=false&metadata_only=false&is_standard_search=true&tab_id=home_tab&location=London&allow_override%5B%5D=&ne_lat=51.599363500119274&ne_lng=-0.06168207198925302&sw_lat=51.47626857868991&sw_lng=-0.289648380583003&zoom=12&search_by_map=true&federated_search_session_id=6d72b1e2-cb68-4877-b27c-8614e11fc5b0&_intents=p1&key=d306zoyjsyarp7ifhu67rjxn52tv0t20¤cy=&locale=en-GB\n",
587 | "# get booking detials\n",
588 | "https://www.airbnb.co.uk/api/v2/pdp_listing_booking_details?guests=1&listing_id=13575756&_format=for_web_dateless&_interaction_type=pageload&_intents=p3_book_it&_parent_request_uuid=aed9f0ce-1534-4a89-9cf6-c6813adcb95b&_p3_impression_id=p3_1506465875_Q2VDMsV0pLs27%2BtX&show_smart_promotion=0&force_boost_unc_priority_message_type=&number_of_adults=1&number_of_children=0&number_of_infants=0&key=d306zoyjsyarp7ifhu67rjxn52tv0t20¤cy=GBP&locale=en-GB\n",
589 | "\"\"\""
590 | ]
591 | },
592 | {
593 | "cell_type": "code",
594 | "execution_count": 19,
595 | "metadata": {
596 | "collapsed": false
597 | },
598 | "outputs": [
599 | {
600 | "name": "stdout",
601 | "output_type": "stream",
602 | "text": [
603 | "will get page: 2\n",
604 | "No. of properties: 20\n",
605 | "will get page: 3\n",
606 | "No. of properties: 40\n",
607 | "will get page: 4\n",
608 | "No. of properties: 60\n",
609 | "will get page: 5\n",
610 | "No. of properties: 80\n",
611 | "will get page: 6\n",
612 | "No. of properties: 100\n"
613 | ]
614 | }
615 | ],
616 | "source": [
617 | "# paginate and get more properties\n",
618 | "props = []\n",
619 | "page_no = 1\n",
620 | "page_limit = 3\n",
621 | "while (ds['explore_tabs'][0]['pagination_metadata']['has_next_page'] == True) and (page_no\n",
654 | " \n",
655 | " \n",
656 | " | \n",
657 | " currency | \n",
658 | " latitude | \n",
659 | " longitude | \n",
660 | " name | \n",
661 | " person_capacity | \n",
662 | " pic | \n",
663 | " price | \n",
664 | " price_type | \n",
665 | " rating | \n",
666 | " room_type | \n",
667 | "
\n",
668 | " \n",
669 | " \n",
670 | " \n",
671 | " | 0 | \n",
672 | " GBP | \n",
673 | " 51.516746 | \n",
674 | " -0.050351 | \n",
675 | " (HAR-A)PRIVATE ROOM FOR 5PPL CLOSE TO TOWER BR... | \n",
676 | " 5 | \n",
677 | " https://a0.muscache.com/im/pictures/786aa625-2... | \n",
678 | " 25 | \n",
679 | " nightly | \n",
680 | " 5.0 | \n",
681 | " Private room | \n",
682 | "
\n",
683 | " \n",
684 | " | 1 | \n",
685 | " GBP | \n",
686 | " 51.524362 | \n",
687 | " -0.116995 | \n",
688 | " Double Room nr Soho | Russell Square |Kings Cross | \n",
689 | " 4 | \n",
690 | " https://a0.muscache.com/im/pictures/affe8de1-a... | \n",
691 | " 62 | \n",
692 | " nightly | \n",
693 | " 4.5 | \n",
694 | " Private room | \n",
695 | "
\n",
696 | " \n",
697 | " | 2 | \n",
698 | " GBP | \n",
699 | " 51.486756 | \n",
700 | " -0.104479 | \n",
701 | " 1 double room in Central London | \n",
702 | " 2 | \n",
703 | " https://a0.muscache.com/im/pictures/6bfb24c2-c... | \n",
704 | " 18 | \n",
705 | " nightly | \n",
706 | " 4.5 | \n",
707 | " Private room | \n",
708 | "
\n",
709 | " \n",
710 | " | 3 | \n",
711 | " GBP | \n",
712 | " 51.511007 | \n",
713 | " -0.226281 | \n",
714 | " THE QUEENS HOSTEL, 6 BED MIXED DORM E | \n",
715 | " 6 | \n",
716 | " https://a0.muscache.com/im/pictures/328d9beb-6... | \n",
717 | " 21 | \n",
718 | " nightly | \n",
719 | " 4.5 | \n",
720 | " Private room | \n",
721 | "
\n",
722 | " \n",
723 | " | 4 | \n",
724 | " GBP | \n",
725 | " 51.564283 | \n",
726 | " -0.120809 | \n",
727 | " Spacious Double room in Holloway, London | \n",
728 | " 2 | \n",
729 | " https://a0.muscache.com/im/pictures/ae707340-f... | \n",
730 | " 19 | \n",
731 | " nightly | \n",
732 | " 4.5 | \n",
733 | " Private room | \n",
734 | "
\n",
735 | " \n",
736 | " | 5 | \n",
737 | " GBP | \n",
738 | " 51.548527 | \n",
739 | " -0.226324 | \n",
740 | " Modern room 10 min from Central London | \n",
741 | " 2 | \n",
742 | " https://a0.muscache.com/im/pictures/a0e3a2af-8... | \n",
743 | " 40 | \n",
744 | " nightly | \n",
745 | " 5.0 | \n",
746 | " Private room | \n",
747 | "
\n",
748 | " \n",
749 | " | 6 | \n",
750 | " GBP | \n",
751 | " 51.491728 | \n",
752 | " -0.014746 | \n",
753 | " Double Room in Canary Wharf hs | \n",
754 | " 2 | \n",
755 | " https://a0.muscache.com/im/pictures/ebd8e53d-0... | \n",
756 | " 25 | \n",
757 | " nightly | \n",
758 | " 4.5 | \n",
759 | " Private room | \n",
760 | "
\n",
761 | " \n",
762 | " | 7 | \n",
763 | " GBP | \n",
764 | " 51.452218 | \n",
765 | " -0.025420 | \n",
766 | " Comfortable, Clean London room | \n",
767 | " 2 | \n",
768 | " https://a0.muscache.com/im/pictures/ae5469b5-9... | \n",
769 | " 25 | \n",
770 | " nightly | \n",
771 | " 5.0 | \n",
772 | " Private room | \n",
773 | "
\n",
774 | " \n",
775 | " | 8 | \n",
776 | " GBP | \n",
777 | " 51.501680 | \n",
778 | " -0.052571 | \n",
779 | " Double room, 2min from the station | \n",
780 | " 2 | \n",
781 | " https://a0.muscache.com/im/pictures/4cfc5263-5... | \n",
782 | " 57 | \n",
783 | " nightly | \n",
784 | " 4.5 | \n",
785 | " Private room | \n",
786 | "
\n",
787 | " \n",
788 | " | 9 | \n",
789 | " GBP | \n",
790 | " 51.624089 | \n",
791 | " -0.054654 | \n",
792 | " En-Suite Bedroom with Bathroom | \n",
793 | " 3 | \n",
794 | " https://a0.muscache.com/im/pictures/97921358/8... | \n",
795 | " 19 | \n",
796 | " nightly | \n",
797 | " 5.0 | \n",
798 | " Private room | \n",
799 | "
\n",
800 | " \n",
801 | " | 10 | \n",
802 | " GBP | \n",
803 | " 51.483282 | \n",
804 | " -0.132566 | \n",
805 | " GORGEOUS RIVERSIDE LUXURY, WITH SPA AND POOL | \n",
806 | " 7 | \n",
807 | " https://a0.muscache.com/im/pictures/088b6118-2... | \n",
808 | " 154 | \n",
809 | " nightly | \n",
810 | " 4.5 | \n",
811 | " Entire home/flat | \n",
812 | "
\n",
813 | " \n",
814 | " | 11 | \n",
815 | " GBP | \n",
816 | " 51.554006 | \n",
817 | " -0.242773 | \n",
818 | " Nice & Equipped Double Bedroom in Dollis Hill!... | \n",
819 | " 2 | \n",
820 | " https://a0.muscache.com/im/pictures/88df04b4-b... | \n",
821 | " 31 | \n",
822 | " nightly | \n",
823 | " 4.5 | \n",
824 | " Private room | \n",
825 | "
\n",
826 | " \n",
827 | " | 12 | \n",
828 | " GBP | \n",
829 | " 51.497491 | \n",
830 | " -0.060441 | \n",
831 | " 3. Lovely Room in Centre London+Wifi | \n",
832 | " 1 | \n",
833 | " https://a0.muscache.com/im/pictures/d349ce62-e... | \n",
834 | " 36 | \n",
835 | " nightly | \n",
836 | " 4.5 | \n",
837 | " Private room | \n",
838 | "
\n",
839 | " \n",
840 | " | 13 | \n",
841 | " GBP | \n",
842 | " 51.512330 | \n",
843 | " -0.066251 | \n",
844 | " Small Room 10 -near Tower of London and Shored... | \n",
845 | " 2 | \n",
846 | " https://a0.muscache.com/im/pictures/4f4e9ad7-d... | \n",
847 | " 40 | \n",
848 | " nightly | \n",
849 | " 4.5 | \n",
850 | " Private room | \n",
851 | "
\n",
852 | " \n",
853 | " | 14 | \n",
854 | " GBP | \n",
855 | " 51.476615 | \n",
856 | " -0.132970 | \n",
857 | " SMALL Stockwell station single room-£17 | \n",
858 | " 1 | \n",
859 | " https://a0.muscache.com/im/pictures/35f34bb4-1... | \n",
860 | " 18 | \n",
861 | " nightly | \n",
862 | " 4.5 | \n",
863 | " Private room | \n",
864 | "
\n",
865 | " \n",
866 | " | 15 | \n",
867 | " GBP | \n",
868 | " 51.489384 | \n",
869 | " -0.099203 | \n",
870 | " Double Bedroom Apartment with Balcony in Zone 1 | \n",
871 | " 2 | \n",
872 | " https://a0.muscache.com/im/pictures/54855d11-1... | \n",
873 | " 21 | \n",
874 | " nightly | \n",
875 | " 4.5 | \n",
876 | " Private room | \n",
877 | "
\n",
878 | " \n",
879 | " | 16 | \n",
880 | " GBP | \n",
881 | " 51.525883 | \n",
882 | " -0.100857 | \n",
883 | " Bright and welcoming flat in central London | \n",
884 | " 4 | \n",
885 | " https://a0.muscache.com/im/pictures/86509cc9-0... | \n",
886 | " 91 | \n",
887 | " nightly | \n",
888 | " 4.5 | \n",
889 | " Entire home/flat | \n",
890 | "
\n",
891 | " \n",
892 | " | 17 | \n",
893 | " GBP | \n",
894 | " 51.517944 | \n",
895 | " -0.069536 | \n",
896 | " CS11 Nice Single Room Central London | \n",
897 | " 1 | \n",
898 | " https://a0.muscache.com/im/pictures/0e601324-0... | \n",
899 | " 33 | \n",
900 | " nightly | \n",
901 | " 4.5 | \n",
902 | " Private room | \n",
903 | "
\n",
904 | " \n",
905 | " | 18 | \n",
906 | " GBP | \n",
907 | " 51.523661 | \n",
908 | " -0.172201 | \n",
909 | " Cosy One Bed 3rd Floor Close to Marble Arch | \n",
910 | " 4 | \n",
911 | " https://a0.muscache.com/im/pictures/2666a91d-0... | \n",
912 | " 62 | \n",
913 | " nightly | \n",
914 | " 4.5 | \n",
915 | " Entire home/flat | \n",
916 | "
\n",
917 | " \n",
918 | " | 19 | \n",
919 | " GBP | \n",
920 | " 51.492525 | \n",
921 | " -0.096608 | \n",
922 | " Room Big Ben + Breakfast (R5a) | \n",
923 | " 1 | \n",
924 | " https://a0.muscache.com/im/pictures/f589d5df-3... | \n",
925 | " 19 | \n",
926 | " nightly | \n",
927 | " 5.0 | \n",
928 | " Shared room | \n",
929 | "
\n",
930 | " \n",
931 | " | 20 | \n",
932 | " GBP | \n",
933 | " 51.491423 | \n",
934 | " -0.139555 | \n",
935 | " Brand new studio in Victoria 11A | \n",
936 | " 2 | \n",
937 | " https://a0.muscache.com/im/pictures/11490113/4... | \n",
938 | " 72 | \n",
939 | " nightly | \n",
940 | " 4.5 | \n",
941 | " Entire home/flat | \n",
942 | "
\n",
943 | " \n",
944 | " | 21 | \n",
945 | " GBP | \n",
946 | " 51.510864 | \n",
947 | " -0.181942 | \n",
948 | " 6 Bed Mixed Dormitory Ensuite | \n",
949 | " 6 | \n",
950 | " https://a0.muscache.com/im/pictures/4d8c2dec-0... | \n",
951 | " 21 | \n",
952 | " nightly | \n",
953 | " 4.0 | \n",
954 | " Shared room | \n",
955 | "
\n",
956 | " \n",
957 | " | 22 | \n",
958 | " GBP | \n",
959 | " 51.451279 | \n",
960 | " 0.015362 | \n",
961 | " Exceptional room zone 3 by station | \n",
962 | " 2 | \n",
963 | " https://a0.muscache.com/im/pictures/4edf624d-0... | \n",
964 | " 25 | \n",
965 | " nightly | \n",
966 | " 4.5 | \n",
967 | " Private room | \n",
968 | "
\n",
969 | " \n",
970 | " | 23 | \n",
971 | " GBP | \n",
972 | " 51.510506 | \n",
973 | " -0.129266 | \n",
974 | " Trafalgar Square, Peaceful Room.\\nFemale frien... | \n",
975 | " 2 | \n",
976 | " https://a0.muscache.com/im/pictures/e0265be9-4... | \n",
977 | " 58 | \n",
978 | " nightly | \n",
979 | " 5.0 | \n",
980 | " Private room | \n",
981 | "
\n",
982 | " \n",
983 | " | 24 | \n",
984 | " GBP | \n",
985 | " 51.555259 | \n",
986 | " -0.252571 | \n",
987 | " Charming Room, 2 Double beds, ensuite (BA-II) | \n",
988 | " 4 | \n",
989 | " https://a0.muscache.com/im/pictures/ddf9b001-b... | \n",
990 | " 33 | \n",
991 | " nightly | \n",
992 | " 4.5 | \n",
993 | " Private room | \n",
994 | "
\n",
995 | " \n",
996 | " | 25 | \n",
997 | " GBP | \n",
998 | " 51.479652 | \n",
999 | " -0.169762 | \n",
1000 | " Luxurious double @ heart of london | \n",
1001 | " 2 | \n",
1002 | " https://a0.muscache.com/im/pictures/1528064c-5... | \n",
1003 | " 36 | \n",
1004 | " nightly | \n",
1005 | " 5.0 | \n",
1006 | " Private room | \n",
1007 | "
\n",
1008 | " \n",
1009 | " | 26 | \n",
1010 | " GBP | \n",
1011 | " 51.580548 | \n",
1012 | " -0.233645 | \n",
1013 | " SUPER VALUE, PRETTY & SAFE, EASY ACCESS TO CENTRE | \n",
1014 | " 1 | \n",
1015 | " https://a0.muscache.com/im/pictures/5a878555-6... | \n",
1016 | " 24 | \n",
1017 | " nightly | \n",
1018 | " 5.0 | \n",
1019 | " Private room | \n",
1020 | "
\n",
1021 | " \n",
1022 | " | 27 | \n",
1023 | " GBP | \n",
1024 | " 51.416261 | \n",
1025 | " -0.101107 | \n",
1026 | " The Snug - Completely independent studio | \n",
1027 | " 3 | \n",
1028 | " https://a0.muscache.com/im/pictures/a7b7ac9c-6... | \n",
1029 | " 43 | \n",
1030 | " nightly | \n",
1031 | " 5.0 | \n",
1032 | " Entire home/flat | \n",
1033 | "
\n",
1034 | " \n",
1035 | " | 28 | \n",
1036 | " GBP | \n",
1037 | " 51.496717 | \n",
1038 | " -0.101200 | \n",
1039 | " Great central London flat | \n",
1040 | " 6 | \n",
1041 | " https://a0.muscache.com/im/pictures/0c6c56c4-a... | \n",
1042 | " 119 | \n",
1043 | " nightly | \n",
1044 | " 4.5 | \n",
1045 | " Entire home/flat | \n",
1046 | "
\n",
1047 | " \n",
1048 | " | 29 | \n",
1049 | " GBP | \n",
1050 | " 51.470231 | \n",
1051 | " -0.090346 | \n",
1052 | " Cosy double in creative, social garden house | \n",
1053 | " 2 | \n",
1054 | " https://a0.muscache.com/im/pictures/cb24dc8f-3... | \n",
1055 | " 33 | \n",
1056 | " nightly | \n",
1057 | " 4.5 | \n",
1058 | " Private room | \n",
1059 | "
\n",
1060 | " \n",
1061 | " | 30 | \n",
1062 | " GBP | \n",
1063 | " 51.530786 | \n",
1064 | " -0.056957 | \n",
1065 | " - Double Room Shoreditch \"Sparrow\", 2min to Train | \n",
1066 | " 2 | \n",
1067 | " https://a0.muscache.com/im/pictures/5d305b15-f... | \n",
1068 | " 25 | \n",
1069 | " nightly | \n",
1070 | " 4.5 | \n",
1071 | " Private room | \n",
1072 | "
\n",
1073 | " \n",
1074 | " | 31 | \n",
1075 | " GBP | \n",
1076 | " 51.521552 | \n",
1077 | " -0.045261 | \n",
1078 | " Central bedroom, close to underground, perfect! | \n",
1079 | " 3 | \n",
1080 | " https://a0.muscache.com/im/pictures/c66b07ea-e... | \n",
1081 | " 36 | \n",
1082 | " nightly | \n",
1083 | " 4.5 | \n",
1084 | " Private room | \n",
1085 | "
\n",
1086 | " \n",
1087 | " | 32 | \n",
1088 | " GBP | \n",
1089 | " 51.499814 | \n",
1090 | " -0.113586 | \n",
1091 | " **AMAZING CITY CENTRE APARTMENT** | \n",
1092 | " 2 | \n",
1093 | " https://a0.muscache.com/im/pictures/79e6ba4f-0... | \n",
1094 | " 71 | \n",
1095 | " nightly | \n",
1096 | " 5.0 | \n",
1097 | " Private room | \n",
1098 | "
\n",
1099 | " \n",
1100 | " | 33 | \n",
1101 | " GBP | \n",
1102 | " 51.492054 | \n",
1103 | " -0.096903 | \n",
1104 | " Safestay London Elephant & Castle | \n",
1105 | " 6 | \n",
1106 | " https://a0.muscache.com/im/pictures/d59e71e6-4... | \n",
1107 | " 15 | \n",
1108 | " nightly | \n",
1109 | " 4.5 | \n",
1110 | " Shared room | \n",
1111 | "
\n",
1112 | " \n",
1113 | " | 34 | \n",
1114 | " GBP | \n",
1115 | " 51.519988 | \n",
1116 | " -0.041878 | \n",
1117 | " (ION-B) Private room for 4 in London | \n",
1118 | " 4 | \n",
1119 | " https://a0.muscache.com/im/pictures/58980ef0-1... | \n",
1120 | " 25 | \n",
1121 | " nightly | \n",
1122 | " 5.0 | \n",
1123 | " Private room | \n",
1124 | "
\n",
1125 | " \n",
1126 | " | 35 | \n",
1127 | " GBP | \n",
1128 | " 51.516530 | \n",
1129 | " -0.062436 | \n",
1130 | " (4BFORD-3)PRIVATE ROOM FOR 2 CLOSE TO TOWER BR... | \n",
1131 | " 2 | \n",
1132 | " https://a0.muscache.com/im/pictures/cf9a8ff6-5... | \n",
1133 | " 21 | \n",
1134 | " nightly | \n",
1135 | " 5.0 | \n",
1136 | " Private room | \n",
1137 | "
\n",
1138 | " \n",
1139 | " | 36 | \n",
1140 | " GBP | \n",
1141 | " 51.489336 | \n",
1142 | " -0.230869 | \n",
1143 | " The Muse Haus II - Riverside Room | \n",
1144 | " 3 | \n",
1145 | " https://a0.muscache.com/im/pictures/da06a628-0... | \n",
1146 | " 49 | \n",
1147 | " nightly | \n",
1148 | " 5.0 | \n",
1149 | " Private room | \n",
1150 | "
\n",
1151 | " \n",
1152 | " | 37 | \n",
1153 | " GBP | \n",
1154 | " 51.553402 | \n",
1155 | " -0.241072 | \n",
1156 | " Large Balcony Double Bedroom in Dollis Hill! BR5 | \n",
1157 | " 2 | \n",
1158 | " https://a0.muscache.com/im/pictures/64093166-7... | \n",
1159 | " 31 | \n",
1160 | " nightly | \n",
1161 | " 4.5 | \n",
1162 | " Private room | \n",
1163 | "
\n",
1164 | " \n",
1165 | " | 38 | \n",
1166 | " GBP | \n",
1167 | " 51.512739 | \n",
1168 | " -0.226962 | \n",
1169 | " THE QUEENS HOSTEL , 6 BED MIXED DORM C | \n",
1170 | " 6 | \n",
1171 | " https://a0.muscache.com/im/pictures/d5752e25-1... | \n",
1172 | " 21 | \n",
1173 | " nightly | \n",
1174 | " 4.5 | \n",
1175 | " Shared room | \n",
1176 | "
\n",
1177 | " \n",
1178 | " | 39 | \n",
1179 | " GBP | \n",
1180 | " 51.539388 | \n",
1181 | " -0.202903 | \n",
1182 | " Double Room in Contemporary Flat | \n",
1183 | " 2 | \n",
1184 | " https://a0.muscache.com/im/pictures/26516a3a-1... | \n",
1185 | " 40 | \n",
1186 | " nightly | \n",
1187 | " 4.5 | \n",
1188 | " Private room | \n",
1189 | "
\n",
1190 | " \n",
1191 | " | 40 | \n",
1192 | " GBP | \n",
1193 | " 51.490277 | \n",
1194 | " -0.097092 | \n",
1195 | " Room Big Ben + Breakfast. Zone 1 (Q3) | \n",
1196 | " 2 | \n",
1197 | " https://a0.muscache.com/im/pictures/82998940/6... | \n",
1198 | " 30 | \n",
1199 | " nightly | \n",
1200 | " 4.0 | \n",
1201 | " Private room | \n",
1202 | "
\n",
1203 | " \n",
1204 | " | 41 | \n",
1205 | " GBP | \n",
1206 | " 51.492041 | \n",
1207 | " -0.017314 | \n",
1208 | " Large Double with Private Bathroom-Canary Whar... | \n",
1209 | " 3 | \n",
1210 | " https://a0.muscache.com/im/pictures/6df596d1-4... | \n",
1211 | " 31 | \n",
1212 | " nightly | \n",
1213 | " 4.5 | \n",
1214 | " Private room | \n",
1215 | "
\n",
1216 | " \n",
1217 | " | 42 | \n",
1218 | " GBP | \n",
1219 | " 51.533898 | \n",
1220 | " -0.130299 | \n",
1221 | " Double in Kings Cross Houseshare | \n",
1222 | " 2 | \n",
1223 | " https://a0.muscache.com/im/pictures/57128460/a... | \n",
1224 | " 88 | \n",
1225 | " nightly | \n",
1226 | " 4.5 | \n",
1227 | " Private room | \n",
1228 | "
\n",
1229 | " \n",
1230 | " | 43 | \n",
1231 | " GBP | \n",
1232 | " 51.497440 | \n",
1233 | " -0.009651 | \n",
1234 | " Room next to Canary Wharf and Greenwich | \n",
1235 | " 2 | \n",
1236 | " https://a0.muscache.com/im/pictures/e747b1f1-1... | \n",
1237 | " 31 | \n",
1238 | " nightly | \n",
1239 | " 4.5 | \n",
1240 | " Private room | \n",
1241 | "
\n",
1242 | " \n",
1243 | " | 44 | \n",
1244 | " GBP | \n",
1245 | " 51.551801 | \n",
1246 | " -0.239277 | \n",
1247 | " Garden Facing Double Bedroom in Dollis Hill! BR4 | \n",
1248 | " 2 | \n",
1249 | " https://a0.muscache.com/im/pictures/da6da5bc-7... | \n",
1250 | " 31 | \n",
1251 | " nightly | \n",
1252 | " 4.5 | \n",
1253 | " Private room | \n",
1254 | "
\n",
1255 | " \n",
1256 | " | 45 | \n",
1257 | " GBP | \n",
1258 | " 51.500920 | \n",
1259 | " -0.114502 | \n",
1260 | " LUXURY CITY CENTRE APARTMENT | \n",
1261 | " 2 | \n",
1262 | " https://a0.muscache.com/im/pictures/050fc4a8-1... | \n",
1263 | " 51 | \n",
1264 | " nightly | \n",
1265 | " 5.0 | \n",
1266 | " Private room | \n",
1267 | "
\n",
1268 | " \n",
1269 | " | 46 | \n",
1270 | " GBP | \n",
1271 | " 51.548562 | \n",
1272 | " -0.225982 | \n",
1273 | " Clean Double Room 5min walk to Underground Sta... | \n",
1274 | " 2 | \n",
1275 | " https://a0.muscache.com/im/pictures/bfe713da-b... | \n",
1276 | " 38 | \n",
1277 | " nightly | \n",
1278 | " 5.0 | \n",
1279 | " Private room | \n",
1280 | "
\n",
1281 | " \n",
1282 | " | 47 | \n",
1283 | " GBP | \n",
1284 | " 51.492564 | \n",
1285 | " -0.096076 | \n",
1286 | " Nice Room Big Ben + Breakfast (R7a) | \n",
1287 | " 1 | \n",
1288 | " https://a0.muscache.com/im/pictures/60cf80f1-8... | \n",
1289 | " 22 | \n",
1290 | " nightly | \n",
1291 | " 4.5 | \n",
1292 | " Shared room | \n",
1293 | "
\n",
1294 | " \n",
1295 | " | 48 | \n",
1296 | " GBP | \n",
1297 | " 51.525872 | \n",
1298 | " -0.067217 | \n",
1299 | " (SP-C)PRIVATE ROOM FOR 4 NEAR SHOREDITCH | \n",
1300 | " 4 | \n",
1301 | " https://a0.muscache.com/im/pictures/a4c8f54c-f... | \n",
1302 | " 25 | \n",
1303 | " nightly | \n",
1304 | " 4.5 | \n",
1305 | " Private room | \n",
1306 | "
\n",
1307 | " \n",
1308 | " | 49 | \n",
1309 | " GBP | \n",
1310 | " 51.517147 | \n",
1311 | " -0.039077 | \n",
1312 | " (CROM-E)PRIVATE ROOM FOR 4 NEAR REGENT'S CANAL | \n",
1313 | " 4 | \n",
1314 | " https://a0.muscache.com/im/pictures/e5d9092c-e... | \n",
1315 | " 25 | \n",
1316 | " nightly | \n",
1317 | " 5.0 | \n",
1318 | " Private room | \n",
1319 | "
\n",
1320 | " \n",
1321 | " | 50 | \n",
1322 | " GBP | \n",
1323 | " 51.513970 | \n",
1324 | " -0.047868 | \n",
1325 | " (WE-1) ROOM FOR 4 NEAR STEPNEY GREEN PARK/GARDEN | \n",
1326 | " 4 | \n",
1327 | " https://a0.muscache.com/im/pictures/fbf383e5-9... | \n",
1328 | " 25 | \n",
1329 | " nightly | \n",
1330 | " 5.0 | \n",
1331 | " Private room | \n",
1332 | "
\n",
1333 | " \n",
1334 | " | 51 | \n",
1335 | " GBP | \n",
1336 | " 51.524006 | \n",
1337 | " -0.063918 | \n",
1338 | " (8RAM-A)Private Room for 4ppl near Victoria Park | \n",
1339 | " 4 | \n",
1340 | " https://a0.muscache.com/im/pictures/06d594e2-a... | \n",
1341 | " 25 | \n",
1342 | " nightly | \n",
1343 | " 4.5 | \n",
1344 | " Private room | \n",
1345 | "
\n",
1346 | " \n",
1347 | " | 52 | \n",
1348 | " GBP | \n",
1349 | " 51.517652 | \n",
1350 | " -0.038606 | \n",
1351 | " (92AST-2)Private rooms up to 4 near Mile End Park | \n",
1352 | " 4 | \n",
1353 | " https://a0.muscache.com/im/pictures/6e5c7acc-d... | \n",
1354 | " 25 | \n",
1355 | " nightly | \n",
1356 | " 4.5 | \n",
1357 | " Private room | \n",
1358 | "
\n",
1359 | " \n",
1360 | " | 53 | \n",
1361 | " GBP | \n",
1362 | " 51.513839 | \n",
1363 | " -0.055826 | \n",
1364 | " (ROB-B)PRIVATE ROOM FOR 4 PPL NEAR RIVERSIDE | \n",
1365 | " 4 | \n",
1366 | " https://a0.muscache.com/im/pictures/29046148-5... | \n",
1367 | " 25 | \n",
1368 | " nightly | \n",
1369 | " 5.0 | \n",
1370 | " Private room | \n",
1371 | "
\n",
1372 | " \n",
1373 | " | 54 | \n",
1374 | " GBP | \n",
1375 | " 51.527791 | \n",
1376 | " -0.068263 | \n",
1377 | " (MCD-D) PRIVATE ROOM UP TO 4 CLOSE TO BRICK LANE | \n",
1378 | " 4 | \n",
1379 | " https://a0.muscache.com/im/pictures/f8d3886c-6... | \n",
1380 | " 25 | \n",
1381 | " nightly | \n",
1382 | " 4.5 | \n",
1383 | " Private room | \n",
1384 | "
\n",
1385 | " \n",
1386 | " | 55 | \n",
1387 | " GBP | \n",
1388 | " 51.517796 | \n",
1389 | " -0.038177 | \n",
1390 | " (92AST-4)PRIVATE ROOM FOR 2 NEAR VICTORIA PARK | \n",
1391 | " 2 | \n",
1392 | " https://a0.muscache.com/im/pictures/9f38c016-a... | \n",
1393 | " 21 | \n",
1394 | " nightly | \n",
1395 | " 5.0 | \n",
1396 | " Private room | \n",
1397 | "
\n",
1398 | " \n",
1399 | " | 56 | \n",
1400 | " GBP | \n",
1401 | " 51.523549 | \n",
1402 | " -0.065682 | \n",
1403 | " (8RAM-C)Private Room for 3ppl near Victoria Park | \n",
1404 | " 3 | \n",
1405 | " https://a0.muscache.com/im/pictures/f9c1ac2d-4... | \n",
1406 | " 21 | \n",
1407 | " nightly | \n",
1408 | " 5.0 | \n",
1409 | " Private room | \n",
1410 | "
\n",
1411 | " \n",
1412 | " | 57 | \n",
1413 | " GBP | \n",
1414 | " 51.520828 | \n",
1415 | " -0.067859 | \n",
1416 | " (KING-D)PRIVATE ROOM FOR 3 PPL IN BRICK LANE | \n",
1417 | " 3 | \n",
1418 | " https://a0.muscache.com/im/pictures/05413b50-5... | \n",
1419 | " 21 | \n",
1420 | " nightly | \n",
1421 | " 4.5 | \n",
1422 | " Private room | \n",
1423 | "
\n",
1424 | " \n",
1425 | " | 58 | \n",
1426 | " GBP | \n",
1427 | " 51.515282 | \n",
1428 | " -0.047395 | \n",
1429 | " (WE-4)PRIVATE ROOM FOR 2 NEAR STEPNEY GREEN PARK | \n",
1430 | " 2 | \n",
1431 | " https://a0.muscache.com/im/pictures/8686ff10-3... | \n",
1432 | " 21 | \n",
1433 | " nightly | \n",
1434 | " 4.5 | \n",
1435 | " Private room | \n",
1436 | "
\n",
1437 | " \n",
1438 | " | 59 | \n",
1439 | " GBP | \n",
1440 | " 51.518104 | \n",
1441 | " -0.068196 | \n",
1442 | " (59CH-1) PRIVATE ROOM FOR 4 BRICK LANE | \n",
1443 | " 4 | \n",
1444 | " https://a0.muscache.com/im/pictures/d8786c02-1... | \n",
1445 | " 25 | \n",
1446 | " nightly | \n",
1447 | " 3.0 | \n",
1448 | " Private room | \n",
1449 | "
\n",
1450 | " \n",
1451 | " | 60 | \n",
1452 | " GBP | \n",
1453 | " 51.513845 | \n",
1454 | " -0.064276 | \n",
1455 | " (HAD-D)PRIVATE ROOM CLOSE TO TOWER BRIDGE | \n",
1456 | " 2 | \n",
1457 | " https://a0.muscache.com/im/pictures/7f1928e5-6... | \n",
1458 | " 21 | \n",
1459 | " nightly | \n",
1460 | " 4.0 | \n",
1461 | " Private room | \n",
1462 | "
\n",
1463 | " \n",
1464 | " | 61 | \n",
1465 | " GBP | \n",
1466 | " 51.509966 | \n",
1467 | " -0.061409 | \n",
1468 | " (BET-A)PRIVATE ROOM FOR 4 NEAR RIVERSIDE | \n",
1469 | " 4 | \n",
1470 | " https://a0.muscache.com/im/pictures/10e3fcc8-c... | \n",
1471 | " 25 | \n",
1472 | " nightly | \n",
1473 | " 4.5 | \n",
1474 | " Private room | \n",
1475 | "
\n",
1476 | " \n",
1477 | " | 62 | \n",
1478 | " GBP | \n",
1479 | " 51.527649 | \n",
1480 | " -0.068401 | \n",
1481 | " (MCD-C) PRIVATE ROOM UP TO 3 CLOSE TO BRICK LANE | \n",
1482 | " 3 | \n",
1483 | " https://a0.muscache.com/im/pictures/59bbf736-6... | \n",
1484 | " 21 | \n",
1485 | " nightly | \n",
1486 | " 5.0 | \n",
1487 | " Private room | \n",
1488 | "
\n",
1489 | " \n",
1490 | " | 63 | \n",
1491 | " GBP | \n",
1492 | " 51.519125 | \n",
1493 | " -0.069282 | \n",
1494 | " (43CHIC-B)PRIVATE ROOM FOR 2 IN BRICK LANE | \n",
1495 | " 2 | \n",
1496 | " https://a0.muscache.com/im/pictures/9298d330-b... | \n",
1497 | " 21 | \n",
1498 | " nightly | \n",
1499 | " 4.5 | \n",
1500 | " Private room | \n",
1501 | "
\n",
1502 | " \n",
1503 | " | 64 | \n",
1504 | " GBP | \n",
1505 | " 51.510940 | \n",
1506 | " -0.018134 | \n",
1507 | " (32GRU-D)PRIVATE ROOM FOR 2 PEOPLE NEAR RIVERSIDE | \n",
1508 | " 2 | \n",
1509 | " https://a0.muscache.com/im/pictures/60424b3b-1... | \n",
1510 | " 21 | \n",
1511 | " nightly | \n",
1512 | " 5.0 | \n",
1513 | " Private room | \n",
1514 | "
\n",
1515 | " \n",
1516 | " | 65 | \n",
1517 | " GBP | \n",
1518 | " 51.511244 | \n",
1519 | " -0.053994 | \n",
1520 | " (73SHAD-5)PRIVATE ROOM FOR 2 PEOPLE NEAR RIVER... | \n",
1521 | " 2 | \n",
1522 | " https://a0.muscache.com/im/pictures/ff71eaa5-b... | \n",
1523 | " 21 | \n",
1524 | " nightly | \n",
1525 | " 4.5 | \n",
1526 | " Private room | \n",
1527 | "
\n",
1528 | " \n",
1529 | " | 66 | \n",
1530 | " GBP | \n",
1531 | " 51.525535 | \n",
1532 | " -0.065950 | \n",
1533 | " (8RAM-B)Private Room for 2ppl near Victoria Park | \n",
1534 | " 2 | \n",
1535 | " https://a0.muscache.com/im/pictures/4d783775-2... | \n",
1536 | " 21 | \n",
1537 | " nightly | \n",
1538 | " 4.5 | \n",
1539 | " Private room | \n",
1540 | "
\n",
1541 | " \n",
1542 | " | 67 | \n",
1543 | " GBP | \n",
1544 | " 51.526277 | \n",
1545 | " -0.072128 | \n",
1546 | " (KR-D)PRIVATE ROOM FOR 2 IN SHOREDITCH | \n",
1547 | " 2 | \n",
1548 | " https://a0.muscache.com/im/pictures/3e7bb23b-a... | \n",
1549 | " 21 | \n",
1550 | " nightly | \n",
1551 | " 4.5 | \n",
1552 | " Private room | \n",
1553 | "
\n",
1554 | " \n",
1555 | " | 68 | \n",
1556 | " GBP | \n",
1557 | " 51.527191 | \n",
1558 | " -0.071177 | \n",
1559 | " (KR-A)PRIVATE ROOM FOR 4 IN SHOREDITCH/BALCONY | \n",
1560 | " 4 | \n",
1561 | " https://a0.muscache.com/im/pictures/f7ffd8de-c... | \n",
1562 | " 25 | \n",
1563 | " nightly | \n",
1564 | " 4.0 | \n",
1565 | " Private room | \n",
1566 | "
\n",
1567 | " \n",
1568 | " | 69 | \n",
1569 | " GBP | \n",
1570 | " 51.511228 | \n",
1571 | " -0.058300 | \n",
1572 | " (26MOR-B)PRIVATE ROOM FOR 4 PPL NEAR RIVERSIDE | \n",
1573 | " 4 | \n",
1574 | " https://a0.muscache.com/im/pictures/7792ef9a-f... | \n",
1575 | " 25 | \n",
1576 | " nightly | \n",
1577 | " 4.5 | \n",
1578 | " Private room | \n",
1579 | "
\n",
1580 | " \n",
1581 | " | 70 | \n",
1582 | " GBP | \n",
1583 | " 51.510086 | \n",
1584 | " -0.061221 | \n",
1585 | " (BET-D)PRIVATE ROOM FOR 2 NEAR RIVERSIDE | \n",
1586 | " 2 | \n",
1587 | " https://a0.muscache.com/im/pictures/2a47f778-0... | \n",
1588 | " 21 | \n",
1589 | " nightly | \n",
1590 | " 4.5 | \n",
1591 | " Private room | \n",
1592 | "
\n",
1593 | " \n",
1594 | " | 71 | \n",
1595 | " GBP | \n",
1596 | " 51.512000 | \n",
1597 | " -0.063476 | \n",
1598 | " (HAL-A)PRIVATE ROOM FOR 4 NEAR TOWER BRIDGE | \n",
1599 | " 4 | \n",
1600 | " https://a0.muscache.com/im/pictures/8e7e39fd-8... | \n",
1601 | " 25 | \n",
1602 | " nightly | \n",
1603 | " 5.0 | \n",
1604 | " Private room | \n",
1605 | "
\n",
1606 | " \n",
1607 | " | 72 | \n",
1608 | " GBP | \n",
1609 | " 51.512355 | \n",
1610 | " -0.065909 | \n",
1611 | " (HAD-C)PRIVATE ROOM CLOSE TO TOWER BRIDGE | \n",
1612 | " 2 | \n",
1613 | " https://a0.muscache.com/im/pictures/2bfbd251-5... | \n",
1614 | " 21 | \n",
1615 | " nightly | \n",
1616 | " 4.5 | \n",
1617 | " Private room | \n",
1618 | "
\n",
1619 | " \n",
1620 | " | 73 | \n",
1621 | " GBP | \n",
1622 | " 51.512548 | \n",
1623 | " -0.018926 | \n",
1624 | " (32GRU-B)PRIVATE ROOM FOR 2 PEOPLE NEAR RIVERSIDE | \n",
1625 | " 2 | \n",
1626 | " https://a0.muscache.com/im/pictures/4030ee3b-8... | \n",
1627 | " 21 | \n",
1628 | " nightly | \n",
1629 | " 4.5 | \n",
1630 | " Private room | \n",
1631 | "
\n",
1632 | " \n",
1633 | " | 74 | \n",
1634 | " GBP | \n",
1635 | " 51.515882 | \n",
1636 | " -0.037798 | \n",
1637 | " (50AST-1)PRIVATE ROOM FOR 4 NEAR MILE END PARK | \n",
1638 | " 4 | \n",
1639 | " https://a0.muscache.com/im/pictures/d7437f04-4... | \n",
1640 | " 21 | \n",
1641 | " nightly | \n",
1642 | " 4.0 | \n",
1643 | " Private room | \n",
1644 | "
\n",
1645 | " \n",
1646 | " | 75 | \n",
1647 | " GBP | \n",
1648 | " 51.523891 | \n",
1649 | " -0.052398 | \n",
1650 | " (BRAI-D)PRIVATE ROOM FOR 4 NEAR BRICK LANE | \n",
1651 | " 4 | \n",
1652 | " https://a0.muscache.com/im/pictures/b11142b5-f... | \n",
1653 | " 25 | \n",
1654 | " nightly | \n",
1655 | " NaN | \n",
1656 | " Private room | \n",
1657 | "
\n",
1658 | " \n",
1659 | " | 76 | \n",
1660 | " GBP | \n",
1661 | " 51.532256 | \n",
1662 | " -0.083022 | \n",
1663 | " (CHA-A) PRIVATE ROOM IN HOXTON WITH BALCONY FOR 4 | \n",
1664 | " 4 | \n",
1665 | " https://a0.muscache.com/im/pictures/6619ba13-b... | \n",
1666 | " 25 | \n",
1667 | " nightly | \n",
1668 | " 5.0 | \n",
1669 | " Private room | \n",
1670 | "
\n",
1671 | " \n",
1672 | " | 77 | \n",
1673 | " GBP | \n",
1674 | " 51.535244 | \n",
1675 | " -0.089366 | \n",
1676 | " (CROP-A)PRIVATE ROOM FOR 4 PEOPLE IN SHOREDITCH | \n",
1677 | " 4 | \n",
1678 | " https://a0.muscache.com/im/pictures/84969292-3... | \n",
1679 | " 25 | \n",
1680 | " nightly | \n",
1681 | " 4.5 | \n",
1682 | " Private room | \n",
1683 | "
\n",
1684 | " \n",
1685 | " | 78 | \n",
1686 | " GBP | \n",
1687 | " 51.535351 | \n",
1688 | " -0.084282 | \n",
1689 | " (FUL-B)PRIVATE ROOM FOR 4 NEAR HOXTON | \n",
1690 | " 4 | \n",
1691 | " https://a0.muscache.com/im/pictures/a7c18ecc-2... | \n",
1692 | " 25 | \n",
1693 | " nightly | \n",
1694 | " 5.0 | \n",
1695 | " Private room | \n",
1696 | "
\n",
1697 | " \n",
1698 | " | 79 | \n",
1699 | " GBP | \n",
1700 | " 51.516583 | \n",
1701 | " -0.063056 | \n",
1702 | " (4AFORD-2)PRIVATE ROOM FOR 4PPL NEAR TOWER BRIDGE | \n",
1703 | " 4 | \n",
1704 | " https://a0.muscache.com/im/pictures/291306f2-4... | \n",
1705 | " 21 | \n",
1706 | " nightly | \n",
1707 | " 4.0 | \n",
1708 | " Private room | \n",
1709 | "
\n",
1710 | " \n",
1711 | " | 80 | \n",
1712 | " GBP | \n",
1713 | " 51.553648 | \n",
1714 | " -0.241361 | \n",
1715 | " Huge & Bright Double Bedroom in Dollis Hill! BR1 | \n",
1716 | " 2 | \n",
1717 | " https://a0.muscache.com/im/pictures/6f44c440-8... | \n",
1718 | " 31 | \n",
1719 | " nightly | \n",
1720 | " 4.5 | \n",
1721 | " Private room | \n",
1722 | "
\n",
1723 | " \n",
1724 | " | 81 | \n",
1725 | " GBP | \n",
1726 | " 51.496208 | \n",
1727 | " -0.071039 | \n",
1728 | " New Central London double with private bathroom | \n",
1729 | " 2 | \n",
1730 | " https://a0.muscache.com/im/pictures/7ddf6bf6-4... | \n",
1731 | " 58 | \n",
1732 | " nightly | \n",
1733 | " 5.0 | \n",
1734 | " Private room | \n",
1735 | "
\n",
1736 | " \n",
1737 | " | 82 | \n",
1738 | " GBP | \n",
1739 | " 51.498859 | \n",
1740 | " -0.086188 | \n",
1741 | " Single box room at london bridge | \n",
1742 | " 1 | \n",
1743 | " https://a0.muscache.com/im/pictures/74d672bc-e... | \n",
1744 | " 22 | \n",
1745 | " nightly | \n",
1746 | " 4.0 | \n",
1747 | " Private room | \n",
1748 | "
\n",
1749 | " \n",
1750 | " | 83 | \n",
1751 | " GBP | \n",
1752 | " 51.518006 | \n",
1753 | " -0.167318 | \n",
1754 | " Cosy 2nd Floor 1 Bed Flat Close To Oxford Street | \n",
1755 | " 4 | \n",
1756 | " https://a0.muscache.com/im/pictures/b9df22db-6... | \n",
1757 | " 67 | \n",
1758 | " nightly | \n",
1759 | " 4.5 | \n",
1760 | " Entire home/flat | \n",
1761 | "
\n",
1762 | " \n",
1763 | " | 84 | \n",
1764 | " GBP | \n",
1765 | " 51.510704 | \n",
1766 | " -0.050297 | \n",
1767 | " (GRD-A)PRIVATE ROOM UP TO 4 WITH CITY VIEWS | \n",
1768 | " 4 | \n",
1769 | " https://a0.muscache.com/im/pictures/39cd4175-5... | \n",
1770 | " 25 | \n",
1771 | " nightly | \n",
1772 | " 4.5 | \n",
1773 | " Private room | \n",
1774 | "
\n",
1775 | " \n",
1776 | " | 85 | \n",
1777 | " GBP | \n",
1778 | " 51.506991 | \n",
1779 | " -0.067635 | \n",
1780 | " * Zone 1 * FLAT * NO CLEANING FEES * | \n",
1781 | " 6 | \n",
1782 | " https://a0.muscache.com/im/pictures/440765d6-8... | \n",
1783 | " 80 | \n",
1784 | " nightly | \n",
1785 | " 5.0 | \n",
1786 | " Entire home/flat | \n",
1787 | "
\n",
1788 | " \n",
1789 | " | 86 | \n",
1790 | " GBP | \n",
1791 | " 51.527842 | \n",
1792 | " -0.071902 | \n",
1793 | " (KR-C)PRIVATE ROOM FOR 4 IN SHOREDITCH | \n",
1794 | " 4 | \n",
1795 | " https://a0.muscache.com/im/pictures/3a9cff13-d... | \n",
1796 | " 25 | \n",
1797 | " nightly | \n",
1798 | " 4.5 | \n",
1799 | " Private room | \n",
1800 | "
\n",
1801 | " \n",
1802 | " | 87 | \n",
1803 | " GBP | \n",
1804 | " 51.521531 | \n",
1805 | " -0.139337 | \n",
1806 | " Room C Brilliant Location | \n",
1807 | " 3 | \n",
1808 | " https://a0.muscache.com/im/pictures/bef5920b-7... | \n",
1809 | " 46 | \n",
1810 | " nightly | \n",
1811 | " 4.5 | \n",
1812 | " Private room | \n",
1813 | "
\n",
1814 | " \n",
1815 | " | 88 | \n",
1816 | " GBP | \n",
1817 | " 51.532407 | \n",
1818 | " -0.063612 | \n",
1819 | " (SBH-C)PRIVATE ROOM FOR 4 PEOPLE NEAR SHOREDITCH | \n",
1820 | " 4 | \n",
1821 | " https://a0.muscache.com/im/pictures/e13000eb-6... | \n",
1822 | " 25 | \n",
1823 | " nightly | \n",
1824 | " 4.5 | \n",
1825 | " Private room | \n",
1826 | "
\n",
1827 | " \n",
1828 | " | 89 | \n",
1829 | " GBP | \n",
1830 | " 51.466348 | \n",
1831 | " -0.192849 | \n",
1832 | " Beautiful 2 bed Fulham apartment | \n",
1833 | " 4 | \n",
1834 | " https://a0.muscache.com/im/pictures/53d529ea-2... | \n",
1835 | " 98 | \n",
1836 | " nightly | \n",
1837 | " 5.0 | \n",
1838 | " Entire home/flat | \n",
1839 | "
\n",
1840 | " \n",
1841 | " | 90 | \n",
1842 | " GBP | \n",
1843 | " 51.490963 | \n",
1844 | " -0.096486 | \n",
1845 | " Room Big Ben + Breakfast (R5b) | \n",
1846 | " 1 | \n",
1847 | " https://a0.muscache.com/im/pictures/1f4b6234-3... | \n",
1848 | " 18 | \n",
1849 | " nightly | \n",
1850 | " 4.5 | \n",
1851 | " Shared room | \n",
1852 | "
\n",
1853 | " \n",
1854 | " | 91 | \n",
1855 | " GBP | \n",
1856 | " 51.594978 | \n",
1857 | " -0.081565 | \n",
1858 | " Bright & Spacious Double Room | \n",
1859 | " 2 | \n",
1860 | " https://a0.muscache.com/im/pictures/72012623/a... | \n",
1861 | " 28 | \n",
1862 | " nightly | \n",
1863 | " 5.0 | \n",
1864 | " Private room | \n",
1865 | "
\n",
1866 | " \n",
1867 | " | 92 | \n",
1868 | " GBP | \n",
1869 | " 51.513420 | \n",
1870 | " -0.065957 | \n",
1871 | " (HAD-A)PRIVATE ROOM CLOSE TO TOWER BRIDGE | \n",
1872 | " 2 | \n",
1873 | " https://a0.muscache.com/im/pictures/27e16557-8... | \n",
1874 | " 21 | \n",
1875 | " nightly | \n",
1876 | " 5.0 | \n",
1877 | " Private room | \n",
1878 | "
\n",
1879 | " \n",
1880 | " | 93 | \n",
1881 | " GBP | \n",
1882 | " 51.534635 | \n",
1883 | " -0.072653 | \n",
1884 | " (GOD-A)PRIVATE ROOM FOR 3 NEAR REGENTS CANAL | \n",
1885 | " 3 | \n",
1886 | " https://a0.muscache.com/im/pictures/d81ee2b9-6... | \n",
1887 | " 21 | \n",
1888 | " nightly | \n",
1889 | " 5.0 | \n",
1890 | " Private room | \n",
1891 | "
\n",
1892 | " \n",
1893 | " | 94 | \n",
1894 | " GBP | \n",
1895 | " 51.492320 | \n",
1896 | " -0.015128 | \n",
1897 | " Double Room - Canary Wharf hp | \n",
1898 | " 2 | \n",
1899 | " https://a0.muscache.com/im/pictures/4cebcd26-b... | \n",
1900 | " 30 | \n",
1901 | " nightly | \n",
1902 | " 4.5 | \n",
1903 | " Private room | \n",
1904 | "
\n",
1905 | " \n",
1906 | " | 95 | \n",
1907 | " GBP | \n",
1908 | " 51.516983 | \n",
1909 | " -0.027006 | \n",
1910 | " (TA-4) Private room for 2 close to Mile End Park | \n",
1911 | " 2 | \n",
1912 | " https://a0.muscache.com/im/pictures/5f8370ba-c... | \n",
1913 | " 21 | \n",
1914 | " nightly | \n",
1915 | " 4.5 | \n",
1916 | " Private room | \n",
1917 | "
\n",
1918 | " \n",
1919 | " | 96 | \n",
1920 | " GBP | \n",
1921 | " 51.503962 | \n",
1922 | " -0.109114 | \n",
1923 | " 5. Lovely Room Free WIFI Centre London | \n",
1924 | " 1 | \n",
1925 | " https://a0.muscache.com/im/pictures/5de10973-b... | \n",
1926 | " 51 | \n",
1927 | " nightly | \n",
1928 | " 4.5 | \n",
1929 | " Private room | \n",
1930 | "
\n",
1931 | " \n",
1932 | " | 97 | \n",
1933 | " GBP | \n",
1934 | " 51.427872 | \n",
1935 | " -0.070807 | \n",
1936 | " Comfy, Clean Room For 1, Central London(FREE W... | \n",
1937 | " 1 | \n",
1938 | " https://a0.muscache.com/im/pictures/ea2c17f9-9... | \n",
1939 | " 22 | \n",
1940 | " nightly | \n",
1941 | " 4.5 | \n",
1942 | " Private room | \n",
1943 | "
\n",
1944 | " \n",
1945 | " | 98 | \n",
1946 | " GBP | \n",
1947 | " 51.532777 | \n",
1948 | " -0.065245 | \n",
1949 | " (SBH-B)PRIVATE ROOM UP TO 4 PEOPLE NEAR SHORED... | \n",
1950 | " 4 | \n",
1951 | " https://a0.muscache.com/im/pictures/c83c76d3-1... | \n",
1952 | " 25 | \n",
1953 | " nightly | \n",
1954 | " 4.5 | \n",
1955 | " Private room | \n",
1956 | "
\n",
1957 | " \n",
1958 | " | 99 | \n",
1959 | " GBP | \n",
1960 | " 51.474246 | \n",
1961 | " -0.194698 | \n",
1962 | " Little Corner of Relaxation & Rest | \n",
1963 | " 3 | \n",
1964 | " https://a0.muscache.com/im/pictures/9a453323-2... | \n",
1965 | " 36 | \n",
1966 | " nightly | \n",
1967 | " 4.5 | \n",
1968 | " Private room | \n",
1969 | "
\n",
1970 | " \n",
1971 | ""
1972 | ],
1973 | "text/plain": [
1974 | ""
1975 | ]
1976 | },
1977 | "execution_count": 20,
1978 | "metadata": {},
1979 | "output_type": "execute_result"
1980 | }
1981 | ],
1982 | "source": [
1983 | "HTML(pd.DataFrame(props).to_html())"
1984 | ]
1985 | },
1986 | {
1987 | "cell_type": "code",
1988 | "execution_count": 12,
1989 | "metadata": {
1990 | "collapsed": false
1991 | },
1992 | "outputs": [
1993 | {
1994 | "name": "stdout",
1995 | "output_type": "stream",
1996 | "text": [
1997 | "will skip code... but you can find nice surprise in any wallmart product code :)\n"
1998 | ]
1999 | }
2000 | ],
2001 | "source": [
2002 | "\"\"\"\n",
2003 | "Case Study 4 - json embeded in html\n",
2004 | "\n",
2005 | "You want analyse and compare some retailing corporations. One of them is Wallmart. You want to get as much \n",
2006 | "details, about products they are selling, as you can. You have investigated your target well, but infortunately\n",
2007 | "you cannot find any hidden API... You need to get data from HTML. You already found nice sitemaps, got product\n",
2008 | "pages and now you're ready to scrape.\n",
2009 | "\n",
2010 | "What will be most efficient and robust way to get all details about products?\n",
2011 | "\"\"\"\n",
2012 | "\n",
2013 | "print(\"will skip code... but you can find nice surprise in any wallmart product code :)\")"
2014 | ]
2015 | },
2016 | {
2017 | "cell_type": "markdown",
2018 | "metadata": {},
2019 | "source": [
2020 | "## Handling JavaScript\n",
2021 | "\n",
2022 | "### Why...?\n",
2023 | "- Sometimes content of webpage can be dynamically presented/altered via JavaScript code\n",
2024 | "- when you're dowlonading HTML, it can be completly different from what you see on browser\n",
2025 | "- you need to perform some sort of interaction with page\n",
2026 | "- your target have some fancy anti-scraping software detecting that you're a bot\n",
2027 | "\n",
2028 | "### Selenium+PhantomJS\n",
2029 | "- Selenium is browser automation tool most often used for testing web application.\n",
2030 | "- It can be usefull while scraping\n",
2031 | "- PhantomJS is just headless browser (there is no UI and it works in background)\n",
2032 | "- BTW: you can use Selenium with any other browser (Firefox, Opera etc.)\n",
2033 | "\n",
2034 | "### It's often an overkill thoguh!\n",
2035 | "- Scraping with Selenium+PhantomJS is much heavier than using simple Python libraries!\n",
2036 | " + you have to have all additional libraries and software installed\n",
2037 | " + it may be slower\n",
2038 | " + you have to navigate as you were a human (eg. find button element and click it programatically)\n",
2039 | " + you rely on page layout (not robust at all...)\n",
2040 | "- Very often you can find work-around it. For example, if you try to deal with infinite scroll,\n",
2041 | " you can investigate what AJAX requests your browser is sending while scrolling (and emulate it)"
2042 | ]
2043 | },
2044 | {
2045 | "cell_type": "code",
2046 | "execution_count": 22,
2047 | "metadata": {
2048 | "collapsed": false
2049 | },
2050 | "outputs": [
2051 | {
2052 | "name": "stdout",
2053 | "output_type": "stream",
2054 | "text": [
2055 | "No. of quotes is: 60\n",
2056 | "“A day without sunshine is like, you know, night.”\n"
2057 | ]
2058 | }
2059 | ],
2060 | "source": [
2061 | "\"\"\"\n",
2062 | "Case Study 5 - how to (not)handle infinite scroll\n",
2063 | "\n",
2064 | "You are planning to spam you friends on Facebook with random quotes to show how smart and deep you are.\n",
2065 | "You found an amazing page with quotes, but... it contains infinite scroll?! Don't worry though! \n",
2066 | "As they say: \"You have to look through the rain to see the rainbow.\"\n",
2067 | "\n",
2068 | "Your task is to get all quotes from http://spidyquotes.herokuapp.com/\n",
2069 | "\"\"\"\n",
2070 | "spidyquotes_url = 'http://spidyquotes.herokuapp.com/scroll'\n",
2071 | "# with Selenium+PhantomJS (in general - bad option. but yeah... may be fancy)\n",
2072 | "driver = webdriver.PhantomJS('/Applications/phantomjs-2.1.1-macosx/bin/phantomjs')\n",
2073 | "driver.get(spidyquotes_url)\n",
2074 | "no_of_scrolls = 5\n",
2075 | "scroll = 0\n",
2076 | "while scroll < no_of_scrolls:\n",
2077 | " # do a fancy screenshoot here\n",
2078 | " driver.get_screenshot_as_file('/Users/stulski/Desktop/osobiste/pydata_meetup/shot_{}.jpg'.format(scroll))\n",
2079 | " # scroll down\n",
2080 | " driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\n",
2081 | " time.sleep(1)\n",
2082 | " scroll += 1\n",
2083 | "quote_elements = driver.find_elements_by_class_name('quote')\n",
2084 | "all_quotes = [element.find_elements_by_class_name('text')[0].text for element in quote_elements]\n",
2085 | "print('No. of quotes is: ', len(all_quotes))\n",
2086 | "print(random.choice(all_quotes))"
2087 | ]
2088 | },
2089 | {
2090 | "cell_type": "code",
2091 | "execution_count": 23,
2092 | "metadata": {
2093 | "collapsed": false
2094 | },
2095 | "outputs": [
2096 | {
2097 | "name": "stdout",
2098 | "output_type": "stream",
2099 | "text": [
2100 | "No. of quotes is: 90\n",
2101 | "“Not all of us can do great things. But we can do small things with great love.”\n"
2102 | ]
2103 | }
2104 | ],
2105 | "source": [
2106 | "# same as above, without unnecessery hassel\n",
2107 | "p_idx = 1\n",
2108 | "spidyquotes_better_url = 'http://spidyquotes.herokuapp.com/api/quotes?page='\n",
2109 | "r = json.loads(requests.get(spidyquotes_better_url+str(p_idx)).text)\n",
2110 | "time.sleep(1)\n",
2111 | "all_quotes = []\n",
2112 | "while r['has_next'] == True:\n",
2113 | " for quote in r['quotes']:\n",
2114 | " all_quotes.append(quote['text'])\n",
2115 | " p_idx += 1\n",
2116 | " r = json.loads(requests.get(spidyquotes_better_url+str(p_idx)).text)\n",
2117 | " time.sleep(1)\n",
2118 | "print('No. of quotes is: ', len(all_quotes))\n",
2119 | "print(random.choice(all_quotes))"
2120 | ]
2121 | },
2122 | {
2123 | "cell_type": "markdown",
2124 | "metadata": {},
2125 | "source": [
2126 | "## Keynotes and advices\n",
2127 | "* investigate you target well (sitemaps, hidden apis, how it works under-the-hood)\n",
2128 | "* use incognito mode whie exploring\n",
2129 | "* use developers tools\n",
2130 | "* think about scraping as a \"hacking\" activity rather than parsing just getting html elements\n",
2131 | "* change your user-agent\n",
2132 | "* add time.sleep if you can afford it\n",
2133 | "* same data are in different places at the website. find those easy to scrape!\n",
2134 | "* if you need to parse HTML and get data from there - try to find something which will not break (avoid finding general elements like DIVs and then finding Nth of those)\n",
2135 | "* look for comonalities\n",
2136 | "* websites are different are there is no one magical way to get your data\n",
2137 | "* it looks easy when I'm showing it, but sometimes it takes time to reverse engineer websites\n",
2138 | "\n",
2139 | "** if you want production level scrapers - use proxy"
2140 | ]
2141 | }
2142 | ],
2143 | "metadata": {
2144 | "kernelspec": {
2145 | "display_name": "Python 3",
2146 | "language": "python",
2147 | "name": "python3"
2148 | },
2149 | "language_info": {
2150 | "codemirror_mode": {
2151 | "name": "ipython",
2152 | "version": 3
2153 | },
2154 | "file_extension": ".py",
2155 | "mimetype": "text/x-python",
2156 | "name": "python",
2157 | "nbconvert_exporter": "python",
2158 | "pygments_lexer": "ipython3",
2159 | "version": "3.5.1"
2160 | }
2161 | },
2162 | "nbformat": 4,
2163 | "nbformat_minor": 2
2164 | }
2165 |
--------------------------------------------------------------------------------