└── Robust_extraction_of_web_data_with_Python.ipynb /Robust_extraction_of_web_data_with_Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Who am I?\n", 8 | "- Slawomir Tulski (Slaw)\n", 9 | "- currently: Big Data Engineer at WorldRemit\n", 10 | "- previously: Python Data Programmer at Import.io (web scraping start-up)\n", 11 | "- linkedin: https://www.linkedin.com/in/slawomir-tulski-091611116/\n", 12 | "- personal website: http://slawomirtulski.com/" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "### My goals for today\n", 20 | "- show how to tackle problem of web scraping in different ways than \"standard\" approach\n", 21 | "- present useful tips and tricks in web-scraping \n", 22 | "- avoid making tutorial on popular html parsing / scraping libraries\n", 23 | "\n", 24 | "### Plan\n", 25 | "- Quick intoduction to scraping\n", 26 | "- Stop crawling, investigate your target instead\n", 27 | " + case study 1: getting all urls you need from website \n", 28 | "- Look for APIs, even if service does not provide (public) one\n", 29 | " + case study 2: getting API KEY and using hidden API in store locator service\n", 30 | " + case study 3: getting available airbnb properties in London\n", 31 | " + case study 4: api json response embeded in html\n", 32 | "- Handling JavaScript with Selenium\n", 33 | " + case study 5: handling infinite scroll\n", 34 | "- Keynotes" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "## Some basic...\n", 42 | "\n", 43 | "### How your browser works?\n", 44 | "- World Wide Web operates on a client/server model\n", 45 | "- Web browser contacts a web server and requests information or resources\n", 46 | "- Server locates and then sends the information (html, images etc.) back to the web browser \n", 47 | "- Browser displays the results\n", 48 | "- Browser can execute JavaScript code to dynmically \"do things\" (sends requests, site appreance and bassicaly everyting)\n", 49 | "- 4 basic types of http requests (GET and POST - you'll use those most often while scraping, PUT, DELETE)\n", 50 | "\n", 51 | "### How to see what my browser is doing?\n", 52 | "- web browsers usually have some sort of \"Developers Toolkit\" (if not you should think about changing your browser)\n", 53 | "- there should be 'Network' tab which shows you what is being sent from/to your broweser/server\n", 54 | "- you can check exactly what type of request were sent, headers, parameters, cookies etc.\n", 55 | "- also you can find in your Developers Tools console to execute JavaScript\n", 56 | "\n", 57 | "### \"Standard\" scraping approach\n", 58 | "0. I want DATA!!!!\n", 59 | "1. don't scrape... find data somewhere else!\n", 60 | "2. don't scrape... they should provide an API!\n", 61 | "3. ok.. you're screwed. get HTML and parse it!\n", 62 | "4. you need a lot of data from different pages of one web service? - build crawler and \"catch them all\"" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "## Using sitemaps instead of crawling whole website\n", 70 | "\n", 71 | "### what is web \"crawler\" ?\n", 72 | "* automate bot which recurse from strat page to all internal link it founds\n", 73 | "* theoretacaly, it will traverse through all urls on website\n", 74 | "\n", 75 | "### why it's not the best idea?\n", 76 | "* not precise (it's brute force... a lot of requests made and a lot of garbage scraped)\n", 77 | "* need to write more code and care about lot of things (what type of url it got, can I go there?)\n", 78 | "* assumes particual page layout and test whatever it encounter\n", 79 | "* easy to catch into trap (honeypots)\n", 80 | "\n", 81 | "### what to use instead?\n", 82 | "* very often there is sitemap of whole website already available!\n", 83 | "* very often sitemaps are hidden! if you can't see it on page, try **/sitemap.xml** [https://www.skipthedishes.com/]\n", 84 | "* also, information about sitmap can be found in **robots.txt** file [https://www.walmart.com/]\n", 85 | "* if there is no sitemap, try to follow a pattern **get categories -> get pages -> get listing -> get item**" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 24, 91 | "metadata": { 92 | "collapsed": true 93 | }, 94 | "outputs": [], 95 | "source": [ 96 | "# built-in\n", 97 | "import json\n", 98 | "import random\n", 99 | "import re\n", 100 | "import time\n", 101 | "# 3rd part\n", 102 | "from IPython.display import HTML\n", 103 | "import pandas as pd\n", 104 | "import requests\n", 105 | "from selenium import webdriver" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 15, 111 | "metadata": { 112 | "collapsed": false 113 | }, 114 | "outputs": [ 115 | { 116 | "name": "stdout", 117 | "output_type": "stream", 118 | "text": [ 119 | "getting properties from: http://www.rightmove.co.uk/sitemap_propertydetails0.xml\n", 120 | "getting properties from: http://www.rightmove.co.uk/sitemap_propertydetails1.xml\n", 121 | "getting properties from: http://www.rightmove.co.uk/sitemap_propertydetails2.xml\n", 122 | "I've got 150000 of urls with properties.\n", 123 | "Some examples:\n", 124 | "\n", 125 | "- http://www.rightmove.co.uk/property-to-rent/property-50480715.html\n", 126 | "\n", 127 | "- http://www.rightmove.co.uk/property-to-rent/property-53775521.html\n", 128 | "\n", 129 | "- http://www.rightmove.co.uk/commercial-property-for-sale/property-64919567.html\n", 130 | "\n", 131 | "- http://www.rightmove.co.uk/property-to-rent/property-68904185.html\n", 132 | "\n", 133 | "- http://www.rightmove.co.uk/commercial-property-to-let/property-47279781.html\n", 134 | "\n", 135 | "- http://www.rightmove.co.uk/property-to-rent/property-61726357.html\n" 136 | ] 137 | } 138 | ], 139 | "source": [ 140 | "\"\"\"\n", 141 | "Case Study 1: getting all links from sitemap \n", 142 | "\n", 143 | "You want to analyse housing market in UK. Data which interest you most are on http://www.rightmove.co.uk/.\n", 144 | "Unfortunately, there is no API available and you need get data from HTMLs. As a first step, before putting your hands\n", 145 | "on data, you need to know urls of all avaiable properties on website. Later, you will use those links to extract data.\n", 146 | "\n", 147 | "Find all urls to properties on rightmove.co.uk. Be as precise as possible. Do not built inefficient crawlers.\n", 148 | "\"\"\"\n", 149 | "main_sitemap_url = 'http://www.rightmove.co.uk/sitemap.xml'\n", 150 | "main_sitemap_text = requests.get(main_sitemap_url).text\n", 151 | "properties_sitemaps = re.findall(r'(http://www.rightmove.co.uk/sitemap_propertydetails\\d+.xml)', main_sitemap_text)\n", 152 | "limit_pages = 3\n", 153 | "all_properites = []\n", 154 | "for pmap_url in properties_sitemaps[:limit_pages]:\n", 155 | " print('getting properties from: ', pmap_url)\n", 156 | " pmap_text = requests.get(pmap_url).text\n", 157 | " p_urls = re.findall(r'(http://www.rightmove.co.uk/[\\-a-z]+/property-\\d+.html)', pmap_text)\n", 158 | " all_properites.extend(p_urls)\n", 159 | "print('I\\'ve got ' + str(len(all_properites)) + ' of urls with properties.\\nSome examples:')\n", 160 | "for url in all_properites[:6]:\n", 161 | " print('\\n- '+url)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": { 167 | "collapsed": false 168 | }, 169 | "source": [ 170 | "## Look for APIs - even if service does not provide (public) one\n", 171 | "\n", 172 | "### why APIs are better (I know... silly question)\n", 173 | "* web appearance can change frequently (which will brake scrapers dependant on html tags), but API stays same for longer time\n", 174 | "* often, responses from API contains very structured data (e.g. in JSON or XML format)\n", 175 | "\n", 176 | "### but there is no API available for website 'X' ;(\n", 177 | "* a lot of modern web services uses some kind of APIs internally [https://www.airbnb.co.uk/s/London/homes]\n", 178 | "* to find out if web service is using API track network in your developer’s tools. (I like Chrome’s tools, but Firefox, Opera etc. also has nice ones)\n", 179 | "* there are some treasures hidden in requests with type xhr, fetch, json etc.\n", 180 | "* often, you need to supply additional information with your request (like API keys or tokens)\n", 181 | "* API responses can be dynamically embeded in HTML [https://www.walmart.com/]" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 16, 187 | "metadata": { 188 | "collapsed": false 189 | }, 190 | "outputs": [ 191 | { 192 | "name": "stdout", 193 | "output_type": "stream", 194 | "text": [ 195 | "Got API KEY from main page: 41C97F66-D0FF-11DD-8143-EF6F37ABAA09\n", 196 | "Raw response from API: {'response': {'collectioncount': 16, 'attributes': {'country': 'US', 'province': '', 'postalcode': '20004', 'city': 'WASHINGTON', 'radiusuom': 'mile', 'radius': '40', 'state': 'DC', 'address': '', 'centerpoint': '-77.0255,38.8957'}, 'activeobject': '', 'collection': [{'fri_open_time': '8:00 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(202) 462-3146', 'email': 'truevalue17@truevalue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:30 PM', 'google_shared': None, 'uid': 1051076976, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:30 PM', 'yelp': None, 'sun_open_time': '10:00 AM', 'tvurl': 'http://www.truevalueon17th.com/', 'sun_close_time': '- 6:00 PM', 'tue_open_time': '8:00 AM', 'tvpaint': '1', '_distance': '1.32', 'wed_close_time': '- 7:30 PM', 'mon_open_time': '8:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 7:30 PM', 'thur_open_time': '8:00 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 7:30 PM', 'latitude': '38.911917', 'activeshiptostore': '1', 'bho': '[[\"1000\",\"1800\"],[\"0800\",\"1930\"],[\"0800\",\"1930\"],[\"0800\",\"1930\"],[\"0800\",\"1930\"],[\"0800\",\"1930\"],[\"0900\",\"1800\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-77.03849', 'ja': None, 'country': 'US', 'postalcode': '20009-2433', 'clientkey': 'L4ZK7Q8W-PA4X-4IS6-587Y-FRJLJ8Z5JF84', 'city': 'Washington', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalueon17th.com', 'sat_open_time': '9:00 AM', 'twitterurl': None, 'wed_open_time': '8:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'True Value On 17th', 'address1': '1623 17th St NW', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'DC'}, {'fri_open_time': '9:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(202) 659-8686', 'email': 'info@districthardware.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:00 PM', 'google_shared': None, 'uid': 1051078615, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 6:00 PM', 'yelp': None, 'sun_open_time': '11:00 AM', 'tvurl': 'http://www.thebikeshopdc.com', 'sun_close_time': '- 5:00 PM', 'tue_open_time': '9:00 AM', 'tvpaint': None, '_distance': '1.50', 'wed_close_time': '- 7:00 PM', 'mon_open_time': '9:00 AM', 'google_email': 'social@districthardware.com', 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 7:00 PM', 'thur_open_time': '9:00 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 7:00 PM', 'latitude': '38.9038748979592', 'activeshiptostore': '1', 'bho': '[[\"1100\",\"1700\"],[\"0900\",\"1900\"],[\"0900\",\"1900\"],[\"0900\",\"1900\"],[\"0900\",\"1900\"],[\"0900\",\"1800\"],[\"0900\",\"1800\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/districthardwareandbike', 'longitude': '-77.0514083673469', 'ja': None, 'country': 'US', 'postalcode': '20037-1432', 'clientkey': '2VMQMTVK-8O6O-8EUD-BA78-S3N6DFM1QESS', 'city': 'Washington', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.districthardware.com', 'sat_open_time': '9:00 AM', 'twitterurl': 'http://www.twitter.com/dchardwarebike', 'wed_open_time': '9:00 AM', 'google': '1', 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': '1', 'name': 'District Hardware and Bike', 'address1': '1108 24th Street NW', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'DC'}, {'fri_open_time': '5:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(202) 636-1701', 'email': None, 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 4:00 PM', 'google_shared': 'Yes', 'uid': 1182202005, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 4:00 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': None, 'sun_close_time': None, 'tue_open_time': '5:00 AM', 'tvpaint': None, '_distance': '2.79', 'wed_close_time': '- 4:00 PM', 'mon_open_time': '5:00 AM', 'google_email': 'tom@websolutions.net', 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 4:00 PM', 'thur_open_time': '5:00 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 4:00 PM', 'latitude': '38.91536', 'activeshiptostore': None, 'bho': '[[\"9999\",\"9999\"],[\"0500\",\"1600\"],[\"0500\",\"1600\"],[\"0500\",\"1600\"],[\"0500\",\"1600\"],[\"0500\",\"1600\"],[\"0700\",\"1100\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-76.98028', 'ja': None, 'country': 'US', 'postalcode': '20002-1834', 'clientkey': 'WOM3OP7L-SBW9-JV36-8DN0-O7U209J7LRQL', 'city': 'WASHINGTON', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/kamcobldgsply', 'sat_open_time': '7:00 AM', 'twitterurl': None, 'wed_open_time': '5:00 AM', 'google': '1', 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'KAMCO BUILDING SUPPLY', 'address1': '2100 W VIRGINIA AVENUE NE', 'sat_close_time': '- 11:00 AM', 'jaurl': None, 'state': 'DC'}, {'fri_open_time': '8:30 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(703) 524-2503', 'email': 'billstruevalue@truevalue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:00 PM', 'google_shared': 'YES', 'uid': 1051078648, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:00 PM', 'yelp': None, 'sun_open_time': '10:00 AM', 'tvurl': 'http://www.billstruevalue.com', 'sun_close_time': '- 5:00 PM', 'tue_open_time': '8:30 AM', 'tvpaint': '1', '_distance': '5.34', 'wed_close_time': '- 7:00 PM', 'mon_open_time': '8:30 AM', 'google_email': 'billstruevalue@truevalue.net', 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 7:00 PM', 'thur_open_time': '8:30 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 7:00 PM', 'latitude': '38.89758', 'activeshiptostore': '1', 'bho': '[[\"1000\",\"1700\"],[\"0830\",\"1900\"],[\"0830\",\"1900\"],[\"0830\",\"1900\"],[\"0830\",\"1900\"],[\"0830\",\"1900\"],[\"0830\",\"1800\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/billstruevalue', 'longitude': '-77.12483', 'ja': None, 'country': 'US', 'postalcode': '22207-2528', 'clientkey': '5JGZQATK-3G0D-KXC4-C6E3-TBAS768FPE51', 'city': 'Arlington', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.billstruevalue.com', 'sat_open_time': '8:30 AM', 'twitterurl': None, 'wed_open_time': '8:30 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'Bills True Value', 'address1': '2213 N. Buchanan Street', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'VA'}, {'fri_open_time': '7:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(301) 229-3700', 'email': 'CGEH@TrueValue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 8:00 PM', 'google_shared': 'Yes', 'uid': 1051076988, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 8:00 PM', 'yelp': None, 'sun_open_time': '9:00 AM', 'tvurl': 'http://www.truevalue.com/christophershardware', 'sun_close_time': '- 5:00 PM', 'tue_open_time': '7:00 AM', 'tvpaint': None, '_distance': '7.98', 'wed_close_time': '- 8:00 PM', 'mon_open_time': '7:00 AM', 'google_email': 'mikec@truevalue.net', 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 8:00 PM', 'thur_open_time': '7:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=HZUK3INI-P5FB-M95A-U15P-BCWH5QWTK30V&forceview=y&Adref=xx', 'tue_close_time': '- 8:00 PM', 'latitude': '38.96912', 'activeshiptostore': None, 'bho': '[[\"0900\",\"1700\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0830\",\"1800\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/home.php#!/GlenEchoHardware', 'longitude': '-77.14004', 'ja': None, 'country': 'US', 'postalcode': '20816', 'clientkey': 'HZUK3INI-P5FB-M95A-U15P-BCWH5QWTK30V', 'city': 'Bethesda', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/christophershardware', 'sat_open_time': '8:30 AM', 'twitterurl': 'http://www.twitter.com/GlenEchoHdwe', 'wed_open_time': '7:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'Christophers Glen Echo Hardware', 'address1': '7301 Mcarthur Blvd', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '6:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(703) 823-8700', 'email': None, 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 5:00 PM', 'google_shared': None, 'uid': 1051078649, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 5:00 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': None, 'sun_close_time': None, 'tue_open_time': '6:00 AM', 'tvpaint': None, '_distance': '8.98', 'wed_close_time': '- 5:00 PM', 'mon_open_time': '6:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 5:00 PM', 'thur_open_time': '6:00 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 5:00 PM', 'latitude': '38.7984625564754', 'activeshiptostore': None, 'bho': '[[\"9999\",\"9999\"],[\"0600\",\"1700\"],[\"0600\",\"1700\"],[\"0600\",\"1700\"],[\"0600\",\"1700\"],[\"0600\",\"1700\"],[\"0730\",\"1100\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-77.1362345995378', 'ja': None, 'country': 'US', 'postalcode': '22304-4822', 'clientkey': 'ACULYL3K-A7QA-N6TD-3KO2-763059ZYGSEK', 'city': 'ALEXANDRIA', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/kamcobldg', 'sat_open_time': '7:30 AM', 'twitterurl': None, 'wed_open_time': '6:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'KAMCO BLDG SPLY & TRUE VALUE', 'address1': '5860 FARINGTON AVE', 'sat_close_time': '- 11:00 AM', 'jaurl': None, 'state': 'VA'}, {'fri_open_time': '9:30 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(703) 765-4110', 'email': 'hhvs@vacoxmail.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 8:00 PM', 'google_shared': None, 'uid': 1051078650, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 8:00 PM', 'yelp': None, 'sun_open_time': '11:00 AM', 'tvurl': None, 'sun_close_time': '- 6:00 PM', 'tue_open_time': '9:30 AM', 'tvpaint': None, '_distance': '10.64', 'wed_close_time': '- 8:00 PM', 'mon_open_time': '9:30 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 8:00 PM', 'thur_open_time': '9:30 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 8:00 PM', 'latitude': '38.7436373198134', 'activeshiptostore': None, 'bho': '[[\"1100\",\"1800\"],[\"0930\",\"2000\"],[\"0930\",\"2000\"],[\"0930\",\"2000\"],[\"0930\",\"2000\"],[\"0930\",\"2000\"],[\"0930\",\"2000\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-77.0570290532158', 'ja': None, 'country': 'US', 'postalcode': '22308-1203', 'clientkey': 'GKNHJ7J6-3FNY-HHZS-DWCS-HYK8QW3I0JI7', 'city': 'Alexandria', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/hollinhallvarietystore', 'sat_open_time': '9:30 AM', 'twitterurl': None, 'wed_open_time': '9:30 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'Hollin Hall Variety Store', 'address1': '7902 Fort Hunt Rd', 'sat_close_time': '- 8:00 PM', 'jaurl': None, 'state': 'VA'}, {'fri_open_time': '7:30 AM', 'giftcard': None, 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(301) 292-1900', 'email': None, 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 5:00 PM', 'google_shared': None, 'uid': 1051076986, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 5:00 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': 'http://www.truevalue.com/ford', 'sun_close_time': None, 'tue_open_time': '7:30 AM', 'tvpaint': '1', '_distance': '11.58', 'wed_close_time': '- 5:00 PM', 'mon_open_time': '7:30 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 5:00 PM', 'thur_open_time': '7:30 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=VB7GDJK3-IDS0-XKHT-PNGL-VK4UG394HK6E&forceview=y&Adref=xx', 'tue_close_time': '- 5:00 PM', 'latitude': '38.730159346246', 'activeshiptostore': None, 'bho': '[[\"9999\",\"9999\"],[\"0730\",\"1700\"],[\"0730\",\"1700\"],[\"0730\",\"1700\"],[\"0730\",\"1700\"],[\"0730\",\"1700\"],[\"0730\",\"1600\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-76.9922404673855', 'ja': None, 'country': 'US', 'postalcode': '20744-5148', 'clientkey': 'VB7GDJK3-IDS0-XKHT-PNGL-VK4UG394HK6E', 'city': 'FORT WASHINGTON', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/ford', 'sat_open_time': '7:30 AM', 'twitterurl': None, 'wed_open_time': '7:30 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'FORD LUMBER COMPANY', 'address1': '11616 LIVINGSTON RD', 'sat_close_time': '- 4:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '7:30 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(301) 570-1300', 'email': 'Christophers@ChristophersHW.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:30 PM', 'google_shared': 'Yes', 'uid': 1051076989, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:30 PM', 'yelp': '1', 'sun_open_time': '9:00 AM', 'tvurl': None, 'sun_close_time': '- 5:00 PM', 'tue_open_time': '7:30 AM', 'tvpaint': None, '_distance': '17.50', 'wed_close_time': '- 7:30 PM', 'mon_open_time': '7:30 AM', 'google_email': 'christophers@christophershw.com', 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 7:30 PM', 'thur_open_time': '7:30 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 7:30 PM', 'latitude': '39.14891', 'activeshiptostore': None, 'bho': '[[\"0900\",\"1700\"],[\"0730\",\"1930\"],[\"0730\",\"1930\"],[\"0730\",\"1930\"],[\"0730\",\"1930\"],[\"0730\",\"1930\"],[\"0800\",\"1800\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-77.02204', 'ja': None, 'country': 'US', 'postalcode': '20860', 'clientkey': 'YOL39PAC-QNYN-CVFS-QHDO-1I4DOMOT4PCB', 'city': 'Sandy Spring', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.christophershardwarestore.com', 'sat_open_time': '8:00 AM', 'twitterurl': None, 'wed_open_time': '7:30 AM', 'google': '1', 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'Christophers Hardware', 'address1': '500 Olney Sandy Spring Rd', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '8:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(703) 361-3141', 'email': 'jerice@truevalue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:00 PM', 'google_shared': None, 'uid': 1051076977, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:00 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': 'http://www.jericeco.com/', 'sun_close_time': None, 'tue_open_time': '8:00 AM', 'tvpaint': None, '_distance': '25.53', 'wed_close_time': '- 7:00 PM', 'mon_open_time': '8:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 7:00 PM', 'thur_open_time': '8:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=7R0OKS9L-1M3V-03T6-T10M-F121Y326BYIU&forceview=y&Adref=xx', 'tue_close_time': '- 7:00 PM', 'latitude': '38.7579993877551', 'activeshiptostore': '1', 'bho': '[[\"9999\",\"9999\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1800\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/home.php?#!/pages/Manassas-VA/JE-Rice-Co/209185793694?ref=ts&__a=9&ajaxpipe=1', 'longitude': '-77.4656865306122', 'ja': None, 'country': 'US', 'postalcode': '20110', 'clientkey': '7R0OKS9L-1M3V-03T6-T10M-F121Y326BYIU', 'city': 'Manassas', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.jericeco.com', 'sat_open_time': '8:00 AM', 'twitterurl': None, 'wed_open_time': '8:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'J E Rice Co.', 'address1': '9124 Mathis Ave', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'VA'}, {'fri_open_time': '8:00 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(443) 607-4162', 'email': 'Chesapeakehardware@truevalue.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:00 PM', 'google_shared': None, 'uid': -2013484794, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:00 PM', 'yelp': None, 'sun_open_time': '8:00 AM', 'tvurl': None, 'sun_close_time': '- 5:00 PM', 'tue_open_time': '8:00 AM', 'tvpaint': '1', '_distance': '26.83', 'wed_close_time': '- 7:00 PM', 'mon_open_time': '8:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 7:00 PM', 'thur_open_time': '8:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=Q1ZGD4YP-NXGT-H3QM-2RUD-OFD11NRUZYNC&forceview=y&Adref=xx', 'tue_close_time': '- 7:00 PM', 'latitude': '38.8150991440624', 'activeshiptostore': '1', 'bho': '[[\"0800\",\"1700\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"],[\"0800\",\"1900\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-76.5376268933135', 'ja': None, 'country': 'US', 'postalcode': '20733-9639', 'clientkey': 'Q1ZGD4YP-NXGT-H3QM-2RUD-OFD11NRUZYNC', 'city': 'CHURCHTON', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.truevalue.com/chesapeakehardware', 'sat_open_time': '8:00 AM', 'twitterurl': None, 'wed_open_time': '8:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'CHESAPEAKE HARDWARE', 'address1': '5570-C SHADY SIDE ROAD', 'sat_close_time': '- 7:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '7:30 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(410) 647-4611', 'email': 'tony@clementhardware.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 7:00 PM', 'google_shared': None, 'uid': -1478017856, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:00 PM', 'yelp': None, 'sun_open_time': '9:00 AM', 'tvurl': None, 'sun_close_time': '- 5:00 PM', 'tue_open_time': '7:30 AM', 'tvpaint': None, '_distance': '28.61', 'wed_close_time': '- 7:00 PM', 'mon_open_time': '7:30 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 7:00 PM', 'thur_open_time': '7:30 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 7:00 PM', 'latitude': '39.08077', 'activeshiptostore': '1', 'bho': '[[\"0900\",\"1700\"],[\"0730\",\"1900\"],[\"0730\",\"1900\"],[\"0730\",\"1900\"],[\"0730\",\"1900\"],[\"0730\",\"1900\"],[\"0700\",\"1800\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/pages/Clement-Hardware/218395654855700', 'longitude': '-76.54892', 'ja': None, 'country': 'US', 'postalcode': '21146-2954', 'clientkey': 'HCFP6UZL-75MR-TKAO-RP2Z-NMHJAEH1PCNM', 'city': 'SEVERNA PARK', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://clementhardware.com/', 'sat_open_time': '7:00 AM', 'twitterurl': None, 'wed_open_time': '7:30 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'CLEMENT HARDWARE', 'address1': '500 RITCHIE HIGHWAY', 'sat_close_time': '- 6:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '7:00 AM', 'giftcard': None, 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(301) 253-2131', 'email': None, 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 6:00 PM', 'google_shared': None, 'uid': 1051078622, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 6:00 PM', 'yelp': None, 'sun_open_time': '10:00 AM', 'tvurl': 'http://www.hyattbuildingsupply.com/', 'sun_close_time': '- 2:00 PM', 'tue_open_time': '7:00 AM', 'tvpaint': '1', '_distance': '28.77', 'wed_close_time': '- 6:00 PM', 'mon_open_time': '7:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 6:00 PM', 'thur_open_time': '7:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=N7IVBTZQ-HZ6O-KGV2-ZUEX-QAHBZXTIRS3E&forceview=y&Adref=xx', 'tue_close_time': '- 6:00 PM', 'latitude': '39.2873782648885', 'activeshiptostore': None, 'bho': '[[\"1000\",\"1400\"],[\"0700\",\"1800\"],[\"0700\",\"1800\"],[\"0700\",\"1800\"],[\"0700\",\"1800\"],[\"0700\",\"1800\"],[\"0800\",\"1600\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-77.2075308119523', 'ja': None, 'country': 'US', 'postalcode': '20872-1830', 'clientkey': 'N7IVBTZQ-HZ6O-KGV2-ZUEX-QAHBZXTIRS3E', 'city': 'DAMASCUS', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.hyattbuildingsupply.com', 'sat_open_time': '8:00 AM', 'twitterurl': None, 'wed_open_time': '7:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': None, 'fax': None, 'foursquare': None, 'name': 'HYATT TRUE VALUE', 'address1': '26200 RIDGE RD', 'sat_close_time': '- 4:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '7:00 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(410) 268-3939', 'email': 'jared@kbtruevalue.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 8:00 PM', 'google_shared': None, 'uid': 1051078638, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 8:00 PM', 'yelp': None, 'sun_open_time': '8:00 AM', 'tvurl': 'http://www.kbtruevalue.com/', 'sun_close_time': '- 6:00 PM', 'tue_open_time': '7:00 AM', 'tvpaint': '1', '_distance': '28.88', 'wed_close_time': '- 8:00 PM', 'mon_open_time': '7:00 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 8:00 PM', 'thur_open_time': '7:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=CH10ZM2D-T28I-979R-X1UQ-NYSIN4L4XT4P&forceview=y&Adref=xx', 'tue_close_time': '- 8:00 PM', 'latitude': '38.9502181524996', 'activeshiptostore': '1', 'bho': '[[\"0800\",\"1800\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"2000\"],[\"0700\",\"1900\"]]', 'trurl': None, 'facebookurl': 'http://www.facebook.com/kbtruevalue', 'longitude': '-76.4927551456658', 'ja': None, 'country': 'US', 'postalcode': '21403-1756', 'clientkey': 'CH10ZM2D-T28I-979R-X1UQ-NYSIN4L4XT4P', 'city': 'Annapolis', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.kbtruevalue.com', 'sat_open_time': '7:00 AM', 'twitterurl': 'http://twitter.com/kbtruevalue', 'wed_open_time': '7:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'K & B True Value', 'address1': '912 Forest Dr', 'sat_close_time': '- 7:00 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '7:30 AM', 'giftcard': None, 'icon': 'default', 'tvadv': None, 'cs': None, 'csurl': None, 'phone': '(410) 535-0442', 'email': 'lusbymot@verizon.net', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 5:30 PM', 'google_shared': None, 'uid': 1051076983, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 7:30 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': 'http://www.lusbyhardware.com/', 'sun_close_time': None, 'tue_open_time': '7:30 AM', 'tvpaint': None, '_distance': '34.26', 'wed_close_time': '- 5:30 PM', 'mon_open_time': '7:30 AM', 'google_email': None, 'ds': None, 'hgurl': None, 'localad': None, 'mon_close_time': '- 5:30 PM', 'thur_open_time': '7:30 AM', 'facebook': None, 'tvadvurl': None, 'tue_close_time': '- 5:30 PM', 'latitude': '38.539349691187', 'activeshiptostore': '1', 'bho': '[[\"9999\",\"9999\"],[\"0730\",\"1730\"],[\"0730\",\"1730\"],[\"0730\",\"1730\"],[\"0730\",\"1730\"],[\"0730\",\"1930\"],[\"0730\",\"1730\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-76.5835104820298', 'ja': None, 'country': 'US', 'postalcode': '20678', 'clientkey': '9MZ7X579-CK0M-3IMU-TT63-OPZF55IIPXFZ', 'city': 'Prince Frederick', 'tvr': None, 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.Lusbyhardware.com', 'sat_open_time': '7:30 AM', 'twitterurl': None, 'wed_open_time': '7:30 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'Lusby Motor Company Inc.', 'address1': '155 Main St', 'sat_close_time': '- 5:30 PM', 'jaurl': None, 'state': 'MD'}, {'fri_open_time': '8:00 AM', 'giftcard': '1', 'icon': 'default', 'tvadv': '1', 'cs': None, 'csurl': None, 'phone': '(301) 259-2540', 'email': 'hungerford@olg.com', 'google_notes': None, 'tv': '1', 'corronado': None, 'thur_close_time': '- 5:00 PM', 'google_shared': 'YES', 'uid': 1051076982, 'address2': None, '_distanceuom': 'mile', 'province': None, 'fri_close_time': '- 5:00 PM', 'yelp': None, 'sun_open_time': 'closed', 'tvurl': 'http://www.aghungerford.com/', 'sun_close_time': None, 'tue_open_time': '8:00 AM', 'tvpaint': '1', '_distance': '35.93', 'wed_close_time': '- 5:00 PM', 'mon_open_time': '8:00 AM', 'google_email': 'hungerford@olg.com', 'ds': None, 'hgurl': None, 'localad': '1', 'mon_close_time': '- 5:00 PM', 'thur_open_time': '8:00 AM', 'facebook': None, 'tvadvurl': 'http://truevalue.shoplocal.com/truevalue/new_user_entry.aspx?StoreRef=N5VPHAFC-MVJV-T31M-RRXY-8PKZ81EMOPHM&forceview=y&Adref=xx', 'tue_close_time': '- 5:00 PM', 'latitude': '38.3786245621919', 'activeshiptostore': '1', 'bho': '[[\"9999\",\"9999\"],[\"0800\",\"1700\"],[\"0800\",\"1700\"],[\"0800\",\"1700\"],[\"0800\",\"1700\"],[\"0800\",\"1700\"],[\"0800\",\"1700\"]]', 'trurl': None, 'facebookurl': None, 'longitude': '-76.95422029337', 'ja': None, 'country': 'US', 'postalcode': '20664', 'clientkey': 'N5VPHAFC-MVJV-T31M-RRXY-8PKZ81EMOPHM', 'city': 'Newburg', 'tvr': '1', 'dsurl': None, 'grurl': None, 'taylorrental': None, 'gr': None, 'url': 'http://www.aghungerford.com', 'sat_open_time': '8:00 AM', 'twitterurl': None, 'wed_open_time': '8:00 AM', 'google': None, 'main_id': 'TV', 'hg': None, 'creditcard': '1', 'fax': None, 'foursquare': None, 'name': 'A G Hungerford & Son Inc.', 'address1': '12165 Rock Point Rd', 'sat_close_time': '- 5:00 PM', 'jaurl': None, 'state': 'MD'}], 'collectionname': 'poi'}, 'code': 1}\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "\"\"\"\n", 202 | "Case Study 2: getting API_KEY from html and then data from API\n", 203 | "\n", 204 | "The True Value Company is an American retailer-owned hardware cooperative with over 4,000 independent retail \n", 205 | "locations worldwide. Create scraper which gets all available True Value shops given post code. Scraper\n", 206 | "should not have any API key hardcoded, as it can change during site lifetime.\n", 207 | "\n", 208 | "Minimum data you should get:\n", 209 | "- address\n", 210 | "- city\n", 211 | "- country\n", 212 | "- latitude\n", 213 | "- longitude\n", 214 | "- name\n", 215 | "- postalcode\n", 216 | "- state\n", 217 | "\"\"\"\n", 218 | "main_page_url = 'http://hosted.where2getit.com/truevalue/index2015.html'\n", 219 | "main_page_text = requests.get(main_page_url).text\n", 220 | "api_key = re.findall(r\"appkey: '([0-9A-Z\\-]+)', \", main_page_text)[0]\n", 221 | "print('Got API KEY from main page: ', api_key)\n", 222 | "api_endpoint = 'http://hosted.where2getit.com/truevalue/rest/locatorsearch'\n", 223 | "POST_CODE = 20004\n", 224 | "body = {\n", 225 | " \"request\": {\n", 226 | " \"appkey\": api_key,\n", 227 | " \"formdata\": {\n", 228 | " \"geoip\": False,\n", 229 | " \"dataview\": \"store_default\",\n", 230 | " \"limit\": 40,\n", 231 | " \"geolocs\": {\n", 232 | " \"geoloc\": [\n", 233 | " {\n", 234 | " \"addressline\": str(POST_CODE)\n", 235 | " }\n", 236 | " ]\n", 237 | " },\n", 238 | " \"searchradius\": \"40|50|80\",\n", 239 | " \"where\": {\n", 240 | " \"and\": {\n", 241 | " \"giftcard\": {\n", 242 | " \"eq\": \"\"\n", 243 | " },\n", 244 | " \"tvpaint\": {\n", 245 | " \"eq\": \"\"\n", 246 | " },\n", 247 | " \"creditcard\": {\n", 248 | " \"eq\": \"\"\n", 249 | " },\n", 250 | " \"localad\": {\n", 251 | " \"eq\": \"\"\n", 252 | " },\n", 253 | " \"ja\": {\n", 254 | " \"eq\": \"\"\n", 255 | " },\n", 256 | " \"tvr\": {\n", 257 | " \"eq\": \"\"\n", 258 | " },\n", 259 | " \"activeshiptostore\": {\n", 260 | " \"eq\": \"\"\n", 261 | " },\n", 262 | " \"main_id\": {\n", 263 | " \"eq\": \"\"\n", 264 | " },\n", 265 | " \"corronado\": {\n", 266 | " \"eq\": \"\"\n", 267 | " },\n", 268 | " \"tv\": {\n", 269 | " \"eq\": \"1\"\n", 270 | " }\n", 271 | " }\n", 272 | " },\n", 273 | " \"false\": \"0\"\n", 274 | " }\n", 275 | " }\n", 276 | "}\n", 277 | "r = requests.post(api_endpoint, data=json.dumps(body))\n", 278 | "data = json.loads(r.text)\n", 279 | "print('Raw response from API: ', data)\n", 280 | "shops = [{'name':entry['name'],\n", 281 | " 'address':entry['address1'],\n", 282 | " 'postalcode':entry['postalcode'],\n", 283 | " 'city':entry['city'],\n", 284 | " 'state':entry['state'],\n", 285 | " 'country':entry['country'],\n", 286 | " 'latitude':entry['latitude'],\n", 287 | " 'longitude':entry['longitude']\n", 288 | " } for entry in data['response']['collection']]\n" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 17, 294 | "metadata": { 295 | "collapsed": false 296 | }, 297 | "outputs": [ 298 | { 299 | "data": { 300 | "text/html": [ 301 | "\n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | "
addresscitycountrylatitudelongitudenamepostalcodestate
01623 17th St NWWashingtonUS38.911917-77.03849True Value On 17th20009-2433DC
11108 24th Street NWWashingtonUS38.9038748979592-77.0514083673469District Hardware and Bike20037-1432DC
22100 W VIRGINIA AVENUE NEWASHINGTONUS38.91536-76.98028KAMCO BUILDING SUPPLY20002-1834DC
32213 N. Buchanan StreetArlingtonUS38.89758-77.12483Bills True Value22207-2528VA
47301 Mcarthur BlvdBethesdaUS38.96912-77.14004Christophers Glen Echo Hardware20816MD
55860 FARINGTON AVEALEXANDRIAUS38.7984625564754-77.1362345995378KAMCO BLDG SPLY & TRUE VALUE22304-4822VA
67902 Fort Hunt RdAlexandriaUS38.7436373198134-77.0570290532158Hollin Hall Variety Store22308-1203VA
711616 LIVINGSTON RDFORT WASHINGTONUS38.730159346246-76.9922404673855FORD LUMBER COMPANY20744-5148MD
8500 Olney Sandy Spring RdSandy SpringUS39.14891-77.02204Christophers Hardware20860MD
99124 Mathis AveManassasUS38.7579993877551-77.4656865306122J E Rice Co.20110VA
105570-C SHADY SIDE ROADCHURCHTONUS38.8150991440624-76.5376268933135CHESAPEAKE HARDWARE20733-9639MD
11500 RITCHIE HIGHWAYSEVERNA PARKUS39.08077-76.54892CLEMENT HARDWARE21146-2954MD
1226200 RIDGE RDDAMASCUSUS39.2873782648885-77.2075308119523HYATT TRUE VALUE20872-1830MD
13912 Forest DrAnnapolisUS38.9502181524996-76.4927551456658K & B True Value21403-1756MD
14155 Main StPrince FrederickUS38.539349691187-76.5835104820298Lusby Motor Company Inc.20678MD
1512165 Rock Point RdNewburgUS38.3786245621919-76.95422029337A G Hungerford & Son Inc.20664MD
" 494 | ], 495 | "text/plain": [ 496 | "" 497 | ] 498 | }, 499 | "execution_count": 17, 500 | "metadata": {}, 501 | "output_type": "execute_result" 502 | } 503 | ], 504 | "source": [ 505 | "HTML(pd.DataFrame(shops).to_html())" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": 18, 511 | "metadata": { 512 | "collapsed": false 513 | }, 514 | "outputs": [ 515 | { 516 | "name": "stdout", 517 | "output_type": "stream", 518 | "text": [ 519 | "Got listing from: https://www.airbnb.co.uk/api/v2/explore_tabs?metadata_only=false&items_per_grid=20&version=1.2.8&luxury_pre_launch=false&_intents=p1&screen_size=small&locale=en-GB&timezone_offset=60&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&is_new_cards_experiment=false&is_standard_search=true&_format=for_explore_search_web&fetch_filters=true¤cy=GBP&supports_for_you_v3=true&location=London&is_guided_search=true&s_tag=DOIPutuT&selected_tab_id=home_tab§ion_offset=0&auto_ib=false&allow_override%5B%5D=&refinements%5B%5D=homes\n" 520 | ] 521 | }, 522 | { 523 | "data": { 524 | "text/plain": [ 525 | "'\\nnote:\\nsome airbnb usefull endpoints\\n# get listings\\nhttps://www.airbnb.co.uk/api/v2/explore_tabs?version=1.2.8&_format=for_explore_search_web&items_per_grid=18&experiences_per_grid=20&guidebooks_per_grid=20&fetch_filters=true&is_guided_search=true&is_new_cards_experiment=false&supports_for_you_v3=true&screen_size=small&timezone_offset=60&auto_ib=false&luxury_pre_launch=false&metadata_only=false&is_standard_search=true&tab_id=home_tab&location=London&allow_override%5B%5D=&ne_lat=51.599363500119274&ne_lng=-0.06168207198925302&sw_lat=51.47626857868991&sw_lng=-0.289648380583003&zoom=12&search_by_map=true&federated_search_session_id=6d72b1e2-cb68-4877-b27c-8614e11fc5b0&_intents=p1&key=d306zoyjsyarp7ifhu67rjxn52tv0t20¤cy=&locale=en-GB\\n# get booking detials\\nhttps://www.airbnb.co.uk/api/v2/pdp_listing_booking_details?guests=1&listing_id=13575756&_format=for_web_dateless&_interaction_type=pageload&_intents=p3_book_it&_parent_request_uuid=aed9f0ce-1534-4a89-9cf6-c6813adcb95b&_p3_impression_id=p3_1506465875_Q2VDMsV0pLs27%2BtX&show_smart_promotion=0&force_boost_unc_priority_message_type=&number_of_adults=1&number_of_children=0&number_of_infants=0&key=d306zoyjsyarp7ifhu67rjxn52tv0t20¤cy=GBP&locale=en-GB\\n'" 526 | ] 527 | }, 528 | "execution_count": 18, 529 | "metadata": {}, 530 | "output_type": "execute_result" 531 | } 532 | ], 533 | "source": [ 534 | "\"\"\"\n", 535 | "Case Study 3: get available airbnb properties in London\n", 536 | "\n", 537 | "You want to visit London and airbnb looks like a nice option for you. As you are crazy data geek and you want to run\n", 538 | "some fancy algorithms to make a better choice of apartment to rent - you need data! Get all available airbnb \n", 539 | "properties in London. You are interested in pricing, location, rating, no. of reviews, images and more. \n", 540 | "You also don't like to repeate yourself, so you need to build scraper which will survive till your next trip.\n", 541 | "\"\"\"\n", 542 | "\n", 543 | "headers = {'accept-encoding': 'gzip, deflate, br',\n", 544 | " 'x-requested-with': 'XMLHttpRequest',\n", 545 | " 'accept-language': 'en-US,en;q=0.8,pl;q=0.6',\n", 546 | " 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',\n", 547 | " 'accept': 'application/json, text/javascript, */*; q=0.01',\n", 548 | " 'referer': 'https://www.airbnb.co.uk/s/London/homes',\n", 549 | " 'authority': 'www.airbnb.co.uk'}\n", 550 | "# get api key embeded in html (yes... same story again :))\n", 551 | "html = requests.get('https://www.airbnb.co.uk/s/London/homes', headers=headers).text\n", 552 | "api_key = re.findall(r'key\\":\\"([a-zA-Z0-7]*)\\"},\\"deep_link', html)[0]\n", 553 | "# get first listing\n", 554 | "enpoint='https://www.airbnb.co.uk/api/v2/explore_tabs'\n", 555 | "params = {'version':'1.2.8',\n", 556 | " '_format':'for_explore_search_web',\n", 557 | " 'items_per_grid':'20',\n", 558 | " 'fetch_filters':'true',\n", 559 | " 'is_guided_search':'true',\n", 560 | " 'is_new_cards_experiment':'false',\n", 561 | " 'supports_for_you_v3':'true',\n", 562 | " 'screen_size':'small',\n", 563 | " 'timezone_offset':'60',\n", 564 | " 'auto_ib':'false',\n", 565 | " 'luxury_pre_launch':'false',\n", 566 | " 'metadata_only':'false',\n", 567 | " 'is_standard_search':'true',\n", 568 | " 'refinements[]':'homes',\n", 569 | " 'selected_tab_id':'home_tab',\n", 570 | " 'location':'London',\n", 571 | " 'allow_override[]':'',\n", 572 | " 's_tag':'DOIPutuT',\n", 573 | " 'section_offset':'0',\n", 574 | " '_intents':'p1',\n", 575 | " 'key':api_key,\n", 576 | " 'currency':'GBP',\n", 577 | " 'locale':'en-GB'}\n", 578 | "r = requests.get(enpoint, params=params)\n", 579 | "print('Got listing from: ', r.url)\n", 580 | "ds = json.loads(r.text)\n", 581 | "\n", 582 | "\"\"\"\n", 583 | "note:\n", 584 | "some airbnb usefull endpoints\n", 585 | "# get listings\n", 586 | "https://www.airbnb.co.uk/api/v2/explore_tabs?version=1.2.8&_format=for_explore_search_web&items_per_grid=18&experiences_per_grid=20&guidebooks_per_grid=20&fetch_filters=true&is_guided_search=true&is_new_cards_experiment=false&supports_for_you_v3=true&screen_size=small&timezone_offset=60&auto_ib=false&luxury_pre_launch=false&metadata_only=false&is_standard_search=true&tab_id=home_tab&location=London&allow_override%5B%5D=&ne_lat=51.599363500119274&ne_lng=-0.06168207198925302&sw_lat=51.47626857868991&sw_lng=-0.289648380583003&zoom=12&search_by_map=true&federated_search_session_id=6d72b1e2-cb68-4877-b27c-8614e11fc5b0&_intents=p1&key=d306zoyjsyarp7ifhu67rjxn52tv0t20¤cy=&locale=en-GB\n", 587 | "# get booking detials\n", 588 | "https://www.airbnb.co.uk/api/v2/pdp_listing_booking_details?guests=1&listing_id=13575756&_format=for_web_dateless&_interaction_type=pageload&_intents=p3_book_it&_parent_request_uuid=aed9f0ce-1534-4a89-9cf6-c6813adcb95b&_p3_impression_id=p3_1506465875_Q2VDMsV0pLs27%2BtX&show_smart_promotion=0&force_boost_unc_priority_message_type=&number_of_adults=1&number_of_children=0&number_of_infants=0&key=d306zoyjsyarp7ifhu67rjxn52tv0t20¤cy=GBP&locale=en-GB\n", 589 | "\"\"\"" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": 19, 595 | "metadata": { 596 | "collapsed": false 597 | }, 598 | "outputs": [ 599 | { 600 | "name": "stdout", 601 | "output_type": "stream", 602 | "text": [ 603 | "will get page: 2\n", 604 | "No. of properties: 20\n", 605 | "will get page: 3\n", 606 | "No. of properties: 40\n", 607 | "will get page: 4\n", 608 | "No. of properties: 60\n", 609 | "will get page: 5\n", 610 | "No. of properties: 80\n", 611 | "will get page: 6\n", 612 | "No. of properties: 100\n" 613 | ] 614 | } 615 | ], 616 | "source": [ 617 | "# paginate and get more properties\n", 618 | "props = []\n", 619 | "page_no = 1\n", 620 | "page_limit = 3\n", 621 | "while (ds['explore_tabs'][0]['pagination_metadata']['has_next_page'] == True) and (page_no\n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " currency\n", 658 | " latitude\n", 659 | " longitude\n", 660 | " name\n", 661 | " person_capacity\n", 662 | " pic\n", 663 | " price\n", 664 | " price_type\n", 665 | " rating\n", 666 | " room_type\n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " 0\n", 672 | " GBP\n", 673 | " 51.516746\n", 674 | " -0.050351\n", 675 | " (HAR-A)PRIVATE ROOM FOR 5PPL CLOSE TO TOWER BR...\n", 676 | " 5\n", 677 | " https://a0.muscache.com/im/pictures/786aa625-2...\n", 678 | " 25\n", 679 | " nightly\n", 680 | " 5.0\n", 681 | " Private room\n", 682 | " \n", 683 | " \n", 684 | " 1\n", 685 | " GBP\n", 686 | " 51.524362\n", 687 | " -0.116995\n", 688 | " Double Room nr Soho | Russell Square |Kings Cross\n", 689 | " 4\n", 690 | " https://a0.muscache.com/im/pictures/affe8de1-a...\n", 691 | " 62\n", 692 | " nightly\n", 693 | " 4.5\n", 694 | " Private room\n", 695 | " \n", 696 | " \n", 697 | " 2\n", 698 | " GBP\n", 699 | " 51.486756\n", 700 | " -0.104479\n", 701 | " 1 double room in Central London\n", 702 | " 2\n", 703 | " https://a0.muscache.com/im/pictures/6bfb24c2-c...\n", 704 | " 18\n", 705 | " nightly\n", 706 | " 4.5\n", 707 | " Private room\n", 708 | " \n", 709 | " \n", 710 | " 3\n", 711 | " GBP\n", 712 | " 51.511007\n", 713 | " -0.226281\n", 714 | " THE QUEENS HOSTEL, 6 BED MIXED DORM E\n", 715 | " 6\n", 716 | " https://a0.muscache.com/im/pictures/328d9beb-6...\n", 717 | " 21\n", 718 | " nightly\n", 719 | " 4.5\n", 720 | " Private room\n", 721 | " \n", 722 | " \n", 723 | " 4\n", 724 | " GBP\n", 725 | " 51.564283\n", 726 | " -0.120809\n", 727 | " Spacious Double room in Holloway, London\n", 728 | " 2\n", 729 | " https://a0.muscache.com/im/pictures/ae707340-f...\n", 730 | " 19\n", 731 | " nightly\n", 732 | " 4.5\n", 733 | " Private room\n", 734 | " \n", 735 | " \n", 736 | " 5\n", 737 | " GBP\n", 738 | " 51.548527\n", 739 | " -0.226324\n", 740 | " Modern room 10 min from Central London\n", 741 | " 2\n", 742 | " https://a0.muscache.com/im/pictures/a0e3a2af-8...\n", 743 | " 40\n", 744 | " nightly\n", 745 | " 5.0\n", 746 | " Private room\n", 747 | " \n", 748 | " \n", 749 | " 6\n", 750 | " GBP\n", 751 | " 51.491728\n", 752 | " -0.014746\n", 753 | " Double Room in Canary Wharf hs\n", 754 | " 2\n", 755 | " https://a0.muscache.com/im/pictures/ebd8e53d-0...\n", 756 | " 25\n", 757 | " nightly\n", 758 | " 4.5\n", 759 | " Private room\n", 760 | " \n", 761 | " \n", 762 | " 7\n", 763 | " GBP\n", 764 | " 51.452218\n", 765 | " -0.025420\n", 766 | " Comfortable, Clean London room\n", 767 | " 2\n", 768 | " https://a0.muscache.com/im/pictures/ae5469b5-9...\n", 769 | " 25\n", 770 | " nightly\n", 771 | " 5.0\n", 772 | " Private room\n", 773 | " \n", 774 | " \n", 775 | " 8\n", 776 | " GBP\n", 777 | " 51.501680\n", 778 | " -0.052571\n", 779 | " Double room, 2min from the station\n", 780 | " 2\n", 781 | " https://a0.muscache.com/im/pictures/4cfc5263-5...\n", 782 | " 57\n", 783 | " nightly\n", 784 | " 4.5\n", 785 | " Private room\n", 786 | " \n", 787 | " \n", 788 | " 9\n", 789 | " GBP\n", 790 | " 51.624089\n", 791 | " -0.054654\n", 792 | " En-Suite Bedroom with Bathroom\n", 793 | " 3\n", 794 | " https://a0.muscache.com/im/pictures/97921358/8...\n", 795 | " 19\n", 796 | " nightly\n", 797 | " 5.0\n", 798 | " Private room\n", 799 | " \n", 800 | " \n", 801 | " 10\n", 802 | " GBP\n", 803 | " 51.483282\n", 804 | " -0.132566\n", 805 | " GORGEOUS RIVERSIDE LUXURY, WITH SPA AND POOL\n", 806 | " 7\n", 807 | " https://a0.muscache.com/im/pictures/088b6118-2...\n", 808 | " 154\n", 809 | " nightly\n", 810 | " 4.5\n", 811 | " Entire home/flat\n", 812 | " \n", 813 | " \n", 814 | " 11\n", 815 | " GBP\n", 816 | " 51.554006\n", 817 | " -0.242773\n", 818 | " Nice & Equipped Double Bedroom in Dollis Hill!...\n", 819 | " 2\n", 820 | " https://a0.muscache.com/im/pictures/88df04b4-b...\n", 821 | " 31\n", 822 | " nightly\n", 823 | " 4.5\n", 824 | " Private room\n", 825 | " \n", 826 | " \n", 827 | " 12\n", 828 | " GBP\n", 829 | " 51.497491\n", 830 | " -0.060441\n", 831 | " 3. Lovely Room in Centre London+Wifi\n", 832 | " 1\n", 833 | " https://a0.muscache.com/im/pictures/d349ce62-e...\n", 834 | " 36\n", 835 | " nightly\n", 836 | " 4.5\n", 837 | " Private room\n", 838 | " \n", 839 | " \n", 840 | " 13\n", 841 | " GBP\n", 842 | " 51.512330\n", 843 | " -0.066251\n", 844 | " Small Room 10 -near Tower of London and Shored...\n", 845 | " 2\n", 846 | " https://a0.muscache.com/im/pictures/4f4e9ad7-d...\n", 847 | " 40\n", 848 | " nightly\n", 849 | " 4.5\n", 850 | " Private room\n", 851 | " \n", 852 | " \n", 853 | " 14\n", 854 | " GBP\n", 855 | " 51.476615\n", 856 | " -0.132970\n", 857 | " SMALL Stockwell station single room-£17\n", 858 | " 1\n", 859 | " https://a0.muscache.com/im/pictures/35f34bb4-1...\n", 860 | " 18\n", 861 | " nightly\n", 862 | " 4.5\n", 863 | " Private room\n", 864 | " \n", 865 | " \n", 866 | " 15\n", 867 | " GBP\n", 868 | " 51.489384\n", 869 | " -0.099203\n", 870 | " Double Bedroom Apartment with Balcony in Zone 1\n", 871 | " 2\n", 872 | " https://a0.muscache.com/im/pictures/54855d11-1...\n", 873 | " 21\n", 874 | " nightly\n", 875 | " 4.5\n", 876 | " Private room\n", 877 | " \n", 878 | " \n", 879 | " 16\n", 880 | " GBP\n", 881 | " 51.525883\n", 882 | " -0.100857\n", 883 | " Bright and welcoming flat in central London\n", 884 | " 4\n", 885 | " https://a0.muscache.com/im/pictures/86509cc9-0...\n", 886 | " 91\n", 887 | " nightly\n", 888 | " 4.5\n", 889 | " Entire home/flat\n", 890 | " \n", 891 | " \n", 892 | " 17\n", 893 | " GBP\n", 894 | " 51.517944\n", 895 | " -0.069536\n", 896 | " CS11 Nice Single Room Central London\n", 897 | " 1\n", 898 | " https://a0.muscache.com/im/pictures/0e601324-0...\n", 899 | " 33\n", 900 | " nightly\n", 901 | " 4.5\n", 902 | " Private room\n", 903 | " \n", 904 | " \n", 905 | " 18\n", 906 | " GBP\n", 907 | " 51.523661\n", 908 | " -0.172201\n", 909 | " Cosy One Bed 3rd Floor Close to Marble Arch\n", 910 | " 4\n", 911 | " https://a0.muscache.com/im/pictures/2666a91d-0...\n", 912 | " 62\n", 913 | " nightly\n", 914 | " 4.5\n", 915 | " Entire home/flat\n", 916 | " \n", 917 | " \n", 918 | " 19\n", 919 | " GBP\n", 920 | " 51.492525\n", 921 | " -0.096608\n", 922 | " Room Big Ben + Breakfast (R5a)\n", 923 | " 1\n", 924 | " https://a0.muscache.com/im/pictures/f589d5df-3...\n", 925 | " 19\n", 926 | " nightly\n", 927 | " 5.0\n", 928 | " Shared room\n", 929 | " \n", 930 | " \n", 931 | " 20\n", 932 | " GBP\n", 933 | " 51.491423\n", 934 | " -0.139555\n", 935 | " Brand new studio in Victoria 11A\n", 936 | " 2\n", 937 | " https://a0.muscache.com/im/pictures/11490113/4...\n", 938 | " 72\n", 939 | " nightly\n", 940 | " 4.5\n", 941 | " Entire home/flat\n", 942 | " \n", 943 | " \n", 944 | " 21\n", 945 | " GBP\n", 946 | " 51.510864\n", 947 | " -0.181942\n", 948 | " 6 Bed Mixed Dormitory Ensuite\n", 949 | " 6\n", 950 | " https://a0.muscache.com/im/pictures/4d8c2dec-0...\n", 951 | " 21\n", 952 | " nightly\n", 953 | " 4.0\n", 954 | " Shared room\n", 955 | " \n", 956 | " \n", 957 | " 22\n", 958 | " GBP\n", 959 | " 51.451279\n", 960 | " 0.015362\n", 961 | " Exceptional room zone 3 by station\n", 962 | " 2\n", 963 | " https://a0.muscache.com/im/pictures/4edf624d-0...\n", 964 | " 25\n", 965 | " nightly\n", 966 | " 4.5\n", 967 | " Private room\n", 968 | " \n", 969 | " \n", 970 | " 23\n", 971 | " GBP\n", 972 | " 51.510506\n", 973 | " -0.129266\n", 974 | " Trafalgar Square, Peaceful Room.\\nFemale frien...\n", 975 | " 2\n", 976 | " https://a0.muscache.com/im/pictures/e0265be9-4...\n", 977 | " 58\n", 978 | " nightly\n", 979 | " 5.0\n", 980 | " Private room\n", 981 | " \n", 982 | " \n", 983 | " 24\n", 984 | " GBP\n", 985 | " 51.555259\n", 986 | " -0.252571\n", 987 | " Charming Room, 2 Double beds, ensuite (BA-II)\n", 988 | " 4\n", 989 | " https://a0.muscache.com/im/pictures/ddf9b001-b...\n", 990 | " 33\n", 991 | " nightly\n", 992 | " 4.5\n", 993 | " Private room\n", 994 | " \n", 995 | " \n", 996 | " 25\n", 997 | " GBP\n", 998 | " 51.479652\n", 999 | " -0.169762\n", 1000 | " Luxurious double @ heart of london\n", 1001 | " 2\n", 1002 | " https://a0.muscache.com/im/pictures/1528064c-5...\n", 1003 | " 36\n", 1004 | " nightly\n", 1005 | " 5.0\n", 1006 | " Private room\n", 1007 | " \n", 1008 | " \n", 1009 | " 26\n", 1010 | " GBP\n", 1011 | " 51.580548\n", 1012 | " -0.233645\n", 1013 | " SUPER VALUE, PRETTY & SAFE, EASY ACCESS TO CENTRE\n", 1014 | " 1\n", 1015 | " https://a0.muscache.com/im/pictures/5a878555-6...\n", 1016 | " 24\n", 1017 | " nightly\n", 1018 | " 5.0\n", 1019 | " Private room\n", 1020 | " \n", 1021 | " \n", 1022 | " 27\n", 1023 | " GBP\n", 1024 | " 51.416261\n", 1025 | " -0.101107\n", 1026 | " The Snug - Completely independent studio\n", 1027 | " 3\n", 1028 | " https://a0.muscache.com/im/pictures/a7b7ac9c-6...\n", 1029 | " 43\n", 1030 | " nightly\n", 1031 | " 5.0\n", 1032 | " Entire home/flat\n", 1033 | " \n", 1034 | " \n", 1035 | " 28\n", 1036 | " GBP\n", 1037 | " 51.496717\n", 1038 | " -0.101200\n", 1039 | " Great central London flat\n", 1040 | " 6\n", 1041 | " https://a0.muscache.com/im/pictures/0c6c56c4-a...\n", 1042 | " 119\n", 1043 | " nightly\n", 1044 | " 4.5\n", 1045 | " Entire home/flat\n", 1046 | " \n", 1047 | " \n", 1048 | " 29\n", 1049 | " GBP\n", 1050 | " 51.470231\n", 1051 | " -0.090346\n", 1052 | " Cosy double in creative, social garden house\n", 1053 | " 2\n", 1054 | " https://a0.muscache.com/im/pictures/cb24dc8f-3...\n", 1055 | " 33\n", 1056 | " nightly\n", 1057 | " 4.5\n", 1058 | " Private room\n", 1059 | " \n", 1060 | " \n", 1061 | " 30\n", 1062 | " GBP\n", 1063 | " 51.530786\n", 1064 | " -0.056957\n", 1065 | " - Double Room Shoreditch \"Sparrow\", 2min to Train\n", 1066 | " 2\n", 1067 | " https://a0.muscache.com/im/pictures/5d305b15-f...\n", 1068 | " 25\n", 1069 | " nightly\n", 1070 | " 4.5\n", 1071 | " Private room\n", 1072 | " \n", 1073 | " \n", 1074 | " 31\n", 1075 | " GBP\n", 1076 | " 51.521552\n", 1077 | " -0.045261\n", 1078 | " Central bedroom, close to underground, perfect!\n", 1079 | " 3\n", 1080 | " https://a0.muscache.com/im/pictures/c66b07ea-e...\n", 1081 | " 36\n", 1082 | " nightly\n", 1083 | " 4.5\n", 1084 | " Private room\n", 1085 | " \n", 1086 | " \n", 1087 | " 32\n", 1088 | " GBP\n", 1089 | " 51.499814\n", 1090 | " -0.113586\n", 1091 | " **AMAZING CITY CENTRE APARTMENT**\n", 1092 | " 2\n", 1093 | " https://a0.muscache.com/im/pictures/79e6ba4f-0...\n", 1094 | " 71\n", 1095 | " nightly\n", 1096 | " 5.0\n", 1097 | " Private room\n", 1098 | " \n", 1099 | " \n", 1100 | " 33\n", 1101 | " GBP\n", 1102 | " 51.492054\n", 1103 | " -0.096903\n", 1104 | " Safestay London Elephant & Castle\n", 1105 | " 6\n", 1106 | " https://a0.muscache.com/im/pictures/d59e71e6-4...\n", 1107 | " 15\n", 1108 | " nightly\n", 1109 | " 4.5\n", 1110 | " Shared room\n", 1111 | " \n", 1112 | " \n", 1113 | " 34\n", 1114 | " GBP\n", 1115 | " 51.519988\n", 1116 | " -0.041878\n", 1117 | " (ION-B) Private room for 4 in London\n", 1118 | " 4\n", 1119 | " https://a0.muscache.com/im/pictures/58980ef0-1...\n", 1120 | " 25\n", 1121 | " nightly\n", 1122 | " 5.0\n", 1123 | " Private room\n", 1124 | " \n", 1125 | " \n", 1126 | " 35\n", 1127 | " GBP\n", 1128 | " 51.516530\n", 1129 | " -0.062436\n", 1130 | " (4BFORD-3)PRIVATE ROOM FOR 2 CLOSE TO TOWER BR...\n", 1131 | " 2\n", 1132 | " https://a0.muscache.com/im/pictures/cf9a8ff6-5...\n", 1133 | " 21\n", 1134 | " nightly\n", 1135 | " 5.0\n", 1136 | " Private room\n", 1137 | " \n", 1138 | " \n", 1139 | " 36\n", 1140 | " GBP\n", 1141 | " 51.489336\n", 1142 | " -0.230869\n", 1143 | " The Muse Haus II - Riverside Room\n", 1144 | " 3\n", 1145 | " https://a0.muscache.com/im/pictures/da06a628-0...\n", 1146 | " 49\n", 1147 | " nightly\n", 1148 | " 5.0\n", 1149 | " Private room\n", 1150 | " \n", 1151 | " \n", 1152 | " 37\n", 1153 | " GBP\n", 1154 | " 51.553402\n", 1155 | " -0.241072\n", 1156 | " Large Balcony Double Bedroom in Dollis Hill! BR5\n", 1157 | " 2\n", 1158 | " https://a0.muscache.com/im/pictures/64093166-7...\n", 1159 | " 31\n", 1160 | " nightly\n", 1161 | " 4.5\n", 1162 | " Private room\n", 1163 | " \n", 1164 | " \n", 1165 | " 38\n", 1166 | " GBP\n", 1167 | " 51.512739\n", 1168 | " -0.226962\n", 1169 | " THE QUEENS HOSTEL , 6 BED MIXED DORM C\n", 1170 | " 6\n", 1171 | " https://a0.muscache.com/im/pictures/d5752e25-1...\n", 1172 | " 21\n", 1173 | " nightly\n", 1174 | " 4.5\n", 1175 | " Shared room\n", 1176 | " \n", 1177 | " \n", 1178 | " 39\n", 1179 | " GBP\n", 1180 | " 51.539388\n", 1181 | " -0.202903\n", 1182 | " Double Room in Contemporary Flat\n", 1183 | " 2\n", 1184 | " https://a0.muscache.com/im/pictures/26516a3a-1...\n", 1185 | " 40\n", 1186 | " nightly\n", 1187 | " 4.5\n", 1188 | " Private room\n", 1189 | " \n", 1190 | " \n", 1191 | " 40\n", 1192 | " GBP\n", 1193 | " 51.490277\n", 1194 | " -0.097092\n", 1195 | " Room Big Ben + Breakfast. Zone 1 (Q3)\n", 1196 | " 2\n", 1197 | " https://a0.muscache.com/im/pictures/82998940/6...\n", 1198 | " 30\n", 1199 | " nightly\n", 1200 | " 4.0\n", 1201 | " Private room\n", 1202 | " \n", 1203 | " \n", 1204 | " 41\n", 1205 | " GBP\n", 1206 | " 51.492041\n", 1207 | " -0.017314\n", 1208 | " Large Double with Private Bathroom-Canary Whar...\n", 1209 | " 3\n", 1210 | " https://a0.muscache.com/im/pictures/6df596d1-4...\n", 1211 | " 31\n", 1212 | " nightly\n", 1213 | " 4.5\n", 1214 | " Private room\n", 1215 | " \n", 1216 | " \n", 1217 | " 42\n", 1218 | " GBP\n", 1219 | " 51.533898\n", 1220 | " -0.130299\n", 1221 | " Double in Kings Cross Houseshare\n", 1222 | " 2\n", 1223 | " https://a0.muscache.com/im/pictures/57128460/a...\n", 1224 | " 88\n", 1225 | " nightly\n", 1226 | " 4.5\n", 1227 | " Private room\n", 1228 | " \n", 1229 | " \n", 1230 | " 43\n", 1231 | " GBP\n", 1232 | " 51.497440\n", 1233 | " -0.009651\n", 1234 | " Room next to Canary Wharf and Greenwich\n", 1235 | " 2\n", 1236 | " https://a0.muscache.com/im/pictures/e747b1f1-1...\n", 1237 | " 31\n", 1238 | " nightly\n", 1239 | " 4.5\n", 1240 | " Private room\n", 1241 | " \n", 1242 | " \n", 1243 | " 44\n", 1244 | " GBP\n", 1245 | " 51.551801\n", 1246 | " -0.239277\n", 1247 | " Garden Facing Double Bedroom in Dollis Hill! BR4\n", 1248 | " 2\n", 1249 | " https://a0.muscache.com/im/pictures/da6da5bc-7...\n", 1250 | " 31\n", 1251 | " nightly\n", 1252 | " 4.5\n", 1253 | " Private room\n", 1254 | " \n", 1255 | " \n", 1256 | " 45\n", 1257 | " GBP\n", 1258 | " 51.500920\n", 1259 | " -0.114502\n", 1260 | " LUXURY CITY CENTRE APARTMENT\n", 1261 | " 2\n", 1262 | " https://a0.muscache.com/im/pictures/050fc4a8-1...\n", 1263 | " 51\n", 1264 | " nightly\n", 1265 | " 5.0\n", 1266 | " Private room\n", 1267 | " \n", 1268 | " \n", 1269 | " 46\n", 1270 | " GBP\n", 1271 | " 51.548562\n", 1272 | " -0.225982\n", 1273 | " Clean Double Room 5min walk to Underground Sta...\n", 1274 | " 2\n", 1275 | " https://a0.muscache.com/im/pictures/bfe713da-b...\n", 1276 | " 38\n", 1277 | " nightly\n", 1278 | " 5.0\n", 1279 | " Private room\n", 1280 | " \n", 1281 | " \n", 1282 | " 47\n", 1283 | " GBP\n", 1284 | " 51.492564\n", 1285 | " -0.096076\n", 1286 | " Nice Room Big Ben + Breakfast (R7a)\n", 1287 | " 1\n", 1288 | " https://a0.muscache.com/im/pictures/60cf80f1-8...\n", 1289 | " 22\n", 1290 | " nightly\n", 1291 | " 4.5\n", 1292 | " Shared room\n", 1293 | " \n", 1294 | " \n", 1295 | " 48\n", 1296 | " GBP\n", 1297 | " 51.525872\n", 1298 | " -0.067217\n", 1299 | " (SP-C)PRIVATE ROOM FOR 4 NEAR SHOREDITCH\n", 1300 | " 4\n", 1301 | " https://a0.muscache.com/im/pictures/a4c8f54c-f...\n", 1302 | " 25\n", 1303 | " nightly\n", 1304 | " 4.5\n", 1305 | " Private room\n", 1306 | " \n", 1307 | " \n", 1308 | " 49\n", 1309 | " GBP\n", 1310 | " 51.517147\n", 1311 | " -0.039077\n", 1312 | " (CROM-E)PRIVATE ROOM FOR 4 NEAR REGENT'S CANAL\n", 1313 | " 4\n", 1314 | " https://a0.muscache.com/im/pictures/e5d9092c-e...\n", 1315 | " 25\n", 1316 | " nightly\n", 1317 | " 5.0\n", 1318 | " Private room\n", 1319 | " \n", 1320 | " \n", 1321 | " 50\n", 1322 | " GBP\n", 1323 | " 51.513970\n", 1324 | " -0.047868\n", 1325 | " (WE-1) ROOM FOR 4 NEAR STEPNEY GREEN PARK/GARDEN\n", 1326 | " 4\n", 1327 | " https://a0.muscache.com/im/pictures/fbf383e5-9...\n", 1328 | " 25\n", 1329 | " nightly\n", 1330 | " 5.0\n", 1331 | " Private room\n", 1332 | " \n", 1333 | " \n", 1334 | " 51\n", 1335 | " GBP\n", 1336 | " 51.524006\n", 1337 | " -0.063918\n", 1338 | " (8RAM-A)Private Room for 4ppl near Victoria Park\n", 1339 | " 4\n", 1340 | " https://a0.muscache.com/im/pictures/06d594e2-a...\n", 1341 | " 25\n", 1342 | " nightly\n", 1343 | " 4.5\n", 1344 | " Private room\n", 1345 | " \n", 1346 | " \n", 1347 | " 52\n", 1348 | " GBP\n", 1349 | " 51.517652\n", 1350 | " -0.038606\n", 1351 | " (92AST-2)Private rooms up to 4 near Mile End Park\n", 1352 | " 4\n", 1353 | " https://a0.muscache.com/im/pictures/6e5c7acc-d...\n", 1354 | " 25\n", 1355 | " nightly\n", 1356 | " 4.5\n", 1357 | " Private room\n", 1358 | " \n", 1359 | " \n", 1360 | " 53\n", 1361 | " GBP\n", 1362 | " 51.513839\n", 1363 | " -0.055826\n", 1364 | " (ROB-B)PRIVATE ROOM FOR 4 PPL NEAR RIVERSIDE\n", 1365 | " 4\n", 1366 | " https://a0.muscache.com/im/pictures/29046148-5...\n", 1367 | " 25\n", 1368 | " nightly\n", 1369 | " 5.0\n", 1370 | " Private room\n", 1371 | " \n", 1372 | " \n", 1373 | " 54\n", 1374 | " GBP\n", 1375 | " 51.527791\n", 1376 | " -0.068263\n", 1377 | " (MCD-D) PRIVATE ROOM UP TO 4 CLOSE TO BRICK LANE\n", 1378 | " 4\n", 1379 | " https://a0.muscache.com/im/pictures/f8d3886c-6...\n", 1380 | " 25\n", 1381 | " nightly\n", 1382 | " 4.5\n", 1383 | " Private room\n", 1384 | " \n", 1385 | " \n", 1386 | " 55\n", 1387 | " GBP\n", 1388 | " 51.517796\n", 1389 | " -0.038177\n", 1390 | " (92AST-4)PRIVATE ROOM FOR 2 NEAR VICTORIA PARK\n", 1391 | " 2\n", 1392 | " https://a0.muscache.com/im/pictures/9f38c016-a...\n", 1393 | " 21\n", 1394 | " nightly\n", 1395 | " 5.0\n", 1396 | " Private room\n", 1397 | " \n", 1398 | " \n", 1399 | " 56\n", 1400 | " GBP\n", 1401 | " 51.523549\n", 1402 | " -0.065682\n", 1403 | " (8RAM-C)Private Room for 3ppl near Victoria Park\n", 1404 | " 3\n", 1405 | " https://a0.muscache.com/im/pictures/f9c1ac2d-4...\n", 1406 | " 21\n", 1407 | " nightly\n", 1408 | " 5.0\n", 1409 | " Private room\n", 1410 | " \n", 1411 | " \n", 1412 | " 57\n", 1413 | " GBP\n", 1414 | " 51.520828\n", 1415 | " -0.067859\n", 1416 | " (KING-D)PRIVATE ROOM FOR 3 PPL IN BRICK LANE\n", 1417 | " 3\n", 1418 | " https://a0.muscache.com/im/pictures/05413b50-5...\n", 1419 | " 21\n", 1420 | " nightly\n", 1421 | " 4.5\n", 1422 | " Private room\n", 1423 | " \n", 1424 | " \n", 1425 | " 58\n", 1426 | " GBP\n", 1427 | " 51.515282\n", 1428 | " -0.047395\n", 1429 | " (WE-4)PRIVATE ROOM FOR 2 NEAR STEPNEY GREEN PARK\n", 1430 | " 2\n", 1431 | " https://a0.muscache.com/im/pictures/8686ff10-3...\n", 1432 | " 21\n", 1433 | " nightly\n", 1434 | " 4.5\n", 1435 | " Private room\n", 1436 | " \n", 1437 | " \n", 1438 | " 59\n", 1439 | " GBP\n", 1440 | " 51.518104\n", 1441 | " -0.068196\n", 1442 | " (59CH-1) PRIVATE ROOM FOR 4 BRICK LANE\n", 1443 | " 4\n", 1444 | " https://a0.muscache.com/im/pictures/d8786c02-1...\n", 1445 | " 25\n", 1446 | " nightly\n", 1447 | " 3.0\n", 1448 | " Private room\n", 1449 | " \n", 1450 | " \n", 1451 | " 60\n", 1452 | " GBP\n", 1453 | " 51.513845\n", 1454 | " -0.064276\n", 1455 | " (HAD-D)PRIVATE ROOM CLOSE TO TOWER BRIDGE\n", 1456 | " 2\n", 1457 | " https://a0.muscache.com/im/pictures/7f1928e5-6...\n", 1458 | " 21\n", 1459 | " nightly\n", 1460 | " 4.0\n", 1461 | " Private room\n", 1462 | " \n", 1463 | " \n", 1464 | " 61\n", 1465 | " GBP\n", 1466 | " 51.509966\n", 1467 | " -0.061409\n", 1468 | " (BET-A)PRIVATE ROOM FOR 4 NEAR RIVERSIDE\n", 1469 | " 4\n", 1470 | " https://a0.muscache.com/im/pictures/10e3fcc8-c...\n", 1471 | " 25\n", 1472 | " nightly\n", 1473 | " 4.5\n", 1474 | " Private room\n", 1475 | " \n", 1476 | " \n", 1477 | " 62\n", 1478 | " GBP\n", 1479 | " 51.527649\n", 1480 | " -0.068401\n", 1481 | " (MCD-C) PRIVATE ROOM UP TO 3 CLOSE TO BRICK LANE\n", 1482 | " 3\n", 1483 | " https://a0.muscache.com/im/pictures/59bbf736-6...\n", 1484 | " 21\n", 1485 | " nightly\n", 1486 | " 5.0\n", 1487 | " Private room\n", 1488 | " \n", 1489 | " \n", 1490 | " 63\n", 1491 | " GBP\n", 1492 | " 51.519125\n", 1493 | " -0.069282\n", 1494 | " (43CHIC-B)PRIVATE ROOM FOR 2 IN BRICK LANE\n", 1495 | " 2\n", 1496 | " https://a0.muscache.com/im/pictures/9298d330-b...\n", 1497 | " 21\n", 1498 | " nightly\n", 1499 | " 4.5\n", 1500 | " Private room\n", 1501 | " \n", 1502 | " \n", 1503 | " 64\n", 1504 | " GBP\n", 1505 | " 51.510940\n", 1506 | " -0.018134\n", 1507 | " (32GRU-D)PRIVATE ROOM FOR 2 PEOPLE NEAR RIVERSIDE\n", 1508 | " 2\n", 1509 | " https://a0.muscache.com/im/pictures/60424b3b-1...\n", 1510 | " 21\n", 1511 | " nightly\n", 1512 | " 5.0\n", 1513 | " Private room\n", 1514 | " \n", 1515 | " \n", 1516 | " 65\n", 1517 | " GBP\n", 1518 | " 51.511244\n", 1519 | " -0.053994\n", 1520 | " (73SHAD-5)PRIVATE ROOM FOR 2 PEOPLE NEAR RIVER...\n", 1521 | " 2\n", 1522 | " https://a0.muscache.com/im/pictures/ff71eaa5-b...\n", 1523 | " 21\n", 1524 | " nightly\n", 1525 | " 4.5\n", 1526 | " Private room\n", 1527 | " \n", 1528 | " \n", 1529 | " 66\n", 1530 | " GBP\n", 1531 | " 51.525535\n", 1532 | " -0.065950\n", 1533 | " (8RAM-B)Private Room for 2ppl near Victoria Park\n", 1534 | " 2\n", 1535 | " https://a0.muscache.com/im/pictures/4d783775-2...\n", 1536 | " 21\n", 1537 | " nightly\n", 1538 | " 4.5\n", 1539 | " Private room\n", 1540 | " \n", 1541 | " \n", 1542 | " 67\n", 1543 | " GBP\n", 1544 | " 51.526277\n", 1545 | " -0.072128\n", 1546 | " (KR-D)PRIVATE ROOM FOR 2 IN SHOREDITCH\n", 1547 | " 2\n", 1548 | " https://a0.muscache.com/im/pictures/3e7bb23b-a...\n", 1549 | " 21\n", 1550 | " nightly\n", 1551 | " 4.5\n", 1552 | " Private room\n", 1553 | " \n", 1554 | " \n", 1555 | " 68\n", 1556 | " GBP\n", 1557 | " 51.527191\n", 1558 | " -0.071177\n", 1559 | " (KR-A)PRIVATE ROOM FOR 4 IN SHOREDITCH/BALCONY\n", 1560 | " 4\n", 1561 | " https://a0.muscache.com/im/pictures/f7ffd8de-c...\n", 1562 | " 25\n", 1563 | " nightly\n", 1564 | " 4.0\n", 1565 | " Private room\n", 1566 | " \n", 1567 | " \n", 1568 | " 69\n", 1569 | " GBP\n", 1570 | " 51.511228\n", 1571 | " -0.058300\n", 1572 | " (26MOR-B)PRIVATE ROOM FOR 4 PPL NEAR RIVERSIDE\n", 1573 | " 4\n", 1574 | " https://a0.muscache.com/im/pictures/7792ef9a-f...\n", 1575 | " 25\n", 1576 | " nightly\n", 1577 | " 4.5\n", 1578 | " Private room\n", 1579 | " \n", 1580 | " \n", 1581 | " 70\n", 1582 | " GBP\n", 1583 | " 51.510086\n", 1584 | " -0.061221\n", 1585 | " (BET-D)PRIVATE ROOM FOR 2 NEAR RIVERSIDE\n", 1586 | " 2\n", 1587 | " https://a0.muscache.com/im/pictures/2a47f778-0...\n", 1588 | " 21\n", 1589 | " nightly\n", 1590 | " 4.5\n", 1591 | " Private room\n", 1592 | " \n", 1593 | " \n", 1594 | " 71\n", 1595 | " GBP\n", 1596 | " 51.512000\n", 1597 | " -0.063476\n", 1598 | " (HAL-A)PRIVATE ROOM FOR 4 NEAR TOWER BRIDGE\n", 1599 | " 4\n", 1600 | " https://a0.muscache.com/im/pictures/8e7e39fd-8...\n", 1601 | " 25\n", 1602 | " nightly\n", 1603 | " 5.0\n", 1604 | " Private room\n", 1605 | " \n", 1606 | " \n", 1607 | " 72\n", 1608 | " GBP\n", 1609 | " 51.512355\n", 1610 | " -0.065909\n", 1611 | " (HAD-C)PRIVATE ROOM CLOSE TO TOWER BRIDGE\n", 1612 | " 2\n", 1613 | " https://a0.muscache.com/im/pictures/2bfbd251-5...\n", 1614 | " 21\n", 1615 | " nightly\n", 1616 | " 4.5\n", 1617 | " Private room\n", 1618 | " \n", 1619 | " \n", 1620 | " 73\n", 1621 | " GBP\n", 1622 | " 51.512548\n", 1623 | " -0.018926\n", 1624 | " (32GRU-B)PRIVATE ROOM FOR 2 PEOPLE NEAR RIVERSIDE\n", 1625 | " 2\n", 1626 | " https://a0.muscache.com/im/pictures/4030ee3b-8...\n", 1627 | " 21\n", 1628 | " nightly\n", 1629 | " 4.5\n", 1630 | " Private room\n", 1631 | " \n", 1632 | " \n", 1633 | " 74\n", 1634 | " GBP\n", 1635 | " 51.515882\n", 1636 | " -0.037798\n", 1637 | " (50AST-1)PRIVATE ROOM FOR 4 NEAR MILE END PARK\n", 1638 | " 4\n", 1639 | " https://a0.muscache.com/im/pictures/d7437f04-4...\n", 1640 | " 21\n", 1641 | " nightly\n", 1642 | " 4.0\n", 1643 | " Private room\n", 1644 | " \n", 1645 | " \n", 1646 | " 75\n", 1647 | " GBP\n", 1648 | " 51.523891\n", 1649 | " -0.052398\n", 1650 | " (BRAI-D)PRIVATE ROOM FOR 4 NEAR BRICK LANE\n", 1651 | " 4\n", 1652 | " https://a0.muscache.com/im/pictures/b11142b5-f...\n", 1653 | " 25\n", 1654 | " nightly\n", 1655 | " NaN\n", 1656 | " Private room\n", 1657 | " \n", 1658 | " \n", 1659 | " 76\n", 1660 | " GBP\n", 1661 | " 51.532256\n", 1662 | " -0.083022\n", 1663 | " (CHA-A) PRIVATE ROOM IN HOXTON WITH BALCONY FOR 4\n", 1664 | " 4\n", 1665 | " https://a0.muscache.com/im/pictures/6619ba13-b...\n", 1666 | " 25\n", 1667 | " nightly\n", 1668 | " 5.0\n", 1669 | " Private room\n", 1670 | " \n", 1671 | " \n", 1672 | " 77\n", 1673 | " GBP\n", 1674 | " 51.535244\n", 1675 | " -0.089366\n", 1676 | " (CROP-A)PRIVATE ROOM FOR 4 PEOPLE IN SHOREDITCH\n", 1677 | " 4\n", 1678 | " https://a0.muscache.com/im/pictures/84969292-3...\n", 1679 | " 25\n", 1680 | " nightly\n", 1681 | " 4.5\n", 1682 | " Private room\n", 1683 | " \n", 1684 | " \n", 1685 | " 78\n", 1686 | " GBP\n", 1687 | " 51.535351\n", 1688 | " -0.084282\n", 1689 | " (FUL-B)PRIVATE ROOM FOR 4 NEAR HOXTON\n", 1690 | " 4\n", 1691 | " https://a0.muscache.com/im/pictures/a7c18ecc-2...\n", 1692 | " 25\n", 1693 | " nightly\n", 1694 | " 5.0\n", 1695 | " Private room\n", 1696 | " \n", 1697 | " \n", 1698 | " 79\n", 1699 | " GBP\n", 1700 | " 51.516583\n", 1701 | " -0.063056\n", 1702 | " (4AFORD-2)PRIVATE ROOM FOR 4PPL NEAR TOWER BRIDGE\n", 1703 | " 4\n", 1704 | " https://a0.muscache.com/im/pictures/291306f2-4...\n", 1705 | " 21\n", 1706 | " nightly\n", 1707 | " 4.0\n", 1708 | " Private room\n", 1709 | " \n", 1710 | " \n", 1711 | " 80\n", 1712 | " GBP\n", 1713 | " 51.553648\n", 1714 | " -0.241361\n", 1715 | " Huge & Bright Double Bedroom in Dollis Hill! BR1\n", 1716 | " 2\n", 1717 | " https://a0.muscache.com/im/pictures/6f44c440-8...\n", 1718 | " 31\n", 1719 | " nightly\n", 1720 | " 4.5\n", 1721 | " Private room\n", 1722 | " \n", 1723 | " \n", 1724 | " 81\n", 1725 | " GBP\n", 1726 | " 51.496208\n", 1727 | " -0.071039\n", 1728 | " New Central London double with private bathroom\n", 1729 | " 2\n", 1730 | " https://a0.muscache.com/im/pictures/7ddf6bf6-4...\n", 1731 | " 58\n", 1732 | " nightly\n", 1733 | " 5.0\n", 1734 | " Private room\n", 1735 | " \n", 1736 | " \n", 1737 | " 82\n", 1738 | " GBP\n", 1739 | " 51.498859\n", 1740 | " -0.086188\n", 1741 | " Single box room at london bridge\n", 1742 | " 1\n", 1743 | " https://a0.muscache.com/im/pictures/74d672bc-e...\n", 1744 | " 22\n", 1745 | " nightly\n", 1746 | " 4.0\n", 1747 | " Private room\n", 1748 | " \n", 1749 | " \n", 1750 | " 83\n", 1751 | " GBP\n", 1752 | " 51.518006\n", 1753 | " -0.167318\n", 1754 | " Cosy 2nd Floor 1 Bed Flat Close To Oxford Street\n", 1755 | " 4\n", 1756 | " https://a0.muscache.com/im/pictures/b9df22db-6...\n", 1757 | " 67\n", 1758 | " nightly\n", 1759 | " 4.5\n", 1760 | " Entire home/flat\n", 1761 | " \n", 1762 | " \n", 1763 | " 84\n", 1764 | " GBP\n", 1765 | " 51.510704\n", 1766 | " -0.050297\n", 1767 | " (GRD-A)PRIVATE ROOM UP TO 4 WITH CITY VIEWS\n", 1768 | " 4\n", 1769 | " https://a0.muscache.com/im/pictures/39cd4175-5...\n", 1770 | " 25\n", 1771 | " nightly\n", 1772 | " 4.5\n", 1773 | " Private room\n", 1774 | " \n", 1775 | " \n", 1776 | " 85\n", 1777 | " GBP\n", 1778 | " 51.506991\n", 1779 | " -0.067635\n", 1780 | " * Zone 1 * FLAT * NO CLEANING FEES *\n", 1781 | " 6\n", 1782 | " https://a0.muscache.com/im/pictures/440765d6-8...\n", 1783 | " 80\n", 1784 | " nightly\n", 1785 | " 5.0\n", 1786 | " Entire home/flat\n", 1787 | " \n", 1788 | " \n", 1789 | " 86\n", 1790 | " GBP\n", 1791 | " 51.527842\n", 1792 | " -0.071902\n", 1793 | " (KR-C)PRIVATE ROOM FOR 4 IN SHOREDITCH\n", 1794 | " 4\n", 1795 | " https://a0.muscache.com/im/pictures/3a9cff13-d...\n", 1796 | " 25\n", 1797 | " nightly\n", 1798 | " 4.5\n", 1799 | " Private room\n", 1800 | " \n", 1801 | " \n", 1802 | " 87\n", 1803 | " GBP\n", 1804 | " 51.521531\n", 1805 | " -0.139337\n", 1806 | " Room C Brilliant Location\n", 1807 | " 3\n", 1808 | " https://a0.muscache.com/im/pictures/bef5920b-7...\n", 1809 | " 46\n", 1810 | " nightly\n", 1811 | " 4.5\n", 1812 | " Private room\n", 1813 | " \n", 1814 | " \n", 1815 | " 88\n", 1816 | " GBP\n", 1817 | " 51.532407\n", 1818 | " -0.063612\n", 1819 | " (SBH-C)PRIVATE ROOM FOR 4 PEOPLE NEAR SHOREDITCH\n", 1820 | " 4\n", 1821 | " https://a0.muscache.com/im/pictures/e13000eb-6...\n", 1822 | " 25\n", 1823 | " nightly\n", 1824 | " 4.5\n", 1825 | " Private room\n", 1826 | " \n", 1827 | " \n", 1828 | " 89\n", 1829 | " GBP\n", 1830 | " 51.466348\n", 1831 | " -0.192849\n", 1832 | " Beautiful 2 bed Fulham apartment\n", 1833 | " 4\n", 1834 | " https://a0.muscache.com/im/pictures/53d529ea-2...\n", 1835 | " 98\n", 1836 | " nightly\n", 1837 | " 5.0\n", 1838 | " Entire home/flat\n", 1839 | " \n", 1840 | " \n", 1841 | " 90\n", 1842 | " GBP\n", 1843 | " 51.490963\n", 1844 | " -0.096486\n", 1845 | " Room Big Ben + Breakfast (R5b)\n", 1846 | " 1\n", 1847 | " https://a0.muscache.com/im/pictures/1f4b6234-3...\n", 1848 | " 18\n", 1849 | " nightly\n", 1850 | " 4.5\n", 1851 | " Shared room\n", 1852 | " \n", 1853 | " \n", 1854 | " 91\n", 1855 | " GBP\n", 1856 | " 51.594978\n", 1857 | " -0.081565\n", 1858 | " Bright & Spacious Double Room\n", 1859 | " 2\n", 1860 | " https://a0.muscache.com/im/pictures/72012623/a...\n", 1861 | " 28\n", 1862 | " nightly\n", 1863 | " 5.0\n", 1864 | " Private room\n", 1865 | " \n", 1866 | " \n", 1867 | " 92\n", 1868 | " GBP\n", 1869 | " 51.513420\n", 1870 | " -0.065957\n", 1871 | " (HAD-A)PRIVATE ROOM CLOSE TO TOWER BRIDGE\n", 1872 | " 2\n", 1873 | " https://a0.muscache.com/im/pictures/27e16557-8...\n", 1874 | " 21\n", 1875 | " nightly\n", 1876 | " 5.0\n", 1877 | " Private room\n", 1878 | " \n", 1879 | " \n", 1880 | " 93\n", 1881 | " GBP\n", 1882 | " 51.534635\n", 1883 | " -0.072653\n", 1884 | " (GOD-A)PRIVATE ROOM FOR 3 NEAR REGENTS CANAL\n", 1885 | " 3\n", 1886 | " https://a0.muscache.com/im/pictures/d81ee2b9-6...\n", 1887 | " 21\n", 1888 | " nightly\n", 1889 | " 5.0\n", 1890 | " Private room\n", 1891 | " \n", 1892 | " \n", 1893 | " 94\n", 1894 | " GBP\n", 1895 | " 51.492320\n", 1896 | " -0.015128\n", 1897 | " Double Room - Canary Wharf hp\n", 1898 | " 2\n", 1899 | " https://a0.muscache.com/im/pictures/4cebcd26-b...\n", 1900 | " 30\n", 1901 | " nightly\n", 1902 | " 4.5\n", 1903 | " Private room\n", 1904 | " \n", 1905 | " \n", 1906 | " 95\n", 1907 | " GBP\n", 1908 | " 51.516983\n", 1909 | " -0.027006\n", 1910 | " (TA-4) Private room for 2 close to Mile End Park\n", 1911 | " 2\n", 1912 | " https://a0.muscache.com/im/pictures/5f8370ba-c...\n", 1913 | " 21\n", 1914 | " nightly\n", 1915 | " 4.5\n", 1916 | " Private room\n", 1917 | " \n", 1918 | " \n", 1919 | " 96\n", 1920 | " GBP\n", 1921 | " 51.503962\n", 1922 | " -0.109114\n", 1923 | " 5. Lovely Room Free WIFI Centre London\n", 1924 | " 1\n", 1925 | " https://a0.muscache.com/im/pictures/5de10973-b...\n", 1926 | " 51\n", 1927 | " nightly\n", 1928 | " 4.5\n", 1929 | " Private room\n", 1930 | " \n", 1931 | " \n", 1932 | " 97\n", 1933 | " GBP\n", 1934 | " 51.427872\n", 1935 | " -0.070807\n", 1936 | " Comfy, Clean Room For 1, Central London(FREE W...\n", 1937 | " 1\n", 1938 | " https://a0.muscache.com/im/pictures/ea2c17f9-9...\n", 1939 | " 22\n", 1940 | " nightly\n", 1941 | " 4.5\n", 1942 | " Private room\n", 1943 | " \n", 1944 | " \n", 1945 | " 98\n", 1946 | " GBP\n", 1947 | " 51.532777\n", 1948 | " -0.065245\n", 1949 | " (SBH-B)PRIVATE ROOM UP TO 4 PEOPLE NEAR SHORED...\n", 1950 | " 4\n", 1951 | " https://a0.muscache.com/im/pictures/c83c76d3-1...\n", 1952 | " 25\n", 1953 | " nightly\n", 1954 | " 4.5\n", 1955 | " Private room\n", 1956 | " \n", 1957 | " \n", 1958 | " 99\n", 1959 | " GBP\n", 1960 | " 51.474246\n", 1961 | " -0.194698\n", 1962 | " Little Corner of Relaxation & Rest\n", 1963 | " 3\n", 1964 | " https://a0.muscache.com/im/pictures/9a453323-2...\n", 1965 | " 36\n", 1966 | " nightly\n", 1967 | " 4.5\n", 1968 | " Private room\n", 1969 | " \n", 1970 | " \n", 1971 | "" 1972 | ], 1973 | "text/plain": [ 1974 | "" 1975 | ] 1976 | }, 1977 | "execution_count": 20, 1978 | "metadata": {}, 1979 | "output_type": "execute_result" 1980 | } 1981 | ], 1982 | "source": [ 1983 | "HTML(pd.DataFrame(props).to_html())" 1984 | ] 1985 | }, 1986 | { 1987 | "cell_type": "code", 1988 | "execution_count": 12, 1989 | "metadata": { 1990 | "collapsed": false 1991 | }, 1992 | "outputs": [ 1993 | { 1994 | "name": "stdout", 1995 | "output_type": "stream", 1996 | "text": [ 1997 | "will skip code... but you can find nice surprise in any wallmart product code :)\n" 1998 | ] 1999 | } 2000 | ], 2001 | "source": [ 2002 | "\"\"\"\n", 2003 | "Case Study 4 - json embeded in html\n", 2004 | "\n", 2005 | "You want analyse and compare some retailing corporations. One of them is Wallmart. You want to get as much \n", 2006 | "details, about products they are selling, as you can. You have investigated your target well, but infortunately\n", 2007 | "you cannot find any hidden API... You need to get data from HTML. You already found nice sitemaps, got product\n", 2008 | "pages and now you're ready to scrape.\n", 2009 | "\n", 2010 | "What will be most efficient and robust way to get all details about products?\n", 2011 | "\"\"\"\n", 2012 | "\n", 2013 | "print(\"will skip code... but you can find nice surprise in any wallmart product code :)\")" 2014 | ] 2015 | }, 2016 | { 2017 | "cell_type": "markdown", 2018 | "metadata": {}, 2019 | "source": [ 2020 | "## Handling JavaScript\n", 2021 | "\n", 2022 | "### Why...?\n", 2023 | "- Sometimes content of webpage can be dynamically presented/altered via JavaScript code\n", 2024 | "- when you're dowlonading HTML, it can be completly different from what you see on browser\n", 2025 | "- you need to perform some sort of interaction with page\n", 2026 | "- your target have some fancy anti-scraping software detecting that you're a bot\n", 2027 | "\n", 2028 | "### Selenium+PhantomJS\n", 2029 | "- Selenium is browser automation tool most often used for testing web application.\n", 2030 | "- It can be usefull while scraping\n", 2031 | "- PhantomJS is just headless browser (there is no UI and it works in background)\n", 2032 | "- BTW: you can use Selenium with any other browser (Firefox, Opera etc.)\n", 2033 | "\n", 2034 | "### It's often an overkill thoguh!\n", 2035 | "- Scraping with Selenium+PhantomJS is much heavier than using simple Python libraries!\n", 2036 | " + you have to have all additional libraries and software installed\n", 2037 | " + it may be slower\n", 2038 | " + you have to navigate as you were a human (eg. find button element and click it programatically)\n", 2039 | " + you rely on page layout (not robust at all...)\n", 2040 | "- Very often you can find work-around it. For example, if you try to deal with infinite scroll,\n", 2041 | " you can investigate what AJAX requests your browser is sending while scrolling (and emulate it)" 2042 | ] 2043 | }, 2044 | { 2045 | "cell_type": "code", 2046 | "execution_count": 22, 2047 | "metadata": { 2048 | "collapsed": false 2049 | }, 2050 | "outputs": [ 2051 | { 2052 | "name": "stdout", 2053 | "output_type": "stream", 2054 | "text": [ 2055 | "No. of quotes is: 60\n", 2056 | "“A day without sunshine is like, you know, night.”\n" 2057 | ] 2058 | } 2059 | ], 2060 | "source": [ 2061 | "\"\"\"\n", 2062 | "Case Study 5 - how to (not)handle infinite scroll\n", 2063 | "\n", 2064 | "You are planning to spam you friends on Facebook with random quotes to show how smart and deep you are.\n", 2065 | "You found an amazing page with quotes, but... it contains infinite scroll?! Don't worry though! \n", 2066 | "As they say: \"You have to look through the rain to see the rainbow.\"\n", 2067 | "\n", 2068 | "Your task is to get all quotes from http://spidyquotes.herokuapp.com/\n", 2069 | "\"\"\"\n", 2070 | "spidyquotes_url = 'http://spidyquotes.herokuapp.com/scroll'\n", 2071 | "# with Selenium+PhantomJS (in general - bad option. but yeah... may be fancy)\n", 2072 | "driver = webdriver.PhantomJS('/Applications/phantomjs-2.1.1-macosx/bin/phantomjs')\n", 2073 | "driver.get(spidyquotes_url)\n", 2074 | "no_of_scrolls = 5\n", 2075 | "scroll = 0\n", 2076 | "while scroll < no_of_scrolls:\n", 2077 | " # do a fancy screenshoot here\n", 2078 | " driver.get_screenshot_as_file('/Users/stulski/Desktop/osobiste/pydata_meetup/shot_{}.jpg'.format(scroll))\n", 2079 | " # scroll down\n", 2080 | " driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\n", 2081 | " time.sleep(1)\n", 2082 | " scroll += 1\n", 2083 | "quote_elements = driver.find_elements_by_class_name('quote')\n", 2084 | "all_quotes = [element.find_elements_by_class_name('text')[0].text for element in quote_elements]\n", 2085 | "print('No. of quotes is: ', len(all_quotes))\n", 2086 | "print(random.choice(all_quotes))" 2087 | ] 2088 | }, 2089 | { 2090 | "cell_type": "code", 2091 | "execution_count": 23, 2092 | "metadata": { 2093 | "collapsed": false 2094 | }, 2095 | "outputs": [ 2096 | { 2097 | "name": "stdout", 2098 | "output_type": "stream", 2099 | "text": [ 2100 | "No. of quotes is: 90\n", 2101 | "“Not all of us can do great things. But we can do small things with great love.”\n" 2102 | ] 2103 | } 2104 | ], 2105 | "source": [ 2106 | "# same as above, without unnecessery hassel\n", 2107 | "p_idx = 1\n", 2108 | "spidyquotes_better_url = 'http://spidyquotes.herokuapp.com/api/quotes?page='\n", 2109 | "r = json.loads(requests.get(spidyquotes_better_url+str(p_idx)).text)\n", 2110 | "time.sleep(1)\n", 2111 | "all_quotes = []\n", 2112 | "while r['has_next'] == True:\n", 2113 | " for quote in r['quotes']:\n", 2114 | " all_quotes.append(quote['text'])\n", 2115 | " p_idx += 1\n", 2116 | " r = json.loads(requests.get(spidyquotes_better_url+str(p_idx)).text)\n", 2117 | " time.sleep(1)\n", 2118 | "print('No. of quotes is: ', len(all_quotes))\n", 2119 | "print(random.choice(all_quotes))" 2120 | ] 2121 | }, 2122 | { 2123 | "cell_type": "markdown", 2124 | "metadata": {}, 2125 | "source": [ 2126 | "## Keynotes and advices\n", 2127 | "* investigate you target well (sitemaps, hidden apis, how it works under-the-hood)\n", 2128 | "* use incognito mode whie exploring\n", 2129 | "* use developers tools\n", 2130 | "* think about scraping as a \"hacking\" activity rather than parsing just getting html elements\n", 2131 | "* change your user-agent\n", 2132 | "* add time.sleep if you can afford it\n", 2133 | "* same data are in different places at the website. find those easy to scrape!\n", 2134 | "* if you need to parse HTML and get data from there - try to find something which will not break (avoid finding general elements like DIVs and then finding Nth of those)\n", 2135 | "* look for comonalities\n", 2136 | "* websites are different are there is no one magical way to get your data\n", 2137 | "* it looks easy when I'm showing it, but sometimes it takes time to reverse engineer websites\n", 2138 | "\n", 2139 | "** if you want production level scrapers - use proxy" 2140 | ] 2141 | } 2142 | ], 2143 | "metadata": { 2144 | "kernelspec": { 2145 | "display_name": "Python 3", 2146 | "language": "python", 2147 | "name": "python3" 2148 | }, 2149 | "language_info": { 2150 | "codemirror_mode": { 2151 | "name": "ipython", 2152 | "version": 3 2153 | }, 2154 | "file_extension": ".py", 2155 | "mimetype": "text/x-python", 2156 | "name": "python", 2157 | "nbconvert_exporter": "python", 2158 | "pygments_lexer": "ipython3", 2159 | "version": "3.5.1" 2160 | } 2161 | }, 2162 | "nbformat": 4, 2163 | "nbformat_minor": 2 2164 | } 2165 | --------------------------------------------------------------------------------