├── .DS_Store
├── .vscode
    └── settings.json
├── README.md
├── airbnb_scraper
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-36.pyc
    │   ├── items.cpython-36.pyc
    │   └── settings.cpython-36.pyc
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
    │   ├── __init__.py
    │   ├── __pycache__
    │       ├── __init__.cpython-36.pyc
    │       ├── airbnb.cpython-36.pyc
    │       └── airbnb_single_test.cpython-36.pyc
    │   └── airbnb.py
├── cancun.sh
├── new_york.sh
├── scrapy.cfg
├── testing
    └── airbnb_testing.py
└── toronto.sh


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/.DS_Store


--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
1 | {
2 |     "python.pythonPath": "/Users/kailu/.pyenv/versions/3.6.5/bin/python"
3 | }


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # airbnb_scraper :spider:
 2 | 
 3 | Spider built with scrapy and ScrapySplash to crawl listings
 4 | 
 5 | ## Checklist
 6 | 
 7 | *This checklist is for personal use and isn't relevant to using the scraper.*
 8 | 
 9 | - [x] Spider can successfully parse one page of listings  
10 | - [x] Spider can successfully parse mutliple/all pages of designated location
11 | - [x] Spider can take price ranges as arguments (`price_lb` and `price_ub`)
12 | - [x] Spider can take location as argument  
13 | 
14 | ## Set up
15 | 
16 | Since Airbnb uses JavaScript to render content, just scrapy on its own cannot suffice sometimes. We need to use Splash as well, which is a plugin created by the Scrapy team that integrates nicely with scrapy.
17 | 
18 | **To install Splash, we need to do several things:**
19 | 1. Install [Docker](https://docs.docker.com/install/), create a Docker account (if you don't already have one), and run Docker in the background before crawling with
20 | 
21 | ```
22 | docker run -p 8050:8050 scrapinghub/splash
23 | ```
24 | It might take a few minutes to pull the image for the first time doing this. When this is done, you can type `localhost:8050` in your browser to check if it's working. If an interface opens up, you are good to go.
25 | 
26 | 2. Install scrapy-splash using pip
27 | 
28 | ```
29 | pip install scrapy-splash
30 | ```
31 | 
32 | See [scrapy-splash](https://github.com/scrapy-plugins/scrapy-splash) if you run into any issues.
33 | 
34 | ## Crawling
35 | 
36 | Run the spider with `scrapy crawl airbnb -o {filename}.json -a city='{cityname}' -a price_lb='{pricelowerbound}' -a price_ub='{priceupperbound}'`
37 | 
38 | `cityname` refers to a valid city name
39 | 
40 | `pricelowerbound` refers to a lower bound for price from 0 to 999
41 | 
42 | `priceupperbound` refers to upper bound for price from 0 to 999. Spider will close if `priceupperbound` is less than
43 | `pricelowerbound`  
44 | **Note: Airbnb only returns a maximum of ~300 listings per specific filter (price range). To get more listings, I recommend scraping multiple times using small increments in price and concatenating the datasets.**
45 | 
46 | If you would like to do multiple scrapes over a wide price range (e.g. 10-spaced intervals from 20 to 990), see `cancun.sh` which I used to crawl a large number listings for Cancún.
47 | 
48 | ## Acknowledgements
49 | 
50 | I would like to thank **Ahmed Rafik** for his guidance and teachings.
51 | 


--------------------------------------------------------------------------------
/airbnb_scraper/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/__init__.py


--------------------------------------------------------------------------------
/airbnb_scraper/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/airbnb_scraper/__pycache__/items.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/__pycache__/items.cpython-36.pyc


--------------------------------------------------------------------------------
/airbnb_scraper/__pycache__/settings.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/__pycache__/settings.cpython-36.pyc


--------------------------------------------------------------------------------
/airbnb_scraper/items.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your scraped items
 4 | #
 5 | # See documentation in:
 6 | # https://doc.scrapy.org/en/latest/topics/items.html
 7 | 
 8 | import scrapy
 9 | from scrapy.loader.processors import MapCompose, TakeFirst, Join
10 | 
11 | 
12 | def remove_unicode(value):
13 |     return value.replace(u"\u201c", '').replace(u"\u201d", '').replace(u"\2764", '').replace(u"\ufe0f")
14 | 
15 | class AirbnbScraperItem(scrapy.Item):
16 | 
17 |     # Host Fields
18 |     is_superhost = scrapy.Field()
19 |     host_id = scrapy.Field()
20 | 
21 |     # Room Fields
22 |     price = scrapy.Field()
23 |     url = scrapy.Field()
24 |     is_business_travel_ready = scrapy.Field()
25 |     is_fully_refundable = scrapy.Field()
26 |     is_new_listing = scrapy.Field()
27 |     lat = scrapy.Field()
28 |     lng = scrapy.Field()
29 |     localized_city = scrapy.Field()
30 |     localized_neighborhood = scrapy.Field()
31 |     listing_name = scrapy.Field(input_processor=MapCompose(remove_unicode))
32 |     person_capacity = scrapy.Field()
33 |     picture_count = scrapy.Field()
34 |     reviews_count = scrapy.Field()
35 |     room_type_category = scrapy.Field()
36 |     star_rating = scrapy.Field() # Rounded to .5 or .0 Avg Rating
37 |     avg_rating = scrapy.Field()
38 |     can_instant_book = scrapy.Field()
39 |     monthly_price_factor = scrapy.Field()
40 |     currency = scrapy.Field()
41 |     amt_w_service = scrapy.Field()
42 |     rate_type = scrapy.Field()
43 |     weekly_price_factor = scrapy.Field()
44 |     bathrooms = scrapy.Field()
45 |     bedrooms = scrapy.Field()
46 |     num_beds = scrapy.Field()
47 |     accuracy = scrapy.Field()
48 |     communication = scrapy.Field()
49 |     cleanliness = scrapy.Field()
50 |     location = scrapy.Field()
51 |     checkin = scrapy.Field()
52 |     value = scrapy.Field()
53 |     guest_satisfication = scrapy.Field()
54 |     host_reviews = scrapy.Field()
55 |     response_rate = scrapy.Field()
56 |     response_time = scrapy.Field()
57 | 
58 | 
59 | 
60 | 


--------------------------------------------------------------------------------
/airbnb_scraper/middlewares.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | # Define here the models for your spider middleware
  4 | #
  5 | # See documentation in:
  6 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  7 | 
  8 | from scrapy import signals
  9 | 
 10 | 
 11 | class AirbnbScraperSpiderMiddleware(object):
 12 |     # Not all methods need to be defined. If a method is not defined,
 13 |     # scrapy acts as if the spider middleware does not modify the
 14 |     # passed objects.
 15 | 
 16 |     @classmethod
 17 |     def from_crawler(cls, crawler):
 18 |         # This method is used by Scrapy to create your spiders.
 19 |         s = cls()
 20 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 21 |         return s
 22 | 
 23 |     def process_spider_input(self, response, spider):
 24 |         # Called for each response that goes through the spider
 25 |         # middleware and into the spider.
 26 | 
 27 |         # Should return None or raise an exception.
 28 |         return None
 29 | 
 30 |     def process_spider_output(self, response, result, spider):
 31 |         # Called with the results returned from the Spider, after
 32 |         # it has processed the response.
 33 | 
 34 |         # Must return an iterable of Request, dict or Item objects.
 35 |         for i in result:
 36 |             yield i
 37 | 
 38 |     def process_spider_exception(self, response, exception, spider):
 39 |         # Called when a spider or process_spider_input() method
 40 |         # (from other spider middleware) raises an exception.
 41 | 
 42 |         # Should return either None or an iterable of Response, dict
 43 |         # or Item objects.
 44 |         pass
 45 | 
 46 |     def process_start_requests(self, start_requests, spider):
 47 |         # Called with the start requests of the spider, and works
 48 |         # similarly to the process_spider_output() method, except
 49 |         # that it doesn’t have a response associated.
 50 | 
 51 |         # Must return only requests (not items).
 52 |         for r in start_requests:
 53 |             yield r
 54 | 
 55 |     def spider_opened(self, spider):
 56 |         spider.logger.info('Spider opened: %s' % spider.name)
 57 | 
 58 | 
 59 | class AirbnbScraperDownloaderMiddleware(object):
 60 |     # Not all methods need to be defined. If a method is not defined,
 61 |     # scrapy acts as if the downloader middleware does not modify the
 62 |     # passed objects.
 63 | 
 64 |     @classmethod
 65 |     def from_crawler(cls, crawler):
 66 |         # This method is used by Scrapy to create your spiders.
 67 |         s = cls()
 68 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 69 |         return s
 70 | 
 71 |     def process_request(self, request, spider):
 72 |         # Called for each request that goes through the downloader
 73 |         # middleware.
 74 | 
 75 |         # Must either:
 76 |         # - return None: continue processing this request
 77 |         # - or return a Response object
 78 |         # - or return a Request object
 79 |         # - or raise IgnoreRequest: process_exception() methods of
 80 |         #   installed downloader middleware will be called
 81 |         return None
 82 | 
 83 |     def process_response(self, request, response, spider):
 84 |         # Called with the response returned from the downloader.
 85 | 
 86 |         # Must either;
 87 |         # - return a Response object
 88 |         # - return a Request object
 89 |         # - or raise IgnoreRequest
 90 |         return response
 91 | 
 92 |     def process_exception(self, request, exception, spider):
 93 |         # Called when a download handler or a process_request()
 94 |         # (from other downloader middleware) raises an exception.
 95 | 
 96 |         # Must either:
 97 |         # - return None: continue processing this exception
 98 |         # - return a Response object: stops process_exception() chain
 99 |         # - return a Request object: stops process_exception() chain
100 |         pass
101 | 
102 |     def spider_opened(self, spider):
103 |         spider.logger.info('Spider opened: %s' % spider.name)
104 | 


--------------------------------------------------------------------------------
/airbnb_scraper/pipelines.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define your item pipelines here
 4 | #
 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 | # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 | 
 8 | 
 9 | class AirbnbScraperPipeline(object):
10 |     def process_item(self, item, spider):
11 |         return item
12 | 


--------------------------------------------------------------------------------
/airbnb_scraper/settings.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Scrapy settings for airbnb_scraper project
 4 | #
 5 | # For simplicity, this file contains only settings considered important or
 6 | # commonly used. You can find more settings consulting the documentation:
 7 | #
 8 | #     https://doc.scrapy.org/en/latest/topics/settings.html
 9 | #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
10 | #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
11 | 
12 | BOT_NAME = 'airbnb_scraper'
13 | 
14 | SPIDER_MODULES = ['airbnb_scraper.spiders']
15 | NEWSPIDER_MODULE = 'airbnb_scraper.spiders'
16 | 
17 | 
18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 | #USER_AGENT = 'airbnb_scraper (+http://www.yourdomain.com)'
20 | 
21 | # Obey robots.txt rules
22 | ROBOTSTXT_OBEY = True
23 | 
24 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 | #CONCURRENT_REQUESTS = 32
26 | 
27 | # Configure a delay for requests for the same website (default: 0)
28 | # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
29 | # See also autothrottle settings and docs
30 | DOWNLOAD_DELAY = 3
31 | # The download delay setting will honor only one of:
32 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 | #CONCURRENT_REQUESTS_PER_IP = 16
34 | 
35 | # Disable cookies (enabled by default)
36 | #COOKIES_ENABLED = False
37 | 
38 | # Disable Telnet Console (enabled by default)
39 | #TELNETCONSOLE_ENABLED = False
40 | 
41 | # Override the default request headers:
42 | #DEFAULT_REQUEST_HEADERS = {
43 | #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44 | #   'Accept-Language': 'en',
45 | #}
46 | 
47 | # Enable or disable spider middlewares
48 | # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
49 | #SPIDER_MIDDLEWARES = {
50 | #    'airbnb_scraper.middlewares.AirbnbScraperSpiderMiddleware': 543,
51 | #}
52 | SPIDER_MIDDLEWARES = {
53 |     'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
54 | }
55 | 
56 | # Enable or disable downloader middlewares
57 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
58 | #DOWNLOADER_MIDDLEWARES = {
59 | #    'airbnb_scraper.middlewares.AirbnbScraperDownloaderMiddleware': 543,
60 | #}
61 | DOWNLOADER_MIDDLEWARES = {
62 |     'scrapy_splash.SplashCookiesMiddleware': 723,
63 |     'scrapy_splash.SplashMiddleware': 725,
64 |     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
65 | }
66 | 
67 | # Enable or disable extensions
68 | # See https://doc.scrapy.org/en/latest/topics/extensions.html
69 | #EXTENSIONS = {
70 | #    'scrapy.extensions.telnet.TelnetConsole': None,
71 | #}
72 | 
73 | # Configure item pipelines
74 | # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
75 | #ITEM_PIPELINES = {
76 | #    'airbnb_scraper.pipelines.AirbnbScraperPipeline': 300,
77 | #}
78 | 
79 | # Enable and configure the AutoThrottle extension (disabled by default)
80 | # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
81 | # AUTOTHROTTLE_ENABLED = True
82 | # The initial download delay
83 | #AUTOTHROTTLE_START_DELAY = 5
84 | # The maximum download delay to be set in case of high latencies
85 | #AUTOTHROTTLE_MAX_DELAY = 60
86 | # The average number of requests Scrapy should be sending in parallel to
87 | # each remote server
88 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
89 | # Enable showing throttling stats for every response received:
90 | #AUTOTHROTTLE_DEBUG = False
91 | 
92 | # Enable and configure HTTP caching (disabled by default)
93 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
94 | #HTTPCACHE_ENABLED = True
95 | #HTTPCACHE_EXPIRATION_SECS = 0
96 | #HTTPCACHE_DIR = 'httpcache'
97 | #HTTPCACHE_IGNORE_HTTP_CODES = []
98 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
99 | SPLASH_URL = 'http://localhost:8050'


--------------------------------------------------------------------------------
/airbnb_scraper/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/airbnb_scraper/spiders/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/spiders/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/airbnb_scraper/spiders/__pycache__/airbnb.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/spiders/__pycache__/airbnb.cpython-36.pyc


--------------------------------------------------------------------------------
/airbnb_scraper/spiders/__pycache__/airbnb_single_test.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/spiders/__pycache__/airbnb_single_test.cpython-36.pyc


--------------------------------------------------------------------------------
/airbnb_scraper/spiders/airbnb.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | import json
  3 | import collections
  4 | import re
  5 | import numpy as np
  6 | import logging
  7 | import sys
  8 | import scrapy
  9 | from scrapy_splash import SplashRequest
 10 | from scrapy.exceptions import CloseSpider
 11 | from airbnb_scraper.items import AirbnbScraperItem
 12 | 
 13 | 
 14 | # ********************************************************************************************
 15 | # Important: Run -> docker run -p 8050:8050 scrapinghub/splash in background before crawling *
 16 | # ********************************************************************************************
 17 | 
 18 | 
 19 | # *********************************************************************************************
 20 | # Run crawler with -> scrapy crawl airbnb -o 21to25.json -a price_lb='' -a price_ub=''        *
 21 | # *********************************************************************************************
 22 | 
 23 | class AirbnbSpider(scrapy.Spider):
 24 |     name = 'airbnb'
 25 |     allowed_domains = ['www.airbnb.com']
 26 | 
 27 |     '''
 28 |     You don't have to override __init__ each time and can simply use self.parameter (See https://bit.ly/2Wxbkd9),
 29 |     but I find this way much more readable.
 30 |     '''
 31 |     def __init__(self, city='',price_lb='', price_ub='', *args,**kwargs):
 32 |         super(AirbnbSpider, self).__init__(*args, **kwargs)
 33 |         self.city = city
 34 |         self.price_lb = price_lb
 35 |         self.price_ub = price_ub
 36 | 
 37 |     def start_requests(self):
 38 |         '''Sends a scrapy request to the designated url price range
 39 | 
 40 |         Args:
 41 |         Returns:
 42 |         '''
 43 | 
 44 |         url = ('https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&_intents=p1'
 45 |               '&allow_override%5B%5D=&auto_ib=false&client_session_id='
 46 |               '621cf853-d03e-4108-b717-c14962b6ab8b&currency=CAD&experiences_per_grid=20&fetch_filters=true'
 47 |               '&guidebooks_per_grid=20&has_zero_guest_treatment=true&is_guided_search=true'
 48 |               '&is_new_cards_experiment=true&is_standard_search=true&items_per_grid=18'
 49 |               '&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&locale=en&luxury_pre_launch=false&metadata_only=false&'
 50 |               'query={2}'
 51 |               '&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&s_tag=QLb9RB7g'
 52 |               '&search_type=FILTER_CHANGE&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true'
 53 |               '&timezone_offset=-240&version=1.5.6'                  
 54 |               '&price_min={0}&price_max={1}')
 55 |         new_url = url.format(self.price_lb, self.price_ub, self.city)
 56 |             
 57 | 
 58 |         if (int(self.price_lb)  >= 990):
 59 |             url = ('https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&_intents=p1'
 60 |               '&allow_override%5B%5D=&auto_ib=false&client_session_id='
 61 |               '621cf853-d03e-4108-b717-c14962b6ab8b&currency=CAD&experiences_per_grid=20&fetch_filters=true'
 62 |               '&guidebooks_per_grid=20&has_zero_guest_treatment=true&is_guided_search=true'
 63 |               '&is_new_cards_experiment=true&is_standard_search=true&items_per_grid=18'
 64 |               '&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&locale=en&luxury_pre_launch=false&metadata_only=false&'
 65 |               'query={1}'
 66 |               '&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&s_tag=QLb9RB7g'
 67 |               '&search_type=FILTER_CHANGE&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true'
 68 |               '&timezone_offset=-240&version=1.5.6'                  
 69 |               '&price_min={0}')
 70 |             new_url = url.format(self.price_lb, self.city)
 71 | 
 72 |         yield scrapy.Request(url=new_url, callback=self.parse_id, dont_filter=True)
 73 | 
 74 | 
 75 |     def parse_id(self, response):
 76 |         '''Parses all the URLs/ids/available fields from the initial json object and stores into dictionary
 77 | 
 78 |         Args:
 79 |             response: Json object from explore_tabs
 80 |         Returns:
 81 |         '''
 82 |         
 83 |         # Fetch and Write the response data
 84 |         data = json.loads(response.body)
 85 | 
 86 |         # Return a List of all homes
 87 |         homes = data.get('explore_tabs')[0].get('sections')[0].get('listings')
 88 | 
 89 |         if homes is None:
 90 |             try: 
 91 |                 homes = data.get('explore_tabs')[0].get('sections')[3].get('listings')
 92 |             except IndexError:
 93 |                 try: 
 94 |                     homes = data.get('explore_tabs')[0].get('sections')[2].get('listings')
 95 |                 except:
 96 |                     raise CloseSpider("No homes available in the city and price parameters")
 97 |         
 98 |         base_url = 'https://www.airbnb.com/rooms/'
 99 |         data_dict = collections.defaultdict(dict) # Create Dictionary to put all currently available fields in
100 | 
101 |         for home in homes:
102 |             room_id = str(home.get('listing').get('id'))
103 |             url = base_url + str(home.get('listing').get('id'))
104 |             data_dict[room_id]['url'] = url
105 |             data_dict[room_id]['price'] = home.get('pricing_quote').get('rate').get('amount')
106 |             data_dict[room_id]['bathrooms'] = home.get('listing').get('bathrooms')
107 |             data_dict[room_id]['bedrooms'] = home.get('listing').get('bedrooms')
108 |             data_dict[room_id]['host_languages'] = home.get('listing').get('host_languages')
109 |             data_dict[room_id]['is_business_travel_ready'] = home.get('listing').get('is_business_travel_ready')
110 |             data_dict[room_id]['is_fully_refundable'] = home.get('listing').get('is_fully_refundable')
111 |             data_dict[room_id]['is_new_listing'] = home.get('listing').get('is_new_listing')
112 |             data_dict[room_id]['is_superhost'] = home.get('listing').get('is_superhost')
113 |             data_dict[room_id]['lat'] = home.get('listing').get('lat')
114 |             data_dict[room_id]['lng'] = home.get('listing').get('lng')
115 |             data_dict[room_id]['localized_city'] = home.get('listing').get('localized_city')
116 |             data_dict[room_id]['localized_neighborhood'] = home.get('listing').get('localized_neighborhood')
117 |             data_dict[room_id]['listing_name'] = home.get('listing').get('name')
118 |             data_dict[room_id]['person_capacity'] = home.get('listing').get('person_capacity')
119 |             data_dict[room_id]['picture_count'] = home.get('listing').get('picture_count')
120 |             data_dict[room_id]['reviews_count'] = home.get('listing').get('reviews_count')
121 |             data_dict[room_id]['room_type_category'] = home.get('listing').get('room_type_category')
122 |             data_dict[room_id]['star_rating'] = home.get('listing').get('star_rating')
123 |             data_dict[room_id]['host_id'] = home.get('listing').get('user').get('id')
124 |             data_dict[room_id]['avg_rating'] = home.get('listing').get('avg_rating')
125 |             data_dict[room_id]['can_instant_book'] = home.get('pricing_quote').get('can_instant_book')
126 |             data_dict[room_id]['monthly_price_factor'] = home.get('pricing_quote').get('monthly_price_factor')
127 |             data_dict[room_id]['currency'] = home.get('pricing_quote').get('rate').get('currency')
128 |             data_dict[room_id]['amt_w_service'] = home.get('pricing_quote').get('rate_with_service_fee').get('amount')
129 |             data_dict[room_id]['rate_type'] = home.get('pricing_quote').get('rate_type')
130 |             data_dict[room_id]['weekly_price_factor'] = home.get('pricing_quote').get('weekly_price_factor')
131 | 
132 | 
133 |         # Iterate through dictionary of URLs in the single page to send a SplashRequest for each
134 |         for room_id in data_dict:
135 |             yield SplashRequest(url=base_url+room_id, callback=self.parse_details,
136 |                                 meta=data_dict.get(room_id),
137 |                                 endpoint="render.html",
138 |                                 args={'wait': '0.5'})
139 | 
140 |         # After scraping entire listings page, check if more pages
141 |         pagination_metadata = data.get('explore_tabs')[0].get('pagination_metadata')
142 |         if pagination_metadata.get('has_next_page'):
143 | 
144 |             items_offset = pagination_metadata.get('items_offset')
145 |             section_offset = pagination_metadata.get('section_offset')
146 | 
147 |             new_url = ('https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&_intents=p1'
148 |                       '&allow_override%5B%5D=&auto_ib=false&client_session_id='
149 |                       '621cf853-d03e-4108-b717-c14962b6ab8b&currency=CAD&experiences_per_grid=20'
150 |                       '&fetch_filters=true&guidebooks_per_grid=20&has_zero_guest_treatment=true&is_guided_search=true'
151 |                       '&is_new_cards_experiment=true&is_standard_search=true&items_per_grid=18'
152 |                       '&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&locale=en&luxury_pre_launch=false&metadata_only=false'
153 |                       '&query={4}'
154 |                       '&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&s_tag=QLb9RB7g'
155 |                       '&satori_version=1.1.9&screen_height=797&screen_size=medium&screen_width=885'
156 |                       '&search_type=FILTER_CHANGE&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true'
157 |                       '&timezone_offset=-240&version=1.5.6'
158 |                       '&items_offset={0}&section_offset={1}&price_min={2}&price_max={3}')
159 |             new_url = new_url.format(items_offset, section_offset, self.price_lb, self.price_ub, self.city)
160 |             
161 |             if (int(self.price_lb) >= 990):
162 |                 url = ('https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&_intents=p1'
163 |                       '&allow_override%5B%5D=&auto_ib=false&client_session_id='
164 |                       '621cf853-d03e-4108-b717-c14962b6ab8b&currency=CAD&experiences_per_grid=20'
165 |                       '&fetch_filters=true&guidebooks_per_grid=20&has_zero_guest_treatment=true&is_guided_search=true'
166 |                       '&is_new_cards_experiment=true&is_standard_search=true&items_per_grid=18'
167 |                       '&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&locale=en&luxury_pre_launch=false&metadata_only=false'
168 |                       '&query={3}'
169 |                       '&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&s_tag=QLb9RB7g'
170 |                       '&satori_version=1.1.9&screen_height=797&screen_size=medium&screen_width=885'
171 |                       '&search_type=FILTER_CHANGE&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true'
172 |                       '&timezone_offset=-240&version=1.5.6'
173 |                       '&items_offset={0}&section_offset={1}&price_min={2}')
174 |                 new_url = url.format(items_offset, section_offset, self.price_lb, self.city)
175 |             
176 |             # If there is a next page, update url and scrape from next page
177 |             yield scrapy.Request(url=new_url, callback=self.parse_id)
178 | 
179 |     def parse_details(self, response):
180 |         '''Parses details for a single listing page and stores into AirbnbScraperItem object
181 | 
182 |         Args:
183 |             response: The response from the page (same as inspecting page source)
184 |         Returns:
185 |             An AirbnbScraperItem object containing the set of fields pertaining to the listing
186 |         '''
187 |         # New Instance
188 |         listing = AirbnbScraperItem()
189 | 
190 |         # Fill in fields for Instance from initial scrapy call
191 |         listing['is_superhost'] = response.meta['is_superhost']
192 |         listing['host_id'] = str(response.meta['host_id'])
193 |         listing['price'] = response.meta['price']
194 |         listing['url'] = response.meta['url']
195 |         listing['bathrooms'] = response.meta['bathrooms']
196 |         listing['bedrooms'] = response.meta['bedrooms']
197 |         listing['is_business_travel_ready'] = response.meta['is_business_travel_ready']
198 |         listing['is_fully_refundable'] = response.meta['is_fully_refundable']
199 |         listing['is_new_listing'] = response.meta['is_new_listing']
200 |         listing['lat'] = response.meta['lat']
201 |         listing['lng'] = response.meta['lng']
202 |         listing['localized_city'] = response.meta['localized_city']
203 |         listing['localized_neighborhood'] = response.meta['localized_neighborhood']
204 |         listing['listing_name'] = response.meta['listing_name']
205 |         listing['person_capacity'] = response.meta['person_capacity']
206 |         listing['picture_count'] = response.meta['picture_count']
207 |         listing['reviews_count'] = response.meta['reviews_count']
208 |         listing['room_type_category'] = response.meta['room_type_category']
209 |         listing['star_rating'] = response.meta['star_rating']
210 |         listing['avg_rating'] = response.meta['avg_rating']
211 |         listing['can_instant_book'] = response.meta['can_instant_book']
212 |         listing['monthly_price_factor'] = response.meta['monthly_price_factor']
213 |         listing['weekly_price_factor'] = response.meta['weekly_price_factor']
214 |         listing['currency'] = response.meta['currency']
215 |         listing['amt_w_service'] = response.meta['amt_w_service']
216 |         listing['rate_type'] = response.meta['rate_type']
217 | 
218 |         # Other fields scraped from html response.text using regex (some might fail hence try/catch)
219 |         try:
220 |             listing['num_beds'] = int((re.search('"bed_label":"(.).*","bedroom_label"', response.text)).group(1))
221 |         except:
222 |             listing['num_beds'] = 0
223 | 
224 |         try:
225 |             listing['host_reviews'] = int((re.search(r'"badges":\[{"count":(.*?),"id":"reviews"',
226 |                                       response.text)).group(1))
227 |         except:
228 |             listing['host_reviews'] = 0
229 | 
230 |         # Main six rating metrics + overall_guest_satisfication
231 |         try:
232 |             listing['accuracy'] = int((re.search('"accuracy_rating":(.*?),"', response.text)).group(1))
233 |             listing['checkin'] = int((re.search('"checkin_rating":(.*?),"', response.text)).group(1))
234 |             listing['cleanliness'] = int((re.search('"cleanliness_rating":(.*?),"', response.text)).group(1))
235 |             listing['communication'] = int((re.search('"communication_rating":(.*?),"', response.text)).group(1))
236 |             listing['value'] = int((re.search('"value_rating":(.*?),"', response.text)).group(1))
237 |             listing['location'] = int((re.search('"location_rating":(.*?),"', response.text)).group(1))
238 |             listing['guest_satisfication'] = int((re.search('"guest_satisfaction_overall":(.*?),"',
239 |                                              response.text)).group(1))
240 |         except:
241 |             listing['accuracy'] = 0
242 |             listing['checkin'] = 0
243 |             listing['cleanliness'] = 0
244 |             listing['communication'] = 0
245 |             listing['value'] = 0
246 |             listing['location'] = 0
247 |             listing['guest_satisfication'] = 0
248 | 
249 |         # Extra Host Fields
250 |         try:
251 |             listing['response_rate'] = int((re.search('"response_rate_without_na":"(.*?)%",', response.text)).group(1))
252 |             listing['response_time'] = (re.search('"response_time_without_na":"(.*?)",', response.text)).group(1)
253 |         except:
254 |             listing['response_rate'] = 0
255 |             listing['response_time'] = ''
256 | 
257 |         # Finally return the object
258 |         yield listing


--------------------------------------------------------------------------------
/cancun.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | for var in `seq 15 10 995`
 4 | do
 5 | 
 6 |    lower_bound=$var
 7 |    upper_bound=`expr $var + 10`
 8 | 
 9 |    # Fixed Variables
10 |    CITY="Cancun"
11 |    CONJUNCTION="to"
12 |    FORMAT=".json"
13 |    DATA_LOCATION="cancun_data"
14 | 
15 |    filename="$lower_bound$CONJUNCTION$upper_bound$FORMAT"
16 | 
17 |    # Run scraper on specific range
18 |    scrapy crawl airbnb -o $filename -a city=$CITY -a price_lb=$lower_bound -a price_ub=$upper_bound
19 |    mv $filename $DATA_LOCATION
20 | 
21 | done
22 | 
23 | 


--------------------------------------------------------------------------------
/new_york.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | for var in `seq 15 10 995`
 4 | do
 5 | 
 6 |    lower_bound=$var
 7 |    upper_bound=`expr $var + 10`
 8 | 
 9 |    # Fixed Variables
10 |    CITY="New%20York%"
11 |    CONJUNCTION="to"
12 |    FORMAT=".json"
13 |    DATA_LOCATION="new_york_data"
14 | 
15 |    filename="$lower_bound$CONJUNCTION$upper_bound$FORMAT"
16 | 
17 |    # Run scraper on specific range
18 |    scrapy crawl airbnb -o $filename -a city=$CITY -a price_lb=$lower_bound -a price_ub=$upper_bound
19 |    mv $filename $DATA_LOCATION
20 | 
21 | done
22 | 
23 | 


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = airbnb_scraper.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = airbnb_scraper
12 | 


--------------------------------------------------------------------------------
/testing/airbnb_testing.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | import scrapy
 3 | from scrapy.exceptions import CloseSpider
 4 | import json
 5 | import time
 6 | import pprint
 7 | import collections
 8 | 
 9 | with open('data.json', 'r') as file:
10 |     data = json.load(file)
11 | 
12 | # Need Try/Catch for first page as sometimes has airbnb plus
13 |  
14 | #homes = data.get('explore_tabs')[0].get('sections')[0].get('listings')
15 | 
16 | #if homes is None:
17 | homes = data.get('explore_tabs')[0].get('sections')[3].get('listings')
18 | 
19 | print(homes)
20 | # data_dict = collections.defaultdict(dict)
21 | 
22 | # base_url = 'https://www.airbnb.com/rooms/'
23 | # for home in homes:
24 | #     room_id = str(home.get('listing').get('id'))
25 | #     url = base_url + str(home.get('listing').get('id'))
26 | 
27 | #     # Map price for specific home to each url
28 | #     data_dict[room_id]['url'] = url
29 | #     data_dict[room_id]['price'] = home.get('pricing_quote').get('rate').get('amount')
30 | #     data_dict[room_id]['bathrooms'] = home.get('listing').get('bathrooms')
31 | 
32 | #     # Copy below
33 | #     data_dict[room_id]['bedrooms'] = home.get('listing').get('bedrooms')
34 | #     data_dict[room_id]['host_languages'] = home.get('listing').get('host_languages')
35 | #     data_dict[room_id]['is_business_travel_ready'] = home.get('listing').get('is_business_travel_ready')
36 | #     data_dict[room_id]['is_fully_refundable'] = home.get('listing').get('is_fully_refundable')
37 | #     data_dict[room_id]['is_new_listing'] = home.get('listing').get('is_new_listing')
38 | #     data_dict[room_id]['is_superhost'] = home.get('listing').get('is_superhost')
39 | #     data_dict[room_id]['lat'] = home.get('listing').get('lat')
40 | #     data_dict[room_id]['lng'] = home.get('listing').get('lng')
41 | #     data_dict[room_id]['localized_city'] = home.get('listing').get('localized_city')
42 | #     data_dict[room_id]['localized_neighborhood'] = home.get('listing').get('localized_neighborhood')
43 | #     data_dict[room_id]['listing_name'] = home.get('listing').get('name')
44 | #     data_dict[room_id]['person_capacity'] = home.get('listing').get('person_capacity')
45 | #     data_dict[room_id]['picture_count'] = home.get('listing').get('picture_count')
46 | #     data_dict[room_id]['reviews_count'] = home.get('listing').get('reviews_count')
47 | #     data_dict[room_id]['room_type_category'] = home.get('listing').get('room_type_category')
48 | #     data_dict[room_id]['star_rating'] = home.get('listing').get('star_rating')
49 | #     data_dict[room_id]['host_id'] = home.get('listing').get('user').get('id')
50 | #     data_dict[room_id]['avg_rating'] = home.get('listing').get('avg_rating')
51 | 
52 | 
53 | #     data_dict[room_id]['can_instant_book'] = home.get('pricing_quote').get('can_instant_book')
54 | #     data_dict[room_id]['monthly_price_factor'] = home.get('pricing_quote').get('monthly_price_factor')
55 | #     data_dict[room_id]['currency'] = home.get('pricing_quote').get('rate').get('currency')
56 | #     data_dict[room_id]['amount_with_service_fee'] = home.get('pricing_quote').get('rate_with_service_fee').get('amount')
57 | #     data_dict[room_id]['rate_type'] = home.get('pricing_quote').get('rate_type')
58 | #     data_dict[room_id]['weekly_price_factor'] = home.get('pricing_quote').get('weekly_price_factor')
59 | 
60 | 
61 | 
62 | # printer = pprint.PrettyPrinter()
63 | # printer.pprint(data_dict)


--------------------------------------------------------------------------------
/toronto.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | for var in `seq 10 10 150`
 4 | do
 5 | 
 6 |    lower_bound=$var
 7 |    upper_bound=`expr $var + 5`
 8 | 
 9 |    # Fixed Variables
10 |    CITY="Toronto"
11 |    CONJUNCTION="to"
12 |    FORMAT=".json"
13 |    DATA_LOCATION="toronto_data"
14 | 
15 |    filename="$lower_bound$CONJUNCTION$upper_bound$FORMAT"
16 | 
17 |    # Run scraper on specific range
18 |    scrapy crawl airbnb -o $filename -a city=$CITY -a price_lb=$lower_bound -a price_ub=$upper_bound
19 |    mv $filename $DATA_LOCATION
20 | 
21 | done
22 | 


--------------------------------------------------------------------------------