├── .DS_Store ├── .vscode └── settings.json ├── README.md ├── airbnb_scraper ├── __init__.py ├── __pycache__ │ ├── __init__.cpython-36.pyc │ ├── items.cpython-36.pyc │ └── settings.cpython-36.pyc ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders │ ├── __init__.py │ ├── __pycache__ │ ├── __init__.cpython-36.pyc │ ├── airbnb.cpython-36.pyc │ └── airbnb_single_test.cpython-36.pyc │ └── airbnb.py ├── cancun.sh ├── new_york.sh ├── scrapy.cfg ├── testing └── airbnb_testing.py └── toronto.sh /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/.DS_Store -------------------------------------------------------------------------------- /.vscode/settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "python.pythonPath": "/Users/kailu/.pyenv/versions/3.6.5/bin/python" 3 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # airbnb_scraper :spider: 2 | 3 | Spider built with scrapy and ScrapySplash to crawl listings 4 | 5 | ## Checklist 6 | 7 | *This checklist is for personal use and isn't relevant to using the scraper.* 8 | 9 | - [x] Spider can successfully parse one page of listings 10 | - [x] Spider can successfully parse mutliple/all pages of designated location 11 | - [x] Spider can take price ranges as arguments (`price_lb` and `price_ub`) 12 | - [x] Spider can take location as argument 13 | 14 | ## Set up 15 | 16 | Since Airbnb uses JavaScript to render content, just scrapy on its own cannot suffice sometimes. We need to use Splash as well, which is a plugin created by the Scrapy team that integrates nicely with scrapy. 17 | 18 | **To install Splash, we need to do several things:** 19 | 1. Install [Docker](https://docs.docker.com/install/), create a Docker account (if you don't already have one), and run Docker in the background before crawling with 20 | 21 | ``` 22 | docker run -p 8050:8050 scrapinghub/splash 23 | ``` 24 | It might take a few minutes to pull the image for the first time doing this. When this is done, you can type `localhost:8050` in your browser to check if it's working. If an interface opens up, you are good to go. 25 | 26 | 2. Install scrapy-splash using pip 27 | 28 | ``` 29 | pip install scrapy-splash 30 | ``` 31 | 32 | See [scrapy-splash](https://github.com/scrapy-plugins/scrapy-splash) if you run into any issues. 33 | 34 | ## Crawling 35 | 36 | Run the spider with `scrapy crawl airbnb -o {filename}.json -a city='{cityname}' -a price_lb='{pricelowerbound}' -a price_ub='{priceupperbound}'` 37 | 38 | `cityname` refers to a valid city name 39 | 40 | `pricelowerbound` refers to a lower bound for price from 0 to 999 41 | 42 | `priceupperbound` refers to upper bound for price from 0 to 999. Spider will close if `priceupperbound` is less than 43 | `pricelowerbound` 44 | **Note: Airbnb only returns a maximum of ~300 listings per specific filter (price range). To get more listings, I recommend scraping multiple times using small increments in price and concatenating the datasets.** 45 | 46 | If you would like to do multiple scrapes over a wide price range (e.g. 10-spaced intervals from 20 to 990), see `cancun.sh` which I used to crawl a large number listings for Cancún. 47 | 48 | ## Acknowledgements 49 | 50 | I would like to thank **Ahmed Rafik** for his guidance and teachings. 51 | -------------------------------------------------------------------------------- /airbnb_scraper/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/__init__.py -------------------------------------------------------------------------------- /airbnb_scraper/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /airbnb_scraper/__pycache__/items.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/__pycache__/items.cpython-36.pyc -------------------------------------------------------------------------------- /airbnb_scraper/__pycache__/settings.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/__pycache__/settings.cpython-36.pyc -------------------------------------------------------------------------------- /airbnb_scraper/items.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your scraped items 4 | # 5 | # See documentation in: 6 | # https://doc.scrapy.org/en/latest/topics/items.html 7 | 8 | import scrapy 9 | from scrapy.loader.processors import MapCompose, TakeFirst, Join 10 | 11 | 12 | def remove_unicode(value): 13 | return value.replace(u"\u201c", '').replace(u"\u201d", '').replace(u"\2764", '').replace(u"\ufe0f") 14 | 15 | class AirbnbScraperItem(scrapy.Item): 16 | 17 | # Host Fields 18 | is_superhost = scrapy.Field() 19 | host_id = scrapy.Field() 20 | 21 | # Room Fields 22 | price = scrapy.Field() 23 | url = scrapy.Field() 24 | is_business_travel_ready = scrapy.Field() 25 | is_fully_refundable = scrapy.Field() 26 | is_new_listing = scrapy.Field() 27 | lat = scrapy.Field() 28 | lng = scrapy.Field() 29 | localized_city = scrapy.Field() 30 | localized_neighborhood = scrapy.Field() 31 | listing_name = scrapy.Field(input_processor=MapCompose(remove_unicode)) 32 | person_capacity = scrapy.Field() 33 | picture_count = scrapy.Field() 34 | reviews_count = scrapy.Field() 35 | room_type_category = scrapy.Field() 36 | star_rating = scrapy.Field() # Rounded to .5 or .0 Avg Rating 37 | avg_rating = scrapy.Field() 38 | can_instant_book = scrapy.Field() 39 | monthly_price_factor = scrapy.Field() 40 | currency = scrapy.Field() 41 | amt_w_service = scrapy.Field() 42 | rate_type = scrapy.Field() 43 | weekly_price_factor = scrapy.Field() 44 | bathrooms = scrapy.Field() 45 | bedrooms = scrapy.Field() 46 | num_beds = scrapy.Field() 47 | accuracy = scrapy.Field() 48 | communication = scrapy.Field() 49 | cleanliness = scrapy.Field() 50 | location = scrapy.Field() 51 | checkin = scrapy.Field() 52 | value = scrapy.Field() 53 | guest_satisfication = scrapy.Field() 54 | host_reviews = scrapy.Field() 55 | response_rate = scrapy.Field() 56 | response_time = scrapy.Field() 57 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /airbnb_scraper/middlewares.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your spider middleware 4 | # 5 | # See documentation in: 6 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html 7 | 8 | from scrapy import signals 9 | 10 | 11 | class AirbnbScraperSpiderMiddleware(object): 12 | # Not all methods need to be defined. If a method is not defined, 13 | # scrapy acts as if the spider middleware does not modify the 14 | # passed objects. 15 | 16 | @classmethod 17 | def from_crawler(cls, crawler): 18 | # This method is used by Scrapy to create your spiders. 19 | s = cls() 20 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 21 | return s 22 | 23 | def process_spider_input(self, response, spider): 24 | # Called for each response that goes through the spider 25 | # middleware and into the spider. 26 | 27 | # Should return None or raise an exception. 28 | return None 29 | 30 | def process_spider_output(self, response, result, spider): 31 | # Called with the results returned from the Spider, after 32 | # it has processed the response. 33 | 34 | # Must return an iterable of Request, dict or Item objects. 35 | for i in result: 36 | yield i 37 | 38 | def process_spider_exception(self, response, exception, spider): 39 | # Called when a spider or process_spider_input() method 40 | # (from other spider middleware) raises an exception. 41 | 42 | # Should return either None or an iterable of Response, dict 43 | # or Item objects. 44 | pass 45 | 46 | def process_start_requests(self, start_requests, spider): 47 | # Called with the start requests of the spider, and works 48 | # similarly to the process_spider_output() method, except 49 | # that it doesn’t have a response associated. 50 | 51 | # Must return only requests (not items). 52 | for r in start_requests: 53 | yield r 54 | 55 | def spider_opened(self, spider): 56 | spider.logger.info('Spider opened: %s' % spider.name) 57 | 58 | 59 | class AirbnbScraperDownloaderMiddleware(object): 60 | # Not all methods need to be defined. If a method is not defined, 61 | # scrapy acts as if the downloader middleware does not modify the 62 | # passed objects. 63 | 64 | @classmethod 65 | def from_crawler(cls, crawler): 66 | # This method is used by Scrapy to create your spiders. 67 | s = cls() 68 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 69 | return s 70 | 71 | def process_request(self, request, spider): 72 | # Called for each request that goes through the downloader 73 | # middleware. 74 | 75 | # Must either: 76 | # - return None: continue processing this request 77 | # - or return a Response object 78 | # - or return a Request object 79 | # - or raise IgnoreRequest: process_exception() methods of 80 | # installed downloader middleware will be called 81 | return None 82 | 83 | def process_response(self, request, response, spider): 84 | # Called with the response returned from the downloader. 85 | 86 | # Must either; 87 | # - return a Response object 88 | # - return a Request object 89 | # - or raise IgnoreRequest 90 | return response 91 | 92 | def process_exception(self, request, exception, spider): 93 | # Called when a download handler or a process_request() 94 | # (from other downloader middleware) raises an exception. 95 | 96 | # Must either: 97 | # - return None: continue processing this exception 98 | # - return a Response object: stops process_exception() chain 99 | # - return a Request object: stops process_exception() chain 100 | pass 101 | 102 | def spider_opened(self, spider): 103 | spider.logger.info('Spider opened: %s' % spider.name) 104 | -------------------------------------------------------------------------------- /airbnb_scraper/pipelines.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define your item pipelines here 4 | # 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 | # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 | 8 | 9 | class AirbnbScraperPipeline(object): 10 | def process_item(self, item, spider): 11 | return item 12 | -------------------------------------------------------------------------------- /airbnb_scraper/settings.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Scrapy settings for airbnb_scraper project 4 | # 5 | # For simplicity, this file contains only settings considered important or 6 | # commonly used. You can find more settings consulting the documentation: 7 | # 8 | # https://doc.scrapy.org/en/latest/topics/settings.html 9 | # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 10 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html 11 | 12 | BOT_NAME = 'airbnb_scraper' 13 | 14 | SPIDER_MODULES = ['airbnb_scraper.spiders'] 15 | NEWSPIDER_MODULE = 'airbnb_scraper.spiders' 16 | 17 | 18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 | #USER_AGENT = 'airbnb_scraper (+http://www.yourdomain.com)' 20 | 21 | # Obey robots.txt rules 22 | ROBOTSTXT_OBEY = True 23 | 24 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 25 | #CONCURRENT_REQUESTS = 32 26 | 27 | # Configure a delay for requests for the same website (default: 0) 28 | # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay 29 | # See also autothrottle settings and docs 30 | DOWNLOAD_DELAY = 3 31 | # The download delay setting will honor only one of: 32 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16 33 | #CONCURRENT_REQUESTS_PER_IP = 16 34 | 35 | # Disable cookies (enabled by default) 36 | #COOKIES_ENABLED = False 37 | 38 | # Disable Telnet Console (enabled by default) 39 | #TELNETCONSOLE_ENABLED = False 40 | 41 | # Override the default request headers: 42 | #DEFAULT_REQUEST_HEADERS = { 43 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 44 | # 'Accept-Language': 'en', 45 | #} 46 | 47 | # Enable or disable spider middlewares 48 | # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html 49 | #SPIDER_MIDDLEWARES = { 50 | # 'airbnb_scraper.middlewares.AirbnbScraperSpiderMiddleware': 543, 51 | #} 52 | SPIDER_MIDDLEWARES = { 53 | 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 54 | } 55 | 56 | # Enable or disable downloader middlewares 57 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 58 | #DOWNLOADER_MIDDLEWARES = { 59 | # 'airbnb_scraper.middlewares.AirbnbScraperDownloaderMiddleware': 543, 60 | #} 61 | DOWNLOADER_MIDDLEWARES = { 62 | 'scrapy_splash.SplashCookiesMiddleware': 723, 63 | 'scrapy_splash.SplashMiddleware': 725, 64 | 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 65 | } 66 | 67 | # Enable or disable extensions 68 | # See https://doc.scrapy.org/en/latest/topics/extensions.html 69 | #EXTENSIONS = { 70 | # 'scrapy.extensions.telnet.TelnetConsole': None, 71 | #} 72 | 73 | # Configure item pipelines 74 | # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html 75 | #ITEM_PIPELINES = { 76 | # 'airbnb_scraper.pipelines.AirbnbScraperPipeline': 300, 77 | #} 78 | 79 | # Enable and configure the AutoThrottle extension (disabled by default) 80 | # See https://doc.scrapy.org/en/latest/topics/autothrottle.html 81 | # AUTOTHROTTLE_ENABLED = True 82 | # The initial download delay 83 | #AUTOTHROTTLE_START_DELAY = 5 84 | # The maximum download delay to be set in case of high latencies 85 | #AUTOTHROTTLE_MAX_DELAY = 60 86 | # The average number of requests Scrapy should be sending in parallel to 87 | # each remote server 88 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 89 | # Enable showing throttling stats for every response received: 90 | #AUTOTHROTTLE_DEBUG = False 91 | 92 | # Enable and configure HTTP caching (disabled by default) 93 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 94 | #HTTPCACHE_ENABLED = True 95 | #HTTPCACHE_EXPIRATION_SECS = 0 96 | #HTTPCACHE_DIR = 'httpcache' 97 | #HTTPCACHE_IGNORE_HTTP_CODES = [] 98 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 99 | SPLASH_URL = 'http://localhost:8050' -------------------------------------------------------------------------------- /airbnb_scraper/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /airbnb_scraper/spiders/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/spiders/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /airbnb_scraper/spiders/__pycache__/airbnb.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/spiders/__pycache__/airbnb.cpython-36.pyc -------------------------------------------------------------------------------- /airbnb_scraper/spiders/__pycache__/airbnb_single_test.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kailu3/airbnb-scraper/c0181559b1473d32514ab0faf6ecaaed44f6e087/airbnb_scraper/spiders/__pycache__/airbnb_single_test.cpython-36.pyc -------------------------------------------------------------------------------- /airbnb_scraper/spiders/airbnb.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import json 3 | import collections 4 | import re 5 | import numpy as np 6 | import logging 7 | import sys 8 | import scrapy 9 | from scrapy_splash import SplashRequest 10 | from scrapy.exceptions import CloseSpider 11 | from airbnb_scraper.items import AirbnbScraperItem 12 | 13 | 14 | # ******************************************************************************************** 15 | # Important: Run -> docker run -p 8050:8050 scrapinghub/splash in background before crawling * 16 | # ******************************************************************************************** 17 | 18 | 19 | # ********************************************************************************************* 20 | # Run crawler with -> scrapy crawl airbnb -o 21to25.json -a price_lb='' -a price_ub='' * 21 | # ********************************************************************************************* 22 | 23 | class AirbnbSpider(scrapy.Spider): 24 | name = 'airbnb' 25 | allowed_domains = ['www.airbnb.com'] 26 | 27 | ''' 28 | You don't have to override __init__ each time and can simply use self.parameter (See https://bit.ly/2Wxbkd9), 29 | but I find this way much more readable. 30 | ''' 31 | def __init__(self, city='',price_lb='', price_ub='', *args,**kwargs): 32 | super(AirbnbSpider, self).__init__(*args, **kwargs) 33 | self.city = city 34 | self.price_lb = price_lb 35 | self.price_ub = price_ub 36 | 37 | def start_requests(self): 38 | '''Sends a scrapy request to the designated url price range 39 | 40 | Args: 41 | Returns: 42 | ''' 43 | 44 | url = ('https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&_intents=p1' 45 | '&allow_override%5B%5D=&auto_ib=false&client_session_id=' 46 | '621cf853-d03e-4108-b717-c14962b6ab8b¤cy=CAD&experiences_per_grid=20&fetch_filters=true' 47 | '&guidebooks_per_grid=20&has_zero_guest_treatment=true&is_guided_search=true' 48 | '&is_new_cards_experiment=true&is_standard_search=true&items_per_grid=18' 49 | '&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&locale=en&luxury_pre_launch=false&metadata_only=false&' 50 | 'query={2}' 51 | '&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&s_tag=QLb9RB7g' 52 | '&search_type=FILTER_CHANGE&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true' 53 | '&timezone_offset=-240&version=1.5.6' 54 | '&price_min={0}&price_max={1}') 55 | new_url = url.format(self.price_lb, self.price_ub, self.city) 56 | 57 | 58 | if (int(self.price_lb) >= 990): 59 | url = ('https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&_intents=p1' 60 | '&allow_override%5B%5D=&auto_ib=false&client_session_id=' 61 | '621cf853-d03e-4108-b717-c14962b6ab8b¤cy=CAD&experiences_per_grid=20&fetch_filters=true' 62 | '&guidebooks_per_grid=20&has_zero_guest_treatment=true&is_guided_search=true' 63 | '&is_new_cards_experiment=true&is_standard_search=true&items_per_grid=18' 64 | '&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&locale=en&luxury_pre_launch=false&metadata_only=false&' 65 | 'query={1}' 66 | '&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&s_tag=QLb9RB7g' 67 | '&search_type=FILTER_CHANGE&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true' 68 | '&timezone_offset=-240&version=1.5.6' 69 | '&price_min={0}') 70 | new_url = url.format(self.price_lb, self.city) 71 | 72 | yield scrapy.Request(url=new_url, callback=self.parse_id, dont_filter=True) 73 | 74 | 75 | def parse_id(self, response): 76 | '''Parses all the URLs/ids/available fields from the initial json object and stores into dictionary 77 | 78 | Args: 79 | response: Json object from explore_tabs 80 | Returns: 81 | ''' 82 | 83 | # Fetch and Write the response data 84 | data = json.loads(response.body) 85 | 86 | # Return a List of all homes 87 | homes = data.get('explore_tabs')[0].get('sections')[0].get('listings') 88 | 89 | if homes is None: 90 | try: 91 | homes = data.get('explore_tabs')[0].get('sections')[3].get('listings') 92 | except IndexError: 93 | try: 94 | homes = data.get('explore_tabs')[0].get('sections')[2].get('listings') 95 | except: 96 | raise CloseSpider("No homes available in the city and price parameters") 97 | 98 | base_url = 'https://www.airbnb.com/rooms/' 99 | data_dict = collections.defaultdict(dict) # Create Dictionary to put all currently available fields in 100 | 101 | for home in homes: 102 | room_id = str(home.get('listing').get('id')) 103 | url = base_url + str(home.get('listing').get('id')) 104 | data_dict[room_id]['url'] = url 105 | data_dict[room_id]['price'] = home.get('pricing_quote').get('rate').get('amount') 106 | data_dict[room_id]['bathrooms'] = home.get('listing').get('bathrooms') 107 | data_dict[room_id]['bedrooms'] = home.get('listing').get('bedrooms') 108 | data_dict[room_id]['host_languages'] = home.get('listing').get('host_languages') 109 | data_dict[room_id]['is_business_travel_ready'] = home.get('listing').get('is_business_travel_ready') 110 | data_dict[room_id]['is_fully_refundable'] = home.get('listing').get('is_fully_refundable') 111 | data_dict[room_id]['is_new_listing'] = home.get('listing').get('is_new_listing') 112 | data_dict[room_id]['is_superhost'] = home.get('listing').get('is_superhost') 113 | data_dict[room_id]['lat'] = home.get('listing').get('lat') 114 | data_dict[room_id]['lng'] = home.get('listing').get('lng') 115 | data_dict[room_id]['localized_city'] = home.get('listing').get('localized_city') 116 | data_dict[room_id]['localized_neighborhood'] = home.get('listing').get('localized_neighborhood') 117 | data_dict[room_id]['listing_name'] = home.get('listing').get('name') 118 | data_dict[room_id]['person_capacity'] = home.get('listing').get('person_capacity') 119 | data_dict[room_id]['picture_count'] = home.get('listing').get('picture_count') 120 | data_dict[room_id]['reviews_count'] = home.get('listing').get('reviews_count') 121 | data_dict[room_id]['room_type_category'] = home.get('listing').get('room_type_category') 122 | data_dict[room_id]['star_rating'] = home.get('listing').get('star_rating') 123 | data_dict[room_id]['host_id'] = home.get('listing').get('user').get('id') 124 | data_dict[room_id]['avg_rating'] = home.get('listing').get('avg_rating') 125 | data_dict[room_id]['can_instant_book'] = home.get('pricing_quote').get('can_instant_book') 126 | data_dict[room_id]['monthly_price_factor'] = home.get('pricing_quote').get('monthly_price_factor') 127 | data_dict[room_id]['currency'] = home.get('pricing_quote').get('rate').get('currency') 128 | data_dict[room_id]['amt_w_service'] = home.get('pricing_quote').get('rate_with_service_fee').get('amount') 129 | data_dict[room_id]['rate_type'] = home.get('pricing_quote').get('rate_type') 130 | data_dict[room_id]['weekly_price_factor'] = home.get('pricing_quote').get('weekly_price_factor') 131 | 132 | 133 | # Iterate through dictionary of URLs in the single page to send a SplashRequest for each 134 | for room_id in data_dict: 135 | yield SplashRequest(url=base_url+room_id, callback=self.parse_details, 136 | meta=data_dict.get(room_id), 137 | endpoint="render.html", 138 | args={'wait': '0.5'}) 139 | 140 | # After scraping entire listings page, check if more pages 141 | pagination_metadata = data.get('explore_tabs')[0].get('pagination_metadata') 142 | if pagination_metadata.get('has_next_page'): 143 | 144 | items_offset = pagination_metadata.get('items_offset') 145 | section_offset = pagination_metadata.get('section_offset') 146 | 147 | new_url = ('https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&_intents=p1' 148 | '&allow_override%5B%5D=&auto_ib=false&client_session_id=' 149 | '621cf853-d03e-4108-b717-c14962b6ab8b¤cy=CAD&experiences_per_grid=20' 150 | '&fetch_filters=true&guidebooks_per_grid=20&has_zero_guest_treatment=true&is_guided_search=true' 151 | '&is_new_cards_experiment=true&is_standard_search=true&items_per_grid=18' 152 | '&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&locale=en&luxury_pre_launch=false&metadata_only=false' 153 | '&query={4}' 154 | '&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&s_tag=QLb9RB7g' 155 | '&satori_version=1.1.9&screen_height=797&screen_size=medium&screen_width=885' 156 | '&search_type=FILTER_CHANGE&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true' 157 | '&timezone_offset=-240&version=1.5.6' 158 | '&items_offset={0}§ion_offset={1}&price_min={2}&price_max={3}') 159 | new_url = new_url.format(items_offset, section_offset, self.price_lb, self.price_ub, self.city) 160 | 161 | if (int(self.price_lb) >= 990): 162 | url = ('https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&_intents=p1' 163 | '&allow_override%5B%5D=&auto_ib=false&client_session_id=' 164 | '621cf853-d03e-4108-b717-c14962b6ab8b¤cy=CAD&experiences_per_grid=20' 165 | '&fetch_filters=true&guidebooks_per_grid=20&has_zero_guest_treatment=true&is_guided_search=true' 166 | '&is_new_cards_experiment=true&is_standard_search=true&items_per_grid=18' 167 | '&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&locale=en&luxury_pre_launch=false&metadata_only=false' 168 | '&query={3}' 169 | '&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&s_tag=QLb9RB7g' 170 | '&satori_version=1.1.9&screen_height=797&screen_size=medium&screen_width=885' 171 | '&search_type=FILTER_CHANGE&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true' 172 | '&timezone_offset=-240&version=1.5.6' 173 | '&items_offset={0}§ion_offset={1}&price_min={2}') 174 | new_url = url.format(items_offset, section_offset, self.price_lb, self.city) 175 | 176 | # If there is a next page, update url and scrape from next page 177 | yield scrapy.Request(url=new_url, callback=self.parse_id) 178 | 179 | def parse_details(self, response): 180 | '''Parses details for a single listing page and stores into AirbnbScraperItem object 181 | 182 | Args: 183 | response: The response from the page (same as inspecting page source) 184 | Returns: 185 | An AirbnbScraperItem object containing the set of fields pertaining to the listing 186 | ''' 187 | # New Instance 188 | listing = AirbnbScraperItem() 189 | 190 | # Fill in fields for Instance from initial scrapy call 191 | listing['is_superhost'] = response.meta['is_superhost'] 192 | listing['host_id'] = str(response.meta['host_id']) 193 | listing['price'] = response.meta['price'] 194 | listing['url'] = response.meta['url'] 195 | listing['bathrooms'] = response.meta['bathrooms'] 196 | listing['bedrooms'] = response.meta['bedrooms'] 197 | listing['is_business_travel_ready'] = response.meta['is_business_travel_ready'] 198 | listing['is_fully_refundable'] = response.meta['is_fully_refundable'] 199 | listing['is_new_listing'] = response.meta['is_new_listing'] 200 | listing['lat'] = response.meta['lat'] 201 | listing['lng'] = response.meta['lng'] 202 | listing['localized_city'] = response.meta['localized_city'] 203 | listing['localized_neighborhood'] = response.meta['localized_neighborhood'] 204 | listing['listing_name'] = response.meta['listing_name'] 205 | listing['person_capacity'] = response.meta['person_capacity'] 206 | listing['picture_count'] = response.meta['picture_count'] 207 | listing['reviews_count'] = response.meta['reviews_count'] 208 | listing['room_type_category'] = response.meta['room_type_category'] 209 | listing['star_rating'] = response.meta['star_rating'] 210 | listing['avg_rating'] = response.meta['avg_rating'] 211 | listing['can_instant_book'] = response.meta['can_instant_book'] 212 | listing['monthly_price_factor'] = response.meta['monthly_price_factor'] 213 | listing['weekly_price_factor'] = response.meta['weekly_price_factor'] 214 | listing['currency'] = response.meta['currency'] 215 | listing['amt_w_service'] = response.meta['amt_w_service'] 216 | listing['rate_type'] = response.meta['rate_type'] 217 | 218 | # Other fields scraped from html response.text using regex (some might fail hence try/catch) 219 | try: 220 | listing['num_beds'] = int((re.search('"bed_label":"(.).*","bedroom_label"', response.text)).group(1)) 221 | except: 222 | listing['num_beds'] = 0 223 | 224 | try: 225 | listing['host_reviews'] = int((re.search(r'"badges":\[{"count":(.*?),"id":"reviews"', 226 | response.text)).group(1)) 227 | except: 228 | listing['host_reviews'] = 0 229 | 230 | # Main six rating metrics + overall_guest_satisfication 231 | try: 232 | listing['accuracy'] = int((re.search('"accuracy_rating":(.*?),"', response.text)).group(1)) 233 | listing['checkin'] = int((re.search('"checkin_rating":(.*?),"', response.text)).group(1)) 234 | listing['cleanliness'] = int((re.search('"cleanliness_rating":(.*?),"', response.text)).group(1)) 235 | listing['communication'] = int((re.search('"communication_rating":(.*?),"', response.text)).group(1)) 236 | listing['value'] = int((re.search('"value_rating":(.*?),"', response.text)).group(1)) 237 | listing['location'] = int((re.search('"location_rating":(.*?),"', response.text)).group(1)) 238 | listing['guest_satisfication'] = int((re.search('"guest_satisfaction_overall":(.*?),"', 239 | response.text)).group(1)) 240 | except: 241 | listing['accuracy'] = 0 242 | listing['checkin'] = 0 243 | listing['cleanliness'] = 0 244 | listing['communication'] = 0 245 | listing['value'] = 0 246 | listing['location'] = 0 247 | listing['guest_satisfication'] = 0 248 | 249 | # Extra Host Fields 250 | try: 251 | listing['response_rate'] = int((re.search('"response_rate_without_na":"(.*?)%",', response.text)).group(1)) 252 | listing['response_time'] = (re.search('"response_time_without_na":"(.*?)",', response.text)).group(1) 253 | except: 254 | listing['response_rate'] = 0 255 | listing['response_time'] = '' 256 | 257 | # Finally return the object 258 | yield listing -------------------------------------------------------------------------------- /cancun.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | for var in `seq 15 10 995` 4 | do 5 | 6 | lower_bound=$var 7 | upper_bound=`expr $var + 10` 8 | 9 | # Fixed Variables 10 | CITY="Cancun" 11 | CONJUNCTION="to" 12 | FORMAT=".json" 13 | DATA_LOCATION="cancun_data" 14 | 15 | filename="$lower_bound$CONJUNCTION$upper_bound$FORMAT" 16 | 17 | # Run scraper on specific range 18 | scrapy crawl airbnb -o $filename -a city=$CITY -a price_lb=$lower_bound -a price_ub=$upper_bound 19 | mv $filename $DATA_LOCATION 20 | 21 | done 22 | 23 | -------------------------------------------------------------------------------- /new_york.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | for var in `seq 15 10 995` 4 | do 5 | 6 | lower_bound=$var 7 | upper_bound=`expr $var + 10` 8 | 9 | # Fixed Variables 10 | CITY="New%20York%" 11 | CONJUNCTION="to" 12 | FORMAT=".json" 13 | DATA_LOCATION="new_york_data" 14 | 15 | filename="$lower_bound$CONJUNCTION$upper_bound$FORMAT" 16 | 17 | # Run scraper on specific range 18 | scrapy crawl airbnb -o $filename -a city=$CITY -a price_lb=$lower_bound -a price_ub=$upper_bound 19 | mv $filename $DATA_LOCATION 20 | 21 | done 22 | 23 | -------------------------------------------------------------------------------- /scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html 5 | 6 | [settings] 7 | default = airbnb_scraper.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = airbnb_scraper 12 | -------------------------------------------------------------------------------- /testing/airbnb_testing.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import scrapy 3 | from scrapy.exceptions import CloseSpider 4 | import json 5 | import time 6 | import pprint 7 | import collections 8 | 9 | with open('data.json', 'r') as file: 10 | data = json.load(file) 11 | 12 | # Need Try/Catch for first page as sometimes has airbnb plus 13 | 14 | #homes = data.get('explore_tabs')[0].get('sections')[0].get('listings') 15 | 16 | #if homes is None: 17 | homes = data.get('explore_tabs')[0].get('sections')[3].get('listings') 18 | 19 | print(homes) 20 | # data_dict = collections.defaultdict(dict) 21 | 22 | # base_url = 'https://www.airbnb.com/rooms/' 23 | # for home in homes: 24 | # room_id = str(home.get('listing').get('id')) 25 | # url = base_url + str(home.get('listing').get('id')) 26 | 27 | # # Map price for specific home to each url 28 | # data_dict[room_id]['url'] = url 29 | # data_dict[room_id]['price'] = home.get('pricing_quote').get('rate').get('amount') 30 | # data_dict[room_id]['bathrooms'] = home.get('listing').get('bathrooms') 31 | 32 | # # Copy below 33 | # data_dict[room_id]['bedrooms'] = home.get('listing').get('bedrooms') 34 | # data_dict[room_id]['host_languages'] = home.get('listing').get('host_languages') 35 | # data_dict[room_id]['is_business_travel_ready'] = home.get('listing').get('is_business_travel_ready') 36 | # data_dict[room_id]['is_fully_refundable'] = home.get('listing').get('is_fully_refundable') 37 | # data_dict[room_id]['is_new_listing'] = home.get('listing').get('is_new_listing') 38 | # data_dict[room_id]['is_superhost'] = home.get('listing').get('is_superhost') 39 | # data_dict[room_id]['lat'] = home.get('listing').get('lat') 40 | # data_dict[room_id]['lng'] = home.get('listing').get('lng') 41 | # data_dict[room_id]['localized_city'] = home.get('listing').get('localized_city') 42 | # data_dict[room_id]['localized_neighborhood'] = home.get('listing').get('localized_neighborhood') 43 | # data_dict[room_id]['listing_name'] = home.get('listing').get('name') 44 | # data_dict[room_id]['person_capacity'] = home.get('listing').get('person_capacity') 45 | # data_dict[room_id]['picture_count'] = home.get('listing').get('picture_count') 46 | # data_dict[room_id]['reviews_count'] = home.get('listing').get('reviews_count') 47 | # data_dict[room_id]['room_type_category'] = home.get('listing').get('room_type_category') 48 | # data_dict[room_id]['star_rating'] = home.get('listing').get('star_rating') 49 | # data_dict[room_id]['host_id'] = home.get('listing').get('user').get('id') 50 | # data_dict[room_id]['avg_rating'] = home.get('listing').get('avg_rating') 51 | 52 | 53 | # data_dict[room_id]['can_instant_book'] = home.get('pricing_quote').get('can_instant_book') 54 | # data_dict[room_id]['monthly_price_factor'] = home.get('pricing_quote').get('monthly_price_factor') 55 | # data_dict[room_id]['currency'] = home.get('pricing_quote').get('rate').get('currency') 56 | # data_dict[room_id]['amount_with_service_fee'] = home.get('pricing_quote').get('rate_with_service_fee').get('amount') 57 | # data_dict[room_id]['rate_type'] = home.get('pricing_quote').get('rate_type') 58 | # data_dict[room_id]['weekly_price_factor'] = home.get('pricing_quote').get('weekly_price_factor') 59 | 60 | 61 | 62 | # printer = pprint.PrettyPrinter() 63 | # printer.pprint(data_dict) -------------------------------------------------------------------------------- /toronto.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | for var in `seq 10 10 150` 4 | do 5 | 6 | lower_bound=$var 7 | upper_bound=`expr $var + 5` 8 | 9 | # Fixed Variables 10 | CITY="Toronto" 11 | CONJUNCTION="to" 12 | FORMAT=".json" 13 | DATA_LOCATION="toronto_data" 14 | 15 | filename="$lower_bound$CONJUNCTION$upper_bound$FORMAT" 16 | 17 | # Run scraper on specific range 18 | scrapy crawl airbnb -o $filename -a city=$CITY -a price_lb=$lower_bound -a price_ub=$upper_bound 19 | mv $filename $DATA_LOCATION 20 | 21 | done 22 | --------------------------------------------------------------------------------