├── .gitignore ├── README.md ├── restaurants-boston-yellowpages-scraped-data.csv └── yellow_pages.py /.gitignore: -------------------------------------------------------------------------------- 1 | /.idea/ 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Yellow Pages Business Details Scraper 2 | 3 | Yellowpages.com Web Scraper written in Python and LXML to extract business details available based on a particular category and location. 4 | 5 | If you would like to know more about this scraper you can check it out at the blog post 'How to Scrape Business Details from Yellow Pages using Python and LXML' - https://www.scrapehero.com/how-to-scrape-business-details-from-yellowpages-com-using-python-and-lxml/ 6 | 7 | ## Getting Started 8 | 9 | These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. 10 | 11 | ### Fields to Extract 12 | 13 | This yellow pages scraper can extract the fields below: 14 | 15 | 1. Rank 16 | 2. Business Name 17 | 3. Phone Number 18 | 4. Business Page 19 | 5. Category 20 | 6. Website 21 | 7. Rating 22 | 8. Street name 23 | 9. Locality 24 | 10. Region 25 | 11. Zipcode 26 | 12. URL 27 | 28 | ### Prerequisites 29 | 30 | For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. 31 | Below are the package requirements: 32 | 33 | - lxml 34 | - requests 35 | 36 | ### Installation 37 | 38 | PIP to install the following packages in Python (https://pip.pypa.io/en/stable/installing/) 39 | 40 | Python Requests, to make requests and download the HTML content of the pages (http://docs.python-requests.org/en/master/user/install/) 41 | 42 | Python LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here – http://lxml.de/installation.html) 43 | 44 | ## Running the scraper 45 | We would execute the code with the script name followed by the positional arguments **keyword** and **place**. Here is an example 46 | to find the business details for restaurants in Boston. MA. 47 | 48 | ``` 49 | python3 yellow_pages.py restaurants Boston,MA 50 | ``` 51 | ## Sample Output 52 | 53 | This will create a csv file: 54 | 55 | [Sample Output](https://raw.githubusercontent.com/scrapehero/yellow_pages/master/restaurants-boston-yellowpages-scraped-data.csv) 56 | 57 | 58 | -------------------------------------------------------------------------------- /restaurants-boston-yellowpages-scraped-data.csv: -------------------------------------------------------------------------------- 1 | "rank","business_name","telephone","business_page","category","website","rating","street","locality","region","zipcode","listing_url" 2 | "1","The Four's Restaurant & Sports Bar","(617) 830-1204","https://www.yellowpages.com/boston-ma/mip/the-fours-restaurant-sports-bar-2475158?lid=209428619","Restaurants,Caterers,Bars","http://www.thefours.com","","166 Canal St","Boston","MA","02114","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 3 | "2","No Name Restaurant","(617) 229-6084","https://www.yellowpages.com/boston-ma/mip/no-name-restaurant-8648730?lid=1001715956847","Restaurants,Take Out Restaurants","http://nonamerestaurant.com","5","15 Fish Pier St E","Boston","MA","02210","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 4 | "3","Aria Trattoria","(617) 580-5141","https://www.yellowpages.com/boston-ma/mip/aria-trattoria-480202715?lid=1000380204864","Restaurants,Italian Restaurants","http://arianorthend.com/menu/","","253 Hanover St","Boston","MA","02113","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 5 | "4","The Four's Restaurant & Sports Bar","(617) 830-4504","https://www.yellowpages.com/boston-ma/mip/the-fours-restaurant-sports-bar-460711092?lid=1000273712008","Bars,Bar & Grills,Caterers,Restaurants","http://www.thefours.com","2","166 Canal St","Boston","MA","02114","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 6 | "5","B&G Oysters","(617) 423-0550","https://www.yellowpages.com/boston-ma/mip/b-g-oysters-2650359?lid=2650359","Seafood Restaurants,American Restaurants","http://www.bandgoysters.com","31","550 Tremont St","Boston","MA","02116","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 7 | "6","Morton's The Steakhouse","(617) 266-5858","https://www.yellowpages.com/boston-ma/mip/mortons-the-steakhouse-22589061?lid=22589061","Steak Houses,Restaurants","http://www.mortons.com","26","699 Boylston St","Boston","MA","02116","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 8 | "7","Abe & Louie's","(617) 536-6300","https://www.yellowpages.com/boston-ma/mip/abe-louies-2198151?lid=2198151","Steak Houses,American Restaurants","http://www.abeandlouies.com","2","793 Boylston St","Boston","MA","02116","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 9 | "8","Ruth's Chris Steak House","(617) 742-8401","https://www.yellowpages.com/boston-ma/mip/ruths-chris-steak-house-7154837?lid=7154837","Steak Houses,Restaurants","","","45 School St Ste 1d","Boston","MA","02108","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 10 | "9","Island Creek Oyster Bar","(617) 532-5300","https://www.yellowpages.com/boston-ma/mip/island-creek-oyster-bar-460080280?lid=460080280","Seafood Restaurants,American Restaurants","http://www.islandcreekoysterbar.com","1","500 Commonwealth Ave","Boston","MA","02215","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 11 | "10","The Oceanaire Seafood Room","(617) 742-2277","https://www.yellowpages.com/boston-ma/mip/the-oceanaire-seafood-room-455904020?lid=455904020","Seafood Restaurants","http://www.theoceanaire.com/Locations/Boston/Locations.aspx?utm_source=Yext&utm_medium=website&utm_campaign=Yext","","40 Court St","Boston","MA","02108","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 12 | "11","Joe's American Bar & Grill","(617) 367-8700","https://www.yellowpages.com/boston-ma/mip/joes-american-bar-grill-1011023?lid=1011023","American Restaurants,Bars","http://www.joesamerican.com","15","100 Atlantic Ave","Boston","MA","02110","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 13 | "12","Fire & Ice","(617) 482-3473","https://www.yellowpages.com/boston-ma/mip/fire-ice-7382808?lid=7382808","Family Style Restaurants","http://www.fire-ice.com","7","205 Berkeley St","Boston","MA","02116","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 14 | "13","Houlihan's","(617) 367-6377","https://www.yellowpages.com/boston-ma/mip/houlihans-467872658?lid=467872658","American Restaurants,Restaurants","http://www.houlihans.com","","60 State St","Boston","MA","02109","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 15 | "14","Tremont 647","(617) 266-4600","https://www.yellowpages.com/boston-ma/mip/tremont-647-5731916?lid=5731916","American Restaurants,Caterers","http://tremont647.com","","647 Tremont St","Boston","MA","02118","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 16 | "15","Flames Restaurant II","(617) 734-1911","https://www.yellowpages.com/boston-ma/mip/flames-restaurant-ii-3289242?lid=3289242","Latin American Restaurants","","2","746 Huntington Ave","Boston","MA","02115","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 17 | "16","McCormick & Schmick's Seafood & Steaks","(617) 720-5522","https://www.yellowpages.com/boston-ma/mip/mccormick-schmicks-seafood-steaks-467887008?lid=467887008","Seafood Restaurants,American Restaurants","http://www.mccormickandschmicks.com/Locations/boston-massachusetts/boston-massachusetts/faneuil-hall-marketplace.aspx?utm_campaign=Yext&utm_source=Yext&utm_medium=Website&utm_content=MSFH","1","1 Faneuil Hall Market Pl","Boston","MA","02109","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 18 | "17","Citizen Public House & Oyster Bar","(617) 450-9000","https://www.yellowpages.com/boston-ma/mip/citizen-public-house-oyster-bar-458900906?lid=458900906","American Restaurants,Seafood Restaurants","http://www.citizenpub.com","3","1310 Boylston St","Boston","MA","02215","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 19 | "18","Artu","(617) 227-9023","https://www.yellowpages.com/boston-ma/mip/artu-3849315?lid=3849315","Italian Restaurants","http://artuboston.com","1","89 Charles St","Boston","MA","02114","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 20 | "19","The North Star","(617) 723-3222","https://www.yellowpages.com/boston-ma/mip/the-north-star-3217193?lid=3217193","American Restaurants,Pizza,Restaurants","http://www.northstarboston.com","","222 Friend St","Boston","MA","02114","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 21 | "20","Oceana At The Marriott Long Wharf","(617) 227-3838","https://www.yellowpages.com/boston-ma/mip/oceana-at-the-marriott-long-wharf-471102654?lid=471102654","Seafood Restaurants,Bars,Night Clubs","http://www.marriott.com/hotel-restaurants/boslw-boston-marriott-long-wharf/waterline/71533/home-page.mi","","296 State St","Boston","MA","02109","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 22 | "21","Top of the Hub","(617) 536-1775","https://www.yellowpages.com/boston-ma/mip/top-of-the-hub-465512780?lid=465512780","American Restaurants","http://topofthehub.net","64","800 Boylston St Fl 52","Boston","MA","02199","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 23 | "22","Terramia Ristorante","(617) 523-3112","https://www.yellowpages.com/boston-ma/mip/terramia-ristorante-13507328?lid=13507328","American Restaurants,Italian Restaurants","http://www.terramiaristorante.com","23","98 Salem St","Boston","MA","02113","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 24 | "23","Neptune Oyster","(617) 742-3474","https://www.yellowpages.com/boston-ma/mip/neptune-oyster-3678122?lid=3678122","Seafood Restaurants,American Restaurants","http://www.neptuneoyster.com","30","63 Salem St Ste 1","Boston","MA","02113","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 25 | "24","Harvard Gardens","(617) 765-0605","https://www.yellowpages.com/boston-ma/mip/harvard-gardens-22868738?lid=1001361722502","American Restaurants,Restaurants,Bars","http://harvardgardens.com/?reference_id=0&publisher=yellowpages&placement=ypwebsitesrpr&action_target=listing_website","15","316 Cambridge St","Boston","MA","02114","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 26 | "25","Blackjack Pasta Bar","(617) 266-1313","https://www.yellowpages.com/boston-ma/mip/blackjack-pasta-bar-7016994?lid=7016994","Italian Restaurants","http://www.blackjack-pasta.com","3","52 Queensberry St","Boston","MA","02215","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 27 | "26","21st Amendment","(617) 227-7100","https://www.yellowpages.com/boston-ma/mip/21st-amendment-3466595?lid=3466595","American Restaurants,Brew Pubs,Bars","http://www.21stboston.com","","150 Bowdoin St","Boston","MA","02108","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 28 | "27","Boston Chowda Co","(617) 742-7279","https://www.yellowpages.com/boston-ma/mip/boston-chowda-co-4099713?lid=4099713","American Restaurants,Seafood Restaurants","http://www.bostonchowda.com","","1 Faneuil Hall Market Pl","Boston","MA","02109","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 29 | "28","Woody's Grill & Tap","(617) 375-9663","https://www.yellowpages.com/boston-ma/mip/woodys-grill-tap-671276?lid=671276","American Restaurants,Bars,Bar & Grills","http://www.woodysboston.com","","58 Hemenway St","Boston","MA","02115","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 30 | "29","The Daily Catch","(617) 523-8567","https://www.yellowpages.com/boston-ma/mip/the-daily-catch-10766579?lid=10766579","Seafood Restaurants","http://thedailycatch.com/northend.html","12","323 Hanover St","Boston","MA","02113","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 31 | "30","Atlantic Fish Co","(617) 267-4000","https://www.yellowpages.com/boston-ma/mip/atlantic-fish-co-8673013?lid=8673013","Seafood Restaurants,American Restaurants","http://www.atlanticfishco.com","4","761 Boylston St","Boston","MA","02116","https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=boston" 32 | -------------------------------------------------------------------------------- /yellow_pages.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import requests 5 | from lxml import html 6 | import unicodecsv as csv 7 | import argparse 8 | 9 | 10 | def parse_listing(keyword, place): 11 | """ 12 | 13 | Function to process yellowpage listing page 14 | : param keyword: search query 15 | : param place : place name 16 | 17 | """ 18 | url = "https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}".format(keyword, place) 19 | 20 | print("retrieving ", url) 21 | 22 | headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 23 | 'Accept-Encoding': 'gzip, deflate, br', 24 | 'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7', 25 | 'Cache-Control': 'max-age=0', 26 | 'Connection': 'keep-alive', 27 | 'Host': 'www.yellowpages.com', 28 | 'Upgrade-Insecure-Requests': '1', 29 | 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36' 30 | } 31 | # Adding retries 32 | for retry in range(10): 33 | try: 34 | response = requests.get(url, verify=False, headers=headers) 35 | print("parsing page") 36 | if response.status_code == 200: 37 | parser = html.fromstring(response.text) 38 | # making links absolute 39 | base_url = "https://www.yellowpages.com" 40 | parser.make_links_absolute(base_url) 41 | 42 | XPATH_LISTINGS = "//div[@class='search-results organic']//div[@class='v-card']" 43 | listings = parser.xpath(XPATH_LISTINGS) 44 | scraped_results = [] 45 | 46 | for results in listings: 47 | XPATH_BUSINESS_NAME = ".//a[@class='business-name']//text()" 48 | XPATH_BUSSINESS_PAGE = ".//a[@class='business-name']//@href" 49 | XPATH_TELEPHONE = ".//div[@class='phones phone primary']//text()" 50 | XPATH_ADDRESS = ".//div[@class='info']//div//p[@itemprop='address']" 51 | XPATH_STREET = ".//div[@class='street-address']//text()" 52 | XPATH_LOCALITY = ".//div[@class='locality']//text()" 53 | XPATH_REGION = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='addressRegion']//text()" 54 | XPATH_ZIP_CODE = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='postalCode']//text()" 55 | XPATH_RANK = ".//div[@class='info']//h2[@class='n']/text()" 56 | XPATH_CATEGORIES = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='categories']//text()" 57 | XPATH_WEBSITE = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='links']//a[contains(@class,'website')]/@href" 58 | XPATH_RATING = ".//div[@class='info']//div[contains(@class,'info-section')]//div[contains(@class,'result-rating')]//span//text()" 59 | 60 | raw_business_name = results.xpath(XPATH_BUSINESS_NAME) 61 | raw_business_telephone = results.xpath(XPATH_TELEPHONE) 62 | raw_business_page = results.xpath(XPATH_BUSSINESS_PAGE) 63 | raw_categories = results.xpath(XPATH_CATEGORIES) 64 | raw_website = results.xpath(XPATH_WEBSITE) 65 | raw_rating = results.xpath(XPATH_RATING) 66 | # address = results.xpath(XPATH_ADDRESS) 67 | raw_street = results.xpath(XPATH_STREET) 68 | raw_locality = results.xpath(XPATH_LOCALITY) 69 | raw_region = results.xpath(XPATH_REGION) 70 | raw_zip_code = results.xpath(XPATH_ZIP_CODE) 71 | raw_rank = results.xpath(XPATH_RANK) 72 | 73 | business_name = ''.join(raw_business_name).strip() if raw_business_name else None 74 | telephone = ''.join(raw_business_telephone).strip() if raw_business_telephone else None 75 | business_page = ''.join(raw_business_page).strip() if raw_business_page else None 76 | rank = ''.join(raw_rank).replace('.\xa0', '') if raw_rank else None 77 | category = ','.join(raw_categories).strip() if raw_categories else None 78 | website = ''.join(raw_website).strip() if raw_website else None 79 | rating = ''.join(raw_rating).replace("(", "").replace(")", "").strip() if raw_rating else None 80 | street = ''.join(raw_street).strip() if raw_street else None 81 | locality = ''.join(raw_locality).replace(',\xa0', '').strip() if raw_locality else None 82 | locality, locality_parts = locality.split(',') 83 | _, region, zipcode = locality_parts.split(' ') 84 | 85 | business_details = { 86 | 'business_name': business_name, 87 | 'telephone': telephone, 88 | 'business_page': business_page, 89 | 'rank': rank, 90 | 'category': category, 91 | 'website': website, 92 | 'rating': rating, 93 | 'street': street, 94 | 'locality': locality, 95 | 'region': region, 96 | 'zipcode': zipcode, 97 | 'listing_url': response.url 98 | } 99 | scraped_results.append(business_details) 100 | 101 | return scraped_results 102 | 103 | elif response.status_code == 404: 104 | print("Could not find a location matching", place) 105 | # no need to retry for non existing page 106 | break 107 | else: 108 | print("Failed to process page") 109 | return [] 110 | 111 | except: 112 | print("Failed to process page") 113 | return [] 114 | 115 | 116 | if __name__ == "__main__": 117 | 118 | argparser = argparse.ArgumentParser() 119 | argparser.add_argument('keyword', help='Search Keyword') 120 | argparser.add_argument('place', help='Place Name') 121 | 122 | args = argparser.parse_args() 123 | keyword = args.keyword 124 | place = args.place 125 | 126 | scraped_data = parse_listing(keyword, place) 127 | 128 | if scraped_data: 129 | print("Writing scraped data to %s-%s-yellowpages-scraped-data.csv" % (keyword, place)) 130 | with open('%s-%s-yellowpages-scraped-data.csv' % (keyword, place), 'wb') as csvfile: 131 | fieldnames = ['rank', 'business_name', 'telephone', 'business_page', 'category', 'website', 'rating', 132 | 'street', 'locality', 'region', 'zipcode', 'listing_url'] 133 | writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_ALL) 134 | writer.writeheader() 135 | for data in scraped_data: 136 | writer.writerow(data) 137 | --------------------------------------------------------------------------------