├── README.md ├── beagle_scraper.py ├── middlewares.py └── start_scraper.py /README.md: -------------------------------------------------------------------------------- 1 | # Beagle Scraper 2 | 3 | Building the largest open-source Ecommerce scraper with Python and BeautifulSoup4 4 | 5 | ## Usage 6 | 7 | No installation or setup required 8 | 9 | 1. Download the source code into a folder 10 | 2. Create a **urls.txt** file with product category pages to be scraped like this [Amazon page](https://www.amazon.com/TVs-HDTVs-Audio-Video/b/ref=tv_nav_tvs?ie=UTF8&node=172659&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-leftnav&pf_rd_r=WQG6T4RDNW1YMS15T8Q8&pf_rd_r=WQG6T4RDNW1YMS15T8Q8&pf_rd_t=101&pf_rd_p=2905dcbf-1f2a-4de6-9aa1-c71f689a0780&pf_rd_p=2905dcbf-1f2a-4de6-9aa1-c71f689a0780&pf_rd_i=1266092011) 11 | 3. Run the command 12 | ``` 13 | $ python start_scraper.py 14 | ``` 15 | 16 | ### Output 17 | 18 | Beagler Scraper will export all data into JSON format into a sub-folder 19 | 20 | ## Current supported e-commerce stores 21 | 22 | * Amazon.com 23 | * BestBuy.com 24 | * HomeDepot.com 25 | 26 | ### Beagle Scraper tutorial - how to use and run the scraper 27 | 28 | https://www.bestproxyproviders.com/blog/beagle-scraper-tutorial-how-to-scrape-e-commerce-websites-and-modify-the-scraper/ 29 | 30 | ## Getting Started 31 | 32 | Beagle Scraper requires a machine with Python 2.7 and BeautifulSoup4 33 | 34 | Install BeautifoulSoup4 35 | ``` 36 | $ pip install beautifulsoup4 37 | ``` 38 | 39 | ### Prerequisites - extra Python packages required 40 | 41 | The following packages are not included in the default Python 2.7 install and require installation 42 | 43 | * tldextract 44 | ``` 45 | $ sudo pip install tldextract 46 | ``` 47 | * selenium 48 | ``` 49 | $ pip install selenium 50 | ``` 51 | If another package is missing run the command 52 | 53 | ``` 54 | $ pip install [missing package name] 55 | ``` 56 | 57 | ## Using proxies to scrape 58 | 59 | Beagle Scraper support external proxies at the moment, but [proxychains](https://github.com/haad/proxychains) can be used to send requests through different proxies 60 | 61 | After installing proxychains, run this command to make the scraper use proxies 62 | ``` 63 | $ proxychains python start_scraper.py 64 | ``` 65 | 66 | ## Test Beagle Scraper 67 | 68 | Here's a short test for Beagle Scraper 69 | 70 | 1. Download Beagle Scraper 71 | 2. Create a **urls.txt** file and insert the following product category pages (each link on a different line) 72 | 73 | * https://www.amazon.com/TVs-HDTVs-Audio-Video/b/ref=tv_nav_tvs?ie=UTF8&node=172659&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-leftnav&pf_rd_r=WQG6T4RDNW1YMS15T8Q8&pf_rd_r=WQG6T4RDNW1YMS15T8Q8&pf_rd_t=101&pf_rd_p=2905dcbf-1f2a-4de6-9aa1-c71f689a0780&pf_rd_p=2905dcbf-1f2a-4de6-9aa1-c71f689a0780&pf_rd_i=1266092011 74 | * https://www.bestbuy.com/site/tvs/75-inch-tvs/pcmcat1514910595284.c?id=pcmcat1514910595284 75 | 76 | 3. Run Beagle Scraper 77 | 78 | ``` 79 | $ python start_scraper.py 80 | ``` 81 | Example output for the above scraped urls: 82 | 83 | * amazon_dd_mm_yy.json 84 | * bestbuy_dd_mm_yy.json 85 | 86 | 87 | ## Built With 88 | 89 | * [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) - Scraping library 90 | * [Python 2.7](https://www.python.org/) - Dependency Management 91 | 92 | ## How to contribute 93 | 94 | All you have to do is to create a function scraper link **amazon_scraper()** from [beagle_scraper.py](https://github.com/ChrisRoark/beagle_scraper/blob/master/beagle_scraper.py) and submit it here. 95 | 96 | Here is more info on how the scraper function is created 97 | 98 | Things to consider: 99 | 1. HTML wrapper and class/id for each product listed on the page 100 | 2. The product details HTML tags and classes 101 | 3. Pagination setup 102 | 103 | ## Authors 104 | 105 | * **Chris Roark** - *Initial work* - [ChrisRoark](https://github.com/ChrisRoark) 106 | 107 | ## License 108 | 109 | GPL-3.0 license 110 | -------------------------------------------------------------------------------- /beagle_scraper.py: -------------------------------------------------------------------------------- 1 | """ 2 | ### BEAGLE SCRAPER ### 3 | 4 | ### The product category scraper ### 5 | 6 | """ 7 | import urllib2 8 | from bs4 import BeautifulSoup 9 | import csv 10 | import json 11 | import re 12 | from random import randint 13 | import time 14 | from datetime import datetime, date 15 | import os 16 | import tldextract 17 | from selenium import webdriver 18 | import cookielib 19 | from middlewares import error_log, output_file, time_out, pagination_timeout 20 | 21 | """ 22 | Scraper functions for each ecommerce store 23 | Start scraper options: 24 | 25 | 1. Call the function scraper on a product category, eg: amazon_scraper('amazon.com/smartphones') 26 | 2. Insert multiple product categories into a file urls.txt and run "python start_scraper.py" 27 | 28 | 29 | 30 | """ 31 | 32 | scrape_page_log = [] 33 | amazon_products_list = [] 34 | bestbuy_products_list = [] 35 | homedepot_products_list = [] 36 | 37 | 38 | def amazon_scraper(category_url): 39 | 40 | page_link = str(category_url) 41 | 42 | now=datetime.now() 43 | print '' 44 | #To avoid pages with pagination issues, check if the page wasn't scraped already 45 | if page_link not in scrape_page_log: 46 | #appends url to pages that are scraped 47 | scrape_page_log.append(page_link) 48 | print 'Start scraping '+'Items: '+str(len(amazon_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)+' Page: '+str(page_link) 49 | 50 | shop_page_req_head = urllib2.Request(page_link) 51 | shop_page_req_head.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0') 52 | #load page and create soup 53 | shop_page = urllib2.urlopen(shop_page_req_head) 54 | shop_soup = BeautifulSoup(shop_page, 'html.parser') 55 | 56 | #get the domain of the url for exporting file name 57 | domain_url = tldextract.extract(page_link) 58 | #join to add also domain suffix link domain.com 59 | domain_name = '.'.join(domain_url[1:]) 60 | file_name = domain_url.domain 61 | 62 | #DATA EXTRACTION 63 | #store all product divs from the page in a list 64 | items_div = shop_soup.find_all('div', {'class': 's-item-container'}) 65 | 66 | #loop the scraped products list and extract the required data 67 | for div in items_div: 68 | try: 69 | #check if current page is the first category page 70 | if div.find('span', {'class': 'sx-price-whole'}): 71 | price = str(div.find('span', {'class': 'sx-price-whole'}).text.strip()) 72 | else: 73 | price = str(div.find('span', {'class': 'a-size-base a-color-base'}).text.strip()) 74 | #verify if the product is rated 75 | if div.find('span', {'class': 'a-icon-alt'}): 76 | rating = str(div.find('span', {'class': 'a-icon-alt'}).text.strip()) 77 | else: 78 | rating = 'Not rated yet' 79 | 80 | #append data in list of items 81 | amazon_products_list.append({ 82 | 'title' : div.find('a', {'class': 'a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal'})['title'], 83 | 'url' : div.find('a', {'class': 'a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal'})['href'], 84 | 'price' : str(price), 85 | 'rating' : str(rating), 86 | 'domain' : domain_name}) 87 | 88 | print div.find('a', {'class': 'a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal'})['title'] 89 | 90 | except: 91 | pass 92 | 93 | #random time delay 1 to several seconds (change 6 with the max seconds for delay) 94 | time_out(6) 95 | #END DATA EXTRACTION 96 | now=datetime.now() 97 | print 'Page completed '+'Items: '+str(len(amazon_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second) 98 | 99 | #PAGINATION AND CHANGING THE NEXT PAGE 100 | #check if current page is last for last page by reading the page button link 101 | if shop_soup.find('a', {'id': 'pagnNextLink'}): 102 | 103 | #loads next page button link 104 | next_page_button = shop_soup.find('a', {'id': 'pagnNextLink'})['href'] 105 | next_page_button_href = 'https://www.amazon.com' + str(next_page_button) 106 | #write scraped data to json file 107 | output_file(file_name, amazon_products_list) 108 | #change 5 to max seconds to pause before changing to next page 109 | pagination_timeout(5) 110 | amazon_scraper(next_page_button_href) 111 | else: 112 | #write scraped data to json file 113 | output_file(file_name, amazon_products_list) 114 | print 'Category Scraped Completed '+'Items: '+str(len(amazon_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second) 115 | #END PAGINATION 116 | 117 | else: 118 | #logs pages issues such as missing next page button or infinit loops 119 | error_log(page_link, 'Pagination issues') 120 | print '' 121 | print 'ERROR! Page '+str(page_link)+' already scraped. See error log' 122 | output_file(file_name, amazon_products_list) 123 | print 'Category Scraped Completed '+'Items: '+str(len(amazon_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second) 124 | time.sleep(6) 125 | return 126 | 127 | return 128 | 129 | 130 | def bestbuy_scraper(category_url): 131 | 132 | page_link = str(category_url) 133 | 134 | now=datetime.now() 135 | print '' 136 | #check if page was scraped already and start scraping if it wasn't 137 | if page_link not in scrape_page_log: 138 | #append page to log_[job_date].csv for not scraping it again during current job 139 | scrape_page_log.append(page_link) 140 | 141 | print 'Start scraping '+'Items: '+str(len(bestbuy_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)+' Page: '+str(page_link) 142 | 143 | #open browser for accessing the page - Bestbuy.com doesn't allow scraping without headers 144 | driver = webdriver.Firefox() 145 | #opens the page in browser 146 | driver.get(page_link) 147 | #creates the soup for scraping data 148 | html = driver.page_source 149 | shop_soup = BeautifulSoup(html, 'html.parser') 150 | 151 | #get the domain of the given page 152 | domain_url = tldextract.extract(page_link) 153 | #join to add also domain suffix link domain.com 154 | domain_name = '.'.join(domain_url[1:]) 155 | file_name = domain_url.domain 156 | 157 | #store all product divs from the page in a list 158 | items_div = shop_soup.find_all('div', {'class': 'list-item'}) 159 | 160 | #loop the scraped products list and extract the required data 161 | for div in items_div: 162 | #check if product review exists 163 | if div.find('div', {'span': 'c-review-average'}): 164 | rating = div.find('span', {'class': 'c-review-average'}).text.strip() 165 | elif div.find('span', {'class': 'c-reviews-none'}): 166 | rating = div.find('span', {'class': 'c-reviews-none'}).text.strip() 167 | else: 168 | rating = 'u/n' 169 | #get product title and url details 170 | if div.find('div', {'class': 'sku-title'}): 171 | product_info = div.find('div', {'class': 'sku-title'}) 172 | product_title = product_info.text.strip() 173 | product_url = product_info.find('a', href=True)['href'] 174 | #get price if present 175 | if div.find('div', {'class': 'pb-hero-price pb-purchase-price'}): 176 | #get price and remove the $ sign before the actual price 177 | price = str(div.find('div', {'class': 'pb-hero-price pb-purchase-price'}).text.strip())[1:] 178 | else: 179 | price = 'u/n' 180 | 181 | #append all data to product list 182 | bestbuy_products_list.append({ 183 | 'title' : product_title, 184 | 'url' : 'https://www.bestbuy.com' +str(product_url), 185 | 'price' : price, 186 | 'rating' : rating, 187 | 'domain' : domain_name}) 188 | 189 | print product_title 190 | 191 | #random time delay 1 to several seconds 192 | time_out(6) 193 | 194 | now=datetime.now() 195 | print 'Page completed '+'Items: '+str(len(bestbuy_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second) 196 | 197 | #get pagination div to change page or stop job 198 | if shop_soup.find('div', {'class': 'results-pagination'}): 199 | #extract the chevron tiles for previous and next page 200 | next_page_click = shop_soup.find_all('a', {'class': 'btn btn-primary btn-sm btn-ficon '}) 201 | 202 | #nxt page is the second and last item and it compares the last digit with current page digits to check if there is a next page 203 | while next_page_click[-1]['href'][-1] > page_link[-1]: 204 | try: 205 | next_page_button_href = next_page_click[-1]['href'] 206 | #write scraped data to json file 207 | output_file(file_name, bestbuy_products_list) 208 | pagination_timeout(5) 209 | #close browser window 210 | driver.quit() 211 | bestbuy_scraper(next_page_button_href) 212 | break 213 | except: 214 | break 215 | 216 | else: 217 | #write scraped data to json file 218 | output_file(file_name, bestbuy_products_list) 219 | driver.quit() 220 | print 'Category Scraped Completed '+'Items: '+str(len(bestbuy_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second) 221 | 222 | else: 223 | #appends any page with issues to log_error_[job_date].csv 224 | error_log(page_link, 'Pagination issues') 225 | print '' 226 | print 'ERROR! Page '+str(page_link)+' already scraped. See error log' 227 | output_file(file_name, bestbuy_products_list) 228 | driver.quit() 229 | print 'Category Scraped Completed '+'Items: '+str(len(bestbuy_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second) 230 | time.sleep(6) 231 | return 232 | return 233 | 234 | 235 | def homedepot_scraper(category_url): 236 | 237 | page_link = str(category_url) 238 | 239 | now=datetime.now() 240 | print '' 241 | #To avoid pages with pagination issues, check if the page wasn't scraped already 242 | if page_link not in scrape_page_log: 243 | #appends url to pages that are scraped 244 | scrape_page_log.append(page_link) 245 | print 'Start scraping '+'Items: '+str(len(homedepot_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)+' Page: '+str(page_link) 246 | 247 | shop_page_req_head = urllib2.Request(page_link) 248 | shop_page_req_head.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0') 249 | #load page and create soup 250 | shop_page = urllib2.urlopen(shop_page_req_head) 251 | shop_soup = BeautifulSoup(shop_page, 'html.parser') 252 | 253 | #get the domain of the url for exporting file name 254 | domain_url = tldextract.extract(page_link) 255 | #join to add also domain suffix link domain.com 256 | domain_name = '.'.join(domain_url[1:]) 257 | file_name = domain_url.domain 258 | 259 | #DATA EXTRACTION 260 | #store all product divs from the page in a list 261 | items_div = shop_soup.find_all('div', {'class': 'pod-inner'}) 262 | 263 | #loop the scraped products list and extract the required data 264 | 265 | for div in items_div: 266 | try: 267 | #append data in list of items 268 | homedepot_products_list.append({ 269 | 'title' : div.find('div', {'class': 'pod-plp__description js-podclick-analytics'}).text.strip(), 270 | 'url' : 'https://www.homedepot.com'+str(div.find('a', {'class': 'js-podclick-analytics'})['href']), 271 | 'price' : div.find('div', {'class': 'price'}).text.strip()[1:-2], 272 | 'domain' : domain_name}) 273 | 274 | print div.find('div', {'class': 'pod-plp__description js-podclick-analytics'}).text.strip() 275 | except: 276 | pass 277 | 278 | #random time delay 1 to several seconds (change 6 with the max seconds for delay) 279 | time_out(6) 280 | #END DATA EXTRACTION 281 | now=datetime.now() 282 | print 'Page completed '+'Items: '+str(len(homedepot_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second) 283 | 284 | #PAGINATION 285 | #check if current page is last for last page by reading the page button link 286 | if shop_soup.find('li', {'class': 'hd-pagination__item hd-pagination__button'}): 287 | 288 | #loads next page button link 289 | next_page_click = shop_soup.find_all('a', {'class': 'hd-pagination__link'}) 290 | while next_page_click[-1]['title'] == 'Next': 291 | next_page_button_href = 'https://www.homedepot.com'+str(next_page_click[-1]['href']) 292 | try: 293 | #write scraped data to json file 294 | output_file(file_name, homedepot_products_list) 295 | #change 5 to max seconds to pause before changing to next page 296 | pagination_timeout(5) 297 | homedepot_scraper(next_page_button_href) 298 | break 299 | except: 300 | break 301 | else: 302 | #write scraped data to json file 303 | output_file(file_name, homedepot_products_list) 304 | print 'Category Scraped Completed '+'Items: '+str(len(homedepot_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second) 305 | #END PAGINATION 306 | else: 307 | #appends any page with issues to log_error_[job_date].csv 308 | error_log(page_link, 'Pagination issues') 309 | print '' 310 | print 'ERROR! Page '+str(page_link)+' already scraped. See error log' 311 | output_file(file_name, homedepot_products_list) 312 | print 'Category Scraped Completed '+'Items: '+str(len(homedepot_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second) 313 | time.sleep(6) 314 | return 315 | return 316 | -------------------------------------------------------------------------------- /middlewares.py: -------------------------------------------------------------------------------- 1 | """ 2 | ### BEAGLE SCRAPER ### 3 | 4 | ### The product category scraper ### 5 | 6 | """ 7 | import urllib2 8 | from bs4 import BeautifulSoup 9 | import csv 10 | import json 11 | import re 12 | from random import randint 13 | import time 14 | from datetime import datetime, date 15 | import os 16 | import tldextract 17 | 18 | """ 19 | Functions use in scrapers 20 | """ 21 | 22 | 23 | #log the last link accessed, either scraped or not 24 | def scrape_log(open_link): 25 | 26 | now=datetime.now() 27 | today = date.today() 28 | directory = str(today.day)+'_'+str(today.strftime("%b")).lower()+'_'+str(today.year) 29 | 30 | if not os.path.exists(directory): 31 | os.makedirs(directory) 32 | 33 | logged_link = str(open_link)[:-1] 34 | log_csv = str(today), str(now.hour)+':'+str(now.minute)+':'+str(now.second),logged_link 35 | log_file = open(str(directory)+'/log_'+str(directory)+'.csv', 'a') 36 | with log_file: 37 | writer = csv.writer(log_file) 38 | writer.writerow(log_csv) 39 | return 40 | 41 | #logs errors into log_error_[job_date].csv file 42 | def error_log(url, error_message): 43 | now=datetime.now() 44 | today = date.today() 45 | 46 | #creates folder to save scraped data for today 47 | directory = 'job_'+str(today.day)+'_'+str(today.strftime("%b")).lower()+'_'+str(today.year) 48 | if not os.path.exists(directory): 49 | os.makedirs(directory) 50 | 51 | logged_link = str(url) 52 | error_type = str(error_message) 53 | error_csv = [str(today), str(now.hour)+':'+str(now.minute)+':'+str(now.second),str(logged_link),str(error_type)] 54 | error_file = open(str(directory)+'/log_error_'+str(directory)+'.csv', 'a') 55 | with error_file: 56 | writer = csv.writer(error_file) 57 | writer.writerow(error_csv) 58 | return 59 | 60 | def output_file(domain, products_list): 61 | print "Exporting your products to file" 62 | today = datetime.now() 63 | 64 | #creates folder to save scraped data for today 65 | directory = 'job_'+str(today.day)+'_'+str(today.strftime("%b")).lower()+'_'+str(today.year) 66 | if not os.path.exists(directory): 67 | os.makedirs(directory) 68 | 69 | output_file_json = str(directory)+'/'+str(domain) + '_products_'+str(directory)+'.json' 70 | with open(output_file_json, 'w') as outfile: 71 | json.dump(products_list, outfile) 72 | return 73 | 74 | def time_out(seconds): 75 | #timeout during scraping job - assign preffered max seconds for pausing the scraper 76 | max_timeout = int(seconds) 77 | print "Timeout for a few seconds..." 78 | time.sleep(randint(2,max_timeout)) 79 | return 80 | 81 | def pagination_timeout(seconds): 82 | #timeout for scraper after scraping one page and before moving to another 83 | print "Changing the page, please wait a few seconds..." 84 | max_timeout = int(seconds) 85 | time.sleep(randint(2,max_timeout)) 86 | return -------------------------------------------------------------------------------- /start_scraper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | """ 4 | ### BEAGLE SCRAPER ### 5 | 6 | ### The product category scraper ### 7 | 8 | """ 9 | 10 | import beagle_scraper as scrape 11 | import urllib2 12 | from bs4 import BeautifulSoup 13 | import tldextract 14 | from datetime import datetime, date 15 | import time 16 | import csv 17 | from middlewares import error_log, output_file, scrape_log 18 | 19 | """ 20 | Open links from a text file and feeds it to the scraper 21 | 22 | USAGE: 23 | 1. Paste prduct category links in a file "urls.txt" 24 | 2. Run: python start_scraper.py 25 | """ 26 | 27 | today = datetime.now() 28 | directory = str(today.day)+'_'+str(today.strftime("%b")).lower()+'_'+str(today.year) 29 | 30 | print 'Opening: urls.txt' 31 | time.sleep(3) 32 | 33 | with open('urls.txt') as links_file: 34 | links = links_file.readlines() 35 | 36 | #Get each url from url.txt to start scraping job 37 | for category_url in links: 38 | 39 | category_url = str(category_url) 40 | #get the domain of a link from given list 41 | domain_name = tldextract.extract(category_url) 42 | #join to add also domain suffix link domain.com 43 | domain_ext = '.'.join(domain_name[1:]) 44 | 45 | print 'Start scraper for ' + str(domain_ext) 46 | time.sleep(2) 47 | 48 | if domain_ext == 'amazon.com': 49 | try: 50 | scrape.amazon_scraper(category_url) 51 | pass 52 | except: 53 | pass 54 | 55 | if domain_ext == 'bestbuy.com': 56 | try: 57 | scrape.bestbuy_scraper(category_url) 58 | pass 59 | except: 60 | pass 61 | 62 | if domain_ext == 'homedepot.com': 63 | try: 64 | scrape.homedepot_scraper(category_url) 65 | pass 66 | except: 67 | pass 68 | 69 | 70 | else: 71 | print 'No scraper for domain: ' + str(domain_name.domain) 72 | print '' 73 | 74 | #log the last link accessed, either scraped or not 75 | scrape_log(category_url) 76 | --------------------------------------------------------------------------------