├── README.md
├── beagle_scraper.py
├── middlewares.py
└── start_scraper.py


/README.md:
--------------------------------------------------------------------------------
  1 | # Beagle Scraper
  2 | 
  3 | Building the largest open-source Ecommerce scraper with Python and BeautifulSoup4
  4 | 
  5 | ## Usage
  6 | 
  7 | No installation or setup required
  8 | 
  9 | 1. Download the source code into a folder
 10 | 2. Create a **urls.txt** file with product category pages to be scraped like this [Amazon page](https://www.amazon.com/TVs-HDTVs-Audio-Video/b/ref=tv_nav_tvs?ie=UTF8&node=172659&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-leftnav&pf_rd_r=WQG6T4RDNW1YMS15T8Q8&pf_rd_r=WQG6T4RDNW1YMS15T8Q8&pf_rd_t=101&pf_rd_p=2905dcbf-1f2a-4de6-9aa1-c71f689a0780&pf_rd_p=2905dcbf-1f2a-4de6-9aa1-c71f689a0780&pf_rd_i=1266092011)
 11 | 3. Run the command
 12 | ```
 13 | $ python start_scraper.py
 14 | ```
 15 | 
 16 | ### Output
 17 | 
 18 | Beagler Scraper will export all data into JSON format into a sub-folder
 19 | 
 20 | ## Current supported e-commerce stores
 21 | 
 22 | * Amazon.com
 23 | * BestBuy.com
 24 | * HomeDepot.com
 25 | 
 26 | ### Beagle Scraper tutorial - how to use and run the scraper
 27 | 
 28 | https://www.bestproxyproviders.com/blog/beagle-scraper-tutorial-how-to-scrape-e-commerce-websites-and-modify-the-scraper/
 29 | 
 30 | ## Getting Started
 31 | 
 32 | Beagle Scraper requires a machine with Python 2.7 and BeautifulSoup4
 33 | 
 34 | Install BeautifoulSoup4
 35 | ```
 36 | $ pip install beautifulsoup4
 37 | ```
 38 | 
 39 | ### Prerequisites - extra Python packages required
 40 | 
 41 | The following packages are not included in the default Python 2.7 install and require installation
 42 | 
 43 | * tldextract
 44 | ```
 45 | $ sudo pip install tldextract
 46 | ```
 47 | * selenium
 48 | ```
 49 | $ pip install selenium
 50 | ```
 51 | If another package is missing run the command
 52 | 
 53 | ```
 54 | $ pip install [missing package name]
 55 | ```
 56 | 
 57 | ## Using proxies to scrape
 58 | 
 59 | Beagle Scraper support external proxies at the moment, but [proxychains](https://github.com/haad/proxychains) can be used to send requests through different proxies
 60 | 
 61 | After installing proxychains, run this command to make the scraper use proxies
 62 | ```
 63 | $ proxychains python start_scraper.py
 64 | ```
 65 | 
 66 | ## Test Beagle Scraper
 67 | 
 68 | Here's a short test for Beagle Scraper
 69 | 
 70 | 1. Download Beagle Scraper
 71 | 2. Create a **urls.txt** file and insert the following product category pages (each link on a different line)
 72 | 
 73 | * https://www.amazon.com/TVs-HDTVs-Audio-Video/b/ref=tv_nav_tvs?ie=UTF8&node=172659&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-leftnav&pf_rd_r=WQG6T4RDNW1YMS15T8Q8&pf_rd_r=WQG6T4RDNW1YMS15T8Q8&pf_rd_t=101&pf_rd_p=2905dcbf-1f2a-4de6-9aa1-c71f689a0780&pf_rd_p=2905dcbf-1f2a-4de6-9aa1-c71f689a0780&pf_rd_i=1266092011
 74 | * https://www.bestbuy.com/site/tvs/75-inch-tvs/pcmcat1514910595284.c?id=pcmcat1514910595284
 75 | 
 76 | 3. Run Beagle Scraper
 77 | 
 78 | ```
 79 | $ python start_scraper.py
 80 | ```
 81 | Example output for the above scraped urls:
 82 | 
 83 | * amazon_dd_mm_yy.json
 84 | * bestbuy_dd_mm_yy.json
 85 | 
 86 | 
 87 | ## Built With
 88 | 
 89 | * [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) - Scraping library
 90 | * [Python 2.7](https://www.python.org/) - Dependency Management
 91 | 
 92 | ## How to contribute
 93 | 
 94 | All you have to do is to create a function scraper link **amazon_scraper()** from [beagle_scraper.py](https://github.com/ChrisRoark/beagle_scraper/blob/master/beagle_scraper.py) and submit it here.
 95 | 
 96 | Here is more info on how the scraper function is created
 97 | 
 98 | Things to consider:
 99 | 1. HTML wrapper and class/id for each product listed on the page
100 | 2. The product details HTML tags and classes
101 | 3. Pagination setup
102 | 
103 | ## Authors
104 | 
105 | * **Chris Roark** - *Initial work* - [ChrisRoark](https://github.com/ChrisRoark)
106 | 
107 | ## License
108 | 
109 | GPL-3.0 license
110 | 


--------------------------------------------------------------------------------
/beagle_scraper.py:
--------------------------------------------------------------------------------
  1 | """
  2 | 	### BEAGLE SCRAPER ###
  3 | 
  4 | 	### The product category scraper ###
  5 | 	
  6 | """
  7 | import urllib2
  8 | from bs4 import BeautifulSoup
  9 | import csv
 10 | import json
 11 | import re
 12 | from random import randint
 13 | import time
 14 | from datetime import datetime, date
 15 | import os
 16 | import tldextract
 17 | from selenium import webdriver
 18 | import cookielib
 19 | from middlewares import error_log, output_file, time_out, pagination_timeout
 20 | 
 21 | """
 22 | Scraper functions for each ecommerce store
 23 | Start scraper options:
 24 | 
 25 | 	1. Call the function scraper on a product category, eg: amazon_scraper('amazon.com/smartphones')
 26 | 	2. Insert multiple product categories into a file urls.txt and run "python start_scraper.py"
 27 | 
 28 | 
 29 | 
 30 | """
 31 | 
 32 | scrape_page_log = []
 33 | amazon_products_list = []
 34 | bestbuy_products_list = []
 35 | homedepot_products_list = []
 36 | 
 37 | 
 38 | def amazon_scraper(category_url):
 39 | 
 40 | 	page_link = str(category_url)
 41 | 
 42 | 	now=datetime.now()
 43 | 	print ''
 44 | 	#To avoid pages with pagination issues, check if the page wasn't scraped already
 45 | 	if page_link not in scrape_page_log:
 46 | 		#appends url to pages that are scraped
 47 | 		scrape_page_log.append(page_link)	
 48 | 		print 'Start scraping '+'Items: '+str(len(amazon_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)+' Page: '+str(page_link)
 49 | 
 50 | 		shop_page_req_head = urllib2.Request(page_link)
 51 | 		shop_page_req_head.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0')
 52 | 		#load page and create soup	
 53 | 		shop_page = urllib2.urlopen(shop_page_req_head)
 54 | 		shop_soup = BeautifulSoup(shop_page, 'html.parser')
 55 | 		
 56 | 		#get the domain of the url for exporting file name
 57 | 		domain_url = tldextract.extract(page_link)
 58 | 		#join to add also domain suffix link domain.com
 59 | 		domain_name = '.'.join(domain_url[1:])
 60 | 		file_name = domain_url.domain
 61 | 
 62 | 		#DATA EXTRACTION
 63 | 		#store all product divs from the page in a list
 64 | 		items_div = shop_soup.find_all('div', {'class': 's-item-container'})
 65 | 		
 66 | 		#loop the scraped products list and extract the required data 
 67 | 		for div in items_div:
 68 | 			try:
 69 | 				#check if current page is the first category page 
 70 | 				if div.find('span', {'class': 'sx-price-whole'}):
 71 | 					price = str(div.find('span', {'class': 'sx-price-whole'}).text.strip())
 72 | 				else: 
 73 | 					price = str(div.find('span', {'class': 'a-size-base a-color-base'}).text.strip())
 74 | 				#verify if the product is rated
 75 | 				if div.find('span', {'class': 'a-icon-alt'}):
 76 | 					rating = str(div.find('span', {'class': 'a-icon-alt'}).text.strip())
 77 | 				else:
 78 | 					rating = 'Not rated yet'
 79 | 
 80 | 				#append data in list of items
 81 | 				amazon_products_list.append({
 82 | 					'title' : div.find('a', {'class': 'a-link-normal s-access-detail-page  s-color-twister-title-link a-text-normal'})['title'], 
 83 | 					'url' : div.find('a', {'class': 'a-link-normal s-access-detail-page  s-color-twister-title-link a-text-normal'})['href'], 
 84 | 					'price' : str(price), 
 85 | 					'rating' : str(rating),
 86 | 					'domain' : domain_name})
 87 | 
 88 | 				print div.find('a', {'class': 'a-link-normal s-access-detail-page  s-color-twister-title-link a-text-normal'})['title']
 89 | 
 90 | 			except:
 91 | 				pass			
 92 | 					
 93 | 			#random time delay 1 to several seconds (change 6 with the max seconds for delay)
 94 | 			time_out(6)
 95 | 		#END DATA EXTRACTION
 96 | 		now=datetime.now()
 97 | 		print 'Page completed '+'Items: '+str(len(amazon_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)
 98 | 		
 99 | 		#PAGINATION AND CHANGING THE NEXT PAGE
100 | 		#check if current page is last for last page by reading the page button link	
101 | 		if shop_soup.find('a', {'id': 'pagnNextLink'}):
102 | 
103 | 			#loads next page button link
104 | 			next_page_button = shop_soup.find('a', {'id': 'pagnNextLink'})['href']
105 | 			next_page_button_href = 'https://www.amazon.com' + str(next_page_button)
106 | 			#write scraped data to json file
107 | 			output_file(file_name, amazon_products_list)
108 | 			#change 5 to max seconds to pause before changing to next page
109 | 			pagination_timeout(5)
110 | 			amazon_scraper(next_page_button_href)
111 | 		else:
112 | 			#write scraped data to json file
113 | 			output_file(file_name, amazon_products_list)
114 | 			print 'Category Scraped Completed '+'Items: '+str(len(amazon_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)
115 | 		#END PAGINATION	
116 | 
117 | 	else:
118 | 		#logs pages issues such as missing next page button or infinit loops
119 | 		error_log(page_link, 'Pagination issues')
120 | 		print ''
121 | 		print 'ERROR! Page '+str(page_link)+' already scraped. See error log'
122 | 		output_file(file_name, amazon_products_list)
123 | 		print 'Category Scraped Completed '+'Items: '+str(len(amazon_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)
124 | 		time.sleep(6)
125 | 		return
126 | 
127 | 	return
128 | 
129 | 
130 | def bestbuy_scraper(category_url):
131 | 
132 | 	page_link = str(category_url)
133 | 
134 | 	now=datetime.now()
135 | 	print ''
136 | 	#check if page was scraped already and start scraping if it wasn't
137 | 	if page_link not in scrape_page_log:
138 | 		#append page to log_[job_date].csv for not scraping it again during current job
139 | 		scrape_page_log.append(page_link)	
140 | 
141 | 		print 'Start scraping '+'Items: '+str(len(bestbuy_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)+' Page: '+str(page_link)
142 | 
143 | 		#open browser for accessing the page - Bestbuy.com doesn't allow scraping without headers
144 | 		driver = webdriver.Firefox()
145 | 		#opens the page in browser
146 | 		driver.get(page_link)
147 | 		#creates the soup for scraping data
148 | 		html = driver.page_source
149 | 		shop_soup = BeautifulSoup(html, 'html.parser')
150 | 		
151 | 		#get the domain of the given page
152 | 		domain_url = tldextract.extract(page_link)
153 | 		#join to add also domain suffix link domain.com
154 | 		domain_name = '.'.join(domain_url[1:])
155 | 		file_name = domain_url.domain
156 | 
157 | 		#store all product divs from the page in a list
158 | 		items_div = shop_soup.find_all('div', {'class': 'list-item'})
159 | 
160 | 		#loop the scraped products list and extract the required data 
161 | 		for div in items_div:
162 | 			#check if product review exists		
163 | 			if div.find('div', {'span': 'c-review-average'}):
164 | 				rating = div.find('span', {'class': 'c-review-average'}).text.strip()
165 | 			elif div.find('span', {'class': 'c-reviews-none'}):
166 | 				rating = div.find('span', {'class': 'c-reviews-none'}).text.strip()
167 | 			else:
168 | 				rating = 'u/n'
169 | 			#get product title and url details
170 | 			if div.find('div', {'class': 'sku-title'}):
171 | 				product_info = div.find('div', {'class': 'sku-title'})
172 | 				product_title = product_info.text.strip()
173 | 				product_url = product_info.find('a', href=True)['href']
174 | 			#get price if present
175 | 			if div.find('div', {'class': 'pb-hero-price pb-purchase-price'}):
176 | 				#get price and remove the $ sign before the actual price
177 | 				price = str(div.find('div', {'class': 'pb-hero-price pb-purchase-price'}).text.strip())[1:]
178 | 			else:
179 | 				price = 'u/n'
180 | 			
181 | 			#append all data to product list			
182 | 			bestbuy_products_list.append({
183 | 				'title' : product_title,
184 | 				'url' : 'https://www.bestbuy.com' +str(product_url), 
185 | 				'price' : price, 
186 | 				'rating' : rating,
187 | 				'domain' : domain_name})
188 | 
189 | 			print product_title
190 | 			
191 | 			#random time delay 1 to several seconds
192 | 			time_out(6)
193 | 
194 | 		now=datetime.now()
195 | 		print 'Page completed '+'Items: '+str(len(bestbuy_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)
196 | 				
197 | 		#get pagination div to change page or stop job	
198 | 		if shop_soup.find('div', {'class': 'results-pagination'}):
199 | 			#extract the chevron tiles for previous and next page
200 | 			next_page_click = shop_soup.find_all('a', {'class': 'btn btn-primary btn-sm btn-ficon '})
201 | 			
202 | 			#nxt page is the second and last item and it compares the last digit with current page digits to check if there is a next page
203 | 			while next_page_click[-1]['href'][-1] > page_link[-1]:
204 | 				try:		
205 | 					next_page_button_href = next_page_click[-1]['href']		
206 | 					#write scraped data to json file
207 | 					output_file(file_name, bestbuy_products_list)
208 | 					pagination_timeout(5)	
209 | 					#close browser window
210 | 					driver.quit()
211 | 					bestbuy_scraper(next_page_button_href)	
212 | 					break
213 | 				except:
214 | 					break
215 | 
216 | 			else:
217 | 				#write scraped data to json file
218 | 				output_file(file_name, bestbuy_products_list)
219 | 				driver.quit()
220 | 				print 'Category Scraped Completed '+'Items: '+str(len(bestbuy_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)
221 | 				
222 | 	else:
223 | 		#appends any page with issues to log_error_[job_date].csv
224 | 		error_log(page_link, 'Pagination issues')
225 | 		print ''
226 | 		print 'ERROR! Page '+str(page_link)+' already scraped. See error log'
227 | 		output_file(file_name, bestbuy_products_list)
228 | 		driver.quit()
229 | 		print 'Category Scraped Completed '+'Items: '+str(len(bestbuy_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)
230 | 		time.sleep(6)
231 | 		return
232 | 	return
233 | 
234 | 
235 | def homedepot_scraper(category_url):
236 | 	
237 | 	page_link = str(category_url)
238 | 
239 | 	now=datetime.now()
240 | 	print ''
241 | 	#To avoid pages with pagination issues, check if the page wasn't scraped already
242 | 	if page_link not in scrape_page_log:
243 | 		#appends url to pages that are scraped
244 | 		scrape_page_log.append(page_link)	
245 | 		print 'Start scraping '+'Items: '+str(len(homedepot_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)+' Page: '+str(page_link)
246 | 
247 | 		shop_page_req_head = urllib2.Request(page_link)
248 | 		shop_page_req_head.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0')
249 | 		#load page and create soup	
250 | 		shop_page = urllib2.urlopen(shop_page_req_head)
251 | 		shop_soup = BeautifulSoup(shop_page, 'html.parser')
252 | 		
253 | 		#get the domain of the url for exporting file name
254 | 		domain_url = tldextract.extract(page_link)
255 | 		#join to add also domain suffix link domain.com
256 | 		domain_name = '.'.join(domain_url[1:])
257 | 		file_name = domain_url.domain
258 | 
259 | 		#DATA EXTRACTION
260 | 		#store all product divs from the page in a list
261 | 		items_div = shop_soup.find_all('div', {'class': 'pod-inner'})
262 | 		
263 | 		#loop the scraped products list and extract the required data 
264 | 
265 | 		for div in items_div:	
266 | 			try:
267 | 				#append data in list of items
268 | 				homedepot_products_list.append({
269 | 					'title' : div.find('div', {'class': 'pod-plp__description js-podclick-analytics'}).text.strip(), 
270 | 					'url' : 'https://www.homedepot.com'+str(div.find('a', {'class': 'js-podclick-analytics'})['href']), 
271 | 					'price' : div.find('div', {'class': 'price'}).text.strip()[1:-2], 
272 | 					'domain' : domain_name})
273 | 
274 | 				print div.find('div', {'class': 'pod-plp__description js-podclick-analytics'}).text.strip()
275 | 			except:
276 | 				pass			
277 | 					
278 | 			#random time delay 1 to several seconds (change 6 with the max seconds for delay)
279 | 			time_out(6)
280 | 		#END DATA EXTRACTION
281 | 		now=datetime.now()
282 | 		print 'Page completed '+'Items: '+str(len(homedepot_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)
283 | 
284 | 		#PAGINATION 
285 | 		#check if current page is last for last page by reading the page button link	
286 | 		if shop_soup.find('li', {'class': 'hd-pagination__item hd-pagination__button'}):
287 | 
288 | 			#loads next page button link
289 | 			next_page_click = shop_soup.find_all('a', {'class': 'hd-pagination__link'})
290 | 			while next_page_click[-1]['title'] == 'Next':
291 | 				next_page_button_href = 'https://www.homedepot.com'+str(next_page_click[-1]['href'])
292 | 				try:
293 | 					#write scraped data to json file
294 | 					output_file(file_name, homedepot_products_list)
295 | 					#change 5 to max seconds to pause before changing to next page
296 | 					pagination_timeout(5)
297 | 					homedepot_scraper(next_page_button_href)
298 | 					break
299 | 				except:
300 | 					break
301 | 			else:
302 | 				#write scraped data to json file
303 | 				output_file(file_name, homedepot_products_list)
304 | 				print 'Category Scraped Completed '+'Items: '+str(len(homedepot_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)
305 | 			#END PAGINATION	
306 | 	else:
307 | 		#appends any page with issues to log_error_[job_date].csv
308 | 		error_log(page_link, 'Pagination issues')
309 | 		print ''
310 | 		print 'ERROR! Page '+str(page_link)+' already scraped. See error log'
311 | 		output_file(file_name, homedepot_products_list)
312 | 		print 'Category Scraped Completed '+'Items: '+str(len(homedepot_products_list))+' At: '+str(now.hour)+':'+str(now.minute)+':'+str(now.second)
313 | 		time.sleep(6)
314 | 		return
315 | 	return
316 | 


--------------------------------------------------------------------------------
/middlewares.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 	### BEAGLE SCRAPER ###
 3 | 
 4 | 	### The product category scraper ###
 5 | 	
 6 | """
 7 | import urllib2
 8 | from bs4 import BeautifulSoup
 9 | import csv
10 | import json
11 | import re
12 | from random import randint
13 | import time
14 | from datetime import datetime, date
15 | import os
16 | import tldextract
17 | 
18 | """
19 | Functions use in scrapers
20 | """
21 | 
22 | 
23 | #log the last link accessed, either scraped or not
24 | def scrape_log(open_link):
25 | 
26 | 	now=datetime.now()
27 | 	today = date.today()
28 | 	directory = str(today.day)+'_'+str(today.strftime("%b")).lower()+'_'+str(today.year)
29 | 	
30 | 	if not os.path.exists(directory):
31 | 		os.makedirs(directory)
32 | 
33 | 	logged_link = str(open_link)[:-1]
34 | 	log_csv = str(today), str(now.hour)+':'+str(now.minute)+':'+str(now.second),logged_link
35 | 	log_file = open(str(directory)+'/log_'+str(directory)+'.csv', 'a')
36 | 	with log_file:
37 | 		writer = csv.writer(log_file)
38 | 		writer.writerow(log_csv)
39 | 	return
40 | 
41 | #logs errors into log_error_[job_date].csv file
42 | def error_log(url, error_message):
43 | 	now=datetime.now()
44 | 	today = date.today()
45 | 
46 | 	#creates folder to save scraped data for today
47 | 	directory = 'job_'+str(today.day)+'_'+str(today.strftime("%b")).lower()+'_'+str(today.year)
48 | 	if not os.path.exists(directory):
49 | 		os.makedirs(directory)
50 | 
51 | 	logged_link = str(url)
52 | 	error_type = str(error_message)
53 | 	error_csv = [str(today), str(now.hour)+':'+str(now.minute)+':'+str(now.second),str(logged_link),str(error_type)]
54 | 	error_file = open(str(directory)+'/log_error_'+str(directory)+'.csv', 'a')
55 | 	with error_file:
56 | 		writer = csv.writer(error_file)
57 | 		writer.writerow(error_csv)
58 | 	return
59 | 
60 | def output_file(domain, products_list):
61 | 	print "Exporting your products to file"
62 | 	today = datetime.now()
63 | 
64 | 	#creates folder to save scraped data for today
65 | 	directory = 'job_'+str(today.day)+'_'+str(today.strftime("%b")).lower()+'_'+str(today.year)
66 | 	if not os.path.exists(directory):
67 | 		os.makedirs(directory)
68 | 
69 | 	output_file_json = str(directory)+'/'+str(domain) + '_products_'+str(directory)+'.json'
70 | 	with open(output_file_json, 'w') as outfile:		
71 | 		json.dump(products_list, outfile)
72 | 	return
73 | 
74 | def time_out(seconds):
75 | 	#timeout during scraping job - assign preffered max seconds for pausing the scraper
76 | 	max_timeout = int(seconds)
77 | 	print "Timeout for a few seconds..."
78 | 	time.sleep(randint(2,max_timeout))
79 | 	return
80 | 
81 | def pagination_timeout(seconds):
82 | 	#timeout for scraper after scraping one page and before moving to another
83 | 	print "Changing the page, please wait a few seconds..."
84 | 	max_timeout = int(seconds)
85 | 	time.sleep(randint(2,max_timeout))
86 | 	return


--------------------------------------------------------------------------------
/start_scraper.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | """
 4 | 	### BEAGLE SCRAPER ###
 5 | 
 6 | 	### The product category scraper ###
 7 | 	
 8 | """
 9 | 
10 | import beagle_scraper as scrape
11 | import urllib2
12 | from bs4 import BeautifulSoup
13 | import tldextract
14 | from datetime import datetime, date
15 | import time
16 | import csv
17 | from middlewares import error_log, output_file, scrape_log
18 | 
19 | """
20 | Open links from a text file and feeds it to the scraper
21 | 
22 | USAGE:
23 | 	1. Paste prduct category links in a file "urls.txt"
24 | 	2. Run: python start_scraper.py
25 | """
26 | 
27 | today = datetime.now()
28 | directory = str(today.day)+'_'+str(today.strftime("%b")).lower()+'_'+str(today.year)
29 | 
30 | print 'Opening: urls.txt'
31 | time.sleep(3)
32 | 
33 | with open('urls.txt') as links_file:
34 | 	links = links_file.readlines()
35 | 
36 | #Get each url from url.txt to start scraping job
37 | for category_url in links:
38 | 	
39 | 	category_url = str(category_url)
40 | 	#get the domain of a link from given list
41 | 	domain_name = tldextract.extract(category_url)
42 | 	#join to add also domain suffix link domain.com
43 | 	domain_ext = '.'.join(domain_name[1:])
44 | 	
45 | 	print 'Start scraper for ' + str(domain_ext)
46 | 	time.sleep(2)
47 | 
48 | 	if domain_ext == 'amazon.com':
49 | 		try:
50 | 			scrape.amazon_scraper(category_url)
51 | 			pass
52 | 		except:
53 | 			pass
54 | 
55 | 	if domain_ext == 'bestbuy.com':
56 | 		try:
57 | 			scrape.bestbuy_scraper(category_url)
58 | 			pass
59 | 		except:
60 | 			pass
61 | 
62 | 	if domain_ext == 'homedepot.com':
63 | 		try:
64 | 			scrape.homedepot_scraper(category_url)
65 | 			pass
66 | 		except:
67 | 			pass
68 | 
69 | 
70 | 	else:
71 | 		print 'No scraper for domain: ' + str(domain_name.domain)
72 | 		print ''
73 | 
74 | 	#log the last link accessed, either scraped or not
75 | 	scrape_log(category_url)
76 | 


--------------------------------------------------------------------------------