├── README.md └── scraper.py /README.md: -------------------------------------------------------------------------------- 1 | ## Etsy Sales Scraper 2 | 3 | This Python script scrapes the Etsy website for information about sellers and their products. The script uses the `requests` and `BeautifulSoup` libraries to parse HTML data and extract relevant information. It also utilizes the `concurrent.futures` library to process requests in parallel for faster output. 4 | 5 | This script functions properly as of February 2023. 6 | 7 | ### Usage 8 | 9 | To parse a page, input a url as a string into `get_listing_info()`. Run the program, and it will display information about all the listings on that page, such as: 10 | - Seller name 11 | - The number of sales the seller has made. 12 | - The link to the listing 13 | 14 | Here is an example: 15 | ``` 16 | url = "https://www.etsy.com/" 17 | get_listing_info(url) 18 | ``` 19 | 20 | To parse multiple URLs, you can create a list of URLs and call the `get_listing_info()` function for each URL in the list. It is advised to add a delay between each function call to avoid overloading the server with requests. 21 | Here is an example: 22 | ``` 23 | urls = ["https://www.etsy.com/ca/","https://www.etsy.com/ca/c/home-and-living","https://www.etsy.com/ca/c/clothing-and-shoes"] 24 | 25 | for url in urls: 26 | get_listing_info(url) 27 | time.sleep(10) 28 | ``` 29 | This will call the `get_listing_info()` function for each URL in the list, with a delay of 10 seconds between each call. 30 | 31 | ### Output 32 | The `get_listing_info()` function returns information about each seller in the following format: 33 | ``` 34 | Seller: [seller name] 35 | [number of sales] 36 | [listing URL] 37 | ---------------------------- 38 | ``` 39 | If the code isn't returning the seller name or sales made, you may need to increase the delay. 40 | -------------------------------------------------------------------------------- /scraper.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | import requests 3 | import concurrent.futures 4 | import re 5 | import time 6 | 7 | # Multithreading parsing for faster output, processes requests in parallel. 8 | 9 | 10 | def get_listing_info(url): 11 | main_url = url 12 | id_array = [] 13 | listing_url = "https://www.etsy.com/ca/listing/" 14 | response = requests.get(main_url) 15 | soup = BeautifulSoup(response.text, "lxml") 16 | listings = soup.find_all("a") 17 | for tag in listings: 18 | if tag.has_attr("data-listing-id"): 19 | id_array.append(tag["data-listing-id"]) 20 | 21 | def get_item_info(listing_id): 22 | item = listing_url + listing_id 23 | r = requests.get(item) 24 | s = BeautifulSoup(r.text, "lxml") 25 | num_of_sales = "" 26 | seller = "" 27 | regex = re.compile(r'\bsales\b') 28 | span_tag = s.find('span', class_='wt-text-caption', text=regex) 29 | if span_tag is not None: 30 | num_of_sales = span_tag.getText().strip() 31 | link_tag = s.find('link', attrs={'rel': 'alternate', 'type': 'application/rss+xml'}) 32 | if link_tag is not None: 33 | title_attr = link_tag.get('title') 34 | match = re.search(r'Shop RSS for (\w+) on Etsy', title_attr) 35 | seller = match.group(1) 36 | return "Seller: " + seller + "\n" + num_of_sales + "\n" + item + "\n----------------------------" 37 | 38 | results = [] 39 | 40 | with concurrent.futures.ThreadPoolExecutor() as executor: 41 | for i in id_array: 42 | results.append(executor.submit(get_item_info, i)) 43 | 44 | for result in results: 45 | print(result.result()) 46 | 47 | 48 | def main(): 49 | # The function below will print all info about all the sellers that appear on the home page. (Works on any Etsy URL) 50 | # get_listing_info("https://www.etsy.com/ca/") 51 | 52 | 53 | if __name__ == '__main__': 54 | main() 55 | 56 | #End of file 57 | --------------------------------------------------------------------------------