├── README.md
└── scraper.py


/README.md:
--------------------------------------------------------------------------------
 1 | ## Etsy Sales Scraper
 2 | 
 3 | This Python script scrapes the Etsy website for information about sellers and their products. The script uses the `requests` and `BeautifulSoup` libraries to parse HTML data and extract relevant information. It also utilizes the `concurrent.futures` library to process requests in parallel for faster output.
 4 | 
 5 | This script functions properly as of February 2023.
 6 | 
 7 | ### Usage
 8 | 
 9 | To parse a page, input a url as a string into `get_listing_info()`. Run the program, and it will display information about all the listings on that page, such as:
10 | - Seller name
11 | - The number of sales the seller has made.
12 | - The link to the listing
13 | 
14 | Here is an example:
15 | ```
16 | url = "https://www.etsy.com/"
17 | get_listing_info(url)
18 | ```
19 | 
20 | To parse multiple URLs, you can create a list of URLs and call the `get_listing_info()` function for each URL in the list. It is advised to add a delay between each function call to avoid overloading the server with requests. 
21 | Here is an example:
22 | ```
23 | urls = ["https://www.etsy.com/ca/","https://www.etsy.com/ca/c/home-and-living","https://www.etsy.com/ca/c/clothing-and-shoes"]
24 | 
25 | for url in urls:
26 |     get_listing_info(url)
27 |     time.sleep(10)
28 | ```
29 | This will call the `get_listing_info()` function for each URL in the list, with a delay of 10 seconds between each call.
30 | 
31 | ### Output
32 | The `get_listing_info()` function returns information about each seller in the following format:
33 | ```
34 | Seller: [seller name]
35 | [number of sales]
36 | [listing URL]
37 | ----------------------------
38 | ```
39 | If the code isn't returning the seller name or sales made, you may need to increase the delay.
40 | 


--------------------------------------------------------------------------------
/scraper.py:
--------------------------------------------------------------------------------
 1 | from bs4 import BeautifulSoup
 2 | import requests
 3 | import concurrent.futures
 4 | import re
 5 | import time
 6 | 
 7 | # Multithreading parsing for faster output, processes requests in parallel.
 8 | 
 9 | 
10 | def get_listing_info(url):
11 |     main_url = url
12 |     id_array = []
13 |     listing_url = "https://www.etsy.com/ca/listing/"
14 |     response = requests.get(main_url)
15 |     soup = BeautifulSoup(response.text, "lxml")
16 |     listings = soup.find_all("a")
17 |     for tag in listings:
18 |         if tag.has_attr("data-listing-id"):
19 |             id_array.append(tag["data-listing-id"])
20 | 
21 |     def get_item_info(listing_id):
22 |         item = listing_url + listing_id
23 |         r = requests.get(item)
24 |         s = BeautifulSoup(r.text, "lxml")
25 |         num_of_sales = ""
26 |         seller = ""
27 |         regex = re.compile(r'\bsales\b')
28 |         span_tag = s.find('span', class_='wt-text-caption', text=regex)
29 |         if span_tag is not None:
30 |             num_of_sales = span_tag.getText().strip()
31 |         link_tag = s.find('link', attrs={'rel': 'alternate', 'type': 'application/rss+xml'})
32 |         if link_tag is not None:
33 |             title_attr = link_tag.get('title')
34 |             match = re.search(r'Shop RSS for (\w+) on Etsy', title_attr)
35 |             seller = match.group(1)
36 |         return "Seller: " + seller + "\n" + num_of_sales + "\n" + item + "\n----------------------------"
37 | 
38 |     results = []
39 | 
40 |     with concurrent.futures.ThreadPoolExecutor() as executor:
41 |         for i in id_array:
42 |             results.append(executor.submit(get_item_info, i))
43 | 
44 |     for result in results:
45 |         print(result.result())
46 | 
47 | 
48 | def main():
49 |     # The function below will print all info about all the sellers that appear on the home page. (Works on any Etsy URL)
50 |     # get_listing_info("https://www.etsy.com/ca/")
51 | 
52 | 
53 | if __name__ == '__main__':
54 |     main()
55 | 
56 | #End of file
57 | 


--------------------------------------------------------------------------------