├── LICENSE ├── README.md └── Task.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Abbireddy Venkata Chandu 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Amazon Best Sellers Web Scraper 2 | 3 | ## Overview 4 | This Python script uses Selenium to scrape product information from Amazon's Best Sellers section. It focuses on products offering discounts greater than 50% in 10 different categories and saves the data into structured formats (CSV or JSON). The script automates login using valid Amazon credentials and extracts key product details from each category. 5 | 6 | --- 7 | 8 | ## Features 9 | - **Authentication:** Logs in to Amazon using provided credentials. 10 | - **Data Collection:** Scrapes details of up to 1500 best-selling products from each category. 11 | - Product Name 12 | - Product Price 13 | - Sale Discount 14 | - Best Seller Rating 15 | - Ship From 16 | - Sold By 17 | - Rating 18 | - Product Description 19 | - Number Bought in the Past Month (if available) 20 | - Category Name 21 | - All Available Images 22 | - **Error Handling:** Robust handling of missing elements, timeouts, and page load issues. 23 | - **Data Storage:** Saves scraped data into a CSV or JSON file for analysis. 24 | 25 | --- 26 | 27 | ## Prerequisites 28 | 1. **Python:** Install Python 3.7 or later. 29 | 2. **Libraries:** 30 | - Selenium: Install using `pip install selenium`. 31 | 3. **WebDriver:** 32 | - Download the appropriate WebDriver (e.g., [ChromeDriver](https://chromedriver.chromium.org/downloads)) and ensure it's in your system PATH. 33 | 4. **Amazon Account:** Provide valid Amazon credentials for authentication. 34 | 35 | --- 36 | 37 | ## Setup Instructions 38 | 39 | 1. **Clone the Repository:** 40 | ```bash 41 | git clone 42 | cd amazon-scraper 43 | ``` 44 | 45 | 2. **Install Dependencies:** 46 | ```bash 47 | pip install selenium 48 | ``` 49 | 50 | 3. **Download WebDriver:** 51 | - Download ChromeDriver from [here](https://chromedriver.chromium.org/downloads). 52 | - Place it in your system PATH or the script directory. 53 | 54 | 4. **Update Credentials:** 55 | - Replace `your_email@example.com` and `your_password` in the script with your Amazon login credentials. 56 | 57 | 5. **Run the Script:** 58 | ```bash 59 | python amazon_scraper.py 60 | ``` 61 | 62 | --- 63 | 64 | ## How It Works 65 | 66 | 1. **Authentication:** 67 | - The script navigates to the Amazon login page and authenticates using the provided email and password. 68 | 69 | 2. **Category Navigation:** 70 | - Visits the URLs of the 10 specified Best Seller categories. 71 | 72 | 3. **Data Extraction:** 73 | - Collects product details, including the name, price, rating, and more. 74 | - Skips products with missing or inaccessible data. 75 | 76 | 4. **Data Storage:** 77 | - Saves the scraped data as `amazon_best_sellers.csv` or `amazon_best_sellers.json` in the script's directory. 78 | 79 | --- 80 | 81 | ## Output Format 82 | - **CSV File:** 83 | - Columns include `Name`, `Price`, `Discount`, `Rating`, `Ship From`, `Sold By`, etc. 84 | - **JSON File:** 85 | - Structured JSON with the same details. 86 | 87 | --- 88 | 89 | ## Example URLs 90 | - **Best Seller Section:** 91 | - [Best Sellers](https://www.amazon.in/gp/bestsellers/?ref_=nav_em_cs_bestsellers_0_1_1_2) 92 | - **Sample Categories:** 93 | - [Kitchen](https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_kitchen_0) 94 | - [Shoes](https://www.amazon.in/gp/bestsellers/shoes/ref=zg_bs_nav_shoes_0) 95 | - [Computers](https://www.amazon.in/gp/bestsellers/computers/ref=zg_bs_nav_computers_0) 96 | - [Electronics](https://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_nav_electronics_0) 97 | 98 | --- 99 | 100 | ## Notes 101 | - Scraping Amazon may violate their [Terms of Service](https://www.amazon.in/gp/help/customer/display.html). Ensure you comply with their policies. 102 | - If the page structure changes, you may need to update the script's XPath or CSS selectors. 103 | 104 | --- 105 | 106 | ## Troubleshooting 107 | 1. **Login Issues:** 108 | - Ensure your credentials are correct. 109 | - Check for CAPTCHA prompts during login. 110 | 111 | 2. **Missing WebDriver:** 112 | - Verify that ChromeDriver is installed and in your PATH. 113 | 114 | 3. **Slow Page Load:** 115 | - Increase wait times using Selenium's `WebDriverWait`. 116 | 117 | 4. **Blocked Requests:** 118 | - Reduce the scraping speed to avoid being flagged by Amazon. 119 | 120 | --- 121 | 122 | ## License 123 | This project is for educational purposes only. Use responsibly and adhere to Amazon's terms of service. 124 | 125 | --- 126 | 127 | ## Contact 128 | For questions or suggestions, reach out to: `chanduabbireddy247@gmail.com`. 129 | 130 | -------------------------------------------------------------------------------- /Task.py: -------------------------------------------------------------------------------- 1 | import time 2 | import csv 3 | import json 4 | from selenium import webdriver 5 | from selenium.webdriver.common.by import By 6 | from selenium.webdriver.common.keys import Keys 7 | from selenium.webdriver.support.ui import WebDriverWait 8 | from selenium.webdriver.support import expected_conditions as EC 9 | from selenium.common.exceptions import TimeoutException, NoSuchElementException 10 | 11 | def configure_driver(): 12 | options = webdriver.ChromeOptions() 13 | options.add_argument('--start-maximized') 14 | options.add_argument('--disable-notifications') 15 | driver = webdriver.Chrome(options=options) 16 | return driver 17 | 18 | def authenticate(driver, email, password): 19 | driver.get("https://www.amazon.in/ap/signin") 20 | try: 21 | email_field = WebDriverWait(driver, 10).until( 22 | EC.presence_of_element_located((By.ID, "ap_email")) 23 | ) 24 | email_field.send_keys(email) 25 | driver.find_element(By.ID, "continue").click() 26 | 27 | password_field = WebDriverWait(driver, 10).until( 28 | EC.presence_of_element_located((By.ID, "ap_password")) 29 | ) 30 | password_field.send_keys(password) 31 | driver.find_element(By.ID, "signInSubmit").click() 32 | except TimeoutException: 33 | print("Login elements not found. Check your credentials or internet connection.") 34 | driver.quit() 35 | exit() 36 | 37 | 38 | def scrape_category(driver, category_url): 39 | driver.get(category_url) 40 | time.sleep(5) 41 | 42 | scraped_data = [] 43 | 44 | try: 45 | products = driver.find_elements(By.XPATH, "//div[contains(@class, 'zg-grid-general-faceout')]") 46 | for product in products[:1500]: 47 | try: 48 | name = product.find_element(By.CSS_SELECTOR, ".p13n-sc-truncate-desktop-type2").text 49 | price = product.find_element(By.CSS_SELECTOR, ".p13n-sc-price").text 50 | discount = None 51 | rating = product.find_element(By.CSS_SELECTOR, "span.a-icon-alt").text 52 | ship_from = None 53 | sold_by = None 54 | description = None 55 | num_bought = None 56 | images = None 57 | 58 | # Append collected data 59 | scraped_data.append({ 60 | "Name": name, 61 | "Price": price, 62 | "Discount": discount, 63 | "Rating": rating, 64 | "Ship From": ship_from, 65 | "Sold By": sold_by, 66 | "Description": description, 67 | "Number Bought": num_bought, 68 | "Images": images, 69 | }) 70 | except NoSuchElementException: 71 | continue 72 | except TimeoutException: 73 | print("Failed to load products in the category. Check your connection or the URL.") 74 | 75 | return scraped_data 76 | 77 | 78 | def save_data(data, file_format="csv", filename="amazon_best_sellers"): 79 | if file_format == "csv": 80 | keys = data[0].keys() 81 | with open(f"{filename}.csv", "w", newline="", encoding="utf-8") as output_file: 82 | dict_writer = csv.DictWriter(output_file, fieldnames=keys) 83 | dict_writer.writeheader() 84 | dict_writer.writerows(data) 85 | elif file_format == "json": 86 | with open(f"{filename}.json", "w", encoding="utf-8") as output_file: 87 | json.dump(data, output_file, indent=4) 88 | 89 | if __name__ == "__main__": 90 | # Amazon credentials 91 | email = "your_email@example.com" 92 | password = "your_password" 93 | 94 | 95 | categories = [ 96 | "https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_kitchen_0", 97 | "https://www.amazon.in/gp/bestsellers/shoes/ref=zg_bs_nav_shoes_0", 98 | "https://www.amazon.in/gp/bestsellers/computers/ref=zg_bs_nav_computers_0", 99 | "https://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_nav_electronics_0", 100 | 101 | ] 102 | 103 | driver = configure_driver() 104 | try: 105 | authenticate(driver, email, password) 106 | all_scraped_data = [] 107 | 108 | for category_url in categories[:10]: # Limit to 10 categories 109 | print(f"Scraping category: {category_url}") 110 | category_data = scrape_category(driver, category_url) 111 | all_scraped_data.extend(category_data) 112 | 113 | save_data(all_scraped_data, file_format="csv", filename="amazon_best_sellers") 114 | print("Scraping completed successfully. Data saved to file.") 115 | finally: 116 | driver.quit() 117 | --------------------------------------------------------------------------------