├── LICENSE
├── README.md
└── Task.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Abbireddy Venkata Chandu
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Amazon Best Sellers Web Scraper
  2 | 
  3 | ## Overview
  4 | This Python script uses Selenium to scrape product information from Amazon's Best Sellers section. It focuses on products offering discounts greater than 50% in 10 different categories and saves the data into structured formats (CSV or JSON). The script automates login using valid Amazon credentials and extracts key product details from each category.
  5 | 
  6 | ---
  7 | 
  8 | ## Features
  9 | - **Authentication:** Logs in to Amazon using provided credentials.
 10 | - **Data Collection:** Scrapes details of up to 1500 best-selling products from each category.
 11 |   - Product Name
 12 |   - Product Price
 13 |   - Sale Discount
 14 |   - Best Seller Rating
 15 |   - Ship From
 16 |   - Sold By
 17 |   - Rating
 18 |   - Product Description
 19 |   - Number Bought in the Past Month (if available)
 20 |   - Category Name
 21 |   - All Available Images
 22 | - **Error Handling:** Robust handling of missing elements, timeouts, and page load issues.
 23 | - **Data Storage:** Saves scraped data into a CSV or JSON file for analysis.
 24 | 
 25 | ---
 26 | 
 27 | ## Prerequisites
 28 | 1. **Python:** Install Python 3.7 or later.
 29 | 2. **Libraries:**
 30 |    - Selenium: Install using `pip install selenium`.
 31 | 3. **WebDriver:**
 32 |    - Download the appropriate WebDriver (e.g., [ChromeDriver](https://chromedriver.chromium.org/downloads)) and ensure it's in your system PATH.
 33 | 4. **Amazon Account:** Provide valid Amazon credentials for authentication.
 34 | 
 35 | ---
 36 | 
 37 | ## Setup Instructions
 38 | 
 39 | 1. **Clone the Repository:**
 40 |    ```bash
 41 |    git clone <repository_url>
 42 |    cd amazon-scraper
 43 |    ```
 44 | 
 45 | 2. **Install Dependencies:**
 46 |    ```bash
 47 |    pip install selenium
 48 |    ```
 49 | 
 50 | 3. **Download WebDriver:**
 51 |    - Download ChromeDriver from [here](https://chromedriver.chromium.org/downloads).
 52 |    - Place it in your system PATH or the script directory.
 53 | 
 54 | 4. **Update Credentials:**
 55 |    - Replace `your_email@example.com` and `your_password` in the script with your Amazon login credentials.
 56 | 
 57 | 5. **Run the Script:**
 58 |    ```bash
 59 |    python amazon_scraper.py
 60 |    ```
 61 | 
 62 | ---
 63 | 
 64 | ## How It Works
 65 | 
 66 | 1. **Authentication:**
 67 |    - The script navigates to the Amazon login page and authenticates using the provided email and password.
 68 | 
 69 | 2. **Category Navigation:**
 70 |    - Visits the URLs of the 10 specified Best Seller categories.
 71 | 
 72 | 3. **Data Extraction:**
 73 |    - Collects product details, including the name, price, rating, and more.
 74 |    - Skips products with missing or inaccessible data.
 75 | 
 76 | 4. **Data Storage:**
 77 |    - Saves the scraped data as `amazon_best_sellers.csv` or `amazon_best_sellers.json` in the script's directory.
 78 | 
 79 | ---
 80 | 
 81 | ## Output Format
 82 | - **CSV File:**
 83 |   - Columns include `Name`, `Price`, `Discount`, `Rating`, `Ship From`, `Sold By`, etc.
 84 | - **JSON File:**
 85 |   - Structured JSON with the same details.
 86 | 
 87 | ---
 88 | 
 89 | ## Example URLs
 90 | - **Best Seller Section:**
 91 |   - [Best Sellers](https://www.amazon.in/gp/bestsellers/?ref_=nav_em_cs_bestsellers_0_1_1_2)
 92 | - **Sample Categories:**
 93 |   - [Kitchen](https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_kitchen_0)
 94 |   - [Shoes](https://www.amazon.in/gp/bestsellers/shoes/ref=zg_bs_nav_shoes_0)
 95 |   - [Computers](https://www.amazon.in/gp/bestsellers/computers/ref=zg_bs_nav_computers_0)
 96 |   - [Electronics](https://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_nav_electronics_0)
 97 | 
 98 | ---
 99 | 
100 | ## Notes
101 | - Scraping Amazon may violate their [Terms of Service](https://www.amazon.in/gp/help/customer/display.html). Ensure you comply with their policies.
102 | - If the page structure changes, you may need to update the script's XPath or CSS selectors.
103 | 
104 | ---
105 | 
106 | ## Troubleshooting
107 | 1. **Login Issues:**
108 |    - Ensure your credentials are correct.
109 |    - Check for CAPTCHA prompts during login.
110 | 
111 | 2. **Missing WebDriver:**
112 |    - Verify that ChromeDriver is installed and in your PATH.
113 | 
114 | 3. **Slow Page Load:**
115 |    - Increase wait times using Selenium's `WebDriverWait`.
116 | 
117 | 4. **Blocked Requests:**
118 |    - Reduce the scraping speed to avoid being flagged by Amazon.
119 | 
120 | ---
121 | 
122 | ## License
123 | This project is for educational purposes only. Use responsibly and adhere to Amazon's terms of service.
124 | 
125 | ---
126 | 
127 | ## Contact
128 | For questions or suggestions, reach out to: `chanduabbireddy247@gmail.com`. 
129 | 
130 | 


--------------------------------------------------------------------------------
/Task.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import csv
  3 | import json
  4 | from selenium import webdriver
  5 | from selenium.webdriver.common.by import By
  6 | from selenium.webdriver.common.keys import Keys
  7 | from selenium.webdriver.support.ui import WebDriverWait
  8 | from selenium.webdriver.support import expected_conditions as EC
  9 | from selenium.common.exceptions import TimeoutException, NoSuchElementException
 10 | 
 11 | def configure_driver():
 12 |     options = webdriver.ChromeOptions()
 13 |     options.add_argument('--start-maximized')
 14 |     options.add_argument('--disable-notifications')
 15 |     driver = webdriver.Chrome(options=options)
 16 |     return driver
 17 |     
 18 | def authenticate(driver, email, password):
 19 |     driver.get("https://www.amazon.in/ap/signin")
 20 |     try:
 21 |         email_field = WebDriverWait(driver, 10).until(
 22 |             EC.presence_of_element_located((By.ID, "ap_email"))
 23 |         )
 24 |         email_field.send_keys(email)
 25 |         driver.find_element(By.ID, "continue").click()
 26 | 
 27 |         password_field = WebDriverWait(driver, 10).until(
 28 |             EC.presence_of_element_located((By.ID, "ap_password"))
 29 |         )
 30 |         password_field.send_keys(password)
 31 |         driver.find_element(By.ID, "signInSubmit").click()
 32 |     except TimeoutException:
 33 |         print("Login elements not found. Check your credentials or internet connection.")
 34 |         driver.quit()
 35 |         exit()
 36 | 
 37 | 
 38 | def scrape_category(driver, category_url):
 39 |     driver.get(category_url)
 40 |     time.sleep(5)  
 41 |     
 42 |     scraped_data = []
 43 |     
 44 |     try:
 45 |         products = driver.find_elements(By.XPATH, "//div[contains(@class, 'zg-grid-general-faceout')]")
 46 |         for product in products[:1500]:  
 47 |             try:
 48 |                 name = product.find_element(By.CSS_SELECTOR, ".p13n-sc-truncate-desktop-type2").text
 49 |                 price = product.find_element(By.CSS_SELECTOR, ".p13n-sc-price").text
 50 |                 discount = None  
 51 |                 rating = product.find_element(By.CSS_SELECTOR, "span.a-icon-alt").text
 52 |                 ship_from = None  
 53 |                 sold_by = None  
 54 |                 description = None  
 55 |                 num_bought = None 
 56 |                 images = None  
 57 | 
 58 |                 # Append collected data
 59 |                 scraped_data.append({
 60 |                     "Name": name,
 61 |                     "Price": price,
 62 |                     "Discount": discount,
 63 |                     "Rating": rating,
 64 |                     "Ship From": ship_from,
 65 |                     "Sold By": sold_by,
 66 |                     "Description": description,
 67 |                     "Number Bought": num_bought,
 68 |                     "Images": images,
 69 |                 })
 70 |             except NoSuchElementException:
 71 |                 continue  
 72 |     except TimeoutException:
 73 |         print("Failed to load products in the category. Check your connection or the URL.")
 74 |     
 75 |     return scraped_data
 76 | 
 77 | 
 78 | def save_data(data, file_format="csv", filename="amazon_best_sellers"):
 79 |     if file_format == "csv":
 80 |         keys = data[0].keys()
 81 |         with open(f"{filename}.csv", "w", newline="", encoding="utf-8") as output_file:
 82 |             dict_writer = csv.DictWriter(output_file, fieldnames=keys)
 83 |             dict_writer.writeheader()
 84 |             dict_writer.writerows(data)
 85 |     elif file_format == "json":
 86 |         with open(f"{filename}.json", "w", encoding="utf-8") as output_file:
 87 |             json.dump(data, output_file, indent=4)
 88 | 
 89 | if __name__ == "__main__":
 90 |     # Amazon credentials
 91 |     email = "your_email@example.com"
 92 |     password = "your_password"
 93 | 
 94 |     
 95 |     categories = [
 96 |         "https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_kitchen_0",
 97 |         "https://www.amazon.in/gp/bestsellers/shoes/ref=zg_bs_nav_shoes_0",
 98 |         "https://www.amazon.in/gp/bestsellers/computers/ref=zg_bs_nav_computers_0",
 99 |         "https://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_nav_electronics_0",
100 |         
101 |     ]
102 | 
103 |     driver = configure_driver()
104 |     try:
105 |         authenticate(driver, email, password)
106 |         all_scraped_data = []
107 |         
108 |         for category_url in categories[:10]:  # Limit to 10 categories
109 |             print(f"Scraping category: {category_url}")
110 |             category_data = scrape_category(driver, category_url)
111 |             all_scraped_data.extend(category_data)
112 | 
113 |         save_data(all_scraped_data, file_format="csv", filename="amazon_best_sellers")
114 |         print("Scraping completed successfully. Data saved to file.")
115 |     finally:
116 |         driver.quit()
117 | 


--------------------------------------------------------------------------------