├── .gitignore ├── LICENSE ├── README.md ├── requirements.txt └── wbparser.py /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | .python-version 3 | .vscode 4 | *.json 5 | *.xlsx 6 | dev_reqs.txt -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2023 Kirill Ignatyev (https://kirillignatyev.github.io) 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 19 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # WildBerries Parser 2 | 3 | [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) 4 | 5 | ## Overview 6 | The WildBerries Parser is a Python script that allows you to collect information on items from the [Wildberries](https://wildberries.ru) website and save it into an xlsx file. The parser provides two modes: scanning items in a specific directory of the marketplace or parsing all items in the search results based on a given keyword. It collects various data from each item, including the link, ID, name, brand name, regular price, discounted price, rating, number of reviews, and total sales. 7 | 8 | ## Features 9 | - Retrieve item data from Wildberries based on categories or search keywords 10 | - Extract information such as link, ID, name, brand, pricing, rating, reviews, and sales 11 | - Save the collected data in xlsx format for further analysis 12 | 13 | ## Installation 14 | 1. Clone this repository to your local machine. 15 | 2. Navigate to the project directory. 16 | 17 | ## Prerequisites 18 | - Python 3.x 19 | - Requests library (`pip install requests`) 20 | - Pandas library (`pip install pandas`) 21 | 22 | ## Usage 23 | 1. Open a terminal or command prompt. 24 | 2. Navigate to the project directory. 25 | 3. Run the script using the following command: 26 | `python wbparser.py` 27 | 4. Follow the on-screen instructions to choose the desired parsing mode and provide the necessary input. 28 | 29 | ## Examples 30 | To parse a specific category: 31 | - Choose the parsing mode for a category. 32 | - Enter the category name or URL. 33 | - The script will retrieve all products in the category, collect sales data, and save the parsed data to an xlsx file. 34 | 35 | To parse by keywords: 36 | - Choose the parsing mode for keywords. 37 | - Enter the search query. 38 | - The script will retrieve all products in the search results, collect sales data, and save the parsed data to an xlsx file. 39 | 40 | ## Contributing 41 | Contributions are welcome! If you find any bugs or have suggestions for improvements, please open an issue or submit a pull request. 42 | 43 | ## License 44 | This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. 45 | 46 | ## Acknowledgements 47 | Special thanks to [Timerlan Nalimov](https://github.com/Timur1991) for inspiring this project with his [initial parser](https://github.com/Timur1991/project_wildberries). I appreciate his work and its contribution to the development of this parser. 48 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas==2.1.1 2 | Requests==2.31.0 3 | -------------------------------------------------------------------------------- /wbparser.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """Collect info on items from wildberries.ru and save it into an xlsx file. 3 | 4 | This script is designed to extract data from the wildberries.ru website 5 | using two main modes: 6 | 1. Scanning items in a specific directory of the marketplace (e.g., books). 7 | 2. Parsing all items in the search results based on a given keyword. 8 | 9 | The script collects the following data from each item in the directory 10 | or search results, which is then saved in xlsx format: 11 | - Link 12 | - ID 13 | - Name 14 | - Brand name 15 | - Brand ID 16 | - Regular price 17 | - Discounted price 18 | - Rating 19 | - Number of reviews 20 | - Total sales 21 | 22 | The parser is under active development, and new features may be added 23 | in the future. 24 | It was inspired by a parser by Timerlan Nalimov (https://github.com/Timur1991). 25 | 26 | The script is distributed under the MIT license. 27 | 28 | --- 29 | 30 | Class: WildBerriesParser 31 | 32 | Methods: 33 | - __init__: Initialize the parser object. 34 | - download_current_catalogue: Download the current catalogue in JSON format. 35 | - traverse_json: Recursively traverse the JSON catalogue 36 | and flatten it to a list. 37 | - process_catalogue: Process the locally saved JSON catalogue 38 | into a list of dictionaries. 39 | - extract_category_data: Extract category data from the processed catalogue. 40 | - get_products_on_page: Parse one page of category or search results 41 | and return a list with product data. 42 | - add_data_from_page: Add data on products from a page to the class's list. 43 | - get_all_products_in_category: Retrieve all products in a category 44 | by going through all pages. 45 | - get_sales_data: Parse additional sales data for the product cards. 46 | - save_to_excel: Save the parsed data in xlsx format and return its path. 47 | - get_all_products_in_search_result: Retrieve all products in the search 48 | result by going through all pages. 49 | - run_parser: Run the whole script for parsing and data processing. 50 | 51 | --- 52 | 53 | Note: This script utilizes the requests library 54 | and requires an active internet connection to function properly. 55 | 56 | """ 57 | 58 | __author__ = "Kirill Ignatyev" 59 | __copyright__ = "Copyright (c) 2023, Kirill Ignatyev" 60 | __license__ = "MIT" 61 | __status__ = "Development" 62 | __version__ = "1.3" 63 | 64 | import json 65 | from datetime import date 66 | from os import path 67 | 68 | 69 | import pandas as pd 70 | import requests 71 | 72 | 73 | class WildBerriesParser: 74 | """ 75 | A parser object for extracting data from wildberries.ru. 76 | 77 | Attributes: 78 | headers (dict): HTTP headers for the parser. 79 | run_date (datetime.date): The date when the parser is run. 80 | product_cards (list): A list to store the parsed product cards. 81 | directory (str): The directory path where the script is located. 82 | """ 83 | 84 | def __init__(self): 85 | """ 86 | Initialize a new instance of the WildBerriesParser class. 87 | 88 | This constructor sets up the parser object with default values 89 | for its attributes. 90 | 91 | Args: 92 | None 93 | 94 | Returns: 95 | None 96 | """ 97 | self.headers = {'Accept': "*/*", 98 | 'User-Agent': "Chrome/51.0.2704.103 Safari/537.36"} 99 | self.run_date = date.today() 100 | self.product_cards = [] 101 | self.directory = path.dirname(__file__) 102 | 103 | def download_current_catalogue(self) -> str: 104 | """ 105 | Download the catalogue from wildberries.ru and save it in JSON format. 106 | 107 | If an up-to-date catalogue already exists in the script's directory, 108 | it uses that instead. 109 | 110 | Returns: 111 | str: The path to the downloaded catalogue file. 112 | """ 113 | local_catalogue_path = path.join(self.directory, 'wb_catalogue.json') 114 | if (not path.exists(local_catalogue_path) 115 | or date.fromtimestamp(int(path.getmtime(local_catalogue_path))) 116 | > self.run_date): 117 | url = ('https://static-basket-01.wb.ru/vol0/data/' 118 | 'main-menu-ru-ru-v2.json') 119 | response = requests.get(url, headers=self.headers).json() 120 | with open(local_catalogue_path, 'w', encoding='UTF-8') as my_file: 121 | json.dump(response, my_file, indent=2, ensure_ascii=False) 122 | return local_catalogue_path 123 | 124 | def traverse_json(self, parent_category: list, flattened_catalogue: list): 125 | """ 126 | Recursively traverse the JSON catalogue and flatten it to a list. 127 | 128 | This function runs recursively through the locally saved JSON 129 | catalogue and appends relevant information to the flattened_catalogue 130 | list. 131 | It handles KeyError exceptions that might occur due to inconsistencies 132 | in the keys of the JSON catalogue. 133 | 134 | Args: 135 | parent_category (list): A list containing the current category 136 | to traverse. 137 | flattened_catalogue (list): A list to store the flattened 138 | catalogue. 139 | 140 | Returns: 141 | None 142 | """ 143 | for category in parent_category: 144 | try: 145 | flattened_catalogue.append({ 146 | 'name': category['name'], 147 | 'url': category['url'], 148 | 'shard': category['shard'], 149 | 'query': category['query'] 150 | }) 151 | except KeyError: 152 | continue 153 | if 'childs' in category: 154 | self.traverse_json(category['childs'], flattened_catalogue) 155 | 156 | def process_catalogue(self, local_catalogue_path: str) -> list: 157 | """ 158 | Process the locally saved JSON catalogue into a list of dictionaries. 159 | 160 | This function reads the locally saved JSON catalogue file, 161 | invokes the traverse_json method to flatten the catalogue, 162 | and returns the resulting catalogue as a list of dictionaries. 163 | 164 | Args: 165 | local_catalogue_path (str): The path to the locally saved 166 | JSON catalogue file. 167 | 168 | Returns: 169 | list: A list of dictionaries representing the processed catalogue. 170 | """ 171 | catalogue = [] 172 | with open(local_catalogue_path, 'r') as my_file: 173 | self.traverse_json(json.load(my_file), catalogue) 174 | return catalogue 175 | 176 | def extract_category_data(self, catalogue: list, user_input: str) -> tuple: 177 | """ 178 | Extract category data from the processed catalogue. 179 | 180 | This function searches for a matching category based 181 | on the user input, which can be either a URL or a category name. 182 | If a match is found, it returns a tuple containing the category name, 183 | shard, and query. 184 | 185 | Args: 186 | catalogue (list): The processed catalogue as a list 187 | of dictionaries. 188 | user_input (str): The user input, which can be a URL 189 | or a category name. 190 | 191 | Returns: 192 | tuple: A tuple containing the category name, shard, and query. 193 | """ 194 | for category in catalogue: 195 | if (user_input.split("https://www.wildberries.ru")[-1] 196 | == category['url'] or user_input == category['name']): 197 | return category['name'], category['shard'], category['query'] 198 | 199 | def get_products_on_page(self, page_data: dict) -> list: 200 | """ 201 | Parse one page of results and return a list with product data. 202 | 203 | This function takes a dictionary containing the data of a page from 204 | wildberries.ru, specifically the 'data' key with a list of 'products'. 205 | It iterates over each item in the 'products' list and extracts 206 | relevant information to create a dictionary representing a product. 207 | The dictionaries are then appended to the 'products_on_page' list. 208 | 209 | Args: 210 | page_data (dict): A dictionary containing the data 211 | of a page from wildberries.ru. 212 | 213 | Returns: 214 | list: A list of dictionaries representing the products 215 | on the page, where each dictionary contains information 216 | such as the link, article number, name, brand, price, discounted 217 | price, rating, and number of reviews. 218 | 219 | """ 220 | products_on_page = [] 221 | for item in page_data['data']['products']: 222 | products_on_page.append({ 223 | 'Ссылка': f"https://www.wildberries.ru/catalog/" 224 | f"{item['id']}/detail.aspx", 225 | 'Артикул': item['id'], 226 | 'Наименование': item['name'], 227 | 'Бренд': item['brand'], 228 | 'ID бренда': item['brandId'], 229 | 'Цена': int(item['priceU'] / 100), 230 | 'Цена со скидкой': int(item['salePriceU'] / 100), 231 | 'Рейтинг': item['rating'], 232 | 'Отзывы': item['feedbacks'] 233 | }) 234 | return products_on_page 235 | 236 | def add_data_from_page(self, url: str): 237 | """ 238 | Add data on products from a page to the class's list. 239 | 240 | This function makes a GET request to the specified URL using 241 | the provided headers, expecting a JSON response. The page data is then 242 | passed to the get_products_on_page method to extract the relevant 243 | product information. If there are products on the page, 244 | they are appended to the product_cards list in the class, 245 | and the number of added products is printed. If there are no products 246 | on the page, it prints a message and returns True to indicate the end 247 | of product loading. 248 | 249 | Args: 250 | url (str): The URL of the page to retrieve the product data from. 251 | 252 | Returns: 253 | bool or None: Returns True if there are no products on the page, 254 | indicating the end of product loading. Otherwise, returns None. 255 | """ 256 | response = requests.get(url, headers=self.headers).json() 257 | page_data = self.get_products_on_page(response) 258 | if len(page_data) > 0: 259 | self.product_cards.extend(page_data) 260 | print(f"Добавлено товаров: {len(page_data)}") 261 | else: 262 | print('Загрузка товаров завершена') 263 | return True 264 | 265 | def get_all_products_in_category(self, category_data: tuple): 266 | """ 267 | Retrieve all products in a category by going through all pages. 268 | 269 | This function iterates over page numbers from 1 to 100, constructing 270 | the URL for each page in the specified category. It then calls the 271 | add_data_from_page method to retrieve and add the product data from 272 | each page to the class's product_cards list. If the add_data_from_page 273 | method returns True, indicating the end of product loading, 274 | the loop breaks. 275 | 276 | Note: 277 | The wildberries.ru website currently limits the maximum number of 278 | pages that can be parsed to 100. 279 | 280 | Args: 281 | category_data (tuple): A tuple containing the category name, 282 | shard, and query. 283 | 284 | Returns: 285 | None 286 | """ 287 | for page in range(1, 101): 288 | print(f"Загружаю товары со страницы {page}") 289 | url = (f"https://catalog.wb.ru/catalog/{category_data[1]}/" 290 | f"catalog?appType=1&{category_data[2]}&curr=rub" 291 | f"&dest=-1257786&page={page}&sort=popular&spp=24") 292 | if self.add_data_from_page(url): 293 | break 294 | 295 | def get_sales_data(self): 296 | """ 297 | Parse additional sales data for the product cards. 298 | 299 | This function iterates over each product card in the product_cards 300 | list and makes a request to retrieve the sales data for the 301 | corresponding product. The sales data is then added to the product 302 | card dictionary with the key 'Продано'. If an exception occurs during 303 | the request, indicating a connection timeout, the sales data is set to 304 | 'нет данных'. Progress information is printed during the iteration. 305 | 306 | Returns: 307 | None 308 | """ 309 | for card in self.product_cards: 310 | url = (f"https://product-order-qnt.wildberries.ru/by-nm/" 311 | f"?nm={card['Артикул']}") 312 | try: 313 | response = requests.get(url, headers=self.headers).json() 314 | card['Продано'] = response[0]['qnt'] 315 | except requests.ConnectTimeout: 316 | card['Продано'] = 'нет данных' 317 | print(f"Собрано карточек: {self.product_cards.index(card) + 1}" 318 | f" из {len(self.product_cards)}") 319 | 320 | def save_to_excel(self, file_name: str) -> str: 321 | """ 322 | Save the parsed data in xlsx format and return its path. 323 | 324 | This function takes the parsed data stored in the product_cards list 325 | and converts it into a Pandas DataFrame. It then saves the DataFrame 326 | as an xlsx file with the specified file name and the current run date 327 | appended to it. The resulting file path is returned. 328 | 329 | Args: 330 | file_name (str): The desired file name for the saved xlsx file. 331 | 332 | Returns: 333 | str: The path of the saved xlsx file. 334 | """ 335 | data = pd.DataFrame(self.product_cards) 336 | result_path = (f"{path.join(self.directory, file_name)}_" 337 | f"{self.run_date.strftime('%Y-%m-%d')}.xlsx") 338 | writer = pd.ExcelWriter(result_path) 339 | data.to_excel(writer, 'data', index=False) 340 | writer.close() 341 | return result_path 342 | 343 | def get_all_products_in_search_result(self, key_word: str): 344 | """ 345 | Retrieve all products in the search result by going through all pages. 346 | 347 | This function iterates over page numbers from 1 to 100, constructing 348 | the URL for each page in the search result based on the provided 349 | keyword. It then calls the add_data_from_page method to retrieve and 350 | add the product data from each page to the class's product_cards list. 351 | If the add_data_from_page method returns True, indicating the end of 352 | product loading, the loop breaks. 353 | 354 | Args: 355 | key_word (str): The keyword to search for in the 356 | wildberries.ru search. 357 | 358 | Returns: 359 | None 360 | """ 361 | for page in range(1, 101): 362 | print(f"Загружаю товары со страницы {page}") 363 | url = (f"https://search.wb.ru/exactmatch/ru/common/v4/search?" 364 | f"appType=1&curr=rub&dest=-1257786&page={page}" 365 | f"&query={'%20'.join(key_word.split())}&resultset=catalog" 366 | f"&sort=popular&spp=24&suppressSpellcheck=false") 367 | if self.add_data_from_page(url): 368 | break 369 | 370 | def run_parser(self): 371 | """ 372 | Run the whole script for parsing and data processing. 373 | 374 | This function runs the entire script by prompting the user to choose 375 | a parsing mode: either parsing a category entirely or parsing by 376 | keywords. Based on the user's choice, it executes the corresponding 377 | sequence of steps. For parsing a category, it downloads the current 378 | catalogue, processes it, extracts the category data, retrieves all 379 | products in the category, collects sales data, and saves the parsed 380 | data to an Excel file. For parsing by keywords, it prompts for 381 | a search query, retrieves all products in the search result, collects 382 | sales data, and saves the parsed data to an Excel file. 383 | 384 | Returns: 385 | None 386 | """ 387 | instructons = """Введите 1 для парсинга категории целиком, 388 | 2 — по ключевым словам: """ 389 | mode = input(instructons) 390 | if mode == '1': 391 | local_catalogue_path = self.download_current_catalogue() 392 | print(f"Каталог сохранен: {local_catalogue_path}") 393 | processed_catalogue = self.process_catalogue(local_catalogue_path) 394 | input_category = input("Введите название категории или ссылку: ") 395 | category_data = self.extract_category_data(processed_catalogue, 396 | input_category) 397 | if category_data is None: 398 | print("Категория не найдена") 399 | else: 400 | print(f"Найдена категория: {category_data[0]}") 401 | self.get_all_products_in_category(category_data) 402 | self.get_sales_data() 403 | print(f"Данные сохранены в {self.save_to_excel(category_data[0])}") 404 | if mode == '2': 405 | key_word = input("Введите запрос для поиска: ") 406 | self.get_all_products_in_search_result(key_word) 407 | self.get_sales_data() 408 | print(f"Данные сохранены в {self.save_to_excel(key_word)}") 409 | 410 | 411 | if __name__ == '__main__': 412 | app = WildBerriesParser() 413 | app.run_parser() 414 | --------------------------------------------------------------------------------