├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
└── wbparser.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | .python-version
3 | .vscode
4 | *.json
5 | *.xlsx
6 | dev_reqs.txt


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2023 Kirill Ignatyev (https://kirillignatyev.github.io)
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy
 4 | of this software and associated documentation files (the "Software"), to deal
 5 | in the Software without restriction, including without limitation the rights
 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 7 | copies of the Software, and to permit persons to whom the Software is
 8 | furnished to do so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in all
11 | copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # WildBerries Parser
 2 | 
 3 | [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
 4 | 
 5 | ## Overview
 6 | The WildBerries Parser is a Python script that allows you to collect information on items from the [Wildberries](https://wildberries.ru) website and save it into an xlsx file. The parser provides two modes: scanning items in a specific directory of the marketplace or parsing all items in the search results based on a given keyword. It collects various data from each item, including the link, ID, name, brand name, regular price, discounted price, rating, number of reviews, and total sales.
 7 | 
 8 | ## Features
 9 | - Retrieve item data from Wildberries based on categories or search keywords
10 | - Extract information such as link, ID, name, brand, pricing, rating, reviews, and sales
11 | - Save the collected data in xlsx format for further analysis
12 | 
13 | ## Installation
14 | 1. Clone this repository to your local machine.
15 | 2. Navigate to the project directory.
16 | 
17 | ## Prerequisites
18 | - Python 3.x
19 | - Requests library (`pip install requests`)
20 | - Pandas library (`pip install pandas`)
21 | 
22 | ## Usage
23 | 1. Open a terminal or command prompt.
24 | 2. Navigate to the project directory.
25 | 3. Run the script using the following command:
26 | `python wbparser.py`
27 | 4. Follow the on-screen instructions to choose the desired parsing mode and provide the necessary input.
28 | 
29 | ## Examples
30 | To parse a specific category:
31 | - Choose the parsing mode for a category.
32 | - Enter the category name or URL.
33 | - The script will retrieve all products in the category, collect sales data, and save the parsed data to an xlsx file.
34 | 
35 | To parse by keywords:
36 | - Choose the parsing mode for keywords.
37 | - Enter the search query.
38 | - The script will retrieve all products in the search results, collect sales data, and save the parsed data to an xlsx file.
39 | 
40 | ## Contributing
41 | Contributions are welcome! If you find any bugs or have suggestions for improvements, please open an issue or submit a pull request.
42 | 
43 | ## License
44 | This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
45 | 
46 | ## Acknowledgements
47 | Special thanks to [Timerlan Nalimov](https://github.com/Timur1991) for inspiring this project with his [initial parser](https://github.com/Timur1991/project_wildberries). I appreciate his work and its contribution to the development of this parser.
48 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pandas==2.1.1
2 | Requests==2.31.0
3 | 


--------------------------------------------------------------------------------
/wbparser.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """Collect info on items from wildberries.ru and save it into an xlsx file.
  3 | 
  4 | This script is designed to extract data from the wildberries.ru website
  5 | using two main modes:
  6 | 1. Scanning items in a specific directory of the marketplace (e.g., books).
  7 | 2. Parsing all items in the search results based on a given keyword.
  8 | 
  9 | The script collects the following data from each item in the directory
 10 | or search results, which is then saved in xlsx format:
 11 | - Link
 12 | - ID
 13 | - Name
 14 | - Brand name
 15 | - Brand ID
 16 | - Regular price
 17 | - Discounted price
 18 | - Rating
 19 | - Number of reviews
 20 | - Total sales
 21 | 
 22 | The parser is under active development, and new features may be added
 23 | in the future.
 24 | It was inspired by a parser by Timerlan Nalimov (https://github.com/Timur1991).
 25 | 
 26 | The script is distributed under the MIT license.
 27 | 
 28 | ---
 29 | 
 30 | Class: WildBerriesParser
 31 | 
 32 | Methods:
 33 | - __init__: Initialize the parser object.
 34 | - download_current_catalogue: Download the current catalogue in JSON format.
 35 | - traverse_json: Recursively traverse the JSON catalogue
 36 |     and flatten it to a list.
 37 | - process_catalogue: Process the locally saved JSON catalogue
 38 |     into a list of dictionaries.
 39 | - extract_category_data: Extract category data from the processed catalogue.
 40 | - get_products_on_page: Parse one page of category or search results
 41 |     and return a list with product data.
 42 | - add_data_from_page: Add data on products from a page to the class's list.
 43 | - get_all_products_in_category: Retrieve all products in a category
 44 |     by going through all pages.
 45 | - get_sales_data: Parse additional sales data for the product cards.
 46 | - save_to_excel: Save the parsed data in xlsx format and return its path.
 47 | - get_all_products_in_search_result: Retrieve all products in the search
 48 |     result by going through all pages.
 49 | - run_parser: Run the whole script for parsing and data processing.
 50 | 
 51 | ---
 52 | 
 53 | Note: This script utilizes the requests library
 54 | and requires an active internet connection to function properly.
 55 | 
 56 | """
 57 | 
 58 | __author__ = "Kirill Ignatyev"
 59 | __copyright__ = "Copyright (c) 2023, Kirill Ignatyev"
 60 | __license__ = "MIT"
 61 | __status__ = "Development"
 62 | __version__ = "1.3"
 63 | 
 64 | import json
 65 | from datetime import date
 66 | from os import path
 67 | 
 68 | 
 69 | import pandas as pd
 70 | import requests
 71 | 
 72 | 
 73 | class WildBerriesParser:
 74 |     """
 75 |     A parser object for extracting data from wildberries.ru.
 76 | 
 77 |     Attributes:
 78 |         headers (dict): HTTP headers for the parser.
 79 |         run_date (datetime.date): The date when the parser is run.
 80 |         product_cards (list): A list to store the parsed product cards.
 81 |         directory (str): The directory path where the script is located.
 82 |     """
 83 | 
 84 |     def __init__(self):
 85 |         """
 86 |         Initialize a new instance of the WildBerriesParser class.
 87 | 
 88 |         This constructor sets up the parser object with default values
 89 |         for its attributes.
 90 | 
 91 |         Args:
 92 |             None
 93 | 
 94 |         Returns:
 95 |             None
 96 |         """
 97 |         self.headers = {'Accept': "*/*",
 98 |                         'User-Agent': "Chrome/51.0.2704.103 Safari/537.36"}
 99 |         self.run_date = date.today()
100 |         self.product_cards = []
101 |         self.directory = path.dirname(__file__)
102 | 
103 |     def download_current_catalogue(self) -> str:
104 |         """
105 |         Download the  catalogue from wildberries.ru and save it in JSON format.
106 | 
107 |         If an up-to-date catalogue already exists in the script's directory,
108 |         it uses that instead.
109 | 
110 |         Returns:
111 |             str: The path to the downloaded catalogue file.
112 |         """
113 |         local_catalogue_path = path.join(self.directory, 'wb_catalogue.json')
114 |         if (not path.exists(local_catalogue_path)
115 |                 or date.fromtimestamp(int(path.getmtime(local_catalogue_path)))
116 |                 > self.run_date):
117 |             url = ('https://static-basket-01.wb.ru/vol0/data/'
118 |                    'main-menu-ru-ru-v2.json')
119 |             response = requests.get(url, headers=self.headers).json()
120 |             with open(local_catalogue_path, 'w', encoding='UTF-8') as my_file:
121 |                 json.dump(response, my_file, indent=2, ensure_ascii=False)
122 |         return local_catalogue_path
123 | 
124 |     def traverse_json(self, parent_category: list, flattened_catalogue: list):
125 |         """
126 |         Recursively traverse the JSON catalogue and flatten it to a list.
127 | 
128 |         This function runs recursively through the locally saved JSON
129 |         catalogue and appends relevant information to the flattened_catalogue
130 |         list.
131 |         It handles KeyError exceptions that might occur due to inconsistencies
132 |         in the keys of the JSON catalogue.
133 | 
134 |         Args:
135 |             parent_category (list): A list containing the current category
136 |               to traverse.
137 |             flattened_catalogue (list): A list to store the flattened
138 |               catalogue.
139 | 
140 |         Returns:
141 |             None
142 |         """
143 |         for category in parent_category:
144 |             try:
145 |                 flattened_catalogue.append({
146 |                     'name': category['name'],
147 |                     'url': category['url'],
148 |                     'shard': category['shard'],
149 |                     'query': category['query']
150 |                 })
151 |             except KeyError:
152 |                 continue
153 |             if 'childs' in category:
154 |                 self.traverse_json(category['childs'], flattened_catalogue)
155 | 
156 |     def process_catalogue(self, local_catalogue_path: str) -> list:
157 |         """
158 |         Process the locally saved JSON catalogue into a list of dictionaries.
159 | 
160 |         This function reads the locally saved JSON catalogue file,
161 |         invokes the traverse_json method to flatten the catalogue,
162 |         and returns the resulting catalogue as a list of dictionaries.
163 | 
164 |         Args:
165 |             local_catalogue_path (str): The path to the locally saved
166 |               JSON catalogue file.
167 | 
168 |         Returns:
169 |             list: A list of dictionaries representing the processed catalogue.
170 |         """
171 |         catalogue = []
172 |         with open(local_catalogue_path, 'r') as my_file:
173 |             self.traverse_json(json.load(my_file), catalogue)
174 |         return catalogue
175 | 
176 |     def extract_category_data(self, catalogue: list, user_input: str) -> tuple:
177 |         """
178 |         Extract category data from the processed catalogue.
179 | 
180 |         This function searches for a matching category based
181 |         on the user input, which can be either a URL or a category name.
182 |         If a match is found, it returns a tuple containing the category name,
183 |         shard, and query.
184 | 
185 |         Args:
186 |             catalogue (list): The processed catalogue as a list
187 |               of dictionaries.
188 |             user_input (str): The user input, which can be a URL
189 |               or a category name.
190 | 
191 |         Returns:
192 |             tuple: A tuple containing the category name, shard, and query.
193 |         """
194 |         for category in catalogue:
195 |             if (user_input.split("https://www.wildberries.ru")[-1]
196 |                     == category['url'] or user_input == category['name']):
197 |                 return category['name'], category['shard'], category['query']
198 | 
199 |     def get_products_on_page(self, page_data: dict) -> list:
200 |         """
201 |         Parse one page of results and return a list with product data.
202 | 
203 |         This function takes a dictionary containing the data of a page from
204 |         wildberries.ru, specifically the 'data' key with a list of 'products'.
205 |         It iterates over each item in the 'products' list and extracts
206 |         relevant information to create a dictionary representing a product.
207 |         The dictionaries are then appended to the 'products_on_page' list.
208 | 
209 |         Args:
210 |             page_data (dict): A dictionary containing the data
211 |               of a page from wildberries.ru.
212 | 
213 |         Returns:
214 |             list: A list of dictionaries representing the products
215 |               on the page, where each dictionary contains information
216 |               such as the link, article number, name, brand, price, discounted
217 |               price, rating, and number of reviews.
218 | 
219 |         """
220 |         products_on_page = []
221 |         for item in page_data['data']['products']:
222 |             products_on_page.append({
223 |                 'Ссылка': f"https://www.wildberries.ru/catalog/"
224 |                           f"{item['id']}/detail.aspx",
225 |                 'Артикул': item['id'],
226 |                 'Наименование': item['name'],
227 |                 'Бренд': item['brand'],
228 |                 'ID бренда': item['brandId'],
229 |                 'Цена': int(item['priceU'] / 100),
230 |                 'Цена со скидкой': int(item['salePriceU'] / 100),
231 |                 'Рейтинг': item['rating'],
232 |                 'Отзывы': item['feedbacks']
233 |             })
234 |         return products_on_page
235 | 
236 |     def add_data_from_page(self, url: str):
237 |         """
238 |         Add data on products from a page to the class's list.
239 | 
240 |         This function makes a GET request to the specified URL using
241 |         the provided headers, expecting a JSON response. The page data is then
242 |         passed to the get_products_on_page method to extract the relevant
243 |         product information. If there are products on the page,
244 |         they are appended to the product_cards list in the class,
245 |         and the number of added products is printed. If there are no products
246 |         on the page, it prints a message and returns True to indicate the end
247 |         of product loading.
248 | 
249 |         Args:
250 |             url (str): The URL of the page to retrieve the product data from.
251 | 
252 |         Returns:
253 |             bool or None: Returns True if there are no products on the page,
254 |               indicating the end of product loading. Otherwise, returns None.
255 |         """
256 |         response = requests.get(url, headers=self.headers).json()
257 |         page_data = self.get_products_on_page(response)
258 |         if len(page_data) > 0:
259 |             self.product_cards.extend(page_data)
260 |             print(f"Добавлено товаров: {len(page_data)}")
261 |         else:
262 |             print('Загрузка товаров завершена')
263 |             return True
264 | 
265 |     def get_all_products_in_category(self, category_data: tuple):
266 |         """
267 |         Retrieve all products in a category by going through all pages.
268 | 
269 |         This function iterates over page numbers from 1 to 100, constructing
270 |         the URL for each page in the specified category. It then calls the
271 |         add_data_from_page method to retrieve and add the product data from
272 |         each page to the class's product_cards list. If the add_data_from_page
273 |         method returns True, indicating the end of product loading,
274 |         the loop breaks.
275 | 
276 |         Note:
277 |             The wildberries.ru website currently limits the maximum number of
278 |             pages that can be parsed to 100.
279 | 
280 |         Args:
281 |             category_data (tuple): A tuple containing the category name,
282 |               shard, and query.
283 | 
284 |         Returns:
285 |             None
286 |         """
287 |         for page in range(1, 101):
288 |             print(f"Загружаю товары со страницы {page}")
289 |             url = (f"https://catalog.wb.ru/catalog/{category_data[1]}/"
290 |                    f"catalog?appType=1&{category_data[2]}&curr=rub"
291 |                    f"&dest=-1257786&page={page}&sort=popular&spp=24")
292 |             if self.add_data_from_page(url):
293 |                 break
294 | 
295 |     def get_sales_data(self):
296 |         """
297 |         Parse additional sales data for the product cards.
298 | 
299 |         This function iterates over each product card in the product_cards
300 |         list and makes a request to retrieve the sales data for the
301 |         corresponding product. The sales data is then added to the product
302 |         card dictionary with the key 'Продано'. If an exception occurs during
303 |         the request, indicating a connection timeout, the sales data is set to
304 |         'нет данных'. Progress information is printed during the iteration.
305 | 
306 |         Returns:
307 |             None
308 |         """
309 |         for card in self.product_cards:
310 |             url = (f"https://product-order-qnt.wildberries.ru/by-nm/"
311 |                    f"?nm={card['Артикул']}")
312 |             try:
313 |                 response = requests.get(url, headers=self.headers).json()
314 |                 card['Продано'] = response[0]['qnt']
315 |             except requests.ConnectTimeout:
316 |                 card['Продано'] = 'нет данных'
317 |             print(f"Собрано карточек: {self.product_cards.index(card) + 1}"
318 |                   f" из {len(self.product_cards)}")
319 | 
320 |     def save_to_excel(self, file_name: str) -> str:
321 |         """
322 |         Save the parsed data in xlsx format and return its path.
323 | 
324 |         This function takes the parsed data stored in the product_cards list
325 |         and converts it into a Pandas DataFrame. It then saves the DataFrame
326 |         as an xlsx file with the specified file name and the current run date
327 |         appended to it. The resulting file path is returned.
328 | 
329 |         Args:
330 |             file_name (str): The desired file name for the saved xlsx file.
331 | 
332 |         Returns:
333 |             str: The path of the saved xlsx file.
334 |         """
335 |         data = pd.DataFrame(self.product_cards)
336 |         result_path = (f"{path.join(self.directory, file_name)}_"
337 |                        f"{self.run_date.strftime('%Y-%m-%d')}.xlsx")
338 |         writer = pd.ExcelWriter(result_path)
339 |         data.to_excel(writer, 'data', index=False)
340 |         writer.close()
341 |         return result_path
342 | 
343 |     def get_all_products_in_search_result(self, key_word: str):
344 |         """
345 |         Retrieve all products in the search result by going through all pages.
346 | 
347 |         This function iterates over page numbers from 1 to 100, constructing
348 |         the URL for each page in the search result based on the provided
349 |         keyword. It then calls the add_data_from_page method to retrieve and
350 |         add the product data from each page to the class's product_cards list.
351 |         If the add_data_from_page method returns True, indicating the end of
352 |         product loading, the loop breaks.
353 | 
354 |         Args:
355 |             key_word (str): The keyword to search for in the
356 |               wildberries.ru search.
357 | 
358 |         Returns:
359 |             None
360 |         """
361 |         for page in range(1, 101):
362 |             print(f"Загружаю товары со страницы {page}")
363 |             url = (f"https://search.wb.ru/exactmatch/ru/common/v4/search?"
364 |                    f"appType=1&curr=rub&dest=-1257786&page={page}"
365 |                    f"&query={'%20'.join(key_word.split())}&resultset=catalog"
366 |                    f"&sort=popular&spp=24&suppressSpellcheck=false")
367 |             if self.add_data_from_page(url):
368 |                 break
369 | 
370 |     def run_parser(self):
371 |         """
372 |         Run the whole script for parsing and data processing.
373 | 
374 |         This function runs the entire script by prompting the user to choose
375 |         a parsing mode: either parsing a category entirely or parsing by
376 |         keywords. Based on the user's choice, it executes the corresponding
377 |         sequence of steps. For parsing a category, it downloads the current
378 |         catalogue, processes it, extracts the category data, retrieves all
379 |         products in the category, collects sales data, and saves the parsed
380 |         data to an Excel file. For parsing by keywords, it prompts for
381 |         a search query, retrieves all products in the search result, collects
382 |         sales data, and saves the parsed data to an Excel file.
383 | 
384 |         Returns:
385 |             None
386 |         """
387 |         instructons = """Введите 1 для парсинга категории целиком,
388 |         2 — по ключевым словам: """
389 |         mode = input(instructons)
390 |         if mode == '1':
391 |             local_catalogue_path = self.download_current_catalogue()
392 |             print(f"Каталог сохранен: {local_catalogue_path}")
393 |             processed_catalogue = self.process_catalogue(local_catalogue_path)
394 |             input_category = input("Введите название категории или ссылку: ")
395 |             category_data = self.extract_category_data(processed_catalogue,
396 |                                                        input_category)
397 |             if category_data is None:
398 |                 print("Категория не найдена")
399 |             else:
400 |                 print(f"Найдена категория: {category_data[0]}")
401 |             self.get_all_products_in_category(category_data)
402 |             self.get_sales_data()
403 |             print(f"Данные сохранены в {self.save_to_excel(category_data[0])}")
404 |         if mode == '2':
405 |             key_word = input("Введите запрос для поиска: ")
406 |             self.get_all_products_in_search_result(key_word)
407 |             self.get_sales_data()
408 |             print(f"Данные сохранены в {self.save_to_excel(key_word)}")
409 | 
410 | 
411 | if __name__ == '__main__':
412 |     app = WildBerriesParser()
413 |     app.run_parser()
414 | 


--------------------------------------------------------------------------------