├── .gitignore
├── README.md
├── scrapy.cfg
└── walmart_scraper
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        └── walmart.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 | venv/
30 | 
31 | 
32 | ## Custom
33 | data/


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # walmart-python-scrapy-scraper
  2 | Python Scrapy spider that scrapes product data from [Walmart.com](https://www.walmart.com/). 
  3 | 
  4 | These scrapers extract the following fields from Walmart product pages:
  5 | 
  6 | - Product Type
  7 | - Product Name
  8 | - Brand
  9 | - Average Rating
 10 | - Manufacturer Name
 11 | - Description
 12 | - Image Url
 13 | - Price
 14 | - Currency
 15 | - Etc.
 16 | 
 17 | The following article goes through in detail how this Walmart spider was developed, which you can use to understand the spiders and edit them for your own use case.
 18 | 
 19 | [Python Scrapy: Build A Walmart.com Scraper](https://scrapeops.io/python-scrapy-playbook/python-scrapy-walmart-scraper/)
 20 | 
 21 | ## ScrapeOps Proxy
 22 | This Walmart spider uses [ScrapeOps Proxy](https://scrapeops.io/proxy-aggregator/) as the proxy solution. ScrapeOps has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be.
 23 | 
 24 | You can [sign up for a free API key here](https://scrapeops.io/app/register/main).
 25 | 
 26 | To use the ScrapeOps Proxy you need to first install the proxy middleware:
 27 | 
 28 | ```python
 29 | 
 30 | pip install scrapeops-scrapy-proxy-sdk
 31 | 
 32 | ```
 33 | 
 34 | Then activate the ScrapeOps Proxy by adding your API key to the `SCRAPEOPS_API_KEY` in the ``settings.py`` file.
 35 | 
 36 | ```python
 37 | 
 38 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
 39 | 
 40 | SCRAPEOPS_PROXY_ENABLED = True
 41 | 
 42 | DOWNLOADER_MIDDLEWARES = {
 43 |     'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
 44 | }
 45 | 
 46 | ```
 47 | 
 48 | 
 49 | ## ScrapeOps Monitoring
 50 | To monitor our scraper, this spider uses the [ScrapeOps Monitor](https://scrapeops.io/monitoring-scheduling/), a free monitoring tool specifically designed for web scraping. 
 51 | 
 52 | **Live demo here:** [ScrapeOps Demo](https://scrapeops.io/app/login/demo) 
 53 | 
 54 | ![ScrapeOps Dashboard](https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png)
 55 | 
 56 | To use the ScrapeOps Proxy you need to first install the monitoring SDK:
 57 | 
 58 | ```
 59 | 
 60 | pip install scrapeops-scrapy
 61 | 
 62 | ```
 63 | 
 64 | 
 65 | Then activate the ScrapeOps Proxy by adding your API key to the `SCRAPEOPS_API_KEY` in the ``settings.py`` file.
 66 | 
 67 | ```python
 68 | 
 69 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
 70 | 
 71 | # Add In The ScrapeOps Monitoring Extension
 72 | EXTENSIONS = {
 73 | 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
 74 | }
 75 | 
 76 | 
 77 | DOWNLOADER_MIDDLEWARES = {
 78 | 
 79 |     ## ScrapeOps Monitor
 80 |     'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
 81 |     'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
 82 |     
 83 |     ## Proxy Middleware
 84 |     'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
 85 | }
 86 | 
 87 | ```
 88 | 
 89 | If you are using both the ScrapeOps Proxy & Monitoring then you just need to enter the API key once.
 90 | 
 91 | 
 92 | ## Running The Scrapers
 93 | Make sure Scrapy and the ScrapeOps Monitor is installed:
 94 | 
 95 | ```
 96 | 
 97 | pip install scrapy scrapeops-scrapy
 98 | 
 99 | ```
100 | 
101 | To run the Walmart spiders you should first set the search query parameters you want to search by updating the `keyword_list` list in the spiders:
102 | 
103 | ```python
104 | 
105 | def start_requests(self):
106 |     keyword_list = ['laptop']
107 |     for keyword in keyword_list:
108 |         payload = {'q': keyword, 'sort': 'best_seller', 'page': 1, 'affinityOverride': 'default'}
109 |         walmart_search_url = 'https://www.walmart.com/search?' + urlencode(payload)
110 |         yield scrapy.Request(url=walmart_search_url, callback=self.parse_search_results, meta={'keyword': keyword, 'page': 1})
111 | 
112 | ```
113 | 
114 | Then to run the spider, enter one of the following command:
115 | 
116 | ```
117 | 
118 | scrapy crawl walmart
119 | 
120 | ```
121 | 
122 | 
123 | ## Customizing The Walmart Scraper
124 | The following are instructions on how to modify the Walmart scraper for your particular use case.
125 | 
126 | Check out this [guide to building a Walmart.com Scrapy spider](https://scrapeops.io/python-scrapy-playbook/python-scrapy-walmart-scraper/) if you need any more information.
127 | 
128 | ### Configuring Walmart Product Search
129 | To change the query parameters for the product search just change the keywords and locations in the `keyword_list` lists in the spider.
130 | 
131 | For example:
132 | 
133 | ```python
134 | 
135 | def start_requests(self):
136 |     keyword_list = ['laptop', 'ipad', '']
137 |     for keyword in keyword_list:
138 |         payload = {'q': keyword, 'sort': 'best_seller', 'page': 1, 'affinityOverride': 'default'}
139 |         walmart_search_url = 'https://www.walmart.com/search?' + urlencode(payload)
140 |         yield scrapy.Request(url=walmart_search_url, callback=self.parse_search_results, meta={'keyword': keyword, 'page': 1})
141 | 
142 | ```
143 | 
144 | You can also change the sorting algorithm to one of: ``best_seller``, `best_match`, `price_low` or `price_high`.
145 | 
146 | ### Extract More/Different Data
147 | The JSON blobs the spiders extract the product data from are pretty big so the spiders are configured to only parse some of the data. 
148 | 
149 | You can expand or change the data that gets extract by changing the yield statements:
150 | 
151 | ```python
152 | 
153 | yield {
154 |         'keyword': response.meta['keyword'],
155 |         'page': response.meta['page'],
156 |         'position': response.meta['position'],
157 |         'id':  raw_product_data.get('id'),
158 |         'type':  raw_product_data.get('type'),
159 |         'name':  raw_product_data.get('name'),
160 |         'brand':  raw_product_data.get('brand'),
161 |         'averageRating':  raw_product_data.get('averageRating'),
162 |         'manufacturerName':  raw_product_data.get('manufacturerName'),
163 |         'shortDescription':  raw_product_data.get('shortDescription'),
164 |         'thumbnailUrl':  raw_product_data['imageInfo'].get('thumbnailUrl'),
165 |         'price':  raw_product_data['priceInfo']['currentPrice'].get('price'), 
166 |         'currencyUnit':  raw_product_data['priceInfo']['currentPrice'].get('currencyUnit'),  
167 |     }
168 | 
169 | ```
170 | 
171 | ### Speeding Up The Crawl
172 | The spiders are set to only use 1 concurrent thread in the settings.py file as the ScrapeOps Free Proxy Plan only gives you 1 concurrent thread.
173 | 
174 | However, if you upgrade to a paid ScrapeOps Proxy plan you will have more concurrent threads. Then you can increase the concurrency limit in your scraper by updating the `CONCURRENT_REQUESTS` value in your ``settings.py`` file.
175 | 
176 | ```python
177 | # settings.py
178 | 
179 | CONCURRENT_REQUESTS = 10
180 | 
181 | ```
182 | 
183 | ### Storing Data
184 | The spiders are set to save the scraped data into a CSV file and store it in a data folder using [Scrapy's Feed Export functionality](https://docs.scrapy.org/en/latest/topics/feed-exports.html).
185 | 
186 | ```python
187 | 
188 | custom_settings = {
189 |         'FEEDS': { 'data/%(name)s_%(time)s.csv': { 'format': 'csv',}}
190 |         }
191 | 
192 | ```
193 | 
194 | If you would like to save your CSV files to a AWS S3 bucket then check out our [Saving CSV/JSON Files to Amazon AWS S3 Bucket guide here](https://scrapeops.io//python-scrapy-playbook/scrapy-save-aws-s3)
195 | 
196 | Or if you would like to save your data to another type of database then be sure to check out these guides:
197 | 
198 | - [Saving Data to JSON](https://scrapeops.io/python-scrapy-playbook/scrapy-save-json-files)
199 | - [Saving Data to SQLite Database](https://scrapeops.io/python-scrapy-playbook/scrapy-save-data-sqlite)
200 | - [Saving Data to MySQL Database](https://scrapeops.io/python-scrapy-playbook/scrapy-save-data-mysql)
201 | - [Saving Data to Postgres Database](https://scrapeops.io/python-scrapy-playbook/scrapy-save-data-postgres)
202 | 
203 | ### Deactivating ScrapeOps Proxy & Monitor
204 | To deactivate the ScrapeOps Proxy & Monitor simply comment out the follow code in your `settings.py` file:
205 | 
206 | ```python
207 | # settings.py
208 | 
209 | # ## Enable ScrapeOps Proxy
210 | # SCRAPEOPS_PROXY_ENABLED = True
211 | 
212 | # # Add In The ScrapeOps Monitoring Extension
213 | # EXTENSIONS = {
214 | # 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
215 | # }
216 | 
217 | 
218 | # DOWNLOADER_MIDDLEWARES = {
219 | 
220 | #     ## ScrapeOps Monitor
221 | #     'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
222 | #     'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
223 |     
224 | #     ## Proxy Middleware
225 | #     'walmart_scraper.middlewares.ScrapeOpsProxyMiddleware': 725,
226 | # }
227 | 
228 | ```
229 | 
230 | 


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = walmart_scraper.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = walmart_scraper
12 | 


--------------------------------------------------------------------------------
/walmart_scraper/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/python-scrapy-playbook/walmart-python-scrapy-scraper/05ef6dadb3bd42b7f75407c062bf198ebe471aa3/walmart_scraper/__init__.py


--------------------------------------------------------------------------------
/walmart_scraper/items.py:
--------------------------------------------------------------------------------
 1 | # Define here the models for your scraped items
 2 | #
 3 | # See documentation in:
 4 | # https://docs.scrapy.org/en/latest/topics/items.html
 5 | 
 6 | import scrapy
 7 | 
 8 | 
 9 | class WalmartScraperItem(scrapy.Item):
10 |     # define the fields for your item here like:
11 |     # name = scrapy.Field()
12 |     pass
13 | 


--------------------------------------------------------------------------------
/walmart_scraper/middlewares.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/python-scrapy-playbook/walmart-python-scrapy-scraper/05ef6dadb3bd42b7f75407c062bf198ebe471aa3/walmart_scraper/middlewares.py


--------------------------------------------------------------------------------
/walmart_scraper/pipelines.py:
--------------------------------------------------------------------------------
 1 | # Define your item pipelines here
 2 | #
 3 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 4 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 5 | 
 6 | 
 7 | # useful for handling different item types with a single interface
 8 | from itemadapter import ItemAdapter
 9 | 
10 | 
11 | class WalmartScraperPipeline:
12 |     def process_item(self, item, spider):
13 |         return item
14 | 


--------------------------------------------------------------------------------
/walmart_scraper/settings.py:
--------------------------------------------------------------------------------
 1 | # Scrapy settings for walmart_scraper project
 2 | #
 3 | # For simplicity, this file contains only settings considered important or
 4 | # commonly used. You can find more settings consulting the documentation:
 5 | #
 6 | #     https://docs.scrapy.org/en/latest/topics/settings.html
 7 | #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
 8 | #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 9 | 
10 | BOT_NAME = 'walmart_scraper'
11 | 
12 | SPIDER_MODULES = ['walmart_scraper.spiders']
13 | NEWSPIDER_MODULE = 'walmart_scraper.spiders'
14 | 
15 | # Obey robots.txt rules
16 | ROBOTSTXT_OBEY = False
17 | 
18 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
19 | 
20 | SCRAPEOPS_PROXY_ENABLED = True
21 | 
22 | # Add In The ScrapeOps Monitoring Extension
23 | EXTENSIONS = {
24 | 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
25 | }
26 | 
27 | 
28 | DOWNLOADER_MIDDLEWARES = {
29 | 
30 |     ## ScrapeOps Monitor
31 |     'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
32 |     'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
33 |     
34 |     ## Proxy Middleware
35 |     'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
36 | }
37 | 
38 | # Max Concurrency On ScrapeOps Proxy Free Plan is 1 thread
39 | CONCURRENT_REQUESTS = 1
40 | 


--------------------------------------------------------------------------------
/walmart_scraper/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/walmart_scraper/spiders/walmart.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import math
 3 | import scrapy
 4 | from urllib.parse import urlencode
 5 | 
 6 | class WalmartSpider(scrapy.Spider):
 7 |     name = "walmart"
 8 | 
 9 |     custom_settings = {
10 |         'FEEDS': { 'data/%(name)s_%(time)s.csv': { 'format': 'csv',}}
11 |         }
12 | 
13 |     def start_requests(self):
14 |         keyword_list = ['laptop']
15 |         for keyword in keyword_list:
16 |             payload = {'q': keyword, 'sort': 'best_seller', 'page': 1, 'affinityOverride': 'default'}
17 |             walmart_search_url = 'https://www.walmart.com/search?' + urlencode(payload)
18 |             yield scrapy.Request(url=walmart_search_url, callback=self.parse_search_results, meta={'keyword': keyword, 'page': 1})
19 | 
20 |     def parse_search_results(self, response):
21 |         page = response.meta['page']
22 |         keyword = response.meta['keyword'] 
23 |         script_tag  = response.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
24 |         if script_tag is not None:
25 |             json_blob = json.loads(script_tag)
26 | 
27 |             ## Request Product Page
28 |             product_list = json_blob["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
29 |             for idx, product in enumerate(product_list):
30 |                 walmart_product_url = 'https://www.walmart.com' + product.get('canonicalUrl', '').split('?')[0]
31 |                 yield scrapy.Request(url=walmart_product_url, callback=self.parse_product_data, meta={'keyword': keyword, 'page': page, 'position': idx + 1})
32 |             
33 |             ## Request Next Page
34 |             if page == 1:
35 |                 total_product_count = json_blob["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
36 |                 max_pages = math.ceil(total_product_count / 40)
37 |                 if max_pages > 5:
38 |                     max_pages = 5
39 |                 for p in range(2, max_pages):
40 |                     payload = {'q': keyword, 'sort': 'best_seller', 'page': p, 'affinityOverride': 'default'}
41 |                     walmart_search_url = 'https://www.walmart.com/search?' + urlencode(payload)
42 |                     yield scrapy.Request(url=walmart_search_url, callback=self.parse_search_results, meta={'keyword': keyword, 'page': p})
43 |     
44 | 
45 |     def parse_product_data(self, response):
46 |         script_tag  = response.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
47 |         if script_tag is not None:
48 |             json_blob = json.loads(script_tag)
49 |             raw_product_data = json_blob["props"]["pageProps"]["initialData"]["data"]["product"]
50 |             yield {
51 |                 'keyword': response.meta['keyword'],
52 |                 'page': response.meta['page'],
53 |                 'position': response.meta['position'],
54 |                 'id':  raw_product_data.get('id'),
55 |                 'type':  raw_product_data.get('type'),
56 |                 'name':  raw_product_data.get('name'),
57 |                 'brand':  raw_product_data.get('brand'),
58 |                 'averageRating':  raw_product_data.get('averageRating'),
59 |                 'manufacturerName':  raw_product_data.get('manufacturerName'),
60 |                 'shortDescription':  raw_product_data.get('shortDescription'),
61 |                 'thumbnailUrl':  raw_product_data['imageInfo'].get('thumbnailUrl'),
62 |                 'price':  raw_product_data['priceInfo']['currentPrice'].get('price'), 
63 |                 'currencyUnit':  raw_product_data['priceInfo']['currentPrice'].get('currencyUnit'),  
64 |             }
65 | 
66 | 
67 | 


--------------------------------------------------------------------------------