├── README.md
├── google_scraper
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-37.pyc
    │   └── settings.cpython-37.pyc
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
    │   ├── __init__.py
    │   ├── __pycache__
    │       ├── __init__.cpython-37.pyc
    │       └── google.cpython-37.pyc
    │   └── google.py
└── scrapy.cfg


/README.md:
--------------------------------------------------------------------------------
  1 | # google-scraper-python-scrapy
  2 | 
  3 | Python Scrapy spider that searches Google for a particular keyword and extracts all data from the SERP results. The spider will iterate through all pages returned by the keyword query. The following are the fields the spider scrapes for the Google SERP page:
  4 | 
  5 | * Title
  6 | * Link
  7 | * Related links
  8 | * Description
  9 | * Snippet
 10 | * Images
 11 | * Thumbnails
 12 | * Sources
 13 | 
 14 | This Google SERP spider uses Scraper API as the proxy solution. Scraper API has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be.
 15 | 
 16 | Scraper API also offers autoparsing functionality free of charge for Google, so by adding `&autoparse=true` to the request the API returns all the SERP data in JSON format.
 17 | 
 18 | To monitor the scraper, this scraper uses [ScrapeOps](https://scrapeops.io/). **Live demo here:** [ScrapeOps Demo](https://scrapeops.io/app/login/demo)
 19 | 
 20 | ![ScrapeOps Dashboard](https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png)
 21 | 
 22 | This spider can be easily customised for your particular search requirements. In this case, it allows you to refine your search queries by specifying a  keyword, the geographic region, the language, the number of results, results from a particular domain, or even to only return safe results.
 23 | 
 24 | Full tutorial can be found here: [Scraping Millions of Google SERPs The Easy Way (Python Scrapy Spider)](https://dev.to/iankerins/scraping-millions-of-google-serps-the-easy-way-python-scrapy-spider-4lol-temp-slug-8957520?preview=f73a488815c3cc75236c79ea4bfadbe21121c6edd4d54095bac81832859c6be9464bc9d34bcc32f2f82792d4e97af36ef1836db8b3d20a1009ddf5d1)
 25 | 
 26 | ## Using the Google Spider
 27 | 
 28 | Make sure Scrapy is installed:
 29 | 
 30 | ```
 31 | pip install scrapy
 32 | ```
 33 | 
 34 | Set the keywords you want to search in Google.
 35 | 
 36 | ```
 37 | queries = ['scrapy', 'beautifulsoup']
 38 | ```
 39 | 
 40 | ### Setting Up ScraperAPI
 41 | Signup to [Scraper API](https://www.scraperapi.com/signup) and get your free API key that allows you to scrape 1,000 pages per month for free. Enter your API key into the API variable:
 42 | 
 43 | ```
 44 | API_KEY = 'YOUR_API_KEY' 
 45 | 
 46 | def get_url(url):
 47 |     payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
 48 |     proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
 49 |     return proxy_url
 50 | ```
 51 | 
 52 | By default, the spider is set to have a max concurrency of 5 concurrent requests as this the max concurrency allowed on Scraper APIs free plan. If you have a plan with higher concurrency then make sure to increase the max concurrency in `custom_settings` dictionary.
 53 | 
 54 | ```
 55 | custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
 56 |                        'CONCURRENT_REQUESTS_PER_DOMAIN': 5, 
 57 |                        'RETRY_TIMES': 5}
 58 | ```
 59 | 
 60 | We should also set `RETRY_TIMES` to tell Scrapy to retry any failed requests (to 5 for example) and make sure that `DOWNLOAD_DELAY`  and `RANDOMIZE_DOWNLOAD_DELAY` aren’t enabled as these will lower your concurrency and are not needed with Scraper API.
 61 | 
 62 | ```
 63 | ## settings.py
 64 | 
 65 | # DOWNLOAD_DELAY
 66 | # RANDOMIZE_DOWNLOAD_DELAY
 67 | ```
 68 | 
 69 | ### Integrating ScrapeOps
 70 | [ScrapeOps](https://scrapeops.io/) is already integrated into the scraper via the `settings.py` file. However, to use it you must:
 71 | 
 72 | Install the [ScrapeOps Scrapy SDK](https://github.com/ScrapeOps/scrapeops-scrapy-sdk) on your machine.
 73 | 
 74 | ```
 75 | pip install scrapeops-scrapy
 76 | ```
 77 | 
 78 | And sign up for a [free ScrapeOps account here](https://scrapeops.io/app/register) so you can insert your **API Key** into the `settings.py` file:
 79 | 
 80 | ```
 81 |     ## settings.py
 82 |     
 83 |     ## Add Your ScrapeOps API key
 84 |     SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
 85 |     
 86 |     ## Add In The ScrapeOps Extension
 87 |     EXTENSIONS = {
 88 |      'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
 89 |     }
 90 |     
 91 |     ## Update The Download Middlewares
 92 |     DOWNLOADER_MIDDLEWARES = { 
 93 | 	'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
 94 | 	'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
 95 |     }
 96 | ```
 97 | From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.
 98 | 
 99 | ### Running The Spider
100 | To run the spider, use:
101 | 
102 | ```
103 | scrapy crawl google -o test.csv
104 | ```
105 | 
106 | 
107 | 
108 | 


--------------------------------------------------------------------------------
/google_scraper/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ian-kerins/google-scraper-python-scrapy/7ed48cb95f1907d7fccf871a8cb6619d65589f38/google_scraper/__init__.py


--------------------------------------------------------------------------------
/google_scraper/__pycache__/__init__.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ian-kerins/google-scraper-python-scrapy/7ed48cb95f1907d7fccf871a8cb6619d65589f38/google_scraper/__pycache__/__init__.cpython-37.pyc


--------------------------------------------------------------------------------
/google_scraper/__pycache__/settings.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ian-kerins/google-scraper-python-scrapy/7ed48cb95f1907d7fccf871a8cb6619d65589f38/google_scraper/__pycache__/settings.cpython-37.pyc


--------------------------------------------------------------------------------
/google_scraper/items.py:
--------------------------------------------------------------------------------
 1 | # Define here the models for your scraped items
 2 | #
 3 | # See documentation in:
 4 | # https://docs.scrapy.org/en/latest/topics/items.html
 5 | 
 6 | import scrapy
 7 | 
 8 | 
 9 | class GoogleScraperItem(scrapy.Item):
10 |     # define the fields for your item here like:
11 |     # name = scrapy.Field()
12 |     pass
13 | 


--------------------------------------------------------------------------------
/google_scraper/middlewares.py:
--------------------------------------------------------------------------------
  1 | # Define here the models for your spider middleware
  2 | #
  3 | # See documentation in:
  4 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  5 | 
  6 | from scrapy import signals
  7 | 
  8 | # useful for handling different item types with a single interface
  9 | from itemadapter import is_item, ItemAdapter
 10 | 
 11 | 
 12 | class GoogleScraperSpiderMiddleware:
 13 |     # Not all methods need to be defined. If a method is not defined,
 14 |     # scrapy acts as if the spider middleware does not modify the
 15 |     # passed objects.
 16 | 
 17 |     @classmethod
 18 |     def from_crawler(cls, crawler):
 19 |         # This method is used by Scrapy to create your spiders.
 20 |         s = cls()
 21 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 22 |         return s
 23 | 
 24 |     def process_spider_input(self, response, spider):
 25 |         # Called for each response that goes through the spider
 26 |         # middleware and into the spider.
 27 | 
 28 |         # Should return None or raise an exception.
 29 |         return None
 30 | 
 31 |     def process_spider_output(self, response, result, spider):
 32 |         # Called with the results returned from the Spider, after
 33 |         # it has processed the response.
 34 | 
 35 |         # Must return an iterable of Request, or item objects.
 36 |         for i in result:
 37 |             yield i
 38 | 
 39 |     def process_spider_exception(self, response, exception, spider):
 40 |         # Called when a spider or process_spider_input() method
 41 |         # (from other spider middleware) raises an exception.
 42 | 
 43 |         # Should return either None or an iterable of Request or item objects.
 44 |         pass
 45 | 
 46 |     def process_start_requests(self, start_requests, spider):
 47 |         # Called with the start requests of the spider, and works
 48 |         # similarly to the process_spider_output() method, except
 49 |         # that it doesn’t have a response associated.
 50 | 
 51 |         # Must return only requests (not items).
 52 |         for r in start_requests:
 53 |             yield r
 54 | 
 55 |     def spider_opened(self, spider):
 56 |         spider.logger.info('Spider opened: %s' % spider.name)
 57 | 
 58 | 
 59 | class GoogleScraperDownloaderMiddleware:
 60 |     # Not all methods need to be defined. If a method is not defined,
 61 |     # scrapy acts as if the downloader middleware does not modify the
 62 |     # passed objects.
 63 | 
 64 |     @classmethod
 65 |     def from_crawler(cls, crawler):
 66 |         # This method is used by Scrapy to create your spiders.
 67 |         s = cls()
 68 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 69 |         return s
 70 | 
 71 |     def process_request(self, request, spider):
 72 |         # Called for each request that goes through the downloader
 73 |         # middleware.
 74 | 
 75 |         # Must either:
 76 |         # - return None: continue processing this request
 77 |         # - or return a Response object
 78 |         # - or return a Request object
 79 |         # - or raise IgnoreRequest: process_exception() methods of
 80 |         #   installed downloader middleware will be called
 81 |         return None
 82 | 
 83 |     def process_response(self, request, response, spider):
 84 |         # Called with the response returned from the downloader.
 85 | 
 86 |         # Must either;
 87 |         # - return a Response object
 88 |         # - return a Request object
 89 |         # - or raise IgnoreRequest
 90 |         return response
 91 | 
 92 |     def process_exception(self, request, exception, spider):
 93 |         # Called when a download handler or a process_request()
 94 |         # (from other downloader middleware) raises an exception.
 95 | 
 96 |         # Must either:
 97 |         # - return None: continue processing this exception
 98 |         # - return a Response object: stops process_exception() chain
 99 |         # - return a Request object: stops process_exception() chain
100 |         pass
101 | 
102 |     def spider_opened(self, spider):
103 |         spider.logger.info('Spider opened: %s' % spider.name)
104 | 


--------------------------------------------------------------------------------
/google_scraper/pipelines.py:
--------------------------------------------------------------------------------
 1 | # Define your item pipelines here
 2 | #
 3 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 4 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 5 | 
 6 | 
 7 | # useful for handling different item types with a single interface
 8 | from itemadapter import ItemAdapter
 9 | 
10 | 
11 | class GoogleScraperPipeline:
12 |     def process_item(self, item, spider):
13 |         return item
14 | 


--------------------------------------------------------------------------------
/google_scraper/settings.py:
--------------------------------------------------------------------------------
  1 | # Scrapy settings for google_scraper project
  2 | #
  3 | # For simplicity, this file contains only settings considered important or
  4 | # commonly used. You can find more settings consulting the documentation:
  5 | #
  6 | #     https://docs.scrapy.org/en/latest/topics/settings.html
  7 | #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  8 | #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  9 | 
 10 | BOT_NAME = 'google_scraper'
 11 | 
 12 | SPIDER_MODULES = ['google_scraper.spiders']
 13 | NEWSPIDER_MODULE = 'google_scraper.spiders'
 14 | 
 15 | ## Add Your ScrapeOps API key
 16 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY' ## get free API key at https://scrapeops.io/app/register
 17 | 
 18 | ## Add In The ScrapeOps Extension
 19 | EXTENSIONS = {
 20 |  'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
 21 | }
 22 | 
 23 | ## Update The Download Middlewares
 24 | DOWNLOADER_MIDDLEWARES = { 
 25 | 'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
 26 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
 27 | }
 28 | 
 29 | 
 30 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
 31 | #USER_AGENT = 'google_scraper (+http://www.yourdomain.com)'
 32 | 
 33 | # Obey robots.txt rules
 34 | #ROBOTSTXT_OBEY = True
 35 | 
 36 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
 37 | #CONCURRENT_REQUESTS = 32
 38 | 
 39 | RETRY_TIMES = 5
 40 | 
 41 | # Configure a delay for requests for the same website (default: 0)
 42 | # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
 43 | # See also autothrottle settings and docs
 44 | #DOWNLOAD_DELAY = 3
 45 | # The download delay setting will honor only one of:
 46 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16
 47 | #CONCURRENT_REQUESTS_PER_IP = 16
 48 | 
 49 | # Disable cookies (enabled by default)
 50 | #COOKIES_ENABLED = False
 51 | 
 52 | # Disable Telnet Console (enabled by default)
 53 | #TELNETCONSOLE_ENABLED = False
 54 | 
 55 | # Override the default request headers:
 56 | #DEFAULT_REQUEST_HEADERS = {
 57 | #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 58 | #   'Accept-Language': 'en',
 59 | #}
 60 | 
 61 | # Enable or disable spider middlewares
 62 | # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 63 | #SPIDER_MIDDLEWARES = {
 64 | #    'google_scraper.middlewares.GoogleScraperSpiderMiddleware': 543,
 65 | #}
 66 | 
 67 | # Enable or disable downloader middlewares
 68 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
 69 | #DOWNLOADER_MIDDLEWARES = {
 70 | #    'google_scraper.middlewares.GoogleScraperDownloaderMiddleware': 543,
 71 | #}
 72 | 
 73 | # Enable or disable extensions
 74 | # See https://docs.scrapy.org/en/latest/topics/extensions.html
 75 | #EXTENSIONS = {
 76 | #    'scrapy.extensions.telnet.TelnetConsole': None,
 77 | #}
 78 | 
 79 | # Configure item pipelines
 80 | # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 81 | #ITEM_PIPELINES = {
 82 | #    'google_scraper.pipelines.GoogleScraperPipeline': 300,
 83 | #}
 84 | 
 85 | # Enable and configure the AutoThrottle extension (disabled by default)
 86 | # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
 87 | #AUTOTHROTTLE_ENABLED = True
 88 | # The initial download delay
 89 | #AUTOTHROTTLE_START_DELAY = 5
 90 | # The maximum download delay to be set in case of high latencies
 91 | #AUTOTHROTTLE_MAX_DELAY = 60
 92 | # The average number of requests Scrapy should be sending in parallel to
 93 | # each remote server
 94 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
 95 | # Enable showing throttling stats for every response received:
 96 | #AUTOTHROTTLE_DEBUG = False
 97 | 
 98 | # Enable and configure HTTP caching (disabled by default)
 99 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
100 | #HTTPCACHE_ENABLED = True
101 | #HTTPCACHE_EXPIRATION_SECS = 0
102 | #HTTPCACHE_DIR = 'httpcache'
103 | #HTTPCACHE_IGNORE_HTTP_CODES = []
104 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
105 | 


--------------------------------------------------------------------------------
/google_scraper/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/google_scraper/spiders/__pycache__/__init__.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ian-kerins/google-scraper-python-scrapy/7ed48cb95f1907d7fccf871a8cb6619d65589f38/google_scraper/spiders/__pycache__/__init__.cpython-37.pyc


--------------------------------------------------------------------------------
/google_scraper/spiders/__pycache__/google.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ian-kerins/google-scraper-python-scrapy/7ed48cb95f1907d7fccf871a8cb6619d65589f38/google_scraper/spiders/__pycache__/google.cpython-37.pyc


--------------------------------------------------------------------------------
/google_scraper/spiders/google.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | import scrapy
 3 | from urllib.parse import urlencode
 4 | from urllib.parse import urlparse
 5 | import json
 6 | from datetime import datetime
 7 | API_KEY = 'YOUR_API_KEY' ## Insert Scraperapi API key here. Signup here for free trial with 5,000 requests: https://www.scraperapi.com/signup
 8 | 
 9 | 
10 | def get_url(url):
11 |     payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
12 |     proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
13 |     return proxy_url
14 | 
15 | 
16 | def create_google_url(query, site=''):
17 |     google_dict = {'q': query, 'num': 100, }
18 |     if site:
19 |         web = urlparse(site).netloc
20 |         google_dict['as_sitesearch'] = web
21 |         return 'http://www.google.com/search?' + urlencode(google_dict)
22 |     return 'http://www.google.com/search?' + urlencode(google_dict)
23 | 
24 | 
25 | class GoogleSpider(scrapy.Spider):
26 |     name = 'google'
27 |     allowed_domains = ['api.scraperapi.com']
28 |     custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
29 |                        'CONCURRENT_REQUESTS_PER_DOMAIN': 5, 
30 |                        'RETRY_TIMES': 5}
31 | 
32 |     def start_requests(self):
33 |         queries = ['scrapy', 'beautifulsoup'] ## Enter keywords here ['keyword1', 'keyword2', 'etc']
34 |         for query in queries:
35 |             url = create_google_url(query)
36 |             yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})
37 | 
38 |     def parse(self, response):
39 |         di = json.loads(response.text)
40 |         pos = response.meta['pos']
41 |         dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
42 |         for result in di['organic_results']:
43 |             title = result['title']
44 |             snippet = result['snippet']
45 |             link = result['link']
46 |             item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
47 |             pos += 1
48 |             yield item
49 |         next_page = di['pagination']['nextPageUrl']
50 |         if next_page:
51 |             yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})
52 | 


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = google_scraper.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = google_scraper
12 | 


--------------------------------------------------------------------------------