├── README.md ├── google_scraper ├── __init__.py ├── __pycache__ │ ├── __init__.cpython-37.pyc │ └── settings.cpython-37.pyc ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders │ ├── __init__.py │ ├── __pycache__ │ ├── __init__.cpython-37.pyc │ └── google.cpython-37.pyc │ └── google.py └── scrapy.cfg /README.md: -------------------------------------------------------------------------------- 1 | # google-scraper-python-scrapy 2 | 3 | Python Scrapy spider that searches Google for a particular keyword and extracts all data from the SERP results. The spider will iterate through all pages returned by the keyword query. The following are the fields the spider scrapes for the Google SERP page: 4 | 5 | * Title 6 | * Link 7 | * Related links 8 | * Description 9 | * Snippet 10 | * Images 11 | * Thumbnails 12 | * Sources 13 | 14 | This Google SERP spider uses Scraper API as the proxy solution. Scraper API has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be. 15 | 16 | Scraper API also offers autoparsing functionality free of charge for Google, so by adding `&autoparse=true` to the request the API returns all the SERP data in JSON format. 17 | 18 | To monitor the scraper, this scraper uses [ScrapeOps](https://scrapeops.io/). **Live demo here:** [ScrapeOps Demo](https://scrapeops.io/app/login/demo) 19 | 20 | ![ScrapeOps Dashboard](https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png) 21 | 22 | This spider can be easily customised for your particular search requirements. In this case, it allows you to refine your search queries by specifying a keyword, the geographic region, the language, the number of results, results from a particular domain, or even to only return safe results. 23 | 24 | Full tutorial can be found here: [Scraping Millions of Google SERPs The Easy Way (Python Scrapy Spider)](https://dev.to/iankerins/scraping-millions-of-google-serps-the-easy-way-python-scrapy-spider-4lol-temp-slug-8957520?preview=f73a488815c3cc75236c79ea4bfadbe21121c6edd4d54095bac81832859c6be9464bc9d34bcc32f2f82792d4e97af36ef1836db8b3d20a1009ddf5d1) 25 | 26 | ## Using the Google Spider 27 | 28 | Make sure Scrapy is installed: 29 | 30 | ``` 31 | pip install scrapy 32 | ``` 33 | 34 | Set the keywords you want to search in Google. 35 | 36 | ``` 37 | queries = ['scrapy', 'beautifulsoup'] 38 | ``` 39 | 40 | ### Setting Up ScraperAPI 41 | Signup to [Scraper API](https://www.scraperapi.com/signup) and get your free API key that allows you to scrape 1,000 pages per month for free. Enter your API key into the API variable: 42 | 43 | ``` 44 | API_KEY = 'YOUR_API_KEY' 45 | 46 | def get_url(url): 47 | payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'} 48 | proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload) 49 | return proxy_url 50 | ``` 51 | 52 | By default, the spider is set to have a max concurrency of 5 concurrent requests as this the max concurrency allowed on Scraper APIs free plan. If you have a plan with higher concurrency then make sure to increase the max concurrency in `custom_settings` dictionary. 53 | 54 | ``` 55 | custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO', 56 | 'CONCURRENT_REQUESTS_PER_DOMAIN': 5, 57 | 'RETRY_TIMES': 5} 58 | ``` 59 | 60 | We should also set `RETRY_TIMES` to tell Scrapy to retry any failed requests (to 5 for example) and make sure that `DOWNLOAD_DELAY` and `RANDOMIZE_DOWNLOAD_DELAY` aren’t enabled as these will lower your concurrency and are not needed with Scraper API. 61 | 62 | ``` 63 | ## settings.py 64 | 65 | # DOWNLOAD_DELAY 66 | # RANDOMIZE_DOWNLOAD_DELAY 67 | ``` 68 | 69 | ### Integrating ScrapeOps 70 | [ScrapeOps](https://scrapeops.io/) is already integrated into the scraper via the `settings.py` file. However, to use it you must: 71 | 72 | Install the [ScrapeOps Scrapy SDK](https://github.com/ScrapeOps/scrapeops-scrapy-sdk) on your machine. 73 | 74 | ``` 75 | pip install scrapeops-scrapy 76 | ``` 77 | 78 | And sign up for a [free ScrapeOps account here](https://scrapeops.io/app/register) so you can insert your **API Key** into the `settings.py` file: 79 | 80 | ``` 81 | ## settings.py 82 | 83 | ## Add Your ScrapeOps API key 84 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY' 85 | 86 | ## Add In The ScrapeOps Extension 87 | EXTENSIONS = { 88 | 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 89 | } 90 | 91 | ## Update The Download Middlewares 92 | DOWNLOADER_MIDDLEWARES = { 93 | 'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 94 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 95 | } 96 | ``` 97 | From there, our scraping stats will be automatically logged and automatically shipped to our dashboard. 98 | 99 | ### Running The Spider 100 | To run the spider, use: 101 | 102 | ``` 103 | scrapy crawl google -o test.csv 104 | ``` 105 | 106 | 107 | 108 | -------------------------------------------------------------------------------- /google_scraper/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ian-kerins/google-scraper-python-scrapy/7ed48cb95f1907d7fccf871a8cb6619d65589f38/google_scraper/__init__.py -------------------------------------------------------------------------------- /google_scraper/__pycache__/__init__.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ian-kerins/google-scraper-python-scrapy/7ed48cb95f1907d7fccf871a8cb6619d65589f38/google_scraper/__pycache__/__init__.cpython-37.pyc -------------------------------------------------------------------------------- /google_scraper/__pycache__/settings.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ian-kerins/google-scraper-python-scrapy/7ed48cb95f1907d7fccf871a8cb6619d65589f38/google_scraper/__pycache__/settings.cpython-37.pyc -------------------------------------------------------------------------------- /google_scraper/items.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your scraped items 2 | # 3 | # See documentation in: 4 | # https://docs.scrapy.org/en/latest/topics/items.html 5 | 6 | import scrapy 7 | 8 | 9 | class GoogleScraperItem(scrapy.Item): 10 | # define the fields for your item here like: 11 | # name = scrapy.Field() 12 | pass 13 | -------------------------------------------------------------------------------- /google_scraper/middlewares.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your spider middleware 2 | # 3 | # See documentation in: 4 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 5 | 6 | from scrapy import signals 7 | 8 | # useful for handling different item types with a single interface 9 | from itemadapter import is_item, ItemAdapter 10 | 11 | 12 | class GoogleScraperSpiderMiddleware: 13 | # Not all methods need to be defined. If a method is not defined, 14 | # scrapy acts as if the spider middleware does not modify the 15 | # passed objects. 16 | 17 | @classmethod 18 | def from_crawler(cls, crawler): 19 | # This method is used by Scrapy to create your spiders. 20 | s = cls() 21 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 22 | return s 23 | 24 | def process_spider_input(self, response, spider): 25 | # Called for each response that goes through the spider 26 | # middleware and into the spider. 27 | 28 | # Should return None or raise an exception. 29 | return None 30 | 31 | def process_spider_output(self, response, result, spider): 32 | # Called with the results returned from the Spider, after 33 | # it has processed the response. 34 | 35 | # Must return an iterable of Request, or item objects. 36 | for i in result: 37 | yield i 38 | 39 | def process_spider_exception(self, response, exception, spider): 40 | # Called when a spider or process_spider_input() method 41 | # (from other spider middleware) raises an exception. 42 | 43 | # Should return either None or an iterable of Request or item objects. 44 | pass 45 | 46 | def process_start_requests(self, start_requests, spider): 47 | # Called with the start requests of the spider, and works 48 | # similarly to the process_spider_output() method, except 49 | # that it doesn’t have a response associated. 50 | 51 | # Must return only requests (not items). 52 | for r in start_requests: 53 | yield r 54 | 55 | def spider_opened(self, spider): 56 | spider.logger.info('Spider opened: %s' % spider.name) 57 | 58 | 59 | class GoogleScraperDownloaderMiddleware: 60 | # Not all methods need to be defined. If a method is not defined, 61 | # scrapy acts as if the downloader middleware does not modify the 62 | # passed objects. 63 | 64 | @classmethod 65 | def from_crawler(cls, crawler): 66 | # This method is used by Scrapy to create your spiders. 67 | s = cls() 68 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 69 | return s 70 | 71 | def process_request(self, request, spider): 72 | # Called for each request that goes through the downloader 73 | # middleware. 74 | 75 | # Must either: 76 | # - return None: continue processing this request 77 | # - or return a Response object 78 | # - or return a Request object 79 | # - or raise IgnoreRequest: process_exception() methods of 80 | # installed downloader middleware will be called 81 | return None 82 | 83 | def process_response(self, request, response, spider): 84 | # Called with the response returned from the downloader. 85 | 86 | # Must either; 87 | # - return a Response object 88 | # - return a Request object 89 | # - or raise IgnoreRequest 90 | return response 91 | 92 | def process_exception(self, request, exception, spider): 93 | # Called when a download handler or a process_request() 94 | # (from other downloader middleware) raises an exception. 95 | 96 | # Must either: 97 | # - return None: continue processing this exception 98 | # - return a Response object: stops process_exception() chain 99 | # - return a Request object: stops process_exception() chain 100 | pass 101 | 102 | def spider_opened(self, spider): 103 | spider.logger.info('Spider opened: %s' % spider.name) 104 | -------------------------------------------------------------------------------- /google_scraper/pipelines.py: -------------------------------------------------------------------------------- 1 | # Define your item pipelines here 2 | # 3 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 4 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html 5 | 6 | 7 | # useful for handling different item types with a single interface 8 | from itemadapter import ItemAdapter 9 | 10 | 11 | class GoogleScraperPipeline: 12 | def process_item(self, item, spider): 13 | return item 14 | -------------------------------------------------------------------------------- /google_scraper/settings.py: -------------------------------------------------------------------------------- 1 | # Scrapy settings for google_scraper project 2 | # 3 | # For simplicity, this file contains only settings considered important or 4 | # commonly used. You can find more settings consulting the documentation: 5 | # 6 | # https://docs.scrapy.org/en/latest/topics/settings.html 7 | # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 8 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 9 | 10 | BOT_NAME = 'google_scraper' 11 | 12 | SPIDER_MODULES = ['google_scraper.spiders'] 13 | NEWSPIDER_MODULE = 'google_scraper.spiders' 14 | 15 | ## Add Your ScrapeOps API key 16 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY' ## get free API key at https://scrapeops.io/app/register 17 | 18 | ## Add In The ScrapeOps Extension 19 | EXTENSIONS = { 20 | 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 21 | } 22 | 23 | ## Update The Download Middlewares 24 | DOWNLOADER_MIDDLEWARES = { 25 | 'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 26 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 27 | } 28 | 29 | 30 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 31 | #USER_AGENT = 'google_scraper (+http://www.yourdomain.com)' 32 | 33 | # Obey robots.txt rules 34 | #ROBOTSTXT_OBEY = True 35 | 36 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 37 | #CONCURRENT_REQUESTS = 32 38 | 39 | RETRY_TIMES = 5 40 | 41 | # Configure a delay for requests for the same website (default: 0) 42 | # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay 43 | # See also autothrottle settings and docs 44 | #DOWNLOAD_DELAY = 3 45 | # The download delay setting will honor only one of: 46 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16 47 | #CONCURRENT_REQUESTS_PER_IP = 16 48 | 49 | # Disable cookies (enabled by default) 50 | #COOKIES_ENABLED = False 51 | 52 | # Disable Telnet Console (enabled by default) 53 | #TELNETCONSOLE_ENABLED = False 54 | 55 | # Override the default request headers: 56 | #DEFAULT_REQUEST_HEADERS = { 57 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 58 | # 'Accept-Language': 'en', 59 | #} 60 | 61 | # Enable or disable spider middlewares 62 | # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html 63 | #SPIDER_MIDDLEWARES = { 64 | # 'google_scraper.middlewares.GoogleScraperSpiderMiddleware': 543, 65 | #} 66 | 67 | # Enable or disable downloader middlewares 68 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 69 | #DOWNLOADER_MIDDLEWARES = { 70 | # 'google_scraper.middlewares.GoogleScraperDownloaderMiddleware': 543, 71 | #} 72 | 73 | # Enable or disable extensions 74 | # See https://docs.scrapy.org/en/latest/topics/extensions.html 75 | #EXTENSIONS = { 76 | # 'scrapy.extensions.telnet.TelnetConsole': None, 77 | #} 78 | 79 | # Configure item pipelines 80 | # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html 81 | #ITEM_PIPELINES = { 82 | # 'google_scraper.pipelines.GoogleScraperPipeline': 300, 83 | #} 84 | 85 | # Enable and configure the AutoThrottle extension (disabled by default) 86 | # See https://docs.scrapy.org/en/latest/topics/autothrottle.html 87 | #AUTOTHROTTLE_ENABLED = True 88 | # The initial download delay 89 | #AUTOTHROTTLE_START_DELAY = 5 90 | # The maximum download delay to be set in case of high latencies 91 | #AUTOTHROTTLE_MAX_DELAY = 60 92 | # The average number of requests Scrapy should be sending in parallel to 93 | # each remote server 94 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 95 | # Enable showing throttling stats for every response received: 96 | #AUTOTHROTTLE_DEBUG = False 97 | 98 | # Enable and configure HTTP caching (disabled by default) 99 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 100 | #HTTPCACHE_ENABLED = True 101 | #HTTPCACHE_EXPIRATION_SECS = 0 102 | #HTTPCACHE_DIR = 'httpcache' 103 | #HTTPCACHE_IGNORE_HTTP_CODES = [] 104 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 105 | -------------------------------------------------------------------------------- /google_scraper/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /google_scraper/spiders/__pycache__/__init__.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ian-kerins/google-scraper-python-scrapy/7ed48cb95f1907d7fccf871a8cb6619d65589f38/google_scraper/spiders/__pycache__/__init__.cpython-37.pyc -------------------------------------------------------------------------------- /google_scraper/spiders/__pycache__/google.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ian-kerins/google-scraper-python-scrapy/7ed48cb95f1907d7fccf871a8cb6619d65589f38/google_scraper/spiders/__pycache__/google.cpython-37.pyc -------------------------------------------------------------------------------- /google_scraper/spiders/google.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import scrapy 3 | from urllib.parse import urlencode 4 | from urllib.parse import urlparse 5 | import json 6 | from datetime import datetime 7 | API_KEY = 'YOUR_API_KEY' ## Insert Scraperapi API key here. Signup here for free trial with 5,000 requests: https://www.scraperapi.com/signup 8 | 9 | 10 | def get_url(url): 11 | payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'} 12 | proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload) 13 | return proxy_url 14 | 15 | 16 | def create_google_url(query, site=''): 17 | google_dict = {'q': query, 'num': 100, } 18 | if site: 19 | web = urlparse(site).netloc 20 | google_dict['as_sitesearch'] = web 21 | return 'http://www.google.com/search?' + urlencode(google_dict) 22 | return 'http://www.google.com/search?' + urlencode(google_dict) 23 | 24 | 25 | class GoogleSpider(scrapy.Spider): 26 | name = 'google' 27 | allowed_domains = ['api.scraperapi.com'] 28 | custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO', 29 | 'CONCURRENT_REQUESTS_PER_DOMAIN': 5, 30 | 'RETRY_TIMES': 5} 31 | 32 | def start_requests(self): 33 | queries = ['scrapy', 'beautifulsoup'] ## Enter keywords here ['keyword1', 'keyword2', 'etc'] 34 | for query in queries: 35 | url = create_google_url(query) 36 | yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0}) 37 | 38 | def parse(self, response): 39 | di = json.loads(response.text) 40 | pos = response.meta['pos'] 41 | dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S') 42 | for result in di['organic_results']: 43 | title = result['title'] 44 | snippet = result['snippet'] 45 | link = result['link'] 46 | item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt} 47 | pos += 1 48 | yield item 49 | next_page = di['pagination']['nextPageUrl'] 50 | if next_page: 51 | yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos}) 52 | -------------------------------------------------------------------------------- /scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html 5 | 6 | [settings] 7 | default = google_scraper.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = google_scraper 12 | --------------------------------------------------------------------------------