├── README.md ├── input.csv ├── requirements.txt ├── scrapy.cfg └── zoominfo ├── __init__.py ├── __pycache__ ├── __init__.cpython-36.pyc ├── pipelines.cpython-36.pyc └── settings.cpython-36.pyc ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders ├── __init__.py ├── __pycache__ ├── __init__.cpython-36.pyc └── zoominfo_scraper.cpython-36.pyc └── zoominfo_scraper.py /README.md: -------------------------------------------------------------------------------- 1 | # zoominfo-scrapper 2 | Simple scraper for collecting data about the companies specified in the `input.csv` file from 3 | the https://www.zoominfo.com/ website. 4 | 5 | ## About 6 | The scraper collects the following data about companies: 7 | 8 | * Headquarters 9 | * Phone 10 | * Revenue 11 | * Number of employees 12 | * Website 13 | 14 | As a result, it generates a separate CSV file for every company from the input list. These files are 15 | located in the `output` folder. Each filename is created using the name of the company - {company_name}.csv. 16 | For example - Amazon.csv, Google.csv, etc. 17 | 18 | ## Technologies 19 | * Scrapy 2.3.0 20 | * scrapy-rotating-proxies 0.6.2 21 | * rotating-free-proxies 0.1.2 22 | 23 | ## How to install and run 24 | 1. Clone the repo: `git clone https://github.com/dfesenko/zoominfo_scraper.git`. 25 | Go inside the `zoominfo_scraper` folder: `cd zoominfo_scraper`. 26 | 2. Create a virtual environment: `python -m venv venv`. 27 | 3. Activate virtual environment: `source venv/bin/activate`. 28 | 4. Install dependencies into the virtual environment: 29 | `pip install -r requirements.txt`. 30 | 5. Change directory: `cd zoominfo`. 31 | 6. Create or change the `input.csv` file. Place there the names of the companies you want 32 | to parse. 33 | 7. Issue the following command: `scrapy crawl zoominfo`. 34 | 8. Now the script should be started. The `output` directory should 35 | appear in the directory. The script populates it with files (one csv file per company). 36 | 9. If you want to change some scrapper parameters you can explore the 37 | `/zoominfo/zoominfo/settings.py` file. 38 | 39 | ## Notes about proxies 40 | The scraper uses `scrapy-rotating-proxies` and `rotating-free-proxies` packages to get the list of 41 | available free proxies and rotate them automatically. You can turn off this feature by commenting out 42 | the `DOWNLOADER_MIDDLEWARES` variable in the `settings.py` file. Also, during the work, these libraries 43 | create the `proxies.txt` file in the root of the project. There are a list of free proxies stored. -------------------------------------------------------------------------------- /input.csv: -------------------------------------------------------------------------------- 1 | Amazon 2 | Google 3 | Facebook 4 | Airbnb 5 | Uber 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | attrs==20.1.0 2 | Automat==20.2.0 3 | beautifulsoup4==4.8.2 4 | certifi==2020.6.20 5 | cffi==1.14.2 6 | chardet==3.0.4 7 | constantly==15.1.0 8 | cryptography==3.1 9 | cssselect==1.1.0 10 | hyperlink==20.0.1 11 | idna==2.10 12 | incremental==17.5.0 13 | itemadapter==0.1.0 14 | itemloaders==1.0.2 15 | jmespath==0.10.0 16 | lxml==4.5.2 17 | parsel==1.6.0 18 | Protego==0.1.16 19 | pyasn1==0.4.8 20 | pyasn1-modules==0.2.8 21 | pycparser==2.20 22 | PyDispatcher==2.0.5 23 | PyHamcrest==2.0.2 24 | pyOpenSSL==19.1.0 25 | queuelib==1.5.0 26 | requests==2.24.0 27 | rotating-free-proxies==0.1.2 28 | Scrapy==2.3.0 29 | scrapy-rotating-proxies==0.6.2 30 | service-identity==18.1.0 31 | six==1.15.0 32 | soupsieve==2.0.1 33 | Twisted==20.3.0 34 | typing==3.7.4.3 35 | urllib3==1.25.10 36 | w3lib==1.22.0 37 | zope.interface==5.1.0 38 | -------------------------------------------------------------------------------- /scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html 5 | 6 | [settings] 7 | default = zoominfo.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = zoominfo 12 | -------------------------------------------------------------------------------- /zoominfo/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/__init__.py -------------------------------------------------------------------------------- /zoominfo/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /zoominfo/__pycache__/pipelines.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/__pycache__/pipelines.cpython-36.pyc -------------------------------------------------------------------------------- /zoominfo/__pycache__/settings.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/__pycache__/settings.cpython-36.pyc -------------------------------------------------------------------------------- /zoominfo/items.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your scraped items 2 | # 3 | # See documentation in: 4 | # https://docs.scrapy.org/en/latest/topics/items.html 5 | 6 | import scrapy 7 | 8 | 9 | class ZoominfoItem(scrapy.Item): 10 | # define the fields for your item here like: 11 | # name = scrapy.Field() 12 | pass 13 | -------------------------------------------------------------------------------- /zoominfo/middlewares.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your spider middleware 2 | # 3 | # See documentation in: 4 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 5 | 6 | from scrapy import signals 7 | 8 | # useful for handling different item types with a single interface 9 | from itemadapter import is_item, ItemAdapter 10 | 11 | 12 | class ZoominfoSpiderMiddleware: 13 | # Not all methods need to be defined. If a method is not defined, 14 | # scrapy acts as if the spider middleware does not modify the 15 | # passed objects. 16 | 17 | @classmethod 18 | def from_crawler(cls, crawler): 19 | # This method is used by Scrapy to create your spiders. 20 | s = cls() 21 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 22 | return s 23 | 24 | def process_spider_input(self, response, spider): 25 | # Called for each response that goes through the spider 26 | # middleware and into the spider. 27 | 28 | # Should return None or raise an exception. 29 | return None 30 | 31 | def process_spider_output(self, response, result, spider): 32 | # Called with the results returned from the Spider, after 33 | # it has processed the response. 34 | 35 | # Must return an iterable of Request, or item objects. 36 | for i in result: 37 | yield i 38 | 39 | def process_spider_exception(self, response, exception, spider): 40 | # Called when a spider or process_spider_input() method 41 | # (from other spider middleware) raises an exception. 42 | 43 | # Should return either None or an iterable of Request or item objects. 44 | pass 45 | 46 | def process_start_requests(self, start_requests, spider): 47 | # Called with the start requests of the spider, and works 48 | # similarly to the process_spider_output() method, except 49 | # that it doesn’t have a response associated. 50 | 51 | # Must return only requests (not items). 52 | for r in start_requests: 53 | yield r 54 | 55 | def spider_opened(self, spider): 56 | spider.logger.info('Spider opened: %s' % spider.name) 57 | 58 | 59 | class ZoominfoDownloaderMiddleware: 60 | # Not all methods need to be defined. If a method is not defined, 61 | # scrapy acts as if the downloader middleware does not modify the 62 | # passed objects. 63 | 64 | @classmethod 65 | def from_crawler(cls, crawler): 66 | # This method is used by Scrapy to create your spiders. 67 | s = cls() 68 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 69 | return s 70 | 71 | def process_request(self, request, spider): 72 | # Called for each request that goes through the downloader 73 | # middleware. 74 | 75 | # Must either: 76 | # - return None: continue processing this request 77 | # - or return a Response object 78 | # - or return a Request object 79 | # - or raise IgnoreRequest: process_exception() methods of 80 | # installed downloader middleware will be called 81 | return None 82 | 83 | def process_response(self, request, response, spider): 84 | # Called with the response returned from the downloader. 85 | 86 | # Must either; 87 | # - return a Response object 88 | # - return a Request object 89 | # - or raise IgnoreRequest 90 | return response 91 | 92 | def process_exception(self, request, exception, spider): 93 | # Called when a download handler or a process_request() 94 | # (from other downloader middleware) raises an exception. 95 | 96 | # Must either: 97 | # - return None: continue processing this exception 98 | # - return a Response object: stops process_exception() chain 99 | # - return a Request object: stops process_exception() chain 100 | pass 101 | 102 | def spider_opened(self, spider): 103 | spider.logger.info('Spider opened: %s' % spider.name) 104 | -------------------------------------------------------------------------------- /zoominfo/pipelines.py: -------------------------------------------------------------------------------- 1 | # Define your item pipelines here 2 | # 3 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 4 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html 5 | 6 | 7 | from itemadapter import ItemAdapter 8 | from scrapy.exporters import CsvItemExporter 9 | 10 | import os 11 | 12 | 13 | class ZoominfoPipeline: 14 | """Stores items in multiple CSV files according to the names of companies""" 15 | 16 | def open_spider(self, spider): 17 | if not os.path.exists('output'): 18 | os.makedirs('output') 19 | 20 | def process_item(self, item, spider): 21 | adapter = ItemAdapter(item) 22 | f = open(f'output/{adapter["company"]}.csv', 'wb') 23 | exporter = CsvItemExporter(f, fields_to_export=['headquarters', 'phone', 'revenue', 'employees_num', 'website']) 24 | exporter.start_exporting() 25 | exporter.export_item(item) 26 | exporter.finish_exporting() 27 | return item 28 | 29 | -------------------------------------------------------------------------------- /zoominfo/settings.py: -------------------------------------------------------------------------------- 1 | # Scrapy settings for zoominfo project 2 | # 3 | # For simplicity, this file contains only settings considered important or 4 | # commonly used. You can find more settings consulting the documentation: 5 | # 6 | # https://docs.scrapy.org/en/latest/topics/settings.html 7 | # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 8 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 9 | 10 | BOT_NAME = 'zoominfo' 11 | 12 | SPIDER_MODULES = ['zoominfo.spiders'] 13 | NEWSPIDER_MODULE = 'zoominfo.spiders' 14 | 15 | 16 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 17 | #USER_AGENT = 'zoominfo (+http://www.yourdomain.com)' 18 | USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36' 19 | 20 | # Obey robots.txt rules 21 | ROBOTSTXT_OBEY = False 22 | 23 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 24 | #CONCURRENT_REQUESTS = 32 25 | 26 | # Configure a delay for requests for the same website (default: 0) 27 | # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay 28 | # See also autothrottle settings and docs 29 | DOWNLOAD_DELAY = 0.2 30 | # The download delay setting will honor only one of: 31 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16 32 | #CONCURRENT_REQUESTS_PER_IP = 16 33 | 34 | # Disable cookies (enabled by default) 35 | #COOKIES_ENABLED = False 36 | 37 | # Disable Telnet Console (enabled by default) 38 | #TELNETCONSOLE_ENABLED = False 39 | 40 | # Override the default request headers: 41 | #DEFAULT_REQUEST_HEADERS = { 42 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 43 | # 'Accept-Language': 'en', 44 | #} 45 | 46 | # Enable or disable spider middlewares 47 | # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html 48 | #SPIDER_MIDDLEWARES = { 49 | # 'zoominfo.middlewares.ZoominfoSpiderMiddleware': 543, 50 | #} 51 | 52 | # Enable or disable downloader middlewares 53 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 54 | #DOWNLOADER_MIDDLEWARES = { 55 | # 'zoominfo.middlewares.ZoominfoDownloaderMiddleware': 543, 56 | #} 57 | 58 | # DOWNLOADER_MIDDLEWARES = { 59 | # 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1 60 | # } 61 | 62 | ROTATING_PROXY_LIST_PATH = 'proxies.txt' 63 | NUMBER_OF_PROXIES_TO_FETCH = 10 64 | 65 | DOWNLOADER_MIDDLEWARES = { 66 | 'rotating_free_proxies.middlewares.RotatingProxyMiddleware': 610, 67 | 'rotating_free_proxies.middlewares.BanDetectionMiddleware': 620, 68 | } 69 | 70 | # Enable or disable extensions 71 | # See https://docs.scrapy.org/en/latest/topics/extensions.html 72 | #EXTENSIONS = { 73 | # 'scrapy.extensions.telnet.TelnetConsole': None, 74 | #} 75 | 76 | # Configure item pipelines 77 | # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html 78 | ITEM_PIPELINES = { 79 | 'zoominfo.pipelines.ZoominfoPipeline': 300, 80 | } 81 | 82 | # Enable and configure the AutoThrottle extension (disabled by default) 83 | # See https://docs.scrapy.org/en/latest/topics/autothrottle.html 84 | #AUTOTHROTTLE_ENABLED = True 85 | # The initial download delay 86 | #AUTOTHROTTLE_START_DELAY = 5 87 | # The maximum download delay to be set in case of high latencies 88 | #AUTOTHROTTLE_MAX_DELAY = 60 89 | # The average number of requests Scrapy should be sending in parallel to 90 | # each remote server 91 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 92 | # Enable showing throttling stats for every response received: 93 | #AUTOTHROTTLE_DEBUG = False 94 | 95 | # Enable and configure HTTP caching (disabled by default) 96 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 97 | #HTTPCACHE_ENABLED = True 98 | #HTTPCACHE_EXPIRATION_SECS = 0 99 | #HTTPCACHE_DIR = 'httpcache' 100 | #HTTPCACHE_IGNORE_HTTP_CODES = [] 101 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 102 | -------------------------------------------------------------------------------- /zoominfo/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /zoominfo/spiders/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/spiders/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /zoominfo/spiders/__pycache__/zoominfo_scraper.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/spiders/__pycache__/zoominfo_scraper.cpython-36.pyc -------------------------------------------------------------------------------- /zoominfo/spiders/zoominfo_scraper.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | 3 | 4 | class ZoominfoSpider(scrapy.Spider): 5 | name = 'zoominfo' 6 | 7 | def start_requests(self): 8 | with open("input.csv", 'r') as input_file: 9 | for company_name in input_file: 10 | company = company_name.strip() 11 | url = f"https://www.google.com.ua/search?q={company + '+zoominfo+overview'}" 12 | req = scrapy.Request(url=url, callback=self.parse_google_results, cb_kwargs={"company": company}) 13 | req.meta['proxy'] = None 14 | yield req 15 | 16 | def parse_google_results(self, response, **kwargs): 17 | all_links = response.css("a::attr(href)").getall() 18 | zoomlinks = [link for link in all_links if "www.zoominfo.com/c/" in link] 19 | yield scrapy.Request(url=zoomlinks[0], callback=self.parse, cb_kwargs=kwargs) 20 | 21 | def parse(self, response, **kwargs): 22 | yield { 23 | 'company': kwargs['company'], 24 | 'headquarters': response.xpath("//h3[text()='Headquarters:']/following-sibling::div/span/text()").get(), 25 | 'phone': response.xpath("//h3[text()='Phone:']/following-sibling::div/span/text()").get(), 26 | 'revenue': response.xpath("//h3[text()='Revenue:']/following-sibling::div/span/text()").get(), 27 | 'employees_num': response.xpath("//h3[text()='Employees:']/following-sibling::div/span/text()").get(), 28 | 'website': response.xpath("//h3[text()='Website:']/following-sibling::a/text()").get() 29 | } 30 | --------------------------------------------------------------------------------