├── README.md
├── input.csv
├── requirements.txt
├── scrapy.cfg
└── zoominfo
    ├── __init__.py
    ├── __pycache__
        ├── __init__.cpython-36.pyc
        ├── pipelines.cpython-36.pyc
        └── settings.cpython-36.pyc
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        ├── __pycache__
            ├── __init__.cpython-36.pyc
            └── zoominfo_scraper.cpython-36.pyc
        └── zoominfo_scraper.py


/README.md:
--------------------------------------------------------------------------------
 1 | # zoominfo-scrapper
 2 | Simple scraper for collecting data about the companies specified in the `input.csv` file from
 3 | the https://www.zoominfo.com/ website. 
 4 | 
 5 | ## About
 6 | The scraper collects the following data about companies:
 7 | 
 8 | * Headquarters
 9 | * Phone
10 | * Revenue
11 | * Number of employees
12 | * Website
13 | 
14 | As a result, it generates a separate CSV file for every company from the input list. These files are 
15 | located in the `output` folder. Each filename is created using the name of the company - {company_name}.csv.
16 | For example - Amazon.csv, Google.csv, etc.
17 | 
18 | ## Technologies
19 | * Scrapy 2.3.0
20 | * scrapy-rotating-proxies 0.6.2
21 | * rotating-free-proxies 0.1.2
22 | 
23 | ## How to install and run
24 | 1. Clone the repo: `git clone https://github.com/dfesenko/zoominfo_scraper.git`. 
25 | Go inside the `zoominfo_scraper` folder: `cd zoominfo_scraper`.
26 | 2. Create a virtual environment: `python -m venv venv`.
27 | 3. Activate virtual environment: `source venv/bin/activate`.
28 | 4. Install dependencies into the virtual environment: 
29 | `pip install -r requirements.txt`.
30 | 5. Change directory: `cd zoominfo`.
31 | 6. Create or change the `input.csv` file. Place there the names of the companies you want 
32 | to parse.
33 | 7. Issue the following command: `scrapy crawl zoominfo`.
34 | 8. Now the script should be started. The `output` directory should 
35 | appear in the directory. The script populates it with files (one csv file per company). 
36 | 9. If you want to change some scrapper parameters you can explore the 
37 | `/zoominfo/zoominfo/settings.py` file. 
38 | 
39 | ## Notes about proxies
40 | The scraper uses `scrapy-rotating-proxies` and `rotating-free-proxies` packages to get the list of 
41 | available free proxies and rotate them automatically. You can turn off this feature by commenting out 
42 | the `DOWNLOADER_MIDDLEWARES` variable in the `settings.py` file. Also, during the work, these libraries 
43 | create the `proxies.txt` file in the root of the project. There are a list of free proxies stored.


--------------------------------------------------------------------------------
/input.csv:
--------------------------------------------------------------------------------
1 | Amazon
2 | Google
3 | Facebook
4 | Airbnb
5 | Uber
6 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | attrs==20.1.0
 2 | Automat==20.2.0
 3 | beautifulsoup4==4.8.2
 4 | certifi==2020.6.20
 5 | cffi==1.14.2
 6 | chardet==3.0.4
 7 | constantly==15.1.0
 8 | cryptography==3.1
 9 | cssselect==1.1.0
10 | hyperlink==20.0.1
11 | idna==2.10
12 | incremental==17.5.0
13 | itemadapter==0.1.0
14 | itemloaders==1.0.2
15 | jmespath==0.10.0
16 | lxml==4.5.2
17 | parsel==1.6.0
18 | Protego==0.1.16
19 | pyasn1==0.4.8
20 | pyasn1-modules==0.2.8
21 | pycparser==2.20
22 | PyDispatcher==2.0.5
23 | PyHamcrest==2.0.2
24 | pyOpenSSL==19.1.0
25 | queuelib==1.5.0
26 | requests==2.24.0
27 | rotating-free-proxies==0.1.2
28 | Scrapy==2.3.0
29 | scrapy-rotating-proxies==0.6.2
30 | service-identity==18.1.0
31 | six==1.15.0
32 | soupsieve==2.0.1
33 | Twisted==20.3.0
34 | typing==3.7.4.3
35 | urllib3==1.25.10
36 | w3lib==1.22.0
37 | zope.interface==5.1.0
38 | 


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = zoominfo.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = zoominfo
12 | 


--------------------------------------------------------------------------------
/zoominfo/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/__init__.py


--------------------------------------------------------------------------------
/zoominfo/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/zoominfo/__pycache__/pipelines.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/__pycache__/pipelines.cpython-36.pyc


--------------------------------------------------------------------------------
/zoominfo/__pycache__/settings.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/__pycache__/settings.cpython-36.pyc


--------------------------------------------------------------------------------
/zoominfo/items.py:
--------------------------------------------------------------------------------
 1 | # Define here the models for your scraped items
 2 | #
 3 | # See documentation in:
 4 | # https://docs.scrapy.org/en/latest/topics/items.html
 5 | 
 6 | import scrapy
 7 | 
 8 | 
 9 | class ZoominfoItem(scrapy.Item):
10 |     # define the fields for your item here like:
11 |     # name = scrapy.Field()
12 |     pass
13 | 


--------------------------------------------------------------------------------
/zoominfo/middlewares.py:
--------------------------------------------------------------------------------
  1 | # Define here the models for your spider middleware
  2 | #
  3 | # See documentation in:
  4 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  5 | 
  6 | from scrapy import signals
  7 | 
  8 | # useful for handling different item types with a single interface
  9 | from itemadapter import is_item, ItemAdapter
 10 | 
 11 | 
 12 | class ZoominfoSpiderMiddleware:
 13 |     # Not all methods need to be defined. If a method is not defined,
 14 |     # scrapy acts as if the spider middleware does not modify the
 15 |     # passed objects.
 16 | 
 17 |     @classmethod
 18 |     def from_crawler(cls, crawler):
 19 |         # This method is used by Scrapy to create your spiders.
 20 |         s = cls()
 21 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 22 |         return s
 23 | 
 24 |     def process_spider_input(self, response, spider):
 25 |         # Called for each response that goes through the spider
 26 |         # middleware and into the spider.
 27 | 
 28 |         # Should return None or raise an exception.
 29 |         return None
 30 | 
 31 |     def process_spider_output(self, response, result, spider):
 32 |         # Called with the results returned from the Spider, after
 33 |         # it has processed the response.
 34 | 
 35 |         # Must return an iterable of Request, or item objects.
 36 |         for i in result:
 37 |             yield i
 38 | 
 39 |     def process_spider_exception(self, response, exception, spider):
 40 |         # Called when a spider or process_spider_input() method
 41 |         # (from other spider middleware) raises an exception.
 42 | 
 43 |         # Should return either None or an iterable of Request or item objects.
 44 |         pass
 45 | 
 46 |     def process_start_requests(self, start_requests, spider):
 47 |         # Called with the start requests of the spider, and works
 48 |         # similarly to the process_spider_output() method, except
 49 |         # that it doesn’t have a response associated.
 50 | 
 51 |         # Must return only requests (not items).
 52 |         for r in start_requests:
 53 |             yield r
 54 | 
 55 |     def spider_opened(self, spider):
 56 |         spider.logger.info('Spider opened: %s' % spider.name)
 57 | 
 58 | 
 59 | class ZoominfoDownloaderMiddleware:
 60 |     # Not all methods need to be defined. If a method is not defined,
 61 |     # scrapy acts as if the downloader middleware does not modify the
 62 |     # passed objects.
 63 | 
 64 |     @classmethod
 65 |     def from_crawler(cls, crawler):
 66 |         # This method is used by Scrapy to create your spiders.
 67 |         s = cls()
 68 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 69 |         return s
 70 | 
 71 |     def process_request(self, request, spider):
 72 |         # Called for each request that goes through the downloader
 73 |         # middleware.
 74 | 
 75 |         # Must either:
 76 |         # - return None: continue processing this request
 77 |         # - or return a Response object
 78 |         # - or return a Request object
 79 |         # - or raise IgnoreRequest: process_exception() methods of
 80 |         #   installed downloader middleware will be called
 81 |         return None
 82 | 
 83 |     def process_response(self, request, response, spider):
 84 |         # Called with the response returned from the downloader.
 85 | 
 86 |         # Must either;
 87 |         # - return a Response object
 88 |         # - return a Request object
 89 |         # - or raise IgnoreRequest
 90 |         return response
 91 | 
 92 |     def process_exception(self, request, exception, spider):
 93 |         # Called when a download handler or a process_request()
 94 |         # (from other downloader middleware) raises an exception.
 95 | 
 96 |         # Must either:
 97 |         # - return None: continue processing this exception
 98 |         # - return a Response object: stops process_exception() chain
 99 |         # - return a Request object: stops process_exception() chain
100 |         pass
101 | 
102 |     def spider_opened(self, spider):
103 |         spider.logger.info('Spider opened: %s' % spider.name)
104 | 


--------------------------------------------------------------------------------
/zoominfo/pipelines.py:
--------------------------------------------------------------------------------
 1 | # Define your item pipelines here
 2 | #
 3 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 4 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 5 | 
 6 | 
 7 | from itemadapter import ItemAdapter
 8 | from scrapy.exporters import CsvItemExporter
 9 | 
10 | import os
11 | 
12 | 
13 | class ZoominfoPipeline:
14 |     """Stores items in multiple CSV files according to the names of companies"""
15 | 
16 |     def open_spider(self, spider):
17 |         if not os.path.exists('output'):
18 |             os.makedirs('output')
19 | 
20 |     def process_item(self, item, spider):
21 |         adapter = ItemAdapter(item)
22 |         f = open(f'output/{adapter["company"]}.csv', 'wb')
23 |         exporter = CsvItemExporter(f, fields_to_export=['headquarters', 'phone', 'revenue', 'employees_num', 'website'])
24 |         exporter.start_exporting()
25 |         exporter.export_item(item)
26 |         exporter.finish_exporting()
27 |         return item
28 | 
29 | 


--------------------------------------------------------------------------------
/zoominfo/settings.py:
--------------------------------------------------------------------------------
  1 | # Scrapy settings for zoominfo project
  2 | #
  3 | # For simplicity, this file contains only settings considered important or
  4 | # commonly used. You can find more settings consulting the documentation:
  5 | #
  6 | #     https://docs.scrapy.org/en/latest/topics/settings.html
  7 | #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  8 | #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  9 | 
 10 | BOT_NAME = 'zoominfo'
 11 | 
 12 | SPIDER_MODULES = ['zoominfo.spiders']
 13 | NEWSPIDER_MODULE = 'zoominfo.spiders'
 14 | 
 15 | 
 16 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
 17 | #USER_AGENT = 'zoominfo (+http://www.yourdomain.com)'
 18 | USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
 19 | 
 20 | # Obey robots.txt rules
 21 | ROBOTSTXT_OBEY = False
 22 | 
 23 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
 24 | #CONCURRENT_REQUESTS = 32
 25 | 
 26 | # Configure a delay for requests for the same website (default: 0)
 27 | # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
 28 | # See also autothrottle settings and docs
 29 | DOWNLOAD_DELAY = 0.2
 30 | # The download delay setting will honor only one of:
 31 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16
 32 | #CONCURRENT_REQUESTS_PER_IP = 16
 33 | 
 34 | # Disable cookies (enabled by default)
 35 | #COOKIES_ENABLED = False
 36 | 
 37 | # Disable Telnet Console (enabled by default)
 38 | #TELNETCONSOLE_ENABLED = False
 39 | 
 40 | # Override the default request headers:
 41 | #DEFAULT_REQUEST_HEADERS = {
 42 | #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 43 | #   'Accept-Language': 'en',
 44 | #}
 45 | 
 46 | # Enable or disable spider middlewares
 47 | # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 48 | #SPIDER_MIDDLEWARES = {
 49 | #    'zoominfo.middlewares.ZoominfoSpiderMiddleware': 543,
 50 | #}
 51 | 
 52 | # Enable or disable downloader middlewares
 53 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
 54 | #DOWNLOADER_MIDDLEWARES = {
 55 | #    'zoominfo.middlewares.ZoominfoDownloaderMiddleware': 543,
 56 | #}
 57 | 
 58 | # DOWNLOADER_MIDDLEWARES = {
 59 | #     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
 60 | # }
 61 | 
 62 | ROTATING_PROXY_LIST_PATH = 'proxies.txt'
 63 | NUMBER_OF_PROXIES_TO_FETCH = 10
 64 | 
 65 | DOWNLOADER_MIDDLEWARES = {
 66 |     'rotating_free_proxies.middlewares.RotatingProxyMiddleware': 610,
 67 |     'rotating_free_proxies.middlewares.BanDetectionMiddleware': 620,
 68 | }
 69 | 
 70 | # Enable or disable extensions
 71 | # See https://docs.scrapy.org/en/latest/topics/extensions.html
 72 | #EXTENSIONS = {
 73 | #    'scrapy.extensions.telnet.TelnetConsole': None,
 74 | #}
 75 | 
 76 | # Configure item pipelines
 77 | # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 78 | ITEM_PIPELINES = {
 79 |    'zoominfo.pipelines.ZoominfoPipeline': 300,
 80 | }
 81 | 
 82 | # Enable and configure the AutoThrottle extension (disabled by default)
 83 | # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
 84 | #AUTOTHROTTLE_ENABLED = True
 85 | # The initial download delay
 86 | #AUTOTHROTTLE_START_DELAY = 5
 87 | # The maximum download delay to be set in case of high latencies
 88 | #AUTOTHROTTLE_MAX_DELAY = 60
 89 | # The average number of requests Scrapy should be sending in parallel to
 90 | # each remote server
 91 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
 92 | # Enable showing throttling stats for every response received:
 93 | #AUTOTHROTTLE_DEBUG = False
 94 | 
 95 | # Enable and configure HTTP caching (disabled by default)
 96 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
 97 | #HTTPCACHE_ENABLED = True
 98 | #HTTPCACHE_EXPIRATION_SECS = 0
 99 | #HTTPCACHE_DIR = 'httpcache'
100 | #HTTPCACHE_IGNORE_HTTP_CODES = []
101 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
102 | 


--------------------------------------------------------------------------------
/zoominfo/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/zoominfo/spiders/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/spiders/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/zoominfo/spiders/__pycache__/zoominfo_scraper.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dfesenko/zoominfo_scraper/d6e40fe0fa3184e1767d49de13811eb10dbf59cd/zoominfo/spiders/__pycache__/zoominfo_scraper.cpython-36.pyc


--------------------------------------------------------------------------------
/zoominfo/spiders/zoominfo_scraper.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | 
 3 | 
 4 | class ZoominfoSpider(scrapy.Spider):
 5 |     name = 'zoominfo'
 6 | 
 7 |     def start_requests(self):
 8 |         with open("input.csv", 'r') as input_file:
 9 |             for company_name in input_file:
10 |                 company = company_name.strip()
11 |                 url = f"https://www.google.com.ua/search?q={company + '+zoominfo+overview'}"
12 |                 req = scrapy.Request(url=url, callback=self.parse_google_results, cb_kwargs={"company": company})
13 |                 req.meta['proxy'] = None
14 |                 yield req
15 | 
16 |     def parse_google_results(self, response, **kwargs):
17 |         all_links = response.css("a::attr(href)").getall()
18 |         zoomlinks = [link for link in all_links if "www.zoominfo.com/c/" in link]
19 |         yield scrapy.Request(url=zoomlinks[0], callback=self.parse, cb_kwargs=kwargs)
20 | 
21 |     def parse(self, response, **kwargs):
22 |         yield {
23 |             'company': kwargs['company'],
24 |             'headquarters': response.xpath("//h3[text()='Headquarters:']/following-sibling::div/span/text()").get(),
25 |             'phone': response.xpath("//h3[text()='Phone:']/following-sibling::div/span/text()").get(),
26 |             'revenue': response.xpath("//h3[text()='Revenue:']/following-sibling::div/span/text()").get(),
27 |             'employees_num': response.xpath("//h3[text()='Employees:']/following-sibling::div/span/text()").get(),
28 |             'website': response.xpath("//h3[text()='Website:']/following-sibling::a/text()").get()
29 |         }
30 | 


--------------------------------------------------------------------------------