├── README.md
├── html-test.html
├── scraper
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
│ ├── __init__.py
│ └── thorough_spider.py
├── scrapy.cfg
└── test_spider.py
/README.md:
--------------------------------------------------------------------------------
1 | # email-scraper
2 | A general-purpose utility written in Python (v3.0+) for crawling websites to extract email addresses.
3 |
4 | ## Overview
5 | I implemented this using the popular python web crawling framework **scrapy**. I had never used it before so this is probably not the most elegant implementation of a scrapy-based email scraper (say that three times fast!). The project consists of a single spider ***ThoroughSpider*** which takes 'domain' as an argument and begins crawling there. Two optional arguments add further tuning capability:
6 | * subdomain_exclusions - optional list of subdomains to exclude from the crawl
7 | * crawl_js - optional boolean [default=False], whether or not to follow links to javascript files and also search them for urls
8 |
9 | The **crawl_js** parameter is needed only when a "perfect storm" of conditions exist:
10 |
11 | 1. there is no sitemap.xml available
12 |
13 | 2. there are clickable menu items that are **not** contained in `````` tags
14 |
15 | 3. those menu items execute javascript code that loads pages but the urls are in the .js file itself
16 |
17 | Normally such links would not be followed and scraped for further urls, however if a single-page AngularJS-based site with no sitemap is crawled, and the menu links are in `````` elements, the pages will not be discovered unless the ng-app is crawled as well to extract the destinations of the menu items.
18 |
19 | A possible workaround to parsing the js files would be to use Selenium Web Driver to automate the crawl. I decided against this because Selenium needs to know how to find the menu using CSS selector, class name, etc. and these are specific to the site itself. A Selenium-based solution would not be as general-purpose, but for an AngularJS-based menu, something like the following would work in Selenium:
20 | ```
21 | from selenium import webdriver
22 | driver = webdriver.Firefox()
23 | driver.get("http://angularjs-site.com")
24 | menuElement = driver.find_element_by_class_name("trigger")
25 | menuElement.click()
26 | driver.find_elements_by_xpath('//li')[3].click() # for example, click the third item off the discovered menu list
27 | ```
28 |
29 | For this solution I opted to design the crawler ***without*** Selenium, which means occasionally crawling JS files to root out further links.
30 |
31 | ### Implemented classes:
32 | * ```ThoroughSpider```: the spider itself
33 | * ```DeDupePipeline```: simple de-duplicator so that email addresses are only printed out once even if they are discovered multiple times
34 | * ```SubdomainBlockerMiddleware```: blocks subdomains in case crawl needs to exclude them
35 | * ```EmailAddressItem```: holds the email addresses as scrapy.Items. Allows the scrapy framework to output items into a number of
36 | convenient formats (csv, json, etc.) without additional dev work on my end
37 |
38 | ### Required Modules
39 | This project is dependent on the following python modules:
40 | * scrapy
41 | * urlparse
42 |
43 | ### How to Run
44 | Since it is scrapy based, it must be invoked via the standard "scrapy way":
45 | ```
46 | scrapy crawl spider -a domain="your.domain.name" -o emails-found.csv
47 | ```
48 | Or with optional command line arguments like:
49 | ```
50 | scrapy crawl spider -a domain="your.domain.name" -a subdomain_exclusions="['blog', 'technology']" -a crawl_js=True -o emails-found.csv
51 | ```
52 |
53 |
54 | ## Testing
55 | I tested this mainly against my own ***incomplete*** blog www.quackquaponics.com and personal website www.tpetz.com, where I hid email addresses around the sites and also hid links to various JS files to test how well the parsing worked. There is also a simple unit test for the ThoroughMiddleware.parse method in test_spider.py. It utilizes a static file, html-test.html, which can be extended with various hard-to-find email addresses to see if they can be discovered. To run the unit test, do
56 | ```
57 | python3 test_spider.py
58 | ```
59 |
60 |
61 | ## Known Issues
62 | 1. The email regex utilized to find addresses is not RFC 822 compliant. That is easily fixed if needed, but in practice, most real-world validators are not RFC 822 compliant either, and the regex I used is *much* simpler than a fully compliant one.
63 |
64 | ## Future Work
65 | * Add PDF-parsing support via PDF -> Text transform library to extract email addresses from PDFs (e.g. whitepapers, resumes, etc.)
--------------------------------------------------------------------------------
/html-test.html:
--------------------------------------------------------------------------------
1 |
2 | bob@bob.com
3 |
6 |
--------------------------------------------------------------------------------
/scraper/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/apetz/email-scraper/e56741fa4dc865c71b561665fe556bc52554883b/scraper/__init__.py
--------------------------------------------------------------------------------
/scraper/items.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define here the models for your scraped items
4 | #
5 | # See documentation in:
6 | # http://doc.scrapy.org/en/latest/topics/items.html
7 |
8 | import scrapy
9 |
10 |
11 | class EmailAddressItem(scrapy.Item):
12 | email_address = scrapy.Field()
13 |
--------------------------------------------------------------------------------
/scraper/middlewares.py:
--------------------------------------------------------------------------------
1 | # very basic subdomain blocking middleware to intercept requests and drop them if the url is for
2 | # a subdomain that is on the blocked list. This is inspired by various blog.*.com subdomains which
3 | # if crawled, result in an explosion of urls that are hit due to the many pages and embedded links
4 | # within the blogs
5 |
6 | import logging
7 | import re
8 | from scrapy.exceptions import IgnoreRequest
9 |
10 |
11 | class SubdomainBlockerMiddleware:
12 | name = "SubdomainBlockerMiddleware"
13 |
14 | def process_request(self, request, spider):
15 | # grab subdomain if it exists and check if it's on the exclusion list
16 | subdomain = re.match('(?:http[s]*://)*([-\w\d]+)\.', request.url).group(1)
17 | if subdomain in spider.subdomain_exclusions:
18 | logging.warning("Dropped request for excluded subdomain {}".format(subdomain))
19 | raise IgnoreRequest()
20 | else:
21 | return None
22 |
--------------------------------------------------------------------------------
/scraper/pipelines.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define your item pipelines here
4 | #
5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
7 |
8 | # based on the Scrapy Tutorial on item pipelines
9 | from scrapy.exceptions import DropItem
10 |
11 |
12 | class DeDupePipeline(object):
13 | def __init__(self):
14 | self.email_addresses_seen = set()
15 |
16 | def process_item(self, item, spider):
17 | if item['email_address'] in self.email_addresses_seen:
18 | raise DropItem("Duplicate item found: %s" % item)
19 | else:
20 | self.email_addresses_seen.add(item['email_address'])
21 | return item
--------------------------------------------------------------------------------
/scraper/settings.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Scrapy settings for scraper project
4 | #
5 | # For simplicity, this file contains only settings considered important or
6 | # commonly used. You can find more settings consulting the documentation:
7 | #
8 | # http://doc.scrapy.org/en/latest/topics/settings.html
9 | # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
10 | # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
11 |
12 | BOT_NAME = 'scraper'
13 |
14 | SPIDER_MODULES = ['scraper.spiders']
15 | NEWSPIDER_MODULE = 'scraper.spiders'
16 |
17 |
18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 | #USER_AGENT = 'scraper (+http://www.yourdomain.com)'
20 |
21 | # Obey robots.txt rules
22 | ROBOTSTXT_OBEY = True
23 |
24 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 | #CONCURRENT_REQUESTS = 32
26 |
27 | # Configure a delay for requests for the same website (default: 0)
28 | # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
29 | # See also autothrottle settings and docs
30 | #DOWNLOAD_DELAY = 3
31 | # The download delay setting will honor only one of:
32 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 | #CONCURRENT_REQUESTS_PER_IP = 16
34 |
35 | # Disable cookies (enabled by default)
36 | #COOKIES_ENABLED = False
37 |
38 | # Disable Telnet Console (enabled by default)
39 | #TELNETCONSOLE_ENABLED = False
40 |
41 | # Override the default request headers:
42 | #DEFAULT_REQUEST_HEADERS = {
43 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44 | # 'Accept-Language': 'en',
45 | #}
46 |
47 | # Enable or disable spider middlewares
48 | # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
49 | #SPIDER_MIDDLEWARES = {
50 | # 'scraper.middlewares.MyCustomSpiderMiddleware': 543,
51 | #}
52 |
53 | # Enable or disable downloader middlewares
54 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
55 | DOWNLOADER_MIDDLEWARES = {
56 | 'scraper.middlewares.SubdomainBlockerMiddleware': 100,
57 | }
58 |
59 | # Enable or disable extensions
60 | # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
61 | #EXTENSIONS = {
62 | # 'scrapy.extensions.telnet.TelnetConsole': None,
63 | #}
64 |
65 | # Configure item pipelines
66 | # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
67 | ITEM_PIPELINES = {
68 | 'scraper.pipelines.DeDupePipeline': 100,
69 | }
70 |
71 | # Enable and configure the AutoThrottle extension (disabled by default)
72 | # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
73 | #AUTOTHROTTLE_ENABLED = True
74 | # The initial download delay
75 | #AUTOTHROTTLE_START_DELAY = 5
76 | # The maximum download delay to be set in case of high latencies
77 | #AUTOTHROTTLE_MAX_DELAY = 60
78 | # The average number of requests Scrapy should be sending in parallel to
79 | # each remote server
80 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
81 | # Enable showing throttling stats for every response received:
82 | #AUTOTHROTTLE_DEBUG = False
83 |
84 | # Enable and configure HTTP caching (disabled by default)
85 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
86 | #HTTPCACHE_ENABLED = True
87 | #HTTPCACHE_EXPIRATION_SECS = 0
88 | #HTTPCACHE_DIR = 'httpcache'
89 | #HTTPCACHE_IGNORE_HTTP_CODES = []
90 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
91 |
--------------------------------------------------------------------------------
/scraper/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 |
--------------------------------------------------------------------------------
/scraper/spiders/thorough_spider.py:
--------------------------------------------------------------------------------
1 | # implementation of the thorough spider
2 |
3 | import re
4 | from urllib.parse import urljoin, urlparse
5 | import scrapy
6 | from scrapy.linkextractors import IGNORED_EXTENSIONS
7 | from scraper.items import EmailAddressItem
8 |
9 | # scrapy.linkextractors has a good list of binary extensions, only slight tweaks needed
10 | IGNORED_EXTENSIONS.extend(['ico', 'tgz', 'gz', 'bz2'])
11 |
12 |
13 | def get_extension_ignore_url_params(url):
14 | path = urlparse(url).path # conveniently lops off all params leaving just the path
15 | extension = re.search('\.([a-zA-Z0-9]+$)', path)
16 | if extension is not None:
17 | return extension.group(1)
18 | else:
19 | return "none" # don't want to return NoneType, it will break comparisons later
20 |
21 |
22 | class ThoroughSpider(scrapy.Spider):
23 | name = "spider"
24 |
25 | def __init__(self, domain=None, subdomain_exclusions=[], crawl_js=False):
26 | self.allowed_domains = [domain]
27 | start_url = "http://" + domain
28 |
29 | self.start_urls = [
30 | start_url
31 | ]
32 | self.subdomain_exclusions=subdomain_exclusions
33 | self.crawl_js = crawl_js
34 | # boolean command line parameters are not converted from strings automatically
35 | if str(crawl_js).lower() in ['true', 't', 'yes', 'y', '1']:
36 | self.crawl_js = True
37 |
38 | def parse(self, response):
39 | # print("Parsing ", response.url)
40 | all_urls = set()
41 |
42 | # use xpath selectors to find all the links, this proved to be more effective than using the
43 | # scrapy provided LinkExtractor during testing
44 | selector = scrapy.Selector(response)
45 |
46 | # grab all hrefs from the page
47 | # print(selector.xpath('//a/@href').extract())
48 | all_urls.update(selector.xpath('//a/@href').extract())
49 | # also grab all sources, this will yield a bunch of binary files which we will filter out
50 | # below, but it has the useful property that it will also grab all javavscript files links
51 | # as well, we need to scrape these for urls to uncover js code that yields up urls when
52 | # executed! An alternative here would be to drive the scraper via selenium to execute the js
53 | # as we go, but this seems slightly simpler
54 | all_urls.update(selector.xpath('//@src').extract())
55 |
56 | # custom regex that works on javascript files to extract relativel urls hidden in quotes.
57 | # This is a workaround for sites that need js executed in order to follow links -- aka
58 | # single-page angularJS type designs that have clickable menu items that are not rendered
59 | # into elements but rather as clickable span elements - e.g. jana.com
60 | all_urls.update(selector.re('"(\/[-\w\d\/\._#?]+?)"'))
61 |
62 | for found_address in selector.re('[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}'):
63 | item = EmailAddressItem()
64 | item['email_address'] = found_address
65 | yield item
66 |
67 | for url in all_urls:
68 | # ignore commonly ignored binary extensions - might want to put PDFs back in list and
69 | # parse with a pdf->txt extraction library to strip emails from whitepapers, resumes,
70 | # etc.
71 | extension = get_extension_ignore_url_params(url)
72 | if extension in IGNORED_EXTENSIONS:
73 | continue
74 | # convert all relative paths to absolute paths
75 | if 'http' not in url:
76 | url = urljoin(response.url, url)
77 | if extension.lower() != 'js' or self.crawl_js is True:
78 | yield scrapy.Request(url, callback=self.parse)
--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
1 | # Automatically created by: scrapy startproject
2 | #
3 | # For more information about the [deploy] section see:
4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html
5 |
6 | [settings]
7 | default = scraper.settings
8 |
9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = scraper
12 |
--------------------------------------------------------------------------------
/test_spider.py:
--------------------------------------------------------------------------------
1 | import unittest
2 | from scrapy.http import Request, TextResponse
3 | from scraper.spiders import thorough_spider
4 |
5 |
6 | class SpiderTest(unittest.TestCase):
7 |
8 | def test_simple_parse(self):
9 | url = "http://test-url1.com"
10 | request = Request(url=url)
11 | file_content = open("html-test.html").read()
12 | response = TextResponse(url=url, request=request, body=file_content, encoding='utf-8')
13 | spider = thorough_spider.ThoroughSpider("test-url1.com")
14 | items = list(spider.parse(response))
15 | self.assertEqual(2, len(items), "Should have length = 2")
16 | self.assertEqual('bob@bob.com', (items[0]['email_address']), "should be bob@bob.com")
17 |
18 | if __name__ == '__main__':
19 | unittest.main()
--------------------------------------------------------------------------------