├── README.md
├── html-test.html
├── scraper
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
    │   ├── __init__.py
    │   └── thorough_spider.py
├── scrapy.cfg
└── test_spider.py


/README.md:
--------------------------------------------------------------------------------
 1 | # email-scraper
 2 | A general-purpose utility written in Python (v3.0+) for crawling websites to extract email addresses.
 3 | 
 4 | ## Overview
 5 | I implemented this using the popular python web crawling framework **scrapy**. I had never used it before so this is probably not the most elegant implementation of a scrapy-based email scraper (say that three times fast!). The project consists  of a single spider ***ThoroughSpider*** which takes 'domain' as an argument and begins crawling there. Two optional arguments add further tuning capability:
 6 | * subdomain_exclusions - optional list of subdomains to exclude from the crawl
 7 | * crawl_js - optional boolean [default=False], whether or not to follow links to javascript files and also search them for urls
 8 | 
 9 | The **crawl_js** parameter is needed only when a "perfect storm" of conditions exist:
10 | 
11 | 1. there is no sitemap.xml available
12 | 
13 | 2. there are clickable menu items that are **not** contained in ```<a>``` tags
14 | 
15 | 3. those menu items execute javascript code that loads pages but the urls are in the .js file itself
16 | 
17 | Normally such links would not be followed and scraped for further urls, however if a single-page AngularJS-based site with no sitemap is crawled, and the menu links are in ```<span>``` elements, the pages will not be discovered unless the ng-app is crawled as well to extract the destinations of the menu items.
18 | 
19 | A possible workaround to parsing the js files would be to use Selenium Web Driver to automate the crawl. I decided against this because Selenium needs to know how to find the menu using CSS selector, class name, etc. and these are specific to the site itself. A Selenium-based solution would not be as general-purpose, but for an AngularJS-based menu, something like the following would work in Selenium:
20 | ```
21 | from selenium import webdriver
22 | driver = webdriver.Firefox()
23 | driver.get("http://angularjs-site.com")
24 | menuElement = driver.find_element_by_class_name("trigger")
25 | menuElement.click()
26 | driver.find_elements_by_xpath('//li')[3].click() # for example, click the third item off the discovered menu list
27 | ```
28 | 
29 | For this solution I opted to design the crawler ***without*** Selenium, which means occasionally crawling JS files to root out further links.
30 | 
31 | ### Implemented classes:
32 | * ```ThoroughSpider```: the spider itself
33 | * ```DeDupePipeline```: simple de-duplicator so that email addresses are only printed out once even if they are discovered multiple times
34 | * ```SubdomainBlockerMiddleware```: blocks subdomains in case crawl needs to exclude them
35 | * ```EmailAddressItem```: holds the email addresses as scrapy.Items. Allows the scrapy framework to output items into a number of 
36 | convenient formats (csv, json, etc.) without additional dev work on my end
37 | 
38 | ### Required Modules
39 | This project is dependent on the following python modules:
40 | * scrapy
41 | * urlparse
42 | 
43 | ### How to Run
44 | Since it is scrapy based, it must be invoked via the standard "scrapy way":
45 |  ```
46 |  scrapy crawl spider -a domain="your.domain.name" -o emails-found.csv
47 |  ```
48 | Or with optional command line arguments like:
49 | ```
50 | scrapy crawl spider -a domain="your.domain.name" -a subdomain_exclusions="['blog', 'technology']" -a crawl_js=True -o emails-found.csv
51 | ```
52 | 
53 | 
54 | ## Testing
55 | I tested this mainly against my own ***incomplete*** blog www.quackquaponics.com and personal website www.tpetz.com, where I hid email addresses around the sites and also hid links to various JS files to test how well the parsing worked. There is also a simple unit test for the  ThoroughMiddleware.parse method in test_spider.py. It utilizes a static file, html-test.html, which can be extended with various hard-to-find email addresses to see if they can be discovered. To run the unit test, do
56 | ```
57 | python3 test_spider.py
58 | ```
59 | 
60 | 
61 | ## Known Issues
62 | 1. The email regex utilized to find addresses is not RFC 822 compliant. That is easily fixed if needed, but in practice, most real-world validators are not RFC 822 compliant either, and the regex I used is *much* simpler than a fully compliant one.
63 | 
64 | ## Future Work
65 | * Add PDF-parsing support via PDF -> Text transform library to extract email addresses from PDFs (e.g. whitepapers, resumes, etc.)


--------------------------------------------------------------------------------
/html-test.html:
--------------------------------------------------------------------------------
1 | <html>
2 | <head>bob@bob.com</head>
3 | <ul>
4 |     <li>bob2@bob2.com</li>
5 | </ul>
6 | </html>


--------------------------------------------------------------------------------
/scraper/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/apetz/email-scraper/e56741fa4dc865c71b561665fe556bc52554883b/scraper/__init__.py


--------------------------------------------------------------------------------
/scraper/items.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your scraped items
 4 | #
 5 | # See documentation in:
 6 | # http://doc.scrapy.org/en/latest/topics/items.html
 7 | 
 8 | import scrapy
 9 | 
10 | 
11 | class EmailAddressItem(scrapy.Item):
12 |     email_address = scrapy.Field()
13 | 


--------------------------------------------------------------------------------
/scraper/middlewares.py:
--------------------------------------------------------------------------------
 1 | # very basic subdomain blocking middleware to intercept requests and drop them if the url is for
 2 | # a subdomain that is on the blocked list. This is inspired by various blog.*.com subdomains which
 3 | # if crawled, result in an explosion of urls that are hit due to the many pages and embedded links
 4 | # within the blogs
 5 | 
 6 | import logging
 7 | import re
 8 | from scrapy.exceptions import IgnoreRequest
 9 | 
10 | 
11 | class SubdomainBlockerMiddleware:
12 |     name = "SubdomainBlockerMiddleware"
13 | 
14 |     def process_request(self, request, spider):
15 |         # grab subdomain if it exists and check if it's on the exclusion list
16 |         subdomain = re.match('(?:http[s]*://)*([-\w\d]+)\.', request.url).group(1)
17 |         if subdomain in spider.subdomain_exclusions:
18 |             logging.warning("Dropped request for excluded subdomain {}".format(subdomain))
19 |             raise IgnoreRequest()
20 |         else:
21 |             return None
22 | 


--------------------------------------------------------------------------------
/scraper/pipelines.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define your item pipelines here
 4 | #
 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 | 
 8 | # based on the Scrapy Tutorial on item pipelines
 9 | from scrapy.exceptions import DropItem
10 | 
11 | 
12 | class DeDupePipeline(object):
13 |     def __init__(self):
14 |         self.email_addresses_seen = set()
15 | 
16 |     def process_item(self, item, spider):
17 |         if item['email_address'] in self.email_addresses_seen:
18 |             raise DropItem("Duplicate item found: %s" % item)
19 |         else:
20 |             self.email_addresses_seen.add(item['email_address'])
21 |             return item


--------------------------------------------------------------------------------
/scraper/settings.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Scrapy settings for scraper project
 4 | #
 5 | # For simplicity, this file contains only settings considered important or
 6 | # commonly used. You can find more settings consulting the documentation:
 7 | #
 8 | #     http://doc.scrapy.org/en/latest/topics/settings.html
 9 | #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
10 | #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
11 | 
12 | BOT_NAME = 'scraper'
13 | 
14 | SPIDER_MODULES = ['scraper.spiders']
15 | NEWSPIDER_MODULE = 'scraper.spiders'
16 | 
17 | 
18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 | #USER_AGENT = 'scraper (+http://www.yourdomain.com)'
20 | 
21 | # Obey robots.txt rules
22 | ROBOTSTXT_OBEY = True
23 | 
24 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 | #CONCURRENT_REQUESTS = 32
26 | 
27 | # Configure a delay for requests for the same website (default: 0)
28 | # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
29 | # See also autothrottle settings and docs
30 | #DOWNLOAD_DELAY = 3
31 | # The download delay setting will honor only one of:
32 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 | #CONCURRENT_REQUESTS_PER_IP = 16
34 | 
35 | # Disable cookies (enabled by default)
36 | #COOKIES_ENABLED = False
37 | 
38 | # Disable Telnet Console (enabled by default)
39 | #TELNETCONSOLE_ENABLED = False
40 | 
41 | # Override the default request headers:
42 | #DEFAULT_REQUEST_HEADERS = {
43 | #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44 | #   'Accept-Language': 'en',
45 | #}
46 | 
47 | # Enable or disable spider middlewares
48 | # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
49 | #SPIDER_MIDDLEWARES = {
50 | #    'scraper.middlewares.MyCustomSpiderMiddleware': 543,
51 | #}
52 | 
53 | # Enable or disable downloader middlewares
54 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
55 | DOWNLOADER_MIDDLEWARES = {
56 |     'scraper.middlewares.SubdomainBlockerMiddleware': 100,
57 | }
58 | 
59 | # Enable or disable extensions
60 | # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
61 | #EXTENSIONS = {
62 | #    'scrapy.extensions.telnet.TelnetConsole': None,
63 | #}
64 | 
65 | # Configure item pipelines
66 | # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
67 | ITEM_PIPELINES = {
68 |     'scraper.pipelines.DeDupePipeline': 100,
69 | }
70 | 
71 | # Enable and configure the AutoThrottle extension (disabled by default)
72 | # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
73 | #AUTOTHROTTLE_ENABLED = True
74 | # The initial download delay
75 | #AUTOTHROTTLE_START_DELAY = 5
76 | # The maximum download delay to be set in case of high latencies
77 | #AUTOTHROTTLE_MAX_DELAY = 60
78 | # The average number of requests Scrapy should be sending in parallel to
79 | # each remote server
80 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
81 | # Enable showing throttling stats for every response received:
82 | #AUTOTHROTTLE_DEBUG = False
83 | 
84 | # Enable and configure HTTP caching (disabled by default)
85 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
86 | #HTTPCACHE_ENABLED = True
87 | #HTTPCACHE_EXPIRATION_SECS = 0
88 | #HTTPCACHE_DIR = 'httpcache'
89 | #HTTPCACHE_IGNORE_HTTP_CODES = []
90 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
91 | 


--------------------------------------------------------------------------------
/scraper/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/scraper/spiders/thorough_spider.py:
--------------------------------------------------------------------------------
 1 | # implementation of the thorough spider
 2 | 
 3 | import re
 4 | from urllib.parse import urljoin, urlparse
 5 | import scrapy
 6 | from scrapy.linkextractors import IGNORED_EXTENSIONS
 7 | from scraper.items import EmailAddressItem
 8 | 
 9 | # scrapy.linkextractors has a good list of binary extensions, only slight tweaks needed
10 | IGNORED_EXTENSIONS.extend(['ico', 'tgz', 'gz', 'bz2'])
11 | 
12 | 
13 | def get_extension_ignore_url_params(url):
14 |     path = urlparse(url).path # conveniently lops off all params leaving just the path
15 |     extension = re.search('\.([a-zA-Z0-9]+$)', path)
16 |     if extension is not None:
17 |         return extension.group(1)
18 |     else:
19 |         return "none"  # don't want to return NoneType, it will break comparisons later
20 | 
21 | 
22 | class ThoroughSpider(scrapy.Spider):
23 |     name = "spider"
24 | 
25 |     def __init__(self, domain=None, subdomain_exclusions=[], crawl_js=False):
26 |         self.allowed_domains = [domain]
27 |         start_url = "http://" + domain
28 | 
29 |         self.start_urls = [
30 |             start_url
31 |         ]
32 |         self.subdomain_exclusions=subdomain_exclusions
33 |         self.crawl_js = crawl_js
34 |         # boolean command line parameters are not converted from strings automatically
35 |         if str(crawl_js).lower() in ['true', 't', 'yes', 'y', '1']:
36 |             self.crawl_js = True
37 | 
38 |     def parse(self, response):
39 |         # print("Parsing ", response.url)
40 |         all_urls = set()
41 | 
42 |         # use xpath selectors to find all the links, this proved to be more effective than using the
43 |         #  scrapy provided LinkExtractor during testing
44 |         selector = scrapy.Selector(response)
45 | 
46 |         # grab all hrefs from the page
47 |         # print(selector.xpath('//a/@href').extract())
48 |         all_urls.update(selector.xpath('//a/@href').extract())
49 |         # also grab all sources, this will yield a bunch of binary files which we will filter out
50 |         # below, but it has the useful property that it will also grab all javavscript files links
51 |         # as well, we need to scrape these for urls to uncover js code that yields up urls when
52 |         # executed! An alternative here would be to drive the scraper via selenium to execute the js
53 |         # as we go, but this seems slightly simpler
54 |         all_urls.update(selector.xpath('//@src').extract())
55 | 
56 |         # custom regex that works on javascript files to extract relativel urls hidden in quotes.
57 |         # This is a workaround for sites that need js executed in order to follow links -- aka
58 |         # single-page angularJS type designs that have clickable menu items that are not rendered
59 |         # into <a> elements but rather as clickable span elements - e.g. jana.com
60 |         all_urls.update(selector.re('"(\/[-\w\d\/\._#?]+?)"'))
61 | 
62 |         for found_address in selector.re('[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}'):
63 |             item = EmailAddressItem()
64 |             item['email_address'] = found_address
65 |             yield item
66 | 
67 |         for url in all_urls:
68 |             # ignore commonly ignored binary extensions - might want to put PDFs back in list and
69 |             # parse with a pdf->txt extraction library to strip emails from whitepapers, resumes,
70 |             # etc.
71 |             extension = get_extension_ignore_url_params(url)
72 |             if extension in IGNORED_EXTENSIONS:
73 |                 continue
74 |             # convert all relative paths to absolute paths
75 |             if 'http' not in url:
76 |                 url = urljoin(response.url, url)
77 |             if extension.lower() != 'js' or self.crawl_js is True:
78 |                 yield scrapy.Request(url, callback=self.parse)


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = scraper.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = scraper
12 | 


--------------------------------------------------------------------------------
/test_spider.py:
--------------------------------------------------------------------------------
 1 | import unittest
 2 | from scrapy.http import Request, TextResponse
 3 | from scraper.spiders import thorough_spider
 4 | 
 5 | 
 6 | class SpiderTest(unittest.TestCase):
 7 | 
 8 |     def test_simple_parse(self):
 9 |         url = "http://test-url1.com"
10 |         request = Request(url=url)
11 |         file_content = open("html-test.html").read()
12 |         response = TextResponse(url=url, request=request, body=file_content, encoding='utf-8')
13 |         spider = thorough_spider.ThoroughSpider("test-url1.com")
14 |         items = list(spider.parse(response))
15 |         self.assertEqual(2, len(items), "Should have length = 2")
16 |         self.assertEqual('bob@bob.com', (items[0]['email_address']), "should be bob@bob.com")
17 | 
18 | if __name__ == '__main__':
19 |     unittest.main()


--------------------------------------------------------------------------------