├── README.md ├── html-test.html ├── scraper ├── __init__.py ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders │ ├── __init__.py │ └── thorough_spider.py ├── scrapy.cfg └── test_spider.py /README.md: -------------------------------------------------------------------------------- 1 | # email-scraper 2 | A general-purpose utility written in Python (v3.0+) for crawling websites to extract email addresses. 3 | 4 | ## Overview 5 | I implemented this using the popular python web crawling framework **scrapy**. I had never used it before so this is probably not the most elegant implementation of a scrapy-based email scraper (say that three times fast!). The project consists of a single spider ***ThoroughSpider*** which takes 'domain' as an argument and begins crawling there. Two optional arguments add further tuning capability: 6 | * subdomain_exclusions - optional list of subdomains to exclude from the crawl 7 | * crawl_js - optional boolean [default=False], whether or not to follow links to javascript files and also search them for urls 8 | 9 | The **crawl_js** parameter is needed only when a "perfect storm" of conditions exist: 10 | 11 | 1. there is no sitemap.xml available 12 | 13 | 2. there are clickable menu items that are **not** contained in `````` tags 14 | 15 | 3. those menu items execute javascript code that loads pages but the urls are in the .js file itself 16 | 17 | Normally such links would not be followed and scraped for further urls, however if a single-page AngularJS-based site with no sitemap is crawled, and the menu links are in `````` elements, the pages will not be discovered unless the ng-app is crawled as well to extract the destinations of the menu items. 18 | 19 | A possible workaround to parsing the js files would be to use Selenium Web Driver to automate the crawl. I decided against this because Selenium needs to know how to find the menu using CSS selector, class name, etc. and these are specific to the site itself. A Selenium-based solution would not be as general-purpose, but for an AngularJS-based menu, something like the following would work in Selenium: 20 | ``` 21 | from selenium import webdriver 22 | driver = webdriver.Firefox() 23 | driver.get("http://angularjs-site.com") 24 | menuElement = driver.find_element_by_class_name("trigger") 25 | menuElement.click() 26 | driver.find_elements_by_xpath('//li')[3].click() # for example, click the third item off the discovered menu list 27 | ``` 28 | 29 | For this solution I opted to design the crawler ***without*** Selenium, which means occasionally crawling JS files to root out further links. 30 | 31 | ### Implemented classes: 32 | * ```ThoroughSpider```: the spider itself 33 | * ```DeDupePipeline```: simple de-duplicator so that email addresses are only printed out once even if they are discovered multiple times 34 | * ```SubdomainBlockerMiddleware```: blocks subdomains in case crawl needs to exclude them 35 | * ```EmailAddressItem```: holds the email addresses as scrapy.Items. Allows the scrapy framework to output items into a number of 36 | convenient formats (csv, json, etc.) without additional dev work on my end 37 | 38 | ### Required Modules 39 | This project is dependent on the following python modules: 40 | * scrapy 41 | * urlparse 42 | 43 | ### How to Run 44 | Since it is scrapy based, it must be invoked via the standard "scrapy way": 45 | ``` 46 | scrapy crawl spider -a domain="your.domain.name" -o emails-found.csv 47 | ``` 48 | Or with optional command line arguments like: 49 | ``` 50 | scrapy crawl spider -a domain="your.domain.name" -a subdomain_exclusions="['blog', 'technology']" -a crawl_js=True -o emails-found.csv 51 | ``` 52 | 53 | 54 | ## Testing 55 | I tested this mainly against my own ***incomplete*** blog www.quackquaponics.com and personal website www.tpetz.com, where I hid email addresses around the sites and also hid links to various JS files to test how well the parsing worked. There is also a simple unit test for the ThoroughMiddleware.parse method in test_spider.py. It utilizes a static file, html-test.html, which can be extended with various hard-to-find email addresses to see if they can be discovered. To run the unit test, do 56 | ``` 57 | python3 test_spider.py 58 | ``` 59 | 60 | 61 | ## Known Issues 62 | 1. The email regex utilized to find addresses is not RFC 822 compliant. That is easily fixed if needed, but in practice, most real-world validators are not RFC 822 compliant either, and the regex I used is *much* simpler than a fully compliant one. 63 | 64 | ## Future Work 65 | * Add PDF-parsing support via PDF -> Text transform library to extract email addresses from PDFs (e.g. whitepapers, resumes, etc.) -------------------------------------------------------------------------------- /html-test.html: -------------------------------------------------------------------------------- 1 | 2 | bob@bob.com 3 | 6 | -------------------------------------------------------------------------------- /scraper/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apetz/email-scraper/e56741fa4dc865c71b561665fe556bc52554883b/scraper/__init__.py -------------------------------------------------------------------------------- /scraper/items.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your scraped items 4 | # 5 | # See documentation in: 6 | # http://doc.scrapy.org/en/latest/topics/items.html 7 | 8 | import scrapy 9 | 10 | 11 | class EmailAddressItem(scrapy.Item): 12 | email_address = scrapy.Field() 13 | -------------------------------------------------------------------------------- /scraper/middlewares.py: -------------------------------------------------------------------------------- 1 | # very basic subdomain blocking middleware to intercept requests and drop them if the url is for 2 | # a subdomain that is on the blocked list. This is inspired by various blog.*.com subdomains which 3 | # if crawled, result in an explosion of urls that are hit due to the many pages and embedded links 4 | # within the blogs 5 | 6 | import logging 7 | import re 8 | from scrapy.exceptions import IgnoreRequest 9 | 10 | 11 | class SubdomainBlockerMiddleware: 12 | name = "SubdomainBlockerMiddleware" 13 | 14 | def process_request(self, request, spider): 15 | # grab subdomain if it exists and check if it's on the exclusion list 16 | subdomain = re.match('(?:http[s]*://)*([-\w\d]+)\.', request.url).group(1) 17 | if subdomain in spider.subdomain_exclusions: 18 | logging.warning("Dropped request for excluded subdomain {}".format(subdomain)) 19 | raise IgnoreRequest() 20 | else: 21 | return None 22 | -------------------------------------------------------------------------------- /scraper/pipelines.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define your item pipelines here 4 | # 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 | 8 | # based on the Scrapy Tutorial on item pipelines 9 | from scrapy.exceptions import DropItem 10 | 11 | 12 | class DeDupePipeline(object): 13 | def __init__(self): 14 | self.email_addresses_seen = set() 15 | 16 | def process_item(self, item, spider): 17 | if item['email_address'] in self.email_addresses_seen: 18 | raise DropItem("Duplicate item found: %s" % item) 19 | else: 20 | self.email_addresses_seen.add(item['email_address']) 21 | return item -------------------------------------------------------------------------------- /scraper/settings.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Scrapy settings for scraper project 4 | # 5 | # For simplicity, this file contains only settings considered important or 6 | # commonly used. You can find more settings consulting the documentation: 7 | # 8 | # http://doc.scrapy.org/en/latest/topics/settings.html 9 | # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 10 | # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 11 | 12 | BOT_NAME = 'scraper' 13 | 14 | SPIDER_MODULES = ['scraper.spiders'] 15 | NEWSPIDER_MODULE = 'scraper.spiders' 16 | 17 | 18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 | #USER_AGENT = 'scraper (+http://www.yourdomain.com)' 20 | 21 | # Obey robots.txt rules 22 | ROBOTSTXT_OBEY = True 23 | 24 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 25 | #CONCURRENT_REQUESTS = 32 26 | 27 | # Configure a delay for requests for the same website (default: 0) 28 | # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay 29 | # See also autothrottle settings and docs 30 | #DOWNLOAD_DELAY = 3 31 | # The download delay setting will honor only one of: 32 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16 33 | #CONCURRENT_REQUESTS_PER_IP = 16 34 | 35 | # Disable cookies (enabled by default) 36 | #COOKIES_ENABLED = False 37 | 38 | # Disable Telnet Console (enabled by default) 39 | #TELNETCONSOLE_ENABLED = False 40 | 41 | # Override the default request headers: 42 | #DEFAULT_REQUEST_HEADERS = { 43 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 44 | # 'Accept-Language': 'en', 45 | #} 46 | 47 | # Enable or disable spider middlewares 48 | # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 49 | #SPIDER_MIDDLEWARES = { 50 | # 'scraper.middlewares.MyCustomSpiderMiddleware': 543, 51 | #} 52 | 53 | # Enable or disable downloader middlewares 54 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 55 | DOWNLOADER_MIDDLEWARES = { 56 | 'scraper.middlewares.SubdomainBlockerMiddleware': 100, 57 | } 58 | 59 | # Enable or disable extensions 60 | # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html 61 | #EXTENSIONS = { 62 | # 'scrapy.extensions.telnet.TelnetConsole': None, 63 | #} 64 | 65 | # Configure item pipelines 66 | # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 67 | ITEM_PIPELINES = { 68 | 'scraper.pipelines.DeDupePipeline': 100, 69 | } 70 | 71 | # Enable and configure the AutoThrottle extension (disabled by default) 72 | # See http://doc.scrapy.org/en/latest/topics/autothrottle.html 73 | #AUTOTHROTTLE_ENABLED = True 74 | # The initial download delay 75 | #AUTOTHROTTLE_START_DELAY = 5 76 | # The maximum download delay to be set in case of high latencies 77 | #AUTOTHROTTLE_MAX_DELAY = 60 78 | # The average number of requests Scrapy should be sending in parallel to 79 | # each remote server 80 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 81 | # Enable showing throttling stats for every response received: 82 | #AUTOTHROTTLE_DEBUG = False 83 | 84 | # Enable and configure HTTP caching (disabled by default) 85 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 86 | #HTTPCACHE_ENABLED = True 87 | #HTTPCACHE_EXPIRATION_SECS = 0 88 | #HTTPCACHE_DIR = 'httpcache' 89 | #HTTPCACHE_IGNORE_HTTP_CODES = [] 90 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 91 | -------------------------------------------------------------------------------- /scraper/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /scraper/spiders/thorough_spider.py: -------------------------------------------------------------------------------- 1 | # implementation of the thorough spider 2 | 3 | import re 4 | from urllib.parse import urljoin, urlparse 5 | import scrapy 6 | from scrapy.linkextractors import IGNORED_EXTENSIONS 7 | from scraper.items import EmailAddressItem 8 | 9 | # scrapy.linkextractors has a good list of binary extensions, only slight tweaks needed 10 | IGNORED_EXTENSIONS.extend(['ico', 'tgz', 'gz', 'bz2']) 11 | 12 | 13 | def get_extension_ignore_url_params(url): 14 | path = urlparse(url).path # conveniently lops off all params leaving just the path 15 | extension = re.search('\.([a-zA-Z0-9]+$)', path) 16 | if extension is not None: 17 | return extension.group(1) 18 | else: 19 | return "none" # don't want to return NoneType, it will break comparisons later 20 | 21 | 22 | class ThoroughSpider(scrapy.Spider): 23 | name = "spider" 24 | 25 | def __init__(self, domain=None, subdomain_exclusions=[], crawl_js=False): 26 | self.allowed_domains = [domain] 27 | start_url = "http://" + domain 28 | 29 | self.start_urls = [ 30 | start_url 31 | ] 32 | self.subdomain_exclusions=subdomain_exclusions 33 | self.crawl_js = crawl_js 34 | # boolean command line parameters are not converted from strings automatically 35 | if str(crawl_js).lower() in ['true', 't', 'yes', 'y', '1']: 36 | self.crawl_js = True 37 | 38 | def parse(self, response): 39 | # print("Parsing ", response.url) 40 | all_urls = set() 41 | 42 | # use xpath selectors to find all the links, this proved to be more effective than using the 43 | # scrapy provided LinkExtractor during testing 44 | selector = scrapy.Selector(response) 45 | 46 | # grab all hrefs from the page 47 | # print(selector.xpath('//a/@href').extract()) 48 | all_urls.update(selector.xpath('//a/@href').extract()) 49 | # also grab all sources, this will yield a bunch of binary files which we will filter out 50 | # below, but it has the useful property that it will also grab all javavscript files links 51 | # as well, we need to scrape these for urls to uncover js code that yields up urls when 52 | # executed! An alternative here would be to drive the scraper via selenium to execute the js 53 | # as we go, but this seems slightly simpler 54 | all_urls.update(selector.xpath('//@src').extract()) 55 | 56 | # custom regex that works on javascript files to extract relativel urls hidden in quotes. 57 | # This is a workaround for sites that need js executed in order to follow links -- aka 58 | # single-page angularJS type designs that have clickable menu items that are not rendered 59 | # into elements but rather as clickable span elements - e.g. jana.com 60 | all_urls.update(selector.re('"(\/[-\w\d\/\._#?]+?)"')) 61 | 62 | for found_address in selector.re('[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}'): 63 | item = EmailAddressItem() 64 | item['email_address'] = found_address 65 | yield item 66 | 67 | for url in all_urls: 68 | # ignore commonly ignored binary extensions - might want to put PDFs back in list and 69 | # parse with a pdf->txt extraction library to strip emails from whitepapers, resumes, 70 | # etc. 71 | extension = get_extension_ignore_url_params(url) 72 | if extension in IGNORED_EXTENSIONS: 73 | continue 74 | # convert all relative paths to absolute paths 75 | if 'http' not in url: 76 | url = urljoin(response.url, url) 77 | if extension.lower() != 'js' or self.crawl_js is True: 78 | yield scrapy.Request(url, callback=self.parse) -------------------------------------------------------------------------------- /scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html 5 | 6 | [settings] 7 | default = scraper.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = scraper 12 | -------------------------------------------------------------------------------- /test_spider.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | from scrapy.http import Request, TextResponse 3 | from scraper.spiders import thorough_spider 4 | 5 | 6 | class SpiderTest(unittest.TestCase): 7 | 8 | def test_simple_parse(self): 9 | url = "http://test-url1.com" 10 | request = Request(url=url) 11 | file_content = open("html-test.html").read() 12 | response = TextResponse(url=url, request=request, body=file_content, encoding='utf-8') 13 | spider = thorough_spider.ThoroughSpider("test-url1.com") 14 | items = list(spider.parse(response)) 15 | self.assertEqual(2, len(items), "Should have length = 2") 16 | self.assertEqual('bob@bob.com', (items[0]['email_address']), "should be bob@bob.com") 17 | 18 | if __name__ == '__main__': 19 | unittest.main() --------------------------------------------------------------------------------