├── img └── logo.png ├── .gitignore ├── LICENSE.txt ├── setup.py ├── README.md └── scrapy_wayback_machine └── __init__.py /img/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sangaline/scrapy-wayback-machine/HEAD/img/logo.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.egg-info 2 | .env 3 | __pycache__ 4 | *.pyc 5 | website 6 | dist 7 | build 8 | upload.sh 9 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | ISC License 2 | 3 | Copyright (c) 2017, Evan Sangaline 4 | 5 | Permission to use, copy, modify, and/or distribute this software for any 6 | purpose with or without fee is hereby granted, provided that the above 7 | copyright notice and this permission notice appear in all copies. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 10 | WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 11 | MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 12 | ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 13 | WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 14 | ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 15 | OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 16 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import find_packages, setup 2 | 3 | description = ('A Scrapy middleware for scraping ' 4 | 'Wayback Machine snapshots from archive.org.') 5 | long_description = description + \ 6 | (' For further details, ' 7 | 'please see the code repository on github: ' 8 | 'https://github.com/sangaline/scrapy-wayback-machine') 9 | 10 | 11 | setup( 12 | name='scrapy-wayback-machine', 13 | version='1.0.3', 14 | author='Evan Sangaline', 15 | author_email='evan@intoli.com', 16 | description=description, 17 | license='ISC', 18 | keywords='archive.org scrapy scraper waybackmachine middleware', 19 | url="https://github.com/sangaline/scrapy-wayback-machine", 20 | packages=find_packages(), 21 | long_description=long_description, 22 | classifiers=[ 23 | 'Development Status :: 5 - Production/Stable', 24 | 'Framework :: Scrapy', 25 | 'Topic :: Utilities', 26 | 'License :: OSI Approved :: ISC License (ISCL)', 27 | ], 28 | install_requires=[ 29 | 'cryptography', 30 | 'scrapy', 31 | 'twisted', 32 | ] 33 | ) 34 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![The Scrapy Wayback Machine Logo](img/logo.png) 2 | 3 | # Scrapy Wayback Machine Middleware 4 | 5 | This project provides a [Scrapy](https://scrapy.org) middleware for scraping archived snapshots of webpages as they appear on [archive.org](http://archive.org)'s [Wayback Machine](https://archive.org/web/). 6 | This can be useful if you're trying to scrape a site that has scraping measures that make direct scraping impossible or prohibitively slow. 7 | It's also useful if you want to scrape a website as it appeared at some point in the past or to scrape information that changes over time. 8 | 9 | If you're just just interested in mirroring page content or would like to parse the HTML content in a language other than python then you should check out [the Wayback Machine Scraper](https://github.com/sangaline/wayback-machine-scraper). 10 | It's a command-line utility that uses the middleware provided here to crawl through historical snapshots of a website and save them to disk. 11 | It's highly configurable in terms of what it scrapes but it only saves the unparsed content of the pages on the site. 12 | This may or may not suit your needs. 13 | 14 | If you're using [Scrapy](https://scrapy.org) already or interested in parsing the data that is crawled then this `WaybackMachineMiddleware` is probably what you want. 15 | This middleware handles all of the tricky parts and passes normal `response` objects to your [Scrapy](https://scrapy.org) spiders with archive timestamp information attached. 16 | The middleware is very unobtrusive and should work seamlessly with existing [Scrapy](https://scrapy.org) middlewares, extensions, and spiders. 17 | 18 | ## Installation 19 | 20 | The package can be installed using `pip`. 21 | 22 | ```bash 23 | pip install scrapy-wayback-machine 24 | ``` 25 | 26 | ## Usage 27 | 28 | To enable the middleware you simply have to add 29 | 30 | ```python 31 | DOWNLOADER_MIDDLEWARES = { 32 | 'scrapy_wayback_machine.WaybackMachineMiddleware': 5, 33 | } 34 | 35 | WAYBACK_MACHINE_TIME_RANGE = (start_time, end_time) 36 | ``` 37 | 38 | to your [Scrapy](https://scrapy.org) settings. 39 | The start and end times can be specified as `datetime.datetime` objects, Unix timestamps, `YYYYmmdd` timestamps, or `YYYYmmddHHMMSS` timestamps. 40 | The type will be automatically inferred from the content and the ranges will limit the range of snapshots to crawl. 41 | You can also pass a single time if you would like to scrape pages as they appeared at that time. 42 | 43 | After configuration, responses will be passed to your spiders as they normally would. 44 | Both `response.url` and all links within `response.body` point to the unarchived content so your parsing code should work the same regardless of whether or not the middleware is enabled. 45 | If you need to access either the time of the snapshot or the [archive.org](http://archive.org) URL for a response then this information is easily available as metadata attached to the response. 46 | Namely, `response.meta['wayback_machine_time']` contains a `datetime.datetime` corresponding to the time of the crawl and `response.meta['wayback_machine_url']` contains the actual URL that was requested. 47 | Unless you're scraping a single point in time, you will almost certainly want to include the timestamp in the items that your spiders produce to differentiate items scraped from the same URL. 48 | 49 | ### Examples 50 | 51 | [The Wayback Machine Scraper](https://github.com/sangaline/wayback-machine-scraper) command-line utility is a good example of how to use the middleware. 52 | The necessary settings are defined in [\_\_main\_\_.py](https://github.com/sangaline/wayback_machine_scraper/scraper/__main__.py) and the handling of responses is done in [mirror_spider.py](https://github.com/sangaline/wayback_machine_scraper/scraper/mirror_spider.py). 53 | The `MirrorSpider` class simply uses the `response.meta['wayback_machine_time']` information attached to each response to construct the snapshot filenames and is otherwise a fairly generic spider. 54 | 55 | There's also an article [Internet Archeology: Scraping time series data from Archive.org](http://sangaline.com/post/wayback-machine-scraper/) that discusses the development of the middleware and includes an example of scraping time series data from [Reddit](https://reddit.com). 56 | -------------------------------------------------------------------------------- /scrapy_wayback_machine/__init__.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | from datetime import datetime, timezone 4 | try: 5 | from urllib.request import pathname2url 6 | except ImportError: 7 | from urllib import pathname2url 8 | 9 | from scrapy import Request 10 | from scrapy.http import Response 11 | from scrapy.exceptions import NotConfigured, IgnoreRequest 12 | 13 | class UnhandledIgnoreRequest(IgnoreRequest): 14 | pass 15 | 16 | class WaybackMachineMiddleware: 17 | cdx_url_template = ('https://web.archive.org/cdx/search/cdx?url={url}' 18 | '&output=json&fl=timestamp,original,statuscode,digest') 19 | snapshot_url_template = 'https://web.archive.org/web/{timestamp}id_/{original}' 20 | robots_txt = 'https://web.archive.org/robots.txt' 21 | timestamp_format = '%Y%m%d%H%M%S' 22 | 23 | def __init__(self, crawler): 24 | self.crawler = crawler 25 | 26 | # read the settings 27 | time_range = crawler.settings.get('WAYBACK_MACHINE_TIME_RANGE') 28 | if not time_range: 29 | raise NotConfigured 30 | self.set_time_range(time_range) 31 | 32 | def set_time_range(self, time_range): 33 | # allow a single time to be passed in place of a range 34 | if type(time_range) not in [tuple, list]: 35 | time_range = (time_range, time_range) 36 | 37 | # translate the times to unix timestamps 38 | def parse_time(time): 39 | if type(time) in [int, float, str]: 40 | time = int(time) 41 | # realistic timestamp range 42 | if 10**8 < time < 10**13: 43 | return time 44 | # otherwise archive.org timestamp format (possibly truncated) 45 | time_string = str(time)[::-1].zfill(14)[::-1] 46 | time = datetime.strptime(time_string, self.timestamp_format) 47 | time = time.replace(tzinfo=timezone.utc) 48 | return time.timestamp() 49 | 50 | self.time_range = [parse_time(time) for time in time_range] 51 | 52 | @classmethod 53 | def from_crawler(cls, crawler): 54 | return cls(crawler) 55 | 56 | def process_request(self, request, spider): 57 | # ignore robots.txt requests 58 | if request.url == self.robots_txt: 59 | return 60 | 61 | # let Wayback Machine requests through 62 | if request.meta.get('wayback_machine_url'): 63 | return 64 | if request.meta.get('wayback_machine_cdx_request'): 65 | return 66 | 67 | # otherwise request a CDX listing of available snapshots 68 | return self.build_cdx_request(request) 69 | 70 | def process_response(self, request, response, spider): 71 | meta = request.meta 72 | 73 | # parse CDX requests and schedule future snapshot requests 74 | if meta.get('wayback_machine_cdx_request'): 75 | snapshot_requests = self.build_snapshot_requests(response, meta) 76 | 77 | # treat empty listings as 404s 78 | if len(snapshot_requests) < 1: 79 | return Response(meta['wayback_machine_original_request'].url, status=404) 80 | 81 | # schedule all of the snapshots 82 | for snapshot_request in snapshot_requests: 83 | self.crawler.engine.schedule(snapshot_request, spider) 84 | 85 | # abort this request 86 | raise UnhandledIgnoreRequest 87 | 88 | # clean up snapshot responses 89 | if meta.get('wayback_machine_url'): 90 | return response.replace(url=meta['wayback_machine_original_request'].url) 91 | 92 | return response 93 | 94 | def build_cdx_request(self, request): 95 | if os.name == 'nt': 96 | cdx_url = self.cdx_url_template.format(url=pathname2url(request.url.split('://')[1])) 97 | else: 98 | cdx_url = self.cdx_url_template.format(url=pathname2url(request.url)) 99 | cdx_request = Request(cdx_url) 100 | cdx_request.meta['wayback_machine_original_request'] = request 101 | cdx_request.meta['wayback_machine_cdx_request'] = True 102 | return cdx_request 103 | 104 | def build_snapshot_requests(self, response, meta): 105 | assert meta.get('wayback_machine_cdx_request'), 'Not a CDX request meta.' 106 | 107 | # parse the CDX snapshot data 108 | try: 109 | data = json.loads(response.text) 110 | except json.decoder.JSONDecodeError: 111 | # forbidden by robots.txt 112 | data = [] 113 | if len(data) < 2: 114 | return [] 115 | keys, rows = data[0], data[1:] 116 | def build_dict(row): 117 | new_dict = {} 118 | for i, key in enumerate(keys): 119 | if key == 'timestamp': 120 | try: 121 | time = datetime.strptime(row[i], self.timestamp_format) 122 | new_dict['datetime'] = time.replace(tzinfo=timezone.utc) 123 | except ValueError: 124 | # this means an error in their date string (it happens) 125 | new_dict['datetime'] = None 126 | new_dict[key] = row[i] 127 | return new_dict 128 | snapshots = list(map(build_dict, rows)) 129 | del rows 130 | 131 | snapshot_requests = [] 132 | for snapshot in self.filter_snapshots(snapshots): 133 | # update the url to point to the snapshot 134 | url = self.snapshot_url_template.format(**snapshot) 135 | original_request = meta['wayback_machine_original_request'] 136 | snapshot_request = original_request.replace(url=url) 137 | 138 | # attach extension specify metadata to the request 139 | snapshot_request.meta.update({ 140 | 'wayback_machine_original_request': original_request, 141 | 'wayback_machine_url': snapshot_request.url, 142 | 'wayback_machine_time': snapshot['datetime'], 143 | }) 144 | 145 | snapshot_requests.append(snapshot_request) 146 | 147 | return snapshot_requests 148 | 149 | def filter_snapshots(self, snapshots): 150 | filtered_snapshots = [] 151 | initial_snapshot = None 152 | last_digest = None 153 | for i, snapshot in enumerate(snapshots): 154 | # ignore entries with invalid timestamps 155 | if not snapshot['datetime']: 156 | continue 157 | timestamp = snapshot['datetime'].timestamp() 158 | 159 | # ignore bot detections (e.g status="-") 160 | if len(snapshot['statuscode']) != 3: 161 | continue 162 | 163 | # also don't download redirects (because the redirected URLs are also present) 164 | if snapshot['statuscode'][0] == '3': 165 | continue 166 | 167 | 168 | # include the snapshot active when we first enter the range 169 | if len(filtered_snapshots) == 0: 170 | if timestamp > self.time_range[0]: 171 | if initial_snapshot: 172 | filtered_snapshots.append(initial_snapshot) 173 | last_digest = initial_snapshot['digest'] 174 | else: 175 | initial_snapshot = snapshot 176 | 177 | # ignore before the range 178 | if timestamp < self.time_range[0]: 179 | continue 180 | 181 | # ignore the rest are past the specified time range 182 | if timestamp > self.time_range[1]: 183 | break 184 | 185 | # don't download unchanged snapshots 186 | if last_digest == snapshot['digest']: 187 | continue 188 | last_digest = snapshot['digest'] 189 | 190 | filtered_snapshots.append(snapshot) 191 | 192 | return filtered_snapshots 193 | --------------------------------------------------------------------------------