├── img
    └── logo.png
├── .gitignore
├── LICENSE.txt
├── setup.py
├── README.md
└── scrapy_wayback_machine
    └── __init__.py


/img/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sangaline/scrapy-wayback-machine/HEAD/img/logo.png


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.egg-info
2 | .env
3 | __pycache__
4 | *.pyc
5 | website
6 | dist
7 | build
8 | upload.sh
9 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | ISC License
 2 | 
 3 | Copyright (c) 2017, Evan Sangaline
 4 | 
 5 | Permission to use, copy, modify, and/or distribute this software for any
 6 | purpose with or without fee is hereby granted, provided that the above
 7 | copyright notice and this permission notice appear in all copies.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
10 | WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
11 | MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
12 | ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
13 | WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
14 | ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
15 | OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
16 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import find_packages, setup
 2 | 
 3 | description = ('A Scrapy middleware for scraping '
 4 |                'Wayback Machine snapshots from archive.org.')
 5 | long_description = description + \
 6 |         (' For further details, '
 7 |          'please see the code repository on github: '
 8 |          'https://github.com/sangaline/scrapy-wayback-machine')
 9 | 
10 | 
11 | setup(
12 |     name='scrapy-wayback-machine',
13 |     version='1.0.3',
14 |     author='Evan Sangaline',
15 |     author_email='evan@intoli.com',
16 |     description=description,
17 |     license='ISC',
18 |     keywords='archive.org scrapy scraper waybackmachine middleware',
19 |     url="https://github.com/sangaline/scrapy-wayback-machine",
20 |     packages=find_packages(),
21 |     long_description=long_description,
22 |     classifiers=[
23 |         'Development Status :: 5 - Production/Stable',
24 |         'Framework :: Scrapy',
25 |         'Topic :: Utilities',
26 |         'License :: OSI Approved :: ISC License (ISCL)',
27 |     ],
28 |     install_requires=[
29 |         'cryptography',
30 |         'scrapy',
31 |         'twisted',
32 |     ]
33 | )
34 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ![The Scrapy Wayback Machine Logo](img/logo.png)
 2 | 
 3 | # Scrapy Wayback Machine Middleware
 4 | 
 5 | This project provides a [Scrapy](https://scrapy.org) middleware for scraping archived snapshots of webpages as they appear on [archive.org](http://archive.org)'s [Wayback Machine](https://archive.org/web/).
 6 | This can be useful if you're trying to scrape a site that has scraping measures that make direct scraping impossible or prohibitively slow.
 7 | It's also useful if you want to scrape a website as it appeared at some point in the past or to scrape information that changes over time.
 8 | 
 9 | If you're just just interested in mirroring page content or would like to parse the HTML content in a language other than python then you should check out [the Wayback Machine Scraper](https://github.com/sangaline/wayback-machine-scraper).
10 | It's a command-line utility that uses the middleware provided here to crawl through historical snapshots of a website and save them to disk.
11 | It's highly configurable in terms of what it scrapes but it only saves the unparsed content of the pages on the site.
12 | This may or may not suit your needs.
13 | 
14 | If you're using [Scrapy](https://scrapy.org) already or interested in parsing the data that is crawled then this `WaybackMachineMiddleware` is probably what you want.
15 | This middleware handles all of the tricky parts and passes normal `response` objects to your [Scrapy](https://scrapy.org) spiders with archive timestamp information attached.
16 | The middleware is very unobtrusive and should work seamlessly with existing [Scrapy](https://scrapy.org) middlewares, extensions, and spiders.
17 | 
18 | ## Installation
19 | 
20 | The package can be installed using `pip`.
21 | 
22 | ```bash
23 | pip install scrapy-wayback-machine
24 | ```
25 | 
26 | ## Usage
27 | 
28 | To enable the middleware you simply have to add
29 | 
30 | ```python
31 | DOWNLOADER_MIDDLEWARES = {
32 |     'scrapy_wayback_machine.WaybackMachineMiddleware': 5,
33 | }
34 | 
35 | WAYBACK_MACHINE_TIME_RANGE = (start_time, end_time)
36 | ```
37 | 
38 | to your [Scrapy](https://scrapy.org) settings.
39 | The start and end times can be specified as `datetime.datetime` objects, Unix timestamps, `YYYYmmdd` timestamps, or `YYYYmmddHHMMSS` timestamps.
40 | The type will be automatically inferred from the content and the ranges will limit the range of snapshots to crawl.
41 | You can also pass a single time if you would like to scrape pages as they appeared at that time.
42 | 
43 | After configuration, responses will be passed to your spiders as they normally would.
44 | Both `response.url` and all links within `response.body` point to the unarchived content so your parsing code should work the same regardless of whether or not the middleware is enabled.
45 | If you need to access either the time of the snapshot or the [archive.org](http://archive.org) URL for a response then this information is easily available as metadata attached to the response.
46 | Namely, `response.meta['wayback_machine_time']` contains a `datetime.datetime` corresponding to the time of the crawl and `response.meta['wayback_machine_url']` contains the actual URL that was requested.
47 | Unless you're scraping a single point in time, you will almost certainly want to include the timestamp in the items that your spiders produce to differentiate items scraped from the same URL.
48 | 
49 | ### Examples
50 | 
51 | [The Wayback Machine Scraper](https://github.com/sangaline/wayback-machine-scraper) command-line utility is a good example of how to use the middleware.
52 | The necessary settings are defined in [\_\_main\_\_.py](https://github.com/sangaline/wayback_machine_scraper/scraper/__main__.py) and the handling of responses is done in [mirror_spider.py](https://github.com/sangaline/wayback_machine_scraper/scraper/mirror_spider.py).
53 | The `MirrorSpider` class simply uses the `response.meta['wayback_machine_time']` information attached to each response to construct the snapshot filenames and is otherwise a fairly generic spider.
54 | 
55 | There's also an article [Internet Archeology: Scraping time series data from Archive.org](http://sangaline.com/post/wayback-machine-scraper/) that discusses the development of the middleware and includes an example of scraping time series data from [Reddit](https://reddit.com).
56 | 


--------------------------------------------------------------------------------
/scrapy_wayback_machine/__init__.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | from datetime import datetime, timezone
  4 | try:
  5 |     from urllib.request import pathname2url
  6 | except ImportError:
  7 |     from urllib import pathname2url
  8 | 
  9 | from scrapy import Request
 10 | from scrapy.http import Response
 11 | from scrapy.exceptions import NotConfigured, IgnoreRequest
 12 | 
 13 | class UnhandledIgnoreRequest(IgnoreRequest):
 14 |     pass
 15 | 
 16 | class WaybackMachineMiddleware:
 17 |     cdx_url_template = ('https://web.archive.org/cdx/search/cdx?url={url}'
 18 |                     '&output=json&fl=timestamp,original,statuscode,digest')
 19 |     snapshot_url_template = 'https://web.archive.org/web/{timestamp}id_/{original}'
 20 |     robots_txt = 'https://web.archive.org/robots.txt'
 21 |     timestamp_format = '%Y%m%d%H%M%S'
 22 | 
 23 |     def __init__(self, crawler):
 24 |         self.crawler = crawler
 25 | 
 26 |         # read the settings
 27 |         time_range = crawler.settings.get('WAYBACK_MACHINE_TIME_RANGE')
 28 |         if not time_range:
 29 |             raise NotConfigured
 30 |         self.set_time_range(time_range)
 31 | 
 32 |     def set_time_range(self, time_range):
 33 |         # allow a single time to be passed in place of a range
 34 |         if type(time_range) not in [tuple, list]:
 35 |             time_range = (time_range, time_range)
 36 | 
 37 |         # translate the times to unix timestamps
 38 |         def parse_time(time):
 39 |             if type(time) in [int, float, str]:
 40 |                 time = int(time)
 41 |                 # realistic timestamp range
 42 |                 if 10**8 < time < 10**13:
 43 |                     return time
 44 |                 # otherwise archive.org timestamp format (possibly truncated)
 45 |                 time_string = str(time)[::-1].zfill(14)[::-1]
 46 |                 time = datetime.strptime(time_string, self.timestamp_format)
 47 |                 time = time.replace(tzinfo=timezone.utc)
 48 |             return time.timestamp()
 49 | 
 50 |         self.time_range = [parse_time(time) for time in time_range]
 51 | 
 52 |     @classmethod
 53 |     def from_crawler(cls, crawler):
 54 |         return cls(crawler)
 55 | 
 56 |     def process_request(self, request, spider):
 57 |         # ignore robots.txt requests
 58 |         if request.url == self.robots_txt:
 59 |             return
 60 | 
 61 |         # let Wayback Machine requests through
 62 |         if request.meta.get('wayback_machine_url'):
 63 |             return
 64 |         if request.meta.get('wayback_machine_cdx_request'):
 65 |             return
 66 | 
 67 |         # otherwise request a CDX listing of available snapshots
 68 |         return self.build_cdx_request(request)
 69 | 
 70 |     def process_response(self, request, response, spider):
 71 |         meta = request.meta
 72 | 
 73 |         # parse CDX requests and schedule future snapshot requests
 74 |         if meta.get('wayback_machine_cdx_request'):
 75 |             snapshot_requests = self.build_snapshot_requests(response, meta)
 76 | 
 77 |             # treat empty listings as 404s
 78 |             if len(snapshot_requests) < 1:
 79 |                 return Response(meta['wayback_machine_original_request'].url, status=404)
 80 | 
 81 |             # schedule all of the snapshots
 82 |             for snapshot_request in snapshot_requests:
 83 |                 self.crawler.engine.schedule(snapshot_request, spider)
 84 | 
 85 |             # abort this request
 86 |             raise UnhandledIgnoreRequest
 87 | 
 88 |         # clean up snapshot responses
 89 |         if meta.get('wayback_machine_url'):
 90 |             return response.replace(url=meta['wayback_machine_original_request'].url)
 91 | 
 92 |         return response
 93 | 
 94 |     def build_cdx_request(self, request):
 95 |         if os.name == 'nt':
 96 |             cdx_url = self.cdx_url_template.format(url=pathname2url(request.url.split('://')[1]))
 97 |         else:
 98 |             cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))
 99 |         cdx_request = Request(cdx_url)
100 |         cdx_request.meta['wayback_machine_original_request'] = request
101 |         cdx_request.meta['wayback_machine_cdx_request'] = True
102 |         return cdx_request
103 | 
104 |     def build_snapshot_requests(self, response, meta):
105 |         assert meta.get('wayback_machine_cdx_request'), 'Not a CDX request meta.'
106 | 
107 |         # parse the CDX snapshot data
108 |         try:
109 |             data = json.loads(response.text)
110 |         except json.decoder.JSONDecodeError:
111 |             # forbidden by robots.txt
112 |             data = []
113 |         if len(data) < 2:
114 |             return []
115 |         keys, rows = data[0], data[1:]
116 |         def build_dict(row):
117 |             new_dict = {}
118 |             for i, key in enumerate(keys):
119 |                 if key == 'timestamp':
120 |                     try:
121 |                         time = datetime.strptime(row[i], self.timestamp_format)
122 |                         new_dict['datetime'] = time.replace(tzinfo=timezone.utc)
123 |                     except ValueError:
124 |                         # this means an error in their date string (it happens)
125 |                         new_dict['datetime'] = None
126 |                 new_dict[key] = row[i]
127 |             return new_dict
128 |         snapshots = list(map(build_dict, rows))
129 |         del rows
130 | 
131 |         snapshot_requests = []
132 |         for snapshot in self.filter_snapshots(snapshots):
133 |             # update the url to point to the snapshot
134 |             url = self.snapshot_url_template.format(**snapshot)
135 |             original_request = meta['wayback_machine_original_request']
136 |             snapshot_request = original_request.replace(url=url)
137 | 
138 |             # attach extension specify metadata to the request
139 |             snapshot_request.meta.update({
140 |                 'wayback_machine_original_request': original_request,
141 |                 'wayback_machine_url': snapshot_request.url,
142 |                 'wayback_machine_time': snapshot['datetime'],
143 |             })
144 | 
145 |             snapshot_requests.append(snapshot_request)
146 | 
147 |         return snapshot_requests
148 | 
149 |     def filter_snapshots(self, snapshots):
150 |         filtered_snapshots = []
151 |         initial_snapshot = None
152 |         last_digest = None
153 |         for i, snapshot in enumerate(snapshots):
154 |             # ignore entries with invalid timestamps
155 |             if not snapshot['datetime']:
156 |                 continue
157 |             timestamp = snapshot['datetime'].timestamp()
158 | 
159 |             # ignore bot detections (e.g status="-")
160 |             if len(snapshot['statuscode']) != 3:
161 |                 continue
162 | 
163 |             # also don't download redirects (because the redirected URLs are also present)
164 |             if snapshot['statuscode'][0] == '3':
165 |                 continue
166 | 
167 | 
168 |             # include the snapshot active when we first enter the range
169 |             if len(filtered_snapshots) == 0:
170 |                 if timestamp > self.time_range[0]:
171 |                     if initial_snapshot:
172 |                         filtered_snapshots.append(initial_snapshot)
173 |                         last_digest = initial_snapshot['digest']
174 |                 else:
175 |                     initial_snapshot = snapshot
176 | 
177 |             # ignore before the range
178 |             if timestamp < self.time_range[0]:
179 |                 continue
180 | 
181 |             # ignore the rest are past the specified time range
182 |             if timestamp > self.time_range[1]:
183 |                 break
184 | 
185 |             # don't download unchanged snapshots
186 |             if last_digest == snapshot['digest']:
187 |                 continue
188 |             last_digest = snapshot['digest']
189 | 
190 |             filtered_snapshots.append(snapshot)
191 | 
192 |         return filtered_snapshots
193 | 


--------------------------------------------------------------------------------