├── CODEOWNERS
├── INSTALL
├── LICENSE
├── MANIFEST.in
├── README.md
├── apps
    ├── urls.txt
    ├── wayback_crawler.py
    └── wayback_retriever.py
├── libwayback
    ├── __init__.py
    ├── crawler.py
    └── retriever.py
└── setup.py


/CODEOWNERS:
--------------------------------------------------------------------------------
1 | *  @caesar0301
2 | 


--------------------------------------------------------------------------------
/INSTALL:
--------------------------------------------------------------------------------
 1 | Thanks for downloading libwayback.
 2 | 
 3 | To install it, make sure you have Python 2.6 or greater installed. Then run
 4 | this command from the command prompt:
 5 | 
 6 |     python setup.py install
 7 | 
 8 | If you're upgrading from a previous version, you need to remove it first.
 9 | 
10 | AS AN ALTERNATIVE, you can just copy the entire "libwayback" directory to Python's
11 | site-packages directory, which is located wherever your Python installation
12 | lives. Some places you might check are:
13 | 
14 |     /usr/lib/python2.7/site-packages (Unix, Python 2.7)
15 |     /usr/lib/python2.6/site-packages (Unix, Python 2.6)
16 |     C:\\PYTHON\site-packages         (Windows)
17 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright (C) 2013   Xiaming Chen - chenxm35@gmail.com
 3 | # All rights reserved.
 4 | #
 5 | # This file is part of project libwayback.
 6 | #
 7 | # This program is free software: you can redistribute it and/or modify
 8 | # it under the terms of the GNU General Public License as published by
 9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
15 | # GNU General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU General Public License
18 | # along with this program.  If not, see <http://www.gnu.org/licenses/>.
19 | #
20 | # Neither the name of the copyright holder nor the names of the contributors
21 | # may be used to endorse or promote products derived from this software without
22 | # specific prior written permission.
23 | #
24 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
25 | # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
26 | # WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
27 | # DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
28 | # ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
29 | # (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
30 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
31 | # ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
32 | # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
33 | # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
34 | 


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include README.md
2 | include LICENSE
3 | include MANIFEST.in
4 | include setup.py
5 | recursive-include apps *.txt
6 | recursive-include apps *.py
7 | recursive-include libwayback *.py
8 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Archiver_crawler
  2 | ============
  3 | 
  4 | ![wayback machine](http://archive.org/images/wayback_logo-sm.gif)
  5 | 
  6 | 
  7 | A library for parsing Wayback Machine of [Internet Archive] (http://www.archive.org) to get the historical content of web page, for research purpose only.
  8 | 
  9 | Only the original content of HTML file of the web page is downloaded, without the embedded web objects.
 10 | 
 11 | By xiamingc,SJTU - chenxm35@gmail.com
 12 | 
 13 | 
 14 | Requirements
 15 | ------------
 16 | 
 17 | [Python 2.6+ (<3)] (http://www.python.org/)
 18 | 
 19 | [lxml 2.3+] (http://lxml.de/)
 20 | 
 21 | [html5lib 0.95+] (https://github.com/html5lib)
 22 | 
 23 | 
 24 | Programs
 25 | ------------
 26 | 
 27 | `wayback_crawler` -- to extract the URLs of websites from Internet Archive for content download later.
 28 | 
 29 | `wayback_retriever` -- to downlad the page content with the URLs output of `wayback_crawler`
 30 | 
 31 | `libwayback` -- the underlying library support crawler and retriever programs.
 32 | 
 33 | 
 34 | 
 35 | Usage of wayback_crawler
 36 | ------------
 37 | 
 38 | If you have python and required packages installed, you can run as python script:
 39 | 
 40 |     python wayback_crawler.py [-l log_level] urlfile
 41 | 
 42 | 
 43 | 
 44 | Usage of wayback_retriever
 45 | ------------
 46 | 
 47 | The `wayback_retriever` works on the output of `wayback_crawler`. With a specific file, you can run retriever like:
 48 | 
 49 |     python wayback_retriever.py <specific_output_file_of_wayback_crawler>
 50 | 
 51 | where the input is an individual file output by crawler.
 52 | 
 53 | The download pages will locate in folder `retriever_results` in current working place; 
 54 | 
 55 | 
 56 | 
 57 | Usage of libwayback
 58 | ------------
 59 | 
 60 | This library provides basic functions for crawling Internet Archive. It has the simple structures like:
 61 | 
 62 |     libwayback
 63 |     |____WaybackCrawler
 64 |     |____WaybackRetriever
 65 | 
 66 | If you are willing to using libwayback in your project, it's easy to integrate:
 67 | 
 68 |     from libwayback import WaybackCrawler, WaybackRetriever
 69 | 
 70 |     crawler = WaybackCrawler("www.sjtu.edu.cn")
 71 |     crawler.parse(live=False)
 72 | 
 73 |     # The `results` of crawler instance contains a dict data structure with 
 74 |     # a "year" number being the key and a list of page addresses being the value.
 75 | 
 76 |     ret = crawler.results
 77 | 
 78 |     # Based on the result of crawler, ie a specific page address, you can use 
 79 |     # retriever to download and save it in yor file system:
 80 | 
 81 |     retriever = WaybackRetriever()
 82 | 
 83 |     for year in ret:
 84 |         for url in ret[year]:
 85 |             retriever.save_page(url, "saved_file")
 86 | 
 87 | NOTE:
 88 | 
 89 | * The `live` option of `parse()` is responsible to parsing the live version of a page. 
 90 | For more about the difference about the modified and original versions, please refer to:
 91 | http://faq.web.archive.org/page-without-wayback-code/ .
 92 | 
 93 | 
 94 | 
 95 | About logs
 96 | ------------
 97 | 
 98 | libwayback:
 99 | 
100 |     ERROR: "Invalid timestamp of wayback url: %s" 
101 |     Meaning: which means the regex expression can't match the year number from the historical URLs.
102 |     Solution: check the URL manually and find the error reason.
103 |     Frequency: ~ 0%
104 | 
105 |     
106 | wayback_retriever:
107 | 
108 |     ERROR: "Failed to extract the time string. The url must in the format like: http://web.archive.org/web/19990117032727/google.com/"
109 |     Meaning: which means the regex expression can't match the year number from the historical URLs.
110 |     Solution: check the URL manually and find the error reason.
111 |     Frequency: ~ 0%
112 |     
113 |     ERROR: "Open page error: %s: %s" 
114 |     Meaning: urllib2 can't open this URL occurs. Multiple reasons may lead to the failure: the swayback server is down; or connection
115 |     is block by third party; or something else;
116 |     Solution: check the URL manually or rerun the program at another time.
117 |     Frequency: ~ 14%
118 |     
119 |     ERROR: "Read content error: %s: %s"
120 |     Meaning: read content error of file object of urllib2.open(). Some reason may be the uncompleted content when reading.
121 |     Solution: check the URL manually
122 |     Frequency: ~ 0%
123 |     
124 |     ERROR: "Save redirected page error: [{0}]{1}: {2}"
125 |     Meaning: fail to save the redirected page indicated by the first dumping.
126 |     Solution: check the URL manually
127 |     Frequency: ~ 0.1%
128 |     
129 |     ERROR: "Fail to extract timestamp: %s"
130 |     Meaning: which means the regex expression can't match the exact numbers of the URL about year, month, day, hour, minute, second.
131 |     This is a strong matching action.
132 |     Solution: check the URL manually
133 |     Frequency: ~ 0%
134 |     
135 |     
136 |     
137 | 
138 | 


--------------------------------------------------------------------------------
/apps/urls.txt:
--------------------------------------------------------------------------------
1 | sjtu.edu.cn


--------------------------------------------------------------------------------
/apps/wayback_crawler.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | """
 3 | Main crawler for program Archive_crawler.
 4 | Copyright (C) 2012  xiamingc, SJTU -  chenxm35@gmail.com
 5 | 
 6 | This program is free software: you can redistribute it and/or modify
 7 | it under the terms of the GNU General Public License as published by
 8 | the Free Software Foundation, either version 3 of the License, or
 9 | (at your option) any later version.
10 | 
11 | This program is distributed in the hope that it will be useful,
12 | but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
14 | GNU General Public License for more details.
15 | 
16 | You should have received a copy of the GNU General Public License
17 | along with this program.  If not, see <http://www.gnu.org/licenses/>.
18 | """
19 | import argparse, logging, os, re, sys
20 | 
21 | from libwayback import WaybackCrawler
22 | 
23 | 
24 | # global logger settings
25 | root_logger = logging.getLogger("libwayback")	# use default 'libwayback' logger to active logging function of libwayback
26 | root_logger.setLevel(logging.ERROR)
27 | handler = logging.FileHandler(os.path.join(os.path.dirname(__file__), "app_crawler_log.txt"))
28 | console = logging.StreamHandler()
29 | handler.setFormatter(logging.Formatter("%(asctime)-15s %(module)s/%(lineno)d %(levelname)s: %(message)s"))
30 | root_logger.addHandler(handler)
31 | root_logger.addHandler(console)
32 | 
33 | 
34 | def _sanitize_name(fname):
35 | 	return re.sub(r'[\.\/\\\?#&]', "_", fname.strip('\r \n'))
36 | 
37 | def _abspath(where):
38 | 	abswhere = os.path.abspath(where)
39 | 	if not os.path.exists(abswhere): os.makedirs(abswhere)
40 | 	return abswhere
41 | 
42 | def dump_results(url, results, where = '.'):
43 | 	fn = os.path.join(_abspath(where), _sanitize_name(url)+'.txt')
44 | 	with open(fn, 'w') as f:
45 | 		for key in results:
46 | 			for line in results[key]:
47 | 				f.write(line+'\n')
48 | 
49 | 
50 | __about__ = \
51 | """
52 | Archive_crawler - Version 0.1
53 | Program to parse historical urls of web page from archive.org.
54 | Copyright (c) 2012 xiamingc, SJTU - chenxm35@gmail.com
55 | """
56 | 
57 | 
58 | def runMain():
59 | 	print (__about__)
60 | 
61 | 	# Input arguments
62 | 	parser = argparse.ArgumentParser(description="Parse the historical version URLs of a web page in Weyback (http://wayback.archive.org/).")
63 | 	parser.add_argument('-l', dest="loglevel", default="INFO", type=str, help="Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL")
64 | 	parser.add_argument("urllist", type=str, help="A plain file containing the addresses of web pages.")
65 | 	args = parser.parse_args()
66 | 	urllist = args.urllist
67 | 	loglevel = args.loglevel
68 | 
69 | 	if not urllist: parser.print_help(); sys.exit(-1)
70 | 
71 | 	# Logging
72 | 	numeric_level = getattr(logging, loglevel.upper(), None)
73 | 	if not isinstance(numeric_level, int):
74 | 	    print('Invalid log level: %s' % loglevel)
75 | 	    parser.print_help(); sys.exit(-1);
76 | 
77 | 	# set logger level
78 | 	root_logger.setLevel(numeric_level)
79 | 
80 | 	logger = logging.getLogger("libwayback.app_crawler")
81 | 
82 | 	for url in open(urllist):
83 | 		url = url.rstrip('\r\n')
84 | 		if url == '': continue
85 | 		try:
86 | 			logger.info('Start parsing: %s' % url)
87 | 			crawler = WaybackCrawler(url)
88 | 			crawler.parse()
89 | 			logger.info('Finish parsing: %s' % url)
90 | 		finally:
91 | 			if crawler.results:
92 | 				logger.info('Dump records of %s to file %s' % (url, url+'.txt'))
93 | 				dump_results(url, crawler.results)
94 | 
95 | 
96 | if __name__ == '__main__':
97 | 	runMain()
98 | 
99 | # EOF


--------------------------------------------------------------------------------
/apps/wayback_retriever.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | """
  3 | Subtools for Archive_crawler program.
  4 | Copyright (C) 2012  xiamingc, SJTU -  chenxm35@gmail.com
  5 | 
  6 | This program is free software: you can redistribute it and/or modify
  7 | it under the terms of the GNU General Public License as published by
  8 | the Free Software Foundation, either version 3 of the License, or
  9 | (at your option) any later version.
 10 | 
 11 | This program is distributed in the hope that it will be useful,
 12 | but WITHOUT ANY WARRANTY; without even the implied warranty of
 13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 14 | GNU General Public License for more details.
 15 | 
 16 | You should have received a copy of the GNU General Public License
 17 | along with this program.  If not, see <http://www.gnu.org/licenses/>.
 18 | """
 19 | import argparse
 20 | import logging
 21 | import os
 22 | import datetime
 23 | import sys
 24 | import time
 25 | 
 26 | from libwayback import WaybackRetriever
 27 | 
 28 | # global logger settings
 29 | root_logger = logging.getLogger("libwayback")	# use default 'libwayback' logger to active logging function of libwayback
 30 | root_logger.setLevel(logging.ERROR)
 31 | handler = logging.FileHandler(os.path.join(os.path.dirname(__file__), "app_retriever_log.txt"))
 32 | console = logging.StreamHandler()
 33 | handler.setFormatter(logging.Formatter("%(asctime)-15s %(module)s/%(lineno)d %(levelname)s: %(message)s"))
 34 | root_logger.addHandler(handler)
 35 | root_logger.addHandler(console)
 36 | 
 37 | logger = logging.getLogger("libwayback.app_retriever")
 38 | 
 39 | 
 40 | __about__ = """
 41 | Program to retrieve the HTML web pages of the URLs generated by wayback_crawler.py.
 42 | """
 43 | 
 44 | 
 45 | def _genoutdir(where = '.'):
 46 | 	outputdir = os.path.join(where, "retriever_results")
 47 | 	if not os.path.exists(outputdir):
 48 | 		os.makedirs(outputdir)
 49 | 	return outputdir
 50 | 
 51 | def _genoutname(outputfolder, inputfilename, timestr):
 52 | 	""" Generate output file name like "[inputfilename]_[urltimestamp].txt"
 53 | 	"""
 54 | 	name = inputfilename.rsplit('.', 1)[0]
 55 | 	newdir = os.path.join(outputfolder, name)
 56 | 	if not os.path.exists(newdir):
 57 | 		os.makedirs(newdir)
 58 | 	outputfilename = "%s_%s.txt" % (name, timestr)
 59 | 	return os.path.join(newdir, outputfilename)
 60 | 	
 61 | 
 62 | def retriever_smart(inputfile, years = None, days = None):
 63 | 	logger.info("Start downloading URLs of %s" % inputfile)
 64 | 
 65 | 	retriever = WaybackRetriever()
 66 | 
 67 | 	all_urls = []
 68 | 	for line in open(inputfile, 'rb'):
 69 | 		line = line.rstrip('\n')
 70 | 		if line =='': continue
 71 | 		timestamp = retriever.extract_timestamp(line)
 72 | 		if timestamp == None:
 73 | 			logger.error("Fail to extract timestamp: %s" % line)
 74 | 			continue
 75 | 		all_urls.append((timestamp, line))
 76 | 	all_urls.sort(lambda x,y:cmp(x[0],y[0]), None, False)
 77 | 
 78 | 	## Process the time-scale limitations
 79 | 	if years != None:
 80 | 		left_urls = [url for url in all_urls if url[0].year in years ]
 81 | 	else:
 82 | 		left_urls = all_urls
 83 | 
 84 | 	inputfilename = os.path.split(inputfile)[1]
 85 | 	resultdir = _genoutdir()	# output lies in the same folder with this program
 86 | 	aday = []
 87 | 	k = 1 	## url counter
 88 | 	j = 0	## day counter
 89 | 	n = 0	## valid day counter
 90 | 	while k <= len(left_urls):
 91 | 		url = left_urls[k-1]
 92 | 		if len(aday) == 0 or url[0].day == aday[0][0].day:
 93 | 			aday.append(url)
 94 | 		if k == len(left_urls) or left_urls[k][0].day != aday[0][0].day:
 95 | 			## process the day to featch the earlies valid web page
 96 | 			print("Parsing the day: %s/%s/%s" % (aday[0][0].month, aday[0][0].day, aday[0][0].year))
 97 | 			j += 1
 98 | 			dl = len(aday)	## total url counter for a day
 99 | 			i = 1 	## url counter for a day
100 | 			while i <= dl:
101 | 				time.sleep(0.5)
102 | 				outputfile = _genoutname(resultdir, inputfilename, retriever.extract_time_string(aday[i-1][1]))
103 | 				status = retriever.save_page(aday[i-1][1], outputfile)
104 | 				if status == None:
105 | 					i += 1; continue
106 | 				else:
107 | 					n += 1; break
108 | 			# start next day
109 | 			aday = []
110 | 		k += 1
111 | 	logger.info("Finish downloading.")
112 | 	logger.info("File %s: %d/%d valid days processed" % (inputfile, n, j))
113 | 
114 |  
115 | def runMain():   
116 | 	parser = argparse.ArgumentParser(description = __about__)
117 | 	parser.add_argument("-y", dest='yearscale', type=str, help="Year scale to retrieve, e.g. '1999', '1999-2003' or '1999,2003' (without quotation marks)")
118 | 	parser.add_argument("-l", dest="loglevel", default="DEBUG", type=str, help="Log level: DEBUG(default), INFO, WARNING, ERROR, CRITICAL")
119 | 	parser.add_argument("urlfile", type=str, help="File containing wayback URLs output by the crawler.")
120 | 	args = parser.parse_args()
121 | 	loglevel = args.loglevel
122 | 	inputfile = args.urlfile
123 | 	yearstr = args.yearscale
124 | 
125 | 	if yearstr != None:
126 | 		years = []
127 | 		for i in yearstr.split(','):
128 | 			points = [int(j) for j in i.split('-')]
129 | 			if len(points) == 1:
130 | 				years.append(int(i))
131 | 			elif len(points) > 2:
132 | 				print("Wrong year scale. -h for help")
133 | 				exit(-1)
134 | 			else:
135 | 				if points[0] <= points[1]:
136 | 					years += range(points[0], points[1]+1)
137 | 				else:
138 | 					years += range(points[1], points[0]+1)
139 | 	else:
140 | 		years = None
141 | 
142 | 	## logging
143 | 	numeric_level = getattr(logging, loglevel.upper(), None)
144 | 	if not isinstance(numeric_level, int):
145 | 	    raise ValueError('Invalid log level: %s' % loglevel)
146 | 	    parser.print_help(); sys.exit(-1)
147 | 
148 | 	root_logger.setLevel(numeric_level)
149 | 
150 | 	logger.info("{0}: processing {1}".format(str(datetime.datetime.now()), inputfile))
151 | 	retriever_smart(inputfile, years)
152 | 
153 | 
154 | if __name__ == '__main__':
155 | 	runMain()


--------------------------------------------------------------------------------
/libwayback/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = '1.0'
2 | 
3 | from crawler import WaybackCrawler
4 | from retriever import WaybackRetriever


--------------------------------------------------------------------------------
/libwayback/crawler.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | """
  3 | Library for program Archive_crawler.
  4 | Copyright (C) 2012  xiamingc, SJTU -  chenxm35@gmail.com
  5 | 
  6 | This program is free software: you can redistribute it and/or modify
  7 | it under the terms of the GNU General Public License as published by
  8 | the Free Software Foundation, either version 3 of the License, or
  9 | (at your option) any later version.
 10 | 
 11 | This program is distributed in the hope that it will be useful,
 12 | but WITHOUT ANY WARRANTY; without even the implied warranty of
 13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 14 | GNU General Public License for more details.
 15 | 
 16 | You should have received a copy of the GNU General Public License
 17 | along with this program.  If not, see <http://www.gnu.org/licenses/>.
 18 | """
 19 | import urllib2
 20 | import re
 21 | import datetime
 22 | import logging
 23 | import time
 24 | import random
 25 | import html5lib
 26 | from html5lib import treebuilders
 27 | 
 28 | 
 29 | module_logger = logging.getLogger("libwayback.libcrawler")
 30 | 
 31 | 
 32 | def _valid_XML_char_ordinal(i):
 33 | 	## As for the XML specification, valid chars must be in the range of
 34 | 	## Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 35 | 	## [Ref] http://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python
 36 |     return (# conditions ordered by presumed frequency
 37 | 		0x20 <= i <= 0xD7FF 
 38 | 	    or i in (0x9, 0xA, 0xD)
 39 | 	    or 0xE000 <= i <= 0xFFFD
 40 | 	    or 0x10000 <= i <= 0x10FFFF
 41 | 	    )
 42 | 
 43 | 
 44 | class WaybackCrawler(object):
 45 | 	
 46 | 	def __init__(self, url):
 47 | 		self.server = 'http://wayback.archive.org'
 48 | 		self.prefix = 'http://wayback.archive.org/web'
 49 | 		self.url = url
 50 | 		self.results = {}	# year: [historical_page_urls]
 51 | 
 52 | 
 53 | 	def get_wayback_page(self, site_url):
 54 | 		return self.prefix + '/*/' + site_url
 55 | 
 56 | 
 57 | 	def convert_live_url(self, url):
 58 | 		"""
 59 | 		For more information, please refer to "How can I view a page without the Wayback code in it?" at
 60 | 		http://faq.web.archive.org/page-without-wayback-code/
 61 | 		"""
 62 | 		pattern = re.compile(r"\/([1-2]\d{3})\d*")
 63 | 		mres = re.search(pattern, url)
 64 | 		return url[0:mres.end()] + 'id_' + url[mres.end():]
 65 | 
 66 | 
 67 | 	def extract_year(self, wayback_url ):
 68 | 		pattern = re.compile(r"\/([1-2]\d{3})\d*")
 69 | 		mres = re.search(pattern, wayback_url)
 70 | 		if mres == None:
 71 | 			return None
 72 | 		return int(mres.group(1))
 73 | 
 74 | 
 75 | 	def open_url(self, url):
 76 | 		ret = None
 77 | 		try:
 78 | 			fh = urllib2.urlopen(url)
 79 | 			if fh.geturl() != url: module_logger.info("Redirected to: %s" % fh.geturl())
 80 | 			ret = fh.read()
 81 | 		except urllib2.URLError, reason:
 82 | 			module_logger.error("%s: %s" % (url, reason))
 83 | 		return ret
 84 | 
 85 | 
 86 | 	def _parse_wayback_page(self, page_year):
 87 | 		"""
 88 | 		Paser all recored web page URLs in specific year.
 89 | 		"""
 90 | 		his_urls = []
 91 | 		wholepage = self.open_url(page_year)
 92 | 		if wholepage == None: return his_urls
 93 | 
 94 | 		parser = html5lib.HTMLParser(tree = treebuilders.getTreeBuilder("lxml"))
 95 | 
 96 | 		try:
 97 | 			html_doc = parser.parse(wholepage)
 98 | 		except ValueError:
 99 | 			wholepage_clean = ''.join(c for c in wholepage if _valid_XML_char_ordinal(ord(c)))
100 | 			html_doc = parser.parse(wholepage_clean)
101 | 
102 | 		body = html_doc.find("./{*}body")
103 | 		position_div = body.find("./{*}div[@id='position']")
104 | 		wayback_cal = position_div.find("./{*}div[@id='wbCalendar']")
105 | 		calOver = wayback_cal.find("./{*}div[@id='calOver']")
106 | 		for month in calOver.findall("./{*}div[@class='month']"):
107 | 			for day in month.findall(".//{*}td"):
108 | 				day_div = day.find("./{*}div[@class='date tooltip']")
109 | 				if day_div != None:
110 | 					for snapshot in day_div.findall("./{*}div[@class='pop']/{*}ul/{*}li"):
111 | 						his_urls.append(snapshot[0].get('href'))
112 | 
113 | 		year =  self.extract_year(his_urls[0]) if len(his_urls) > 0 else None
114 | 
115 | 		return (year, his_urls)
116 | 
117 | 
118 | 	def parse(self, live = False):
119 | 		""" 
120 | 		Parse historical urls for a web page over years.
121 | 		We first determine the year scale that has valid snapshots.
122 | 		@Return: list of historical urls or None
123 | 		"""
124 | 		self._parse_called = True
125 | 
126 | 		wayback_page_whole = self.open_url( self.get_wayback_page(self.url) )
127 | 		if wayback_page_whole == None: return None
128 | 
129 | 		parser = html5lib.HTMLParser(tree = treebuilders.getTreeBuilder("lxml"))
130 | 		html_doc = parser.parse(wayback_page_whole)
131 | 
132 | 		position_div = html_doc.find("./{*}body/{*}div[@id='position']")
133 | 		sketchinfo = position_div.find("./{*}div[@id='wbSearch']/{*}div[@id='form']/{*}div[@id='wbMeta']/{*}p[@class='wbThis']")
134 | 		first_url = sketchinfo.getchildren()[-1].attrib['href']
135 | 		first_year = self.extract_year(first_url)
136 | 
137 | 		for year in range(first_year, datetime.datetime.now().year+1):
138 | 			# Be polite to the host server
139 | 			time.sleep(random.randint(1,3))
140 | 
141 | 			# Note: the timestamp in search url indicates the time scale of query:
142 | 			# E.g., wildcard * matches all of the items in specific year.
143 | 			# If only * is supported, the results of latest year are returned.
144 | 			# I found that it returned wrong results if the month and day numbers are small like 0101,
145 | 			# so a bigger number is used to match wildly.
146 | 			wayback_page_year = "%s/%d0601000000*/%s" % ( self.prefix, year, self.url )
147 | 			page_year, his_urls = self._parse_wayback_page(wayback_page_year)
148 | 
149 | 			# To exclude duplicated items that don't match the year
150 | 			# By default the results of latest year are returned 
151 | 			# if some year hasn't been crawled
152 | 			if page_year == None or page_year != year: continue
153 | 			module_logger.debug("%s: %d pages found for year %d" % (self.url, len(his_urls), page_year))
154 | 
155 | 			for url in his_urls:
156 | 				try:
157 | 					page_year =  self.extract_year(url)
158 | 				except:
159 | 					module_logger.error("Invalid timestamp of wayback url: %s" % url)
160 | 					continue
161 | 				if year == page_year:
162 | 					if live: self.add_item(year, self.convert_live_url(url))
163 | 					else: self.add_item(year, url)
164 | 
165 | 		return self.results
166 | 
167 | 
168 | 	def add_item(self, year, item):
169 | 		if item[:4] != 'http': item = (self.server + item)
170 | 		self[str(year)].append(item)
171 | 
172 | 
173 | 	def __getitem__(self, year):
174 | 		try:
175 | 			self.results[str(year)]
176 | 		except KeyError:
177 | 			self.results[str(year)] = []
178 | 		return self.results[str(year)]
179 | 
180 | 
181 | 	def __setitem__(self, year, item):
182 | 		self.results[str(year)] = item
183 | 
184 | 		
185 | 	def __len__(self):
186 | 		cnt = 0
187 | 		for key in results:
188 | 			cnt += len(results[key])
189 | 		return cnt
190 | 
191 | 
192 | 
193 | if __name__ == '__main__':
194 | 	crawler = WaybackCrawler("www.sjtu.edu.cn")
195 | 	crawler.parse(live=False)
196 | 
197 | 
198 | # EOF


--------------------------------------------------------------------------------
/libwayback/retriever.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | """
  3 | Library for program Archive_crawler.
  4 | Copyright (C) 2012  xiamingc, SJTU -  chenxm35@gmail.com
  5 | 
  6 | This program is free software: you can redistribute it and/or modify
  7 | it under the terms of the GNU General Public License as published by
  8 | the Free Software Foundation, either version 3 of the License, or
  9 | (at your option) any later version.
 10 | 
 11 | This program is distributed in the hope that it will be useful,
 12 | but WITHOUT ANY WARRANTY; without even the implied warranty of
 13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 14 | GNU General Public License for more details.
 15 | 
 16 | You should have received a copy of the GNU General Public License
 17 | along with this program.  If not, see <http://www.gnu.org/licenses/>.
 18 | """
 19 | 
 20 | __author__ = 'chenxm'
 21 | 
 22 | 
 23 | import argparse
 24 | import httplib
 25 | import urllib2
 26 | import re
 27 | import datetime
 28 | import math
 29 | import unittest
 30 | import logging
 31 | import time
 32 | import os
 33 | import os.path
 34 | import random
 35 | import sys
 36 | import html5lib
 37 | from html5lib import treebuilders
 38 | 
 39 | 
 40 | logger = logging.getLogger("libwayback.waybackretriever")
 41 | 
 42 | 
 43 | def _patch_http_response_read(func):
 44 |     def inner(*args):
 45 |         try:
 46 |             return func(*args)
 47 |         except httplib.IncompleteRead, e:
 48 |             return e.partial
 49 |     return inner
 50 | # Patch HTTPResponse.read to avoid IncompleteRead exception
 51 | httplib.HTTPResponse.read = _patch_http_response_read(httplib.HTTPResponse.read)
 52 | 
 53 | 
 54 | class WaybackRetriever(object):
 55 | 
 56 | 	def __init__(self):
 57 | 		pass
 58 | 
 59 | 
 60 | 	def extract_time_string(self, url):
 61 | 		p = re.compile(r"\/([1-2]\d{3}\d*)")
 62 | 		pm = re.search(p, url)
 63 | 		if pm != None:
 64 | 			return pm.group(1)
 65 | 		else:
 66 | 			# Explicit Error Report
 67 | 			logger.error("Failed to extract the time string. URL exmaple: http://web.archive.org/web/19990117032727/google.com/")
 68 | 			return None
 69 | 
 70 | 
 71 | 	def extract_timestamp(self, url):
 72 | 		p = re.compile(r"\/([1-2]\d{3}\d*)")
 73 | 		pm = re.search(p, url)
 74 | 		if pm == None: return None
 75 | 
 76 | 		timestr = pm.group(1)
 77 | 		members = ["0"]*6 ## [year, month, day, hour, minute, second]
 78 | 		members_limit = [2999, 12, 31, 23, 59, 59]
 79 | 
 80 | 		i = 0; j = 0
 81 | 		leap_year = False
 82 | 		while 0<= i <= len(timestr)-1:
 83 | 			if j == 0:
 84 | 				# year
 85 | 				members[j] = timestr[:4]
 86 | 				i += 4; j += 1
 87 | 				leap_year = (int(members[j])%4 == 0 and True or False)
 88 | 				continue
 89 | 			elif j == 2:
 90 | 				# day
 91 | 				if int(members[j-1]) in [2]: maxv = leap_year == True and 29 or 28
 92 | 				elif int(members[j-1]) in [1,3,5,7,8,10,12]: maxv = 31
 93 | 				else: maxv = 30
 94 | 			else:
 95 | 				# month, hour, minute, second
 96 | 				maxv = members_limit[j]
 97 | 
 98 | 			if 0<= int(timestr[i:i+2]) <= maxv: offset = 2
 99 | 			elif int(timestr[i:i+2]) > maxv: offset = 1
100 | 
101 | 			members[j] = timestr[i:i+offset]
102 | 			i += offset; j += 1
103 | 		return datetime.datetime(year=int(members[0]), month=int(members[1]), \
104 | 								day=int(members[2]), hour=int(members[3]), \
105 | 								minute=int(members[4]), second=int(members[5]))
106 | 
107 | 	def save_page(self, url, savefile):
108 | 		"""
109 | 		Return the abstract path of savefile if succeed; 
110 | 		Otherwise, return None.
111 | 		"""
112 | 		url = url.rstrip('\r \n')
113 | 		try:
114 | 			f = urllib2.urlopen(url)
115 | 		except urllib2.URLError, reason:
116 | 			## Multiple reasons may lead to the failure:
117 | 			## Server is down,
118 | 			## The wayback didn't archive this page
119 | 			## Connection is block by the third person
120 | 			logger.error("Open page error: %s: %s" % (reason, url))
121 | 			return None
122 | 		except:
123 | 			logger.error("Open page error: null: %s" % url)
124 | 			return None
125 | 			
126 | 		try:
127 | 			fc = f.read()
128 | 		except httplib.IncompleteRead:
129 | 			logger.error("Read content error: uncompleted content: %s" % url)
130 | 			return None
131 | 		except:
132 | 			logger.error("Read content error: null: %s" % url)
133 | 			return None
134 | 
135 | 		## check content
136 | 		RE_WRONG_CONTENT = r'(Got an HTTP \d* response at crawl time)'
137 | 
138 | 		pattern = re.compile(RE_WRONG_CONTENT)
139 | 		mp = re.search(pattern, fc)
140 | 		if mp != None:
141 | 			logger.warning("%s: %s" % (mp.group(1), url))
142 | 			parser = html5lib.HTMLParser(tree = treebuilders.getTreeBuilder("lxml"))
143 | 			html_doc = parser.parse(fc)
144 | 			html_body = html_doc.find("./{*}body")
145 | 			div_error = html_body.find(".//{*}div[@id='error']")
146 | 			redir_a = div_error.find("./{*}p[@class='impatient']/{*}a")
147 | 			redir_url = redir_a.get('href')
148 | 			try:
149 | 				return self.save_page(redir_url, savefile)
150 | 			except RuntimeError, detail:
151 | 				logger.error("Save redirected page error: [{0}]{1}: {2}".format(detail.errno, detail.strerror, url))
152 | 				return None
153 | 			except:
154 | 				logger.error("Save redirected page error: null: %s" % url)
155 | 				return None
156 | 		else:
157 | 			open(savefile, 'wb').write(fc)
158 | 			return os.path.abspath(savefile)
159 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from setuptools import setup
 4 | from libwayback import __version__
 5 | 
 6 | setup(
 7 |     name = "libwayback",
 8 |     version = __version__,
 9 |     url = 'https://github.com/caesar0301/libwayback',
10 |     author = 'Jamin Chen',
11 |     author_email = 'chenxm35@gmail.com',
12 |     description = 'A library to parse Wayback Machine of archive.org to get a historical views of web pages.',
13 |     long_description='''A library to parse Wayback Machine of archive.org to get a historical views of web pages. It is a useful tool to research on the evolution of web pages, page structure analysis, and among other interesting topics.''',
14 |     license = "LICENSE",
15 |     packages = ['libwayback'],
16 |     classifiers = [
17 |         'Development Status :: 4 - Beta',
18 |         'Environment :: Console',
19 |             'Intended Audience :: Developers',
20 |             'License :: Freely Distributable',
21 |             'Operating System :: OS Independent',
22 |             'Programming Language :: Python',
23 |             'Programming Language :: Python :: 2.6',
24 |             'Programming Language :: Python :: 2.7',
25 |             'Topic :: Software Development :: Libraries :: Python Modules',
26 |    ],
27 | )
28 | 


--------------------------------------------------------------------------------