├── CODEOWNERS ├── INSTALL ├── LICENSE ├── MANIFEST.in ├── README.md ├── apps ├── urls.txt ├── wayback_crawler.py └── wayback_retriever.py ├── libwayback ├── __init__.py ├── crawler.py └── retriever.py └── setup.py /CODEOWNERS: -------------------------------------------------------------------------------- 1 | * @caesar0301 2 | -------------------------------------------------------------------------------- /INSTALL: -------------------------------------------------------------------------------- 1 | Thanks for downloading libwayback. 2 | 3 | To install it, make sure you have Python 2.6 or greater installed. Then run 4 | this command from the command prompt: 5 | 6 | python setup.py install 7 | 8 | If you're upgrading from a previous version, you need to remove it first. 9 | 10 | AS AN ALTERNATIVE, you can just copy the entire "libwayback" directory to Python's 11 | site-packages directory, which is located wherever your Python installation 12 | lives. Some places you might check are: 13 | 14 | /usr/lib/python2.7/site-packages (Unix, Python 2.7) 15 | /usr/lib/python2.6/site-packages (Unix, Python 2.6) 16 | C:\\PYTHON\site-packages (Windows) 17 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (C) 2013 Xiaming Chen - chenxm35@gmail.com 3 | # All rights reserved. 4 | # 5 | # This file is part of project libwayback. 6 | # 7 | # This program is free software: you can redistribute it and/or modify 8 | # it under the terms of the GNU General Public License as published by 9 | # the Free Software Foundation, either version 3 of the License, or 10 | # (at your option) any later version. 11 | # 12 | # This program is distributed in the hope that it will be useful, 13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 15 | # GNU General Public License for more details. 16 | # 17 | # You should have received a copy of the GNU General Public License 18 | # along with this program. If not, see . 19 | # 20 | # Neither the name of the copyright holder nor the names of the contributors 21 | # may be used to endorse or promote products derived from this software without 22 | # specific prior written permission. 23 | # 24 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 25 | # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 26 | # WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 27 | # DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR 28 | # ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 29 | # (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 30 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 31 | # ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 32 | # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 33 | # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 34 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.md 2 | include LICENSE 3 | include MANIFEST.in 4 | include setup.py 5 | recursive-include apps *.txt 6 | recursive-include apps *.py 7 | recursive-include libwayback *.py 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Archiver_crawler 2 | ============ 3 | 4 | ![wayback machine](http://archive.org/images/wayback_logo-sm.gif) 5 | 6 | 7 | A library for parsing Wayback Machine of [Internet Archive] (http://www.archive.org) to get the historical content of web page, for research purpose only. 8 | 9 | Only the original content of HTML file of the web page is downloaded, without the embedded web objects. 10 | 11 | By xiamingc,SJTU - chenxm35@gmail.com 12 | 13 | 14 | Requirements 15 | ------------ 16 | 17 | [Python 2.6+ (<3)] (http://www.python.org/) 18 | 19 | [lxml 2.3+] (http://lxml.de/) 20 | 21 | [html5lib 0.95+] (https://github.com/html5lib) 22 | 23 | 24 | Programs 25 | ------------ 26 | 27 | `wayback_crawler` -- to extract the URLs of websites from Internet Archive for content download later. 28 | 29 | `wayback_retriever` -- to downlad the page content with the URLs output of `wayback_crawler` 30 | 31 | `libwayback` -- the underlying library support crawler and retriever programs. 32 | 33 | 34 | 35 | Usage of wayback_crawler 36 | ------------ 37 | 38 | If you have python and required packages installed, you can run as python script: 39 | 40 | python wayback_crawler.py [-l log_level] urlfile 41 | 42 | 43 | 44 | Usage of wayback_retriever 45 | ------------ 46 | 47 | The `wayback_retriever` works on the output of `wayback_crawler`. With a specific file, you can run retriever like: 48 | 49 | python wayback_retriever.py 50 | 51 | where the input is an individual file output by crawler. 52 | 53 | The download pages will locate in folder `retriever_results` in current working place; 54 | 55 | 56 | 57 | Usage of libwayback 58 | ------------ 59 | 60 | This library provides basic functions for crawling Internet Archive. It has the simple structures like: 61 | 62 | libwayback 63 | |____WaybackCrawler 64 | |____WaybackRetriever 65 | 66 | If you are willing to using libwayback in your project, it's easy to integrate: 67 | 68 | from libwayback import WaybackCrawler, WaybackRetriever 69 | 70 | crawler = WaybackCrawler("www.sjtu.edu.cn") 71 | crawler.parse(live=False) 72 | 73 | # The `results` of crawler instance contains a dict data structure with 74 | # a "year" number being the key and a list of page addresses being the value. 75 | 76 | ret = crawler.results 77 | 78 | # Based on the result of crawler, ie a specific page address, you can use 79 | # retriever to download and save it in yor file system: 80 | 81 | retriever = WaybackRetriever() 82 | 83 | for year in ret: 84 | for url in ret[year]: 85 | retriever.save_page(url, "saved_file") 86 | 87 | NOTE: 88 | 89 | * The `live` option of `parse()` is responsible to parsing the live version of a page. 90 | For more about the difference about the modified and original versions, please refer to: 91 | http://faq.web.archive.org/page-without-wayback-code/ . 92 | 93 | 94 | 95 | About logs 96 | ------------ 97 | 98 | libwayback: 99 | 100 | ERROR: "Invalid timestamp of wayback url: %s" 101 | Meaning: which means the regex expression can't match the year number from the historical URLs. 102 | Solution: check the URL manually and find the error reason. 103 | Frequency: ~ 0% 104 | 105 | 106 | wayback_retriever: 107 | 108 | ERROR: "Failed to extract the time string. The url must in the format like: http://web.archive.org/web/19990117032727/google.com/" 109 | Meaning: which means the regex expression can't match the year number from the historical URLs. 110 | Solution: check the URL manually and find the error reason. 111 | Frequency: ~ 0% 112 | 113 | ERROR: "Open page error: %s: %s" 114 | Meaning: urllib2 can't open this URL occurs. Multiple reasons may lead to the failure: the swayback server is down; or connection 115 | is block by third party; or something else; 116 | Solution: check the URL manually or rerun the program at another time. 117 | Frequency: ~ 14% 118 | 119 | ERROR: "Read content error: %s: %s" 120 | Meaning: read content error of file object of urllib2.open(). Some reason may be the uncompleted content when reading. 121 | Solution: check the URL manually 122 | Frequency: ~ 0% 123 | 124 | ERROR: "Save redirected page error: [{0}]{1}: {2}" 125 | Meaning: fail to save the redirected page indicated by the first dumping. 126 | Solution: check the URL manually 127 | Frequency: ~ 0.1% 128 | 129 | ERROR: "Fail to extract timestamp: %s" 130 | Meaning: which means the regex expression can't match the exact numbers of the URL about year, month, day, hour, minute, second. 131 | This is a strong matching action. 132 | Solution: check the URL manually 133 | Frequency: ~ 0% 134 | 135 | 136 | 137 | 138 | -------------------------------------------------------------------------------- /apps/urls.txt: -------------------------------------------------------------------------------- 1 | sjtu.edu.cn -------------------------------------------------------------------------------- /apps/wayback_crawler.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | """ 3 | Main crawler for program Archive_crawler. 4 | Copyright (C) 2012 xiamingc, SJTU - chenxm35@gmail.com 5 | 6 | This program is free software: you can redistribute it and/or modify 7 | it under the terms of the GNU General Public License as published by 8 | the Free Software Foundation, either version 3 of the License, or 9 | (at your option) any later version. 10 | 11 | This program is distributed in the hope that it will be useful, 12 | but WITHOUT ANY WARRANTY; without even the implied warranty of 13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 14 | GNU General Public License for more details. 15 | 16 | You should have received a copy of the GNU General Public License 17 | along with this program. If not, see . 18 | """ 19 | import argparse, logging, os, re, sys 20 | 21 | from libwayback import WaybackCrawler 22 | 23 | 24 | # global logger settings 25 | root_logger = logging.getLogger("libwayback") # use default 'libwayback' logger to active logging function of libwayback 26 | root_logger.setLevel(logging.ERROR) 27 | handler = logging.FileHandler(os.path.join(os.path.dirname(__file__), "app_crawler_log.txt")) 28 | console = logging.StreamHandler() 29 | handler.setFormatter(logging.Formatter("%(asctime)-15s %(module)s/%(lineno)d %(levelname)s: %(message)s")) 30 | root_logger.addHandler(handler) 31 | root_logger.addHandler(console) 32 | 33 | 34 | def _sanitize_name(fname): 35 | return re.sub(r'[\.\/\\\?#&]', "_", fname.strip('\r \n')) 36 | 37 | def _abspath(where): 38 | abswhere = os.path.abspath(where) 39 | if not os.path.exists(abswhere): os.makedirs(abswhere) 40 | return abswhere 41 | 42 | def dump_results(url, results, where = '.'): 43 | fn = os.path.join(_abspath(where), _sanitize_name(url)+'.txt') 44 | with open(fn, 'w') as f: 45 | for key in results: 46 | for line in results[key]: 47 | f.write(line+'\n') 48 | 49 | 50 | __about__ = \ 51 | """ 52 | Archive_crawler - Version 0.1 53 | Program to parse historical urls of web page from archive.org. 54 | Copyright (c) 2012 xiamingc, SJTU - chenxm35@gmail.com 55 | """ 56 | 57 | 58 | def runMain(): 59 | print (__about__) 60 | 61 | # Input arguments 62 | parser = argparse.ArgumentParser(description="Parse the historical version URLs of a web page in Weyback (http://wayback.archive.org/).") 63 | parser.add_argument('-l', dest="loglevel", default="INFO", type=str, help="Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL") 64 | parser.add_argument("urllist", type=str, help="A plain file containing the addresses of web pages.") 65 | args = parser.parse_args() 66 | urllist = args.urllist 67 | loglevel = args.loglevel 68 | 69 | if not urllist: parser.print_help(); sys.exit(-1) 70 | 71 | # Logging 72 | numeric_level = getattr(logging, loglevel.upper(), None) 73 | if not isinstance(numeric_level, int): 74 | print('Invalid log level: %s' % loglevel) 75 | parser.print_help(); sys.exit(-1); 76 | 77 | # set logger level 78 | root_logger.setLevel(numeric_level) 79 | 80 | logger = logging.getLogger("libwayback.app_crawler") 81 | 82 | for url in open(urllist): 83 | url = url.rstrip('\r\n') 84 | if url == '': continue 85 | try: 86 | logger.info('Start parsing: %s' % url) 87 | crawler = WaybackCrawler(url) 88 | crawler.parse() 89 | logger.info('Finish parsing: %s' % url) 90 | finally: 91 | if crawler.results: 92 | logger.info('Dump records of %s to file %s' % (url, url+'.txt')) 93 | dump_results(url, crawler.results) 94 | 95 | 96 | if __name__ == '__main__': 97 | runMain() 98 | 99 | # EOF -------------------------------------------------------------------------------- /apps/wayback_retriever.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | """ 3 | Subtools for Archive_crawler program. 4 | Copyright (C) 2012 xiamingc, SJTU - chenxm35@gmail.com 5 | 6 | This program is free software: you can redistribute it and/or modify 7 | it under the terms of the GNU General Public License as published by 8 | the Free Software Foundation, either version 3 of the License, or 9 | (at your option) any later version. 10 | 11 | This program is distributed in the hope that it will be useful, 12 | but WITHOUT ANY WARRANTY; without even the implied warranty of 13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 14 | GNU General Public License for more details. 15 | 16 | You should have received a copy of the GNU General Public License 17 | along with this program. If not, see . 18 | """ 19 | import argparse 20 | import logging 21 | import os 22 | import datetime 23 | import sys 24 | import time 25 | 26 | from libwayback import WaybackRetriever 27 | 28 | # global logger settings 29 | root_logger = logging.getLogger("libwayback") # use default 'libwayback' logger to active logging function of libwayback 30 | root_logger.setLevel(logging.ERROR) 31 | handler = logging.FileHandler(os.path.join(os.path.dirname(__file__), "app_retriever_log.txt")) 32 | console = logging.StreamHandler() 33 | handler.setFormatter(logging.Formatter("%(asctime)-15s %(module)s/%(lineno)d %(levelname)s: %(message)s")) 34 | root_logger.addHandler(handler) 35 | root_logger.addHandler(console) 36 | 37 | logger = logging.getLogger("libwayback.app_retriever") 38 | 39 | 40 | __about__ = """ 41 | Program to retrieve the HTML web pages of the URLs generated by wayback_crawler.py. 42 | """ 43 | 44 | 45 | def _genoutdir(where = '.'): 46 | outputdir = os.path.join(where, "retriever_results") 47 | if not os.path.exists(outputdir): 48 | os.makedirs(outputdir) 49 | return outputdir 50 | 51 | def _genoutname(outputfolder, inputfilename, timestr): 52 | """ Generate output file name like "[inputfilename]_[urltimestamp].txt" 53 | """ 54 | name = inputfilename.rsplit('.', 1)[0] 55 | newdir = os.path.join(outputfolder, name) 56 | if not os.path.exists(newdir): 57 | os.makedirs(newdir) 58 | outputfilename = "%s_%s.txt" % (name, timestr) 59 | return os.path.join(newdir, outputfilename) 60 | 61 | 62 | def retriever_smart(inputfile, years = None, days = None): 63 | logger.info("Start downloading URLs of %s" % inputfile) 64 | 65 | retriever = WaybackRetriever() 66 | 67 | all_urls = [] 68 | for line in open(inputfile, 'rb'): 69 | line = line.rstrip('\n') 70 | if line =='': continue 71 | timestamp = retriever.extract_timestamp(line) 72 | if timestamp == None: 73 | logger.error("Fail to extract timestamp: %s" % line) 74 | continue 75 | all_urls.append((timestamp, line)) 76 | all_urls.sort(lambda x,y:cmp(x[0],y[0]), None, False) 77 | 78 | ## Process the time-scale limitations 79 | if years != None: 80 | left_urls = [url for url in all_urls if url[0].year in years ] 81 | else: 82 | left_urls = all_urls 83 | 84 | inputfilename = os.path.split(inputfile)[1] 85 | resultdir = _genoutdir() # output lies in the same folder with this program 86 | aday = [] 87 | k = 1 ## url counter 88 | j = 0 ## day counter 89 | n = 0 ## valid day counter 90 | while k <= len(left_urls): 91 | url = left_urls[k-1] 92 | if len(aday) == 0 or url[0].day == aday[0][0].day: 93 | aday.append(url) 94 | if k == len(left_urls) or left_urls[k][0].day != aday[0][0].day: 95 | ## process the day to featch the earlies valid web page 96 | print("Parsing the day: %s/%s/%s" % (aday[0][0].month, aday[0][0].day, aday[0][0].year)) 97 | j += 1 98 | dl = len(aday) ## total url counter for a day 99 | i = 1 ## url counter for a day 100 | while i <= dl: 101 | time.sleep(0.5) 102 | outputfile = _genoutname(resultdir, inputfilename, retriever.extract_time_string(aday[i-1][1])) 103 | status = retriever.save_page(aday[i-1][1], outputfile) 104 | if status == None: 105 | i += 1; continue 106 | else: 107 | n += 1; break 108 | # start next day 109 | aday = [] 110 | k += 1 111 | logger.info("Finish downloading.") 112 | logger.info("File %s: %d/%d valid days processed" % (inputfile, n, j)) 113 | 114 | 115 | def runMain(): 116 | parser = argparse.ArgumentParser(description = __about__) 117 | parser.add_argument("-y", dest='yearscale', type=str, help="Year scale to retrieve, e.g. '1999', '1999-2003' or '1999,2003' (without quotation marks)") 118 | parser.add_argument("-l", dest="loglevel", default="DEBUG", type=str, help="Log level: DEBUG(default), INFO, WARNING, ERROR, CRITICAL") 119 | parser.add_argument("urlfile", type=str, help="File containing wayback URLs output by the crawler.") 120 | args = parser.parse_args() 121 | loglevel = args.loglevel 122 | inputfile = args.urlfile 123 | yearstr = args.yearscale 124 | 125 | if yearstr != None: 126 | years = [] 127 | for i in yearstr.split(','): 128 | points = [int(j) for j in i.split('-')] 129 | if len(points) == 1: 130 | years.append(int(i)) 131 | elif len(points) > 2: 132 | print("Wrong year scale. -h for help") 133 | exit(-1) 134 | else: 135 | if points[0] <= points[1]: 136 | years += range(points[0], points[1]+1) 137 | else: 138 | years += range(points[1], points[0]+1) 139 | else: 140 | years = None 141 | 142 | ## logging 143 | numeric_level = getattr(logging, loglevel.upper(), None) 144 | if not isinstance(numeric_level, int): 145 | raise ValueError('Invalid log level: %s' % loglevel) 146 | parser.print_help(); sys.exit(-1) 147 | 148 | root_logger.setLevel(numeric_level) 149 | 150 | logger.info("{0}: processing {1}".format(str(datetime.datetime.now()), inputfile)) 151 | retriever_smart(inputfile, years) 152 | 153 | 154 | if __name__ == '__main__': 155 | runMain() -------------------------------------------------------------------------------- /libwayback/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = '1.0' 2 | 3 | from crawler import WaybackCrawler 4 | from retriever import WaybackRetriever -------------------------------------------------------------------------------- /libwayback/crawler.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | """ 3 | Library for program Archive_crawler. 4 | Copyright (C) 2012 xiamingc, SJTU - chenxm35@gmail.com 5 | 6 | This program is free software: you can redistribute it and/or modify 7 | it under the terms of the GNU General Public License as published by 8 | the Free Software Foundation, either version 3 of the License, or 9 | (at your option) any later version. 10 | 11 | This program is distributed in the hope that it will be useful, 12 | but WITHOUT ANY WARRANTY; without even the implied warranty of 13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 14 | GNU General Public License for more details. 15 | 16 | You should have received a copy of the GNU General Public License 17 | along with this program. If not, see . 18 | """ 19 | import urllib2 20 | import re 21 | import datetime 22 | import logging 23 | import time 24 | import random 25 | import html5lib 26 | from html5lib import treebuilders 27 | 28 | 29 | module_logger = logging.getLogger("libwayback.libcrawler") 30 | 31 | 32 | def _valid_XML_char_ordinal(i): 33 | ## As for the XML specification, valid chars must be in the range of 34 | ## Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 35 | ## [Ref] http://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python 36 | return (# conditions ordered by presumed frequency 37 | 0x20 <= i <= 0xD7FF 38 | or i in (0x9, 0xA, 0xD) 39 | or 0xE000 <= i <= 0xFFFD 40 | or 0x10000 <= i <= 0x10FFFF 41 | ) 42 | 43 | 44 | class WaybackCrawler(object): 45 | 46 | def __init__(self, url): 47 | self.server = 'http://wayback.archive.org' 48 | self.prefix = 'http://wayback.archive.org/web' 49 | self.url = url 50 | self.results = {} # year: [historical_page_urls] 51 | 52 | 53 | def get_wayback_page(self, site_url): 54 | return self.prefix + '/*/' + site_url 55 | 56 | 57 | def convert_live_url(self, url): 58 | """ 59 | For more information, please refer to "How can I view a page without the Wayback code in it?" at 60 | http://faq.web.archive.org/page-without-wayback-code/ 61 | """ 62 | pattern = re.compile(r"\/([1-2]\d{3})\d*") 63 | mres = re.search(pattern, url) 64 | return url[0:mres.end()] + 'id_' + url[mres.end():] 65 | 66 | 67 | def extract_year(self, wayback_url ): 68 | pattern = re.compile(r"\/([1-2]\d{3})\d*") 69 | mres = re.search(pattern, wayback_url) 70 | if mres == None: 71 | return None 72 | return int(mres.group(1)) 73 | 74 | 75 | def open_url(self, url): 76 | ret = None 77 | try: 78 | fh = urllib2.urlopen(url) 79 | if fh.geturl() != url: module_logger.info("Redirected to: %s" % fh.geturl()) 80 | ret = fh.read() 81 | except urllib2.URLError, reason: 82 | module_logger.error("%s: %s" % (url, reason)) 83 | return ret 84 | 85 | 86 | def _parse_wayback_page(self, page_year): 87 | """ 88 | Paser all recored web page URLs in specific year. 89 | """ 90 | his_urls = [] 91 | wholepage = self.open_url(page_year) 92 | if wholepage == None: return his_urls 93 | 94 | parser = html5lib.HTMLParser(tree = treebuilders.getTreeBuilder("lxml")) 95 | 96 | try: 97 | html_doc = parser.parse(wholepage) 98 | except ValueError: 99 | wholepage_clean = ''.join(c for c in wholepage if _valid_XML_char_ordinal(ord(c))) 100 | html_doc = parser.parse(wholepage_clean) 101 | 102 | body = html_doc.find("./{*}body") 103 | position_div = body.find("./{*}div[@id='position']") 104 | wayback_cal = position_div.find("./{*}div[@id='wbCalendar']") 105 | calOver = wayback_cal.find("./{*}div[@id='calOver']") 106 | for month in calOver.findall("./{*}div[@class='month']"): 107 | for day in month.findall(".//{*}td"): 108 | day_div = day.find("./{*}div[@class='date tooltip']") 109 | if day_div != None: 110 | for snapshot in day_div.findall("./{*}div[@class='pop']/{*}ul/{*}li"): 111 | his_urls.append(snapshot[0].get('href')) 112 | 113 | year = self.extract_year(his_urls[0]) if len(his_urls) > 0 else None 114 | 115 | return (year, his_urls) 116 | 117 | 118 | def parse(self, live = False): 119 | """ 120 | Parse historical urls for a web page over years. 121 | We first determine the year scale that has valid snapshots. 122 | @Return: list of historical urls or None 123 | """ 124 | self._parse_called = True 125 | 126 | wayback_page_whole = self.open_url( self.get_wayback_page(self.url) ) 127 | if wayback_page_whole == None: return None 128 | 129 | parser = html5lib.HTMLParser(tree = treebuilders.getTreeBuilder("lxml")) 130 | html_doc = parser.parse(wayback_page_whole) 131 | 132 | position_div = html_doc.find("./{*}body/{*}div[@id='position']") 133 | sketchinfo = position_div.find("./{*}div[@id='wbSearch']/{*}div[@id='form']/{*}div[@id='wbMeta']/{*}p[@class='wbThis']") 134 | first_url = sketchinfo.getchildren()[-1].attrib['href'] 135 | first_year = self.extract_year(first_url) 136 | 137 | for year in range(first_year, datetime.datetime.now().year+1): 138 | # Be polite to the host server 139 | time.sleep(random.randint(1,3)) 140 | 141 | # Note: the timestamp in search url indicates the time scale of query: 142 | # E.g., wildcard * matches all of the items in specific year. 143 | # If only * is supported, the results of latest year are returned. 144 | # I found that it returned wrong results if the month and day numbers are small like 0101, 145 | # so a bigger number is used to match wildly. 146 | wayback_page_year = "%s/%d0601000000*/%s" % ( self.prefix, year, self.url ) 147 | page_year, his_urls = self._parse_wayback_page(wayback_page_year) 148 | 149 | # To exclude duplicated items that don't match the year 150 | # By default the results of latest year are returned 151 | # if some year hasn't been crawled 152 | if page_year == None or page_year != year: continue 153 | module_logger.debug("%s: %d pages found for year %d" % (self.url, len(his_urls), page_year)) 154 | 155 | for url in his_urls: 156 | try: 157 | page_year = self.extract_year(url) 158 | except: 159 | module_logger.error("Invalid timestamp of wayback url: %s" % url) 160 | continue 161 | if year == page_year: 162 | if live: self.add_item(year, self.convert_live_url(url)) 163 | else: self.add_item(year, url) 164 | 165 | return self.results 166 | 167 | 168 | def add_item(self, year, item): 169 | if item[:4] != 'http': item = (self.server + item) 170 | self[str(year)].append(item) 171 | 172 | 173 | def __getitem__(self, year): 174 | try: 175 | self.results[str(year)] 176 | except KeyError: 177 | self.results[str(year)] = [] 178 | return self.results[str(year)] 179 | 180 | 181 | def __setitem__(self, year, item): 182 | self.results[str(year)] = item 183 | 184 | 185 | def __len__(self): 186 | cnt = 0 187 | for key in results: 188 | cnt += len(results[key]) 189 | return cnt 190 | 191 | 192 | 193 | if __name__ == '__main__': 194 | crawler = WaybackCrawler("www.sjtu.edu.cn") 195 | crawler.parse(live=False) 196 | 197 | 198 | # EOF -------------------------------------------------------------------------------- /libwayback/retriever.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | """ 3 | Library for program Archive_crawler. 4 | Copyright (C) 2012 xiamingc, SJTU - chenxm35@gmail.com 5 | 6 | This program is free software: you can redistribute it and/or modify 7 | it under the terms of the GNU General Public License as published by 8 | the Free Software Foundation, either version 3 of the License, or 9 | (at your option) any later version. 10 | 11 | This program is distributed in the hope that it will be useful, 12 | but WITHOUT ANY WARRANTY; without even the implied warranty of 13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 14 | GNU General Public License for more details. 15 | 16 | You should have received a copy of the GNU General Public License 17 | along with this program. If not, see . 18 | """ 19 | 20 | __author__ = 'chenxm' 21 | 22 | 23 | import argparse 24 | import httplib 25 | import urllib2 26 | import re 27 | import datetime 28 | import math 29 | import unittest 30 | import logging 31 | import time 32 | import os 33 | import os.path 34 | import random 35 | import sys 36 | import html5lib 37 | from html5lib import treebuilders 38 | 39 | 40 | logger = logging.getLogger("libwayback.waybackretriever") 41 | 42 | 43 | def _patch_http_response_read(func): 44 | def inner(*args): 45 | try: 46 | return func(*args) 47 | except httplib.IncompleteRead, e: 48 | return e.partial 49 | return inner 50 | # Patch HTTPResponse.read to avoid IncompleteRead exception 51 | httplib.HTTPResponse.read = _patch_http_response_read(httplib.HTTPResponse.read) 52 | 53 | 54 | class WaybackRetriever(object): 55 | 56 | def __init__(self): 57 | pass 58 | 59 | 60 | def extract_time_string(self, url): 61 | p = re.compile(r"\/([1-2]\d{3}\d*)") 62 | pm = re.search(p, url) 63 | if pm != None: 64 | return pm.group(1) 65 | else: 66 | # Explicit Error Report 67 | logger.error("Failed to extract the time string. URL exmaple: http://web.archive.org/web/19990117032727/google.com/") 68 | return None 69 | 70 | 71 | def extract_timestamp(self, url): 72 | p = re.compile(r"\/([1-2]\d{3}\d*)") 73 | pm = re.search(p, url) 74 | if pm == None: return None 75 | 76 | timestr = pm.group(1) 77 | members = ["0"]*6 ## [year, month, day, hour, minute, second] 78 | members_limit = [2999, 12, 31, 23, 59, 59] 79 | 80 | i = 0; j = 0 81 | leap_year = False 82 | while 0<= i <= len(timestr)-1: 83 | if j == 0: 84 | # year 85 | members[j] = timestr[:4] 86 | i += 4; j += 1 87 | leap_year = (int(members[j])%4 == 0 and True or False) 88 | continue 89 | elif j == 2: 90 | # day 91 | if int(members[j-1]) in [2]: maxv = leap_year == True and 29 or 28 92 | elif int(members[j-1]) in [1,3,5,7,8,10,12]: maxv = 31 93 | else: maxv = 30 94 | else: 95 | # month, hour, minute, second 96 | maxv = members_limit[j] 97 | 98 | if 0<= int(timestr[i:i+2]) <= maxv: offset = 2 99 | elif int(timestr[i:i+2]) > maxv: offset = 1 100 | 101 | members[j] = timestr[i:i+offset] 102 | i += offset; j += 1 103 | return datetime.datetime(year=int(members[0]), month=int(members[1]), \ 104 | day=int(members[2]), hour=int(members[3]), \ 105 | minute=int(members[4]), second=int(members[5])) 106 | 107 | def save_page(self, url, savefile): 108 | """ 109 | Return the abstract path of savefile if succeed; 110 | Otherwise, return None. 111 | """ 112 | url = url.rstrip('\r \n') 113 | try: 114 | f = urllib2.urlopen(url) 115 | except urllib2.URLError, reason: 116 | ## Multiple reasons may lead to the failure: 117 | ## Server is down, 118 | ## The wayback didn't archive this page 119 | ## Connection is block by the third person 120 | logger.error("Open page error: %s: %s" % (reason, url)) 121 | return None 122 | except: 123 | logger.error("Open page error: null: %s" % url) 124 | return None 125 | 126 | try: 127 | fc = f.read() 128 | except httplib.IncompleteRead: 129 | logger.error("Read content error: uncompleted content: %s" % url) 130 | return None 131 | except: 132 | logger.error("Read content error: null: %s" % url) 133 | return None 134 | 135 | ## check content 136 | RE_WRONG_CONTENT = r'(Got an HTTP \d* response at crawl time)' 137 | 138 | pattern = re.compile(RE_WRONG_CONTENT) 139 | mp = re.search(pattern, fc) 140 | if mp != None: 141 | logger.warning("%s: %s" % (mp.group(1), url)) 142 | parser = html5lib.HTMLParser(tree = treebuilders.getTreeBuilder("lxml")) 143 | html_doc = parser.parse(fc) 144 | html_body = html_doc.find("./{*}body") 145 | div_error = html_body.find(".//{*}div[@id='error']") 146 | redir_a = div_error.find("./{*}p[@class='impatient']/{*}a") 147 | redir_url = redir_a.get('href') 148 | try: 149 | return self.save_page(redir_url, savefile) 150 | except RuntimeError, detail: 151 | logger.error("Save redirected page error: [{0}]{1}: {2}".format(detail.errno, detail.strerror, url)) 152 | return None 153 | except: 154 | logger.error("Save redirected page error: null: %s" % url) 155 | return None 156 | else: 157 | open(savefile, 'wb').write(fc) 158 | return os.path.abspath(savefile) 159 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from setuptools import setup 4 | from libwayback import __version__ 5 | 6 | setup( 7 | name = "libwayback", 8 | version = __version__, 9 | url = 'https://github.com/caesar0301/libwayback', 10 | author = 'Jamin Chen', 11 | author_email = 'chenxm35@gmail.com', 12 | description = 'A library to parse Wayback Machine of archive.org to get a historical views of web pages.', 13 | long_description='''A library to parse Wayback Machine of archive.org to get a historical views of web pages. It is a useful tool to research on the evolution of web pages, page structure analysis, and among other interesting topics.''', 14 | license = "LICENSE", 15 | packages = ['libwayback'], 16 | classifiers = [ 17 | 'Development Status :: 4 - Beta', 18 | 'Environment :: Console', 19 | 'Intended Audience :: Developers', 20 | 'License :: Freely Distributable', 21 | 'Operating System :: OS Independent', 22 | 'Programming Language :: Python', 23 | 'Programming Language :: Python :: 2.6', 24 | 'Programming Language :: Python :: 2.7', 25 | 'Topic :: Software Development :: Libraries :: Python Modules', 26 | ], 27 | ) 28 | --------------------------------------------------------------------------------