├── CODEOWNERS
├── INSTALL
├── LICENSE
├── MANIFEST.in
├── README.md
├── apps
├── urls.txt
├── wayback_crawler.py
└── wayback_retriever.py
├── libwayback
├── __init__.py
├── crawler.py
└── retriever.py
└── setup.py
/CODEOWNERS:
--------------------------------------------------------------------------------
1 | * @caesar0301
2 |
--------------------------------------------------------------------------------
/INSTALL:
--------------------------------------------------------------------------------
1 | Thanks for downloading libwayback.
2 |
3 | To install it, make sure you have Python 2.6 or greater installed. Then run
4 | this command from the command prompt:
5 |
6 | python setup.py install
7 |
8 | If you're upgrading from a previous version, you need to remove it first.
9 |
10 | AS AN ALTERNATIVE, you can just copy the entire "libwayback" directory to Python's
11 | site-packages directory, which is located wherever your Python installation
12 | lives. Some places you might check are:
13 |
14 | /usr/lib/python2.7/site-packages (Unix, Python 2.7)
15 | /usr/lib/python2.6/site-packages (Unix, Python 2.6)
16 | C:\\PYTHON\site-packages (Windows)
17 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright (C) 2013 Xiaming Chen - chenxm35@gmail.com
3 | # All rights reserved.
4 | #
5 | # This file is part of project libwayback.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU General Public License
18 | # along with this program. If not, see .
19 | #
20 | # Neither the name of the copyright holder nor the names of the contributors
21 | # may be used to endorse or promote products derived from this software without
22 | # specific prior written permission.
23 | #
24 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
25 | # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
26 | # WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
27 | # DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
28 | # ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
29 | # (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
30 | # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
31 | # ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
32 | # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
33 | # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
34 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include README.md
2 | include LICENSE
3 | include MANIFEST.in
4 | include setup.py
5 | recursive-include apps *.txt
6 | recursive-include apps *.py
7 | recursive-include libwayback *.py
8 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Archiver_crawler
2 | ============
3 |
4 | 
5 |
6 |
7 | A library for parsing Wayback Machine of [Internet Archive] (http://www.archive.org) to get the historical content of web page, for research purpose only.
8 |
9 | Only the original content of HTML file of the web page is downloaded, without the embedded web objects.
10 |
11 | By xiamingc,SJTU - chenxm35@gmail.com
12 |
13 |
14 | Requirements
15 | ------------
16 |
17 | [Python 2.6+ (<3)] (http://www.python.org/)
18 |
19 | [lxml 2.3+] (http://lxml.de/)
20 |
21 | [html5lib 0.95+] (https://github.com/html5lib)
22 |
23 |
24 | Programs
25 | ------------
26 |
27 | `wayback_crawler` -- to extract the URLs of websites from Internet Archive for content download later.
28 |
29 | `wayback_retriever` -- to downlad the page content with the URLs output of `wayback_crawler`
30 |
31 | `libwayback` -- the underlying library support crawler and retriever programs.
32 |
33 |
34 |
35 | Usage of wayback_crawler
36 | ------------
37 |
38 | If you have python and required packages installed, you can run as python script:
39 |
40 | python wayback_crawler.py [-l log_level] urlfile
41 |
42 |
43 |
44 | Usage of wayback_retriever
45 | ------------
46 |
47 | The `wayback_retriever` works on the output of `wayback_crawler`. With a specific file, you can run retriever like:
48 |
49 | python wayback_retriever.py
50 |
51 | where the input is an individual file output by crawler.
52 |
53 | The download pages will locate in folder `retriever_results` in current working place;
54 |
55 |
56 |
57 | Usage of libwayback
58 | ------------
59 |
60 | This library provides basic functions for crawling Internet Archive. It has the simple structures like:
61 |
62 | libwayback
63 | |____WaybackCrawler
64 | |____WaybackRetriever
65 |
66 | If you are willing to using libwayback in your project, it's easy to integrate:
67 |
68 | from libwayback import WaybackCrawler, WaybackRetriever
69 |
70 | crawler = WaybackCrawler("www.sjtu.edu.cn")
71 | crawler.parse(live=False)
72 |
73 | # The `results` of crawler instance contains a dict data structure with
74 | # a "year" number being the key and a list of page addresses being the value.
75 |
76 | ret = crawler.results
77 |
78 | # Based on the result of crawler, ie a specific page address, you can use
79 | # retriever to download and save it in yor file system:
80 |
81 | retriever = WaybackRetriever()
82 |
83 | for year in ret:
84 | for url in ret[year]:
85 | retriever.save_page(url, "saved_file")
86 |
87 | NOTE:
88 |
89 | * The `live` option of `parse()` is responsible to parsing the live version of a page.
90 | For more about the difference about the modified and original versions, please refer to:
91 | http://faq.web.archive.org/page-without-wayback-code/ .
92 |
93 |
94 |
95 | About logs
96 | ------------
97 |
98 | libwayback:
99 |
100 | ERROR: "Invalid timestamp of wayback url: %s"
101 | Meaning: which means the regex expression can't match the year number from the historical URLs.
102 | Solution: check the URL manually and find the error reason.
103 | Frequency: ~ 0%
104 |
105 |
106 | wayback_retriever:
107 |
108 | ERROR: "Failed to extract the time string. The url must in the format like: http://web.archive.org/web/19990117032727/google.com/"
109 | Meaning: which means the regex expression can't match the year number from the historical URLs.
110 | Solution: check the URL manually and find the error reason.
111 | Frequency: ~ 0%
112 |
113 | ERROR: "Open page error: %s: %s"
114 | Meaning: urllib2 can't open this URL occurs. Multiple reasons may lead to the failure: the swayback server is down; or connection
115 | is block by third party; or something else;
116 | Solution: check the URL manually or rerun the program at another time.
117 | Frequency: ~ 14%
118 |
119 | ERROR: "Read content error: %s: %s"
120 | Meaning: read content error of file object of urllib2.open(). Some reason may be the uncompleted content when reading.
121 | Solution: check the URL manually
122 | Frequency: ~ 0%
123 |
124 | ERROR: "Save redirected page error: [{0}]{1}: {2}"
125 | Meaning: fail to save the redirected page indicated by the first dumping.
126 | Solution: check the URL manually
127 | Frequency: ~ 0.1%
128 |
129 | ERROR: "Fail to extract timestamp: %s"
130 | Meaning: which means the regex expression can't match the exact numbers of the URL about year, month, day, hour, minute, second.
131 | This is a strong matching action.
132 | Solution: check the URL manually
133 | Frequency: ~ 0%
134 |
135 |
136 |
137 |
138 |
--------------------------------------------------------------------------------
/apps/urls.txt:
--------------------------------------------------------------------------------
1 | sjtu.edu.cn
--------------------------------------------------------------------------------
/apps/wayback_crawler.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | """
3 | Main crawler for program Archive_crawler.
4 | Copyright (C) 2012 xiamingc, SJTU - chenxm35@gmail.com
5 |
6 | This program is free software: you can redistribute it and/or modify
7 | it under the terms of the GNU General Public License as published by
8 | the Free Software Foundation, either version 3 of the License, or
9 | (at your option) any later version.
10 |
11 | This program is distributed in the hope that it will be useful,
12 | but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 | GNU General Public License for more details.
15 |
16 | You should have received a copy of the GNU General Public License
17 | along with this program. If not, see .
18 | """
19 | import argparse, logging, os, re, sys
20 |
21 | from libwayback import WaybackCrawler
22 |
23 |
24 | # global logger settings
25 | root_logger = logging.getLogger("libwayback") # use default 'libwayback' logger to active logging function of libwayback
26 | root_logger.setLevel(logging.ERROR)
27 | handler = logging.FileHandler(os.path.join(os.path.dirname(__file__), "app_crawler_log.txt"))
28 | console = logging.StreamHandler()
29 | handler.setFormatter(logging.Formatter("%(asctime)-15s %(module)s/%(lineno)d %(levelname)s: %(message)s"))
30 | root_logger.addHandler(handler)
31 | root_logger.addHandler(console)
32 |
33 |
34 | def _sanitize_name(fname):
35 | return re.sub(r'[\.\/\\\?#&]', "_", fname.strip('\r \n'))
36 |
37 | def _abspath(where):
38 | abswhere = os.path.abspath(where)
39 | if not os.path.exists(abswhere): os.makedirs(abswhere)
40 | return abswhere
41 |
42 | def dump_results(url, results, where = '.'):
43 | fn = os.path.join(_abspath(where), _sanitize_name(url)+'.txt')
44 | with open(fn, 'w') as f:
45 | for key in results:
46 | for line in results[key]:
47 | f.write(line+'\n')
48 |
49 |
50 | __about__ = \
51 | """
52 | Archive_crawler - Version 0.1
53 | Program to parse historical urls of web page from archive.org.
54 | Copyright (c) 2012 xiamingc, SJTU - chenxm35@gmail.com
55 | """
56 |
57 |
58 | def runMain():
59 | print (__about__)
60 |
61 | # Input arguments
62 | parser = argparse.ArgumentParser(description="Parse the historical version URLs of a web page in Weyback (http://wayback.archive.org/).")
63 | parser.add_argument('-l', dest="loglevel", default="INFO", type=str, help="Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL")
64 | parser.add_argument("urllist", type=str, help="A plain file containing the addresses of web pages.")
65 | args = parser.parse_args()
66 | urllist = args.urllist
67 | loglevel = args.loglevel
68 |
69 | if not urllist: parser.print_help(); sys.exit(-1)
70 |
71 | # Logging
72 | numeric_level = getattr(logging, loglevel.upper(), None)
73 | if not isinstance(numeric_level, int):
74 | print('Invalid log level: %s' % loglevel)
75 | parser.print_help(); sys.exit(-1);
76 |
77 | # set logger level
78 | root_logger.setLevel(numeric_level)
79 |
80 | logger = logging.getLogger("libwayback.app_crawler")
81 |
82 | for url in open(urllist):
83 | url = url.rstrip('\r\n')
84 | if url == '': continue
85 | try:
86 | logger.info('Start parsing: %s' % url)
87 | crawler = WaybackCrawler(url)
88 | crawler.parse()
89 | logger.info('Finish parsing: %s' % url)
90 | finally:
91 | if crawler.results:
92 | logger.info('Dump records of %s to file %s' % (url, url+'.txt'))
93 | dump_results(url, crawler.results)
94 |
95 |
96 | if __name__ == '__main__':
97 | runMain()
98 |
99 | # EOF
--------------------------------------------------------------------------------
/apps/wayback_retriever.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | """
3 | Subtools for Archive_crawler program.
4 | Copyright (C) 2012 xiamingc, SJTU - chenxm35@gmail.com
5 |
6 | This program is free software: you can redistribute it and/or modify
7 | it under the terms of the GNU General Public License as published by
8 | the Free Software Foundation, either version 3 of the License, or
9 | (at your option) any later version.
10 |
11 | This program is distributed in the hope that it will be useful,
12 | but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 | GNU General Public License for more details.
15 |
16 | You should have received a copy of the GNU General Public License
17 | along with this program. If not, see .
18 | """
19 | import argparse
20 | import logging
21 | import os
22 | import datetime
23 | import sys
24 | import time
25 |
26 | from libwayback import WaybackRetriever
27 |
28 | # global logger settings
29 | root_logger = logging.getLogger("libwayback") # use default 'libwayback' logger to active logging function of libwayback
30 | root_logger.setLevel(logging.ERROR)
31 | handler = logging.FileHandler(os.path.join(os.path.dirname(__file__), "app_retriever_log.txt"))
32 | console = logging.StreamHandler()
33 | handler.setFormatter(logging.Formatter("%(asctime)-15s %(module)s/%(lineno)d %(levelname)s: %(message)s"))
34 | root_logger.addHandler(handler)
35 | root_logger.addHandler(console)
36 |
37 | logger = logging.getLogger("libwayback.app_retriever")
38 |
39 |
40 | __about__ = """
41 | Program to retrieve the HTML web pages of the URLs generated by wayback_crawler.py.
42 | """
43 |
44 |
45 | def _genoutdir(where = '.'):
46 | outputdir = os.path.join(where, "retriever_results")
47 | if not os.path.exists(outputdir):
48 | os.makedirs(outputdir)
49 | return outputdir
50 |
51 | def _genoutname(outputfolder, inputfilename, timestr):
52 | """ Generate output file name like "[inputfilename]_[urltimestamp].txt"
53 | """
54 | name = inputfilename.rsplit('.', 1)[0]
55 | newdir = os.path.join(outputfolder, name)
56 | if not os.path.exists(newdir):
57 | os.makedirs(newdir)
58 | outputfilename = "%s_%s.txt" % (name, timestr)
59 | return os.path.join(newdir, outputfilename)
60 |
61 |
62 | def retriever_smart(inputfile, years = None, days = None):
63 | logger.info("Start downloading URLs of %s" % inputfile)
64 |
65 | retriever = WaybackRetriever()
66 |
67 | all_urls = []
68 | for line in open(inputfile, 'rb'):
69 | line = line.rstrip('\n')
70 | if line =='': continue
71 | timestamp = retriever.extract_timestamp(line)
72 | if timestamp == None:
73 | logger.error("Fail to extract timestamp: %s" % line)
74 | continue
75 | all_urls.append((timestamp, line))
76 | all_urls.sort(lambda x,y:cmp(x[0],y[0]), None, False)
77 |
78 | ## Process the time-scale limitations
79 | if years != None:
80 | left_urls = [url for url in all_urls if url[0].year in years ]
81 | else:
82 | left_urls = all_urls
83 |
84 | inputfilename = os.path.split(inputfile)[1]
85 | resultdir = _genoutdir() # output lies in the same folder with this program
86 | aday = []
87 | k = 1 ## url counter
88 | j = 0 ## day counter
89 | n = 0 ## valid day counter
90 | while k <= len(left_urls):
91 | url = left_urls[k-1]
92 | if len(aday) == 0 or url[0].day == aday[0][0].day:
93 | aday.append(url)
94 | if k == len(left_urls) or left_urls[k][0].day != aday[0][0].day:
95 | ## process the day to featch the earlies valid web page
96 | print("Parsing the day: %s/%s/%s" % (aday[0][0].month, aday[0][0].day, aday[0][0].year))
97 | j += 1
98 | dl = len(aday) ## total url counter for a day
99 | i = 1 ## url counter for a day
100 | while i <= dl:
101 | time.sleep(0.5)
102 | outputfile = _genoutname(resultdir, inputfilename, retriever.extract_time_string(aday[i-1][1]))
103 | status = retriever.save_page(aday[i-1][1], outputfile)
104 | if status == None:
105 | i += 1; continue
106 | else:
107 | n += 1; break
108 | # start next day
109 | aday = []
110 | k += 1
111 | logger.info("Finish downloading.")
112 | logger.info("File %s: %d/%d valid days processed" % (inputfile, n, j))
113 |
114 |
115 | def runMain():
116 | parser = argparse.ArgumentParser(description = __about__)
117 | parser.add_argument("-y", dest='yearscale', type=str, help="Year scale to retrieve, e.g. '1999', '1999-2003' or '1999,2003' (without quotation marks)")
118 | parser.add_argument("-l", dest="loglevel", default="DEBUG", type=str, help="Log level: DEBUG(default), INFO, WARNING, ERROR, CRITICAL")
119 | parser.add_argument("urlfile", type=str, help="File containing wayback URLs output by the crawler.")
120 | args = parser.parse_args()
121 | loglevel = args.loglevel
122 | inputfile = args.urlfile
123 | yearstr = args.yearscale
124 |
125 | if yearstr != None:
126 | years = []
127 | for i in yearstr.split(','):
128 | points = [int(j) for j in i.split('-')]
129 | if len(points) == 1:
130 | years.append(int(i))
131 | elif len(points) > 2:
132 | print("Wrong year scale. -h for help")
133 | exit(-1)
134 | else:
135 | if points[0] <= points[1]:
136 | years += range(points[0], points[1]+1)
137 | else:
138 | years += range(points[1], points[0]+1)
139 | else:
140 | years = None
141 |
142 | ## logging
143 | numeric_level = getattr(logging, loglevel.upper(), None)
144 | if not isinstance(numeric_level, int):
145 | raise ValueError('Invalid log level: %s' % loglevel)
146 | parser.print_help(); sys.exit(-1)
147 |
148 | root_logger.setLevel(numeric_level)
149 |
150 | logger.info("{0}: processing {1}".format(str(datetime.datetime.now()), inputfile))
151 | retriever_smart(inputfile, years)
152 |
153 |
154 | if __name__ == '__main__':
155 | runMain()
--------------------------------------------------------------------------------
/libwayback/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = '1.0'
2 |
3 | from crawler import WaybackCrawler
4 | from retriever import WaybackRetriever
--------------------------------------------------------------------------------
/libwayback/crawler.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | """
3 | Library for program Archive_crawler.
4 | Copyright (C) 2012 xiamingc, SJTU - chenxm35@gmail.com
5 |
6 | This program is free software: you can redistribute it and/or modify
7 | it under the terms of the GNU General Public License as published by
8 | the Free Software Foundation, either version 3 of the License, or
9 | (at your option) any later version.
10 |
11 | This program is distributed in the hope that it will be useful,
12 | but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 | GNU General Public License for more details.
15 |
16 | You should have received a copy of the GNU General Public License
17 | along with this program. If not, see .
18 | """
19 | import urllib2
20 | import re
21 | import datetime
22 | import logging
23 | import time
24 | import random
25 | import html5lib
26 | from html5lib import treebuilders
27 |
28 |
29 | module_logger = logging.getLogger("libwayback.libcrawler")
30 |
31 |
32 | def _valid_XML_char_ordinal(i):
33 | ## As for the XML specification, valid chars must be in the range of
34 | ## Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
35 | ## [Ref] http://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python
36 | return (# conditions ordered by presumed frequency
37 | 0x20 <= i <= 0xD7FF
38 | or i in (0x9, 0xA, 0xD)
39 | or 0xE000 <= i <= 0xFFFD
40 | or 0x10000 <= i <= 0x10FFFF
41 | )
42 |
43 |
44 | class WaybackCrawler(object):
45 |
46 | def __init__(self, url):
47 | self.server = 'http://wayback.archive.org'
48 | self.prefix = 'http://wayback.archive.org/web'
49 | self.url = url
50 | self.results = {} # year: [historical_page_urls]
51 |
52 |
53 | def get_wayback_page(self, site_url):
54 | return self.prefix + '/*/' + site_url
55 |
56 |
57 | def convert_live_url(self, url):
58 | """
59 | For more information, please refer to "How can I view a page without the Wayback code in it?" at
60 | http://faq.web.archive.org/page-without-wayback-code/
61 | """
62 | pattern = re.compile(r"\/([1-2]\d{3})\d*")
63 | mres = re.search(pattern, url)
64 | return url[0:mres.end()] + 'id_' + url[mres.end():]
65 |
66 |
67 | def extract_year(self, wayback_url ):
68 | pattern = re.compile(r"\/([1-2]\d{3})\d*")
69 | mres = re.search(pattern, wayback_url)
70 | if mres == None:
71 | return None
72 | return int(mres.group(1))
73 |
74 |
75 | def open_url(self, url):
76 | ret = None
77 | try:
78 | fh = urllib2.urlopen(url)
79 | if fh.geturl() != url: module_logger.info("Redirected to: %s" % fh.geturl())
80 | ret = fh.read()
81 | except urllib2.URLError, reason:
82 | module_logger.error("%s: %s" % (url, reason))
83 | return ret
84 |
85 |
86 | def _parse_wayback_page(self, page_year):
87 | """
88 | Paser all recored web page URLs in specific year.
89 | """
90 | his_urls = []
91 | wholepage = self.open_url(page_year)
92 | if wholepage == None: return his_urls
93 |
94 | parser = html5lib.HTMLParser(tree = treebuilders.getTreeBuilder("lxml"))
95 |
96 | try:
97 | html_doc = parser.parse(wholepage)
98 | except ValueError:
99 | wholepage_clean = ''.join(c for c in wholepage if _valid_XML_char_ordinal(ord(c)))
100 | html_doc = parser.parse(wholepage_clean)
101 |
102 | body = html_doc.find("./{*}body")
103 | position_div = body.find("./{*}div[@id='position']")
104 | wayback_cal = position_div.find("./{*}div[@id='wbCalendar']")
105 | calOver = wayback_cal.find("./{*}div[@id='calOver']")
106 | for month in calOver.findall("./{*}div[@class='month']"):
107 | for day in month.findall(".//{*}td"):
108 | day_div = day.find("./{*}div[@class='date tooltip']")
109 | if day_div != None:
110 | for snapshot in day_div.findall("./{*}div[@class='pop']/{*}ul/{*}li"):
111 | his_urls.append(snapshot[0].get('href'))
112 |
113 | year = self.extract_year(his_urls[0]) if len(his_urls) > 0 else None
114 |
115 | return (year, his_urls)
116 |
117 |
118 | def parse(self, live = False):
119 | """
120 | Parse historical urls for a web page over years.
121 | We first determine the year scale that has valid snapshots.
122 | @Return: list of historical urls or None
123 | """
124 | self._parse_called = True
125 |
126 | wayback_page_whole = self.open_url( self.get_wayback_page(self.url) )
127 | if wayback_page_whole == None: return None
128 |
129 | parser = html5lib.HTMLParser(tree = treebuilders.getTreeBuilder("lxml"))
130 | html_doc = parser.parse(wayback_page_whole)
131 |
132 | position_div = html_doc.find("./{*}body/{*}div[@id='position']")
133 | sketchinfo = position_div.find("./{*}div[@id='wbSearch']/{*}div[@id='form']/{*}div[@id='wbMeta']/{*}p[@class='wbThis']")
134 | first_url = sketchinfo.getchildren()[-1].attrib['href']
135 | first_year = self.extract_year(first_url)
136 |
137 | for year in range(first_year, datetime.datetime.now().year+1):
138 | # Be polite to the host server
139 | time.sleep(random.randint(1,3))
140 |
141 | # Note: the timestamp in search url indicates the time scale of query:
142 | # E.g., wildcard * matches all of the items in specific year.
143 | # If only * is supported, the results of latest year are returned.
144 | # I found that it returned wrong results if the month and day numbers are small like 0101,
145 | # so a bigger number is used to match wildly.
146 | wayback_page_year = "%s/%d0601000000*/%s" % ( self.prefix, year, self.url )
147 | page_year, his_urls = self._parse_wayback_page(wayback_page_year)
148 |
149 | # To exclude duplicated items that don't match the year
150 | # By default the results of latest year are returned
151 | # if some year hasn't been crawled
152 | if page_year == None or page_year != year: continue
153 | module_logger.debug("%s: %d pages found for year %d" % (self.url, len(his_urls), page_year))
154 |
155 | for url in his_urls:
156 | try:
157 | page_year = self.extract_year(url)
158 | except:
159 | module_logger.error("Invalid timestamp of wayback url: %s" % url)
160 | continue
161 | if year == page_year:
162 | if live: self.add_item(year, self.convert_live_url(url))
163 | else: self.add_item(year, url)
164 |
165 | return self.results
166 |
167 |
168 | def add_item(self, year, item):
169 | if item[:4] != 'http': item = (self.server + item)
170 | self[str(year)].append(item)
171 |
172 |
173 | def __getitem__(self, year):
174 | try:
175 | self.results[str(year)]
176 | except KeyError:
177 | self.results[str(year)] = []
178 | return self.results[str(year)]
179 |
180 |
181 | def __setitem__(self, year, item):
182 | self.results[str(year)] = item
183 |
184 |
185 | def __len__(self):
186 | cnt = 0
187 | for key in results:
188 | cnt += len(results[key])
189 | return cnt
190 |
191 |
192 |
193 | if __name__ == '__main__':
194 | crawler = WaybackCrawler("www.sjtu.edu.cn")
195 | crawler.parse(live=False)
196 |
197 |
198 | # EOF
--------------------------------------------------------------------------------
/libwayback/retriever.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | """
3 | Library for program Archive_crawler.
4 | Copyright (C) 2012 xiamingc, SJTU - chenxm35@gmail.com
5 |
6 | This program is free software: you can redistribute it and/or modify
7 | it under the terms of the GNU General Public License as published by
8 | the Free Software Foundation, either version 3 of the License, or
9 | (at your option) any later version.
10 |
11 | This program is distributed in the hope that it will be useful,
12 | but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 | GNU General Public License for more details.
15 |
16 | You should have received a copy of the GNU General Public License
17 | along with this program. If not, see .
18 | """
19 |
20 | __author__ = 'chenxm'
21 |
22 |
23 | import argparse
24 | import httplib
25 | import urllib2
26 | import re
27 | import datetime
28 | import math
29 | import unittest
30 | import logging
31 | import time
32 | import os
33 | import os.path
34 | import random
35 | import sys
36 | import html5lib
37 | from html5lib import treebuilders
38 |
39 |
40 | logger = logging.getLogger("libwayback.waybackretriever")
41 |
42 |
43 | def _patch_http_response_read(func):
44 | def inner(*args):
45 | try:
46 | return func(*args)
47 | except httplib.IncompleteRead, e:
48 | return e.partial
49 | return inner
50 | # Patch HTTPResponse.read to avoid IncompleteRead exception
51 | httplib.HTTPResponse.read = _patch_http_response_read(httplib.HTTPResponse.read)
52 |
53 |
54 | class WaybackRetriever(object):
55 |
56 | def __init__(self):
57 | pass
58 |
59 |
60 | def extract_time_string(self, url):
61 | p = re.compile(r"\/([1-2]\d{3}\d*)")
62 | pm = re.search(p, url)
63 | if pm != None:
64 | return pm.group(1)
65 | else:
66 | # Explicit Error Report
67 | logger.error("Failed to extract the time string. URL exmaple: http://web.archive.org/web/19990117032727/google.com/")
68 | return None
69 |
70 |
71 | def extract_timestamp(self, url):
72 | p = re.compile(r"\/([1-2]\d{3}\d*)")
73 | pm = re.search(p, url)
74 | if pm == None: return None
75 |
76 | timestr = pm.group(1)
77 | members = ["0"]*6 ## [year, month, day, hour, minute, second]
78 | members_limit = [2999, 12, 31, 23, 59, 59]
79 |
80 | i = 0; j = 0
81 | leap_year = False
82 | while 0<= i <= len(timestr)-1:
83 | if j == 0:
84 | # year
85 | members[j] = timestr[:4]
86 | i += 4; j += 1
87 | leap_year = (int(members[j])%4 == 0 and True or False)
88 | continue
89 | elif j == 2:
90 | # day
91 | if int(members[j-1]) in [2]: maxv = leap_year == True and 29 or 28
92 | elif int(members[j-1]) in [1,3,5,7,8,10,12]: maxv = 31
93 | else: maxv = 30
94 | else:
95 | # month, hour, minute, second
96 | maxv = members_limit[j]
97 |
98 | if 0<= int(timestr[i:i+2]) <= maxv: offset = 2
99 | elif int(timestr[i:i+2]) > maxv: offset = 1
100 |
101 | members[j] = timestr[i:i+offset]
102 | i += offset; j += 1
103 | return datetime.datetime(year=int(members[0]), month=int(members[1]), \
104 | day=int(members[2]), hour=int(members[3]), \
105 | minute=int(members[4]), second=int(members[5]))
106 |
107 | def save_page(self, url, savefile):
108 | """
109 | Return the abstract path of savefile if succeed;
110 | Otherwise, return None.
111 | """
112 | url = url.rstrip('\r \n')
113 | try:
114 | f = urllib2.urlopen(url)
115 | except urllib2.URLError, reason:
116 | ## Multiple reasons may lead to the failure:
117 | ## Server is down,
118 | ## The wayback didn't archive this page
119 | ## Connection is block by the third person
120 | logger.error("Open page error: %s: %s" % (reason, url))
121 | return None
122 | except:
123 | logger.error("Open page error: null: %s" % url)
124 | return None
125 |
126 | try:
127 | fc = f.read()
128 | except httplib.IncompleteRead:
129 | logger.error("Read content error: uncompleted content: %s" % url)
130 | return None
131 | except:
132 | logger.error("Read content error: null: %s" % url)
133 | return None
134 |
135 | ## check content
136 | RE_WRONG_CONTENT = r'(Got an HTTP \d* response at crawl time)'
137 |
138 | pattern = re.compile(RE_WRONG_CONTENT)
139 | mp = re.search(pattern, fc)
140 | if mp != None:
141 | logger.warning("%s: %s" % (mp.group(1), url))
142 | parser = html5lib.HTMLParser(tree = treebuilders.getTreeBuilder("lxml"))
143 | html_doc = parser.parse(fc)
144 | html_body = html_doc.find("./{*}body")
145 | div_error = html_body.find(".//{*}div[@id='error']")
146 | redir_a = div_error.find("./{*}p[@class='impatient']/{*}a")
147 | redir_url = redir_a.get('href')
148 | try:
149 | return self.save_page(redir_url, savefile)
150 | except RuntimeError, detail:
151 | logger.error("Save redirected page error: [{0}]{1}: {2}".format(detail.errno, detail.strerror, url))
152 | return None
153 | except:
154 | logger.error("Save redirected page error: null: %s" % url)
155 | return None
156 | else:
157 | open(savefile, 'wb').write(fc)
158 | return os.path.abspath(savefile)
159 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | from setuptools import setup
4 | from libwayback import __version__
5 |
6 | setup(
7 | name = "libwayback",
8 | version = __version__,
9 | url = 'https://github.com/caesar0301/libwayback',
10 | author = 'Jamin Chen',
11 | author_email = 'chenxm35@gmail.com',
12 | description = 'A library to parse Wayback Machine of archive.org to get a historical views of web pages.',
13 | long_description='''A library to parse Wayback Machine of archive.org to get a historical views of web pages. It is a useful tool to research on the evolution of web pages, page structure analysis, and among other interesting topics.''',
14 | license = "LICENSE",
15 | packages = ['libwayback'],
16 | classifiers = [
17 | 'Development Status :: 4 - Beta',
18 | 'Environment :: Console',
19 | 'Intended Audience :: Developers',
20 | 'License :: Freely Distributable',
21 | 'Operating System :: OS Independent',
22 | 'Programming Language :: Python',
23 | 'Programming Language :: Python :: 2.6',
24 | 'Programming Language :: Python :: 2.7',
25 | 'Topic :: Software Development :: Libraries :: Python Modules',
26 | ],
27 | )
28 |
--------------------------------------------------------------------------------