├── .dockerignore ├── .gitignore ├── Dockerfile ├── LICENSE ├── README.rst ├── archivenow ├── __init__.py ├── archivenow.py ├── handlers │ ├── cc_handler.py │ ├── ia_handler.py │ ├── is_handler.py │ ├── mg_handler.py │ └── warc_handler.py ├── static │ └── ajax-loader.gif └── templates │ ├── api.txt │ └── index.html ├── docs └── archivetoday_selenium.mp4 ├── requirements.txt └── setup.py /.dockerignore: -------------------------------------------------------------------------------- 1 | .git 2 | .gitignore 3 | LICENSE 4 | Dockerfile 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | archivenow.egg-info/ 3 | build/ 4 | dist/ 5 | __pycache__ 6 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | ARG PYTAG=latest 2 | FROM python:${PYTAG} 3 | LABEL maintainer "Mohamed Aturban " 4 | 5 | WORKDIR /app 6 | COPY requirements.txt ./ 7 | RUN pip install --no-cache-dir -r requirements.txt 8 | COPY . ./ 9 | RUN chmod a+x ./archivenow/archivenow.py 10 | 11 | ENTRYPOINT ["./archivenow/archivenow.py"] 12 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 ODU Web Science / Digital Libraries Research Group 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | Archive Now (archivenow) 2 | ============================= 3 | A Tool To Push Web Resources Into Web Archives 4 | ---------------------------------------------- 5 | 6 | Archive Now (**archivenow**) currently is configured to push resources into four public web archives. You can easily add more archives by writing a new archive handler (e.g., myarchive_handler.py) and place it inside the folder "handlers". 7 | 8 | Update January 2021 9 | ~~~~~~~~~ 10 | Originally, **archivenow** was configured to push to 6 different public web archives. The two removed web archives are `WebCite `_ and `archive.st `_. WebCite was removed from **archivenow** as they are no longer accepting archiving requests. Archive.st was removed from **archivenow** due to encountering a Captcha when attempting to push to the archive. In addition to removing those 2 archives, the method for pushing to `archive.today `_ and `megalodon.jp `_ from **archivenow** has been updated. In order to push to `archive.today `_ and `megalodon.jp `_, `Selenium `_ is used. 11 | 12 | As explained below, this library can be used through: 13 | 14 | - Command Line Interface (CLI) 15 | 16 | - A Web Service 17 | 18 | - A Docker Container 19 | 20 | - Python 21 | 22 | 23 | Installing 24 | ---------- 25 | The latest release of **archivenow** can be installed using pip: 26 | 27 | .. code-block:: bash 28 | 29 | $ pip install archivenow 30 | 31 | The latest development version containing changes not yet released can be installed from source: 32 | 33 | .. code-block:: bash 34 | 35 | $ git clone git@github.com:oduwsdl/archivenow.git 36 | $ cd archivenow 37 | $ pip install -r requirements.txt 38 | $ pip install ./ 39 | 40 | In order to push to `archive.today `_ and `megalodon.jp `_, **archivenow** must use `Selenium `_, which has already been added to the requirements.txt. However, Selenium additionally needs a driver to interface with the chosen browser. It is recommended to use Selenium and **archivenow** with `Firefox `_ and Firefox's corresponding `GeckoDriver `_. 41 | 42 | You can download the latest versions of `Firefox `_ and the `GeckoDriver `_ to use with **archivenow**. 43 | 44 | After installing the driver, you can push to `archive.today `_ and `megalodon.jp `_ from **archivenow**. 45 | 46 | CLI USAGE 47 | --------- 48 | Usage of sub-commands in **archivenow** can be accessed through providing the `-h` or `--help` flag, like any of the below. 49 | 50 | .. code-block:: bash 51 | 52 | $ archivenow -h 53 | usage: archivenow.py [-h] [--mg] [--cc] [--cc_api_key [CC_API_KEY]] 54 | [--is] [--ia] [--warc [WARC]] [-v] [--all] 55 | [--server] [--host [HOST]] [--agent [AGENT]] 56 | [--port [PORT]] 57 | [URI] 58 | 59 | positional arguments: 60 | URI URI of a web resource 61 | 62 | optional arguments: 63 | -h, --help show this help message and exit 64 | --mg Use Megalodon.jp 65 | --cc Use The Perma.cc Archive 66 | --cc_api_key [CC_API_KEY] 67 | An API KEY is required by The Perma.cc Archive 68 | --is Use The Archive.is 69 | --ia Use The Internet Archive 70 | --warc [WARC] Generate WARC file 71 | -v, --version Report the version of archivenow 72 | --all Use all possible archives 73 | --server Run archiveNow as a Web Service 74 | --host [HOST] A server address 75 | --agent [AGENT] Use "wget" or "squidwarc" for WARC generation 76 | --port [PORT] A port number to run a Web Service 77 | 78 | Examples 79 | -------- 80 | 81 | 82 | Example 1 83 | ~~~~~~~~~ 84 | 85 | To save the web page (www.foxnews.com) in the Internet Archive: 86 | 87 | .. code-block:: bash 88 | 89 | $ archivenow --ia www.foxnews.com 90 | https://web.archive.org/web/20170209135625/http://www.foxnews.com 91 | 92 | Example 2 93 | ~~~~~~~~~ 94 | 95 | By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments are provided: 96 | 97 | .. code-block:: bash 98 | 99 | $ archivenow www.foxnews.com 100 | https://web.archive.org/web/20170215164835/http://www.foxnews.com 101 | 102 | Example 3 103 | ~~~~~~~~~ 104 | 105 | To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and Archive.is: 106 | 107 | .. code-block:: bash 108 | 109 | $ archivenow --ia --is www.foxnews.com 110 | https://web.archive.org/web/20170209140345/http://www.foxnews.com 111 | http://archive.is/fPVyc 112 | 113 | 114 | Example 4 115 | ~~~~~~~~~ 116 | 117 | To save the web page (https://nypost.com/) in all configured web archives. In addition to preserving the page in all configured archives, this command will also locally create a WARC file: 118 | 119 | .. code-block:: bash 120 | 121 | $ archivenow --all https://nypost.com/ --cc_api_key $Your-Perma-CC-API-Key 122 | http://archive.is/dcnan 123 | https://perma.cc/53CC-5ST8 124 | https://web.archive.org/web/20181002081445/https://nypost.com/ 125 | https://megalodon.jp/2018-1002-1714-24/https://nypost.com:443/ 126 | https_nypost.com__96ec2300.warc 127 | 128 | Example 5 129 | ~~~~~~~~~ 130 | 131 | To download the web page (https://nypost.com/) and create a WARC file: 132 | 133 | .. code-block:: bash 134 | 135 | $ archivenow --warc=mypage --agent=wget https://nypost.com/ 136 | mypage.warc 137 | 138 | Server 139 | ------ 140 | 141 | You can run **archivenow** as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 12345) 142 | 143 | .. code-block:: bash 144 | 145 | $ archivenow --server 146 | 147 | Running on http://0.0.0.0:12345/ (Press CTRL+C to quit) 148 | 149 | 150 | Example 6 151 | ~~~~~~~~~ 152 | 153 | To save the web page (www.foxnews.com) in The Internet Archive through the web service: 154 | 155 | .. code-block:: bash 156 | 157 | $ curl -i http://0.0.0.0:12345/ia/www.foxnews.com 158 | 159 | HTTP/1.0 200 OK 160 | Content-Type: application/json 161 | Content-Length: 95 162 | Server: Werkzeug/0.11.15 Python/2.7.10 163 | Date: Tue, 02 Oct 2018 08:20:18 GMT 164 | 165 | { 166 | "results": [ 167 | "https://web.archive.org/web/20181002082007/http://www.foxnews.com" 168 | ] 169 | } 170 | 171 | Example 7 172 | ~~~~~~~~~ 173 | 174 | To save the web page (www.foxnews.com) in all configured archives though the web service: 175 | 176 | .. code-block:: bash 177 | 178 | $ curl -i http://0.0.0.0:12345/all/www.foxnews.com 179 | 180 | HTTP/1.0 200 OK 181 | Content-Type: application/json 182 | Content-Length: 385 183 | Server: Werkzeug/0.11.15 Python/2.7.10 184 | Date: Tue, 02 Oct 2018 08:23:53 GMT 185 | 186 | { 187 | "results": [ 188 | "Error (The Perma.cc Archive): An API Key is required ", 189 | "http://archive.is/ukads", 190 | "https://web.archive.org/web/20181002082007/http://www.foxnews.com", 191 | "Error (Megalodon.jp): We can not obtain this page because the time limit has been reached or for technical ... ", 192 | "http://www.webcitation.org/72rbKsX8B" 193 | ] 194 | } 195 | 196 | Example 8 197 | ~~~~~~~~~ 198 | 199 | Because an API Key is required by Perma.cc, the HTTP request should be as follows: 200 | 201 | .. code-block:: bash 202 | 203 | $ curl -i http://127.0.0.1:12345/all/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key 204 | 205 | Or use only Perma.cc: 206 | 207 | .. code-block:: bash 208 | 209 | $ curl -i http://127.0.0.1:12345/cc/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key 210 | 211 | Running as a Docker Container 212 | ----------------------------- 213 | 214 | .. code-block:: bash 215 | 216 | $ docker image pull oduwsdl/archivenow 217 | 218 | Different ways to run archivenow 219 | 220 | .. code-block:: bash 221 | 222 | $ docker container run -it --rm oduwsdl/archivenow -h 223 | 224 | Accessible at 127.0.0.1:12345: 225 | 226 | .. code-block:: bash 227 | 228 | $ docker container run -p 12345:12345 -it --rm oduwsdl/archivenow --server --host 0.0.0.0 229 | 230 | Accessible at 127.0.0.1:22222: 231 | 232 | .. code-block:: bash 233 | 234 | $ docker container run -p 22222:11111 -it --rm oduwsdl/archivenow --server --port 11111 --host 0.0.0.0 235 | 236 | .. image:: http://www.cs.odu.edu/~maturban/archivenow-6-archives.gif 237 | :width: 10pt 238 | 239 | 240 | To save the web page (http://www.cnn.com) in The Internet Archive 241 | 242 | .. code-block:: bash 243 | 244 | $ docker container run -it --rm oduwsdl/archivenow --ia http://www.cnn.com 245 | 246 | 247 | Python Usage 248 | ------------ 249 | 250 | .. code-block:: bash 251 | 252 | >>> from archivenow import archivenow 253 | 254 | Example 9 255 | ~~~~~~~~~~ 256 | 257 | To save the web page (www.foxnews.com) in all configured archives: 258 | 259 | .. code-block:: bash 260 | 261 | >>> archivenow.push("www.foxnews.com","all") 262 | ['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required] 263 | 264 | Example 10 265 | ~~~~~~~~~~ 266 | 267 | To save the web page (www.foxnews.com) in The Perma.cc: 268 | 269 | .. code-block:: bash 270 | 271 | >>> archivenow.push("www.foxnews.com","cc",{"cc_api_key":"$YOUR-Perma-cc-API-KEY"}) 272 | ['https://perma.cc/8YYC-C7RM'] 273 | 274 | Example 11 275 | ~~~~~~~~~~ 276 | 277 | To start the server from Python do the following. The server/port number can be passed (e.g., start(port=1111, host='localhost')): 278 | 279 | .. code-block:: bash 280 | 281 | >>> archivenow.start() 282 | 283 | 2017-02-09 15:02:37 284 | Running on http://127.0.0.1:12345 285 | (Press CTRL+C to quit) 286 | 287 | 288 | Configuring a new archive or removing existing one 289 | -------------------------------------------------- 290 | Additional archives may be added by creating a handler file in the "handlers" directory. 291 | 292 | For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write: 293 | 294 | 295 | .. code-block:: python 296 | 297 | archivenow.push("www.cnn.com","ma") 298 | 299 | 300 | In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. See the existing `handler files`_ for examples on how to organized a newly configured archive handler. 301 | 302 | Removing an archive can be done by one of the following options: 303 | 304 | - Removing the archive handler file from the folder "handlers" 305 | 306 | - Renaming the archive handler file to other name that does not end with "_handler.py" 307 | 308 | - Setting the variable "enabled" to "False" inside the handler file 309 | 310 | 311 | Notes 312 | ----- 313 | The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the "same" resource. 314 | 315 | For example, if you send a request to IA to capture (www.cnn.com) at 10:00pm, IA will create a new copy (*C*) of this URI. IA will then return *C* for all requests to the archive for this URI received until 10:02pm. Using this same submission procedure for Archive.is requires a time gap of five minutes. 316 | 317 | .. _handler files: https://github.com/oduwsdl/archivenow/tree/master/archivenow/handlers 318 | 319 | 320 | Citing Project 321 | -------------- 322 | 323 | .. code-block:: latex 324 | 325 | @INPROCEEDINGS{archivenow-jcdl2018, 326 | AUTHOR = {Mohamed Aturban and 327 | Mat Kelly and 328 | Sawood Alam and 329 | John A. Berlin and 330 | Michael L. Nelson and 331 | Michele C. Weigle}, 332 | TITLE = {{ArchiveNow}: Simplified, Extensible, Multi-Archive Preservation}, 333 | BOOKTITLE = {Proceedings of the 18th {ACM/IEEE-CS} Joint Conference on Digital Libraries}, 334 | SERIES = {{JCDL} '18}, 335 | PAGES = {321--322}, 336 | MONTH = {June}, 337 | YEAR = {2018}, 338 | ADDRESS = {Fort Worth, Texas, USA}, 339 | URL = {https://doi.org/10.1145/3197026.3203880}, 340 | DOI = {10.1145/3197026.3203880} 341 | } 342 | -------------------------------------------------------------------------------- /archivenow/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = '2020.7.18.12.19.44' -------------------------------------------------------------------------------- /archivenow/archivenow.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import os 3 | import re 4 | import sys 5 | import uuid 6 | import glob 7 | import json 8 | import importlib 9 | import argparse 10 | import string 11 | import requests 12 | from threading import Thread 13 | from flask import request, Flask, jsonify, render_template 14 | from pathlib import Path 15 | 16 | #from __init__ import __version__ as archiveNowVersion 17 | 18 | archiveNowVersion = '2020.7.18.12.19.44' 19 | 20 | # archive handlers path 21 | PATH = Path(os.path.dirname(os.path.abspath(__file__))) 22 | PATH_HANDLER = PATH / 'handlers' 23 | 24 | # for the web app 25 | app = Flask(__name__) 26 | 27 | # create handlers for enabled archives 28 | global handlers 29 | handlers = {} 30 | 31 | # defult value for server/port 32 | SERVER_IP = '0.0.0.0' 33 | SERVER_PORT = 12345 34 | 35 | 36 | def bad_request(error=None): 37 | message = { 38 | 'status': 400, 39 | 'message': 'Error in processing the request', 40 | } 41 | resp = jsonify(message) 42 | resp.status_code = 400 43 | return resp 44 | 45 | 46 | # def getServer_IP_PORT(): 47 | # u = str(SERVER_IP) 48 | # if str(SERVER_PORT) != '80': 49 | # u = u + ":" + str(SERVER_PORT) 50 | # if 'http' != u[0:4]: 51 | # u = 'http://' + u 52 | # return u 53 | 54 | 55 | def listArchives_server(handlers): 56 | uri_args = '' 57 | if 'cc' in handlers: 58 | if handlers['cc'].enabled and handlers['cc'].api_required: 59 | uri_args = '?cc_api_key={Your-Perma.cc-API-Key}' 60 | li = {"archives": [{ # getServer_IP_PORT() + 61 | "id": "all", "GET":'/all/' + '{URI}'+uri_args, 62 | "archive-name": "All enabled archives"}]} 63 | for handler in handlers: 64 | if handlers[handler].enabled: 65 | uri_args2 = '' 66 | if handler == 'cc': 67 | uri_args2 = uri_args 68 | li["archives"].append({ #getServer_IP_PORT() + 69 | "id": handler, "archive-name": handlers[handler].name, 70 | "GET": '/' + handler + '/' + '{URI}'+uri_args2}) 71 | return li 72 | 73 | 74 | @app.route('/', defaults={'path': ''}, methods=['GET']) 75 | @app.route('/', methods=['GET']) 76 | def pushit(path): 77 | # no path; return a list of avaliable archives 78 | if path == '': 79 | #resp = jsonify(listArchives_server(handlers)) 80 | #resp.status_code = 200 81 | return render_template('index.html') 82 | #return resp 83 | # get request with path 84 | elif (path == 'api'): 85 | resp = jsonify(listArchives_server(handlers)) 86 | resp.status_code = 200 87 | return resp 88 | elif (path == "ajax-loader.gif"): 89 | return render_template('ajax-loader.gif') 90 | else: 91 | try: 92 | # get the args passed to push function like API KEY if provided 93 | PUSH_ARGS = {} 94 | for k in request.args.keys(): 95 | PUSH_ARGS[k] = request.args[k] 96 | 97 | s = str(path).split('/', 1) 98 | arc_id = s[0] 99 | URI = request.url.split('/', 4)[4] # include query params, too 100 | 101 | if 'herokuapp.com' in request.host: 102 | PUSH_ARGS['from_heroku'] = True 103 | 104 | # To push into archives 105 | resp = {"results": push(URI, arc_id, PUSH_ARGS)} 106 | if len(resp["results"]) == 0: 107 | return bad_request() 108 | else: 109 | # what to return 110 | resp = jsonify(resp) 111 | resp.status_code = 200 112 | 113 | return resp 114 | except Exception as e: 115 | pass 116 | return bad_request() 117 | 118 | res_uris = {} 119 | 120 | 121 | def push_proxy(hdlr, URIproxy, p_args_proxy, res_uris_idx, session=requests.Session()): 122 | global res_uris 123 | try: 124 | res = hdlr.push( URIproxy , p_args_proxy, session=session) 125 | print ( res ) 126 | res_uris[res_uris_idx].append(res) 127 | except: 128 | pass 129 | 130 | def push(URI, arc_id, p_args={}, session=requests.Session()): 131 | global handlers 132 | global res_uris 133 | try: 134 | # push to all possible archives 135 | res_uris_idx = str(uuid.uuid4()) 136 | res_uris[res_uris_idx] = [] 137 | ### if arc_id == 'all': 138 | ### for handler in handlers: 139 | ### if (handlers[handler].api_required): 140 | # pass args like key API 141 | ### res.append(handlers[handler].push(str(URI), p_args)) 142 | ### else: 143 | ### res.append(handlers[handler].push(str(URI))) 144 | ### else: 145 | # push to the chosen archives 146 | 147 | threads = [] 148 | 149 | for handler in handlers: 150 | if (arc_id == handler) or (arc_id == 'all'): 151 | ### if (arc_id == handler): ### and (handlers[handler].api_required): 152 | #res.append(handlers[handler].push(str(URI), p_args)) 153 | #push_proxy( handlers[handler], str(URI), p_args, res_uris_idx) 154 | threads.append( 155 | Thread( 156 | target=push_proxy, 157 | args=(handlers[handler], str(URI), p_args, res_uris_idx, ), 158 | kwargs={'session': session})) 159 | ### elif (arc_id == handler): 160 | ### res.append(handlers[handler].push(str(URI))) 161 | 162 | for th in threads: 163 | th.start() 164 | for th in threads: 165 | th.join() 166 | 167 | res = res_uris[res_uris_idx] 168 | del res_uris[res_uris_idx] 169 | return res 170 | except: 171 | del res_uris[res_uris_idx] 172 | pass 173 | return ["bad request"] 174 | 175 | 176 | def start(port=SERVER_PORT, host=SERVER_IP): 177 | global SERVER_PORT 178 | global SERVER_IP 179 | SERVER_PORT = port 180 | SERVER_IP = host 181 | app.run( 182 | host=host, 183 | port=port, 184 | threaded=True, 185 | debug=True, 186 | use_reloader=False) 187 | 188 | 189 | def load_handlers(): 190 | global handlers 191 | handlers = {} 192 | # add the path of the handlers to the system so they can be imported 193 | sys.path.append(str(PATH_HANDLER)) 194 | 195 | # create a list of handlers. 196 | for file in PATH_HANDLER.glob('*_handler.py'): 197 | name = file.stem 198 | prefix = name.replace('_handler', '') 199 | mod = importlib.import_module(name) 200 | mod_class = getattr(mod, prefix.upper() + '_handler') 201 | # finally an object is created 202 | handlers[prefix] = mod_class() 203 | # exclude all disabled archives 204 | 205 | for handler in list(handlers): # handlers.keys(): 206 | if not handlers[handler].enabled: 207 | del handlers[handler] 208 | 209 | 210 | def args_parser(): 211 | global SERVER_PORT 212 | global SERVER_IP 213 | # parsing arguments 214 | 215 | class MyParser(argparse.ArgumentParser): 216 | 217 | def error(self, message): 218 | sys.stderr.write('error: %s\n' % message) 219 | self.print_help() 220 | sys.exit(2) 221 | 222 | def printm(self): 223 | sys.stderr.write('') 224 | self.print_help() 225 | sys.exit(2) 226 | 227 | parser = MyParser() 228 | 229 | # arc_handler = 0 230 | for handler in handlers: 231 | # add archives identifiers to the list of options 232 | # arc_handler += 1 233 | if handler == 'warc': 234 | parser.add_argument('--' + handler, nargs='?', 235 | help=handlers[handler].name) 236 | else: 237 | parser.add_argument('--' + handler, action='store_true', default=False, 238 | help='Use ' + handlers[handler].name) 239 | if (handlers[handler].api_required): 240 | parser.add_argument( 241 | '--' + 242 | handler + 243 | '_api_key', 244 | nargs='?', 245 | help='An API KEY is required by ' + 246 | handlers[handler].name) 247 | 248 | parser.add_argument( 249 | '-v', 250 | '--version', 251 | help='Report the version of archivenow', 252 | action='version', 253 | version='ArchiveNow ' + 254 | archiveNowVersion) 255 | 256 | if len(handlers) > 0: 257 | parser.add_argument('--all', action='store_true', default=False, 258 | help='Use all possible archives ') 259 | 260 | parser.add_argument('--server', action='store_true', default=False, 261 | help='Run archiveNow as a Web Service ') 262 | 263 | parser.add_argument('URI', nargs='?', help='URI of a web resource') 264 | 265 | parser.add_argument('--host', nargs='?', help='A server address') 266 | 267 | if 'warc' in handlers.keys(): 268 | parser.add_argument('--agent', nargs='?', help='Use "wget" or "squidwarc" for WARC generation') 269 | 270 | parser.add_argument( 271 | '--port', 272 | nargs='?', 273 | help='A port number to run a Web Service') 274 | 275 | args = parser.parse_args() 276 | else: 277 | print ('\n Error: No enabled archive handler found\n') 278 | sys.exit(0) 279 | 280 | arc_opt = 0 281 | # start the server 282 | if getattr(args, 'server'): 283 | if getattr(args, 'port'): 284 | SERVER_PORT = int(args.port) 285 | if getattr(args, 'host'): 286 | SERVER_IP = str(args.host) 287 | 288 | start(port=SERVER_PORT, host=SERVER_IP) 289 | 290 | else: 291 | if not getattr(args, 'URI'): 292 | print (parser.error('too few arguments')) 293 | res = [] 294 | 295 | # get the args passed to push function like API KEY if provided 296 | PUSH_ARGS = {} 297 | for handler in handlers: 298 | if (handlers[handler].api_required): 299 | if getattr(args, handler + '_api_key'): 300 | PUSH_ARGS[ 301 | handler + 302 | '_api_key'] = getattr( 303 | args, 304 | handler + 305 | '_api_key') 306 | else: 307 | if getattr(args, handler): 308 | print ( 309 | parser.error( 310 | 'An API Key is required by ' + 311 | handlers[handler].name)) 312 | orginal_warc_value = getattr(args, 'warc') 313 | if handler == 'warc': 314 | PUSH_ARGS['warc'] = getattr(args, 'warc') 315 | if PUSH_ARGS['warc'] == None: 316 | valid_chars = "-_.()/ %s%s" % (string.ascii_letters, string.digits) 317 | PUSH_ARGS['warc'] = ''.join(c for c in str(args.URI).strip() if c in valid_chars) 318 | PUSH_ARGS['warc'] = PUSH_ARGS['warc'].replace(' ','_').replace('/','_').replace('__','_') # I don't like spaces in filenames. 319 | PUSH_ARGS['warc'] = PUSH_ARGS['warc']+'_'+str(uuid.uuid4())[:8] 320 | if PUSH_ARGS['warc'][-1] == '_': 321 | PUSH_ARGS['warc'] = PUSH_ARGS['warc'][:-1] 322 | agent = 'wget' 323 | tmp_agent = getattr(args, 'agent') 324 | if tmp_agent == 'squidwarc': 325 | agent = tmp_agent 326 | PUSH_ARGS['agent'] = agent 327 | 328 | # sys.exit(0) 329 | 330 | # push to all possible archives 331 | if getattr(args, 'all'): 332 | arc_opt = 1 333 | res = push(str(args.URI).strip(), 'all', PUSH_ARGS) 334 | else: 335 | # push to the chosen archives 336 | for handler in handlers: 337 | if getattr(args, handler): 338 | arc_opt += 1 339 | for i in push(str(args.URI).strip(), handler, PUSH_ARGS): 340 | res.append(i) 341 | # push to the defult archive 342 | if (len(handlers) > 0) and (arc_opt == 0): 343 | # set the default; it ia by default or the first archive in the 344 | # list if not found 345 | if 'ia' in handlers: 346 | res = push(str(args.URI).strip(), 'ia', PUSH_ARGS) 347 | else: 348 | res = push(str(args.URI).strip(), 349 | handlers.keys()[0], PUSH_ARGS) 350 | # print (parser.printm()) 351 | # else: 352 | # for rs in res: 353 | # print (rs) 354 | 355 | load_handlers() 356 | 357 | if __name__ == '__main__': 358 | args_parser() 359 | -------------------------------------------------------------------------------- /archivenow/handlers/cc_handler.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | 4 | class CC_handler(object): 5 | 6 | def __init__(self): 7 | self.enabled = True 8 | self.name = 'The Perma.cc Archive' 9 | self.api_required = True 10 | 11 | def push(self, uri_org, p_args=[], session=requests.Session()): 12 | msg = '' 13 | try: 14 | 15 | APIKEY = p_args['cc_api_key'] 16 | 17 | r = session.post('https://api.perma.cc/v1/archives/?api_key='+APIKEY, timeout=120, 18 | data=json.dumps({"url":uri_org}), 19 | headers={'Content-type': 'application/json'}, 20 | allow_redirects=True) 21 | r.raise_for_status() 22 | 23 | if 'Location' in r.headers: 24 | return 'https://perma.cc/'+r.headers['Location'].rsplit('/',1)[1] 25 | else: 26 | for r2 in r.history: 27 | if 'Location' in r2.headers: 28 | return 'https://perma.cc/'+r2.headers['Location'].rsplit('/',1)[1] 29 | entity_json = r.json() 30 | if 'guid' in entity_json: 31 | return str('https://perma.cc/'+entity_json['guid']) 32 | msg = "Error ("+self.name+ "): No HTTP Location header is returned in the response" 33 | except Exception as e: 34 | if (msg == '') and ('_api_key' in str(e)): 35 | msg = "Error (" + self.name+ "): " + 'An API Key is required ' 36 | elif (msg == ''): 37 | msg = "Error (" + self.name+ "): " + str(e) 38 | pass; 39 | return msg 40 | -------------------------------------------------------------------------------- /archivenow/handlers/ia_handler.py: -------------------------------------------------------------------------------- 1 | import requests 2 | 3 | class IA_handler(object): 4 | 5 | def __init__(self): 6 | self.enabled = True 7 | self.name = 'The Internet Archive' 8 | self.api_required = False 9 | 10 | def push(self, uri_org, p_args=[], session=requests.Session()): 11 | msg = '' 12 | try: 13 | uri = 'https://web.archive.org/save/' + uri_org 14 | archiveTodayUserAgent = { 15 | "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36" 16 | } 17 | # push into the archive 18 | # r = session.get(uri, timeout=120, allow_redirects=True, headers=archiveTodayUserAgent) 19 | 20 | if ('user-agent' in session.headers) and (not session.headers['User-Agent'].lower().startswith('python-requests/')): 21 | r = session.get(uri, timeout=120, allow_redirects=True) 22 | else: 23 | r = session.get(uri, timeout=120, allow_redirects=True, headers=archiveTodayUserAgent) 24 | 25 | r.raise_for_status() 26 | # extract the link to the archived copy 27 | if (r != None): 28 | if "Location" in r.headers: 29 | return r.headers["Location"] 30 | elif "Content-Location" in r.headers: 31 | if (r.headers["Content-Location"]).startswith("/web/"): 32 | return "https://web.archive.org"+r.headers["Content-Location"] 33 | else: 34 | try: 35 | uri_from_content = "https://web.archive.org" + r.text.split('var redirUrl = "',1)[1].split('"',1)[0] 36 | except: 37 | uri_from_content = r.headers["Content-Location"] 38 | #pass; 39 | return uri_from_content 40 | else: 41 | for r2 in r.history: 42 | if 'Location' in r2.headers: 43 | return r.url 44 | #return r2.headers['Location'] 45 | if 'Content-Location' in r2.headers: 46 | return r.url 47 | #return r2.headers['Content-Location'] 48 | msg = "("+self.name+ "): No HTTP Location/Content-Location header is returned in the response" 49 | except Exception as e: 50 | if msg == '': 51 | msg = "Error (" + self.name+ "): " + str(e) 52 | pass 53 | return msg 54 | -------------------------------------------------------------------------------- /archivenow/handlers/is_handler.py: -------------------------------------------------------------------------------- 1 | import os 2 | import requests 3 | import sys 4 | from selenium.webdriver.firefox.options import Options 5 | from selenium import webdriver 6 | from selenium.webdriver.common.keys import Keys 7 | from selenium.webdriver.support.ui import WebDriverWait 8 | from selenium.webdriver.support import expected_conditions as EC 9 | from selenium.webdriver.common.by import By 10 | from selenium.common.exceptions import TimeoutException 11 | 12 | 13 | class IS_handler(object): 14 | 15 | def __init__(self): 16 | self.enabled = True 17 | self.name = 'The Archive.is' 18 | self.api_required = False 19 | 20 | def push(self, uri_org, p_args=[], session=requests.Session()): 21 | 22 | msg = "" 23 | 24 | try: 25 | 26 | options = Options() 27 | options.headless = True # Run in background 28 | driver = webdriver.Firefox(options = options) 29 | driver.get("https://archive.is") 30 | 31 | elem = driver.find_element_by_id("url") # Find the form to place a URL to be archived 32 | 33 | elem.send_keys(uri_org) # Place the URL in the input box 34 | 35 | saveButton = driver.find_element_by_xpath("/html/body/center/div/form[1]/div[3]/input") # Find the submit button 36 | 37 | saveButton.click() # Click the submit button 38 | 39 | # After clicking submit, there may be an additional page that pops up and asks if you are sure you want 40 | # to archive that page since it was archived X amount of time ago. We need to wait for that page to 41 | # load and click submit again. 42 | delay = 30 # seconds 43 | try: 44 | nextSaveButton = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, "/html/body/center/div[4]/center/div/div[2]/div/form/div/input"))) 45 | nextSaveButton.click() 46 | 47 | except TimeoutException: 48 | pass 49 | 50 | # The page takes a while to archive, so keep checking if the loading page is still displayed. 51 | loading = True 52 | while loading: 53 | 54 | if not 'wip' in driver.current_url and not 'submit' in driver.current_url: 55 | loading = False 56 | 57 | # After the loading screen is gone and the page is archived, the current URL 58 | # will be the URL to the archived page. 59 | msg = driver.current_url; 60 | 61 | driver.quit() 62 | 63 | except: 64 | 65 | ''' 66 | exc_type, exc_obj, exc_tb = sys.exc_info() 67 | fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1] 68 | print((fname, exc_tb.tb_lineno, sys.exc_info() )) 69 | ''' 70 | 71 | msg = "Unable to complete request." 72 | 73 | return msg 74 | -------------------------------------------------------------------------------- /archivenow/handlers/mg_handler.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | import os 3 | import requests 4 | import sys 5 | from selenium.webdriver.firefox.options import Options 6 | from selenium import webdriver 7 | from selenium.webdriver.common.keys import Keys 8 | from selenium.webdriver.support.ui import WebDriverWait 9 | from selenium.webdriver.support import expected_conditions as EC 10 | from selenium.webdriver.common.by import By 11 | from selenium.common.exceptions import TimeoutException 12 | 13 | class MG_handler(object): 14 | 15 | def __init__(self): 16 | self.enabled = True 17 | self.name = 'Megalodon.jp' 18 | self.api_required = False 19 | 20 | def push(self, uri_org, p_args=[], session=requests.Session()): 21 | 22 | msg = "" 23 | 24 | options = Options() 25 | options.headless = True # Run in background 26 | driver = webdriver.Firefox(options = options) 27 | driver.get("https://megalodon.jp/?url=" + uri_org) 28 | 29 | try: 30 | addButton = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[8]/form/div[1]/input[2]") 31 | 32 | addButton.click() # Click the add button 33 | except : 34 | print("Unable to archive this page at this time.") 35 | raise 36 | 37 | 38 | stillOnPage = True 39 | while stillOnPage: 40 | try: 41 | button = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[1]/div/h3") 42 | 43 | except: 44 | stillOnPage = False 45 | 46 | try: 47 | error = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[3]/div/a/h3") 48 | msg = "We apologize for the inconvenience. Currently, acquisitions that are considered \"robots\" in the acquisition of certain conditions are prohibited." 49 | raise 50 | sys.exit() 51 | 52 | except: 53 | pass 54 | 55 | # The page takes a while to archive, so keep checking if the loading page is still displayed. 56 | loading = True 57 | while loading: 58 | try: 59 | loadingPage = driver.find_element_by_xpath("/html/body/div[2]/div/div[1]/a/img") 60 | loading = False 61 | 62 | except: 63 | loading = True 64 | 65 | # After the loading screen is gone and the page is archived, the current URL 66 | # will be the URL to the archived page. 67 | if msg == "": 68 | print(driver.current_url) 69 | 70 | return msg 71 | -------------------------------------------------------------------------------- /archivenow/handlers/warc_handler.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import os.path 3 | import distutils.spawn 4 | 5 | class WARC_handler(object): 6 | 7 | def __init__(self): 8 | self.enabled = True 9 | self.name = 'Generate WARC file' 10 | self.api_required = False 11 | 12 | def push(self, uri_org, p_args=[], session=requests.Session()): 13 | msg = '' 14 | if p_args['agent'] == 'squidwarc': 15 | # squidwarc 16 | #if not distutils.spawn.find_executable("squidwarc"): 17 | # return 'wget is not installed!' 18 | os.system('python ~/squidwarc_one_page/generte_warcs.py 9222 "'+uri_org+'" '+p_args['warc']+'.warc &> /dev/null') 19 | if os.path.exists(p_args['warc']): 20 | return p_args['warc'] 21 | elif os.path.exists(p_args['warc']+'.warc'): 22 | return p_args['warc']+'.warc' 23 | else: 24 | return 'squidwarc failed to generate the WARC file' 25 | 26 | else: 27 | if not distutils.spawn.find_executable("wget"): 28 | return 'wget is not installed!' 29 | # wget 30 | os.system('wget -E -H -k -p -q --delete-after --no-warc-compression --warc-file="'+p_args['warc']+'" "'+uri_org+'"') 31 | if os.path.exists(p_args['warc']): 32 | return p_args['warc'] 33 | elif os.path.exists(p_args['warc']+'.warc'): 34 | return p_args['warc']+'.warc' 35 | else: 36 | return 'wget failed to generate the WARC file' 37 | -------------------------------------------------------------------------------- /archivenow/static/ajax-loader.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oduwsdl/archivenow/dbc688f4f2384139f6eb1ebf7ced2319b49ecc73/archivenow/static/ajax-loader.gif -------------------------------------------------------------------------------- /archivenow/templates/api.txt: -------------------------------------------------------------------------------- 1 | 25 | -------------------------------------------------------------------------------- /archivenow/templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 150 | 151 | 152 | 153 |

Preserve a web page in web archives

154 |
155 | 156 | 157 |
158 | 159 |
160 |

Select archives:

161 |
162 |
163 | Internet Archive
164 | Archive.is
165 | Megalodon.jp
166 | Perma.cc 167 |
168 | 169 | 170 |
171 |
172 |
173 |
174 |
175 | 176 | 177 |
178 |
179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 |
ArchiveLink to the archived page
187 | 188 | 189 | 414 | 415 | 416 | -------------------------------------------------------------------------------- /docs/archivetoday_selenium.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oduwsdl/archivenow/dbc688f4f2384139f6eb1ebf7ced2319b49ecc73/docs/archivetoday_selenium.mp4 -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | flask 2 | requests 3 | pathlib 4 | selenium -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from setuptools import setup, find_packages 4 | from archivenow import __version__ 5 | 6 | long_description = open('README.rst').read() 7 | desc = """A Python library to push web resources into public web archives""" 8 | 9 | 10 | setup( 11 | name='archivenow', 12 | version=__version__, 13 | description=desc, 14 | long_description=long_description, 15 | author='Mohamed Aturban', 16 | author_email='maturban@cs.odu.edu', 17 | url='https://github.com/maturban/archivenow', 18 | packages=find_packages(), 19 | license="MIT", 20 | classifiers=[ 21 | 'Development Status :: 5 - Production/Stable', 22 | 'Programming Language :: Python', 23 | 'Programming Language :: Python :: 2.7', 24 | 'Programming Language :: Python :: 3', 25 | 'Programming Language :: Python :: 3.4', 26 | 'Programming Language :: Python :: 3.5', 27 | 'Programming Language :: Python :: 3.6', 28 | 'License :: OSI Approved :: MIT License' 29 | ], 30 | install_requires=[ 31 | 'flask', 32 | 'requests' 33 | ], 34 | package_data={ 35 | 'archivenow': [ 36 | 'handlers/*.*', 37 | 'templates/*.*', 38 | 'static/*.*' 39 | ] 40 | }, 41 | entry_points=''' 42 | [console_scripts] 43 | archivenow=archivenow.archivenow:args_parser 44 | ''' 45 | ) 46 | --------------------------------------------------------------------------------