├── .dockerignore
├── .gitignore
├── Dockerfile
├── LICENSE
├── README.rst
├── archivenow
    ├── __init__.py
    ├── archivenow.py
    ├── handlers
    │   ├── cc_handler.py
    │   ├── ia_handler.py
    │   ├── is_handler.py
    │   ├── mg_handler.py
    │   └── warc_handler.py
    ├── static
    │   └── ajax-loader.gif
    └── templates
    │   ├── api.txt
    │   └── index.html
├── docs
    └── archivetoday_selenium.mp4
├── requirements.txt
└── setup.py


/.dockerignore:
--------------------------------------------------------------------------------
1 | .git
2 | .gitignore
3 | LICENSE
4 | Dockerfile
5 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | archivenow.egg-info/
3 | build/
4 | dist/
5 | __pycache__
6 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | ARG PYTAG=latest
 2 | FROM python:${PYTAG}
 3 | LABEL maintainer "Mohamed Aturban <mohsci1@yahoo.com>"
 4 | 
 5 | WORKDIR /app
 6 | COPY requirements.txt ./
 7 | RUN pip install --no-cache-dir -r requirements.txt
 8 | COPY . ./
 9 | RUN chmod a+x ./archivenow/archivenow.py
10 | 
11 | ENTRYPOINT ["./archivenow/archivenow.py"]
12 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 ODU Web Science / Digital Libraries Research Group
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
  1 | Archive Now (archivenow)
  2 | =============================
  3 | A Tool To Push Web Resources Into Web Archives
  4 | ----------------------------------------------
  5 | 
  6 | Archive Now (**archivenow**) currently is configured to push resources into four public web archives. You can easily add more archives by writing a new archive handler (e.g., myarchive_handler.py) and place it inside the folder "handlers". 
  7 | 
  8 | Update January 2021
  9 | ~~~~~~~~~
 10 | Originally, **archivenow** was configured to push to 6 different public web archives. The two removed web archives are `WebCite <https://www.webcitation.org/>`_ and `archive.st <http://archive.st/>`_. WebCite was removed from **archivenow** as they are no longer accepting archiving requests. Archive.st was removed from **archivenow** due to encountering a Captcha when attempting to push to the archive. In addition to removing those 2 archives, the method for pushing to `archive.today <https://archive.vn/>`_ and `megalodon.jp <https://megalodon.jp/>`_ from **archivenow** has been updated. In order to push to `archive.today <https://archive.vn/>`_ and `megalodon.jp <https://megalodon.jp/>`_, `Selenium <https://selenium-python.readthedocs.io/>`_ is used.
 11 | 
 12 | As explained below, this library can be used through:
 13 | 
 14 | - Command Line Interface (CLI)
 15 | 
 16 | - A Web Service
 17 | 
 18 | - A Docker Container
 19 | 
 20 | - Python
 21 | 
 22 | 
 23 | Installing
 24 | ----------
 25 | The latest release of **archivenow** can be installed using pip:
 26 | 
 27 | .. code-block:: bash
 28 | 
 29 |       $ pip install archivenow
 30 | 
 31 | The latest development version containing changes not yet released can be installed from source:
 32 | 
 33 | .. code-block:: bash
 34 |       
 35 |       $ git clone git@github.com:oduwsdl/archivenow.git
 36 |       $ cd archivenow
 37 |       $ pip install -r requirements.txt
 38 |       $ pip install ./
 39 |       
 40 | In order to push to `archive.today <https://archive.vn/>`_ and `megalodon.jp <https://megalodon.jp/>`_, **archivenow** must use `Selenium <https://selenium-python.readthedocs.io/>`_, which has already been added to the requirements.txt. However, Selenium additionally needs a driver to interface with the chosen browser. It is recommended to use Selenium and **archivenow** with `Firefox <https://www.mozilla.org/en-US/firefox/releases/>`_ and Firefox's corresponding `GeckoDriver <https://github.com/mozilla/geckodriver/releases>`_.
 41 | 
 42 | You can download the latest versions of `Firefox <https://www.mozilla.org/en-US/firefox/releases/>`_ and the `GeckoDriver <https://github.com/mozilla/geckodriver/releases>`_ to use with **archivenow**.
 43 | 
 44 | After installing the driver, you can push to `archive.today <https://archive.vn/>`_ and `megalodon.jp <https://megalodon.jp/>`_ from **archivenow**.
 45 | 
 46 | CLI USAGE 
 47 | ---------
 48 | Usage of sub-commands in **archivenow** can be accessed through providing the `-h` or `--help` flag, like any of the below.
 49 | 
 50 | .. code-block:: bash
 51 | 
 52 |       $ archivenow -h
 53 |       usage: archivenow.py [-h] [--mg] [--cc] [--cc_api_key [CC_API_KEY]]
 54 |                            [--is] [--ia] [--warc [WARC]] [-v] [--all]
 55 |                            [--server] [--host [HOST]] [--agent [AGENT]]
 56 |                            [--port [PORT]]
 57 |                            [URI]
 58 | 
 59 |       positional arguments:
 60 |         URI                   URI of a web resource
 61 | 
 62 |       optional arguments:
 63 |         -h, --help            show this help message and exit
 64 |         --mg                  Use Megalodon.jp
 65 |         --cc                  Use The Perma.cc Archive
 66 |         --cc_api_key [CC_API_KEY]
 67 |                               An API KEY is required by The Perma.cc Archive
 68 |         --is                  Use The Archive.is
 69 |         --ia                  Use The Internet Archive
 70 |         --warc [WARC]         Generate WARC file
 71 |         -v, --version         Report the version of archivenow
 72 |         --all                 Use all possible archives
 73 |         --server              Run archiveNow as a Web Service
 74 |         --host [HOST]         A server address
 75 |         --agent [AGENT]       Use "wget" or "squidwarc" for WARC generation
 76 |         --port [PORT]         A port number to run a Web Service
 77 | 
 78 | Examples
 79 | --------
 80 | 
 81 | 
 82 | Example 1
 83 | ~~~~~~~~~
 84 | 
 85 | To save the web page (www.foxnews.com) in the Internet Archive:
 86 | 
 87 | .. code-block:: bash
 88 | 
 89 |       $ archivenow --ia www.foxnews.com
 90 |       https://web.archive.org/web/20170209135625/http://www.foxnews.com
 91 | 
 92 | Example 2
 93 | ~~~~~~~~~
 94 | 
 95 | By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments are provided:
 96 | 
 97 | .. code-block:: bash
 98 | 
 99 |       $ archivenow www.foxnews.com
100 |       https://web.archive.org/web/20170215164835/http://www.foxnews.com
101 | 
102 | Example 3
103 | ~~~~~~~~~
104 | 
105 | To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and Archive.is:
106 | 
107 | .. code-block:: bash
108 |       
109 |       $ archivenow --ia --is www.foxnews.com
110 |       https://web.archive.org/web/20170209140345/http://www.foxnews.com
111 |       http://archive.is/fPVyc
112 | 
113 | 
114 | Example 4
115 | ~~~~~~~~~
116 | 
117 | To save the web page (https://nypost.com/) in all configured web archives. In addition to preserving the page in all configured archives, this command will also locally create a WARC file:
118 | 
119 | .. code-block:: bash
120 |       
121 |       $ archivenow --all https://nypost.com/ --cc_api_key $Your-Perma-CC-API-Key
122 |       http://archive.is/dcnan
123 |       https://perma.cc/53CC-5ST8
124 |       https://web.archive.org/web/20181002081445/https://nypost.com/
125 |       https://megalodon.jp/2018-1002-1714-24/https://nypost.com:443/
126 |       https_nypost.com__96ec2300.warc
127 | 
128 | Example 5
129 | ~~~~~~~~~
130 | 
131 | To download the web page (https://nypost.com/) and create a WARC file:
132 | 
133 | .. code-block:: bash
134 |       
135 |       $ archivenow --warc=mypage --agent=wget https://nypost.com/
136 |       mypage.warc
137 |       
138 | Server
139 | ------
140 | 
141 | You can run **archivenow** as a web service. You can specify the server address and/or the port number (e.g., --host localhost  --port 12345)
142 | 
143 | .. code-block:: bash
144 |       
145 |       $ archivenow --server
146 |       
147 |       Running on http://0.0.0.0:12345/ (Press CTRL+C to quit)
148 | 
149 | 
150 | Example 6
151 | ~~~~~~~~~
152 | 
153 | To save the web page (www.foxnews.com) in The Internet Archive through the web service:
154 | 
155 | .. code-block:: bash
156 | 
157 |       $ curl -i http://0.0.0.0:12345/ia/www.foxnews.com
158 |       
159 |           HTTP/1.0 200 OK
160 |           Content-Type: application/json
161 |           Content-Length: 95
162 |           Server: Werkzeug/0.11.15 Python/2.7.10
163 |           Date: Tue, 02 Oct 2018 08:20:18 GMT
164 | 
165 |           {
166 |             "results": [
167 |               "https://web.archive.org/web/20181002082007/http://www.foxnews.com"
168 |             ]
169 |           }
170 |       
171 | Example 7
172 | ~~~~~~~~~
173 | 
174 | To save the web page (www.foxnews.com) in all configured archives though the web service:
175 | 
176 | .. code-block:: bash
177 |       
178 |       $ curl -i http://0.0.0.0:12345/all/www.foxnews.com
179 | 
180 |           HTTP/1.0 200 OK
181 |           Content-Type: application/json
182 |           Content-Length: 385
183 |           Server: Werkzeug/0.11.15 Python/2.7.10
184 |           Date: Tue, 02 Oct 2018 08:23:53 GMT
185 | 
186 |           {
187 |             "results": [
188 |               "Error (The Perma.cc Archive): An API Key is required ", 
189 |               "http://archive.is/ukads", 
190 |               "https://web.archive.org/web/20181002082007/http://www.foxnews.com", 
191 |               "Error (Megalodon.jp): We can not obtain this page because the time limit has been reached or for technical ... ", 
192 |               "http://www.webcitation.org/72rbKsX8B"
193 |             ]
194 |           }
195 | 
196 | Example 8
197 | ~~~~~~~~~
198 | 
199 | Because an API Key is required by Perma.cc, the HTTP request should be as follows:
200 |         
201 | .. code-block:: bash
202 |       
203 |       $ curl -i http://127.0.0.1:12345/all/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key
204 | 
205 | Or use only Perma.cc:
206 | 
207 | .. code-block:: bash
208 | 
209 |       $ curl -i http://127.0.0.1:12345/cc/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key
210 | 
211 | Running as a Docker Container
212 | -----------------------------
213 | 
214 | .. code-block:: bash
215 | 
216 |     $ docker image pull oduwsdl/archivenow
217 | 
218 | Different ways to run archivenow    
219 | 
220 | .. code-block:: bash
221 | 
222 |     $ docker container run -it --rm oduwsdl/archivenow -h
223 | 
224 | Accessible at 127.0.0.1:12345:
225 | 
226 | .. code-block:: bash
227 | 
228 |     $ docker container run -p 12345:12345 -it --rm oduwsdl/archivenow --server --host 0.0.0.0
229 | 
230 | Accessible at 127.0.0.1:22222:
231 | 
232 | .. code-block:: bash
233 | 
234 |     $ docker container run -p 22222:11111 -it --rm oduwsdl/archivenow --server --port 11111 --host 0.0.0.0
235 | 
236 | .. image:: http://www.cs.odu.edu/~maturban/archivenow-6-archives.gif
237 |    :width: 10pt
238 | 
239 | 
240 | To save the web page (http://www.cnn.com) in The Internet Archive
241 | 
242 | .. code-block:: bash
243 | 
244 |     $ docker container run -it --rm oduwsdl/archivenow --ia http://www.cnn.com
245 |     
246 | 
247 | Python Usage
248 | ------------
249 | 
250 | .. code-block:: bash
251 |    
252 |     >>> from archivenow import archivenow
253 | 
254 | Example 9
255 | ~~~~~~~~~~
256 | 
257 | To save the web page (www.foxnews.com) in all configured archives:
258 | 
259 | .. code-block:: bash
260 | 
261 |       >>> archivenow.push("www.foxnews.com","all")
262 |       ['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]
263 | 
264 | Example 10
265 | ~~~~~~~~~~
266 | 
267 | To save the web page (www.foxnews.com) in The Perma.cc:
268 | 
269 | .. code-block:: bash
270 | 
271 |       >>> archivenow.push("www.foxnews.com","cc",{"cc_api_key":"$YOUR-Perma-cc-API-KEY"})
272 |       ['https://perma.cc/8YYC-C7RM']
273 |       
274 | Example 11
275 | ~~~~~~~~~~
276 | 
277 | To start the server from Python do the following. The server/port number can be passed (e.g., start(port=1111, host='localhost')):
278 | 
279 | .. code-block:: bash
280 | 
281 |       >>> archivenow.start()
282 |       
283 |           2017-02-09 15:02:37
284 |           Running on http://127.0.0.1:12345
285 |           (Press CTRL+C to quit)
286 | 
287 | 
288 | Configuring a new archive or removing existing one
289 | --------------------------------------------------
290 | Additional archives may be added by creating a handler file in the "handlers" directory.
291 | 
292 | For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write:
293 | 
294 | 
295 | .. code-block:: python
296 | 
297 |       archivenow.push("www.cnn.com","ma")
298 |       
299 | 
300 | In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. See the existing `handler files`_ for examples on how to organized a newly configured archive handler.
301 | 
302 | Removing an archive can be done by one of the following options:
303 | 
304 | - Removing the archive handler file from the folder "handlers"
305 | 
306 | - Renaming the archive handler file to other name that does not end with "_handler.py"
307 | 
308 | - Setting the variable "enabled" to "False" inside the handler file
309 | 
310 | 
311 | Notes
312 | -----
313 | The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the "same" resource. 
314 | 
315 | For example, if you send a request to IA to capture (www.cnn.com) at 10:00pm, IA will create a new copy (*C*) of this URI. IA will then return *C* for all requests to the archive for this URI received until 10:02pm. Using this same submission procedure for Archive.is requires a time gap of five minutes.  
316 | 
317 | .. _handler files: https://github.com/oduwsdl/archivenow/tree/master/archivenow/handlers
318 | 
319 | 
320 | Citing Project
321 | --------------
322 | 
323 | .. code-block:: latex
324 | 
325 |       @INPROCEEDINGS{archivenow-jcdl2018,
326 |         AUTHOR    = {Mohamed Aturban and
327 |                      Mat Kelly and
328 |                      Sawood Alam and
329 |                      John A. Berlin and
330 |                      Michael L. Nelson and
331 |                      Michele C. Weigle},
332 |         TITLE     = {{ArchiveNow}: Simplified, Extensible, Multi-Archive Preservation},
333 |         BOOKTITLE = {Proceedings of the 18th {ACM/IEEE-CS} Joint Conference on Digital Libraries},
334 |         SERIES    = {{JCDL} '18},
335 |         PAGES     = {321--322},
336 |         MONTH     = {June},
337 |         YEAR      = {2018},
338 |         ADDRESS   = {Fort Worth, Texas, USA},
339 |         URL       = {https://doi.org/10.1145/3197026.3203880},
340 |         DOI       = {10.1145/3197026.3203880}
341 |       }
342 | 


--------------------------------------------------------------------------------
/archivenow/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = '2020.7.18.12.19.44'


--------------------------------------------------------------------------------
/archivenow/archivenow.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | import os
  3 | import re
  4 | import sys
  5 | import uuid
  6 | import glob
  7 | import json
  8 | import importlib
  9 | import argparse
 10 | import string
 11 | import requests
 12 | from threading import Thread
 13 | from flask import request, Flask, jsonify, render_template
 14 | from pathlib import Path
 15 | 
 16 | #from __init__ import __version__ as archiveNowVersion
 17 | 
 18 | archiveNowVersion = '2020.7.18.12.19.44'
 19 | 
 20 | # archive handlers path
 21 | PATH = Path(os.path.dirname(os.path.abspath(__file__)))
 22 | PATH_HANDLER = PATH / 'handlers'
 23 | 
 24 | # for the web app
 25 | app = Flask(__name__)
 26 | 
 27 | # create handlers for enabled archives
 28 | global handlers
 29 | handlers = {}
 30 | 
 31 | # defult value for server/port
 32 | SERVER_IP = '0.0.0.0'
 33 | SERVER_PORT = 12345
 34 | 
 35 | 
 36 | def bad_request(error=None):
 37 |     message = {
 38 |         'status': 400,
 39 |         'message': 'Error in processing the request',
 40 |     }
 41 |     resp = jsonify(message)
 42 |     resp.status_code = 400
 43 |     return resp
 44 | 
 45 | 
 46 | # def getServer_IP_PORT():
 47 | #     u = str(SERVER_IP)
 48 | #     if str(SERVER_PORT) != '80':
 49 | #         u = u + ":" + str(SERVER_PORT)
 50 | #     if 'http' != u[0:4]:
 51 | #         u = 'http://' + u
 52 | #     return u
 53 | 
 54 | 
 55 | def listArchives_server(handlers):
 56 |     uri_args = ''
 57 |     if 'cc' in handlers:
 58 |         if handlers['cc'].enabled and handlers['cc'].api_required:
 59 |             uri_args = '?cc_api_key={Your-Perma.cc-API-Key}'
 60 |     li = {"archives": [{  # getServer_IP_PORT() + 
 61 |         "id": "all", "GET":'/all/' + '{URI}'+uri_args,
 62 |         "archive-name": "All enabled archives"}]}
 63 |     for handler in handlers:
 64 |         if handlers[handler].enabled:
 65 |             uri_args2 = ''
 66 |             if handler == 'cc':
 67 |                 uri_args2 = uri_args
 68 |             li["archives"].append({ #getServer_IP_PORT() +
 69 |                 "id": handler, "archive-name": handlers[handler].name,
 70 |                 "GET":  '/' + handler + '/' + '{URI}'+uri_args2})
 71 |     return li
 72 | 
 73 | 
 74 | @app.route('/', defaults={'path': ''}, methods=['GET'])
 75 | @app.route('/<path:path>', methods=['GET'])
 76 | def pushit(path):
 77 |     # no path; return a list of avaliable archives
 78 |     if path == '':
 79 |         #resp = jsonify(listArchives_server(handlers))
 80 |         #resp.status_code = 200
 81 |         return render_template('index.html')
 82 |         #return resp
 83 |     # get request with path
 84 |     elif (path == 'api'):
 85 |         resp = jsonify(listArchives_server(handlers))
 86 |         resp.status_code = 200
 87 |         return resp
 88 |     elif (path == "ajax-loader.gif"):
 89 |         return render_template('ajax-loader.gif')
 90 |     else:
 91 |         try:
 92 |             # get the args passed to push function like API KEY if provided
 93 |             PUSH_ARGS = {}
 94 |             for k in request.args.keys():
 95 |                 PUSH_ARGS[k] = request.args[k]
 96 | 
 97 |             s = str(path).split('/', 1)
 98 |             arc_id = s[0]
 99 |             URI = request.url.split('/', 4)[4] # include query params, too
100 | 
101 |             if 'herokuapp.com' in request.host:
102 |                 PUSH_ARGS['from_heroku'] = True
103 | 
104 |             # To push into archives
105 |             resp = {"results": push(URI, arc_id, PUSH_ARGS)}
106 |             if len(resp["results"]) == 0:
107 |                 return bad_request()
108 |             else:
109 |                 # what to return
110 |                 resp = jsonify(resp)
111 |                 resp.status_code = 200
112 | 
113 |                 return resp
114 |         except Exception as e:
115 |             pass
116 |         return bad_request()
117 | 
118 | res_uris = {}
119 | 
120 | 
121 | def push_proxy(hdlr, URIproxy, p_args_proxy, res_uris_idx, session=requests.Session()):
122 |     global res_uris
123 |     try:
124 |         res = hdlr.push( URIproxy , p_args_proxy, session=session)
125 |         print ( res )
126 |         res_uris[res_uris_idx].append(res)
127 |     except:
128 |         pass
129 | 
130 | def push(URI, arc_id, p_args={}, session=requests.Session()):
131 |     global handlers
132 |     global res_uris
133 |     try:
134 |         # push to all possible archives
135 |         res_uris_idx = str(uuid.uuid4())
136 |         res_uris[res_uris_idx] = []
137 |         ### if arc_id == 'all':
138 |             ### for handler in handlers:
139 |                 ### if (handlers[handler].api_required):
140 |                     # pass args like key API
141 |                     ### res.append(handlers[handler].push(str(URI), p_args))
142 |                 ### else:
143 |                     ### res.append(handlers[handler].push(str(URI)))
144 |         ### else:
145 |             # push to the chosen archives
146 | 
147 |         threads = []
148 | 
149 |         for handler in handlers:
150 |             if (arc_id == handler) or (arc_id == 'all'):
151 |             ### if (arc_id == handler): ### and (handlers[handler].api_required):
152 |                 #res.append(handlers[handler].push(str(URI), p_args))
153 |                 #push_proxy( handlers[handler], str(URI), p_args, res_uris_idx)
154 |                 threads.append(
155 |                     Thread(
156 |                         target=push_proxy, 
157 |                         args=(handlers[handler], str(URI), p_args, res_uris_idx, ), 
158 |                         kwargs={'session': session}))
159 |                 ### elif (arc_id == handler):
160 |                     ### res.append(handlers[handler].push(str(URI)))
161 | 
162 |         for th in threads:
163 |             th.start()
164 |         for th in threads:
165 |             th.join()
166 | 
167 |         res = res_uris[res_uris_idx]
168 |         del res_uris[res_uris_idx]
169 |         return res
170 |     except:
171 |         del res_uris[res_uris_idx]
172 |         pass
173 |     return ["bad request"]
174 | 
175 | 
176 | def start(port=SERVER_PORT, host=SERVER_IP):
177 |     global SERVER_PORT
178 |     global SERVER_IP
179 |     SERVER_PORT = port
180 |     SERVER_IP = host
181 |     app.run(
182 |         host=host,
183 |         port=port,
184 |         threaded=True,
185 |         debug=True,
186 |         use_reloader=False)
187 | 
188 | 
189 | def load_handlers():
190 |     global handlers
191 |     handlers = {}
192 |     # add the path of the handlers to the system so they can be imported
193 |     sys.path.append(str(PATH_HANDLER))
194 | 
195 |     # create a list of handlers.
196 |     for file in PATH_HANDLER.glob('*_handler.py'):
197 |         name = file.stem
198 |         prefix = name.replace('_handler', '')
199 |         mod = importlib.import_module(name)
200 |         mod_class = getattr(mod, prefix.upper() + '_handler')
201 |         # finally an object is created
202 |         handlers[prefix] = mod_class()
203 |     # exclude all disabled archives
204 | 
205 |     for handler in list(handlers): # handlers.keys():
206 |         if not handlers[handler].enabled:
207 |             del handlers[handler]
208 | 
209 | 
210 | def args_parser():
211 |     global SERVER_PORT
212 |     global SERVER_IP
213 |     # parsing arguments
214 | 
215 |     class MyParser(argparse.ArgumentParser):
216 | 
217 |         def error(self, message):
218 |             sys.stderr.write('error: %s\n' % message)
219 |             self.print_help()
220 |             sys.exit(2)
221 | 
222 |         def printm(self):
223 |             sys.stderr.write('')
224 |             self.print_help()
225 |             sys.exit(2)
226 | 
227 |     parser = MyParser()
228 | 
229 |     # arc_handler = 0
230 |     for handler in handlers:
231 |         # add archives identifiers to the list of options
232 |         # arc_handler += 1
233 |         if handler == 'warc':
234 |             parser.add_argument('--' + handler, nargs='?', 
235 |                             help=handlers[handler].name)
236 |         else:
237 |             parser.add_argument('--' + handler, action='store_true', default=False,
238 |                             help='Use ' + handlers[handler].name)
239 |         if (handlers[handler].api_required):
240 |             parser.add_argument(
241 |                 '--' +
242 |                 handler +
243 |                 '_api_key',
244 |                 nargs='?',
245 |                 help='An API KEY is required by ' +
246 |                 handlers[handler].name)
247 | 
248 |     parser.add_argument(
249 |         '-v',
250 |         '--version',
251 |         help='Report the version of archivenow',
252 |         action='version',
253 |         version='ArchiveNow ' +
254 |         archiveNowVersion)
255 | 
256 |     if len(handlers) > 0:
257 |         parser.add_argument('--all', action='store_true', default=False,
258 |                             help='Use all possible archives ')
259 | 
260 |         parser.add_argument('--server', action='store_true', default=False,
261 |                             help='Run archiveNow as a Web Service ')
262 | 
263 |         parser.add_argument('URI', nargs='?', help='URI of a web resource')
264 | 
265 |         parser.add_argument('--host', nargs='?', help='A server address')
266 | 
267 |         if 'warc' in handlers.keys():
268 |             parser.add_argument('--agent', nargs='?', help='Use "wget" or "squidwarc" for WARC generation')
269 | 
270 |         parser.add_argument(
271 |             '--port',
272 |             nargs='?',
273 |             help='A port number to run a Web Service')
274 | 
275 |         args = parser.parse_args()
276 |     else:
277 |         print ('\n Error: No enabled archive handler found\n')
278 |         sys.exit(0)
279 | 
280 |     arc_opt = 0
281 |     # start the server
282 |     if getattr(args, 'server'):
283 |         if getattr(args, 'port'):
284 |             SERVER_PORT = int(args.port)
285 |         if getattr(args, 'host'):
286 |             SERVER_IP = str(args.host)
287 | 
288 |         start(port=SERVER_PORT, host=SERVER_IP)
289 | 
290 |     else:
291 |         if not getattr(args, 'URI'):
292 |             print (parser.error('too few arguments'))
293 |         res = []
294 | 
295 |         # get the args passed to push function like API KEY if provided
296 |         PUSH_ARGS = {}
297 |         for handler in handlers:
298 |             if (handlers[handler].api_required):
299 |                 if getattr(args, handler + '_api_key'):
300 |                     PUSH_ARGS[
301 |                         handler +
302 |                         '_api_key'] = getattr(
303 |                         args,
304 |                         handler +
305 |                         '_api_key')
306 |                 else:
307 |                     if getattr(args, handler):
308 |                         print (
309 |                             parser.error(
310 |                                 'An API Key is required by ' +
311 |                                 handlers[handler].name))
312 |             orginal_warc_value = getattr(args, 'warc')
313 |             if handler == 'warc':
314 |                 PUSH_ARGS['warc'] = getattr(args, 'warc')
315 |                 if PUSH_ARGS['warc'] == None:
316 |                     valid_chars = "-_.()/ %s%s" % (string.ascii_letters, string.digits)
317 |                     PUSH_ARGS['warc'] = ''.join(c for c in str(args.URI).strip() if c in valid_chars)
318 |                     PUSH_ARGS['warc'] = PUSH_ARGS['warc'].replace(' ','_').replace('/','_').replace('__','_') # I don't like spaces in filenames.
319 |                     PUSH_ARGS['warc'] = PUSH_ARGS['warc']+'_'+str(uuid.uuid4())[:8]
320 |                 if PUSH_ARGS['warc'][-1] == '_':
321 |                     PUSH_ARGS['warc'] = PUSH_ARGS['warc'][:-1]
322 |                 agent = 'wget'
323 |                 tmp_agent = getattr(args, 'agent')
324 |                 if tmp_agent == 'squidwarc':
325 |                     agent = tmp_agent
326 |                 PUSH_ARGS['agent'] = agent
327 | 
328 |         # sys.exit(0)
329 | 
330 |         # push to all possible archives
331 |         if getattr(args, 'all'):
332 |             arc_opt = 1
333 |             res = push(str(args.URI).strip(), 'all', PUSH_ARGS)
334 |         else:
335 |             # push to the chosen archives
336 |             for handler in handlers:
337 |                 if getattr(args, handler):
338 |                     arc_opt += 1
339 |                     for i in push(str(args.URI).strip(), handler, PUSH_ARGS):
340 |                         res.append(i)
341 |             # push to the defult archive
342 |             if (len(handlers) > 0) and (arc_opt == 0):
343 |                 # set the default; it ia by default or the first archive in the
344 |                 # list if not found
345 |                 if 'ia' in handlers:
346 |                     res = push(str(args.URI).strip(), 'ia', PUSH_ARGS)
347 |                 else:
348 |                     res = push(str(args.URI).strip(),
349 |                                handlers.keys()[0], PUSH_ARGS)
350 |                 # print (parser.printm())
351 |             # else:
352 |         # for rs in res:
353 |         #     print (rs)
354 | 
355 | load_handlers()
356 | 
357 | if __name__ == '__main__':
358 |     args_parser()
359 | 


--------------------------------------------------------------------------------
/archivenow/handlers/cc_handler.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import json
 3 | 
 4 | class CC_handler(object):
 5 | 
 6 |     def __init__(self):
 7 |         self.enabled = True
 8 |         self.name = 'The Perma.cc Archive'
 9 |         self.api_required = True
10 | 
11 |     def push(self, uri_org, p_args=[], session=requests.Session()):
12 |         msg = ''
13 |         try:
14 | 
15 |             APIKEY = p_args['cc_api_key']
16 | 
17 |             r = session.post('https://api.perma.cc/v1/archives/?api_key='+APIKEY, timeout=120,
18 |                                                            data=json.dumps({"url":uri_org}),
19 |                                                            headers={'Content-type': 'application/json'},
20 |                                                            allow_redirects=True)       
21 |             r.raise_for_status()
22 | 
23 |             if 'Location' in r.headers:
24 |                 return 'https://perma.cc/'+r.headers['Location'].rsplit('/',1)[1]
25 |             else:
26 |                 for r2 in r.history:
27 |                     if 'Location' in r2.headers:
28 |                         return 'https://perma.cc/'+r2.headers['Location'].rsplit('/',1)[1]
29 |             entity_json = r.json()
30 |             if 'guid' in entity_json:
31 |                 return str('https://perma.cc/'+entity_json['guid'])
32 |             msg = "Error ("+self.name+ "): No HTTP Location header is returned in the response" 
33 |         except Exception as e:
34 |             if (msg == '') and ('_api_key' in str(e)):
35 |                 msg = "Error (" + self.name+ "): " + 'An API Key is required '
36 |             elif (msg == ''):
37 |                 msg = "Error (" + self.name+ "): " + str(e)
38 |             pass;
39 |         return msg
40 | 


--------------------------------------------------------------------------------
/archivenow/handlers/ia_handler.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | 
 3 | class IA_handler(object):
 4 | 
 5 |     def __init__(self):
 6 |         self.enabled = True
 7 |         self.name = 'The Internet Archive'
 8 |         self.api_required = False
 9 | 
10 |     def push(self, uri_org, p_args=[], session=requests.Session()):
11 |         msg = ''
12 |         try:
13 |             uri = 'https://web.archive.org/save/' + uri_org
14 |             archiveTodayUserAgent = {
15 |                 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
16 |             }
17 |             # push into the archive
18 |             # r = session.get(uri, timeout=120, allow_redirects=True, headers=archiveTodayUserAgent)
19 | 
20 |             if ('user-agent' in session.headers) and (not session.headers['User-Agent'].lower().startswith('python-requests/')):
21 |                 r = session.get(uri, timeout=120, allow_redirects=True)
22 |             else:
23 |                 r = session.get(uri, timeout=120, allow_redirects=True, headers=archiveTodayUserAgent)
24 | 
25 |             r.raise_for_status()
26 |             # extract the link to the archived copy 
27 |             if (r != None):
28 |                 if "Location" in r.headers:
29 |                     return r.headers["Location"]
30 |                 elif "Content-Location" in r.headers:
31 |                     if (r.headers["Content-Location"]).startswith("/web/"):
32 |                         return "https://web.archive.org"+r.headers["Content-Location"]
33 |                     else:
34 |                         try:
35 |                             uri_from_content = "https://web.archive.org" + r.text.split('var redirUrl = "',1)[1].split('"',1)[0]
36 |                         except:
37 |                             uri_from_content = r.headers["Content-Location"]
38 |                             #pass;
39 |                         return uri_from_content
40 |                 else:
41 |                     for r2 in r.history:
42 |                         if 'Location' in r2.headers:
43 |                             return r.url
44 |                             #return r2.headers['Location']
45 |                         if 'Content-Location' in r2.headers:
46 |                             return r.url
47 |                             #return r2.headers['Content-Location']
48 |             msg = "("+self.name+ "): No HTTP Location/Content-Location header is returned in the response"               
49 |         except Exception as e:
50 |             if msg == '':
51 |                 msg = "Error (" + self.name+ "): " + str(e)
52 |             pass
53 |         return msg
54 | 


--------------------------------------------------------------------------------
/archivenow/handlers/is_handler.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import requests
 3 | import sys
 4 | from selenium.webdriver.firefox.options import Options
 5 | from selenium import webdriver
 6 | from selenium.webdriver.common.keys import Keys
 7 | from selenium.webdriver.support.ui import WebDriverWait
 8 | from selenium.webdriver.support import expected_conditions as EC
 9 | from selenium.webdriver.common.by import By
10 | from selenium.common.exceptions import TimeoutException
11 | 
12 | 
13 | class IS_handler(object):
14 | 
15 |     def __init__(self):
16 |         self.enabled = True
17 |         self.name = 'The Archive.is'
18 |         self.api_required = False 
19 | 
20 |     def push(self, uri_org, p_args=[], session=requests.Session()):
21 | 
22 |         msg = ""
23 | 
24 |         try:
25 | 
26 |             options = Options()
27 |             options.headless = True # Run in background
28 |             driver = webdriver.Firefox(options = options)
29 |             driver.get("https://archive.is")
30 | 
31 |             elem = driver.find_element_by_id("url") # Find the form to place a URL to be archived
32 | 
33 |             elem.send_keys(uri_org) # Place the URL in the input box
34 | 
35 |             saveButton = driver.find_element_by_xpath("/html/body/center/div/form[1]/div[3]/input") # Find the submit button
36 | 
37 |             saveButton.click() # Click the submit button
38 | 
39 |             # After clicking submit, there may be an additional page that pops up and asks if you are sure you want
40 |             # to archive that page since it was archived X amount of time ago. We need to wait for that page to 
41 |             # load and click submit again.
42 |             delay = 30 # seconds
43 |             try:
44 |                 nextSaveButton = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, "/html/body/center/div[4]/center/div/div[2]/div/form/div/input")))
45 |                 nextSaveButton.click()
46 | 
47 |             except TimeoutException:
48 |                 pass
49 | 
50 |             # The page takes a while to archive, so keep checking if the loading page is still displayed.
51 |             loading = True
52 |             while loading:
53 |                 
54 |                 if not 'wip' in driver.current_url and not 'submit' in driver.current_url:
55 |                     loading = False
56 | 
57 |             # After the loading screen is gone and the page is archived, the current URL
58 |             # will be the URL to the archived page.
59 |             msg = driver.current_url;
60 | 
61 |             driver.quit()
62 | 
63 |         except:
64 | 
65 |             '''
66 |             exc_type, exc_obj, exc_tb = sys.exc_info()
67 |             fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
68 |             print((fname, exc_tb.tb_lineno, sys.exc_info() ))
69 |             '''
70 | 
71 |             msg = "Unable to complete request."
72 | 
73 |         return msg
74 | 


--------------------------------------------------------------------------------
/archivenow/handlers/mg_handler.py:
--------------------------------------------------------------------------------
 1 | # encoding: utf-8
 2 | import os
 3 | import requests
 4 | import sys
 5 | from selenium.webdriver.firefox.options import Options
 6 | from selenium import webdriver
 7 | from selenium.webdriver.common.keys import Keys
 8 | from selenium.webdriver.support.ui import WebDriverWait
 9 | from selenium.webdriver.support import expected_conditions as EC
10 | from selenium.webdriver.common.by import By
11 | from selenium.common.exceptions import TimeoutException
12 | 
13 | class MG_handler(object):
14 | 
15 |     def __init__(self):
16 |         self.enabled = True
17 |         self.name = 'Megalodon.jp'
18 |         self.api_required = False
19 | 
20 |     def push(self, uri_org, p_args=[], session=requests.Session()):
21 | 
22 |         msg = ""
23 | 
24 |         options = Options()
25 |         options.headless = True # Run in background
26 |         driver = webdriver.Firefox(options = options)
27 |         driver.get("https://megalodon.jp/?url=" + uri_org)
28 | 
29 |         try:
30 |             addButton = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[8]/form/div[1]/input[2]")
31 | 
32 |             addButton.click() # Click the add button
33 |         except :
34 |             print("Unable to archive this page at this time.")
35 |             raise
36 | 
37 | 
38 |         stillOnPage = True
39 |         while stillOnPage:
40 |             try:
41 |                 button = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[1]/div/h3")
42 | 
43 |             except:
44 |                 stillOnPage = False
45 | 
46 |             try:
47 |                 error = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[3]/div/a/h3")
48 |                 msg = "We apologize for the inconvenience. Currently, acquisitions that are considered \"robots\" in the acquisition of certain conditions are prohibited."
49 |                 raise
50 |                 sys.exit()
51 | 
52 |             except:
53 |                 pass
54 | 
55 |         # The page takes a while to archive, so keep checking if the loading page is still displayed.
56 |         loading = True
57 |         while loading:
58 |             try:
59 |                 loadingPage = driver.find_element_by_xpath("/html/body/div[2]/div/div[1]/a/img")
60 |                 loading = False
61 | 
62 |             except:
63 |                 loading = True
64 | 
65 |         # After the loading screen is gone and the page is archived, the current URL
66 |         # will be the URL to the archived page.
67 |         if msg == "":
68 |             print(driver.current_url)
69 | 
70 |         return msg
71 |         


--------------------------------------------------------------------------------
/archivenow/handlers/warc_handler.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import os.path
 3 | import distutils.spawn
 4 | 
 5 | class WARC_handler(object):
 6 | 
 7 |     def __init__(self):
 8 |         self.enabled = True
 9 |         self.name = 'Generate WARC file'
10 |         self.api_required = False
11 | 
12 |     def push(self, uri_org, p_args=[], session=requests.Session()):
13 |         msg = ''
14 |         if p_args['agent'] == 'squidwarc':
15 |             # squidwarc
16 |             #if not distutils.spawn.find_executable("squidwarc"):
17 |             #    return 'wget is not installed!'
18 |             os.system('python ~/squidwarc_one_page/generte_warcs.py 9222 "'+uri_org+'" '+p_args['warc']+'.warc  &> /dev/null')
19 |             if os.path.exists(p_args['warc']):
20 |                 return p_args['warc']
21 |             elif os.path.exists(p_args['warc']+'.warc'):
22 |                 return p_args['warc']+'.warc'
23 |             else:
24 |                 return 'squidwarc failed to generate the WARC file'
25 | 
26 |         else:
27 |             if not distutils.spawn.find_executable("wget"):
28 |                 return 'wget is not installed!'
29 |             # wget 
30 |             os.system('wget -E -H -k -p -q --delete-after --no-warc-compression --warc-file="'+p_args['warc']+'" "'+uri_org+'"')
31 |             if os.path.exists(p_args['warc']):
32 |                 return p_args['warc']
33 |             elif os.path.exists(p_args['warc']+'.warc'):
34 |                 return p_args['warc']+'.warc'
35 |             else:
36 |                 return 'wget failed to generate the WARC file'
37 | 


--------------------------------------------------------------------------------
/archivenow/static/ajax-loader.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oduwsdl/archivenow/dbc688f4f2384139f6eb1ebf7ced2319b49ecc73/archivenow/static/ajax-loader.gif


--------------------------------------------------------------------------------
/archivenow/templates/api.txt:
--------------------------------------------------------------------------------
 1 | <!--   <h4 id="archivenow_api">Archive Now API</h4>
 2 |   <h5 id="archivenow_api1">To push a web page into particular web archive, use the following URL:</h5>
 3 |   <pre>
 4 |     http://{server}:{port}/{archive-id}/{URI}
 5 |   </pre>
 6 |   <h5 id="archivenow_api2">Archive identifier (use "all" for all archives):</h5>
 7 | <table style="width:30%">
 8 |   <tr>
 9 |     <th>Archive</th>
10 |     <th>Identifier</th>
11 |   </tr>
12 |   <tr>
13 |     <td>Internet Archive</td>
14 |     <td>ia</td>
15 |   </tr>
16 |   <tr>
17 |     <td>Archive.is</td>
18 |     <td>is</td>
19 |   </tr>
20 |   <tr>
21 |     <td>Perma.cc</td>
22 |     <td>cc</td>
23 |   </tr>
24 | </table> -->
25 | <!--   <h5 id="archivenow_api3">Example, capture http://www.example.com by Internet Archive: </h5>
26 | <pre>
27 | curl  -i http://127.0.0.1:12345/ia/http://www.example.com
28 | 
29 | HTTP/1.0 200 OK
30 | Content-Type: application/json
31 | Content-Length: 95
32 | Server: Werkzeug/0.11.15 Python/2.7.10
33 | Date: Fri, 10 Nov 2017 22:36:26 GMT
34 | 
35 | {
36 |   "results": [
37 |     "https://web.archive.org/web/20171110223626/http://www.example.com"
38 |   ]
39 | }
40 | </pre>
41 |   <h5 id="archivenow_api4">Example, capture http://www.example.com by all four archive (An API KEY is required by Perma.cc): </h5>
42 | <pre>
43 | curl -i 127.0.0.1:12345/all/http://www.example.com?cc_api_key=8r820...
44 | 
45 | HTTP/1.0 200 OK
46 | Content-Type: application/json
47 | Content-Length: 207
48 | Server: Werkzeug/0.11.15 Python/2.7.10
49 | Date: Fri, 10 Nov 2017 22:42:08 GMT
50 | 
51 | {
52 |   "results": [
53 |     "https://perma.cc/QX65-CFDD", 
54 |     "https://web.archive.org/web/20171110223626/http://www.example.com", 
55 |     "http://archive.is/ff17A", 
56 |     "http://www.webcitation.org/6uschXwlI"
57 |   ]
58 | }
59 | </pre> -->


--------------------------------------------------------------------------------
/archivenow/templates/index.html:
--------------------------------------------------------------------------------
  1 | <html>
  2 | <head>
  3 | <style>
  4 | 
  5 | .reveal-if-active {
  6 |   opacity: 0;
  7 |   max-height: 0;
  8 |   overflow: hidden;
  9 |   font-size: 14px;
 10 |   -webkit-transform: scale(0.8);
 11 |           transform: scale(0.8);
 12 |   -webkit-transition: 0.5s;
 13 |   transition: 0.5s;
 14 | }
 15 | .reveal-if-active label {
 16 |   margin: 0 0 3px 22px;
 17 |   display: block;
 18 |   font-size: smaller;
 19 | }
 20 | .reveal-if-active input[type=text] {
 21 |   width: 300px;
 22 | }
 23 | input[id="choice-archive4"]:checked ~ .reveal-if-active {
 24 |   opacity: 1;
 25 |   max-height: 120px;
 26 |   padding: 0px 0px;
 27 |   -webkit-transform: scale(1);
 28 |           transform: scale(1);
 29 |   overflow: visible;
 30 | }
 31 | 
 32 | table {
 33 |     margin: 14px auto;
 34 |     opacity: 0;
 35 | }
 36 | 
 37 | table, th, td {
 38 |     border-collapse: collapse;
 39 | }
 40 | th, td {
 41 |     padding: 1px;
 42 |     text-align: left;
 43 |     font-family: "My Custom Font", Verdana, Tahoma;
 44 |     font-size: 12px;
 45 | }
 46 | 
 47 | tr{
 48 |     border-bottom: 1px solid #ccc;
 49 |     border-top: 1px solid #ccc;
 50 | }
 51 | #title {
 52 |   display: block;
 53 |   text-align: center;
 54 |   padding: 22px 0 0 0
 55 | }
 56 | 
 57 | .url{
 58 |   display: block;
 59 |   text-align: center;
 60 | }
 61 | 
 62 | #text_url{
 63 |   width:333px;
 64 |   font-size: 12.5px;
 65 | }
 66 | 
 67 | #select_label{
 68 |     padding: 0px 270px 0 0;
 69 |     text-align: center;
 70 |     margin-bottom: 8px;
 71 | }
 72 | 
 73 | #choices{
 74 |       text-align: center;
 75 |       padding: 0px 20px 0px 0px;
 76 |       margin-left: 133px;
 77 | }
 78 | 
 79 | #choices2{
 80 |       text-align: left;
 81 |       display: inline-block;
 82 | 
 83 | }
 84 | 
 85 | #perma_cc_api{
 86 |     margin: -2px 93px 0px 21px;
 87 | }
 88 | 
 89 | #submitdiv{
 90 |     text-align: center;
 91 |     padding: 20px 242px 0 0;
 92 |     margin: 0 0 0 38px;
 93 | }
 94 | 
 95 | input[type=submit] {
 96 |     width: 5em;
 97 |     height: 2em;
 98 |     font-size: 12px;
 99 |     background-color: gainsboro;
100 |     margin: 0px 0px 0px 13px;
101 | }
102 | 
103 | #errors{
104 |     font-size: smaller;
105 |     color: brown;
106 |     padding: 6px 0px 3px 104px;
107 | }
108 | 
109 | .img1{
110 | 
111 |   width: 13px;
112 |   opacity: 0;
113 | }
114 | 
115 | .img2{
116 | 
117 |   width: 13px;
118 |   opacity: 0;
119 | }
120 | 
121 | .img3{
122 | 
123 |   width: 13px;
124 |   opacity: 0;
125 | }
126 | 
127 | .img5{
128 | 
129 |   width: 13px;
130 |   opacity: 0;
131 | }
132 | 
133 | .img6{
134 | 
135 |   width: 13px;
136 |   opacity: 0;
137 | }
138 | 
139 | .img4{
140 | 
141 |   width: 13px;
142 |   opacity: 0;
143 | }
144 | 
145 | #apilink{
146 |   font-size: smaller;
147 |   padding-top: 39px;
148 | }
149 | </style>
150 | </head>
151 | <body>
152 |       <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js" type="text/javascript"></script>
153 |       <h3 id="title"> Preserve a web page in web archives </h3>
154 |       <div class="url">
155 |         <label for="text_url" id="label_url">URL</label>
156 |         <input type="text" id="text_url" required>
157 |       </div>
158 |   
159 |   <div>
160 |     <p id="select_label">Select archives:</p>
161 |     <div id="choices">
162 |     <div id="choices2">
163 |     <input type="checkbox" id="choice-archive1" checked > Internet Archive <img src={{ url_for('static', filename = "ajax-loader.gif") }} class="img1" id="img1"> <br>
164 |     <input type="checkbox" id="choice-archive2" checked > Archive.is       <img src={{ url_for('static', filename = "ajax-loader.gif") }} class="img2" id="img2"> <br>
165 |     <input type="checkbox" id="choice-archive6" checked > Megalodon.jp <img src={{ url_for('static', filename = "ajax-loader.gif") }} class="img6" id="img6"> <br>
166 |     <input type="checkbox" id="choice-archive4" > Perma.cc                 <img src={{ url_for('static', filename = "ajax-loader.gif") }} class="img4" id="img4">
167 |     <div class="reveal-if-active">
168 |       <label for="perma_cc_api">Permaa.cc requires <a href="https://perma.cc/settings/tools" target="_blank"> an API Key </a></label>
169 |       <input type="text" id="perma_cc_api">
170 |     </div>
171 |     </div>
172 |     </div>
173 |   </div>
174 |   <div id="submitdiv">
175 |     <input type="submit" value="Submit" onClick="push_archive();">
176 |     <input type="submit" value="Reset" onClick="reset();">
177 |     <div id ="errors"></div>
178 |   </div>
179 |     <table id="results" width="600">
180 |         <thead>
181 |         <tr>
182 |             <th scope="col" width="130">Archive</th>
183 |             <th scope="col" width="450">Link to the archived page</th>
184 |         </tr>
185 |         </thead>
186 |     </table>
187 | <div id="apilink"><a href="/api" target="_blank">Archive Now API</a></div>
188 | 
189 | <script type="text/javascript">
190 | 
191 |   document.getElementById('perma_cc_api').value = localStorage.getItem("permaccapikey");
192 | 
193 |   if (localStorage.getItem("check_archive_1") !== null){
194 |       if (localStorage.getItem("check_archive_1") == 'true'){
195 |         document.getElementById('choice-archive1').checked = true
196 |       }else{
197 |         document.getElementById('choice-archive1').checked = false
198 |       }
199 |   }
200 |   if (localStorage.getItem("check_archive_2") !== null){
201 |       if (localStorage.getItem("check_archive_2") == 'true'){
202 |         document.getElementById('choice-archive2').checked = true
203 |       }else{
204 |         document.getElementById('choice-archive2').checked = false
205 |       }
206 |   }
207 |   if (localStorage.getItem("check_archive_3") !== null){
208 |       if (localStorage.getItem("check_archive_3") == 'true'){
209 |         document.getElementById('choice-archive3').checked = true
210 |       }else{
211 |         document.getElementById('choice-archive3').checked = false
212 |      }
213 |   }
214 |   if (localStorage.getItem("check_archive_5") !== null){
215 |       if (localStorage.getItem("check_archive_5") == 'true'){
216 |         document.getElementById('choice-archive5').checked = true
217 |       }else{
218 |         document.getElementById('choice-archive5').checked = false
219 |      }
220 |   }
221 |   if (localStorage.getItem("check_archive_6") !== null){
222 |       if (localStorage.getItem("check_archive_6") == 'true'){
223 |         document.getElementById('choice-archive6').checked = true
224 |       }else{
225 |         document.getElementById('choice-archive6').checked = false
226 |      }
227 |   }
228 |   if (localStorage.getItem("check_archive_4") !== null){
229 |       if (localStorage.getItem("check_archive_4") == 'true'){
230 |         document.getElementById('choice-archive4').checked = true
231 |       }else{
232 |        document.getElementById('choice-archive4').checked = false
233 |       }
234 |   }
235 | 
236 |   function reset() {
237 | 
238 |       window.location.reload();
239 |   
240 |   }
241 | 
242 |   function push_archive() {
243 | 
244 |             document.getElementById('errors').innerHTML="";
245 |             localStorage.setItem("check_archive_1", false);
246 |             localStorage.setItem("check_archive_2", false);
247 |             localStorage.setItem("check_archive_3", false);
248 |             localStorage.setItem("check_archive_5", false);
249 |             localStorage.setItem("check_archive_6", false);
250 |             localStorage.setItem("check_archive_4", false);
251 | 
252 | 
253 |             var arr = []
254 | 
255 |             var table = document.getElementById('results');
256 |             for (var r = 1, n = table.rows.length; r < n; r++) {
257 |                     if(table.rows[r].cells[0].innerHTML.indexOf("https://archive.org") !== -1){
258 |                       arr.push("ia");
259 |                     }
260 |                     if(table.rows[r].cells[0].innerHTML.indexOf("https://archive.is") !== -1){
261 |                       arr.push("is");
262 |                     }
263 |                     if(table.rows[r].cells[0].innerHTML.indexOf("https://megalodon.jp") !== -1){
264 |                       arr.push("mg");
265 |                     }
266 |                     if(table.rows[r].cells[0].innerHTML.indexOf("https://www.webcitation.org") !== -1){
267 |                       arr.push("wc");
268 |                     }
269 |                     if(table.rows[r].cells[0].innerHTML.indexOf("https://perma.cc") !== -1){
270 |                       arr.push("cc");
271 |                     }
272 |             }
273 | 
274 |             function validateURL(textval) {
275 |                 var urlregex = /^(https?|ftp):\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*@)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|\/|\?)*)?$/i;
276 |                 return urlregex.test(textval);
277 |             }
278 |             if (validateURL(document.getElementById('text_url').value) == false){
279 |               document.getElementById('text_url').focus();
280 |               document.getElementById('errors').innerHTML="*Enter a correct URL*";
281 |               return;
282 |             }
283 | 
284 |             if (document.getElementById('choice-archive4').checked == true){ // perma.cc
285 |                 if(document.getElementById('perma_cc_api').value.trim() == ""){
286 |                   document.getElementById('perma_cc_api').focus();
287 |                   document.getElementById('errors').innerHTML="*Enter your Perma.cc API Key*";
288 |                   return;
289 |                 }
290 |             }
291 | 
292 |             var selected_archives = 0;
293 | 
294 |             if (document.getElementById('choice-archive1').checked == true){
295 |               
296 |                   selected_archives = selected_archives + 1;
297 | 
298 |                   if(arr.indexOf("ia") == -1){
299 |                       document.getElementById('img1').style.opacity = 1
300 |                       $.ajax({
301 |                           type: "GET",
302 |                           url: "ia/"+document.getElementById('text_url').value,
303 |                           success: function(json) {
304 |                               if (validateURL(json['results'][0]) == true){
305 |                                   var table=document.getElementById("results");
306 |                                   var row=table.insertRow(-1);
307 |                                   var cell1=row.insertCell(0);
308 |                                   var cell2=row.insertCell(1);
309 |                                   cell1.innerHTML='<a href="https://archive.org" target="_blank"> Internet Archive </a>'
310 |                                   cell2.innerHTML='<a href="'+json['results'][0]+'" target="_blank"> '+json['results'][0]+' </a>'
311 |                                   document.getElementById('results').style.opacity = 1
312 |                                   document.getElementById('img1').style.opacity = 0
313 |                               }
314 |                           },
315 |                           complete: function(){
316 |                             document.getElementById('img1').style.opacity = 0
317 |                           }
318 |                       });
319 |                  }
320 |                  localStorage.setItem("check_archive_1", true);
321 |             }
322 |             if (document.getElementById('choice-archive2').checked == true){
323 |               
324 |                   selected_archives = selected_archives + 1;
325 | 
326 |                   if(arr.indexOf("is") == -1){
327 |                       document.getElementById('img2').style.opacity = 1
328 |                       $.ajax({
329 |                           type: "GET",
330 |                           url: "is/"+document.getElementById('text_url').value,
331 |                           success: function(json) {
332 |                               if (validateURL(json['results'][0]) == true){
333 |                                   var table=document.getElementById("results");
334 |                                   var row=table.insertRow(-1);
335 |                                   var cell1=row.insertCell(0);
336 |                                   var cell2=row.insertCell(1);
337 |                                   cell1.innerHTML='<a href="https://archive.is" target="_blank"> Archive.is </a>'
338 |                                   cell2.innerHTML='<a href="'+json['results'][0]+'" target="_blank"> '+json['results'][0]+' </a>'
339 |                                   document.getElementById('results').style.opacity = 1
340 |                                   document.getElementById('img2').style.opacity = 0
341 |                               }
342 |                           },
343 |                           complete: function(){
344 |                             document.getElementById('img2').style.opacity = 0
345 |                           }
346 |                       });
347 |                   }
348 |               localStorage.setItem("check_archive_2", true);
349 |             }
350 |             if (document.getElementById('choice-archive6').checked == true){
351 |                   
352 |                   selected_archives = selected_archives + 1;
353 |                   
354 |                   if(arr.indexOf("mg") == -1){
355 |                       document.getElementById('img6').style.opacity = 1
356 |                       $.ajax({
357 |                           type: "GET",
358 |                           url: "mg/"+document.getElementById('text_url').value,
359 |                           success: function(json) {
360 |                               if (validateURL(json['results'][0]) == true){
361 |                                   var table=document.getElementById("results");
362 |                                   var row=table.insertRow(-1);
363 |                                   var cell1=row.insertCell(0);
364 |                                   var cell2=row.insertCell(1);
365 |                                   cell1.innerHTML='<a href="https://megalodon.jp" target="_blank"> Megalodon.jp </a>'
366 |                                   cell2.innerHTML='<a href="'+json['results'][0]+'" target="_blank"> '+json['results'][0]+' </a>'
367 |                                   document.getElementById('results').style.opacity = 1
368 |                                   document.getElementById('img6').style.opacity = 0
369 |                               }
370 |                           },
371 |                           complete: function(){
372 |                             document.getElementById('img6').style.opacity = 0
373 |                           }
374 |                       });
375 |                   }
376 |                   localStorage.setItem("check_archive_6", true);
377 |             }
378 |             if (document.getElementById('choice-archive4').checked == true){
379 | 
380 |                   selected_archives = selected_archives + 1;
381 | 
382 |                   if(arr.indexOf("cc") == -1){
383 |                       document.getElementById('img4').style.opacity = 1
384 |                       $.ajax({
385 |                           type: "GET",
386 |                           url: "cc/"+document.getElementById('text_url').value+'?cc_api_key='+document.getElementById('perma_cc_api').value,
387 |                           success: function(json) {
388 |                               if (validateURL(json['results'][0]) == true){
389 |                                   var table=document.getElementById("results");
390 |                                   var row=table.insertRow(-1);
391 |                                   var cell1=row.insertCell(0);
392 |                                   var cell2=row.insertCell(1);
393 |                                   cell1.innerHTML='<a href="https://perma.cc" target="_blank"> Perma.cc </a>'
394 |                                   cell2.innerHTML='<a href="'+json['results'][0]+'" target="_blank"> '+json['results'][0]+' </a>'
395 |                                   document.getElementById('results').style.opacity = 1
396 |                                   document.getElementById('img4').style.opacity = 0
397 |                               }
398 |                           },
399 |                           complete: function(){
400 |                             document.getElementById('img4').style.opacity = 0
401 |                           }
402 |                       });
403 |                   }
404 |                 localStorage.setItem("permaccapikey", document.getElementById('perma_cc_api').value);
405 |                 localStorage.setItem("check_archive_4", true);
406 |             }
407 | 
408 |             if (selected_archives == 0){
409 |                 document.getElementById('errors').innerHTML="*Select at least one archive*";
410 |                 return;
411 |             }
412 |     }
413 | </script>
414 | </body>
415 | </html>
416 | 


--------------------------------------------------------------------------------
/docs/archivetoday_selenium.mp4:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oduwsdl/archivenow/dbc688f4f2384139f6eb1ebf7ced2319b49ecc73/docs/archivetoday_selenium.mp4


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | flask
2 | requests
3 | pathlib
4 | selenium


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from setuptools import setup, find_packages
 4 | from archivenow import __version__
 5 | 
 6 | long_description = open('README.rst').read()
 7 | desc = """A Python library to push web resources into public web archives"""
 8 | 
 9 | 
10 | setup(
11 |     name='archivenow',
12 |     version=__version__,
13 |     description=desc,
14 |     long_description=long_description,
15 |     author='Mohamed Aturban',
16 |     author_email='maturban@cs.odu.edu',
17 |     url='https://github.com/maturban/archivenow',
18 |     packages=find_packages(),
19 |     license="MIT",
20 |     classifiers=[
21 |         'Development Status :: 5 - Production/Stable',
22 |         'Programming Language :: Python',
23 |         'Programming Language :: Python :: 2.7',
24 |         'Programming Language :: Python :: 3',
25 |         'Programming Language :: Python :: 3.4',
26 |         'Programming Language :: Python :: 3.5',
27 |         'Programming Language :: Python :: 3.6',
28 |         'License :: OSI Approved :: MIT License'
29 |     ],
30 |     install_requires=[
31 |         'flask',
32 |         'requests'
33 |     ],
34 |     package_data={
35 |         'archivenow': [
36 |             'handlers/*.*',
37 |             'templates/*.*',
38 |             'static/*.*'
39 |           ]
40 |     },
41 |     entry_points='''
42 |         [console_scripts]
43 |         archivenow=archivenow.archivenow:args_parser
44 |     '''   
45 | )
46 | 


--------------------------------------------------------------------------------