├── archiver
    ├── __init__.py
    ├── templates
    │   ├── static
    │   │   ├── archive.png
    │   │   ├── spinner.gif
    │   │   └── external.png
    │   ├── index_row.html
    │   ├── link_index.html
    │   ├── index.html
    │   └── link_index_fancy.html
    ├── tests
    │   ├── pinboard_export.json
    │   ├── pocket_export.html
    │   ├── rss_export.xml
    │   └── firefox_export.html
    ├── peekable.py
    ├── config.py
    ├── links.py
    ├── index.py
    ├── archive.py
    ├── parse.py
    ├── archive_methods.py
    └── util.py
├── archive
├── setup
├── bin
    ├── bookmark-archiver
    ├── export-browser-history
    └── setup-bookmark-archiver
├── .gitignore
├── DONATE.md
├── LICENSE
└── README.md


/archiver/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/archive:
--------------------------------------------------------------------------------
1 | bin/bookmark-archiver


--------------------------------------------------------------------------------
/setup:
--------------------------------------------------------------------------------
1 | bin/setup-bookmark-archiver


--------------------------------------------------------------------------------
/bin/bookmark-archiver:
--------------------------------------------------------------------------------
1 | ../archiver/archive.py


--------------------------------------------------------------------------------
/archiver/templates/static/archive.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TarekJor/bookmark-archiver/master/archiver/templates/static/archive.png


--------------------------------------------------------------------------------
/archiver/templates/static/spinner.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TarekJor/bookmark-archiver/master/archiver/templates/static/spinner.gif


--------------------------------------------------------------------------------
/archiver/templates/static/external.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TarekJor/bookmark-archiver/master/archiver/templates/static/external.png


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # OS cruft
 2 | .DS_Store
 3 | ._*
 4 | 
 5 | # python
 6 | __pycache__/
 7 | archiver/venv
 8 | 
 9 | # vim
10 | .swp*
11 | 
12 | # output artifacts
13 | output/
14 | 


--------------------------------------------------------------------------------
/DONATE.md:
--------------------------------------------------------------------------------
 1 | # Donate
 2 | 
 3 | Right now I'm working on this project in my spare time and accepting the occational PR,   
 4 | if you want me to dedicate more time to it, donate to support development!
 5 | 
 6 |  - Ethereum: 0x5B0F85FFc44fD759C2d97f0BE4681279966f3832
 7 |  - Bitcoin: https://shapeshift.io/ BTC -> to address above (ETH)
 8 |  - Paypal: https://www.paypal.me/NicholasSweeting/25
 9 | 
10 | The eventual goal is to support one or two developers full-time on this project via donations.   
11 | With more engineering power this can become a distributed archive service with a nice UI,   
12 | like the Way-Back machine but hosted by everyone!
13 | 
14 | If have any questions or want to sponsor this project long-term, contact me at  
15 | bookmark-archiver@sweeting.me
16 | 


--------------------------------------------------------------------------------
/archiver/templates/index_row.html:
--------------------------------------------------------------------------------
 1 | <tr>
 2 |     <td title="Bookmarked timestamp: $timestamp">$date</td>
 3 |     <td>
 4 |         <a href="$files_url" title="Link Index">
 5 |             <img src="$favicon_url" onerror="this.src='static/spinner.gif'" class="link-favicon">
 6 |         </a>
 7 |     </td>
 8 |     <td style="text-align: left"><a href="$archive_url" style="font-size:1.4em;text-decoration:none;color:black;" title="$title">
 9 |         $title <small style="background-color: #eee;border-radius:4px; float:right">$tags</small>
10 |     </td>
11 |     <td><a href="$screenshot_link" title="Screenshot">🖼</a></td>
12 |     <td><a href="$pdf_link" title="PDF">📜</a></td>
13 |     <td><a href="$dom_link" title="DOM">📄</a></td>
14 |     <td><a href="$archive_org_url" title="Archive.org">🏛</a></td>
15 |     <td style="text-align: left"><!--🔗 <img src="$google_favicon_url" height="16px">--> <a href="$url">$url</a></td>
16 | </tr>
17 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Nick Sweeting
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/bin/export-browser-history:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | REPO_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )"; cd .. && pwd )"
 4 | 
 5 | if [[ "$1" == "--chrome" ]]; then
 6 |     # Google Chrome / Chromium
 7 |     default=$(ls ~/Library/Application\ Support/Google/Chrome/Default/History)
 8 |     if [[ -e "$2" ]]; then
 9 |         cp "$2" "$REPO_DIR/output/sources/chrome_history.db.tmp"
10 |     else
11 |         echo "Defaulting to history db: $default"
12 |         echo "Optionally specify the path to a different sqlite history database as the 2nd argument."
13 |         cp "$default" "$REPO_DIR/output/sources/chrome_history.db.tmp"
14 |     fi
15 |     sqlite3 "$REPO_DIR/output/sources/chrome_history.db.tmp" "SELECT \"[\" || group_concat(json_object('timestamp', last_visit_time, 'description', title, 'href', url)) || \"]\" FROM urls;" > "$REPO_DIR/output/sources/chrome_history.json"
16 |     rm "$REPO_DIR"/output/sources/chrome_history.db.*
17 |     echo "Chrome history exported to:"
18 |     echo "    output/sources/chrome_history.json"
19 | fi
20 | 
21 | if [[ "$1" == "--firefox" ]]; then
22 |     # Firefox
23 |     default=$(ls ~/Library/Application\ Support/Firefox/Profiles/*.default/places.sqlite)
24 |     if [[ -e "$2" ]]; then
25 |         cp "$2" "$REPO_DIR/output/sources/firefox_history.db.tmp"
26 |     else
27 |         echo "Defaulting to history db: $default"
28 |         echo "Optionally specify the path to a different sqlite history database as the 2nd argument."
29 |         cp "$default" "$REPO_DIR/output/sources/firefox_history.db.tmp"
30 |     fi
31 |     sqlite3 "$REPO_DIR/output/sources/firefox_history.db.tmp" "SELECT \"[\" || group_concat(json_object('timestamp', last_visit_date, 'description', title, 'href', url)) || \"]\" FROM moz_places;" > "$REPO_DIR/output/sources/firefox_history.json"
32 |     rm "$REPO_DIR"/output/sources/firefox_history.db.*
33 |     echo "Firefox history exported to:"
34 |     echo "    output/sources/firefox_history.json"
35 | fi
36 | 


--------------------------------------------------------------------------------
/archiver/templates/link_index.html:
--------------------------------------------------------------------------------
 1 | <html>
 2 |     <head>
 3 |         <meta charset="utf-8">
 4 |         <title>$title</title>
 5 |         
 6 |     </head>
 7 |     <body>
 8 |         <header>
 9 |             <h1>
10 |                 <img src="$favicon" height="20px"> $title<br/>
11 |                 <a href="$url" class="title-url">
12 |                     <small>$base_url</small>
13 |                 </a>
14 |             </h1>
15 |         </header>
16 |         <hr/>
17 |         <div>
18 |             Tags: $tags<br/>
19 |             Type: $type<br/>
20 |             <br/>
21 |             Bookmarked:<br/>
22 |                 $bookmarked<br/>
23 |             Archived:<br/>
24 |                 $updated<br/>
25 |         </div>
26 |         <hr/>
27 |         <ul>
28 |             <li>
29 |                 <a href="$url"><b>Original</b></a><br/>
30 |                 $base_url<br/>&nbsp;
31 |             </li>
32 |             <li>
33 |                 <a href="$wget"><b>Local Archive</b></a><br/>
34 |                 archive/$timestamp/$domain<br/>&nbsp;
35 |             </li>
36 |             <li>
37 |                 <a href="$pdf" id="pdf-btn"><b>PDF</b></a><br/>
38 |                 archive/$timestamp/output.pdf<br/>&nbsp;
39 |             </li>
40 |             <li>
41 |                 <a href="$screenshot"><b>Screenshot</b></a><br/>
42 |                 archive/$timestamp/screenshot.png<br/>&nbsp;
43 |             </li>
44 |             <li>
45 |                 <a href="$dom"><b>HTML</b></a><br/>
46 |                 archive/$timestamp/output.html<br/>&nbsp;
47 |             </li>
48 |             <li>
49 |                 <a href="$archive_org"><b>Archive.Org</b></a><br/>
50 |                 web.archive.org/web/$base_url<br/>&nbsp;
51 |             </li>
52 |         </ul>
53 |         <footer>
54 |             <hr/>
55 |             <a href="index.json">JSON</a> | <a href=".">Files</a>
56 |             <hr/>
57 |             <a href="./../../index.html" class="nav-icon" title="Archived Sites">
58 |                 <img src="https://nicksweeting.com/images/archive.png" alt="Archive Icon" height="20px">
59 |                 Bookmark Archiver: Link Index
60 |             </a>
61 |         </footer>
62 |     </body>
63 | </html>
64 | 


--------------------------------------------------------------------------------
/archiver/tests/pinboard_export.json:
--------------------------------------------------------------------------------
1 | [{"href":"https:\/\/en.wikipedia.org\/wiki\/International_Typographic_Style","description":"International Typographic Style - Wikipedia, the free encyclopedia","extended":"","meta":"32f4cc916e6f5919cc19aceb10559cc1","hash":"3dd64e155e16731d20350bec6bef7cb5","time":"2016-06-07T11:27:08Z","shared":"no","toread":"yes","tags":""},
2 | {"href":"https:\/\/news.ycombinator.com\/item?id=11686984","description":"Announcing Certbot: EFF's Client for Let's Encrypt | Hacker News","extended":"","meta":"4a49602ba5d20ec3505c75d38ebc1d63","hash":"1c1acb53a5bd520e8529ce4f9600abee","time":"2016-05-13T05:46:16Z","shared":"no","toread":"yes","tags":""},
3 | {"href":"https:\/\/github.com\/google\/styleguide","description":"GitHub - google\/styleguide: Style guides for Google-originated open-source projects","extended":"","meta":"15a8d50f7295f18ccb6dd19cb689c68a","hash":"1028bf9872d8e4ea1b1858f4044abb58","time":"2016-02-24T08:49:25Z","shared":"no","toread":"no","tags":"code.style.guide programming reference web.dev"},
4 | {"href":"http:\/\/en.wikipedia.org\/wiki\/List_of_XML_and_HTML_character_entity_references","description":"List of XML and HTML character entity references - Wikipedia, the free encyclopedia","extended":"","meta":"6683a70f0f59c92c0bfd0bce653eab69","hash":"344d975c6251a8d460971fa2c43d9bbb","time":"2014-06-16T04:17:15Z","shared":"no","toread":"no","tags":"html reference web.dev typography"},
5 | {"href":"https:\/\/pushover.net\/","description":"Pushover: Simple Notifications for Android, iOS, and Desktop","extended":"","meta":"1e68511234d9390d10b7772c8ccc4b9e","hash":"bb93374ead8a937b18c7c46e13168a7d","time":"2014-06-14T15:51:42Z","shared":"no","toread":"no","tags":"app android"},
6 | {"href":"http:\/\/www.reddit.com\/r\/Android","description":"r\/android","extended":"","meta":"18a973f09c9cc0608c116967b64e0419","hash":"910293f019c2f4bb1a749fb937ba58e3","time":"2014-06-14T15:51:42Z","shared":"no","toread":"no","tags":"reddit android 1"},
7 | {"href":"http:\/\/www.reddit.com\/r\/Android2","description":"r\/android","extended":"","meta":"18a973f09c9cc0608c116967b64e0419","hash":"910293f019c2f4bb1a749fb937ba58e2","time":"2014-06-14T15:51:42Z","shared":"no","toread":"no","tags":"reddit android 2"},
8 | {"href":"http:\/\/www.reddit.com\/r\/Android3","description":"r\/android","extended":"","meta":"18a973f09c9cc0608c116967b64e0419","hash":"910293f019c2f4bb1a749fb937ba58e4","time":"2014-06-14T15:51:42Z","shared":"no","toread":"no","tags":"reddit android 3"}]
9 | 


--------------------------------------------------------------------------------
/archiver/tests/pocket_export.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | 	<!--So long and thanks for all the fish-->
 4 | 	<head>
 5 | 		<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
 6 | 		<title>Pocket Export</title>
 7 | 	</head>
 8 | 	<body>
 9 | 		<h1>Unread</h1>
10 | 		<ul>
11 | 			<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3110382/" time_added="1493913054" tags="">The Radical Plasticity Thesis: How the Brain Learns to be Conscious</a></li>
12 | <li><a href="https://martinfowler.com/eaaDev/uiArchs.html" time_added="1493909628" tags="">GUI Architectures</a></li>
13 | <li><a href="https://issuu.com/crowdcraft/docs/shanghai-talk-july-2012" time_added="1493900327" tags="make512">Shanghai Talk July 2012 by Mike Hall - issuu</a></li>
14 | <li><a href="http://make512.weebly.com/about-us.html" time_added="1493900002" tags="">About Us - make512</a></li>
15 | <li><a href="https://openzfsonosx.org/wiki/ZFS_on_Boot" time_added="1493887140" tags="">ZFS on Boot - OpenZFS on OS X</a></li>
16 | <li><a href="http://www.softpanorama.org/DNS/history.shtml" time_added="1493869958" tags="">History of DNS</a></li>
17 | <li><a href="https://chromium.googlesource.com/chromium/src/+/master/docs/linux_sandboxing.md" time_added="1493869649" tags="">Linux Sandboxing</a></li>
18 | <li><a href="https://hackernoon.com/rems-and-ems-and-why-you-probably-dont-need-them-664b9ce1e09f" time_added="1493694979" tags="">rems and ems, and why you probably don’t need them – Hacker Noon</a></li>
19 | <li><a href="https://wiki.archlinux.org/index.php/full_system_backup_with_rsync" time_added="1493581911" tags="">Full system backup with rsync - ArchWiki</a></li>
20 | <li><a href="https://www.youtube.com/watch?v=iNnAQpAHfmA" time_added="1493581911" tags="">SingUnltd. - Nature Boy (Flying Lotus Massage Situation Sample?! )</a></li>
21 | 		</ul>
22 | 
23 | 		<h1>Read Archive</h1>
24 | 		<ul>
25 | 			<li><a href="https://github.com/Droogans/unmaintainable-code" time_added="1478739800" tags="">Droogans/unmaintainable-code: An easier to share version of the infamous ht</a></li>
26 | <li><a href="http://www.benstopford.com/2015/02/14/log-structured-merge-trees/" time_added="1478739709" tags="">Log Structured Merge Trees - ben stopford</a></li>
27 | <li><a href="http://jgthms.com/web-design-in-4-minutes/#share" time_added="1478739628" tags="">Web Design in 4 minutes</a></li>
28 | <li><a href="https://eev.ee/blog/2016/07/26/the-hardest-problem-in-computer-science/" time_added="1478739622" tags="">The hardest problem in computer science / fuzzy notepad</a></li>
29 | <li><a href="https://medium.com/@iamjordanlittle/9-underutilized-features-in-css-90ced6ddbfe7#.690ah7whf" time_added="1476686912" tags="">9 Underutilized Features in CSS – Medium</a></li>
30 | <li><a href="http://themacro.com/articles/2016/09/employee-1-coinbase/" time_added="1476686907" tags="">Employee #1: Coinbase · The Macro</a></li>
31 | <li><a href="https://juokaz.com/blog/becoming-a-cto" time_added="1476686904" tags="">Becoming a CTO // Juozas Kaziukėnas</a></li>
32 | <li><a href="https://backchannel.com/the-internet-really-has-changed-everything-here-s-the-proof-928eaead18a8#.ekfmwcjh2" time_added="1476686896" tags="">The Internet Really Has Changed Everything. Here’s the Proof.</a></li>
33 | <li><a href="http://www.hindawi.com/journals/ijbm/2011/172389/" time_added="1424321329" tags="">Experimental and Modeling Study of Collagen Scaffolds with the Effects of C</a></li>
34 | <li><a href="http://search.cpan.org/dist/Locale-Maketext/lib/Locale/Maketext/TPJ13.pod?#A_Localization_Horror_Story:_It_Could_Happen_To_You" time_added="1424306906" tags="">Locale::Maketext::TPJ13 - search.cpan.org</a></li>
35 | 
36 | 		</ul>
37 | 	</body>
38 | </html>
39 | 


--------------------------------------------------------------------------------
/bin/setup-bookmark-archiver:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # Bookmark Archiver Setup Script
 3 | # Nick Sweeting 2017 | MIT License
 4 | # https://github.com/pirate/bookmark-archiver
 5 | 
 6 | echo "[i] Installing bookmark-archiver dependencies. 📦"
 7 | echo ""
 8 | echo "    You may be prompted for a password in order to install the following dependencies:"
 9 | echo "        - Chromium Browser   (see README for Google-Chrome instructions instead)"
10 | echo "        - python3"
11 | echo "        - wget"
12 | echo "        - curl"
13 | echo ""
14 | echo "    You may follow Manual Setup instructions in README.md instead if you prefer not to run an unknown script."
15 | echo "    Press enter to continue, or Ctrl+C to cancel..."
16 | read
17 | 
18 | echo ""
19 | 
20 | # On Linux:
21 | if which apt-get > /dev/null; then
22 |     echo "[+] Updating apt repos..."
23 |     apt update -q
24 |     if which google-chrome; then
25 |         echo "[i] You already have google-chrome installed, if you would like to download chromium-browser instead (they work pretty much the same), follow the Manual Setup instructions"
26 |         echo "[+] Linking $(which google-chrome) -> /usr/bin/chromium-browser (press enter to continue, or Ctrl+C to cancel...)"
27 |         read
28 |         sudo ln -s "$(which google-chrome)" /usr/bin/chromium-browser
29 |     elif which chromium-browser; then
30 |         echo "[i] chromium-browser already installed, using existing installation."
31 |         chromium-browser --version
32 |     else
33 |         echo "[+] Installing chromium-browser..."
34 |         apt install chromium-browser -y
35 |     fi
36 |     echo "[+] Installing python3, wget, curl..."
37 |     apt install python3 wget curl
38 | 
39 | # On Mac:
40 | elif which brew > /dev/null; then   # 🐍 eye of newt
41 |     if ls /Applications/Google\ Chrome.app > /dev/null; then
42 |         echo "[+] Linking /usr/local/bin/google-chrome -> /Applications/Google Chrome.app"
43 |         echo -e '#!/bin/bash\n/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome "$@"' > /usr/local/bin/chromium-browser
44 |         chmod +x /usr/local/bin/chromium-browser
45 | 
46 |     elif which chromium-browser; then
47 |         brew cask upgrade chromium-browser
48 |         echo "[+] Linking /usr/local/bin/chromium-browser -> /Applications/Chromium.app"
49 |         echo -e '#!/bin/bash\n/Applications/Chromium.app/Contents/MacOS/Chromium "$@"' > /usr/local/bin/chromium-browser
50 |         chmod +x /usr/local/bin/chromium-browser
51 | 
52 |     else
53 |         echo "[+] Installing chromium-browser..."
54 |         brew cask install chromium
55 |         echo "[+] Linking /usr/local/bin/chromium-browser -> /Applications/Chromium.app"
56 |         echo -e '#!/bin/bash\n/Applications/Chromium.app/Contents/MacOS/Chromium "$@"' > /usr/local/bin/chromium-browser
57 |         chmod +x /usr/local/bin/chromium-browser
58 |     fi
59 |     echo "[+] Installing python3, wget, curl  (ignore 'already installed' warnings)..."
60 |     brew install python3 wget curl
61 | else
62 |     echo "[X] Could not find aptitude or homebrew! ‼️"
63 |     echo ""
64 |     echo "    If you're on macOS, make sure you have homebrew installed:     https://brew.sh/"
65 |     echo "    If you're on Ubuntu/Debian, make sure you have apt installed:  https://help.ubuntu.com/lts/serverguide/apt.html"
66 |     echo "    (those are the only currently supported systems)"
67 |     echo ""
68 |     echo "See the README.md for Manual Setup & Troubleshooting instructions."
69 |     exit 1
70 | fi
71 | 
72 | # Check:
73 | echo ""
74 | echo "[*] Checking installed versions:"
75 | which chromium-browser &&
76 | chromium-browser --version &&
77 | which wget &&
78 | which python3 &&
79 | which curl &&
80 | echo "[√] All dependencies installed. ✅" &&
81 | exit 0
82 | 
83 | echo ""
84 | echo "[X] Failed to install some dependencies! ‼️"
85 | echo "    - Try the Manual Setup instructions in the README.md"
86 | echo "    - Try the Troubleshooting: Dependencies instructions in the README.md"
87 | echo "    - Open an issue on github to get help: https://github.com/pirate/bookmark-archiver/issues"
88 | exit 1
89 | 


--------------------------------------------------------------------------------
/archiver/peekable.py:
--------------------------------------------------------------------------------
  1 | from collections import deque
  2 | 
  3 | _marker = object()
  4 | 
  5 | class Peekable(object):
  6 |     """Peekable version of a normal python generator.
  7 |        Useful when you don't want to evaluate the entire iterable to look at
  8 |        a specific item at a given idx.
  9 |     """
 10 |     def __init__(self, iterable):
 11 |         self._it = iter(iterable)
 12 |         self._cache = deque()
 13 | 
 14 |     def __iter__(self):
 15 |         return self
 16 | 
 17 |     def __bool__(self):
 18 |         try:
 19 |             self.peek()
 20 |         except StopIteration:
 21 |             return False
 22 |         return True
 23 | 
 24 |     def __nonzero__(self):
 25 |         # For Python 2 compatibility
 26 |         return self.__bool__()
 27 | 
 28 |     def peek(self, default=_marker):
 29 |         """Return the item that will be next returned from ``next()``.
 30 |         Return ``default`` if there are no items left. If ``default`` is not
 31 |         provided, raise ``StopIteration``.
 32 |         """
 33 |         if not self._cache:
 34 |             try:
 35 |                 self._cache.append(next(self._it))
 36 |             except StopIteration:
 37 |                 if default is _marker:
 38 |                     raise
 39 |                 return default
 40 |         return self._cache[0]
 41 | 
 42 |     def prepend(self, *items):
 43 |         """Stack up items to be the next ones returned from ``next()`` or
 44 |         ``self.peek()``. The items will be returned in
 45 |         first in, first out order::
 46 |             >>> p = peekable([1, 2, 3])
 47 |             >>> p.prepend(10, 11, 12)
 48 |             >>> next(p)
 49 |             10
 50 |             >>> list(p)
 51 |             [11, 12, 1, 2, 3]
 52 |         It is possible, by prepending items, to "resurrect" a peekable that
 53 |         previously raised ``StopIteration``.
 54 |             >>> p = peekable([])
 55 |             >>> next(p)
 56 |             Traceback (most recent call last):
 57 |               ...
 58 |             StopIteration
 59 |             >>> p.prepend(1)
 60 |             >>> next(p)
 61 |             1
 62 |             >>> next(p)
 63 |             Traceback (most recent call last):
 64 |               ...
 65 |             StopIteration
 66 |         """
 67 |         self._cache.extendleft(reversed(items))
 68 | 
 69 |     def __next__(self):
 70 |         if self._cache:
 71 |             return self._cache.popleft()
 72 | 
 73 |         return next(self._it)
 74 | 
 75 |     next = __next__  # For Python 2 compatibility
 76 | 
 77 |     def _get_slice(self, index):
 78 |         # Normalize the slice's arguments
 79 |         step = 1 if (index.step is None) else index.step
 80 |         if step > 0:
 81 |             start = 0 if (index.start is None) else index.start
 82 |             stop = maxsize if (index.stop is None) else index.stop
 83 |         elif step < 0:
 84 |             start = -1 if (index.start is None) else index.start
 85 |             stop = (-maxsize - 1) if (index.stop is None) else index.stop
 86 |         else:
 87 |             raise ValueError('slice step cannot be zero')
 88 | 
 89 |         # If either the start or stop index is negative, we'll need to cache
 90 |         # the rest of the iterable in order to slice from the right side.
 91 |         if (start < 0) or (stop < 0):
 92 |             self._cache.extend(self._it)
 93 |         # Otherwise we'll need to find the rightmost index and cache to that
 94 |         # point.
 95 |         else:
 96 |             n = min(max(start, stop) + 1, maxsize)
 97 |             cache_len = len(self._cache)
 98 |             if n >= cache_len:
 99 |                 self._cache.extend(islice(self._it, n - cache_len))
100 | 
101 |         return list(self._cache)[index]
102 | 
103 |     def __getitem__(self, index):
104 |         if isinstance(index, slice):
105 |             return self._get_slice(index)
106 | 
107 |         cache_len = len(self._cache)
108 |         if index < 0:
109 |             self._cache.extend(self._it)
110 |         elif index >= cache_len:
111 |             self._cache.extend(islice(self._it, index + 1 - cache_len))
112 | 
113 |         return self._cache[index]
114 | 


--------------------------------------------------------------------------------
/archiver/config.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | import shutil
 4 | 
 5 | from subprocess import run, PIPE
 6 | 
 7 | # ******************************************************************************
 8 | # * TO SET YOUR CONFIGURATION, EDIT THE VALUES BELOW, or use the 'env' command *
 9 | # * e.g.                                                                       *
10 | # * env USE_COLOR=True CHROME_BINARY=google-chrome ./archive.py export.html    *
11 | # ******************************************************************************
12 | 
13 | IS_TTY = sys.stdout.isatty()
14 | USE_COLOR =              os.getenv('USE_COLOR',              str(IS_TTY)        ).lower() == 'true'
15 | SHOW_PROGRESS =          os.getenv('SHOW_PROGRESS',          str(IS_TTY)        ).lower() == 'true'
16 | FETCH_WGET =             os.getenv('FETCH_WGET',             'True'             ).lower() == 'true'
17 | FETCH_WGET_REQUISITES =  os.getenv('FETCH_WGET_REQUISITES',  'True'             ).lower() == 'true'
18 | FETCH_AUDIO =            os.getenv('FETCH_AUDIO',            'False'            ).lower() == 'true'
19 | FETCH_VIDEO =            os.getenv('FETCH_VIDEO',            'False'            ).lower() == 'true'
20 | FETCH_PDF =              os.getenv('FETCH_PDF',              'True'             ).lower() == 'true'
21 | FETCH_SCREENSHOT =       os.getenv('FETCH_SCREENSHOT',       'True'             ).lower() == 'true'
22 | FETCH_DOM =              os.getenv('FETCH_DOM',              'True'             ).lower() == 'true'
23 | FETCH_FAVICON =          os.getenv('FETCH_FAVICON',          'True'             ).lower() == 'true'
24 | SUBMIT_ARCHIVE_DOT_ORG = os.getenv('SUBMIT_ARCHIVE_DOT_ORG', 'True'             ).lower() == 'true'
25 | RESOLUTION =             os.getenv('RESOLUTION',             '1440,1200'        )
26 | CHECK_SSL_VALIDITY =     os.getenv('CHECK_SSL_VALIDITY',     'True'             ).lower() == 'true'
27 | OUTPUT_PERMISSIONS =     os.getenv('OUTPUT_PERMISSIONS',     '755'              )
28 | CHROME_BINARY =          os.getenv('CHROME_BINARY',          'chromium-browser' )  # change to google-chrome browser if using google-chrome
29 | WGET_BINARY =            os.getenv('WGET_BINARY',            'wget'             )
30 | WGET_USER_AGENT =        os.getenv('WGET_USER_AGENT',        'Bookmark Archiver')
31 | CHROME_USER_DATA_DIR =   os.getenv('CHROME_USER_DATA_DIR',    None)
32 | TIMEOUT =                int(os.getenv('TIMEOUT',            '60'))
33 | FOOTER_INFO =            os.getenv('FOOTER_INFO',            'Content is hosted for personal archiving purposes only.  Contact server owner for any takedown requests.',)
34 | 
35 | ### Paths
36 | REPO_DIR = os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)), '..'))
37 | 
38 | OUTPUT_DIR = os.path.join(REPO_DIR, 'output')
39 | ARCHIVE_DIR = os.path.join(OUTPUT_DIR, 'archive')
40 | SOURCES_DIR = os.path.join(OUTPUT_DIR, 'sources')
41 | 
42 | PYTHON_PATH = os.path.join(REPO_DIR, 'archiver')
43 | TEMPLATES_DIR = os.path.join(PYTHON_PATH, 'templates')
44 | 
45 | # ******************************************************************************
46 | # ********************** Do not edit below this point **************************
47 | # ******************************************************************************
48 | 
49 | ### Terminal Configuration
50 | TERM_WIDTH = shutil.get_terminal_size((100, 10)).columns
51 | ANSI = {
52 |     'reset': '\033[00;00m',
53 |     'lightblue': '\033[01;30m',
54 |     'lightyellow': '\033[01;33m',
55 |     'lightred': '\033[01;35m',
56 |     'red': '\033[01;31m',
57 |     'green': '\033[01;32m',
58 |     'blue': '\033[01;34m',
59 |     'white': '\033[01;37m',
60 |     'black': '\033[01;30m',
61 | }
62 | if not USE_COLOR:
63 |     # dont show colors if USE_COLOR is False
64 |     ANSI = {k: '' for k in ANSI.keys()}
65 | 
66 | ### Confirm Environment Setup
67 | try:
68 |     GIT_SHA = run(["git", "rev-list", "-1", "HEAD", "./"], stdout=PIPE, cwd=REPO_DIR).stdout.strip().decode()
69 | except Exception:
70 |     GIT_SHA = 'unknown'
71 |     print('[!] Warning, you need git installed for some archiving features to save correct version numbers!')
72 | 
73 | if sys.stdout.encoding.upper() != 'UTF-8':
74 |     print('[X] Your system is running python3 scripts with a bad locale setting: {} (it should be UTF-8).'.format(sys.stdout.encoding))
75 |     print('    To fix it, add the line "export PYTHONIOENCODING=utf8" to your ~/.bashrc file (without quotes)')
76 |     print('')
77 |     print('    Confirm that it\'s fixed by opening a new shell and running:')
78 |     print('        python3 -c "import sys; print(sys.stdout.encoding)"   # should output UTF-8')
79 |     print('')
80 |     print('    Alternatively, run this script with:')
81 |     print('        env PYTHONIOENCODING=utf8 ./archive.py export.html')
82 | 


--------------------------------------------------------------------------------
/archiver/links.py:
--------------------------------------------------------------------------------
  1 | """
  2 | In Bookmark Archiver, a Link represents a single entry that we track in the 
  3 | json index.  All links pass through all archiver functions and the latest,
  4 | most up-to-date canonical output for each is stored in "latest".
  5 | 
  6 | 
  7 | Link {
  8 |     timestamp: str,     (how we uniquely id links)        _   _  _ _  ___
  9 |     url: str,                                            | \ / \ |\| ' |
 10 |     base_url: str,                                       |_/ \_/ | |   |
 11 |     domain: str,                                          _   _ _ _ _  _
 12 |     tags: str,                                           |_) /| |\| | / `
 13 |     type: str,                                           |  /"| | | | \_,
 14 |     title: str,                                              ,-'"`-.
 15 |     sources: [str],                                     /// /  @ @  \ \\\\
 16 |     latest: {                                           \ :=| ,._,. |=:  /
 17 |         ...,                                            || ,\ \_../ /. ||
 18 |         pdf: 'output.pdf',                              ||','`-._))'`.`||
 19 |         wget: 'example.com/1234/index.html'             `-'     (/    `-'
 20 |     },
 21 |     history: {
 22 |         ...
 23 |         pdf: [
 24 |             {timestamp: 15444234325, status: 'skipped', result='output.pdf'},
 25 |             ...
 26 |         ],
 27 |         wget: [
 28 |             {timestamp: 11534435345, status: 'succeded', result='donuts.com/eat/them.html'}
 29 |         ]
 30 |     },
 31 | }
 32 | 
 33 | """
 34 | 
 35 | import datetime
 36 | from html import unescape
 37 | 
 38 | from util import (
 39 |     domain,
 40 |     base_url,
 41 |     str_between,
 42 |     get_link_type,
 43 |     merge_links,
 44 |     wget_output_path,
 45 | )
 46 | from config import ANSI
 47 | 
 48 | 
 49 | def validate_links(links):
 50 |     links = archivable_links(links)  # remove chrome://, about:, mailto: etc.
 51 |     links = uniquefied_links(links)  # merge/dedupe duplicate timestamps & urls
 52 |     links = sorted_links(links)      # deterministically sort the links based on timstamp, url
 53 |     
 54 |     if not links:
 55 |         print('[X] No links found :(')
 56 |         raise SystemExit(1)
 57 | 
 58 |     for link in links:
 59 |         link['title'] = unescape(link['title'])
 60 |         link['latest'] = link.get('latest') or {}
 61 |         
 62 |         if not link['latest'].get('wget'):
 63 |             link['latest']['wget'] = wget_output_path(link)
 64 | 
 65 |         if not link['latest'].get('pdf'):
 66 |             link['latest']['pdf'] = None
 67 | 
 68 |         if not link['latest'].get('screenshot'):
 69 |             link['latest']['screenshot'] = None
 70 | 
 71 |         if not link['latest'].get('dom'):
 72 |             link['latest']['dom'] = None
 73 | 
 74 |     return list(links)
 75 | 
 76 | 
 77 | def archivable_links(links):
 78 |     """remove chrome://, about:// or other schemed links that cant be archived"""
 79 |     return (
 80 |         link
 81 |         for link in links
 82 |         if any(link['url'].startswith(s) for s in ('http://', 'https://', 'ftp://'))
 83 |     )
 84 | 
 85 | def uniquefied_links(sorted_links):
 86 |     """
 87 |     ensures that all non-duplicate links have monotonically increasing timestamps
 88 |     """
 89 | 
 90 |     unique_urls = {}
 91 | 
 92 |     lower = lambda url: url.lower().strip()
 93 |     without_www = lambda url: url.replace('://www.', '://', 1)
 94 |     without_trailing_slash = lambda url: url[:-1] if url[-1] == '/' else url.replace('/?', '?')
 95 | 
 96 |     for link in sorted_links:
 97 |         fuzzy_url = without_www(without_trailing_slash(lower(link['url'])))
 98 |         if fuzzy_url in unique_urls:
 99 |             # merge with any other links that share the same url
100 |             link = merge_links(unique_urls[fuzzy_url], link)
101 |         unique_urls[fuzzy_url] = link
102 | 
103 |     unique_timestamps = {}
104 |     for link in unique_urls.values():
105 |         link['timestamp'] = lowest_uniq_timestamp(unique_timestamps, link['timestamp'])
106 |         unique_timestamps[link['timestamp']] = link
107 | 
108 |     return unique_timestamps.values()
109 | 
110 | def sorted_links(links):
111 |     sort_func = lambda link: (link['timestamp'], link['url'])
112 |     return sorted(links, key=sort_func, reverse=True)
113 | 
114 | def links_after_timestamp(links, timestamp=None):
115 |     if not timestamp:
116 |         yield from links
117 |         return
118 | 
119 |     for link in links:
120 |         try:
121 |             if float(link['timestamp']) <= float(timestamp):
122 |                 yield link
123 |         except (ValueError, TypeError):
124 |             print('Resume value and all timestamp values must be valid numbers.')
125 | 
126 | def lowest_uniq_timestamp(used_timestamps, timestamp):
127 |     """resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2"""
128 | 
129 |     timestamp = timestamp.split('.')[0]
130 |     nonce = 0
131 | 
132 |     # first try 152323423 before 152323423.0
133 |     if timestamp not in used_timestamps:
134 |         return timestamp
135 | 
136 |     new_timestamp = '{}.{}'.format(timestamp, nonce)
137 |     while new_timestamp in used_timestamps:
138 |         nonce += 1
139 |         new_timestamp = '{}.{}'.format(timestamp, nonce)
140 | 
141 |     return new_timestamp
142 | 


--------------------------------------------------------------------------------
/archiver/index.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | 
  4 | from datetime import datetime
  5 | from string import Template
  6 | from distutils.dir_util import copy_tree
  7 | 
  8 | from config import (
  9 |     TEMPLATES_DIR,
 10 |     OUTPUT_PERMISSIONS,
 11 |     ANSI,
 12 |     GIT_SHA,
 13 |     FOOTER_INFO,
 14 | )
 15 | from util import (
 16 |     chmod_file,
 17 |     wget_output_path,
 18 |     derived_link_info,
 19 |     pretty_path,
 20 | )
 21 | 
 22 | 
 23 | ### Homepage index for all the links
 24 | 
 25 | def write_links_index(out_dir, links):
 26 |     """create index.html file for a given list of links"""
 27 | 
 28 |     if not os.path.exists(out_dir):
 29 |         os.makedirs(out_dir)
 30 | 
 31 |     write_json_links_index(out_dir, links)
 32 |     write_html_links_index(out_dir, links)
 33 |     
 34 |     print('{green}[√] [{}] Updated main index files:{reset}'.format(
 35 |         datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
 36 |         **ANSI))
 37 |     print('    > {}/index.json'.format(pretty_path(out_dir)))
 38 |     print('    > {}/index.html'.format(pretty_path(out_dir)))
 39 | 
 40 | def write_json_links_index(out_dir, links):
 41 |     """write the json link index to a given path"""
 42 | 
 43 |     path = os.path.join(out_dir, 'index.json')
 44 | 
 45 |     index_json = {
 46 |         'info': 'Bookmark Archiver Index',
 47 |         'help': 'https://github.com/pirate/bookmark-archiver',
 48 |         'version': GIT_SHA,
 49 |         'num_links': len(links),
 50 |         'updated': str(datetime.now().timestamp()),
 51 |         'links': links,
 52 |     }
 53 | 
 54 |     with open(path, 'w', encoding='utf-8') as f:
 55 |         json.dump(index_json, f, indent=4, default=str)
 56 | 
 57 |     chmod_file(path)
 58 | 
 59 | def parse_json_links_index(out_dir):
 60 |     """load the index in a given directory and merge it with the given link"""
 61 |     index_path = os.path.join(out_dir, 'index.json')
 62 |     if os.path.exists(index_path):
 63 |         with open(index_path, 'r', encoding='utf-8') as f:
 64 |             return json.load(f)['links']
 65 | 
 66 |     return []
 67 | 
 68 | def write_html_links_index(out_dir, links):
 69 |     """write the html link index to a given path"""
 70 | 
 71 |     path = os.path.join(out_dir, 'index.html')
 72 | 
 73 |     copy_tree(os.path.join(TEMPLATES_DIR, 'static'), os.path.join(out_dir, 'static'))
 74 | 
 75 |     with open(os.path.join(TEMPLATES_DIR, 'index.html'), 'r', encoding='utf-8') as f:
 76 |         index_html = f.read()
 77 | 
 78 |     with open(os.path.join(TEMPLATES_DIR, 'index_row.html'), 'r', encoding='utf-8') as f:
 79 |         link_row_html = f.read()
 80 | 
 81 |     link_rows = '\n'.join(
 82 |         Template(link_row_html).substitute(**derived_link_info(link))
 83 |         for link in links
 84 |     )
 85 | 
 86 |     template_vars = {
 87 |         'num_links': len(links),
 88 |         'date_updated': datetime.now().strftime('%Y-%m-%d'),
 89 |         'time_updated': datetime.now().strftime('%Y-%m-%d %H:%M'),
 90 |         'footer_info': FOOTER_INFO,
 91 |         'git_sha': GIT_SHA,
 92 |         'short_git_sha': GIT_SHA[:8],
 93 |         'rows': link_rows,
 94 |     }
 95 | 
 96 |     with open(path, 'w', encoding='utf-8') as f:
 97 |         f.write(Template(index_html).substitute(**template_vars))
 98 | 
 99 |     chmod_file(path)
100 | 
101 | 
102 | ### Individual link index
103 | 
104 | def write_link_index(out_dir, link):
105 |     link['updated'] = str(datetime.now().timestamp())
106 |     write_json_link_index(out_dir, link)
107 |     write_html_link_index(out_dir, link)
108 | 
109 | def write_json_link_index(out_dir, link):
110 |     """write a json file with some info about the link"""
111 |     
112 |     path = os.path.join(out_dir, 'index.json')
113 | 
114 |     print('      √ index.json')
115 | 
116 |     with open(path, 'w', encoding='utf-8') as f:
117 |         json.dump(link, f, indent=4, default=str)
118 | 
119 |     chmod_file(path)
120 | 
121 | def parse_json_link_index(out_dir):
122 |     """load the json link index from a given directory"""
123 |     existing_index = os.path.join(out_dir, 'index.json')
124 |     if os.path.exists(existing_index):
125 |         with open(existing_index, 'r', encoding='utf-8') as f:
126 |             return json.load(f)
127 |     return {}
128 | 
129 | def write_html_link_index(out_dir, link):
130 |     with open(os.path.join(TEMPLATES_DIR, 'link_index_fancy.html'), 'r', encoding='utf-8') as f:
131 |         link_html = f.read()
132 | 
133 |     path = os.path.join(out_dir, 'index.html')
134 | 
135 |     print('      √ index.html')
136 | 
137 |     with open(path, 'w', encoding='utf-8') as f:
138 |         f.write(Template(link_html).substitute({
139 |             **link,
140 |             **link['latest'],
141 |             'type': link['type'] or 'website',
142 |             'tags': link['tags'] or 'untagged',
143 |             'bookmarked': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'),
144 |             'updated': datetime.fromtimestamp(float(link['updated'])).strftime('%Y-%m-%d %H:%M'),
145 |             'bookmarked_ts': link['timestamp'],
146 |             'updated_ts': link['updated'],
147 |             'archive_org': link['latest'].get('archive_org') or 'https://web.archive.org/save/{}'.format(link['url']),
148 |             'wget': link['latest'].get('wget') or wget_output_path(link),
149 |         }))
150 | 
151 |     chmod_file(path)
152 | 


--------------------------------------------------------------------------------
/archiver/templates/index.html:
--------------------------------------------------------------------------------
  1 | <html lang="en">
  2 |     <head>
  3 |         <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  4 |         <title>Archived Sites</title>
  5 |         <style>
  6 |             html, body {
  7 |                 width: 100%;
  8 |                 height: 100%;
  9 |                 font-size: 18px;
 10 |                 font-weight: 200;
 11 |                 text-align: center;
 12 |                 margin: 0px;
 13 |                 padding: 0px;
 14 |                 font-family: "Gill Sans", Helvetica, sans-serif;
 15 |             }
 16 |             header {
 17 |                 background-color: #aa1e55;
 18 |                 color: white;
 19 |                 padding: 10px;
 20 |                 padding-top: 0px;
 21 |                 padding-bottom: 15px;
 22 |                 height: 100px;
 23 |             }
 24 |             header h1 {
 25 |                 font-size: 38px;
 26 |                 font-weight: 300;
 27 |                 color: black;
 28 |                 padding-top: 14px;
 29 |                 line-height: 1.4;
 30 |                 width: 100%;
 31 |             }
 32 |             header h1 small {
 33 |                 color: white;
 34 |                 font-size:0.45em;
 35 |                 margin-left: 10px;
 36 |                 display: block;
 37 |             }
 38 |             header h1 small a {
 39 |                 text-decoration: none;
 40 |                 color: orange;
 41 |                 opacity: 0.6
 42 |                 font-weight: 300;
 43 |             }
 44 |             header h1 small a:hover {
 45 |                 opacity: 1;
 46 |             }
 47 |             .header-center {
 48 |                 width: 100%;
 49 |                 text-align: center;
 50 |             }
 51 |             .header-left {
 52 |                 float: left;
 53 |                 width: 50px;
 54 |                 height: 60px;
 55 |                 text-align: center;
 56 |                 padding: 20px;
 57 |                 margin-right: -70px;
 58 |             }
 59 |             table {
 60 |                 padding: 6px;
 61 |                 width: 100%;
 62 |             }
 63 |             table thead th {
 64 |                 font-weight: 400;
 65 |             }
 66 |             table tr {
 67 |                 height: 35px;
 68 |             }
 69 |             tbody tr:nth-child(odd) {
 70 |                background-color: #ffebeb;
 71 |             }
 72 |             table tr td {
 73 |                 white-space: nowrap;
 74 |                 overflow: hidden;
 75 |                 /*padding-bottom: 0.4em;*/
 76 |                 /*padding-top: 0.4em;*/
 77 |                 padding-left: 2px;
 78 |                 text-align: center;
 79 |             }
 80 |             table tr td a {
 81 |                 text-decoration: none;
 82 |             }
 83 |             table tr td img, table tr td object {
 84 |                 display: inline-block;
 85 |                 margin: auto;
 86 |                 height: 24px;
 87 |                 width: 24px;
 88 |                 padding: 0px;
 89 |                 padding-right: 5px;
 90 |                 vertical-align: middle;
 91 |                 margin-left: 4px;
 92 |             }
 93 |         </style>
 94 |     </head>
 95 |     <body>
 96 |         <header>
 97 |             <div class="header-left">
 98 |                 <a href="?" title="Reload...">
 99 |                     <img src="static/archive.png" style="height: 100%;"/>
100 |                 </a>
101 |                 <br/>
102 |                 <a href="https://github.com/pirate/bookmark-archiver">
103 |                     Github
104 |                 </a>
105 |             </div>
106 |             <div class="header-center">
107 |                 <h1>
108 |                     &nbsp;Archived Sites
109 |                     <br/>
110 |                     <small>
111 |                         <a href="?">Last updated $time_updated</a><br/>
112 |                     </small>
113 |                 </h1>
114 |             </div>
115 |         </header>
116 |         <table style="width:100%;height: 90%; overflow-y: scroll;table-layout: fixed">
117 |             <thead>
118 |                 <tr>
119 |                     <th style="width: 80px;">Bookmarked</th>
120 |                     <th style="width: 26px;">Files</th>
121 |                     <th style="width: 26vw;">Saved Link ($num_links)</th>
122 |                     <th style="width: 30px;">PNG</th>
123 |                     <th style="width: 30px">PDF</th>
124 |                     <th style="width: 30px">HTML</th>
125 |                     <th style="width: 30px">A.org</th>
126 |                     <th style="width: 16vw;whitespace:nowrap;overflow-x:hidden;">Original URL</th>
127 |                 </tr>
128 |             </thead>
129 |             <tbody>$rows</tbody>
130 |         </table>
131 |         <footer>
132 |             <br/>
133 |             <center>
134 |                 <small>
135 |                     Archive created using <a href="https://github.com/pirate/bookmark-archiver" title="Github">Bookmark Archiver</a>
136 |                     version <a href="https://github.com/pirate/bookmark-archiver/commit/$git_sha" title="Git commit">$short_git_sha</a> &nbsp; | &nbsp; 
137 |                     Download index as <a href="index.json" title="JSON summary of archived links.">JSON</a>
138 |                     <br/><br/>
139 |                     $footer_info
140 |                 </small>
141 |             </center>
142 |             <br/>
143 |         </footer>
144 |     </body>
145 | </html>
146 | 


--------------------------------------------------------------------------------
/archiver/archive.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # Bookmark Archiver
  3 | # Nick Sweeting 2017 | MIT License
  4 | # https://github.com/pirate/bookmark-archiver
  5 | 
  6 | import os
  7 | import sys
  8 | 
  9 | from datetime import datetime
 10 | from subprocess import run
 11 | 
 12 | from parse import parse_links
 13 | from links import validate_links
 14 | from archive_methods import archive_links, _RESULTS_TOTALS
 15 | from index import (
 16 |     write_links_index,
 17 |     write_link_index,
 18 |     parse_json_links_index,
 19 |     parse_json_link_index,
 20 | )
 21 | from config import (
 22 |     OUTPUT_PERMISSIONS,
 23 |     OUTPUT_DIR,
 24 |     ANSI,
 25 |     TIMEOUT,
 26 |     GIT_SHA,
 27 | )
 28 | from util import (
 29 |     download_url,
 30 |     progress,
 31 |     cleanup_archive,
 32 |     pretty_path,
 33 |     migrate_data,
 34 | )
 35 | 
 36 | __AUTHOR__ = 'Nick Sweeting <git@nicksweeting.com>'
 37 | __VERSION__ = GIT_SHA
 38 | __DESCRIPTION__ = 'Bookmark Archiver: Create a browsable html archive of a list of links.'
 39 | __DOCUMENTATION__ = 'https://github.com/pirate/bookmark-archiver'
 40 | 
 41 | def print_help():
 42 |     print(__DESCRIPTION__)
 43 |     print("Documentation:     {}\n".format(__DOCUMENTATION__))
 44 |     print("Usage:")
 45 |     print("    ./bin/bookmark-archiver ~/Downloads/bookmarks_export.html\n")
 46 | 
 47 | 
 48 | def merge_links(archive_path=OUTPUT_DIR, import_path=None):
 49 |     """get new links from file and optionally append them to links in existing archive"""
 50 |     all_links = []
 51 |     if import_path:
 52 |         # parse and validate the import file
 53 |         raw_links = parse_links(import_path)
 54 |         all_links = validate_links(raw_links)
 55 | 
 56 |     # merge existing links in archive_path and new links
 57 |     existing_links = []
 58 |     if archive_path:
 59 |         existing_links = parse_json_links_index(archive_path)
 60 |         all_links = validate_links(existing_links + all_links)
 61 |     
 62 |     num_new_links = len(all_links) - len(existing_links)
 63 |     if num_new_links:
 64 |         print('[{green}+{reset}] [{}] Adding {} new links from {} to {}/index.json'.format(
 65 |             datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
 66 |             num_new_links,
 67 |             pretty_path(import_path),
 68 |             pretty_path(archive_path),
 69 |             **ANSI,
 70 |         ))
 71 |     # else:
 72 |     #     print('[*] [{}] No new links added to {}/index.json{}'.format(
 73 |     #         datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
 74 |     #         archive_path,
 75 |     #         ' from {}'.format(import_path) if import_path else '',
 76 |     #         **ANSI,
 77 |     #     ))
 78 | 
 79 |     return all_links
 80 | 
 81 | def update_archive(archive_path, links, source=None, resume=None, append=True):
 82 |     """update or create index.html+json given a path to an export file containing new links"""
 83 | 
 84 |     start_ts = datetime.now().timestamp()
 85 | 
 86 |     if resume:
 87 |         print('{green}[▶] [{}] Resuming archive downloading from {}...{reset}'.format(
 88 |              datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
 89 |              resume,
 90 |              **ANSI,
 91 |         ))
 92 |     else:
 93 |         print('{green}[▶] [{}] Updating files for {} links in archive...{reset}'.format(
 94 |              datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
 95 |              len(links),
 96 |              **ANSI,
 97 |         ))
 98 | 
 99 |     # loop over links and archive them
100 |     archive_links(archive_path, links, source=source, resume=resume)
101 | 
102 |     # print timing information & summary
103 |     end_ts = datetime.now().timestamp()
104 |     seconds = end_ts - start_ts
105 |     if seconds > 60:
106 |         duration = '{0:.2f} min'.format(seconds / 60, 2)
107 |     else:
108 |         duration = '{0:.2f} sec'.format(seconds, 2)
109 | 
110 |     print('{}[√] [{}] Update of {} links complete ({}){}'.format(
111 |         ANSI['green'],
112 |         datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
113 |         len(links),
114 |         duration,
115 |         ANSI['reset'],
116 |     ))
117 |     print('    - {} entries skipped'.format(_RESULTS_TOTALS['skipped']))
118 |     print('    - {} entries updated'.format(_RESULTS_TOTALS['succeded']))
119 |     print('    - {} errors'.format(_RESULTS_TOTALS['failed']))
120 | 
121 | 
122 | if __name__ == '__main__':
123 |     argc = len(sys.argv)
124 | 
125 |     if set(sys.argv).intersection(('-h', '--help', 'help')):
126 |         print_help()
127 |         raise SystemExit(0)
128 | 
129 |     migrate_data()
130 | 
131 |     source = sys.argv[1] if argc > 1 else None  # path of links file to import
132 |     resume = sys.argv[2] if argc > 2 else None  # timestamp to resume dowloading from
133 |    
134 |     if argc == 1:
135 |         source, resume = None, None
136 |     elif argc == 2:
137 |         if all(d.isdigit() for d in sys.argv[1].split('.')):
138 |             # argv[1] is a resume timestamp
139 |             source, resume = None, sys.argv[1]
140 |         else:
141 |             # argv[1] is a path to a file to import
142 |             source, resume = sys.argv[1].strip(), None
143 |     elif argc == 3:
144 |         source, resume = sys.argv[1].strip(), sys.argv[2]
145 |     else:
146 |         print_help()
147 |         raise SystemExit(1)
148 | 
149 |     # See if archive folder already exists
150 |     for out_dir in (OUTPUT_DIR, 'bookmarks', 'pocket', 'pinboard', 'html'):
151 |         if os.path.exists(out_dir):
152 |             break
153 |     else:
154 |         out_dir = OUTPUT_DIR
155 | 
156 |     # Step 0: Download url to local file (only happens if a URL is specified instead of local path) 
157 |     if source and any(source.startswith(s) for s in ('http://', 'https://', 'ftp://')):
158 |         source = download_url(source)
159 | 
160 |     # Step 1: Parse the links and dedupe them with existing archive
161 |     links = merge_links(archive_path=out_dir, import_path=source)
162 |     
163 |     # Step 2: Write new index
164 |     write_links_index(out_dir=out_dir, links=links)
165 | 
166 |     # Step 3: Verify folder structure is 1:1 with index
167 |     # cleanup_archive(out_dir, links)
168 | 
169 |     # Step 4: Run the archive methods for each link
170 |     update_archive(out_dir, links, source=source, resume=resume, append=True)
171 | 


--------------------------------------------------------------------------------
/archiver/parse.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Everything related to parsing links from bookmark services.
  3 | 
  4 | For a list of supported services, see the README.md.
  5 | For examples of supported files see examples/.
  6 | 
  7 | Parsed link schema: {
  8 |     'url': 'https://example.com/example/?abc=123&xyc=345#lmnop',
  9 |     'domain': 'example.com',
 10 |     'base_url': 'example.com/example/',
 11 |     'timestamp': '15442123124234',
 12 |     'tags': 'abc,def',
 13 |     'title': 'Example.com Page Title',
 14 |     'sources': ['ril_export.html', 'downloads/getpocket.com.txt'],
 15 | }
 16 | """
 17 | 
 18 | import re
 19 | import json
 20 | import xml.etree.ElementTree as etree
 21 | 
 22 | from datetime import datetime
 23 | 
 24 | from util import (
 25 |     domain,
 26 |     base_url,
 27 |     str_between,
 28 |     get_link_type,
 29 | )
 30 | 
 31 | 
 32 | def get_parsers(file):
 33 |     """return all parsers that work on a given file, defaults to all of them"""
 34 | 
 35 |     return {
 36 |         'pocket': parse_pocket_export,
 37 |         'pinboard': parse_json_export,
 38 |         'bookmarks': parse_bookmarks_export,
 39 |         'rss': parse_rss_export,
 40 |         'pinboard_rss': parse_pinboard_rss_feed,
 41 |         'medium_rss': parse_medium_rss_feed,
 42 |     }
 43 | 
 44 | def parse_links(path):
 45 |     """parse a list of links dictionaries from a bookmark export file"""
 46 |     
 47 |     links = []
 48 |     with open(path, 'r', encoding='utf-8') as file:
 49 |         for parser_func in get_parsers(file).values():
 50 |             # otherwise try all parsers until one works
 51 |             try:
 52 |                 links += list(parser_func(file))
 53 |                 if links:
 54 |                     break
 55 |             except (ValueError, TypeError, IndexError, AttributeError, etree.ParseError):
 56 |                 # parser not supported on this file
 57 |                 pass
 58 | 
 59 |     return links
 60 | 
 61 | 
 62 | def parse_pocket_export(html_file):
 63 |     """Parse Pocket-format bookmarks export files (produced by getpocket.com/export/)"""
 64 | 
 65 |     html_file.seek(0)
 66 |     pattern = re.compile("^\\s*<li><a href=\"(.+)\" time_added=\"(\\d+)\" tags=\"(.*)\">(.+)</a></li>", re.UNICODE)
 67 |     for line in html_file:
 68 |         # example line
 69 |         # <li><a href="http://example.com/ time_added="1478739709" tags="tag1,tag2">example title</a></li>
 70 |         match = pattern.search(line)
 71 |         if match:
 72 |             fixed_url = match.group(1).replace('http://www.readability.com/read?url=', '')           # remove old readability prefixes to get original url
 73 |             time = datetime.fromtimestamp(float(match.group(2)))
 74 |             info = {
 75 |                 'url': fixed_url,
 76 |                 'domain': domain(fixed_url),
 77 |                 'base_url': base_url(fixed_url),
 78 |                 'timestamp': str(time.timestamp()),
 79 |                 'tags': match.group(3),
 80 |                 'title': match.group(4).replace(' — Readability', '').replace('http://www.readability.com/read?url=', '') or base_url(fixed_url),
 81 |                 'sources': [html_file.name],
 82 |             }
 83 |             info['type'] = get_link_type(info)
 84 |             yield info
 85 | 
 86 | def parse_json_export(json_file):
 87 |     """Parse JSON-format bookmarks export files (produced by pinboard.in/export/)"""
 88 | 
 89 |     json_file.seek(0)
 90 |     json_content = json.load(json_file)
 91 |     for line in json_content:
 92 |         # example line
 93 |         # {"href":"http:\/\/www.reddit.com\/r\/example","description":"title here","extended":"","meta":"18a973f09c9cc0608c116967b64e0419","hash":"910293f019c2f4bb1a749fb937ba58e3","time":"2014-06-14T15:51:42Z","shared":"no","toread":"no","tags":"reddit android"}]
 94 |         if line:
 95 |             erg = line
 96 |             if erg.get('timestamp'):
 97 |                 timestamp = str(erg['timestamp']/10000000)  # chrome/ff histories use a very precise timestamp
 98 |             elif erg.get('time'):
 99 |                 timestamp = str(datetime.strptime(erg['time'].split(',', 1)[0], '%Y-%m-%dT%H:%M:%SZ').timestamp())
100 |             else:
101 |                 timestamp = str(datetime.now().timestamp())
102 |             info = {
103 |                 'url': erg['href'],
104 |                 'domain': domain(erg['href']),
105 |                 'base_url': base_url(erg['href']),
106 |                 'timestamp': timestamp,
107 |                 'tags': erg.get('tags') or '',
108 |                 'title': (erg.get('description') or '').replace(' — Readability', ''),
109 |                 'sources': [json_file.name],
110 |             }
111 |             info['type'] = get_link_type(info)
112 |             yield info
113 | 
114 | def parse_rss_export(rss_file):
115 |     """Parse RSS XML-format files into links"""
116 | 
117 |     rss_file.seek(0)
118 |     items = rss_file.read().split('</item>\n<item>')
119 |     for item in items:
120 |         # example item:
121 |         # <item>
122 |         # <title><![CDATA[How JavaScript works: inside the V8 engine]]></title>
123 |         # <category>Unread</category>
124 |         # <link>https://blog.sessionstack.com/how-javascript-works-inside</link>
125 |         # <guid>https://blog.sessionstack.com/how-javascript-works-inside</guid>
126 |         # <pubDate>Mon, 21 Aug 2017 14:21:58 -0500</pubDate>
127 |         # </item>
128 | 
129 |         trailing_removed = item.split('</item>', 1)[0]
130 |         leading_removed = trailing_removed.split('<item>', 1)[-1]
131 |         rows = leading_removed.split('\n')
132 | 
133 |         def get_row(key):
134 |             return [r for r in rows if r.startswith('<{}>'.format(key))][0]
135 | 
136 |         title = str_between(get_row('title'), '<![CDATA[', ']]')
137 |         url = str_between(get_row('link'), '<link>', '</link>')
138 |         ts_str = str_between(get_row('pubDate'), '<pubDate>', '</pubDate>')
139 |         time = datetime.strptime(ts_str, "%a, %d %b %Y %H:%M:%S %z")
140 | 
141 |         info = {
142 |             'url': url,
143 |             'domain': domain(url),
144 |             'base_url': base_url(url),
145 |             'timestamp': str(time.timestamp()),
146 |             'tags': '',
147 |             'title': title,
148 |             'sources': [rss_file.name],
149 |         }
150 |         info['type'] = get_link_type(info)
151 | 
152 |         yield info
153 | 
154 | def parse_bookmarks_export(html_file):
155 |     """Parse netscape-format bookmarks export files (produced by all browsers)"""
156 | 
157 |     html_file.seek(0)
158 |     pattern = re.compile("<a href=\"(.+?)\" add_date=\"(\\d+)\"[^>]*>(.+)</a>", re.UNICODE | re.IGNORECASE)
159 |     for line in html_file:
160 |         # example line
161 |         # <DT><A HREF="https://example.com/?q=1+2" ADD_DATE="1497562974" LAST_MODIFIED="1497562974" ICON_URI="https://example.com/favicon.ico" ICON="data:image/png;base64,...">example bookmark title</A>
162 |         
163 |         match = pattern.search(line)
164 |         if match:
165 |             url = match.group(1)
166 |             time = datetime.fromtimestamp(float(match.group(2)))
167 | 
168 |             info = {
169 |                 'url': url,
170 |                 'domain': domain(url),
171 |                 'base_url': base_url(url),
172 |                 'timestamp': str(time.timestamp()),
173 |                 'tags': "",
174 |                 'title': match.group(3),
175 |                 'sources': [html_file.name],
176 |             }
177 |             info['type'] = get_link_type(info)
178 | 
179 |             yield info
180 | 
181 | def parse_pinboard_rss_feed(rss_file):
182 |     """Parse Pinboard RSS feed files into links"""
183 | 
184 |     rss_file.seek(0)
185 |     root = etree.parse(rss_file).getroot()
186 |     items = root.findall("{http://purl.org/rss/1.0/}item")
187 |     for item in items:
188 |         url = item.find("{http://purl.org/rss/1.0/}link").text
189 |         tags = item.find("{http://purl.org/dc/elements/1.1/}subject").text
190 |         title = item.find("{http://purl.org/rss/1.0/}title").text
191 |         ts_str = item.find("{http://purl.org/dc/elements/1.1/}date").text
192 |         #       = 🌈🌈🌈🌈
193 |         #        = 🌈🌈🌈🌈
194 |         #         = 🏆🏆🏆🏆
195 |         
196 |         # Pinboard includes a colon in its date stamp timezone offsets, which
197 |         # Python can't parse. Remove it:
198 |         if ":" == ts_str[-3:-2]:
199 |             ts_str = ts_str[:-3]+ts_str[-2:]
200 |         time = datetime.strptime(ts_str, "%Y-%m-%dT%H:%M:%S%z")
201 |         info = {
202 |             'url': url,
203 |             'domain': domain(url),
204 |             'base_url': base_url(url),
205 |             'timestamp': str(time.timestamp()),
206 |             'tags': tags,
207 |             'title': title,
208 |             'sources': [rss_file.name],
209 |         }
210 |         info['type'] = get_link_type(info)
211 |         yield info
212 | 
213 | def parse_medium_rss_feed(rss_file):
214 |     """Parse Medium RSS feed files into links"""
215 | 
216 |     rss_file.seek(0)
217 |     root = etree.parse(rss_file).getroot()
218 |     items = root.find("channel").findall("item")
219 |     for item in items:
220 |         # for child in item:
221 |         #     print(child.tag, child.text)
222 |         url = item.find("link").text
223 |         title = item.find("title").text
224 |         ts_str = item.find("pubDate").text
225 |         time = datetime.strptime(ts_str, "%a, %d %b %Y %H:%M:%S %Z")
226 |         info = {
227 |             'url': url,
228 |             'domain': domain(url),
229 |             'base_url': base_url(url),
230 |             'timestamp': str(time.timestamp()),
231 |             'tags': "",
232 |             'title': title,
233 |             'sources': [rss_file.name],
234 |         }
235 |         info['type'] = get_link_type(info)
236 |         yield info
237 | 


--------------------------------------------------------------------------------
/archiver/tests/rss_export.xml:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
  2 | 	xmlns:content="http://purl.org/rss/1.0/modules/content/"
  3 | 	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
  4 | 	xmlns:dc="http://purl.org/dc/elements/1.1/"
  5 | 	xmlns:atom="http://www.w3.org/2005/Atom"
  6 | 	>
  7 | 
  8 | <channel>
  9 | 
 10 | <title>My Reading List: Read and Unread</title>
 11 | <description>Items I've saved to read</description>
 12 | <link>http://readitlaterlist.com/users/nikisweeting/feed/all</link>
 13 | <atom:link href="http://readitlaterlist.com/users/nikisweeting/feed/all" rel="self" type="application/rss+xml" />
 14 | 
 15 | 
 16 | <item>
 17 | <title><![CDATA[Cell signaling]]></title>
 18 | <category>Unread</category>
 19 | <link>https://en.wikipedia.org/wiki/Cell_signaling</link>
 20 | <guid>https://en.wikipedia.org/wiki/Cell_signaling</guid>
 21 | <pubDate>Mon, 30 Oct 2017 01:12:10 -0500</pubDate>
 22 | </item>
 23 | <item>
 24 | <title><![CDATA[Hayflick limit]]></title>
 25 | <category>Unread</category>
 26 | <link>https://en.wikipedia.org/wiki/Hayflick_limit</link>
 27 | <guid>https://en.wikipedia.org/wiki/Hayflick_limit</guid>
 28 | <pubDate>Mon, 30 Oct 2017 01:11:38 -0500</pubDate>
 29 | </item>
 30 | <item>
 31 | <title><![CDATA[Even moderate drinking by parents can upset children – study]]></title>
 32 | <category>Unread</category>
 33 | <link>https://theguardian.com/society/2017/oct/18/even-moderate-drinking-by-parents-can-upset-children-study?CMP=Share_AndroidApp_Signal</link>
 34 | <guid>https://theguardian.com/society/2017/oct/18/even-moderate-drinking-by-parents-can-upset-children-study?CMP=Share_AndroidApp_Signal</guid>
 35 | <pubDate>Mon, 30 Oct 2017 01:11:30 -0500</pubDate>
 36 | </item>
 37 | <item>
 38 | <title><![CDATA[How Merkle trees enable the decentralized Web]]></title>
 39 | <category>Unread</category>
 40 | <link>https://taravancil.com/blog/how-merkle-trees-enable-decentralized-web</link>
 41 | <guid>https://taravancil.com/blog/how-merkle-trees-enable-decentralized-web</guid>
 42 | <pubDate>Mon, 30 Oct 2017 01:11:30 -0500</pubDate>
 43 | </item>
 44 | <item>
 45 | <title><![CDATA[Inertial navigation system]]></title>
 46 | <category>Unread</category>
 47 | <link>https://en.wikipedia.org/wiki/Inertial_navigation_system</link>
 48 | <guid>https://en.wikipedia.org/wiki/Inertial_navigation_system</guid>
 49 | <pubDate>Mon, 30 Oct 2017 01:10:10 -0500</pubDate>
 50 | </item>
 51 | <item>
 52 | <title><![CDATA[Dead reckoning]]></title>
 53 | <category>Unread</category>
 54 | <link>https://en.wikipedia.org/wiki/Dead_reckoning</link>
 55 | <guid>https://en.wikipedia.org/wiki/Dead_reckoning</guid>
 56 | <pubDate>Mon, 30 Oct 2017 01:10:08 -0500</pubDate>
 57 | </item>
 58 | <item>
 59 | <title><![CDATA[Calling Rust From Python]]></title>
 60 | <category>Unread</category>
 61 | <link>https://bheisler.github.io/post/calling-rust-in-python</link>
 62 | <guid>https://bheisler.github.io/post/calling-rust-in-python</guid>
 63 | <pubDate>Mon, 30 Oct 2017 01:04:33 -0500</pubDate>
 64 | </item>
 65 | <item>
 66 | <title><![CDATA[Why would anyone choose Docker over fat binaries?]]></title>
 67 | <category>Unread</category>
 68 | <link>http://smashcompany.com/technology/why-would-anyone-choose-docker-over-fat-binaries</link>
 69 | <guid>http://smashcompany.com/technology/why-would-anyone-choose-docker-over-fat-binaries</guid>
 70 | <pubDate>Sun, 29 Oct 2017 14:57:25 -0500</pubDate>
 71 | </item>
 72 | <item>
 73 | <title><![CDATA[]]></title>
 74 | <category>Unread</category>
 75 | <link>https://heml.io</link>
 76 | <guid>https://heml.io</guid>
 77 | <pubDate>Sun, 29 Oct 2017 14:55:26 -0500</pubDate>
 78 | </item>
 79 | <item>
 80 | <title><![CDATA[A surprising amount of people want to be in North Korea]]></title>
 81 | <category>Unread</category>
 82 | <link>https://blog.benjojo.co.uk/post/north-korea-dprk-bgp-geoip-fruad</link>
 83 | <guid>https://blog.benjojo.co.uk/post/north-korea-dprk-bgp-geoip-fruad</guid>
 84 | <pubDate>Sat, 28 Oct 2017 05:41:41 -0500</pubDate>
 85 | </item>
 86 | <item>
 87 | <title><![CDATA[Learning a Hierarchy]]></title>
 88 | <category>Unread</category>
 89 | <link>https://blog.openai.com/learning-a-hierarchy</link>
 90 | <guid>https://blog.openai.com/learning-a-hierarchy</guid>
 91 | <pubDate>Thu, 26 Oct 2017 16:43:48 -0500</pubDate>
 92 | </item>
 93 | <item>
 94 | <title><![CDATA[High Performance Browser Networking]]></title>
 95 | <category>Unread</category>
 96 | <link>https://hpbn.co</link>
 97 | <guid>https://hpbn.co</guid>
 98 | <pubDate>Wed, 25 Oct 2017 19:05:24 -0500</pubDate>
 99 | </item>
100 | <item>
101 | <title><![CDATA[What tender and juicy drama is going on at your school/workplace?]]></title>
102 | <category>Unread</category>
103 | <link>https://reddit.com/r/AskReddit/comments/78nc2a/what_tender_and_juicy_drama_is_going_on_at_your/dovab2v</link>
104 | <guid>https://reddit.com/r/AskReddit/comments/78nc2a/what_tender_and_juicy_drama_is_going_on_at_your/dovab2v</guid>
105 | <pubDate>Wed, 25 Oct 2017 18:05:58 -0500</pubDate>
106 | </item>
107 | <item>
108 | <title><![CDATA[Using an SSH Bastion Host]]></title>
109 | <category>Unread</category>
110 | <link>https://blog.scottlowe.org/2015/11/21/using-ssh-bastion-host</link>
111 | <guid>https://blog.scottlowe.org/2015/11/21/using-ssh-bastion-host</guid>
112 | <pubDate>Wed, 25 Oct 2017 11:38:47 -0500</pubDate>
113 | </item>
114 | <item>
115 | <title><![CDATA[Let's Define &quot;undefined&quot; | NathanShane.me]]></title>
116 | <category>Unread</category>
117 | <link>https://nathanshane.me/blog/let's-define-undefined</link>
118 | <guid>https://nathanshane.me/blog/let's-define-undefined</guid>
119 | <pubDate>Wed, 25 Oct 2017 11:32:59 -0500</pubDate>
120 | </item>
121 | <item>
122 | <title><![CDATA[Control theory]]></title>
123 | <category>Unread</category>
124 | <link>https://en.wikipedia.org/wiki/Control_theory#Closed-loop_transfer_function</link>
125 | <guid>https://en.wikipedia.org/wiki/Control_theory#Closed-loop_transfer_function</guid>
126 | <pubDate>Tue, 24 Oct 2017 22:57:43 -0500</pubDate>
127 | </item>
128 | <item>
129 | <title><![CDATA[J012-86-intractable.pdf]]></title>
130 | <category>Unread</category>
131 | <link>http://mit.edu/~jnt/Papers/J012-86-intractable.pdf</link>
132 | <guid>http://mit.edu/~jnt/Papers/J012-86-intractable.pdf</guid>
133 | <pubDate>Tue, 24 Oct 2017 22:56:32 -0500</pubDate>
134 | </item>
135 | <item>
136 | <title><![CDATA[Dynamic Programming: First Principles]]></title>
137 | <category>Unread</category>
138 | <link>http://flawlessrhetoric.com/Dynamic-Programming-First-Principles</link>
139 | <guid>http://flawlessrhetoric.com/Dynamic-Programming-First-Principles</guid>
140 | <pubDate>Tue, 24 Oct 2017 22:56:30 -0500</pubDate>
141 | </item>
142 | <item>
143 | <title><![CDATA[What Would Happen If There Were No Number 6?]]></title>
144 | <category>Unread</category>
145 | <link>https://fivethirtyeight.com/features/what-would-happen-if-there-were-no-number-6</link>
146 | <guid>https://fivethirtyeight.com/features/what-would-happen-if-there-were-no-number-6</guid>
147 | <pubDate>Tue, 24 Oct 2017 22:21:59 -0500</pubDate>
148 | </item>
149 | <item>
150 | <title><![CDATA[Ten Basic Rules for Adventure]]></title>
151 | <category>Unread</category>
152 | <link>https://outsideonline.com/2252916/10-basic-rules-adventure</link>
153 | <guid>https://outsideonline.com/2252916/10-basic-rules-adventure</guid>
154 | <pubDate>Tue, 24 Oct 2017 20:56:25 -0500</pubDate>
155 | </item>
156 | <item>
157 | <title><![CDATA[Insects Are In Serious Trouble]]></title>
158 | <category>Unread</category>
159 | <link>https://theatlantic.com/science/archive/2017/10/oh-no/543390?single_page=true</link>
160 | <guid>https://theatlantic.com/science/archive/2017/10/oh-no/543390?single_page=true</guid>
161 | <pubDate>Mon, 23 Oct 2017 23:10:10 -0500</pubDate>
162 | </item>
163 | <item>
164 | <title><![CDATA[Netflix/bless]]></title>
165 | <category>Unread</category>
166 | <link>https://github.com/Netflix/bless</link>
167 | <guid>https://github.com/Netflix/bless</guid>
168 | <pubDate>Mon, 23 Oct 2017 23:04:46 -0500</pubDate>
169 | </item>
170 | <item>
171 | <title><![CDATA[Getting Your First 10 Customers]]></title>
172 | <category>Unread</category>
173 | <link>https://stripe.com/atlas/guides/starting-sales</link>
174 | <guid>https://stripe.com/atlas/guides/starting-sales</guid>
175 | <pubDate>Mon, 23 Oct 2017 22:27:36 -0500</pubDate>
176 | </item>
177 | <item>
178 | <title><![CDATA[GPS Hardware]]></title>
179 | <category>Unread</category>
180 | <link>https://novasummits.com/gps-hardware</link>
181 | <guid>https://novasummits.com/gps-hardware</guid>
182 | <pubDate>Mon, 23 Oct 2017 04:44:40 -0500</pubDate>
183 | </item>
184 | <item>
185 | <title><![CDATA[Bicycle Tires and Tubes]]></title>
186 | <category>Unread</category>
187 | <link>http://sheldonbrown.com/tires.html#pressure</link>
188 | <guid>http://sheldonbrown.com/tires.html#pressure</guid>
189 | <pubDate>Mon, 23 Oct 2017 01:28:32 -0500</pubDate>
190 | </item>
191 | <item>
192 | <title><![CDATA[Tire light is on]]></title>
193 | <category>Unread</category>
194 | <link>https://reddit.com/r/Justrolledintotheshop/comments/77zm9e/tire_light_is_on/doqbshe</link>
195 | <guid>https://reddit.com/r/Justrolledintotheshop/comments/77zm9e/tire_light_is_on/doqbshe</guid>
196 | <pubDate>Mon, 23 Oct 2017 01:21:42 -0500</pubDate>
197 | </item>
198 | <item>
199 | <title><![CDATA[Bad_Salish_Boo ?? on Twitter]]></title>
200 | <category>Unread</category>
201 | <link>https://t.co/PDLlNjACv9</link>
202 | <guid>https://t.co/PDLlNjACv9</guid>
203 | <pubDate>Sat, 21 Oct 2017 06:48:07 -0500</pubDate>
204 | </item>
205 | <item>
206 | <title><![CDATA[Is an Open Marriage a Happier Marriage?]]></title>
207 | <category>Unread</category>
208 | <link>https://nytimes.com/2017/05/11/magazine/is-an-open-marriage-a-happier-marriage.html</link>
209 | <guid>https://nytimes.com/2017/05/11/magazine/is-an-open-marriage-a-happier-marriage.html</guid>
210 | <pubDate>Fri, 20 Oct 2017 13:08:52 -0500</pubDate>
211 | </item>
212 | <item>
213 | <title><![CDATA[The Invention of Monogamy]]></title>
214 | <category>Unread</category>
215 | <link>https://thenib.com/the-invention-of-monogamy</link>
216 | <guid>https://thenib.com/the-invention-of-monogamy</guid>
217 | <pubDate>Fri, 20 Oct 2017 12:19:00 -0500</pubDate>
218 | </item>
219 | <item>
220 | <title><![CDATA[Google Chrome May Add a Permission to Stop In-Browser Cryptocurrency Miners]]></title>
221 | <category>Unread</category>
222 | <link>https://bleepingcomputer.com/news/google/google-chrome-may-add-a-permission-to-stop-in-browser-cryptocurrency-miners</link>
223 | <guid>https://bleepingcomputer.com/news/google/google-chrome-may-add-a-permission-to-stop-in-browser-cryptocurrency-miners</guid>
224 | <pubDate>Fri, 20 Oct 2017 03:57:41 -0500</pubDate>
225 | </item>
226 | </channel>
227 | 
228 | </rss>
229 | 


--------------------------------------------------------------------------------
/archiver/templates/link_index_fancy.html:
--------------------------------------------------------------------------------
  1 | <html>
  2 |     <head>
  3 |         <meta charset="utf-8">
  4 |         <title>$title</title>
  5 |         <style>
  6 |             html, body {
  7 |                 width: 100%;
  8 |                 height: 100%;
  9 |             }
 10 |             body {
 11 |                 background-color: #ddd;
 12 |             }
 13 |             header {
 14 |                 width: 100%;
 15 |                 height: 90px;
 16 |                 background-color: #aa1e55;
 17 |                 margin: 0px;
 18 |                 text-align: center;
 19 |                 color: white;
 20 |             }
 21 |             header h1 {
 22 |                 padding-top: 5px;
 23 |                 padding-bottom: 5px;
 24 |                 margin: 0px;
 25 |                 font-weight: 200;
 26 |                 font-family: "Gill Sans", Helvetica, sans-serif;
 27 |                 font-size: calc(16px + 1vw);
 28 |             }
 29 |             .collapse-icon {
 30 |                 float: right;
 31 |                 color: black;
 32 |                 width: 126px;
 33 |                 font-size: 0.8em;
 34 |                 margin-top: 20px;
 35 |                 margin-right: 0px;
 36 |                 margin-left: -35px;
 37 |             }
 38 |             .nav-icon img {
 39 |                 float: left;
 40 |                 display: block;
 41 |                 margin-right: 13px;
 42 |                 color: black;
 43 |                 height: 53px;
 44 |                 margin-top: 7px;
 45 |                 margin-left: 10px;
 46 |             }
 47 |             .nav-icon img:hover {
 48 |                 opacity: 0.5;
 49 |             }
 50 |             .title-url {
 51 |                 color: black;
 52 |                 display: block;
 53 |                 width: 75%;
 54 |                 white-space: nowrap;
 55 |                 overflow: hidden;
 56 |                 margin: auto;
 57 |             }
 58 |             .archive-page-header {
 59 |                 margin-top: 5px;
 60 |                 margin-bottom: 5px;
 61 |             }
 62 |             .archive-page-header .alert {
 63 |                 margin-bottom: 0px;
 64 |             }
 65 |             h1 small {
 66 |                 opacity: 0.4;
 67 |                 font-size: 0.6em;
 68 |             }
 69 |             h1 small:hover {
 70 |                 opacity: 0.8;
 71 |             }
 72 |             .card {
 73 |                 box-shadow: 2px 3px 14px 0px rgba(0,0,0,0.02);
 74 |             }
 75 |             .card h4 {
 76 |                 font-size: 1.4vw;
 77 |             }
 78 |             .card-body {
 79 |                 font-size: 1vw;
 80 |                 padding-top: 1.2vw;
 81 |                 padding-left: 1vw;
 82 |                 padding-right: 1vw;
 83 |                 padding-bottom: 1vw;
 84 |                 line-height: 1.1;
 85 |                 word-wrap: break-word;
 86 |                 max-height: 102px;
 87 |                 overflow: hidden;
 88 |             }
 89 |             .card-img-top {
 90 |                 border: 0px;
 91 |                 padding: 0px;
 92 |                 margin: 0px;
 93 |                 overflow: hidden;
 94 |                 opacity: 0.8;
 95 |                 border-top: 1px solid gray;
 96 |                 border-radius: 3px;
 97 |                 border-bottom: 1px solid #ddd;
 98 |                 height: 430px;
 99 |                 width: 400%;
100 |                 margin-bottom: -330px;
101 | 
102 |                 transform: scale(0.25); 
103 |                 transform-origin: 0 0;
104 |             }
105 |             .full-page-iframe {
106 |                 border-top: 1px solid #ddd;
107 |                 width: 100%;
108 |                 height: 69vh;
109 |                 margin: 0px;
110 |                 border: 0px;
111 |                 border-top: 3px solid #aa1e55;
112 |             }
113 |             .card.selected-card {
114 |                 border: 2px solid orange;
115 |                 box-shadow: 0px -6px 13px 1px rgba(0,0,0,0.05);
116 |             }
117 |             .iframe-large {
118 |                 height: 93%;
119 |                 margin-top: -10px;
120 |             }
121 |             img.external {
122 |                 height: 30px;
123 |                 margin-right: -10px;
124 |                 padding: 3px;
125 |                 border-radius: 4px;
126 |                 vertical-align: middle;
127 |                 border: 4px solid rgba(0,0,0,0);
128 |             }
129 |             img.external:hover {
130 |                 border: 4px solid green;
131 |             }
132 | 
133 |             @media(max-width: 1092px) {
134 |                 iframe {
135 |                     display: none;
136 |                 }
137 |             }
138 |                 
139 | 
140 |             @media(max-width: 728px) {
141 |                 .card h4 {
142 |                     font-size: 5vw;
143 |                 }
144 |                 .card-body {
145 |                     font-size: 4vw;
146 |                 }
147 |                 .card {
148 |                     margin-bottom: 5px;
149 |                 }
150 |                 header > h1 > a.collapse-icon, header > h1 > a.nav-icon {
151 |                     display: none;
152 |                 }
153 |             }
154 |         </style>
155 |         <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" integrity="sha384-rwoIResjU2yc3z8GV/NPeZWAv56rSmLldC3R/AZzGRnGxQQKnKkoFVhFQhNUwEyJ" crossorigin="anonymous">
156 |     </head>
157 |     <body>
158 |         <header>
159 |             <h1 class="page-title">
160 |                 <a href="../../index.html" class="nav-icon" title="Go to Main Index...">
161 |                     <img src="../../static/archive.png" alt="Archive Icon">
162 |                 </a>
163 |                 <a href="#" class="collapse-icon" style="text-decoration: none" title="Toggle info panel...">
164 |                     ▾
165 |                 </a>
166 |                 <img src="$favicon" height="20px"> $title<br/>
167 |                 <a href="$url" class="title-url">
168 |                     <small>$base_url</small>
169 |                 </a>
170 |             </h1>
171 |         </header>
172 |         <div class="site-header container-fluid">
173 |             <div class="row archive-page-header">
174 |                 <div class="col-lg-4 alert well">
175 |                     Bookmarked: <small title="Timestamp: $bookmarked_ts">$bookmarked</small>
176 |                     &nbsp; | &nbsp;
177 |                     Last updated: <small title="Timestamp: $updated_ts">$updated</small>
178 |                 </div>
179 |                 <div class="col-lg-4 alert well">
180 |                     Type: 
181 |                     <span class="badge badge-default">$type</span>
182 |                     &nbsp; | &nbsp;
183 |                     Tags:
184 |                     <span class="badge badge-success">$tags</span> 
185 |                 </div>
186 |                 <div class="col-lg-4 alert well">
187 |                     Download:
188 |                     <a href="index.json" title="JSON summary of archived link.">JSON</a> | 
189 |                     <a href="." title="Webserver-provided index of files directory.">Files</a>
190 |                 </div>
191 |                 <hr/>
192 |                 <div class="col-lg-2">
193 |                     <div class="card selected-card">
194 |                       <iframe class="card-img-top" src="$wget" sandbox="allow-same-origin allow-scripts allow-forms"></iframe>
195 |                       <div class="card-body">
196 |                         <a href="$wget" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
197 |                             <img src="../../static/external.png" class="external"/>
198 |                         </a>
199 |                         <a href="$wget" target="preview"><h4 class="card-title">Local Archive</h4></a>
200 |                         <p class="card-text">archive/$domain</p>
201 |                       </div>
202 |                     </div>
203 |                 </div>
204 |                 <div class="col-lg-2">
205 |                     <div class="card">
206 |                       <iframe class="card-img-top" src="$dom" sandbox="allow-same-origin allow-scripts allow-forms"></iframe>
207 |                       <div class="card-body">
208 |                         <a href="$dom" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
209 |                             <img src="../../static/external.png" class="external"/>
210 |                         </a>
211 |                         <a href="$dom" target="preview"><h4 class="card-title">HTML</h4></a>
212 |                         <p class="card-text">archive/output.html</p>
213 |                       </div>
214 |                     </div>
215 |                 </div>
216 |                 <div class="col-lg-2">
217 |                     <div class="card">
218 |                       <iframe class="card-img-top" src="$pdf"></iframe>
219 |                       <div class="card-body">
220 |                         <a href="$pdf" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
221 |                             <img src="../../static/external.png" class="external"/>
222 |                         </a>
223 |                         <a href="$pdf" target="preview" id="pdf-btn"><h4 class="card-title">PDF</h4></a>
224 |                         <p class="card-text">archive/output.pdf</p>
225 |                       </div>
226 |                     </div>
227 |                 </div>
228 |                 <div class="col-lg-2">
229 |                     <div class="card">
230 |                       <iframe class="card-img-top" src="$screenshot" sandbox="allow-same-origin allow-scripts allow-forms"></iframe>
231 |                       <div class="card-body">
232 |                         <a href="$screenshot" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
233 |                             <img src="../../static/external.png" class="external"/>
234 |                         </a>
235 |                         <a href="$screenshot" target="preview"><h4 class="card-title">Screenshot</h4></a>
236 |                         <p class="card-text">archive/screenshot.png</p>
237 |                       </div>
238 |                     </div>
239 |                 </div>
240 |                 <div class="col-lg-2">
241 |                     <div class="card">
242 |                       <iframe class="card-img-top" src="$url" sandbox="allow-same-origin allow-scripts allow-forms"></iframe>
243 |                       <div class="card-body">
244 |                         <a href="$url" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
245 |                             <img src="../../static/external.png" class="external"/>
246 |                         </a>
247 |                         <a href="$url" target="preview"><h4 class="card-title">Original</h4></a>
248 |                         <p class="card-text">$domain</p>
249 |                       </div>
250 |                     </div>
251 |                 </div>
252 |                 <div class="col-lg-2">
253 |                     <div class="card">
254 |                       <iframe class="card-img-top" src="$archive_org" sandbox="allow-same-origin allow-scripts allow-forms"></iframe>
255 |                       <div class="card-body">
256 |                         <a href="$archive_org" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
257 |                             <img src="../../static/external.png" class="external"/>
258 |                         </a>
259 |                         <a href="$archive_org" target="preview"><h4 class="card-title">Archive.Org</h4></a>
260 |                         <p class="card-text">web.archive.org/web/...</p>
261 |                       </div>
262 |                     </div>
263 |                 </div>
264 |             </div>
265 |         </div>
266 |         <iframe sandbox="allow-same-origin allow-scripts allow-forms" class="full-page-iframe" src="$wget" name="preview"></iframe>
267 |     
268 |         <script
269 |               src="https://code.jquery.com/jquery-3.2.1.slim.min.js"
270 |               integrity="sha256-k2WSCIexGzOj3Euiig+TlR8gA0EmPjuc79OEeY5L45g="
271 |               crossorigin="anonymous"></script>
272 |         <script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0-alpha.6/js/bootstrap.min.js" integrity="sha384-vBWWzlZJ8ea9aCX4pEW3rVHjgjt7zpkNpZk+02D9phzyeVkE+jo0ieGizqPLForn" crossorigin="anonymous"></script>
273 | 
274 |         <script>
275 |             // show selected file in iframe when preview card is clicked
276 |             jQuery('.card').on('click', function(e) {
277 |                 jQuery('.selected-card').removeClass('selected-card')
278 |                 jQuery(e.target).closest('.card').addClass('selected-card')
279 |             })
280 |             jQuery('.card a[target=preview]').on('click', function(e) {
281 |                 if (e.currentTarget.href.endsWith('.pdf')) {
282 |                     jQuery('.full-page-iframe')[0].removeAttribute('sandbox')
283 |                 } else {
284 |                     jQuery('.full-page-iframe')[0].sandbox = "allow-same-origin allow-scripts allow-forms"
285 |                 }
286 |                 return true
287 |             })
288 | 
289 |             // un-sandbox iframes showing pdfs (required to display pdf viewer)
290 |             jQuery('iframe').map(function() {
291 |                 if (this.src.endsWith('.pdf')) {
292 |                     this.removeAttribute('sandbox')
293 |                     this.src = this.src
294 |                 }
295 |             })
296 | 
297 |             // hide header when collapse icon is clicked
298 |             jQuery('.collapse-icon').on('click', function() {
299 |                 if (jQuery('.collapse-icon').text().includes('▾')) {
300 |                     jQuery('.collapse-icon').text('▸')
301 |                     jQuery('.site-header').hide()
302 |                     jQuery('.full-page-iframe').addClass('iframe-large')
303 |                 } else {
304 |                     jQuery('.collapse-icon').text('▾')
305 |                     jQuery('.site-header').show()
306 |                     jQuery('.full-page-iframe').removeClass('iframe-large')
307 |                 }
308 |                 return true
309 |             })
310 | 
311 |             // hide all preview iframes on small screens
312 |             if (window.innerWidth < 1091) {
313 |                 jQuery('.card a[target=preview]').attr('target', '_self')
314 |             }
315 |         </script>
316 |     </body>
317 | </html>
318 | 


--------------------------------------------------------------------------------
/archiver/tests/firefox_export.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE NETSCAPE-Bookmark-file-1>
 2 | <!-- This is an automatically generated file.
 3 |      It will be read and overwritten.
 4 |      DO NOT EDIT! -->
 5 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
 6 | <TITLE>Bookmarks</TITLE>
 7 | <H1>Bookmarks Menu</H1>
 8 | 
 9 | <DL><p>
10 |     <DT><A HREF="place:folder=BOOKMARKS_MENU&folder=UNFILED_BOOKMARKS&folder=TOOLBAR&queryType=1&sort=12&maxResults=10&excludeQueries=1" ADD_DATE="1409779227" LAST_MODIFIED="1470506008">Recently Bookmarked</A>
11 |     <DT><A HREF="place:type=6&sort=14&maxResults=10" ADD_DATE="1470506008" LAST_MODIFIED="1470506008">Recent Tags</A>
12 |     <HR>    <DT><H3 ADD_DATE="1409779227" LAST_MODIFIED="1409779227">Mozilla Firefox</H3>
13 |     <DL><p>
14 |         <DT><A HREF="https://www.mozilla.org/en-US/firefox/help/" ADD_DATE="1409779227" LAST_MODIFIED="1409779227" ICON_URI="http://www.mozilla.org/2005/made-up-favicon/0-1409779227970" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAABGdBTUEAAK/INwWK6QAAABl0RVh0U29mdHdhcmUAQWRvYmUgSW1hZ2VSZWFkeXHJZTwAAAHWSURBVHjaYvz//z8DJQAggJiQOe/fv2fv7Oz8rays/N+VkfG/iYnJfyD/1+rVq7ffu3dPFpsBAAHEAHIBCJ85c8bN2Nj4vwsDw/8zQLwKiO8CcRoQu0DxqlWrdsHUwzBAAIGJmTNnPgYa9j8UqhFElwPxf2MIDeIrKSn9FwSJoRkAEEAM0DD4DzMAyPi/G+QKY4hh5WAXGf8PDQ0FGwJ22d27CjADAAIIrLmjo+MXA9R2kAHvGBA2wwx6B8W7od6CeQcggKCmCEL8bgwxYCbUIGTDVkHDBia+CuotgACCueD3TDQN75D4xmAvCoK9ARMHBzAw0AECiBHkAlC0Mdy7x9ABNA3obAZXIAa6iKEcGlMVQHwWyjYuL2d4v2cPg8vZswx7gHyAAAK7AOif7SAbOqCmn4Ha3AHFsIDtgPq/vLz8P4MSkJ2W9h8ggBjevXvHDo4FQUQg/kdypqCg4H8lUIACnQ/SOBMYI8bAsAJFPcj1AAEEjwVQqLpAbXmH5BJjqI0gi9DTAAgDBBCcAVLkgmQ7yKCZxpCQxqUZhAECCJ4XgMl493ug21ZD+aDAXH0WLM4A9MZPXJkJIIAwTAR5pQMalaCABQUULttBGCCAGCnNzgABBgAMJ5THwGvJLAAAAABJRU5ErkJggg==">Help and Tutorials</A>
15 |         <DT><A HREF="https://www.mozilla.org/en-US/firefox/customize/" ADD_DATE="1409779227" LAST_MODIFIED="1409779227" ICON_URI="http://www.mozilla.org/2005/made-up-favicon/1-1409779227971" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAABGdBTUEAAK/INwWK6QAAABl0RVh0U29mdHdhcmUAQWRvYmUgSW1hZ2VSZWFkeXHJZTwAAAHWSURBVHjaYvz//z8DJQAggJiQOe/fv2fv7Oz8rays/N+VkfG/iYnJfyD/1+rVq7ffu3dPFpsBAAHEAHIBCJ85c8bN2Nj4vwsDw/8zQLwKiO8CcRoQu0DxqlWrdsHUwzBAAIGJmTNnPgYa9j8UqhFElwPxf2MIDeIrKSn9FwSJoRkAEEAM0DD4DzMAyPi/G+QKY4hh5WAXGf8PDQ0FGwJ22d27CjADAAIIrLmjo+MXA9R2kAHvGBA2wwx6B8W7od6CeQcggKCmCEL8bgwxYCbUIGTDVkHDBia+CuotgACCueD3TDQN75D4xmAvCoK9ARMHBzAw0AECiBHkAlC0Mdy7x9ABNA3obAZXIAa6iKEcGlMVQHwWyjYuL2d4v2cPg8vZswx7gHyAAAK7AOif7SAbOqCmn4Ha3AHFsIDtgPq/vLz8P4MSkJ2W9h8ggBjevXvHDo4FQUQg/kdypqCg4H8lUIACnQ/SOBMYI8bAsAJFPcj1AAEEjwVQqLpAbXmH5BJjqI0gi9DTAAgDBBCcAVLkgmQ7yKCZxpCQxqUZhAECCJ4XgMl493ug21ZD+aDAXH0WLM4A9MZPXJkJIIAwTAR5pQMalaCABQUULttBGCCAGCnNzgABBgAMJ5THwGvJLAAAAABJRU5ErkJggg==">Customize Firefox</A>
16 |         <DT><A HREF="https://www.mozilla.org/en-US/contribute/" ADD_DATE="1409779227" LAST_MODIFIED="1409779227" ICON_URI="http://www.mozilla.org/2005/made-up-favicon/2-1409779227973" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAABGdBTUEAAK/INwWK6QAAABl0RVh0U29mdHdhcmUAQWRvYmUgSW1hZ2VSZWFkeXHJZTwAAAHWSURBVHjaYvz//z8DJQAggJiQOe/fv2fv7Oz8rays/N+VkfG/iYnJfyD/1+rVq7ffu3dPFpsBAAHEAHIBCJ85c8bN2Nj4vwsDw/8zQLwKiO8CcRoQu0DxqlWrdsHUwzBAAIGJmTNnPgYa9j8UqhFElwPxf2MIDeIrKSn9FwSJoRkAEEAM0DD4DzMAyPi/G+QKY4hh5WAXGf8PDQ0FGwJ22d27CjADAAIIrLmjo+MXA9R2kAHvGBA2wwx6B8W7od6CeQcggKCmCEL8bgwxYCbUIGTDVkHDBia+CuotgACCueD3TDQN75D4xmAvCoK9ARMHBzAw0AECiBHkAlC0Mdy7x9ABNA3obAZXIAa6iKEcGlMVQHwWyjYuL2d4v2cPg8vZswx7gHyAAAK7AOif7SAbOqCmn4Ha3AHFsIDtgPq/vLz8P4MSkJ2W9h8ggBjevXvHDo4FQUQg/kdypqCg4H8lUIACnQ/SOBMYI8bAsAJFPcj1AAEEjwVQqLpAbXmH5BJjqI0gi9DTAAgDBBCcAVLkgmQ7yKCZxpCQxqUZhAECCJ4XgMl493ug21ZD+aDAXH0WLM4A9MZPXJkJIIAwTAR5pQMalaCABQUULttBGCCAGCnNzgABBgAMJ5THwGvJLAAAAABJRU5ErkJggg==">Get Involved</A>
17 |         <DT><A HREF="https://www.mozilla.org/en-US/about/" ADD_DATE="1409779227" LAST_MODIFIED="1409779227" ICON_URI="http://www.mozilla.org/2005/made-up-favicon/3-1409779227974" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAABGdBTUEAAK/INwWK6QAAABl0RVh0U29mdHdhcmUAQWRvYmUgSW1hZ2VSZWFkeXHJZTwAAAHWSURBVHjaYvz//z8DJQAggJiQOe/fv2fv7Oz8rays/N+VkfG/iYnJfyD/1+rVq7ffu3dPFpsBAAHEAHIBCJ85c8bN2Nj4vwsDw/8zQLwKiO8CcRoQu0DxqlWrdsHUwzBAAIGJmTNnPgYa9j8UqhFElwPxf2MIDeIrKSn9FwSJoRkAEEAM0DD4DzMAyPi/G+QKY4hh5WAXGf8PDQ0FGwJ22d27CjADAAIIrLmjo+MXA9R2kAHvGBA2wwx6B8W7od6CeQcggKCmCEL8bgwxYCbUIGTDVkHDBia+CuotgACCueD3TDQN75D4xmAvCoK9ARMHBzAw0AECiBHkAlC0Mdy7x9ABNA3obAZXIAa6iKEcGlMVQHwWyjYuL2d4v2cPg8vZswx7gHyAAAK7AOif7SAbOqCmn4Ha3AHFsIDtgPq/vLz8P4MSkJ2W9h8ggBjevXvHDo4FQUQg/kdypqCg4H8lUIACnQ/SOBMYI8bAsAJFPcj1AAEEjwVQqLpAbXmH5BJjqI0gi9DTAAgDBBCcAVLkgmQ7yKCZxpCQxqUZhAECCJ4XgMl493ug21ZD+aDAXH0WLM4A9MZPXJkJIIAwTAR5pQMalaCABQUULttBGCCAGCnNzgABBgAMJ5THwGvJLAAAAABJRU5ErkJggg==">About Us</A>
18 |     </DL><p>
19 |     <DT><H3 ADD_DATE="1497562973" LAST_MODIFIED="1497562974">[Folder Name]</H3>
20 |     <DL><p>
21 |         <DT><A HREF="https://duckduckgo.com/?q=firefox+export+bookmarks&t=ffhp&ia=web" ADD_DATE="1497562974" LAST_MODIFIED="1497562974" ICON_URI="https://duckduckgo.com/favicon.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAAG7UlEQVRYha2XbYwbVxWG798oorJQhYKqSJPxzJ35gWDVComiKrK86/H4Y+2J21/5kIyUgiKk1GqArVpEXQlKSlFNm35IFMVREgQNJE4KAsqCpiGtsrsIJjvO2N6xvbPj9SbZJBvjdsmGzcYvP+xxxt6PbBFHeiTP9T3nfe+ZO+NrQjYZmsJ4aoo3WU+w2fqTO7S5BA839Sd3aPUEm60p3qSmMJ7N1n1gWArnqyfYbD3Bw2F2F9eYVTi1h11cwz2nlvDmLIXz/c/CmsJ46gk2W9/Fo76LR11hrXqCTVmKyKxvVmTqCTZVV1irm5dgs5+5I5bCDNQV1prtCDsrsRRmwE7QdE3hVOd7Z05N4VQ7QdOWwgx0O9dToz2+KXE7zjVqCg87xmWcYrU4p9YUHpsizqmOaTvGZWoKDzvONR5ownSJO5vJjrPZ8ruv4E5+HHfy41j862ksvDGC+p7HHmjEjrfbX1O8SceEuZ4JTWE8M1GuYcfvi8/Edmh2nMeN359Ef9xbbOL6ywdgx/kNmYnt0BwTdpzHTJRrrLkn7DibteM87GEuoymMx+qI23EehQMhFA6EYP/mXTSbzR4jn7x/DFdSsQ1NWB0T9jCXaY+x2R7xsiL6ZmIUVozX2mb47EyMwuHKMzHcna+v6oI7br4+AndOP3aczxJCiBXjtZkYRVkRfV0Dlbg3Z3UGy4ros2IUDuaer2HxxrUNxZ24+vxeuHP7cdevxL05QgghRUVkrBhFtbP66SinupOMt38IAJiYvIpj7+n4p351lfDpP5p4Nv0XfPzLcxsamI5yKiGEVGO8ZsUoiorIkGpESE0PU1QjQsoMMAPTwxRu9DPtDfjzX+mb6kL9YBz9NdyYAWbArUkqUW+uOtx2Ux7mMtVhCjf66ftPwNx8E8Xp62sKH3z5HCbys/jXuWPor+GmPMxliorIVIcpKlFvjlTCrGWG+AYhhExHOLUapXDzj/eyXZHHd7+Fw79Q1zTxpfhreOH1P+G2Po7+Gm6mI+3bUAmzViXMWqQSpSh3BqdCfKMSpXBz/s3DXZFvvHAKj+9+C3PzzVUG5uabmJtv4rY+jkqUYvopDtd+8Ag++fXnsXjOgyvf245KlGKqs9hyhFMrUQpSiVCU5Y6rCEU/Y+mDq579/mgtTaK1+De0bp3AXfNbuP3BQ1jRt/RwM7OtW5MQQsoyp1YiFKQcoTA7BsoRin70b4YxOzu7tvJ/ZnCvGlwl5rA8sRW33vkCrL1sT01CCDFlTi1HKIgZvm+gJPENM0zhprTrK5icnFxTf6UkdsUWz3q6NLIPY+6729FfywxTlKT2LTBlTjXDFGQqxFqlEGu5B/u5+IdzGxpYnti6ppgZppjew/aOOYsNsdZUiLVIKeTNTYUpNFlkzCCXmQpT9PPhkcNrGmgtTWLF2IYVfQv+/cFDuPHati6fnvVgeWIrVvQtuPajL3ZrmUEuo8kiMxWmKIW8OVIICalSiMKQvUkzwAyUQhT9jD/3NBYWFtY0URvZi9nvbEcj+zDunP9czx749KwHV77/SE8tM8AMGLI3WQpRFEJCimiyyJRCFCWp/SouyqzVb0DbH4ZlWesaWMv0WhTl9q0uSbxWCrW7TgghpBDk1KJMkZdFX0HyKkWZop/1NuKNk0dWzV2PguRV8rLoK8oUhWB7LxBCCMnLoq8gUxidLhiSN1eQKdx8dOoEAODm0jWUbt03s1QtdOf8+ekv4/A7T+C3hx5Ff74htX/9DInXCp3F9pwJDInmCkEBeVlMqwrjMSSqFYICHM7/NI3l5WW8Xz2B/aMSRj7ah1z5OEbtM3jp6E68dHQn9o9K2D8qIfOzr8Oda0hUUxXGk5fFdOc6t+pEpPoYz2WJNoygAEMWk51rzQgKMIICxlN7sbCw0DWwf1TCU2cC3c9u0kd3wsm7LFFN9TEeQxaTneuG6lvnmG74mAEn0fC3TegSzRlBAdoeP0zT7DGw+3cbG9AlmlN9jMfwt8WNoADD94CTseEXk5clAZclAZOSmCaEkMKgV7nsZ62xsTFo1z/uCv1k7EU8rx7CG39/FcWTr+DUs4/hx28/gZMjj1qFQa9CCCGTkph26hl+MbmhuLsTeoA28pKASwGqaUPtDXPhwgWf3ZzKniy+ad1c6j2mrTQb1qTEZ5wVakOi71KAanlJgB6gjQeuvD9UH+PRB2kuHxCQDwjQhzhV94vJu7du5FqtltpqtdR79+5p3bdiq6U2tYtp3S8m9SFO7eYNtm/DZxJ3hzYk+vQhTtUDAhyqh/ahemgfZl99DvPHj2D++BHMvPhtGMpXu3P0IU51Ovd/iYuyyGh+IaUN0twlP2tNDglwc8nPWtogzWl+IXVRXv8PbH/8Fyy0G0H5udT6AAAAAElFTkSuQmCC">firefox export bookmarks at DuckDuckGo</A>
22 |         <DT><A HREF="https://duckduckgo.com/?q=archive+firefox+bookmarks&t=ffab&ia=web" ADD_DATE="1497562974" LAST_MODIFIED="1497562974" ICON_URI="https://duckduckgo.com/favicon.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAAG7UlEQVRYha2XbYwbVxWG798oorJQhYKqSJPxzJ35gWDVComiKrK86/H4Y+2J21/5kIyUgiKk1GqArVpEXQlKSlFNm35IFMVREgQNJE4KAsqCpiGtsrsIJjvO2N6xvbPj9SbZJBvjdsmGzcYvP+xxxt6PbBFHeiTP9T3nfe+ZO+NrQjYZmsJ4aoo3WU+w2fqTO7S5BA839Sd3aPUEm60p3qSmMJ7N1n1gWArnqyfYbD3Bw2F2F9eYVTi1h11cwz2nlvDmLIXz/c/CmsJ46gk2W9/Fo76LR11hrXqCTVmKyKxvVmTqCTZVV1irm5dgs5+5I5bCDNQV1prtCDsrsRRmwE7QdE3hVOd7Z05N4VQ7QdOWwgx0O9dToz2+KXE7zjVqCg87xmWcYrU4p9YUHpsizqmOaTvGZWoKDzvONR5ownSJO5vJjrPZ8ruv4E5+HHfy41j862ksvDGC+p7HHmjEjrfbX1O8SceEuZ4JTWE8M1GuYcfvi8/Edmh2nMeN359Ef9xbbOL6ywdgx/kNmYnt0BwTdpzHTJRrrLkn7DibteM87GEuoymMx+qI23EehQMhFA6EYP/mXTSbzR4jn7x/DFdSsQ1NWB0T9jCXaY+x2R7xsiL6ZmIUVozX2mb47EyMwuHKMzHcna+v6oI7br4+AndOP3aczxJCiBXjtZkYRVkRfV0Dlbg3Z3UGy4ros2IUDuaer2HxxrUNxZ24+vxeuHP7cdevxL05QgghRUVkrBhFtbP66SinupOMt38IAJiYvIpj7+n4p351lfDpP5p4Nv0XfPzLcxsamI5yKiGEVGO8ZsUoiorIkGpESE0PU1QjQsoMMAPTwxRu9DPtDfjzX+mb6kL9YBz9NdyYAWbArUkqUW+uOtx2Ux7mMtVhCjf66ftPwNx8E8Xp62sKH3z5HCbys/jXuWPor+GmPMxliorIVIcpKlFvjlTCrGWG+AYhhExHOLUapXDzj/eyXZHHd7+Fw79Q1zTxpfhreOH1P+G2Po7+Gm6mI+3bUAmzViXMWqQSpSh3BqdCfKMSpXBz/s3DXZFvvHAKj+9+C3PzzVUG5uabmJtv4rY+jkqUYvopDtd+8Ag++fXnsXjOgyvf245KlGKqs9hyhFMrUQpSiVCU5Y6rCEU/Y+mDq579/mgtTaK1+De0bp3AXfNbuP3BQ1jRt/RwM7OtW5MQQsoyp1YiFKQcoTA7BsoRin70b4YxOzu7tvJ/ZnCvGlwl5rA8sRW33vkCrL1sT01CCDFlTi1HKIgZvm+gJPENM0zhprTrK5icnFxTf6UkdsUWz3q6NLIPY+6729FfywxTlKT2LTBlTjXDFGQqxFqlEGu5B/u5+IdzGxpYnti6ppgZppjew/aOOYsNsdZUiLVIKeTNTYUpNFlkzCCXmQpT9PPhkcNrGmgtTWLF2IYVfQv+/cFDuPHati6fnvVgeWIrVvQtuPajL3ZrmUEuo8kiMxWmKIW8OVIICalSiMKQvUkzwAyUQhT9jD/3NBYWFtY0URvZi9nvbEcj+zDunP9czx749KwHV77/SE8tM8AMGLI3WQpRFEJCimiyyJRCFCWp/SouyqzVb0DbH4ZlWesaWMv0WhTl9q0uSbxWCrW7TgghpBDk1KJMkZdFX0HyKkWZop/1NuKNk0dWzV2PguRV8rLoK8oUhWB7LxBCCMnLoq8gUxidLhiSN1eQKdx8dOoEAODm0jWUbt03s1QtdOf8+ekv4/A7T+C3hx5Ff74htX/9DInXCp3F9pwJDInmCkEBeVlMqwrjMSSqFYICHM7/NI3l5WW8Xz2B/aMSRj7ah1z5OEbtM3jp6E68dHQn9o9K2D8qIfOzr8Oda0hUUxXGk5fFdOc6t+pEpPoYz2WJNoygAEMWk51rzQgKMIICxlN7sbCw0DWwf1TCU2cC3c9u0kd3wsm7LFFN9TEeQxaTneuG6lvnmG74mAEn0fC3TegSzRlBAdoeP0zT7DGw+3cbG9AlmlN9jMfwt8WNoADD94CTseEXk5clAZclAZOSmCaEkMKgV7nsZ62xsTFo1z/uCv1k7EU8rx7CG39/FcWTr+DUs4/hx28/gZMjj1qFQa9CCCGTkph26hl+MbmhuLsTeoA28pKASwGqaUPtDXPhwgWf3ZzKniy+ad1c6j2mrTQb1qTEZ5wVakOi71KAanlJgB6gjQeuvD9UH+PRB2kuHxCQDwjQhzhV94vJu7du5FqtltpqtdR79+5p3bdiq6U2tYtp3S8m9SFO7eYNtm/DZxJ3hzYk+vQhTtUDAhyqh/ahemgfZl99DvPHj2D++BHMvPhtGMpXu3P0IU51Ovd/iYuyyGh+IaUN0twlP2tNDglwc8nPWtogzWl+IXVRXv8PbH/8Fyy0G0H5udT6AAAAAElFTkSuQmCC">archive firefox bookmarks at DuckDuckGo</A>
23 |         <DT><A HREF="https://github.com/nodiscc" ADD_DATE="1497562974" LAST_MODIFIED="1497562974" ICON_URI="https://assets-cdn.github.com/favicon.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAADC0lEQVRYhb2Wv2tUQRDHv9Fwefdul535zp7GRrGz9RcoiEIK/wdBjBhFOwsVxMI/wMbSQrSyiWKwE0t/NHbRIoURJdqI2iRBBZVocxeez93L5XI68Jo3M/v5zuzOvgf0b9F7P2Uid8zsZTRb3tpu/9rabv+KZstm9tJE7njvpwDEdazb28qy3E1y2sgfXeBaj5E/SE6XZbl7YLCqBpI3jVzpF5wQskLypqqGdVcdyTeDgutPJN/03Q3n3ISRi8OCV7qx6Jyb6AkPIez9F/CqiBDC3hxcjXxbS5gz8v4gooxc7OTO1d6/DSHoXwKMvF1fRFWPd9xj3vupaPahOnYUeUyRx9WxjGYfOqM4BgCqejwh7nb90O1LVVGW5Z5qnKqGRqOxC8CmRBM3NRqNXfUTX5blnsza+6rVz2SC9vc8NH1YWZb7M1s0040ZN/JnKkhEJjcqQEQmMwJ+AhiHOnc2FdAm3wPgRgUAIMl3KYY6dxYkp1NO7/2pIcABAN77UykGyWnUx6TbnnVfnz1MVUNqm42cQzT7+pfDbGFY8K6Z2UKdE82+ItWaaDY/bAHR7HWKlRRg5NKwBRi5lBSQdQBbhsjfki00isxm7oATw6KLyInkVovMguStjPMFgNEh8EejyIvMGN6CtlrHUs5OwPWN0klez62vrdYxAPDR7EsuyMh7AMYHYG8z8l5u3Wj2BYAHAKjqjVVVqtecc0co8qQi4nubvMsQznS+YpsTwFHfbB5U584ZOWPk9xy8w7mxmqmq26PZtw5shSIXAIwZ+TzRkRkAIwkBIybyoBe0Uv03Vd3+R7aIXK0GNZvNA0VR7Kx/SFT1UK7nzrnD/QgQkaup/FGSzyqVPupq895PqeqVVqt1NNP+rrXXgpN8hh7TNV69MkXkfA9YymSN1r/GWge6KIod0Wx+NSnGpxrCJe/9SVW9DKAYREA0my+KYke/lZDkw8wVLesVQPIhBvi5GWEIp6PZx0EFRPITQziD9NT0bY4iF8zslZGf0fndztiYkZ/N7BVDuAjAbQT8X+w36KQvZccCoxkAAAAASUVORK5CYII=">nodiscc (nodiscc) · GitHub</A>
24 |         <DT><A HREF="https://github.com/pirate/bookmark-archiver#troubleshooting" ADD_DATE="1497562975" LAST_MODIFIED="1497562975" ICON_URI="https://assets-cdn.github.com/favicon.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAADC0lEQVRYhb2Wv2tUQRDHv9Fwefdul535zp7GRrGz9RcoiEIK/wdBjBhFOwsVxMI/wMbSQrSyiWKwE0t/NHbRIoURJdqI2iRBBZVocxeez93L5XI68Jo3M/v5zuzOvgf0b9F7P2Uid8zsZTRb3tpu/9rabv+KZstm9tJE7njvpwDEdazb28qy3E1y2sgfXeBaj5E/SE6XZbl7YLCqBpI3jVzpF5wQskLypqqGdVcdyTeDgutPJN/03Q3n3ISRi8OCV7qx6Jyb6AkPIez9F/CqiBDC3hxcjXxbS5gz8v4gooxc7OTO1d6/DSHoXwKMvF1fRFWPd9xj3vupaPahOnYUeUyRx9WxjGYfOqM4BgCqejwh7nb90O1LVVGW5Z5qnKqGRqOxC8CmRBM3NRqNXfUTX5blnsza+6rVz2SC9vc8NH1YWZb7M1s0040ZN/JnKkhEJjcqQEQmMwJ+AhiHOnc2FdAm3wPgRgUAIMl3KYY6dxYkp1NO7/2pIcABAN77UykGyWnUx6TbnnVfnz1MVUNqm42cQzT7+pfDbGFY8K6Z2UKdE82+ItWaaDY/bAHR7HWKlRRg5NKwBRi5lBSQdQBbhsjfki00isxm7oATw6KLyInkVovMguStjPMFgNEh8EejyIvMGN6CtlrHUs5OwPWN0klez62vrdYxAPDR7EsuyMh7AMYHYG8z8l5u3Wj2BYAHAKjqjVVVqtecc0co8qQi4nubvMsQznS+YpsTwFHfbB5U584ZOWPk9xy8w7mxmqmq26PZtw5shSIXAIwZ+TzRkRkAIwkBIybyoBe0Uv03Vd3+R7aIXK0GNZvNA0VR7Kx/SFT1UK7nzrnD/QgQkaup/FGSzyqVPupq895PqeqVVqt1NNP+rrXXgpN8hh7TNV69MkXkfA9YymSN1r/GWge6KIod0Wx+NSnGpxrCJe/9SVW9DKAYREA0my+KYke/lZDkw8wVLesVQPIhBvi5GWEIp6PZx0EFRPITQziD9NT0bY4iF8zslZGf0fndztiYkZ/N7BVDuAjAbQT8X+w36KQvZccCoxkAAAAASUVORK5CYII=">pirate/bookmark-archiver · Github</A>
25 |         <DT><A HREF="http://www.cs.unc.edu/~fabian/papers/foniks-oak11.pdf" ADD_DATE="1497562976" LAST_MODIFIED="1497562976" ICON_URI="https://assets-cdn.github.com/favicon.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAADC0lEQVRYhb2Wv2tUQRDHv9Fwefdul535zp7GRrGz9RcoiEIK/wdBjBhFOwsVxMI/wMbSQrSyiWKwE0t/NHbRIoURJdqI2iRBBZVocxeez93L5XI68Jo3M/v5zuzOvgf0b9F7P2Uid8zsZTRb3tpu/9rabv+KZstm9tJE7njvpwDEdazb28qy3E1y2sgfXeBaj5E/SE6XZbl7YLCqBpI3jVzpF5wQskLypqqGdVcdyTeDgutPJN/03Q3n3ISRi8OCV7qx6Jyb6AkPIez9F/CqiBDC3hxcjXxbS5gz8v4gooxc7OTO1d6/DSHoXwKMvF1fRFWPd9xj3vupaPahOnYUeUyRx9WxjGYfOqM4BgCqejwh7nb90O1LVVGW5Z5qnKqGRqOxC8CmRBM3NRqNXfUTX5blnsza+6rVz2SC9vc8NH1YWZb7M1s0040ZN/JnKkhEJjcqQEQmMwJ+AhiHOnc2FdAm3wPgRgUAIMl3KYY6dxYkp1NO7/2pIcABAN77UykGyWnUx6TbnnVfnz1MVUNqm42cQzT7+pfDbGFY8K6Z2UKdE82+ItWaaDY/bAHR7HWKlRRg5NKwBRi5lBSQdQBbhsjfki00isxm7oATw6KLyInkVovMguStjPMFgNEh8EejyIvMGN6CtlrHUs5OwPWN0klez62vrdYxAPDR7EsuyMh7AMYHYG8z8l5u3Wj2BYAHAKjqjVVVqtecc0co8qQi4nubvMsQznS+YpsTwFHfbB5U584ZOWPk9xy8w7mxmqmq26PZtw5shSIXAIwZ+TzRkRkAIwkBIybyoBe0Uv03Vd3+R7aIXK0GNZvNA0VR7Kx/SFT1UK7nzrnD/QgQkaup/FGSzyqVPupq895PqeqVVqt1NNP+rrXXgpN8hh7TNV69MkXkfA9YymSN1r/GWge6KIod0Wx+NSnGpxrCJe/9SVW9DKAYREA0my+KYke/lZDkw8wVLesVQPIhBvi5GWEIp6PZx0EFRPITQziD9NT0bY4iF8zslZGf0fndztiYkZ/N7BVDuAjAbQT8X+w36KQvZccCoxkAAAAASUVORK5CYII=">Phonotactic Reconstruction of Encrypted VoIP Conversations</A>
26 |         <DT><A HREF="https://www.ghacks.net/2009/07/23/firefox-bookmarks-archiver/" ADD_DATE="1497562974" LAST_MODIFIED="1497562974" ICON_URI="https://www.ghacks.net/wp-content/uploads/2005/10/favicon.ico" ICON="data:image/png;base64,AAABAAEAEBAAAAEAGABoAwAAFgAAACgAAAAQAAAAIAAAAAEAGAAAAAAAAAMAAAAAAAAAAAAAAAAAAAAAAAD+/P76/Pzs7Ozg2uXNyeW/ueOXmuKJleOpsOrAvubNzOff3ebv7+77/f39/f7+/v79/vv+/f77+v3s7vjIy++5vO+truh9keNggNuRpemrtua/0Orz+f79/P38/v3+/v78/f78/vna7PpyltlTfNxsjOSdruq4u+6grPFlidxxl+SdtummvN7w+Pz9/v3+//79/fv49/2cqttZgNtFeOJGeeNbhdaqtubGxeuyw/VuleFvm9ybuui2yOb5/fv+/v79/Pzj4ve8vumzuvCZqO1ojeCUsvSJq+pyneGAqOqYuupsm+F7ouCkveXk6/b8/f74+fqhsuSRouios/G5v+vDx+mjtu9Jht43h+k+iuJ5qeG61vdsnNeNsevU3vD7/v7h7fxUgNRDe+JSg96JpObByu+dt+s6ieM3jOkzj+dBkt281PK31vFmm9XC1/T7/f7d6fV8ntxfkuRFhN1Bg+NtnOalyvRElts4kuc4k+c9leWn0ezm7/ebweaNuuD3/P7z9fnY1+vV3PGsyu1bld9Cjd9nouCFw+1QnN48mONWoNnC5/fY5/XP6vqHstr1+v7x+P3AyeTAz+7f6fTS5vdhpeA5kOlssOG54/iv2e/C5/iZ0ux2s9ve8/u91un3/P72+/yCpNpIh96JtOXg8vvZ7/xjp90+meGQxezy+/ze8vpcpto/ouG+4/Tw9vb8/v78/fu91O9eldxMjdp+seLp+P7I6vdNoeJIouDC5vbw/f6Jxek9oOWi0Or4/P3+/v73/v3v+f3E1+qex/FOktubyur2+/yez+05pN93ueXy/fy94/dQo9e+3PH7/f7//v78/v38/f32+PfA2exanuBVmuHg8vza8vxVqeBPpN3a8vva6/Wm0+P0+v38/v7+/v79/f77/P77/v3t+fyrz+lwp9XT7PXl+fxost1VptbQ6fjz+Pz0/vv+/vz9/v7+/v/8/vz8/v38/vz8/fz5/P3x+v7y/Pv5+vnJ5PPQ8Pn2/Pz8/fz+/v38/f7+/v3///4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA">Firefox Bookmarks Archiver - gHacks Tech News</A>
27 |     </DL><p>
28 |     <DT><H3 ADD_DATE="1409779227" LAST_MODIFIED="1470506008" PERSONAL_TOOLBAR_FOLDER="true">Bookmarks Toolbar</H3>
29 |     <DD>Add bookmarks to this folder to see them displayed on the Bookmarks Toolbar
30 |     <DL><p>
31 |         <DT><A HREF="place:sort=8&maxResults=10" ADD_DATE="1470506008" LAST_MODIFIED="1470506008">Most Visited</A>
32 |         <DT><A HREF="https://www.mozilla.org/en-US/firefox/central/" ADD_DATE="1409779227" LAST_MODIFIED="1409779227">Getting Started</A>
33 |     </DL><p>
34 | </DL>
35 | 


--------------------------------------------------------------------------------
/archiver/archive_methods.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | 
  4 | from functools import wraps
  5 | from collections import defaultdict
  6 | from datetime import datetime
  7 | from subprocess import run, PIPE, DEVNULL
  8 | 
  9 | from peekable import Peekable
 10 | 
 11 | from index import wget_output_path, parse_json_link_index, write_link_index
 12 | from links import links_after_timestamp
 13 | from config import (
 14 |     CHROME_BINARY,
 15 |     FETCH_WGET,
 16 |     FETCH_WGET_REQUISITES,
 17 |     FETCH_PDF,
 18 |     FETCH_SCREENSHOT,
 19 |     FETCH_DOM,
 20 |     RESOLUTION,
 21 |     CHECK_SSL_VALIDITY,
 22 |     SUBMIT_ARCHIVE_DOT_ORG,
 23 |     FETCH_AUDIO,
 24 |     FETCH_VIDEO,
 25 |     FETCH_FAVICON,
 26 |     WGET_USER_AGENT,
 27 |     CHROME_USER_DATA_DIR,
 28 |     TIMEOUT,
 29 |     ANSI,
 30 |     ARCHIVE_DIR,
 31 | )
 32 | from util import (
 33 |     check_dependencies,
 34 |     progress,
 35 |     chmod_file,
 36 |     pretty_path,
 37 | )
 38 | 
 39 | 
 40 | _RESULTS_TOTALS = {   # globals are bad, mmkay
 41 |     'skipped': 0,
 42 |     'succeded': 0,
 43 |     'failed': 0,
 44 | }
 45 | 
 46 | def archive_links(archive_path, links, source=None, resume=None):
 47 |     check_dependencies()
 48 | 
 49 |     to_archive = Peekable(links_after_timestamp(links, resume))
 50 |     idx, link = 0, to_archive.peek(0)
 51 | 
 52 |     try:
 53 |         for idx, link in enumerate(to_archive):
 54 |             link_dir = os.path.join(ARCHIVE_DIR, link['timestamp'])
 55 |             archive_link(link_dir, link)
 56 |     
 57 |     except (KeyboardInterrupt, SystemExit, Exception) as e:
 58 |         print('{lightyellow}[X] [{now}] Downloading paused on link {timestamp} ({idx}/{total}){reset}'.format(
 59 |             **ANSI,
 60 |             now=datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
 61 |             idx=idx+1,
 62 |             timestamp=link['timestamp'],
 63 |             total=len(links),
 64 |         ))
 65 |         print('    Continue where you left off by running:')
 66 |         print('        {} {}'.format(
 67 |             pretty_path(sys.argv[0]),
 68 |             link['timestamp'],
 69 |         ))
 70 |         if not isinstance(e, KeyboardInterrupt):
 71 |             raise e
 72 |         raise SystemExit(1)
 73 | 
 74 | 
 75 | def archive_link(link_dir, link, overwrite=True):
 76 |     """download the DOM, PDF, and a screenshot into a folder named after the link's timestamp"""
 77 | 
 78 |     update_existing = os.path.exists(link_dir)
 79 |     if update_existing:
 80 |         link = {
 81 |             **parse_json_link_index(link_dir),
 82 |             **link,
 83 |         }
 84 |     else:
 85 |         os.makedirs(link_dir)
 86 |     
 87 |     log_link_archive(link_dir, link, update_existing)
 88 | 
 89 |     if FETCH_WGET:
 90 |         link = fetch_wget(link_dir, link, overwrite=overwrite)
 91 | 
 92 |     if FETCH_PDF:
 93 |         link = fetch_pdf(link_dir, link, overwrite=overwrite)
 94 | 
 95 |     if FETCH_SCREENSHOT:
 96 |         link = fetch_screenshot(link_dir, link, overwrite=overwrite)
 97 | 
 98 |     if FETCH_DOM:
 99 |         link = fetch_dom(link_dir, link, overwrite=overwrite)
100 | 
101 |     if SUBMIT_ARCHIVE_DOT_ORG:
102 |         link = archive_dot_org(link_dir, link, overwrite=overwrite)
103 | 
104 |     # if FETCH_AUDIO:
105 |     #     link = fetch_audio(link_dir, link, overwrite=overwrite)
106 | 
107 |     # if FETCH_VIDEO:
108 |     #     link = fetch_video(link_dir, link, overwrite=overwrite)
109 | 
110 |     if FETCH_FAVICON:
111 |         link = fetch_favicon(link_dir, link, overwrite=overwrite)
112 | 
113 |     write_link_index(link_dir, link)
114 |     # print()
115 |     
116 |     return link
117 | 
118 | def log_link_archive(link_dir, link, update_existing):
119 |     print('[{symbol_color}{symbol}{reset}] [{now}] "{title}"\n    {blue}{url}{reset}'.format(
120 |         symbol='*' if update_existing else '+',
121 |         symbol_color=ANSI['black' if update_existing else 'green'],
122 |         now=datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
123 |         **link,
124 |         **ANSI,
125 |     ))
126 | 
127 |     print('    > {}{}'.format(pretty_path(link_dir), '' if update_existing else ' (new)'))
128 |     if link['type']:
129 |         print('      i {}'.format(link['type']))
130 | 
131 | 
132 | 
133 | def attach_result_to_link(method):
134 |     """
135 |     Instead of returning a result={output:'...', status:'success'} object,
136 |     attach that result to the links's history & latest fields, then return
137 |     the updated link object.
138 |     """
139 |     def decorator(fetch_func):
140 |         @wraps(fetch_func)
141 |         def timed_fetch_func(link_dir, link, overwrite=False, **kwargs):
142 |             # initialize methods and history json field on link
143 |             link['latest'] = link.get('latest') or {}
144 |             link['latest'][method] = link['latest'].get(method) or None
145 |             link['history'] = link.get('history') or {}
146 |             link['history'][method] = link['history'].get(method) or []
147 | 
148 |             start_ts = datetime.now().timestamp()
149 | 
150 |             # if a valid method output is already present, dont run the fetch function
151 |             if link['latest'][method] and not overwrite:
152 |                 print('      √ {}'.format(method))
153 |                 result = None
154 |             else:
155 |                 print('      > {}'.format(method))
156 |                 result = fetch_func(link_dir, link, **kwargs)
157 | 
158 |             end_ts = datetime.now().timestamp()
159 |             duration = str(end_ts * 1000 - start_ts * 1000).split('.')[0]
160 | 
161 |             # append a history item recording fail/success
162 |             history_entry = {
163 |                 'timestamp': str(start_ts).split('.')[0],
164 |             }
165 |             if result is None:
166 |                 history_entry['status'] = 'skipped'
167 |             elif isinstance(result.get('output'), Exception):
168 |                 history_entry['status'] = 'failed'
169 |                 history_entry['duration'] = duration
170 |                 history_entry.update(result or {})
171 |                 link['history'][method].append(history_entry)
172 |             else:
173 |                 history_entry['status'] = 'succeded'
174 |                 history_entry['duration'] = duration
175 |                 history_entry.update(result or {})
176 |                 link['history'][method].append(history_entry)
177 |                 link['latest'][method] = result['output']
178 | 
179 |             _RESULTS_TOTALS[history_entry['status']] += 1
180 |             
181 |             return link
182 |         return timed_fetch_func
183 |     return decorator
184 | 
185 | 
186 | @attach_result_to_link('wget')
187 | def fetch_wget(link_dir, link, requisites=FETCH_WGET_REQUISITES, timeout=TIMEOUT):
188 |     """download full site using wget"""
189 | 
190 |     domain_dir = os.path.join(link_dir, link['domain'])
191 |     existing_file = wget_output_path(link)
192 |     if os.path.exists(domain_dir) and existing_file:
193 |         return {'output': existing_file, 'status': 'skipped'}
194 | 
195 |     CMD = [
196 |         # WGET CLI Docs: https://www.gnu.org/software/wget/manual/wget.html
197 |         *'wget -N -E -np -x -H -k -K -S --restrict-file-names=unix'.split(' '),
198 |         *(('-p',) if FETCH_WGET_REQUISITES else ()),
199 |         *(('--user-agent="{}"'.format(WGET_USER_AGENT),) if WGET_USER_AGENT else ()),
200 |         *((() if CHECK_SSL_VALIDITY else ('--no-check-certificate',))),
201 |         link['url'],
202 |     ]
203 |     end = progress(timeout, prefix='      ')
204 |     try:
205 |         result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1)  # index.html
206 |         end()
207 |         output = wget_output_path(link, look_in=domain_dir)
208 | 
209 |         # Check for common failure cases
210 |         if result.returncode > 0:
211 |             print('        got wget response code {}:'.format(result.returncode))
212 |             if result.returncode != 8:
213 |                 print('\n'.join('          ' + line for line in (result.stderr or result.stdout).decode().rsplit('\n', 10)[-10:] if line.strip()))
214 |             if b'403: Forbidden' in result.stderr:
215 |                 raise Exception('403 Forbidden (try changing WGET_USER_AGENT)')
216 |             if b'404: Not Found' in result.stderr:
217 |                 raise Exception('404 Not Found')
218 |             if b'ERROR 500: Internal Server Error' in result.stderr:
219 |                 raise Exception('500 Internal Server Error')
220 |             if result.returncode == 4:
221 |                 raise Exception('Failed wget download')
222 |     except Exception as e:
223 |         end()
224 |         print('        Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
225 |         print('        {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
226 |         output = e
227 | 
228 |     return {
229 |         'cmd': CMD,
230 |         'output': output,
231 |     }
232 | 
233 | 
234 | @attach_result_to_link('pdf')
235 | def fetch_pdf(link_dir, link, timeout=TIMEOUT, user_data_dir=CHROME_USER_DATA_DIR):
236 |     """print PDF of site to file using chrome --headless"""
237 | 
238 |     if link['type'] in ('PDF', 'image'):
239 |         return {'output': wget_output_path(link)}
240 |     
241 |     if os.path.exists(os.path.join(link_dir, 'output.pdf')):
242 |         return {'output': 'output.pdf', 'status': 'skipped'}
243 | 
244 |     CMD = [
245 |         *chrome_headless(user_data_dir=user_data_dir),
246 |         '--print-to-pdf',
247 |         link['url']
248 |     ]
249 |     end = progress(timeout, prefix='      ')
250 |     try:
251 |         result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1)  # output.pdf
252 |         end()
253 |         if result.returncode:
254 |             print('     ', (result.stderr or result.stdout).decode())
255 |             raise Exception('Failed to print PDF')
256 |         chmod_file('output.pdf', cwd=link_dir)
257 |         output = 'output.pdf'
258 |     except Exception as e:
259 |         end()
260 |         print('        Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
261 |         print('        {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
262 |         output = e
263 | 
264 |     return {
265 |         'cmd': CMD,
266 |         'output': output,
267 |     }
268 | 
269 | @attach_result_to_link('screenshot')
270 | def fetch_screenshot(link_dir, link, timeout=TIMEOUT, user_data_dir=CHROME_USER_DATA_DIR, resolution=RESOLUTION):
271 |     """take screenshot of site using chrome --headless"""
272 | 
273 |     if link['type'] in ('PDF', 'image'):
274 |         return {'output': wget_output_path(link)}
275 |     
276 |     if os.path.exists(os.path.join(link_dir, 'screenshot.png')):
277 |         return {'output': 'screenshot.png', 'status': 'skipped'}
278 | 
279 |     CMD = [
280 |         *chrome_headless(user_data_dir=user_data_dir),
281 |         '--screenshot',
282 |         '--window-size={}'.format(resolution),
283 |         '--hide-scrollbars',
284 |         # '--full-page',   # TODO: make this actually work using ./bin/screenshot fullPage: true
285 |         link['url'],
286 |     ]
287 |     end = progress(timeout, prefix='      ')
288 |     try:
289 |         result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1)  # sreenshot.png
290 |         end()
291 |         if result.returncode:
292 |             print('     ', (result.stderr or result.stdout).decode())
293 |             raise Exception('Failed to take screenshot')
294 |         chmod_file('screenshot.png', cwd=link_dir)
295 |         output = 'screenshot.png'
296 |     except Exception as e:
297 |         end()
298 |         print('        Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
299 |         print('        {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
300 |         output = e
301 | 
302 |     return {
303 |         'cmd': CMD,
304 |         'output': output,
305 |     }
306 |     
307 | @attach_result_to_link('dom')
308 | def fetch_dom(link_dir, link, timeout=TIMEOUT, user_data_dir=CHROME_USER_DATA_DIR):
309 |     """print HTML of site to file using chrome --dump-html"""
310 | 
311 |     if link['type'] in ('PDF', 'image'):
312 |         return {'output': wget_output_path(link)}
313 |     
314 |     output_path = os.path.join(link_dir, 'output.html')
315 | 
316 |     if os.path.exists(output_path):
317 |         return {'output': 'output.html', 'status': 'skipped'}
318 | 
319 |     CMD = [
320 |         *chrome_headless(user_data_dir=user_data_dir),
321 |         '--dump-dom',
322 |         link['url']
323 |     ]
324 |     end = progress(timeout, prefix='      ')
325 |     try:
326 |         with open(output_path, 'w+') as f:
327 |             result = run(CMD, stdout=f, stderr=PIPE, cwd=link_dir, timeout=timeout + 1)  # output.html
328 |         end()
329 |         if result.returncode:
330 |             print('     ', (result.stderr).decode())
331 |             raise Exception('Failed to fetch DOM')
332 |         chmod_file('output.html', cwd=link_dir)
333 |         output = 'output.html'
334 |     except Exception as e:
335 |         end()
336 |         print('        Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
337 |         print('        {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
338 |         output = e
339 | 
340 |     return {
341 |         'cmd': CMD,
342 |         'output': output,
343 |     }
344 | 
345 | @attach_result_to_link('archive_org')
346 | def archive_dot_org(link_dir, link, timeout=TIMEOUT):
347 |     """submit site to archive.org for archiving via their service, save returned archive url"""
348 | 
349 |     path = os.path.join(link_dir, 'archive.org.txt')
350 |     if os.path.exists(path):
351 |         archive_org_url = open(path, 'r').read().strip()
352 |         return {'output': archive_org_url, 'status': 'skipped'}
353 | 
354 |     submit_url = 'https://web.archive.org/save/{}'.format(link['url'].split('?', 1)[0])
355 | 
356 |     success = False
357 |     CMD = ['curl', '-I', submit_url]
358 |     end = progress(timeout, prefix='      ')
359 |     try:
360 |         result = run(CMD, stdout=PIPE, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1)  # archive.org.txt
361 |         end()
362 | 
363 |         # Parse archive.org response headers
364 |         headers = defaultdict(list)
365 | 
366 |         # lowercase all the header names and store in dict
367 |         for header in result.stdout.splitlines():
368 |             if b':' not in header or not header.strip():
369 |                 continue
370 |             name, val = header.decode().split(':', 1)
371 |             headers[name.lower().strip()].append(val.strip())
372 | 
373 |         # Get successful archive url in "content-location" header or any errors
374 |         content_location = headers['content-location']
375 |         errors = headers['x-archive-wayback-runtime-error']
376 | 
377 |         if content_location:
378 |             saved_url = 'https://web.archive.org{}'.format(content_location[0])
379 |             success = True
380 |         elif len(errors) == 1 and 'RobotAccessControlException' in errors[0]:
381 |             output = submit_url
382 |             # raise Exception('Archive.org denied by {}/robots.txt'.format(link['domain']))
383 |         elif errors:
384 |             raise Exception(', '.join(errors))
385 |         else:
386 |             raise Exception('Failed to find "content-location" URL header in Archive.org response.')
387 |     except Exception as e:
388 |         end()
389 |         print('        Visit url to see output:', ' '.join(CMD))
390 |         print('        {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
391 |         output = e
392 | 
393 |     if success:
394 |         with open(os.path.join(link_dir, 'archive.org.txt'), 'w', encoding='utf-8') as f:
395 |             f.write(saved_url)
396 |         chmod_file('archive.org.txt', cwd=link_dir)
397 |         output = saved_url
398 | 
399 |     return {
400 |         'cmd': CMD,
401 |         'output': output,
402 |     }
403 | 
404 | @attach_result_to_link('favicon')
405 | def fetch_favicon(link_dir, link, timeout=TIMEOUT):
406 |     """download site favicon from google's favicon api"""
407 | 
408 |     if os.path.exists(os.path.join(link_dir, 'favicon.ico')):
409 |         return {'output': 'favicon.ico', 'status': 'skipped'}
410 | 
411 |     CMD = ['curl', 'https://www.google.com/s2/favicons?domain={domain}'.format(**link)]
412 |     fout = open('{}/favicon.ico'.format(link_dir), 'w')
413 |     end = progress(timeout, prefix='      ')
414 |     try:
415 |         run(CMD, stdout=fout, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1)  # favicon.ico
416 |         fout.close()
417 |         end()
418 |         chmod_file('favicon.ico', cwd=link_dir)
419 |         output = 'favicon.ico'
420 |     except Exception as e:
421 |         fout.close()
422 |         end()
423 |         print('        Run to see full output:', ' '.join(CMD))
424 |         print('        {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
425 |         output = e
426 | 
427 |     return {
428 |         'cmd': CMD,
429 |         'output': output,
430 |     }
431 | 
432 | # @attach_result_to_link('audio')
433 | # def fetch_audio(link_dir, link, timeout=TIMEOUT):
434 | #     """Download audio rip using youtube-dl"""
435 | 
436 | #     if link['type'] not in ('soundcloud',)\
437 | #        and 'audio' not in link['tags']:
438 | #         return
439 | 
440 | #     path = os.path.join(link_dir, 'audio')
441 | 
442 | #     if not os.path.exists(path) or overwrite:
443 | #         print('    - Downloading audio')
444 | #         CMD = [
445 | #             "youtube-dl -x --audio-format mp3 --audio-quality 0 -o '%(title)s.%(ext)s'",
446 | #             link['url'],
447 | #         ]
448 | #         end = progress(timeout, prefix='      ')
449 | #         try:
450 | #             result = run(CMD, stdout=DEVNULL, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1)  # audio/audio.mp3
451 | #             end()
452 | #             if result.returncode:
453 | #                 print('     ', result.stderr.decode())
454 | #                 raise Exception('Failed to download audio')
455 | #             chmod_file('audio.mp3', cwd=link_dir)
456 | #             return 'audio.mp3'
457 | #         except Exception as e:
458 | #             end()
459 | #             print('        Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
460 | #             print('        {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
461 | #             raise
462 | #     else:
463 | #         print('    √ Skipping audio download')
464 | 
465 | # @attach_result_to_link('video')
466 | # def fetch_video(link_dir, link, timeout=TIMEOUT):
467 | #     """Download video rip using youtube-dl"""
468 | 
469 | #     if link['type'] not in ('youtube', 'youku', 'vimeo')\
470 | #        and 'video' not in link['tags']:
471 | #         return
472 | 
473 | #     path = os.path.join(link_dir, 'video')
474 | 
475 | #     if not os.path.exists(path) or overwrite:
476 | #         print('    - Downloading video')
477 | #         CMD = [
478 | #             "youtube-dl -x --video-format mp4 --audio-quality 0 -o '%(title)s.%(ext)s'",
479 | #             link['url'],
480 | #         ]
481 | #         end = progress(timeout, prefix='      ')
482 | #         try:
483 | #             result = run(CMD, stdout=DEVNULL, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1)  # video/movie.mp4
484 | #             end()
485 | #             if result.returncode:
486 | #                 print('     ', result.stderr.decode())
487 | #                 raise Exception('Failed to download video')
488 | #             chmod_file('video.mp4', cwd=link_dir)
489 | #             return 'video.mp4'
490 | #         except Exception as e:
491 | #             end()
492 | #             print('        Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
493 | #             print('        {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
494 | #             raise
495 | #     else:
496 | #         print('    √ Skipping video download')
497 | 
498 | 
499 | def chrome_headless(binary=CHROME_BINARY, user_data_dir=CHROME_USER_DATA_DIR):
500 |     args = [binary, '--headless']  # '--disable-gpu'
501 |     default_profile = os.path.expanduser('~/Library/Application Support/Google/Chrome/Default')
502 |     if user_data_dir:
503 |         args.append('--user-data-dir={}'.format(user_data_dir))
504 |     elif os.path.exists(default_profile):
505 |         args.append('--user-data-dir={}'.format(default_profile))
506 |     return args
507 | 


--------------------------------------------------------------------------------
/archiver/util.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | import sys
  4 | import time
  5 | import json
  6 | import requests
  7 | 
  8 | from datetime import datetime
  9 | from subprocess import run, PIPE, DEVNULL
 10 | from multiprocessing import Process
 11 | from urllib.parse import quote
 12 | 
 13 | from config import (
 14 |     IS_TTY,
 15 |     OUTPUT_PERMISSIONS,
 16 |     REPO_DIR,
 17 |     SOURCES_DIR,
 18 |     OUTPUT_DIR,
 19 |     ARCHIVE_DIR,
 20 |     TIMEOUT,
 21 |     TERM_WIDTH,
 22 |     SHOW_PROGRESS,
 23 |     ANSI,
 24 |     CHROME_BINARY,
 25 |     FETCH_WGET,
 26 |     FETCH_PDF,
 27 |     FETCH_SCREENSHOT,
 28 |     FETCH_DOM,
 29 |     FETCH_FAVICON,
 30 |     FETCH_AUDIO,
 31 |     FETCH_VIDEO,
 32 |     SUBMIT_ARCHIVE_DOT_ORG,
 33 | )
 34 | 
 35 | # URL helpers
 36 | without_scheme = lambda url: url.replace('http://', '').replace('https://', '').replace('ftp://', '')
 37 | without_query = lambda url: url.split('?', 1)[0]
 38 | without_hash = lambda url: url.split('#', 1)[0]
 39 | without_path = lambda url: url.split('/', 1)[0]
 40 | domain = lambda url: without_hash(without_query(without_path(without_scheme(url))))
 41 | base_url = lambda url: without_scheme(url)  # uniq base url used to dedupe links
 42 | 
 43 | short_ts = lambda ts: ts.split('.')[0]
 44 | 
 45 | 
 46 | def check_dependencies():
 47 |     """Check that all necessary dependencies are installed, and have valid versions"""
 48 | 
 49 |     python_vers = float('{}.{}'.format(sys.version_info.major, sys.version_info.minor))
 50 |     if python_vers < 3.5:
 51 |         print('{}[X] Python version is not new enough: {} (>3.5 is required){}'.format(ANSI['red'], python_vers, ANSI['reset']))
 52 |         print('    See https://github.com/pirate/bookmark-archiver#troubleshooting for help upgrading your Python installation.')
 53 |         raise SystemExit(1)
 54 | 
 55 |     if FETCH_PDF or FETCH_SCREENSHOT or FETCH_DOM:
 56 |         if run(['which', CHROME_BINARY], stdout=DEVNULL).returncode:
 57 |             print('{}[X] Missing dependency: {}{}'.format(ANSI['red'], CHROME_BINARY, ANSI['reset']))
 58 |             print('    Run ./setup.sh, then confirm it was installed with: {} --version'.format(CHROME_BINARY))
 59 |             print('    See https://github.com/pirate/bookmark-archiver for help.')
 60 |             raise SystemExit(1)
 61 | 
 62 |         # parse chrome --version e.g. Google Chrome 61.0.3114.0 canary / Chromium 59.0.3029.110 built on Ubuntu, running on Ubuntu 16.04
 63 |         try:
 64 |             result = run([CHROME_BINARY, '--version'], stdout=PIPE)
 65 |             version_str = result.stdout.decode('utf-8')
 66 |             version_lines = re.sub("(Google Chrome|Chromium) (\\d+?)\\.(\\d+?)\\.(\\d+?).*?$", "\\2", version_str).split('\n')
 67 |             version = [l for l in version_lines if l.isdigit()][-1]
 68 |             if int(version) < 59:
 69 |                 print(version_lines)
 70 |                 print('{red}[X] Chrome version must be 59 or greater for headless PDF, screenshot, and DOM saving{reset}'.format(**ANSI))
 71 |                 print('    See https://github.com/pirate/bookmark-archiver for help.')
 72 |                 raise SystemExit(1)
 73 |         except (IndexError, TypeError, OSError):
 74 |             print('{red}[X] Failed to parse Chrome version, is it installed properly?{reset}'.format(**ANSI))
 75 |             print('    Run ./setup.sh, then confirm it was installed with: {} --version'.format(CHROME_BINARY))
 76 |             print('    See https://github.com/pirate/bookmark-archiver for help.')
 77 |             raise SystemExit(1)
 78 | 
 79 |     if FETCH_WGET:
 80 |         if run(['which', 'wget'], stdout=DEVNULL).returncode or run(['wget', '--version'], stdout=DEVNULL).returncode:
 81 |             print('{red}[X] Missing dependency: wget{reset}'.format(**ANSI))
 82 |             print('    Run ./setup.sh, then confirm it was installed with: {} --version'.format('wget'))
 83 |             print('    See https://github.com/pirate/bookmark-archiver for help.')
 84 |             raise SystemExit(1)
 85 | 
 86 |     if FETCH_FAVICON or SUBMIT_ARCHIVE_DOT_ORG:
 87 |         if run(['which', 'curl'], stdout=DEVNULL).returncode or run(['curl', '--version'], stdout=DEVNULL).returncode:
 88 |             print('{red}[X] Missing dependency: curl{reset}'.format(**ANSI))
 89 |             print('    Run ./setup.sh, then confirm it was installed with: {} --version'.format('curl'))
 90 |             print('    See https://github.com/pirate/bookmark-archiver for help.')
 91 |             raise SystemExit(1)
 92 | 
 93 |     if FETCH_AUDIO or FETCH_VIDEO:
 94 |         if run(['which', 'youtube-dl'], stdout=DEVNULL).returncode or run(['youtube-dl', '--version'], stdout=DEVNULL).returncode:
 95 |             print('{red}[X] Missing dependency: youtube-dl{reset}'.format(**ANSI))
 96 |             print('    Run ./setup.sh, then confirm it was installed with: {} --version'.format('youtube-dl'))
 97 |             print('    See https://github.com/pirate/bookmark-archiver for help.')
 98 |             raise SystemExit(1)
 99 | 
100 | 
101 | def chmod_file(path, cwd='.', permissions=OUTPUT_PERMISSIONS, timeout=30):
102 |     """chmod -R <permissions> <cwd>/<path>"""
103 | 
104 |     if not os.path.exists(os.path.join(cwd, path)):
105 |         raise Exception('Failed to chmod: {} does not exist (did the previous step fail?)'.format(path))
106 | 
107 |     chmod_result = run(['chmod', '-R', permissions, path], cwd=cwd, stdout=DEVNULL, stderr=PIPE, timeout=timeout)
108 |     if chmod_result.returncode == 1:
109 |         print('     ', chmod_result.stderr.decode())
110 |         raise Exception('Failed to chmod {}/{}'.format(cwd, path))
111 | 
112 | 
113 | def progress(seconds=TIMEOUT, prefix=''):
114 |     """Show a (subprocess-controlled) progress bar with a <seconds> timeout,
115 |        returns end() function to instantly finish the progress
116 |     """
117 | 
118 |     if not SHOW_PROGRESS:
119 |         return lambda: None
120 | 
121 |     chunk = '█' if sys.stdout.encoding == 'UTF-8' else '#'
122 |     chunks = TERM_WIDTH - len(prefix) - 20  # number of progress chunks to show (aka max bar width)
123 | 
124 |     def progress_bar(seconds=seconds, prefix=prefix):
125 |         """show timer in the form of progress bar, with percentage and seconds remaining"""
126 |         try:
127 |             for s in range(seconds * chunks):
128 |                 progress = s / chunks / seconds * 100
129 |                 bar_width = round(progress/(100/chunks))
130 | 
131 |                 # ████████████████████           0.9% (1/60sec)
132 |                 sys.stdout.write('\r{0}{1}{2}{3} {4}% ({5}/{6}sec)'.format(
133 |                     prefix,
134 |                     ANSI['green'],
135 |                     (chunk * bar_width).ljust(chunks),
136 |                     ANSI['reset'],
137 |                     round(progress, 1),
138 |                     round(s/chunks),
139 |                     seconds,
140 |                 ))
141 |                 sys.stdout.flush()
142 |                 time.sleep(1 / chunks)
143 | 
144 |             # ██████████████████████████████████ 100.0% (60/60sec)
145 |             sys.stdout.write('\r{0}{1}{2}{3} {4}% ({5}/{6}sec)\n'.format(
146 |                 prefix,
147 |                 ANSI['red'],
148 |                 chunk * chunks,
149 |                 ANSI['reset'],
150 |                 100.0,
151 |                 seconds,
152 |                 seconds,
153 |             ))
154 |             sys.stdout.flush()
155 |         except KeyboardInterrupt:
156 |             print()
157 |             pass
158 | 
159 |     p = Process(target=progress_bar)
160 |     p.start()
161 | 
162 |     def end():
163 |         """immediately finish progress and clear the progressbar line"""
164 |         p.terminate()
165 |         sys.stdout.write('\r{}{}\r'.format((' ' * TERM_WIDTH), ANSI['reset']))  # clear whole terminal line
166 |         sys.stdout.flush()
167 | 
168 |     return end
169 | 
170 | def pretty_path(path):
171 |     """convert paths like .../bookmark-archiver/archiver/../output/abc into output/abc"""
172 |     return path.replace(REPO_DIR + '/', '')
173 | 
174 | 
175 | def download_url(url):
176 |     """download a given url's content into downloads/domain.txt"""
177 | 
178 |     if not os.path.exists(SOURCES_DIR):
179 |         os.makedirs(SOURCES_DIR)
180 | 
181 |     ts = str(datetime.now().timestamp()).split('.', 1)[0]
182 | 
183 |     source_path = os.path.join(SOURCES_DIR, '{}-{}.txt'.format(domain(url), ts))
184 | 
185 |     print('[*] [{}] Downloading {} > {}'.format(
186 |         datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
187 |         url,
188 |         pretty_path(source_path),
189 |     ))
190 |     end = progress(TIMEOUT, prefix='      ')
191 |     try:
192 |         downloaded_xml = requests.get(url).content.decode()
193 |         end()
194 |     except Exception as e:
195 |         end()
196 |         print('[!] Failed to download {}\n'.format(url))
197 |         print('    ', e)
198 |         raise SystemExit(1)
199 | 
200 |     with open(source_path, 'w', encoding='utf-8') as f:
201 |         f.write(downloaded_xml)
202 | 
203 |     return source_path
204 | 
205 | def str_between(string, start, end=None):
206 |     """(<abc>12345</def>, <abc>, </def>)  ->  12345"""
207 | 
208 |     content = string.split(start, 1)[-1]
209 |     if end is not None:
210 |         content = content.rsplit(end, 1)[0]
211 | 
212 |     return content
213 | 
214 | def get_link_type(link):
215 |     """Certain types of links need to be handled specially, this figures out when that's the case"""
216 | 
217 |     if link['base_url'].endswith('.pdf'):
218 |         return 'PDF'
219 |     elif link['base_url'].rsplit('.', 1) in ('pdf', 'png', 'jpg', 'jpeg', 'svg', 'bmp', 'gif', 'tiff', 'webp'):
220 |         return 'image'
221 |     elif 'wikipedia.org' in link['domain']:
222 |         return 'wiki'
223 |     elif 'youtube.com' in link['domain']:
224 |         return 'youtube'
225 |     elif 'soundcloud.com' in link['domain']:
226 |         return 'soundcloud'
227 |     elif 'youku.com' in link['domain']:
228 |         return 'youku'
229 |     elif 'vimeo.com' in link['domain']:
230 |         return 'vimeo'
231 |     return None
232 | 
233 | def merge_links(a, b):
234 |     """deterministially merge two links, favoring longer field values over shorter,
235 |     and "cleaner" values over worse ones.
236 |     """
237 |     longer = lambda key: a[key] if len(a[key]) > len(b[key]) else b[key]
238 |     earlier = lambda key: a[key] if a[key] < b[key] else b[key]
239 |     
240 |     url = longer('url')
241 |     longest_title = longer('title')
242 |     cleanest_title = a['title'] if '://' not in a['title'] else b['title']
243 |     link = {
244 |         'timestamp': earlier('timestamp'),
245 |         'url': url,
246 |         'domain': domain(url),
247 |         'base_url': base_url(url),
248 |         'tags': longer('tags'),
249 |         'title': longest_title if '://' not in longest_title else cleanest_title,
250 |         'sources': list(set(a.get('sources', []) + b.get('sources', []))),
251 |     }
252 |     link['type'] = get_link_type(link)
253 |     return link
254 | 
255 | def find_link(folder, links):
256 |     """for a given archive folder, find the corresponding link object in links"""
257 |     url = parse_url(folder)
258 |     if url:
259 |         for link in links:
260 |             if (link['base_url'] in url) or (url in link['url']):
261 |                 return link
262 | 
263 |     timestamp = folder.split('.')[0]
264 |     for link in links:
265 |         if link['timestamp'].startswith(timestamp):
266 |             if link['domain'] in os.listdir(os.path.join(ARCHIVE_DIR, folder)):
267 |                 return link      # careful now, this isn't safe for most ppl
268 |             if link['domain'] in parse_url(folder):
269 |                 return link
270 |     return None
271 | 
272 | 
273 | def parse_url(folder):
274 |     """for a given archive folder, figure out what url it's for"""
275 |     link_json = os.path.join(ARCHIVE_DIR, folder, 'index.json')
276 |     if os.path.exists(link_json):
277 |         with open(link_json, 'r') as f:
278 |             try:
279 |                 link_json = f.read().strip()
280 |                 if link_json:
281 |                     link = json.loads(link_json)
282 |                     return link['base_url']
283 |             except ValueError:
284 |                 print('File contains invalid JSON: {}!'.format(link_json))
285 | 
286 |     archive_org_txt = os.path.join(ARCHIVE_DIR, folder, 'archive.org.txt')
287 |     if os.path.exists(archive_org_txt):
288 |         with open(archive_org_txt, 'r') as f:
289 |             original_link = f.read().strip().split('/http', 1)[-1]
290 |             with_scheme = 'http{}'.format(original_link)
291 |             return with_scheme
292 | 
293 |     return ''
294 | 
295 | def manually_merge_folders(source, target):
296 |     """prompt for user input to resolve a conflict between two archive folders"""
297 | 
298 |     if not IS_TTY:
299 |         return
300 | 
301 |     fname = lambda path: path.split('/')[-1]
302 | 
303 |     print('    {} and {} have conflicting files, which do you want to keep?'.format(fname(source), fname(target)))
304 |     print('      - [enter]: do nothing (keep both)')
305 |     print('      - a:       prefer files from {}'.format(source))
306 |     print('      - b:       prefer files from {}'.format(target))
307 |     print('      - q:       quit and resolve the conflict manually')
308 |     try:
309 |         answer = input('> ').strip().lower()
310 |     except KeyboardInterrupt:
311 |         answer = 'q'
312 | 
313 |     assert answer in ('', 'a', 'b', 'q'), 'Invalid choice.'
314 | 
315 |     if answer == 'q':
316 |         print('\nJust run Bookmark Archiver again to pick up where you left off.')
317 |         raise SystemExit(0)
318 |     elif answer == '':
319 |         return
320 | 
321 |     files_in_source = set(os.listdir(source))
322 |     files_in_target = set(os.listdir(target))
323 |     for file in files_in_source:
324 |         if file in files_in_target:
325 |             to_delete = target if answer == 'a' else source
326 |             run(['rm', '-Rf', os.path.join(to_delete, file)])
327 |         run(['mv', os.path.join(source, file), os.path.join(target, file)])
328 | 
329 |     if not set(os.listdir(source)):
330 |         run(['rm', '-Rf', source])
331 | 
332 | def fix_folder_path(archive_path, link_folder, link):
333 |     """given a folder, merge it to the canonical 'correct' path for the given link object"""
334 |     source = os.path.join(archive_path, link_folder)
335 |     target = os.path.join(archive_path, link['timestamp'])
336 | 
337 |     url_in_folder = parse_url(source)
338 |     if not (url_in_folder in link['base_url']
339 |             or link['base_url'] in url_in_folder):
340 |         raise ValueError('The link does not match the url for this folder.')
341 | 
342 |     if not os.path.exists(target):
343 |         # target doesn't exist so nothing needs merging, simply move A to B
344 |         run(['mv', source, target])
345 |     else:
346 |         # target folder exists, check for conflicting files and attempt manual merge
347 |         files_in_source = set(os.listdir(source))
348 |         files_in_target = set(os.listdir(target))
349 |         conflicting_files = files_in_source & files_in_target
350 | 
351 |         if not conflicting_files:
352 |             for file in files_in_source:
353 |                 run(['mv', os.path.join(source, file), os.path.join(target, file)])
354 | 
355 |     if os.path.exists(source):
356 |         files_in_source = set(os.listdir(source))
357 |         if files_in_source:
358 |             manually_merge_folders(source, target)
359 |         else:
360 |             run(['rm', '-R', source])
361 | 
362 | 
363 | def migrate_data():
364 |     # migrate old folder to new OUTPUT folder
365 |     old_dir = os.path.join(REPO_DIR, 'html')
366 |     if os.path.exists(old_dir):
367 |         print('[!] WARNING: Moved old output folder "html" to new location: {}'.format(OUTPUT_DIR))
368 |         run(['mv', old_dir, OUTPUT_DIR], timeout=10)
369 | 
370 | 
371 | def cleanup_archive(archive_path, links):
372 |     """move any incorrectly named folders to their canonical locations"""
373 |     
374 |     # for each folder that exists, see if we can match it up with a known good link
375 |     # if we can, then merge the two folders (TODO: if not, move it to lost & found)
376 | 
377 |     unmatched = []
378 |     bad_folders = []
379 | 
380 |     if not os.path.exists(archive_path):
381 |         return
382 | 
383 |     for folder in os.listdir(archive_path):
384 |         try:
385 |             files = os.listdir(os.path.join(archive_path, folder))
386 |         except NotADirectoryError:
387 |             continue
388 |         
389 |         if files:
390 |             link = find_link(folder, links)
391 |             if link is None:
392 |                 unmatched.append(folder)
393 |                 continue
394 |             
395 |             if folder != link['timestamp']:
396 |                 bad_folders.append((folder, link))
397 |         else:
398 |             # delete empty folders
399 |             run(['rm', '-R', os.path.join(archive_path, folder)])
400 |     
401 |     if bad_folders and IS_TTY and input('[!] Cleanup archive? y/[n]: ') == 'y':
402 |         print('[!] Fixing {} improperly named folders in archive...'.format(len(bad_folders)))
403 |         for folder, link in bad_folders:
404 |             fix_folder_path(archive_path, folder, link)
405 |     elif bad_folders:
406 |         print('[!] Warning! {} folders need to be merged, fix by running bookmark archiver.'.format(len(bad_folders)))
407 | 
408 |     if unmatched:
409 |         print('[!] Warning! {} unrecognized folders in html/archive/'.format(len(unmatched)))
410 |         print('    '+ '\n    '.join(unmatched))
411 | 
412 | 
413 | def wget_output_path(link, look_in=None):
414 |     """calculate the path to the wgetted .html file, since wget may
415 |     adjust some paths to be different than the base_url path.
416 | 
417 |     See docs on wget --adjust-extension (-E)
418 |     """
419 | 
420 |     # if we have it stored, always prefer the actual output path to computed one
421 |     if link.get('latest', {}).get('wget'):
422 |         return link['latest']['wget']
423 | 
424 |     urlencode = lambda s: quote(s, encoding='utf-8', errors='replace')
425 | 
426 |     if link['type'] in ('PDF', 'image'):
427 |         return urlencode(link['base_url'])
428 | 
429 |     # Since the wget algorithm to for -E (appending .html) is incredibly complex
430 |     # instead of trying to emulate it here, we just look in the output folder
431 |     # to see what html file wget actually created as the output
432 |     wget_folder = link['base_url'].rsplit('/', 1)[0].split('/')
433 |     look_in = os.path.join(ARCHIVE_DIR, link['timestamp'], *wget_folder)
434 | 
435 |     if look_in and os.path.exists(look_in):
436 |         html_files = [
437 |             f for f in os.listdir(look_in)
438 |             if re.search(".+\\.[Hh][Tt][Mm][Ll]?$", f, re.I | re.M)
439 |         ]
440 |         if html_files:
441 |             return urlencode(os.path.join(*wget_folder, html_files[0]))
442 | 
443 |     return None
444 | 
445 |     # If finding the actual output file didn't work, fall back to the buggy
446 |     # implementation of the wget .html appending algorithm
447 |     # split_url = link['url'].split('#', 1)
448 |     # query = ('%3F' + link['url'].split('?', 1)[-1]) if '?' in link['url'] else ''
449 | 
450 |     # if re.search(".+\\.[Hh][Tt][Mm][Ll]?$", split_url[0], re.I | re.M):
451 |     #     # already ends in .html
452 |     #     return urlencode(link['base_url'])
453 |     # else:
454 |     #     # .html needs to be appended
455 |     #     without_scheme = split_url[0].split('://', 1)[-1].split('?', 1)[0]
456 |     #     if without_scheme.endswith('/'):
457 |     #         if query:
458 |     #             return urlencode('#'.join([without_scheme + 'index.html' + query + '.html', *split_url[1:]]))
459 |     #         return urlencode('#'.join([without_scheme + 'index.html', *split_url[1:]]))
460 |     #     else:
461 |     #         if query:
462 |     #             return urlencode('#'.join([without_scheme + '/index.html' + query + '.html', *split_url[1:]]))
463 |     #         elif '/' in without_scheme:
464 |     #             return urlencode('#'.join([without_scheme + '.html', *split_url[1:]]))
465 |     #         return urlencode(link['base_url'] + '/index.html')
466 | 
467 | 
468 | def derived_link_info(link):
469 |     """extend link info with the archive urls and other derived data"""
470 | 
471 |     link_info = {
472 |         **link,
473 |         'date': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'),
474 |         'google_favicon_url': 'https://www.google.com/s2/favicons?domain={domain}'.format(**link),
475 |         'favicon_url': 'archive/{timestamp}/favicon.ico'.format(**link),
476 |         'files_url': 'archive/{timestamp}/index.html'.format(**link),
477 |         'archive_url': 'archive/{}/{}'.format(link['timestamp'], wget_output_path(link) or 'index.html'),
478 |         'pdf_link': 'archive/{timestamp}/output.pdf'.format(**link),
479 |         'screenshot_link': 'archive/{timestamp}/screenshot.png'.format(**link),
480 |         'dom_link': 'archive/{timestamp}/output.html'.format(**link),
481 |         'archive_org_url': 'https://web.archive.org/web/{base_url}'.format(**link),
482 |     }
483 | 
484 |     # PDF and images are handled slightly differently
485 |     # wget, screenshot, & pdf urls all point to the same file
486 |     if link['type'] in ('PDF', 'image'):
487 |         link_info.update({
488 |             'archive_url': 'archive/{timestamp}/{base_url}'.format(**link),
489 |             'pdf_link': 'archive/{timestamp}/{base_url}'.format(**link),
490 |             'screenshot_link': 'archive/{timestamp}/{base_url}'.format(**link),
491 |             'dom_link': 'archive/{timestamp}/{base_url}'.format(**link),
492 |             'title': '{title} ({type})'.format(**link),
493 |         })
494 |     return link_info
495 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Bookmark Archiver <img src="https://nicksweeting.com/images/archive.png" height="22px"/>  [![Github Stars](https://img.shields.io/github/stars/pirate/bookmark-archiver.svg)](https://github.com/pirate/bookmark-archiver) [![Twitter URL](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/thesquashSH)
  2 | 
  3 |     "Your own personal Way-Back Machine"
  4 | 
  5 | ▶️ [Quickstart](#quickstart) | [Details](#details) | [Configuration](#configuration) | [Manual Setup](#manual-setup) | [Troubleshooting](#troubleshooting) | [Demo](https://archive.sweeting.me) | [Changelog](#changelog) | [Donate](https://github.com/pirate/bookmark-archiver/blob/master/DONATE.md)
  6 | 
  7 | ---
  8 | 
  9 | Save an archived copy of all websites you bookmark (the actual *content* of each site, not just the list of bookmarks).
 10 | 
 11 | Can import links from:
 12 | 
 13 |  - <img src="https://nicksweeting.com/images/bookmarks.png" height="22px"/> Browser history & bookmarks (Chrome, Firefox, Safari, IE, Opera)
 14 |  - <img src="https://getpocket.com/favicon.ico" height="22px"/> Pocket
 15 |  - <img src="https://pinboard.in/favicon.ico" height="22px"/> Pinboard
 16 |  - <img src="https://nicksweeting.com/images/rss.svg" height="22px"/> RSS or plain text lists
 17 |  - Shaarli, Delicious, Instapaper, Reddit Saved Posts, Wallabag, Unmark.it, and more!
 18 | 
 19 | For each site, it outputs (configurable):
 20 | 
 21 | - Browsable static HTML archive (wget)
 22 | - PDF (Chrome headless)
 23 | - Screenshot (Chrome headless)
 24 | - DOM dump (Chrome headless)
 25 | - Favicon
 26 | - Submits URL to archive.org
 27 | - Index summary pages: index.html & index.json
 28 | 
 29 | The archiving is additive, so you can schedule `./archive` to run regularly and pull new links into the index.
 30 | All the saved content is static and indexed with json files, so it lives forever & is easily parseable, it requires no always-running backend.
 31 | 
 32 | [DEMO: archive.sweeting.me](https://archive.sweeting.me)
 33 | 
 34 | <img src="https://i.imgur.com/q3Oz9wN.png" width="75%" alt="Desktop Screenshot" align="top"><img src="https://i.imgur.com/TG0fGVo.png" width="25%" alt="Mobile Screenshot" align="top"><br/>
 35 | 
 36 | ## Quickstart
 37 | 
 38 | **1. Get your list of URLs:**
 39 | 
 40 | Follow the links here to find instructions for exporting bookmarks from each service.
 41 | 
 42 |  - [Pocket](https://getpocket.com/export)
 43 |  - [Pinboard](https://pinboard.in/export/)
 44 |  - [Instapaper](https://www.instapaper.com/user/export)
 45 |  - [Reddit Saved Posts](https://github.com/csu/export-saved-reddit)
 46 |  - [Shaarli](http://shaarli.readthedocs.io/en/master/Backup,-restore,-import-and-export/#export-links-as)
 47 |  - [Unmark.it](http://help.unmark.it/import-export)
 48 |  - [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html)
 49 |  - [Chrome Bookmarks](https://support.google.com/chrome/answer/96816?hl=en)
 50 |  - [Firefox Bookmarks](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer)
 51 |  - [Safari Bookmarks](http://i.imgur.com/AtcvUZA.png)
 52 |  - [Opera Bookmarks](http://help.opera.com/Windows/12.10/en/importexport.html)
 53 |  - [Internet Explorer Bookmarks](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows)
 54 |  - Chrome History: `./bin/export-browser-history --chrome`
 55 |  - Firefox History: `./bin/export-browser-history --firefox`
 56 |  - Other File or URL: (e.g. RSS feed) pass as second argument in the next step
 57 | 
 58 |  (If any of these links are broken, please submit an issue and I'll fix it)
 59 | 
 60 | **2. Create your archive:**
 61 | 
 62 | ```bash
 63 | git clone https://github.com/pirate/bookmark-archiver
 64 | cd bookmark-archiver/
 65 | ./setup                                         # install all dependencies
 66 | 
 67 | # add a list of links from a file
 68 | ./archive ~/Downloads/bookmark_export.html      # replace with the path to your export file or URL from step 1
 69 | 
 70 | # OR add a list of links from remote URL
 71 | ./archive "https://getpocket.com/users/yourusername/feed/all"  # url to an RSS, html, or json links file
 72 | 
 73 | # OR add all the links from your browser history
 74 | ./bin/export-browser-history --chrome           # works with --firefox as well, can take path to SQLite history db
 75 | ./archive output/sources/chrome_history.json
 76 | 
 77 | # OR just continue archiving the existing links in the index
 78 | ./archive   # at any point if you just want to continue archiving where you left off, without adding any new links
 79 | ```
 80 | 
 81 | **3. Done!**
 82 | 
 83 | You can open `output/index.html` to view your archive.  (favicons will appear next to each title once it has finished downloading)
 84 | 
 85 | If you want to host your archive somewhere to share it with other people, see the [Publishing Your Archive](#publishing-your-archive) section below.
 86 | 
 87 | **4. (Optional) Schedule it to run every day**
 88 | 
 89 | You can import links from any local file path or feed url by changing the second argument to `archive.py`.
 90 | Bookmark Archiver will ignore links that are imported multiple times, it will keep the earliest version that it's seen.
 91 | This means you can add multiple cron jobs to pull links from several different feeds or files each day,
 92 | it will keep the index up-to-date without duplicate links.
 93 | 
 94 | This example archives a pocket RSS feed and an export file every 24 hours, and saves the output to a logfile.
 95 | ```bash
 96 | 0 24 * * * yourusername /opt/bookmark-archiver/archive https://getpocket.com/users/yourusername/feed/all > /var/log/bookmark_archiver_rss.log
 97 | 0 24 * * * yourusername /opt/bookmark-archiver/archive /home/darth-vader/Desktop/bookmarks.html > /var/log/bookmark_archiver_firefox.log
 98 | ```
 99 | (Add the above lines to `/etc/crontab`)
100 | 
101 | **Next Steps**  
102 |   
103 | If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom.  
104 | If you'd like to customize options, see the [Configuration](#configuration) section.  
105 | 
106 | If you want something easier than running programs in the command-line, take a look at [Pocket Premium](https://getpocket.com/premium) (yay Mozilla!) and [Pinboard Pro](https://pinboard.in/upgrade/) (yay independent developer!).  Both offer easy-to-use bookmark archiving with full-text-search and other features.
107 | 
108 | ## Details
109 | 
110 | `archive.py` is a script that takes a [Pocket-format](https://getpocket.com/export), [JSON-format](https://pinboard.in/export/), [Netscape-format](https://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx), or RSS-formatted list of links, and downloads a clone of each linked website to turn into a browsable archive that you can store locally or host online.
111 | 
112 | The archiver produces an output folder `output/` containing an `index.html`, `index.json`, and archived copies of all the sites,
113 | organized by timestamp bookmarked.  It's Powered by [headless](https://developers.google.com/web/updates/2017/04/headless-chrome) Chromium and good 'ol `wget`.
114 | 
115 | For each sites it saves:
116 | 
117 |  - wget of site, e.g. `en.wikipedia.org/wiki/Example.html` with .html appended if not present
118 |  - `output.pdf` Printed PDF of site using headless chrome
119 |  - `screenshot.png` 1440x900 screenshot of site using headless chrome
120 |  - `output.html` DOM Dump of the HTML after rendering using headless chrome
121 |  - `archive.org.txt` A link to the saved site on archive.org
122 |  - `audio/` and `video/` for sites like youtube, soundcloud, etc. (using youtube-dl) (WIP)
123 |  - `code/` clone of any repository for github, bitbucket, or gitlab links (WIP)
124 |  - `index.json` JSON index containing link info and archive details
125 |  - `index.html` HTML index containing link info and archive details (optional fancy or simple index)
126 | 
127 | Wget doesn't work on sites you need to be logged into, but chrome headless does, see the [Configuration](#configuration)* section for `CHROME_USER_DATA_DIR`.
128 | 
129 | **Large Exports & Estimated Runtime:** 
130 | 
131 | I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.  
132 | Those numbers are from running it single-threaded on my i5 machine with 50mbps down.  YMMV.  
133 | 
134 | You can run it in parallel by using the `resume` feature, or by manually splitting export.html into multiple files:
135 | ```bash
136 | ./archive export.html 1498800000 &  # second argument is timestamp to resume downloading from
137 | ./archive export.html 1498810000 &
138 | ./archive export.html 1498820000 &
139 | ./archive export.html 1498830000 &
140 | ```
141 | Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running).
142 | 
143 | ## Configuration
144 | 
145 | You can tweak parameters via environment variables, or by editing `config.py` directly:
146 | ```bash
147 | env CHROME_BINARY=google-chrome-stable RESOLUTION=1440,900 FETCH_PDF=False ./archive ~/Downloads/bookmarks_export.html
148 | ```
149 | 
150 | **Shell Options:**
151 |  - colorize console ouput: `USE_COLOR` value: [`True`]/`False`
152 |  - show progress bar: `SHOW_PROGRESS` value: [`True`]/`False`
153 |  - archive permissions: `OUTPUT_PERMISSIONS` values: [`755`]/`644`/`...`
154 | 
155 | **Dependency Options:**
156 |  - path to Chrome: `CHROME_BINARY` values: [`chromium-browser`]/`/usr/local/bin/google-chrome`/`...`
157 |  - path to wget: `WGET_BINARY` values: [`wget`]/`/usr/local/bin/wget`/`...`
158 | 
159 | **Archive Options:**
160 |  - maximum allowed download time per link: `TIMEOUT` values: [`60`]/`30`/`...`
161 |  - archive methods (values: [`True`]/`False`):
162 |    - fetch page with wget: `FETCH_WGET`
163 |    - fetch images/css/js with wget: `FETCH_WGET_REQUISITES` (True is highly recommended)
164 |    - print page as PDF: `FETCH_PDF`
165 |    - fetch a screenshot of the page: `FETCH_SCREENSHOT`
166 |    - fetch a DOM dump of the page: `FETCH_DOM`
167 |    - fetch a favicon for the page: `FETCH_FAVICON`
168 |    - submit the page to archive.org: `SUBMIT_ARCHIVE_DOT_ORG` 
169 |  - screenshot: `RESOLUTION` values: [`1440,900`]/`1024,768`/`...`
170 |  - user agent: `WGET_USER_AGENT` values: [`Wget/1.19.1`]/`"Mozilla/5.0 ..."`/`...`
171 |  - chrome profile: `CHROME_USER_DATA_DIR` values: [`~/Library/Application\ Support/Google/Chrome/Default`]/`/tmp/chrome-profile`/`...`
172 |     To capture sites that require a user to be logged in, you must specify a path to a chrome profile (which loads the cookies needed for the user to be logged in).  If you don't have an existing chrome profile, create one with `chromium-browser --disable-gpu --user-data-dir=/tmp/chrome-profile`, and log into the sites you need.  Then set `CHROME_USER_DATA_DIR=/tmp/chrome-profile` to make Bookmark Archiver use that profile.
173 | 
174 |  (See defaults & more at the top of `config.py`)
175 | 
176 | To tweak the outputted html index file's look and feel, just edit the HTML files in `archiver/templates/`.
177 | 
178 | The chrome/chromium dependency is _optional_ and only required for screenshots, PDF, and DOM dump output, it can be safely ignored if those three methods are disabled.
179 | 
180 | ## Publishing Your Archive
181 | 
182 | The archive produced by `./archive` is suitable for serving on any provider that can host static html (e.g. github pages!).
183 | 
184 | You can also serve it from a home server or VPS by uploading the outputted `output` folder to your web directory, e.g. `/var/www/bookmark-archiver` and configuring your webserver.
185 | 
186 | Here's a sample nginx configuration that works to serve archive folders:
187 | 
188 | ```nginx
189 | location / {
190 |     alias       /path/to/bookmark-archiver/output/;
191 |     index       index.html;
192 |     autoindex   on;               # see directory listing upon clicking "The Files" links
193 |     try_files   $uri $uri/ =404;
194 | }
195 | ```
196 | 
197 | Make sure you're not running any content as CGI or PHP, you only want to serve static files!
198 | 
199 | Urls look like: `https://archive.example.com/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem.html`
200 | 
201 | **Security WARNING & Content Disclaimer**
202 | 
203 | Re-hosting other people's content has security implications for any other sites sharing your hosting domain.  Make sure you understand
204 | the dangers of hosting unknown archived CSS & JS files [on your shared domain](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy).
205 | Due to the security risk of serving some malicious JS you archived by accident, it's best to put this on a domain or subdomain
206 | of its own to keep cookies separate and slightly mitigate [CSRF attacks](https://en.wikipedia.org/wiki/Cross-site_request_forgery) and other nastiness.
207 | 
208 | You may also want to blacklist your archive in `/robots.txt` if you don't want to be publicly assosciated with all the links you archive via search engine results.
209 | 
210 | Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons,
211 | it's up to you to host responsibly and respond to takedown requests appropriately.
212 | 
213 | Please modify the `FOOTER_INFO` config variable to add your contact info to the footer of your index.
214 | 
215 | ## Info & Motivation
216 | 
217 | This is basically an open-source version of [Pocket Premium](https://getpocket.com/premium) (which you should consider paying for!).
218 | I got tired of sites I saved going offline or changing their URLS, so I started
219 | archiving a copy of them locally now, similar to The Way-Back Machine provided
220 | by [archive.org](https://archive.org).  Self hosting your own archive allows you to save
221 | PDFs & Screenshots of dynamic sites in addition to static html, something archive.org doesn't do.
222 | 
223 | Now I can rest soundly knowing important articles and resources I like wont dissapear off the internet.
224 | 
225 | My published archive as an example: [archive.sweeting.me](https://archive.sweeting.me).
226 | 
227 | ## Manual Setup
228 | 
229 | If you don't like running random setup scripts off the internet (:+1:), you can follow these manual setup instructions.
230 | 
231 | **1. Install dependencies:** `chromium >= 59`,` wget >= 1.16`, `python3 >= 3.5`  (`google-chrome >= v59` works fine as well)
232 | 
233 | If you already have Google Chrome installed, or wish to use that instead of Chromium, follow the [Google Chrome Instructions](#google-chrome-instructions).
234 | 
235 | ```bash
236 | # On Mac:
237 | brew cask install chromium  # If you already have Google Chrome/Chromium in /Applications/, skip this command
238 | brew install wget python3
239 | 
240 | echo -e '#!/bin/bash\n/Applications/Chromium.app/Contents/MacOS/Chromium "$@"' > /usr/local/bin/chromium-browser  # see instructions for google-chrome below
241 | chmod +x /usr/local/bin/chromium-browser
242 | ```
243 | 
244 | ```bash
245 | # On Ubuntu/Debian:
246 | apt install chromium-browser python3 wget
247 | ```
248 | 
249 | ```bash
250 | # Check that everything worked:
251 | chromium-browser --version && which wget && which python3 && which curl && echo "[√] All dependencies installed."
252 | ```
253 | 
254 | **2. Get your bookmark export file:**
255 | 
256 | Follow the instruction links above in the "Quickstart" section to download your bookmarks export file.
257 | 
258 | **3. Run the archive script:**
259 | 
260 | 1. Clone this repo `git clone https://github.com/pirate/bookmark-archiver`
261 | 3. `cd bookmark-archiver/`
262 | 4. `./archive ~/Downloads/bookmarks_export.html`
263 | 
264 | You may optionally specify a second argument to `archive.py export.html 153242424324` to resume the archive update at a specific timestamp.
265 | 
266 | If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom.
267 | 
268 | ### Google Chrome Instructions:
269 | 
270 | I recommend Chromium instead of Google Chrome, since it's open source and doesn't send your data to Google.
271 | Chromium may have some issues rendering some sites though, so you're welcome to try Google-chrome instead.
272 | It's also easier to use Google Chrome if you already have it installed, rather than downloading Chromium all over.
273 | 
274 | 1. Install & link google-chrome
275 | ```bash
276 | # On Mac:
277 | # If you already have Google Chrome in /Applications/, skip this brew command
278 | brew cask install google-chrome
279 | brew install wget python3
280 | 
281 | echo -e '#!/bin/bash\n/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome "$@"' > /usr/local/bin/google-chrome
282 | chmod +x /usr/local/bin/google-chrome
283 | ```
284 | 
285 | ```bash
286 | # On Linux:
287 | wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
288 | sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
289 | apt update; apt install google-chrome-beta python3 wget
290 | ```
291 | 
292 | 2. Set the environment variable `CHROME_BINARY` to `google-chrome` before running:
293 | 
294 | ```bash
295 | env CHROME_BINARY=google-chrome ./archive ~/Downloads/bookmarks_export.html
296 | ```
297 | If you're having any trouble trying to set up Google Chrome or Chromium, see the Troubleshooting section below.
298 | 
299 | ## Troubleshooting
300 | 
301 | ### Dependencies
302 | 
303 | **Python:**
304 | 
305 | On some Linux distributions the python3 package might not be recent enough.
306 | If this is the case for you, resort to installing a recent enough version manually.
307 | ```bash
308 | add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6
309 | ```
310 | If you still need help, [the official Python docs](https://docs.python.org/3.6/using/unix.html) are a good place to start.
311 | 
312 | **Chromium/Google Chrome:**
313 | 
314 | `archive.py` depends on being able to access a `chromium-browser`/`google-chrome` executable.  The executable used
315 | defaults to `chromium-browser` but can be manually specified with the environment variable `CHROME_BINARY`:
316 | 
317 | ```bash
318 | env CHROME_BINARY=/usr/local/bin/chromium-browser ./archive ~/Downloads/bookmarks_export.html
319 | ```
320 | 
321 | 1. Test to make sure you have Chrome on your `$PATH` with:
322 | 
323 | ```bash
324 | which chromium-browser || which google-chrome
325 | ```
326 | If no executable is displayed, follow the setup instructions to install and link one of them.
327 | 
328 | 2. If a path is displayed, the next step is to check that it's runnable:
329 | 
330 | ```bash
331 | chromium-browser --version || google-chrome --version
332 | ```
333 | If no version is displayed, try the setup instructions again, or confirm that you have permission to access chrome.
334 | 
335 | 3. If a version is displayed and it's `<59`, upgrade it:
336 | 
337 | ```bash
338 | apt upgrade chromium-browser -y
339 | # OR
340 | brew cask upgrade chromium-browser
341 | ```
342 | 
343 | 4. If a version is displayed and it's `>=59`, make sure `archive.py` is running the right one:
344 | 
345 | ```bash
346 | env CHROME_BINARY=/path/from/step/1/chromium-browser ./archive bookmarks_export.html   # replace the path with the one you got from step 1
347 | ```
348 | 
349 | 
350 | **Wget & Curl:**
351 | 
352 | If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice.
353 | See the "Manual Setup" instructions for more details.
354 | 
355 | If wget times out or randomly fails to download some sites that you have confirmed are online,
356 | upgrade wget to the most recent version with `brew upgrade wget` or `apt upgrade wget`.  There is
357 | a bug in versions `<=1.19.1_1` that caused wget to fail for perfectly valid sites.
358 | 
359 | ### Archiving
360 | 
361 | **No links parsed from export file:**
362 | 
363 | Please open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of where you got the export, and
364 | preferrably your export file attached (you can redact the links).  We'll fix the parser to support your format.
365 | 
366 | **Lots of skipped sites:**
367 | 
368 | If you ran the archiver once, it wont re-download sites subsequent times, it will only download new links.
369 | If you haven't already run it, make sure you have a working internet connection and that the parsed URLs look correct.
370 | You can check the `archive.py` output or `index.html` to see what links it's downloading.
371 | 
372 | If you're still having issues, try deleting or moving the `output/archive` folder (back it up first!) and running `./archive` again.
373 | 
374 | **Lots of errors:**
375 | 
376 | Make sure you have all the dependencies installed and that you're able to visit the links from your browser normally.
377 | Open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of the errors if you're still having problems.
378 | 
379 | **Lots of broken links from the index:**
380 | 
381 | Not all sites can be effectively archived with each method, that's why it's best to use a combination of `wget`, PDFs, and screenshots.
382 | If it seems like more than 10-20% of sites in the archive are broken, open an [issue](https://github.com/pirate/bookmark-archiver/issues)
383 | with some of the URLs that failed to be archived and I'll investigate.
384 | 
385 | ### Hosting the Archive
386 | 
387 | If you're having issues trying to host the archive via nginx, make sure you already have nginx running with SSL.
388 | If you don't, google around, there are plenty of tutorials to help get that set up.  Open an [issue](https://github.com/pirate/bookmark-archiver/issues)
389 | if you have problem with a particular nginx config.
390 | 
391 | ## Roadmap
392 | 
393 | If you feel like contributing a PR, some of these tasks are pretty easy.  Feel free to open an issue if you need help getting started in any way!
394 | 
395 |  - download closed-captions text from youtube videos
396 |  - body text extraction using [fathom](https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/)
397 |  - auto-tagging based on important extracted words
398 |  - audio & video archiving with `youtube-dl`
399 |  - full-text indexing with elasticsearch/elasticlunr/ag
400 |  - video closed-caption downloading for full-text indexing video content
401 |  - automatic text summaries of article with summarization library
402 |  - feature image extraction
403 |  - http support (from my https-only domain)
404 |  - try wgetting dead sites from archive.org (https://github.com/hartator/wayback-machine-downloader)
405 |  - live updating from pocket/pinboard
406 | 
407 | It's possible to pull links via the pocket API or public pocket RSS feeds instead of downloading an html export.
408 | Once I write a script to do that, we can stick this in `cron` and have it auto-update on it's own.
409 | 
410 | For now you just have to download `ril_export.html` and run `archive.py` each time it updates. The script
411 | will run fast subsequent times because it only downloads new links that haven't been archived already.
412 | 
413 | ## Links
414 | 
415 | **Similar Projects:**
416 |  - [Memex by Worldbrain.io](https://github.com/WorldBrain/Memex) a browser extension that saves all your history and does full-text search
417 |  - [Hypothes.is](https://web.hypothes.is/) a web/pdf/ebook annotation tool that also archives content
418 |  - [Perkeep](https://perkeep.org/) "Perkeep lets you permanently keep your stuff, for life."
419 |  - [Fetching.io](http://fetching.io/) A personal search engine/archiver that lets you search through all archived websites that you've bookmarked
420 |  - [Shaarchiver](https://github.com/nodiscc/shaarchiver) very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index
421 |  - [Webrecorder.io](https://webrecorder.io/) Save full browsing sessions and archive all the content
422 |  - [Wallabag](https://wallabag.org) Save articles you read locally or on your phone
423 | 
424 | **Discussions:**
425 |  - [Hacker News Discussion](https://news.ycombinator.com/item?id=14272133)
426 |  - [Reddit r/selfhosted Discussion](https://www.reddit.com/r/selfhosted/comments/69eoi3/pocket_stream_archive_your_own_personal_wayback/)
427 |  - [Reddit r/datahoarder Discussion #1](https://www.reddit.com/r/DataHoarder/comments/69e6i9/archive_a_browseable_copy_of_your_saved_pocket/)
428 |  - [Reddit r/datahoarder Discussion #2](https://www.reddit.com/r/DataHoarder/comments/6kepv6/bookmarkarchiver_now_supports_archiving_all_major/)
429 | 
430 | 
431 | **Tools/Other:**
432 |  - https://github.com/ikreymer/webarchiveplayer#auto-load-warcs
433 |  - [Sheetsee-Pocket](http://jlord.us/sheetsee-pocket/) project that provides a pretty auto-updating index of your Pocket links (without archiving them)
434 |  - [Pocket -> IFTTT -> Dropbox](https://christopher.su/2013/saving-pocket-links-file-day-dropbox-ifttt-launchd/) Post by Christopher Su on his Pocket saving IFTTT recipie
435 | 
436 | ## Changelog
437 | 
438 |  - v0.1.0 released
439 |  - support for browser history exporting added with `./bin/export-browser-history`
440 |  - support for chrome `--dump-dom` to output full page HTML after JS executes
441 |  - v0.0.3 released
442 |  - support for chrome `--user-data-dir` to archive sites that need logins
443 |  - fancy individual html & json indexes for each link
444 |  - smartly append new links to existing index instead of overwriting 
445 |   - v0.0.2 released
446 |  - proper HTML templating instead of format strings (thanks to https://github.com/bardisty!)
447 |  - refactored into separate files, wip audio & video archiving
448 |  - v0.0.1 released
449 |  - Index links now work without nginx url rewrites, archive can now be hosted on github pages
450 |  - added setup.sh script & docstrings & help commands
451 |  - made Chromium the default instead of Google Chrome (yay free software)
452 |  - added [env-variable](https://github.com/pirate/bookmark-archiver/pull/25) configuration (thanks to https://github.com/hannah98!)
453 |  - renamed from **Pocket Archive Stream** -> **Bookmark Archiver**
454 |  - added [Netscape-format](https://github.com/pirate/bookmark-archiver/pull/20) export support (thanks to https://github.com/ilvar!)
455 |  - added [Pinboard-format](https://github.com/pirate/bookmark-archiver/pull/7) export support (thanks to https://github.com/sconeyard!)
456 |  - front-page of HN, oops! apparently I have users to support now :grin:?
457 |  - added Pocket-format export support
458 |  - v0.0.0 released: created Pocket Archive Stream 2017/05/05
459 |  
460 |  ## Donations
461 |  
462 |  This project can really flourish with some more engineering effort, but unless it can support   
463 |  me financially I'm unlikely to be able to take it to the next level alone.  It's already pretty   
464 |  functional and robust, but it really deserves to be taken to the next level with a few more   
465 |  talented engineers.  If you or your foundation wants to sponsor this project long-term, contact  
466 |  me at bookmark-archiver@sweeting.me.
467 |    
468 |  [Grants / Donations](https://github.com/pirate/bookmark-archiver/blob/master/donate.md)
469 | 


--------------------------------------------------------------------------------