├── .gitignore
├── DONATE.md
├── LICENSE
├── README.md
├── archive
├── archiver
├── __init__.py
├── archive.py
├── archive_methods.py
├── config.py
├── index.py
├── links.py
├── parse.py
├── peekable.py
├── templates
│ ├── index.html
│ ├── index_row.html
│ ├── link_index.html
│ ├── link_index_fancy.html
│ └── static
│ │ ├── archive.png
│ │ ├── external.png
│ │ └── spinner.gif
├── tests
│ ├── firefox_export.html
│ ├── pinboard_export.json
│ ├── pocket_export.html
│ └── rss_export.xml
└── util.py
├── bin
├── bookmark-archiver
├── export-browser-history
└── setup-bookmark-archiver
└── setup
/.gitignore:
--------------------------------------------------------------------------------
1 | # OS cruft
2 | .DS_Store
3 | ._*
4 |
5 | # python
6 | __pycache__/
7 | archiver/venv
8 |
9 | # vim
10 | .swp*
11 |
12 | # output artifacts
13 | output/
14 |
--------------------------------------------------------------------------------
/DONATE.md:
--------------------------------------------------------------------------------
1 | # Donate
2 |
3 | Right now I'm working on this project in my spare time and accepting the occational PR,
4 | if you want me to dedicate more time to it, donate to support development!
5 |
6 | - Ethereum: 0x5B0F85FFc44fD759C2d97f0BE4681279966f3832
7 | - Bitcoin: https://shapeshift.io/ BTC -> to address above (ETH)
8 | - Paypal: https://www.paypal.me/NicholasSweeting/25
9 |
10 | The eventual goal is to support one or two developers full-time on this project via donations.
11 | With more engineering power this can become a distributed archive service with a nice UI,
12 | like the Way-Back machine but hosted by everyone!
13 |
14 | If have any questions or want to sponsor this project long-term, contact me at
15 | bookmark-archiver@sweeting.me
16 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2017 Nick Sweeting
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Bookmark Archiver [](https://github.com/pirate/bookmark-archiver) [](https://twitter.com/thesquashSH)
2 |
3 | "Your own personal Way-Back Machine"
4 |
5 | ▶️ [Quickstart](#quickstart) | [Details](#details) | [Configuration](#configuration) | [Manual Setup](#manual-setup) | [Troubleshooting](#troubleshooting) | [Demo](https://archive.sweeting.me) | [Changelog](#changelog) | [Donate](https://github.com/pirate/bookmark-archiver/blob/master/DONATE.md)
6 |
7 | ---
8 |
9 | Save an archived copy of all websites you bookmark (the actual *content* of each site, not just the list of bookmarks).
10 |
11 | Can import links from:
12 |
13 | - Browser history & bookmarks (Chrome, Firefox, Safari, IE, Opera)
14 | - Pocket
15 | - Pinboard
16 | - RSS or plain text lists
17 | - Shaarli, Delicious, Instapaper, Reddit Saved Posts, Wallabag, Unmark.it, and more!
18 |
19 | For each site, it outputs (configurable):
20 |
21 | - Browsable static HTML archive (wget)
22 | - PDF (Chrome headless)
23 | - Screenshot (Chrome headless)
24 | - DOM dump (Chrome headless)
25 | - Favicon
26 | - Submits URL to archive.org
27 | - Index summary pages: index.html & index.json
28 |
29 | The archiving is additive, so you can schedule `./archive` to run regularly and pull new links into the index.
30 | All the saved content is static and indexed with json files, so it lives forever & is easily parseable, it requires no always-running backend.
31 |
32 | [DEMO: archive.sweeting.me](https://archive.sweeting.me)
33 |
34 |
35 |
36 | ## Quickstart
37 |
38 | **1. Get your list of URLs:**
39 |
40 | Follow the links here to find instructions for exporting bookmarks from each service.
41 |
42 | - [Pocket](https://getpocket.com/export)
43 | - [Pinboard](https://pinboard.in/export/)
44 | - [Instapaper](https://www.instapaper.com/user/export)
45 | - [Reddit Saved Posts](https://github.com/csu/export-saved-reddit)
46 | - [Shaarli](http://shaarli.readthedocs.io/en/master/Backup,-restore,-import-and-export/#export-links-as)
47 | - [Unmark.it](http://help.unmark.it/import-export)
48 | - [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html)
49 | - [Chrome Bookmarks](https://support.google.com/chrome/answer/96816?hl=en)
50 | - [Firefox Bookmarks](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer)
51 | - [Safari Bookmarks](http://i.imgur.com/AtcvUZA.png)
52 | - [Opera Bookmarks](http://help.opera.com/Windows/12.10/en/importexport.html)
53 | - [Internet Explorer Bookmarks](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows)
54 | - Chrome History: `./bin/export-browser-history --chrome`
55 | - Firefox History: `./bin/export-browser-history --firefox`
56 | - Other File or URL: (e.g. RSS feed) pass as second argument in the next step
57 |
58 | (If any of these links are broken, please submit an issue and I'll fix it)
59 |
60 | **2. Create your archive:**
61 |
62 | ```bash
63 | git clone https://github.com/pirate/bookmark-archiver
64 | cd bookmark-archiver/
65 | ./setup # install all dependencies
66 |
67 | # add a list of links from a file
68 | ./archive ~/Downloads/bookmark_export.html # replace with the path to your export file or URL from step 1
69 |
70 | # OR add a list of links from remote URL
71 | ./archive "https://getpocket.com/users/yourusername/feed/all" # url to an RSS, html, or json links file
72 |
73 | # OR add all the links from your browser history
74 | ./bin/export-browser-history --chrome # works with --firefox as well, can take path to SQLite history db
75 | ./archive output/sources/chrome_history.json
76 |
77 | # OR just continue archiving the existing links in the index
78 | ./archive # at any point if you just want to continue archiving where you left off, without adding any new links
79 | ```
80 |
81 | **3. Done!**
82 |
83 | You can open `output/index.html` to view your archive. (favicons will appear next to each title once it has finished downloading)
84 |
85 | If you want to host your archive somewhere to share it with other people, see the [Publishing Your Archive](#publishing-your-archive) section below.
86 |
87 | **4. (Optional) Schedule it to run every day**
88 |
89 | You can import links from any local file path or feed url by changing the second argument to `archive.py`.
90 | Bookmark Archiver will ignore links that are imported multiple times, it will keep the earliest version that it's seen.
91 | This means you can add multiple cron jobs to pull links from several different feeds or files each day,
92 | it will keep the index up-to-date without duplicate links.
93 |
94 | This example archives a pocket RSS feed and an export file every 24 hours, and saves the output to a logfile.
95 | ```bash
96 | 0 24 * * * yourusername /opt/bookmark-archiver/archive https://getpocket.com/users/yourusername/feed/all > /var/log/bookmark_archiver_rss.log
97 | 0 24 * * * yourusername /opt/bookmark-archiver/archive /home/darth-vader/Desktop/bookmarks.html > /var/log/bookmark_archiver_firefox.log
98 | ```
99 | (Add the above lines to `/etc/crontab`)
100 |
101 | **Next Steps**
102 |
103 | If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom.
104 | If you'd like to customize options, see the [Configuration](#configuration) section.
105 |
106 | If you want something easier than running programs in the command-line, take a look at [Pocket Premium](https://getpocket.com/premium) (yay Mozilla!) and [Pinboard Pro](https://pinboard.in/upgrade/) (yay independent developer!). Both offer easy-to-use bookmark archiving with full-text-search and other features.
107 |
108 | ## Details
109 |
110 | `archive.py` is a script that takes a [Pocket-format](https://getpocket.com/export), [JSON-format](https://pinboard.in/export/), [Netscape-format](https://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx), or RSS-formatted list of links, and downloads a clone of each linked website to turn into a browsable archive that you can store locally or host online.
111 |
112 | The archiver produces an output folder `output/` containing an `index.html`, `index.json`, and archived copies of all the sites,
113 | organized by timestamp bookmarked. It's Powered by [headless](https://developers.google.com/web/updates/2017/04/headless-chrome) Chromium and good 'ol `wget`.
114 |
115 | For each sites it saves:
116 |
117 | - wget of site, e.g. `en.wikipedia.org/wiki/Example.html` with .html appended if not present
118 | - `output.pdf` Printed PDF of site using headless chrome
119 | - `screenshot.png` 1440x900 screenshot of site using headless chrome
120 | - `output.html` DOM Dump of the HTML after rendering using headless chrome
121 | - `archive.org.txt` A link to the saved site on archive.org
122 | - `audio/` and `video/` for sites like youtube, soundcloud, etc. (using youtube-dl) (WIP)
123 | - `code/` clone of any repository for github, bitbucket, or gitlab links (WIP)
124 | - `index.json` JSON index containing link info and archive details
125 | - `index.html` HTML index containing link info and archive details (optional fancy or simple index)
126 |
127 | Wget doesn't work on sites you need to be logged into, but chrome headless does, see the [Configuration](#configuration)* section for `CHROME_USER_DATA_DIR`.
128 |
129 | **Large Exports & Estimated Runtime:**
130 |
131 | I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
132 | Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
133 |
134 | You can run it in parallel by using the `resume` feature, or by manually splitting export.html into multiple files:
135 | ```bash
136 | ./archive export.html 1498800000 & # second argument is timestamp to resume downloading from
137 | ./archive export.html 1498810000 &
138 | ./archive export.html 1498820000 &
139 | ./archive export.html 1498830000 &
140 | ```
141 | Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running).
142 |
143 | ## Configuration
144 |
145 | You can tweak parameters via environment variables, or by editing `config.py` directly:
146 | ```bash
147 | env CHROME_BINARY=google-chrome-stable RESOLUTION=1440,900 FETCH_PDF=False ./archive ~/Downloads/bookmarks_export.html
148 | ```
149 |
150 | **Shell Options:**
151 | - colorize console ouput: `USE_COLOR` value: [`True`]/`False`
152 | - show progress bar: `SHOW_PROGRESS` value: [`True`]/`False`
153 | - archive permissions: `OUTPUT_PERMISSIONS` values: [`755`]/`644`/`...`
154 |
155 | **Dependency Options:**
156 | - path to Chrome: `CHROME_BINARY` values: [`chromium-browser`]/`/usr/local/bin/google-chrome`/`...`
157 | - path to wget: `WGET_BINARY` values: [`wget`]/`/usr/local/bin/wget`/`...`
158 |
159 | **Archive Options:**
160 | - maximum allowed download time per link: `TIMEOUT` values: [`60`]/`30`/`...`
161 | - archive methods (values: [`True`]/`False`):
162 | - fetch page with wget: `FETCH_WGET`
163 | - fetch images/css/js with wget: `FETCH_WGET_REQUISITES` (True is highly recommended)
164 | - print page as PDF: `FETCH_PDF`
165 | - fetch a screenshot of the page: `FETCH_SCREENSHOT`
166 | - fetch a DOM dump of the page: `FETCH_DOM`
167 | - fetch a favicon for the page: `FETCH_FAVICON`
168 | - submit the page to archive.org: `SUBMIT_ARCHIVE_DOT_ORG`
169 | - screenshot: `RESOLUTION` values: [`1440,900`]/`1024,768`/`...`
170 | - user agent: `WGET_USER_AGENT` values: [`Wget/1.19.1`]/`"Mozilla/5.0 ..."`/`...`
171 | - chrome profile: `CHROME_USER_DATA_DIR` values: [`~/Library/Application\ Support/Google/Chrome/Default`]/`/tmp/chrome-profile`/`...`
172 | To capture sites that require a user to be logged in, you must specify a path to a chrome profile (which loads the cookies needed for the user to be logged in). If you don't have an existing chrome profile, create one with `chromium-browser --disable-gpu --user-data-dir=/tmp/chrome-profile`, and log into the sites you need. Then set `CHROME_USER_DATA_DIR=/tmp/chrome-profile` to make Bookmark Archiver use that profile.
173 |
174 | (See defaults & more at the top of `config.py`)
175 |
176 | To tweak the outputted html index file's look and feel, just edit the HTML files in `archiver/templates/`.
177 |
178 | The chrome/chromium dependency is _optional_ and only required for screenshots, PDF, and DOM dump output, it can be safely ignored if those three methods are disabled.
179 |
180 | ## Publishing Your Archive
181 |
182 | The archive produced by `./archive` is suitable for serving on any provider that can host static html (e.g. github pages!).
183 |
184 | You can also serve it from a home server or VPS by uploading the outputted `output` folder to your web directory, e.g. `/var/www/bookmark-archiver` and configuring your webserver.
185 |
186 | Here's a sample nginx configuration that works to serve archive folders:
187 |
188 | ```nginx
189 | location / {
190 | alias /path/to/bookmark-archiver/output/;
191 | index index.html;
192 | autoindex on; # see directory listing upon clicking "The Files" links
193 | try_files $uri $uri/ =404;
194 | }
195 | ```
196 |
197 | Make sure you're not running any content as CGI or PHP, you only want to serve static files!
198 |
199 | Urls look like: `https://archive.example.com/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem.html`
200 |
201 | **Security WARNING & Content Disclaimer**
202 |
203 | Re-hosting other people's content has security implications for any other sites sharing your hosting domain. Make sure you understand
204 | the dangers of hosting unknown archived CSS & JS files [on your shared domain](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy).
205 | Due to the security risk of serving some malicious JS you archived by accident, it's best to put this on a domain or subdomain
206 | of its own to keep cookies separate and slightly mitigate [CSRF attacks](https://en.wikipedia.org/wiki/Cross-site_request_forgery) and other nastiness.
207 |
208 | You may also want to blacklist your archive in `/robots.txt` if you don't want to be publicly assosciated with all the links you archive via search engine results.
209 |
210 | Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons,
211 | it's up to you to host responsibly and respond to takedown requests appropriately.
212 |
213 | Please modify the `FOOTER_INFO` config variable to add your contact info to the footer of your index.
214 |
215 | ## Info & Motivation
216 |
217 | This is basically an open-source version of [Pocket Premium](https://getpocket.com/premium) (which you should consider paying for!).
218 | I got tired of sites I saved going offline or changing their URLS, so I started
219 | archiving a copy of them locally now, similar to The Way-Back Machine provided
220 | by [archive.org](https://archive.org). Self hosting your own archive allows you to save
221 | PDFs & Screenshots of dynamic sites in addition to static html, something archive.org doesn't do.
222 |
223 | Now I can rest soundly knowing important articles and resources I like wont dissapear off the internet.
224 |
225 | My published archive as an example: [archive.sweeting.me](https://archive.sweeting.me).
226 |
227 | ## Manual Setup
228 |
229 | If you don't like running random setup scripts off the internet (:+1:), you can follow these manual setup instructions.
230 |
231 | **1. Install dependencies:** `chromium >= 59`,` wget >= 1.16`, `python3 >= 3.5` (`google-chrome >= v59` works fine as well)
232 |
233 | If you already have Google Chrome installed, or wish to use that instead of Chromium, follow the [Google Chrome Instructions](#google-chrome-instructions).
234 |
235 | ```bash
236 | # On Mac:
237 | brew cask install chromium # If you already have Google Chrome/Chromium in /Applications/, skip this command
238 | brew install wget python3
239 |
240 | echo -e '#!/bin/bash\n/Applications/Chromium.app/Contents/MacOS/Chromium "$@"' > /usr/local/bin/chromium-browser # see instructions for google-chrome below
241 | chmod +x /usr/local/bin/chromium-browser
242 | ```
243 |
244 | ```bash
245 | # On Ubuntu/Debian:
246 | apt install chromium-browser python3 wget
247 | ```
248 |
249 | ```bash
250 | # Check that everything worked:
251 | chromium-browser --version && which wget && which python3 && which curl && echo "[√] All dependencies installed."
252 | ```
253 |
254 | **2. Get your bookmark export file:**
255 |
256 | Follow the instruction links above in the "Quickstart" section to download your bookmarks export file.
257 |
258 | **3. Run the archive script:**
259 |
260 | 1. Clone this repo `git clone https://github.com/pirate/bookmark-archiver`
261 | 3. `cd bookmark-archiver/`
262 | 4. `./archive ~/Downloads/bookmarks_export.html`
263 |
264 | You may optionally specify a second argument to `archive.py export.html 153242424324` to resume the archive update at a specific timestamp.
265 |
266 | If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom.
267 |
268 | ### Google Chrome Instructions:
269 |
270 | I recommend Chromium instead of Google Chrome, since it's open source and doesn't send your data to Google.
271 | Chromium may have some issues rendering some sites though, so you're welcome to try Google-chrome instead.
272 | It's also easier to use Google Chrome if you already have it installed, rather than downloading Chromium all over.
273 |
274 | 1. Install & link google-chrome
275 | ```bash
276 | # On Mac:
277 | # If you already have Google Chrome in /Applications/, skip this brew command
278 | brew cask install google-chrome
279 | brew install wget python3
280 |
281 | echo -e '#!/bin/bash\n/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome "$@"' > /usr/local/bin/google-chrome
282 | chmod +x /usr/local/bin/google-chrome
283 | ```
284 |
285 | ```bash
286 | # On Linux:
287 | wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
288 | sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
289 | apt update; apt install google-chrome-beta python3 wget
290 | ```
291 |
292 | 2. Set the environment variable `CHROME_BINARY` to `google-chrome` before running:
293 |
294 | ```bash
295 | env CHROME_BINARY=google-chrome ./archive ~/Downloads/bookmarks_export.html
296 | ```
297 | If you're having any trouble trying to set up Google Chrome or Chromium, see the Troubleshooting section below.
298 |
299 | ## Troubleshooting
300 |
301 | ### Dependencies
302 |
303 | **Python:**
304 |
305 | On some Linux distributions the python3 package might not be recent enough.
306 | If this is the case for you, resort to installing a recent enough version manually.
307 | ```bash
308 | add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6
309 | ```
310 | If you still need help, [the official Python docs](https://docs.python.org/3.6/using/unix.html) are a good place to start.
311 |
312 | **Chromium/Google Chrome:**
313 |
314 | `archive.py` depends on being able to access a `chromium-browser`/`google-chrome` executable. The executable used
315 | defaults to `chromium-browser` but can be manually specified with the environment variable `CHROME_BINARY`:
316 |
317 | ```bash
318 | env CHROME_BINARY=/usr/local/bin/chromium-browser ./archive ~/Downloads/bookmarks_export.html
319 | ```
320 |
321 | 1. Test to make sure you have Chrome on your `$PATH` with:
322 |
323 | ```bash
324 | which chromium-browser || which google-chrome
325 | ```
326 | If no executable is displayed, follow the setup instructions to install and link one of them.
327 |
328 | 2. If a path is displayed, the next step is to check that it's runnable:
329 |
330 | ```bash
331 | chromium-browser --version || google-chrome --version
332 | ```
333 | If no version is displayed, try the setup instructions again, or confirm that you have permission to access chrome.
334 |
335 | 3. If a version is displayed and it's `<59`, upgrade it:
336 |
337 | ```bash
338 | apt upgrade chromium-browser -y
339 | # OR
340 | brew cask upgrade chromium-browser
341 | ```
342 |
343 | 4. If a version is displayed and it's `>=59`, make sure `archive.py` is running the right one:
344 |
345 | ```bash
346 | env CHROME_BINARY=/path/from/step/1/chromium-browser ./archive bookmarks_export.html # replace the path with the one you got from step 1
347 | ```
348 |
349 |
350 | **Wget & Curl:**
351 |
352 | If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice.
353 | See the "Manual Setup" instructions for more details.
354 |
355 | If wget times out or randomly fails to download some sites that you have confirmed are online,
356 | upgrade wget to the most recent version with `brew upgrade wget` or `apt upgrade wget`. There is
357 | a bug in versions `<=1.19.1_1` that caused wget to fail for perfectly valid sites.
358 |
359 | ### Archiving
360 |
361 | **No links parsed from export file:**
362 |
363 | Please open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of where you got the export, and
364 | preferrably your export file attached (you can redact the links). We'll fix the parser to support your format.
365 |
366 | **Lots of skipped sites:**
367 |
368 | If you ran the archiver once, it wont re-download sites subsequent times, it will only download new links.
369 | If you haven't already run it, make sure you have a working internet connection and that the parsed URLs look correct.
370 | You can check the `archive.py` output or `index.html` to see what links it's downloading.
371 |
372 | If you're still having issues, try deleting or moving the `output/archive` folder (back it up first!) and running `./archive` again.
373 |
374 | **Lots of errors:**
375 |
376 | Make sure you have all the dependencies installed and that you're able to visit the links from your browser normally.
377 | Open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of the errors if you're still having problems.
378 |
379 | **Lots of broken links from the index:**
380 |
381 | Not all sites can be effectively archived with each method, that's why it's best to use a combination of `wget`, PDFs, and screenshots.
382 | If it seems like more than 10-20% of sites in the archive are broken, open an [issue](https://github.com/pirate/bookmark-archiver/issues)
383 | with some of the URLs that failed to be archived and I'll investigate.
384 |
385 | ### Hosting the Archive
386 |
387 | If you're having issues trying to host the archive via nginx, make sure you already have nginx running with SSL.
388 | If you don't, google around, there are plenty of tutorials to help get that set up. Open an [issue](https://github.com/pirate/bookmark-archiver/issues)
389 | if you have problem with a particular nginx config.
390 |
391 | ## Roadmap
392 |
393 | If you feel like contributing a PR, some of these tasks are pretty easy. Feel free to open an issue if you need help getting started in any way!
394 |
395 | - download closed-captions text from youtube videos
396 | - body text extraction using [fathom](https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/)
397 | - auto-tagging based on important extracted words
398 | - audio & video archiving with `youtube-dl`
399 | - full-text indexing with elasticsearch/elasticlunr/ag
400 | - video closed-caption downloading for full-text indexing video content
401 | - automatic text summaries of article with summarization library
402 | - feature image extraction
403 | - http support (from my https-only domain)
404 | - try wgetting dead sites from archive.org (https://github.com/hartator/wayback-machine-downloader)
405 | - live updating from pocket/pinboard
406 |
407 | It's possible to pull links via the pocket API or public pocket RSS feeds instead of downloading an html export.
408 | Once I write a script to do that, we can stick this in `cron` and have it auto-update on it's own.
409 |
410 | For now you just have to download `ril_export.html` and run `archive.py` each time it updates. The script
411 | will run fast subsequent times because it only downloads new links that haven't been archived already.
412 |
413 | ## Links
414 |
415 | **Similar Projects:**
416 | - [Memex by Worldbrain.io](https://github.com/WorldBrain/Memex) a browser extension that saves all your history and does full-text search
417 | - [Hypothes.is](https://web.hypothes.is/) a web/pdf/ebook annotation tool that also archives content
418 | - [Perkeep](https://perkeep.org/) "Perkeep lets you permanently keep your stuff, for life."
419 | - [Fetching.io](http://fetching.io/) A personal search engine/archiver that lets you search through all archived websites that you've bookmarked
420 | - [Shaarchiver](https://github.com/nodiscc/shaarchiver) very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index
421 | - [Webrecorder.io](https://webrecorder.io/) Save full browsing sessions and archive all the content
422 | - [Wallabag](https://wallabag.org) Save articles you read locally or on your phone
423 |
424 | **Discussions:**
425 | - [Hacker News Discussion](https://news.ycombinator.com/item?id=14272133)
426 | - [Reddit r/selfhosted Discussion](https://www.reddit.com/r/selfhosted/comments/69eoi3/pocket_stream_archive_your_own_personal_wayback/)
427 | - [Reddit r/datahoarder Discussion #1](https://www.reddit.com/r/DataHoarder/comments/69e6i9/archive_a_browseable_copy_of_your_saved_pocket/)
428 | - [Reddit r/datahoarder Discussion #2](https://www.reddit.com/r/DataHoarder/comments/6kepv6/bookmarkarchiver_now_supports_archiving_all_major/)
429 |
430 |
431 | **Tools/Other:**
432 | - https://github.com/ikreymer/webarchiveplayer#auto-load-warcs
433 | - [Sheetsee-Pocket](http://jlord.us/sheetsee-pocket/) project that provides a pretty auto-updating index of your Pocket links (without archiving them)
434 | - [Pocket -> IFTTT -> Dropbox](https://christopher.su/2013/saving-pocket-links-file-day-dropbox-ifttt-launchd/) Post by Christopher Su on his Pocket saving IFTTT recipie
435 |
436 | ## Changelog
437 |
438 | - v0.1.0 released
439 | - support for browser history exporting added with `./bin/export-browser-history`
440 | - support for chrome `--dump-dom` to output full page HTML after JS executes
441 | - v0.0.3 released
442 | - support for chrome `--user-data-dir` to archive sites that need logins
443 | - fancy individual html & json indexes for each link
444 | - smartly append new links to existing index instead of overwriting
445 | - v0.0.2 released
446 | - proper HTML templating instead of format strings (thanks to https://github.com/bardisty!)
447 | - refactored into separate files, wip audio & video archiving
448 | - v0.0.1 released
449 | - Index links now work without nginx url rewrites, archive can now be hosted on github pages
450 | - added setup.sh script & docstrings & help commands
451 | - made Chromium the default instead of Google Chrome (yay free software)
452 | - added [env-variable](https://github.com/pirate/bookmark-archiver/pull/25) configuration (thanks to https://github.com/hannah98!)
453 | - renamed from **Pocket Archive Stream** -> **Bookmark Archiver**
454 | - added [Netscape-format](https://github.com/pirate/bookmark-archiver/pull/20) export support (thanks to https://github.com/ilvar!)
455 | - added [Pinboard-format](https://github.com/pirate/bookmark-archiver/pull/7) export support (thanks to https://github.com/sconeyard!)
456 | - front-page of HN, oops! apparently I have users to support now :grin:?
457 | - added Pocket-format export support
458 | - v0.0.0 released: created Pocket Archive Stream 2017/05/05
459 |
460 | ## Donations
461 |
462 | This project can really flourish with some more engineering effort, but unless it can support
463 | me financially I'm unlikely to be able to take it to the next level alone. It's already pretty
464 | functional and robust, but it really deserves to be taken to the next level with a few more
465 | talented engineers. If you or your foundation wants to sponsor this project long-term, contact
466 | me at bookmark-archiver@sweeting.me.
467 |
468 | [Grants / Donations](https://github.com/pirate/bookmark-archiver/blob/master/donate.md)
469 |
--------------------------------------------------------------------------------
/archive:
--------------------------------------------------------------------------------
1 | bin/bookmark-archiver
--------------------------------------------------------------------------------
/archiver/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TarekJor/bookmark-archiver/80c02bed5548b3429128cc699c2d462f51cd2df2/archiver/__init__.py
--------------------------------------------------------------------------------
/archiver/archive.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # Bookmark Archiver
3 | # Nick Sweeting 2017 | MIT License
4 | # https://github.com/pirate/bookmark-archiver
5 |
6 | import os
7 | import sys
8 |
9 | from datetime import datetime
10 | from subprocess import run
11 |
12 | from parse import parse_links
13 | from links import validate_links
14 | from archive_methods import archive_links, _RESULTS_TOTALS
15 | from index import (
16 | write_links_index,
17 | write_link_index,
18 | parse_json_links_index,
19 | parse_json_link_index,
20 | )
21 | from config import (
22 | OUTPUT_PERMISSIONS,
23 | OUTPUT_DIR,
24 | ANSI,
25 | TIMEOUT,
26 | GIT_SHA,
27 | )
28 | from util import (
29 | download_url,
30 | progress,
31 | cleanup_archive,
32 | pretty_path,
33 | migrate_data,
34 | )
35 |
36 | __AUTHOR__ = 'Nick Sweeting '
37 | __VERSION__ = GIT_SHA
38 | __DESCRIPTION__ = 'Bookmark Archiver: Create a browsable html archive of a list of links.'
39 | __DOCUMENTATION__ = 'https://github.com/pirate/bookmark-archiver'
40 |
41 | def print_help():
42 | print(__DESCRIPTION__)
43 | print("Documentation: {}\n".format(__DOCUMENTATION__))
44 | print("Usage:")
45 | print(" ./bin/bookmark-archiver ~/Downloads/bookmarks_export.html\n")
46 |
47 |
48 | def merge_links(archive_path=OUTPUT_DIR, import_path=None):
49 | """get new links from file and optionally append them to links in existing archive"""
50 | all_links = []
51 | if import_path:
52 | # parse and validate the import file
53 | raw_links = parse_links(import_path)
54 | all_links = validate_links(raw_links)
55 |
56 | # merge existing links in archive_path and new links
57 | existing_links = []
58 | if archive_path:
59 | existing_links = parse_json_links_index(archive_path)
60 | all_links = validate_links(existing_links + all_links)
61 |
62 | num_new_links = len(all_links) - len(existing_links)
63 | if num_new_links:
64 | print('[{green}+{reset}] [{}] Adding {} new links from {} to {}/index.json'.format(
65 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
66 | num_new_links,
67 | pretty_path(import_path),
68 | pretty_path(archive_path),
69 | **ANSI,
70 | ))
71 | # else:
72 | # print('[*] [{}] No new links added to {}/index.json{}'.format(
73 | # datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
74 | # archive_path,
75 | # ' from {}'.format(import_path) if import_path else '',
76 | # **ANSI,
77 | # ))
78 |
79 | return all_links
80 |
81 | def update_archive(archive_path, links, source=None, resume=None, append=True):
82 | """update or create index.html+json given a path to an export file containing new links"""
83 |
84 | start_ts = datetime.now().timestamp()
85 |
86 | if resume:
87 | print('{green}[▶] [{}] Resuming archive downloading from {}...{reset}'.format(
88 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
89 | resume,
90 | **ANSI,
91 | ))
92 | else:
93 | print('{green}[▶] [{}] Updating files for {} links in archive...{reset}'.format(
94 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
95 | len(links),
96 | **ANSI,
97 | ))
98 |
99 | # loop over links and archive them
100 | archive_links(archive_path, links, source=source, resume=resume)
101 |
102 | # print timing information & summary
103 | end_ts = datetime.now().timestamp()
104 | seconds = end_ts - start_ts
105 | if seconds > 60:
106 | duration = '{0:.2f} min'.format(seconds / 60, 2)
107 | else:
108 | duration = '{0:.2f} sec'.format(seconds, 2)
109 |
110 | print('{}[√] [{}] Update of {} links complete ({}){}'.format(
111 | ANSI['green'],
112 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
113 | len(links),
114 | duration,
115 | ANSI['reset'],
116 | ))
117 | print(' - {} entries skipped'.format(_RESULTS_TOTALS['skipped']))
118 | print(' - {} entries updated'.format(_RESULTS_TOTALS['succeded']))
119 | print(' - {} errors'.format(_RESULTS_TOTALS['failed']))
120 |
121 |
122 | if __name__ == '__main__':
123 | argc = len(sys.argv)
124 |
125 | if set(sys.argv).intersection(('-h', '--help', 'help')):
126 | print_help()
127 | raise SystemExit(0)
128 |
129 | migrate_data()
130 |
131 | source = sys.argv[1] if argc > 1 else None # path of links file to import
132 | resume = sys.argv[2] if argc > 2 else None # timestamp to resume dowloading from
133 |
134 | if argc == 1:
135 | source, resume = None, None
136 | elif argc == 2:
137 | if all(d.isdigit() for d in sys.argv[1].split('.')):
138 | # argv[1] is a resume timestamp
139 | source, resume = None, sys.argv[1]
140 | else:
141 | # argv[1] is a path to a file to import
142 | source, resume = sys.argv[1].strip(), None
143 | elif argc == 3:
144 | source, resume = sys.argv[1].strip(), sys.argv[2]
145 | else:
146 | print_help()
147 | raise SystemExit(1)
148 |
149 | # See if archive folder already exists
150 | for out_dir in (OUTPUT_DIR, 'bookmarks', 'pocket', 'pinboard', 'html'):
151 | if os.path.exists(out_dir):
152 | break
153 | else:
154 | out_dir = OUTPUT_DIR
155 |
156 | # Step 0: Download url to local file (only happens if a URL is specified instead of local path)
157 | if source and any(source.startswith(s) for s in ('http://', 'https://', 'ftp://')):
158 | source = download_url(source)
159 |
160 | # Step 1: Parse the links and dedupe them with existing archive
161 | links = merge_links(archive_path=out_dir, import_path=source)
162 |
163 | # Step 2: Write new index
164 | write_links_index(out_dir=out_dir, links=links)
165 |
166 | # Step 3: Verify folder structure is 1:1 with index
167 | # cleanup_archive(out_dir, links)
168 |
169 | # Step 4: Run the archive methods for each link
170 | update_archive(out_dir, links, source=source, resume=resume, append=True)
171 |
--------------------------------------------------------------------------------
/archiver/archive_methods.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 |
4 | from functools import wraps
5 | from collections import defaultdict
6 | from datetime import datetime
7 | from subprocess import run, PIPE, DEVNULL
8 |
9 | from peekable import Peekable
10 |
11 | from index import wget_output_path, parse_json_link_index, write_link_index
12 | from links import links_after_timestamp
13 | from config import (
14 | CHROME_BINARY,
15 | FETCH_WGET,
16 | FETCH_WGET_REQUISITES,
17 | FETCH_PDF,
18 | FETCH_SCREENSHOT,
19 | FETCH_DOM,
20 | RESOLUTION,
21 | CHECK_SSL_VALIDITY,
22 | SUBMIT_ARCHIVE_DOT_ORG,
23 | FETCH_AUDIO,
24 | FETCH_VIDEO,
25 | FETCH_FAVICON,
26 | WGET_USER_AGENT,
27 | CHROME_USER_DATA_DIR,
28 | TIMEOUT,
29 | ANSI,
30 | ARCHIVE_DIR,
31 | )
32 | from util import (
33 | check_dependencies,
34 | progress,
35 | chmod_file,
36 | pretty_path,
37 | )
38 |
39 |
40 | _RESULTS_TOTALS = { # globals are bad, mmkay
41 | 'skipped': 0,
42 | 'succeded': 0,
43 | 'failed': 0,
44 | }
45 |
46 | def archive_links(archive_path, links, source=None, resume=None):
47 | check_dependencies()
48 |
49 | to_archive = Peekable(links_after_timestamp(links, resume))
50 | idx, link = 0, to_archive.peek(0)
51 |
52 | try:
53 | for idx, link in enumerate(to_archive):
54 | link_dir = os.path.join(ARCHIVE_DIR, link['timestamp'])
55 | archive_link(link_dir, link)
56 |
57 | except (KeyboardInterrupt, SystemExit, Exception) as e:
58 | print('{lightyellow}[X] [{now}] Downloading paused on link {timestamp} ({idx}/{total}){reset}'.format(
59 | **ANSI,
60 | now=datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
61 | idx=idx+1,
62 | timestamp=link['timestamp'],
63 | total=len(links),
64 | ))
65 | print(' Continue where you left off by running:')
66 | print(' {} {}'.format(
67 | pretty_path(sys.argv[0]),
68 | link['timestamp'],
69 | ))
70 | if not isinstance(e, KeyboardInterrupt):
71 | raise e
72 | raise SystemExit(1)
73 |
74 |
75 | def archive_link(link_dir, link, overwrite=True):
76 | """download the DOM, PDF, and a screenshot into a folder named after the link's timestamp"""
77 |
78 | update_existing = os.path.exists(link_dir)
79 | if update_existing:
80 | link = {
81 | **parse_json_link_index(link_dir),
82 | **link,
83 | }
84 | else:
85 | os.makedirs(link_dir)
86 |
87 | log_link_archive(link_dir, link, update_existing)
88 |
89 | if FETCH_WGET:
90 | link = fetch_wget(link_dir, link, overwrite=overwrite)
91 |
92 | if FETCH_PDF:
93 | link = fetch_pdf(link_dir, link, overwrite=overwrite)
94 |
95 | if FETCH_SCREENSHOT:
96 | link = fetch_screenshot(link_dir, link, overwrite=overwrite)
97 |
98 | if FETCH_DOM:
99 | link = fetch_dom(link_dir, link, overwrite=overwrite)
100 |
101 | if SUBMIT_ARCHIVE_DOT_ORG:
102 | link = archive_dot_org(link_dir, link, overwrite=overwrite)
103 |
104 | # if FETCH_AUDIO:
105 | # link = fetch_audio(link_dir, link, overwrite=overwrite)
106 |
107 | # if FETCH_VIDEO:
108 | # link = fetch_video(link_dir, link, overwrite=overwrite)
109 |
110 | if FETCH_FAVICON:
111 | link = fetch_favicon(link_dir, link, overwrite=overwrite)
112 |
113 | write_link_index(link_dir, link)
114 | # print()
115 |
116 | return link
117 |
118 | def log_link_archive(link_dir, link, update_existing):
119 | print('[{symbol_color}{symbol}{reset}] [{now}] "{title}"\n {blue}{url}{reset}'.format(
120 | symbol='*' if update_existing else '+',
121 | symbol_color=ANSI['black' if update_existing else 'green'],
122 | now=datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
123 | **link,
124 | **ANSI,
125 | ))
126 |
127 | print(' > {}{}'.format(pretty_path(link_dir), '' if update_existing else ' (new)'))
128 | if link['type']:
129 | print(' i {}'.format(link['type']))
130 |
131 |
132 |
133 | def attach_result_to_link(method):
134 | """
135 | Instead of returning a result={output:'...', status:'success'} object,
136 | attach that result to the links's history & latest fields, then return
137 | the updated link object.
138 | """
139 | def decorator(fetch_func):
140 | @wraps(fetch_func)
141 | def timed_fetch_func(link_dir, link, overwrite=False, **kwargs):
142 | # initialize methods and history json field on link
143 | link['latest'] = link.get('latest') or {}
144 | link['latest'][method] = link['latest'].get(method) or None
145 | link['history'] = link.get('history') or {}
146 | link['history'][method] = link['history'].get(method) or []
147 |
148 | start_ts = datetime.now().timestamp()
149 |
150 | # if a valid method output is already present, dont run the fetch function
151 | if link['latest'][method] and not overwrite:
152 | print(' √ {}'.format(method))
153 | result = None
154 | else:
155 | print(' > {}'.format(method))
156 | result = fetch_func(link_dir, link, **kwargs)
157 |
158 | end_ts = datetime.now().timestamp()
159 | duration = str(end_ts * 1000 - start_ts * 1000).split('.')[0]
160 |
161 | # append a history item recording fail/success
162 | history_entry = {
163 | 'timestamp': str(start_ts).split('.')[0],
164 | }
165 | if result is None:
166 | history_entry['status'] = 'skipped'
167 | elif isinstance(result.get('output'), Exception):
168 | history_entry['status'] = 'failed'
169 | history_entry['duration'] = duration
170 | history_entry.update(result or {})
171 | link['history'][method].append(history_entry)
172 | else:
173 | history_entry['status'] = 'succeded'
174 | history_entry['duration'] = duration
175 | history_entry.update(result or {})
176 | link['history'][method].append(history_entry)
177 | link['latest'][method] = result['output']
178 |
179 | _RESULTS_TOTALS[history_entry['status']] += 1
180 |
181 | return link
182 | return timed_fetch_func
183 | return decorator
184 |
185 |
186 | @attach_result_to_link('wget')
187 | def fetch_wget(link_dir, link, requisites=FETCH_WGET_REQUISITES, timeout=TIMEOUT):
188 | """download full site using wget"""
189 |
190 | domain_dir = os.path.join(link_dir, link['domain'])
191 | existing_file = wget_output_path(link)
192 | if os.path.exists(domain_dir) and existing_file:
193 | return {'output': existing_file, 'status': 'skipped'}
194 |
195 | CMD = [
196 | # WGET CLI Docs: https://www.gnu.org/software/wget/manual/wget.html
197 | *'wget -N -E -np -x -H -k -K -S --restrict-file-names=unix'.split(' '),
198 | *(('-p',) if FETCH_WGET_REQUISITES else ()),
199 | *(('--user-agent="{}"'.format(WGET_USER_AGENT),) if WGET_USER_AGENT else ()),
200 | *((() if CHECK_SSL_VALIDITY else ('--no-check-certificate',))),
201 | link['url'],
202 | ]
203 | end = progress(timeout, prefix=' ')
204 | try:
205 | result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1) # index.html
206 | end()
207 | output = wget_output_path(link, look_in=domain_dir)
208 |
209 | # Check for common failure cases
210 | if result.returncode > 0:
211 | print(' got wget response code {}:'.format(result.returncode))
212 | if result.returncode != 8:
213 | print('\n'.join(' ' + line for line in (result.stderr or result.stdout).decode().rsplit('\n', 10)[-10:] if line.strip()))
214 | if b'403: Forbidden' in result.stderr:
215 | raise Exception('403 Forbidden (try changing WGET_USER_AGENT)')
216 | if b'404: Not Found' in result.stderr:
217 | raise Exception('404 Not Found')
218 | if b'ERROR 500: Internal Server Error' in result.stderr:
219 | raise Exception('500 Internal Server Error')
220 | if result.returncode == 4:
221 | raise Exception('Failed wget download')
222 | except Exception as e:
223 | end()
224 | print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
225 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
226 | output = e
227 |
228 | return {
229 | 'cmd': CMD,
230 | 'output': output,
231 | }
232 |
233 |
234 | @attach_result_to_link('pdf')
235 | def fetch_pdf(link_dir, link, timeout=TIMEOUT, user_data_dir=CHROME_USER_DATA_DIR):
236 | """print PDF of site to file using chrome --headless"""
237 |
238 | if link['type'] in ('PDF', 'image'):
239 | return {'output': wget_output_path(link)}
240 |
241 | if os.path.exists(os.path.join(link_dir, 'output.pdf')):
242 | return {'output': 'output.pdf', 'status': 'skipped'}
243 |
244 | CMD = [
245 | *chrome_headless(user_data_dir=user_data_dir),
246 | '--print-to-pdf',
247 | link['url']
248 | ]
249 | end = progress(timeout, prefix=' ')
250 | try:
251 | result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1) # output.pdf
252 | end()
253 | if result.returncode:
254 | print(' ', (result.stderr or result.stdout).decode())
255 | raise Exception('Failed to print PDF')
256 | chmod_file('output.pdf', cwd=link_dir)
257 | output = 'output.pdf'
258 | except Exception as e:
259 | end()
260 | print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
261 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
262 | output = e
263 |
264 | return {
265 | 'cmd': CMD,
266 | 'output': output,
267 | }
268 |
269 | @attach_result_to_link('screenshot')
270 | def fetch_screenshot(link_dir, link, timeout=TIMEOUT, user_data_dir=CHROME_USER_DATA_DIR, resolution=RESOLUTION):
271 | """take screenshot of site using chrome --headless"""
272 |
273 | if link['type'] in ('PDF', 'image'):
274 | return {'output': wget_output_path(link)}
275 |
276 | if os.path.exists(os.path.join(link_dir, 'screenshot.png')):
277 | return {'output': 'screenshot.png', 'status': 'skipped'}
278 |
279 | CMD = [
280 | *chrome_headless(user_data_dir=user_data_dir),
281 | '--screenshot',
282 | '--window-size={}'.format(resolution),
283 | '--hide-scrollbars',
284 | # '--full-page', # TODO: make this actually work using ./bin/screenshot fullPage: true
285 | link['url'],
286 | ]
287 | end = progress(timeout, prefix=' ')
288 | try:
289 | result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1) # sreenshot.png
290 | end()
291 | if result.returncode:
292 | print(' ', (result.stderr or result.stdout).decode())
293 | raise Exception('Failed to take screenshot')
294 | chmod_file('screenshot.png', cwd=link_dir)
295 | output = 'screenshot.png'
296 | except Exception as e:
297 | end()
298 | print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
299 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
300 | output = e
301 |
302 | return {
303 | 'cmd': CMD,
304 | 'output': output,
305 | }
306 |
307 | @attach_result_to_link('dom')
308 | def fetch_dom(link_dir, link, timeout=TIMEOUT, user_data_dir=CHROME_USER_DATA_DIR):
309 | """print HTML of site to file using chrome --dump-html"""
310 |
311 | if link['type'] in ('PDF', 'image'):
312 | return {'output': wget_output_path(link)}
313 |
314 | output_path = os.path.join(link_dir, 'output.html')
315 |
316 | if os.path.exists(output_path):
317 | return {'output': 'output.html', 'status': 'skipped'}
318 |
319 | CMD = [
320 | *chrome_headless(user_data_dir=user_data_dir),
321 | '--dump-dom',
322 | link['url']
323 | ]
324 | end = progress(timeout, prefix=' ')
325 | try:
326 | with open(output_path, 'w+') as f:
327 | result = run(CMD, stdout=f, stderr=PIPE, cwd=link_dir, timeout=timeout + 1) # output.html
328 | end()
329 | if result.returncode:
330 | print(' ', (result.stderr).decode())
331 | raise Exception('Failed to fetch DOM')
332 | chmod_file('output.html', cwd=link_dir)
333 | output = 'output.html'
334 | except Exception as e:
335 | end()
336 | print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
337 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
338 | output = e
339 |
340 | return {
341 | 'cmd': CMD,
342 | 'output': output,
343 | }
344 |
345 | @attach_result_to_link('archive_org')
346 | def archive_dot_org(link_dir, link, timeout=TIMEOUT):
347 | """submit site to archive.org for archiving via their service, save returned archive url"""
348 |
349 | path = os.path.join(link_dir, 'archive.org.txt')
350 | if os.path.exists(path):
351 | archive_org_url = open(path, 'r').read().strip()
352 | return {'output': archive_org_url, 'status': 'skipped'}
353 |
354 | submit_url = 'https://web.archive.org/save/{}'.format(link['url'].split('?', 1)[0])
355 |
356 | success = False
357 | CMD = ['curl', '-I', submit_url]
358 | end = progress(timeout, prefix=' ')
359 | try:
360 | result = run(CMD, stdout=PIPE, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1) # archive.org.txt
361 | end()
362 |
363 | # Parse archive.org response headers
364 | headers = defaultdict(list)
365 |
366 | # lowercase all the header names and store in dict
367 | for header in result.stdout.splitlines():
368 | if b':' not in header or not header.strip():
369 | continue
370 | name, val = header.decode().split(':', 1)
371 | headers[name.lower().strip()].append(val.strip())
372 |
373 | # Get successful archive url in "content-location" header or any errors
374 | content_location = headers['content-location']
375 | errors = headers['x-archive-wayback-runtime-error']
376 |
377 | if content_location:
378 | saved_url = 'https://web.archive.org{}'.format(content_location[0])
379 | success = True
380 | elif len(errors) == 1 and 'RobotAccessControlException' in errors[0]:
381 | output = submit_url
382 | # raise Exception('Archive.org denied by {}/robots.txt'.format(link['domain']))
383 | elif errors:
384 | raise Exception(', '.join(errors))
385 | else:
386 | raise Exception('Failed to find "content-location" URL header in Archive.org response.')
387 | except Exception as e:
388 | end()
389 | print(' Visit url to see output:', ' '.join(CMD))
390 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
391 | output = e
392 |
393 | if success:
394 | with open(os.path.join(link_dir, 'archive.org.txt'), 'w', encoding='utf-8') as f:
395 | f.write(saved_url)
396 | chmod_file('archive.org.txt', cwd=link_dir)
397 | output = saved_url
398 |
399 | return {
400 | 'cmd': CMD,
401 | 'output': output,
402 | }
403 |
404 | @attach_result_to_link('favicon')
405 | def fetch_favicon(link_dir, link, timeout=TIMEOUT):
406 | """download site favicon from google's favicon api"""
407 |
408 | if os.path.exists(os.path.join(link_dir, 'favicon.ico')):
409 | return {'output': 'favicon.ico', 'status': 'skipped'}
410 |
411 | CMD = ['curl', 'https://www.google.com/s2/favicons?domain={domain}'.format(**link)]
412 | fout = open('{}/favicon.ico'.format(link_dir), 'w')
413 | end = progress(timeout, prefix=' ')
414 | try:
415 | run(CMD, stdout=fout, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1) # favicon.ico
416 | fout.close()
417 | end()
418 | chmod_file('favicon.ico', cwd=link_dir)
419 | output = 'favicon.ico'
420 | except Exception as e:
421 | fout.close()
422 | end()
423 | print(' Run to see full output:', ' '.join(CMD))
424 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
425 | output = e
426 |
427 | return {
428 | 'cmd': CMD,
429 | 'output': output,
430 | }
431 |
432 | # @attach_result_to_link('audio')
433 | # def fetch_audio(link_dir, link, timeout=TIMEOUT):
434 | # """Download audio rip using youtube-dl"""
435 |
436 | # if link['type'] not in ('soundcloud',)\
437 | # and 'audio' not in link['tags']:
438 | # return
439 |
440 | # path = os.path.join(link_dir, 'audio')
441 |
442 | # if not os.path.exists(path) or overwrite:
443 | # print(' - Downloading audio')
444 | # CMD = [
445 | # "youtube-dl -x --audio-format mp3 --audio-quality 0 -o '%(title)s.%(ext)s'",
446 | # link['url'],
447 | # ]
448 | # end = progress(timeout, prefix=' ')
449 | # try:
450 | # result = run(CMD, stdout=DEVNULL, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1) # audio/audio.mp3
451 | # end()
452 | # if result.returncode:
453 | # print(' ', result.stderr.decode())
454 | # raise Exception('Failed to download audio')
455 | # chmod_file('audio.mp3', cwd=link_dir)
456 | # return 'audio.mp3'
457 | # except Exception as e:
458 | # end()
459 | # print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
460 | # print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
461 | # raise
462 | # else:
463 | # print(' √ Skipping audio download')
464 |
465 | # @attach_result_to_link('video')
466 | # def fetch_video(link_dir, link, timeout=TIMEOUT):
467 | # """Download video rip using youtube-dl"""
468 |
469 | # if link['type'] not in ('youtube', 'youku', 'vimeo')\
470 | # and 'video' not in link['tags']:
471 | # return
472 |
473 | # path = os.path.join(link_dir, 'video')
474 |
475 | # if not os.path.exists(path) or overwrite:
476 | # print(' - Downloading video')
477 | # CMD = [
478 | # "youtube-dl -x --video-format mp4 --audio-quality 0 -o '%(title)s.%(ext)s'",
479 | # link['url'],
480 | # ]
481 | # end = progress(timeout, prefix=' ')
482 | # try:
483 | # result = run(CMD, stdout=DEVNULL, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1) # video/movie.mp4
484 | # end()
485 | # if result.returncode:
486 | # print(' ', result.stderr.decode())
487 | # raise Exception('Failed to download video')
488 | # chmod_file('video.mp4', cwd=link_dir)
489 | # return 'video.mp4'
490 | # except Exception as e:
491 | # end()
492 | # print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD)))
493 | # print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset']))
494 | # raise
495 | # else:
496 | # print(' √ Skipping video download')
497 |
498 |
499 | def chrome_headless(binary=CHROME_BINARY, user_data_dir=CHROME_USER_DATA_DIR):
500 | args = [binary, '--headless'] # '--disable-gpu'
501 | default_profile = os.path.expanduser('~/Library/Application Support/Google/Chrome/Default')
502 | if user_data_dir:
503 | args.append('--user-data-dir={}'.format(user_data_dir))
504 | elif os.path.exists(default_profile):
505 | args.append('--user-data-dir={}'.format(default_profile))
506 | return args
507 |
--------------------------------------------------------------------------------
/archiver/config.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | import shutil
4 |
5 | from subprocess import run, PIPE
6 |
7 | # ******************************************************************************
8 | # * TO SET YOUR CONFIGURATION, EDIT THE VALUES BELOW, or use the 'env' command *
9 | # * e.g. *
10 | # * env USE_COLOR=True CHROME_BINARY=google-chrome ./archive.py export.html *
11 | # ******************************************************************************
12 |
13 | IS_TTY = sys.stdout.isatty()
14 | USE_COLOR = os.getenv('USE_COLOR', str(IS_TTY) ).lower() == 'true'
15 | SHOW_PROGRESS = os.getenv('SHOW_PROGRESS', str(IS_TTY) ).lower() == 'true'
16 | FETCH_WGET = os.getenv('FETCH_WGET', 'True' ).lower() == 'true'
17 | FETCH_WGET_REQUISITES = os.getenv('FETCH_WGET_REQUISITES', 'True' ).lower() == 'true'
18 | FETCH_AUDIO = os.getenv('FETCH_AUDIO', 'False' ).lower() == 'true'
19 | FETCH_VIDEO = os.getenv('FETCH_VIDEO', 'False' ).lower() == 'true'
20 | FETCH_PDF = os.getenv('FETCH_PDF', 'True' ).lower() == 'true'
21 | FETCH_SCREENSHOT = os.getenv('FETCH_SCREENSHOT', 'True' ).lower() == 'true'
22 | FETCH_DOM = os.getenv('FETCH_DOM', 'True' ).lower() == 'true'
23 | FETCH_FAVICON = os.getenv('FETCH_FAVICON', 'True' ).lower() == 'true'
24 | SUBMIT_ARCHIVE_DOT_ORG = os.getenv('SUBMIT_ARCHIVE_DOT_ORG', 'True' ).lower() == 'true'
25 | RESOLUTION = os.getenv('RESOLUTION', '1440,1200' )
26 | CHECK_SSL_VALIDITY = os.getenv('CHECK_SSL_VALIDITY', 'True' ).lower() == 'true'
27 | OUTPUT_PERMISSIONS = os.getenv('OUTPUT_PERMISSIONS', '755' )
28 | CHROME_BINARY = os.getenv('CHROME_BINARY', 'chromium-browser' ) # change to google-chrome browser if using google-chrome
29 | WGET_BINARY = os.getenv('WGET_BINARY', 'wget' )
30 | WGET_USER_AGENT = os.getenv('WGET_USER_AGENT', 'Bookmark Archiver')
31 | CHROME_USER_DATA_DIR = os.getenv('CHROME_USER_DATA_DIR', None)
32 | TIMEOUT = int(os.getenv('TIMEOUT', '60'))
33 | FOOTER_INFO = os.getenv('FOOTER_INFO', 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.',)
34 |
35 | ### Paths
36 | REPO_DIR = os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)), '..'))
37 |
38 | OUTPUT_DIR = os.path.join(REPO_DIR, 'output')
39 | ARCHIVE_DIR = os.path.join(OUTPUT_DIR, 'archive')
40 | SOURCES_DIR = os.path.join(OUTPUT_DIR, 'sources')
41 |
42 | PYTHON_PATH = os.path.join(REPO_DIR, 'archiver')
43 | TEMPLATES_DIR = os.path.join(PYTHON_PATH, 'templates')
44 |
45 | # ******************************************************************************
46 | # ********************** Do not edit below this point **************************
47 | # ******************************************************************************
48 |
49 | ### Terminal Configuration
50 | TERM_WIDTH = shutil.get_terminal_size((100, 10)).columns
51 | ANSI = {
52 | 'reset': '\033[00;00m',
53 | 'lightblue': '\033[01;30m',
54 | 'lightyellow': '\033[01;33m',
55 | 'lightred': '\033[01;35m',
56 | 'red': '\033[01;31m',
57 | 'green': '\033[01;32m',
58 | 'blue': '\033[01;34m',
59 | 'white': '\033[01;37m',
60 | 'black': '\033[01;30m',
61 | }
62 | if not USE_COLOR:
63 | # dont show colors if USE_COLOR is False
64 | ANSI = {k: '' for k in ANSI.keys()}
65 |
66 | ### Confirm Environment Setup
67 | try:
68 | GIT_SHA = run(["git", "rev-list", "-1", "HEAD", "./"], stdout=PIPE, cwd=REPO_DIR).stdout.strip().decode()
69 | except Exception:
70 | GIT_SHA = 'unknown'
71 | print('[!] Warning, you need git installed for some archiving features to save correct version numbers!')
72 |
73 | if sys.stdout.encoding.upper() != 'UTF-8':
74 | print('[X] Your system is running python3 scripts with a bad locale setting: {} (it should be UTF-8).'.format(sys.stdout.encoding))
75 | print(' To fix it, add the line "export PYTHONIOENCODING=utf8" to your ~/.bashrc file (without quotes)')
76 | print('')
77 | print(' Confirm that it\'s fixed by opening a new shell and running:')
78 | print(' python3 -c "import sys; print(sys.stdout.encoding)" # should output UTF-8')
79 | print('')
80 | print(' Alternatively, run this script with:')
81 | print(' env PYTHONIOENCODING=utf8 ./archive.py export.html')
82 |
--------------------------------------------------------------------------------
/archiver/index.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 |
4 | from datetime import datetime
5 | from string import Template
6 | from distutils.dir_util import copy_tree
7 |
8 | from config import (
9 | TEMPLATES_DIR,
10 | OUTPUT_PERMISSIONS,
11 | ANSI,
12 | GIT_SHA,
13 | FOOTER_INFO,
14 | )
15 | from util import (
16 | chmod_file,
17 | wget_output_path,
18 | derived_link_info,
19 | pretty_path,
20 | )
21 |
22 |
23 | ### Homepage index for all the links
24 |
25 | def write_links_index(out_dir, links):
26 | """create index.html file for a given list of links"""
27 |
28 | if not os.path.exists(out_dir):
29 | os.makedirs(out_dir)
30 |
31 | write_json_links_index(out_dir, links)
32 | write_html_links_index(out_dir, links)
33 |
34 | print('{green}[√] [{}] Updated main index files:{reset}'.format(
35 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
36 | **ANSI))
37 | print(' > {}/index.json'.format(pretty_path(out_dir)))
38 | print(' > {}/index.html'.format(pretty_path(out_dir)))
39 |
40 | def write_json_links_index(out_dir, links):
41 | """write the json link index to a given path"""
42 |
43 | path = os.path.join(out_dir, 'index.json')
44 |
45 | index_json = {
46 | 'info': 'Bookmark Archiver Index',
47 | 'help': 'https://github.com/pirate/bookmark-archiver',
48 | 'version': GIT_SHA,
49 | 'num_links': len(links),
50 | 'updated': str(datetime.now().timestamp()),
51 | 'links': links,
52 | }
53 |
54 | with open(path, 'w', encoding='utf-8') as f:
55 | json.dump(index_json, f, indent=4, default=str)
56 |
57 | chmod_file(path)
58 |
59 | def parse_json_links_index(out_dir):
60 | """load the index in a given directory and merge it with the given link"""
61 | index_path = os.path.join(out_dir, 'index.json')
62 | if os.path.exists(index_path):
63 | with open(index_path, 'r', encoding='utf-8') as f:
64 | return json.load(f)['links']
65 |
66 | return []
67 |
68 | def write_html_links_index(out_dir, links):
69 | """write the html link index to a given path"""
70 |
71 | path = os.path.join(out_dir, 'index.html')
72 |
73 | copy_tree(os.path.join(TEMPLATES_DIR, 'static'), os.path.join(out_dir, 'static'))
74 |
75 | with open(os.path.join(TEMPLATES_DIR, 'index.html'), 'r', encoding='utf-8') as f:
76 | index_html = f.read()
77 |
78 | with open(os.path.join(TEMPLATES_DIR, 'index_row.html'), 'r', encoding='utf-8') as f:
79 | link_row_html = f.read()
80 |
81 | link_rows = '\n'.join(
82 | Template(link_row_html).substitute(**derived_link_info(link))
83 | for link in links
84 | )
85 |
86 | template_vars = {
87 | 'num_links': len(links),
88 | 'date_updated': datetime.now().strftime('%Y-%m-%d'),
89 | 'time_updated': datetime.now().strftime('%Y-%m-%d %H:%M'),
90 | 'footer_info': FOOTER_INFO,
91 | 'git_sha': GIT_SHA,
92 | 'short_git_sha': GIT_SHA[:8],
93 | 'rows': link_rows,
94 | }
95 |
96 | with open(path, 'w', encoding='utf-8') as f:
97 | f.write(Template(index_html).substitute(**template_vars))
98 |
99 | chmod_file(path)
100 |
101 |
102 | ### Individual link index
103 |
104 | def write_link_index(out_dir, link):
105 | link['updated'] = str(datetime.now().timestamp())
106 | write_json_link_index(out_dir, link)
107 | write_html_link_index(out_dir, link)
108 |
109 | def write_json_link_index(out_dir, link):
110 | """write a json file with some info about the link"""
111 |
112 | path = os.path.join(out_dir, 'index.json')
113 |
114 | print(' √ index.json')
115 |
116 | with open(path, 'w', encoding='utf-8') as f:
117 | json.dump(link, f, indent=4, default=str)
118 |
119 | chmod_file(path)
120 |
121 | def parse_json_link_index(out_dir):
122 | """load the json link index from a given directory"""
123 | existing_index = os.path.join(out_dir, 'index.json')
124 | if os.path.exists(existing_index):
125 | with open(existing_index, 'r', encoding='utf-8') as f:
126 | return json.load(f)
127 | return {}
128 |
129 | def write_html_link_index(out_dir, link):
130 | with open(os.path.join(TEMPLATES_DIR, 'link_index_fancy.html'), 'r', encoding='utf-8') as f:
131 | link_html = f.read()
132 |
133 | path = os.path.join(out_dir, 'index.html')
134 |
135 | print(' √ index.html')
136 |
137 | with open(path, 'w', encoding='utf-8') as f:
138 | f.write(Template(link_html).substitute({
139 | **link,
140 | **link['latest'],
141 | 'type': link['type'] or 'website',
142 | 'tags': link['tags'] or 'untagged',
143 | 'bookmarked': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'),
144 | 'updated': datetime.fromtimestamp(float(link['updated'])).strftime('%Y-%m-%d %H:%M'),
145 | 'bookmarked_ts': link['timestamp'],
146 | 'updated_ts': link['updated'],
147 | 'archive_org': link['latest'].get('archive_org') or 'https://web.archive.org/save/{}'.format(link['url']),
148 | 'wget': link['latest'].get('wget') or wget_output_path(link),
149 | }))
150 |
151 | chmod_file(path)
152 |
--------------------------------------------------------------------------------
/archiver/links.py:
--------------------------------------------------------------------------------
1 | """
2 | In Bookmark Archiver, a Link represents a single entry that we track in the
3 | json index. All links pass through all archiver functions and the latest,
4 | most up-to-date canonical output for each is stored in "latest".
5 |
6 |
7 | Link {
8 | timestamp: str, (how we uniquely id links) _ _ _ _ ___
9 | url: str, | \ / \ |\| ' |
10 | base_url: str, |_/ \_/ | | |
11 | domain: str, _ _ _ _ _ _
12 | tags: str, |_) /| |\| | / `
13 | type: str, | /"| | | | \_,
14 | title: str, ,-'"`-.
15 | sources: [str], /// / @ @ \ \\\\
16 | latest: { \ :=| ,._,. |=: /
17 | ..., || ,\ \_../ /. ||
18 | pdf: 'output.pdf', ||','`-._))'`.`||
19 | wget: 'example.com/1234/index.html' `-' (/ `-'
20 | },
21 | history: {
22 | ...
23 | pdf: [
24 | {timestamp: 15444234325, status: 'skipped', result='output.pdf'},
25 | ...
26 | ],
27 | wget: [
28 | {timestamp: 11534435345, status: 'succeded', result='donuts.com/eat/them.html'}
29 | ]
30 | },
31 | }
32 |
33 | """
34 |
35 | import datetime
36 | from html import unescape
37 |
38 | from util import (
39 | domain,
40 | base_url,
41 | str_between,
42 | get_link_type,
43 | merge_links,
44 | wget_output_path,
45 | )
46 | from config import ANSI
47 |
48 |
49 | def validate_links(links):
50 | links = archivable_links(links) # remove chrome://, about:, mailto: etc.
51 | links = uniquefied_links(links) # merge/dedupe duplicate timestamps & urls
52 | links = sorted_links(links) # deterministically sort the links based on timstamp, url
53 |
54 | if not links:
55 | print('[X] No links found :(')
56 | raise SystemExit(1)
57 |
58 | for link in links:
59 | link['title'] = unescape(link['title'])
60 | link['latest'] = link.get('latest') or {}
61 |
62 | if not link['latest'].get('wget'):
63 | link['latest']['wget'] = wget_output_path(link)
64 |
65 | if not link['latest'].get('pdf'):
66 | link['latest']['pdf'] = None
67 |
68 | if not link['latest'].get('screenshot'):
69 | link['latest']['screenshot'] = None
70 |
71 | if not link['latest'].get('dom'):
72 | link['latest']['dom'] = None
73 |
74 | return list(links)
75 |
76 |
77 | def archivable_links(links):
78 | """remove chrome://, about:// or other schemed links that cant be archived"""
79 | return (
80 | link
81 | for link in links
82 | if any(link['url'].startswith(s) for s in ('http://', 'https://', 'ftp://'))
83 | )
84 |
85 | def uniquefied_links(sorted_links):
86 | """
87 | ensures that all non-duplicate links have monotonically increasing timestamps
88 | """
89 |
90 | unique_urls = {}
91 |
92 | lower = lambda url: url.lower().strip()
93 | without_www = lambda url: url.replace('://www.', '://', 1)
94 | without_trailing_slash = lambda url: url[:-1] if url[-1] == '/' else url.replace('/?', '?')
95 |
96 | for link in sorted_links:
97 | fuzzy_url = without_www(without_trailing_slash(lower(link['url'])))
98 | if fuzzy_url in unique_urls:
99 | # merge with any other links that share the same url
100 | link = merge_links(unique_urls[fuzzy_url], link)
101 | unique_urls[fuzzy_url] = link
102 |
103 | unique_timestamps = {}
104 | for link in unique_urls.values():
105 | link['timestamp'] = lowest_uniq_timestamp(unique_timestamps, link['timestamp'])
106 | unique_timestamps[link['timestamp']] = link
107 |
108 | return unique_timestamps.values()
109 |
110 | def sorted_links(links):
111 | sort_func = lambda link: (link['timestamp'], link['url'])
112 | return sorted(links, key=sort_func, reverse=True)
113 |
114 | def links_after_timestamp(links, timestamp=None):
115 | if not timestamp:
116 | yield from links
117 | return
118 |
119 | for link in links:
120 | try:
121 | if float(link['timestamp']) <= float(timestamp):
122 | yield link
123 | except (ValueError, TypeError):
124 | print('Resume value and all timestamp values must be valid numbers.')
125 |
126 | def lowest_uniq_timestamp(used_timestamps, timestamp):
127 | """resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2"""
128 |
129 | timestamp = timestamp.split('.')[0]
130 | nonce = 0
131 |
132 | # first try 152323423 before 152323423.0
133 | if timestamp not in used_timestamps:
134 | return timestamp
135 |
136 | new_timestamp = '{}.{}'.format(timestamp, nonce)
137 | while new_timestamp in used_timestamps:
138 | nonce += 1
139 | new_timestamp = '{}.{}'.format(timestamp, nonce)
140 |
141 | return new_timestamp
142 |
--------------------------------------------------------------------------------
/archiver/parse.py:
--------------------------------------------------------------------------------
1 | """
2 | Everything related to parsing links from bookmark services.
3 |
4 | For a list of supported services, see the README.md.
5 | For examples of supported files see examples/.
6 |
7 | Parsed link schema: {
8 | 'url': 'https://example.com/example/?abc=123&xyc=345#lmnop',
9 | 'domain': 'example.com',
10 | 'base_url': 'example.com/example/',
11 | 'timestamp': '15442123124234',
12 | 'tags': 'abc,def',
13 | 'title': 'Example.com Page Title',
14 | 'sources': ['ril_export.html', 'downloads/getpocket.com.txt'],
15 | }
16 | """
17 |
18 | import re
19 | import json
20 | import xml.etree.ElementTree as etree
21 |
22 | from datetime import datetime
23 |
24 | from util import (
25 | domain,
26 | base_url,
27 | str_between,
28 | get_link_type,
29 | )
30 |
31 |
32 | def get_parsers(file):
33 | """return all parsers that work on a given file, defaults to all of them"""
34 |
35 | return {
36 | 'pocket': parse_pocket_export,
37 | 'pinboard': parse_json_export,
38 | 'bookmarks': parse_bookmarks_export,
39 | 'rss': parse_rss_export,
40 | 'pinboard_rss': parse_pinboard_rss_feed,
41 | 'medium_rss': parse_medium_rss_feed,
42 | }
43 |
44 | def parse_links(path):
45 | """parse a list of links dictionaries from a bookmark export file"""
46 |
47 | links = []
48 | with open(path, 'r', encoding='utf-8') as file:
49 | for parser_func in get_parsers(file).values():
50 | # otherwise try all parsers until one works
51 | try:
52 | links += list(parser_func(file))
53 | if links:
54 | break
55 | except (ValueError, TypeError, IndexError, AttributeError, etree.ParseError):
56 | # parser not supported on this file
57 | pass
58 |
59 | return links
60 |
61 |
62 | def parse_pocket_export(html_file):
63 | """Parse Pocket-format bookmarks export files (produced by getpocket.com/export/)"""
64 |
65 | html_file.seek(0)
66 | pattern = re.compile("^\\s*