├── .gitignore ├── DONATE.md ├── LICENSE ├── README.md ├── archive ├── archiver ├── __init__.py ├── archive.py ├── archive_methods.py ├── config.py ├── index.py ├── links.py ├── parse.py ├── peekable.py ├── templates │ ├── index.html │ ├── index_row.html │ ├── link_index.html │ ├── link_index_fancy.html │ └── static │ │ ├── archive.png │ │ ├── external.png │ │ └── spinner.gif ├── tests │ ├── firefox_export.html │ ├── pinboard_export.json │ ├── pocket_export.html │ └── rss_export.xml └── util.py ├── bin ├── bookmark-archiver ├── export-browser-history └── setup-bookmark-archiver └── setup /.gitignore: -------------------------------------------------------------------------------- 1 | # OS cruft 2 | .DS_Store 3 | ._* 4 | 5 | # python 6 | __pycache__/ 7 | archiver/venv 8 | 9 | # vim 10 | .swp* 11 | 12 | # output artifacts 13 | output/ 14 | -------------------------------------------------------------------------------- /DONATE.md: -------------------------------------------------------------------------------- 1 | # Donate 2 | 3 | Right now I'm working on this project in my spare time and accepting the occational PR, 4 | if you want me to dedicate more time to it, donate to support development! 5 | 6 | - Ethereum: 0x5B0F85FFc44fD759C2d97f0BE4681279966f3832 7 | - Bitcoin: https://shapeshift.io/ BTC -> to address above (ETH) 8 | - Paypal: https://www.paypal.me/NicholasSweeting/25 9 | 10 | The eventual goal is to support one or two developers full-time on this project via donations. 11 | With more engineering power this can become a distributed archive service with a nice UI, 12 | like the Way-Back machine but hosted by everyone! 13 | 14 | If have any questions or want to sponsor this project long-term, contact me at 15 | bookmark-archiver@sweeting.me 16 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Nick Sweeting 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Bookmark Archiver [![Github Stars](https://img.shields.io/github/stars/pirate/bookmark-archiver.svg)](https://github.com/pirate/bookmark-archiver) [![Twitter URL](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/thesquashSH) 2 | 3 | "Your own personal Way-Back Machine" 4 | 5 | ▶️ [Quickstart](#quickstart) | [Details](#details) | [Configuration](#configuration) | [Manual Setup](#manual-setup) | [Troubleshooting](#troubleshooting) | [Demo](https://archive.sweeting.me) | [Changelog](#changelog) | [Donate](https://github.com/pirate/bookmark-archiver/blob/master/DONATE.md) 6 | 7 | --- 8 | 9 | Save an archived copy of all websites you bookmark (the actual *content* of each site, not just the list of bookmarks). 10 | 11 | Can import links from: 12 | 13 | - Browser history & bookmarks (Chrome, Firefox, Safari, IE, Opera) 14 | - Pocket 15 | - Pinboard 16 | - RSS or plain text lists 17 | - Shaarli, Delicious, Instapaper, Reddit Saved Posts, Wallabag, Unmark.it, and more! 18 | 19 | For each site, it outputs (configurable): 20 | 21 | - Browsable static HTML archive (wget) 22 | - PDF (Chrome headless) 23 | - Screenshot (Chrome headless) 24 | - DOM dump (Chrome headless) 25 | - Favicon 26 | - Submits URL to archive.org 27 | - Index summary pages: index.html & index.json 28 | 29 | The archiving is additive, so you can schedule `./archive` to run regularly and pull new links into the index. 30 | All the saved content is static and indexed with json files, so it lives forever & is easily parseable, it requires no always-running backend. 31 | 32 | [DEMO: archive.sweeting.me](https://archive.sweeting.me) 33 | 34 | Desktop ScreenshotMobile Screenshot
35 | 36 | ## Quickstart 37 | 38 | **1. Get your list of URLs:** 39 | 40 | Follow the links here to find instructions for exporting bookmarks from each service. 41 | 42 | - [Pocket](https://getpocket.com/export) 43 | - [Pinboard](https://pinboard.in/export/) 44 | - [Instapaper](https://www.instapaper.com/user/export) 45 | - [Reddit Saved Posts](https://github.com/csu/export-saved-reddit) 46 | - [Shaarli](http://shaarli.readthedocs.io/en/master/Backup,-restore,-import-and-export/#export-links-as) 47 | - [Unmark.it](http://help.unmark.it/import-export) 48 | - [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html) 49 | - [Chrome Bookmarks](https://support.google.com/chrome/answer/96816?hl=en) 50 | - [Firefox Bookmarks](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer) 51 | - [Safari Bookmarks](http://i.imgur.com/AtcvUZA.png) 52 | - [Opera Bookmarks](http://help.opera.com/Windows/12.10/en/importexport.html) 53 | - [Internet Explorer Bookmarks](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows) 54 | - Chrome History: `./bin/export-browser-history --chrome` 55 | - Firefox History: `./bin/export-browser-history --firefox` 56 | - Other File or URL: (e.g. RSS feed) pass as second argument in the next step 57 | 58 | (If any of these links are broken, please submit an issue and I'll fix it) 59 | 60 | **2. Create your archive:** 61 | 62 | ```bash 63 | git clone https://github.com/pirate/bookmark-archiver 64 | cd bookmark-archiver/ 65 | ./setup # install all dependencies 66 | 67 | # add a list of links from a file 68 | ./archive ~/Downloads/bookmark_export.html # replace with the path to your export file or URL from step 1 69 | 70 | # OR add a list of links from remote URL 71 | ./archive "https://getpocket.com/users/yourusername/feed/all" # url to an RSS, html, or json links file 72 | 73 | # OR add all the links from your browser history 74 | ./bin/export-browser-history --chrome # works with --firefox as well, can take path to SQLite history db 75 | ./archive output/sources/chrome_history.json 76 | 77 | # OR just continue archiving the existing links in the index 78 | ./archive # at any point if you just want to continue archiving where you left off, without adding any new links 79 | ``` 80 | 81 | **3. Done!** 82 | 83 | You can open `output/index.html` to view your archive. (favicons will appear next to each title once it has finished downloading) 84 | 85 | If you want to host your archive somewhere to share it with other people, see the [Publishing Your Archive](#publishing-your-archive) section below. 86 | 87 | **4. (Optional) Schedule it to run every day** 88 | 89 | You can import links from any local file path or feed url by changing the second argument to `archive.py`. 90 | Bookmark Archiver will ignore links that are imported multiple times, it will keep the earliest version that it's seen. 91 | This means you can add multiple cron jobs to pull links from several different feeds or files each day, 92 | it will keep the index up-to-date without duplicate links. 93 | 94 | This example archives a pocket RSS feed and an export file every 24 hours, and saves the output to a logfile. 95 | ```bash 96 | 0 24 * * * yourusername /opt/bookmark-archiver/archive https://getpocket.com/users/yourusername/feed/all > /var/log/bookmark_archiver_rss.log 97 | 0 24 * * * yourusername /opt/bookmark-archiver/archive /home/darth-vader/Desktop/bookmarks.html > /var/log/bookmark_archiver_firefox.log 98 | ``` 99 | (Add the above lines to `/etc/crontab`) 100 | 101 | **Next Steps** 102 | 103 | If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom. 104 | If you'd like to customize options, see the [Configuration](#configuration) section. 105 | 106 | If you want something easier than running programs in the command-line, take a look at [Pocket Premium](https://getpocket.com/premium) (yay Mozilla!) and [Pinboard Pro](https://pinboard.in/upgrade/) (yay independent developer!). Both offer easy-to-use bookmark archiving with full-text-search and other features. 107 | 108 | ## Details 109 | 110 | `archive.py` is a script that takes a [Pocket-format](https://getpocket.com/export), [JSON-format](https://pinboard.in/export/), [Netscape-format](https://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx), or RSS-formatted list of links, and downloads a clone of each linked website to turn into a browsable archive that you can store locally or host online. 111 | 112 | The archiver produces an output folder `output/` containing an `index.html`, `index.json`, and archived copies of all the sites, 113 | organized by timestamp bookmarked. It's Powered by [headless](https://developers.google.com/web/updates/2017/04/headless-chrome) Chromium and good 'ol `wget`. 114 | 115 | For each sites it saves: 116 | 117 | - wget of site, e.g. `en.wikipedia.org/wiki/Example.html` with .html appended if not present 118 | - `output.pdf` Printed PDF of site using headless chrome 119 | - `screenshot.png` 1440x900 screenshot of site using headless chrome 120 | - `output.html` DOM Dump of the HTML after rendering using headless chrome 121 | - `archive.org.txt` A link to the saved site on archive.org 122 | - `audio/` and `video/` for sites like youtube, soundcloud, etc. (using youtube-dl) (WIP) 123 | - `code/` clone of any repository for github, bitbucket, or gitlab links (WIP) 124 | - `index.json` JSON index containing link info and archive details 125 | - `index.html` HTML index containing link info and archive details (optional fancy or simple index) 126 | 127 | Wget doesn't work on sites you need to be logged into, but chrome headless does, see the [Configuration](#configuration)* section for `CHROME_USER_DATA_DIR`. 128 | 129 | **Large Exports & Estimated Runtime:** 130 | 131 | I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB. 132 | Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV. 133 | 134 | You can run it in parallel by using the `resume` feature, or by manually splitting export.html into multiple files: 135 | ```bash 136 | ./archive export.html 1498800000 & # second argument is timestamp to resume downloading from 137 | ./archive export.html 1498810000 & 138 | ./archive export.html 1498820000 & 139 | ./archive export.html 1498830000 & 140 | ``` 141 | Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running). 142 | 143 | ## Configuration 144 | 145 | You can tweak parameters via environment variables, or by editing `config.py` directly: 146 | ```bash 147 | env CHROME_BINARY=google-chrome-stable RESOLUTION=1440,900 FETCH_PDF=False ./archive ~/Downloads/bookmarks_export.html 148 | ``` 149 | 150 | **Shell Options:** 151 | - colorize console ouput: `USE_COLOR` value: [`True`]/`False` 152 | - show progress bar: `SHOW_PROGRESS` value: [`True`]/`False` 153 | - archive permissions: `OUTPUT_PERMISSIONS` values: [`755`]/`644`/`...` 154 | 155 | **Dependency Options:** 156 | - path to Chrome: `CHROME_BINARY` values: [`chromium-browser`]/`/usr/local/bin/google-chrome`/`...` 157 | - path to wget: `WGET_BINARY` values: [`wget`]/`/usr/local/bin/wget`/`...` 158 | 159 | **Archive Options:** 160 | - maximum allowed download time per link: `TIMEOUT` values: [`60`]/`30`/`...` 161 | - archive methods (values: [`True`]/`False`): 162 | - fetch page with wget: `FETCH_WGET` 163 | - fetch images/css/js with wget: `FETCH_WGET_REQUISITES` (True is highly recommended) 164 | - print page as PDF: `FETCH_PDF` 165 | - fetch a screenshot of the page: `FETCH_SCREENSHOT` 166 | - fetch a DOM dump of the page: `FETCH_DOM` 167 | - fetch a favicon for the page: `FETCH_FAVICON` 168 | - submit the page to archive.org: `SUBMIT_ARCHIVE_DOT_ORG` 169 | - screenshot: `RESOLUTION` values: [`1440,900`]/`1024,768`/`...` 170 | - user agent: `WGET_USER_AGENT` values: [`Wget/1.19.1`]/`"Mozilla/5.0 ..."`/`...` 171 | - chrome profile: `CHROME_USER_DATA_DIR` values: [`~/Library/Application\ Support/Google/Chrome/Default`]/`/tmp/chrome-profile`/`...` 172 | To capture sites that require a user to be logged in, you must specify a path to a chrome profile (which loads the cookies needed for the user to be logged in). If you don't have an existing chrome profile, create one with `chromium-browser --disable-gpu --user-data-dir=/tmp/chrome-profile`, and log into the sites you need. Then set `CHROME_USER_DATA_DIR=/tmp/chrome-profile` to make Bookmark Archiver use that profile. 173 | 174 | (See defaults & more at the top of `config.py`) 175 | 176 | To tweak the outputted html index file's look and feel, just edit the HTML files in `archiver/templates/`. 177 | 178 | The chrome/chromium dependency is _optional_ and only required for screenshots, PDF, and DOM dump output, it can be safely ignored if those three methods are disabled. 179 | 180 | ## Publishing Your Archive 181 | 182 | The archive produced by `./archive` is suitable for serving on any provider that can host static html (e.g. github pages!). 183 | 184 | You can also serve it from a home server or VPS by uploading the outputted `output` folder to your web directory, e.g. `/var/www/bookmark-archiver` and configuring your webserver. 185 | 186 | Here's a sample nginx configuration that works to serve archive folders: 187 | 188 | ```nginx 189 | location / { 190 | alias /path/to/bookmark-archiver/output/; 191 | index index.html; 192 | autoindex on; # see directory listing upon clicking "The Files" links 193 | try_files $uri $uri/ =404; 194 | } 195 | ``` 196 | 197 | Make sure you're not running any content as CGI or PHP, you only want to serve static files! 198 | 199 | Urls look like: `https://archive.example.com/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem.html` 200 | 201 | **Security WARNING & Content Disclaimer** 202 | 203 | Re-hosting other people's content has security implications for any other sites sharing your hosting domain. Make sure you understand 204 | the dangers of hosting unknown archived CSS & JS files [on your shared domain](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy). 205 | Due to the security risk of serving some malicious JS you archived by accident, it's best to put this on a domain or subdomain 206 | of its own to keep cookies separate and slightly mitigate [CSRF attacks](https://en.wikipedia.org/wiki/Cross-site_request_forgery) and other nastiness. 207 | 208 | You may also want to blacklist your archive in `/robots.txt` if you don't want to be publicly assosciated with all the links you archive via search engine results. 209 | 210 | Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons, 211 | it's up to you to host responsibly and respond to takedown requests appropriately. 212 | 213 | Please modify the `FOOTER_INFO` config variable to add your contact info to the footer of your index. 214 | 215 | ## Info & Motivation 216 | 217 | This is basically an open-source version of [Pocket Premium](https://getpocket.com/premium) (which you should consider paying for!). 218 | I got tired of sites I saved going offline or changing their URLS, so I started 219 | archiving a copy of them locally now, similar to The Way-Back Machine provided 220 | by [archive.org](https://archive.org). Self hosting your own archive allows you to save 221 | PDFs & Screenshots of dynamic sites in addition to static html, something archive.org doesn't do. 222 | 223 | Now I can rest soundly knowing important articles and resources I like wont dissapear off the internet. 224 | 225 | My published archive as an example: [archive.sweeting.me](https://archive.sweeting.me). 226 | 227 | ## Manual Setup 228 | 229 | If you don't like running random setup scripts off the internet (:+1:), you can follow these manual setup instructions. 230 | 231 | **1. Install dependencies:** `chromium >= 59`,` wget >= 1.16`, `python3 >= 3.5` (`google-chrome >= v59` works fine as well) 232 | 233 | If you already have Google Chrome installed, or wish to use that instead of Chromium, follow the [Google Chrome Instructions](#google-chrome-instructions). 234 | 235 | ```bash 236 | # On Mac: 237 | brew cask install chromium # If you already have Google Chrome/Chromium in /Applications/, skip this command 238 | brew install wget python3 239 | 240 | echo -e '#!/bin/bash\n/Applications/Chromium.app/Contents/MacOS/Chromium "$@"' > /usr/local/bin/chromium-browser # see instructions for google-chrome below 241 | chmod +x /usr/local/bin/chromium-browser 242 | ``` 243 | 244 | ```bash 245 | # On Ubuntu/Debian: 246 | apt install chromium-browser python3 wget 247 | ``` 248 | 249 | ```bash 250 | # Check that everything worked: 251 | chromium-browser --version && which wget && which python3 && which curl && echo "[√] All dependencies installed." 252 | ``` 253 | 254 | **2. Get your bookmark export file:** 255 | 256 | Follow the instruction links above in the "Quickstart" section to download your bookmarks export file. 257 | 258 | **3. Run the archive script:** 259 | 260 | 1. Clone this repo `git clone https://github.com/pirate/bookmark-archiver` 261 | 3. `cd bookmark-archiver/` 262 | 4. `./archive ~/Downloads/bookmarks_export.html` 263 | 264 | You may optionally specify a second argument to `archive.py export.html 153242424324` to resume the archive update at a specific timestamp. 265 | 266 | If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom. 267 | 268 | ### Google Chrome Instructions: 269 | 270 | I recommend Chromium instead of Google Chrome, since it's open source and doesn't send your data to Google. 271 | Chromium may have some issues rendering some sites though, so you're welcome to try Google-chrome instead. 272 | It's also easier to use Google Chrome if you already have it installed, rather than downloading Chromium all over. 273 | 274 | 1. Install & link google-chrome 275 | ```bash 276 | # On Mac: 277 | # If you already have Google Chrome in /Applications/, skip this brew command 278 | brew cask install google-chrome 279 | brew install wget python3 280 | 281 | echo -e '#!/bin/bash\n/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome "$@"' > /usr/local/bin/google-chrome 282 | chmod +x /usr/local/bin/google-chrome 283 | ``` 284 | 285 | ```bash 286 | # On Linux: 287 | wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add - 288 | sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list' 289 | apt update; apt install google-chrome-beta python3 wget 290 | ``` 291 | 292 | 2. Set the environment variable `CHROME_BINARY` to `google-chrome` before running: 293 | 294 | ```bash 295 | env CHROME_BINARY=google-chrome ./archive ~/Downloads/bookmarks_export.html 296 | ``` 297 | If you're having any trouble trying to set up Google Chrome or Chromium, see the Troubleshooting section below. 298 | 299 | ## Troubleshooting 300 | 301 | ### Dependencies 302 | 303 | **Python:** 304 | 305 | On some Linux distributions the python3 package might not be recent enough. 306 | If this is the case for you, resort to installing a recent enough version manually. 307 | ```bash 308 | add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6 309 | ``` 310 | If you still need help, [the official Python docs](https://docs.python.org/3.6/using/unix.html) are a good place to start. 311 | 312 | **Chromium/Google Chrome:** 313 | 314 | `archive.py` depends on being able to access a `chromium-browser`/`google-chrome` executable. The executable used 315 | defaults to `chromium-browser` but can be manually specified with the environment variable `CHROME_BINARY`: 316 | 317 | ```bash 318 | env CHROME_BINARY=/usr/local/bin/chromium-browser ./archive ~/Downloads/bookmarks_export.html 319 | ``` 320 | 321 | 1. Test to make sure you have Chrome on your `$PATH` with: 322 | 323 | ```bash 324 | which chromium-browser || which google-chrome 325 | ``` 326 | If no executable is displayed, follow the setup instructions to install and link one of them. 327 | 328 | 2. If a path is displayed, the next step is to check that it's runnable: 329 | 330 | ```bash 331 | chromium-browser --version || google-chrome --version 332 | ``` 333 | If no version is displayed, try the setup instructions again, or confirm that you have permission to access chrome. 334 | 335 | 3. If a version is displayed and it's `<59`, upgrade it: 336 | 337 | ```bash 338 | apt upgrade chromium-browser -y 339 | # OR 340 | brew cask upgrade chromium-browser 341 | ``` 342 | 343 | 4. If a version is displayed and it's `>=59`, make sure `archive.py` is running the right one: 344 | 345 | ```bash 346 | env CHROME_BINARY=/path/from/step/1/chromium-browser ./archive bookmarks_export.html # replace the path with the one you got from step 1 347 | ``` 348 | 349 | 350 | **Wget & Curl:** 351 | 352 | If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice. 353 | See the "Manual Setup" instructions for more details. 354 | 355 | If wget times out or randomly fails to download some sites that you have confirmed are online, 356 | upgrade wget to the most recent version with `brew upgrade wget` or `apt upgrade wget`. There is 357 | a bug in versions `<=1.19.1_1` that caused wget to fail for perfectly valid sites. 358 | 359 | ### Archiving 360 | 361 | **No links parsed from export file:** 362 | 363 | Please open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of where you got the export, and 364 | preferrably your export file attached (you can redact the links). We'll fix the parser to support your format. 365 | 366 | **Lots of skipped sites:** 367 | 368 | If you ran the archiver once, it wont re-download sites subsequent times, it will only download new links. 369 | If you haven't already run it, make sure you have a working internet connection and that the parsed URLs look correct. 370 | You can check the `archive.py` output or `index.html` to see what links it's downloading. 371 | 372 | If you're still having issues, try deleting or moving the `output/archive` folder (back it up first!) and running `./archive` again. 373 | 374 | **Lots of errors:** 375 | 376 | Make sure you have all the dependencies installed and that you're able to visit the links from your browser normally. 377 | Open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of the errors if you're still having problems. 378 | 379 | **Lots of broken links from the index:** 380 | 381 | Not all sites can be effectively archived with each method, that's why it's best to use a combination of `wget`, PDFs, and screenshots. 382 | If it seems like more than 10-20% of sites in the archive are broken, open an [issue](https://github.com/pirate/bookmark-archiver/issues) 383 | with some of the URLs that failed to be archived and I'll investigate. 384 | 385 | ### Hosting the Archive 386 | 387 | If you're having issues trying to host the archive via nginx, make sure you already have nginx running with SSL. 388 | If you don't, google around, there are plenty of tutorials to help get that set up. Open an [issue](https://github.com/pirate/bookmark-archiver/issues) 389 | if you have problem with a particular nginx config. 390 | 391 | ## Roadmap 392 | 393 | If you feel like contributing a PR, some of these tasks are pretty easy. Feel free to open an issue if you need help getting started in any way! 394 | 395 | - download closed-captions text from youtube videos 396 | - body text extraction using [fathom](https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/) 397 | - auto-tagging based on important extracted words 398 | - audio & video archiving with `youtube-dl` 399 | - full-text indexing with elasticsearch/elasticlunr/ag 400 | - video closed-caption downloading for full-text indexing video content 401 | - automatic text summaries of article with summarization library 402 | - feature image extraction 403 | - http support (from my https-only domain) 404 | - try wgetting dead sites from archive.org (https://github.com/hartator/wayback-machine-downloader) 405 | - live updating from pocket/pinboard 406 | 407 | It's possible to pull links via the pocket API or public pocket RSS feeds instead of downloading an html export. 408 | Once I write a script to do that, we can stick this in `cron` and have it auto-update on it's own. 409 | 410 | For now you just have to download `ril_export.html` and run `archive.py` each time it updates. The script 411 | will run fast subsequent times because it only downloads new links that haven't been archived already. 412 | 413 | ## Links 414 | 415 | **Similar Projects:** 416 | - [Memex by Worldbrain.io](https://github.com/WorldBrain/Memex) a browser extension that saves all your history and does full-text search 417 | - [Hypothes.is](https://web.hypothes.is/) a web/pdf/ebook annotation tool that also archives content 418 | - [Perkeep](https://perkeep.org/) "Perkeep lets you permanently keep your stuff, for life." 419 | - [Fetching.io](http://fetching.io/) A personal search engine/archiver that lets you search through all archived websites that you've bookmarked 420 | - [Shaarchiver](https://github.com/nodiscc/shaarchiver) very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index 421 | - [Webrecorder.io](https://webrecorder.io/) Save full browsing sessions and archive all the content 422 | - [Wallabag](https://wallabag.org) Save articles you read locally or on your phone 423 | 424 | **Discussions:** 425 | - [Hacker News Discussion](https://news.ycombinator.com/item?id=14272133) 426 | - [Reddit r/selfhosted Discussion](https://www.reddit.com/r/selfhosted/comments/69eoi3/pocket_stream_archive_your_own_personal_wayback/) 427 | - [Reddit r/datahoarder Discussion #1](https://www.reddit.com/r/DataHoarder/comments/69e6i9/archive_a_browseable_copy_of_your_saved_pocket/) 428 | - [Reddit r/datahoarder Discussion #2](https://www.reddit.com/r/DataHoarder/comments/6kepv6/bookmarkarchiver_now_supports_archiving_all_major/) 429 | 430 | 431 | **Tools/Other:** 432 | - https://github.com/ikreymer/webarchiveplayer#auto-load-warcs 433 | - [Sheetsee-Pocket](http://jlord.us/sheetsee-pocket/) project that provides a pretty auto-updating index of your Pocket links (without archiving them) 434 | - [Pocket -> IFTTT -> Dropbox](https://christopher.su/2013/saving-pocket-links-file-day-dropbox-ifttt-launchd/) Post by Christopher Su on his Pocket saving IFTTT recipie 435 | 436 | ## Changelog 437 | 438 | - v0.1.0 released 439 | - support for browser history exporting added with `./bin/export-browser-history` 440 | - support for chrome `--dump-dom` to output full page HTML after JS executes 441 | - v0.0.3 released 442 | - support for chrome `--user-data-dir` to archive sites that need logins 443 | - fancy individual html & json indexes for each link 444 | - smartly append new links to existing index instead of overwriting 445 | - v0.0.2 released 446 | - proper HTML templating instead of format strings (thanks to https://github.com/bardisty!) 447 | - refactored into separate files, wip audio & video archiving 448 | - v0.0.1 released 449 | - Index links now work without nginx url rewrites, archive can now be hosted on github pages 450 | - added setup.sh script & docstrings & help commands 451 | - made Chromium the default instead of Google Chrome (yay free software) 452 | - added [env-variable](https://github.com/pirate/bookmark-archiver/pull/25) configuration (thanks to https://github.com/hannah98!) 453 | - renamed from **Pocket Archive Stream** -> **Bookmark Archiver** 454 | - added [Netscape-format](https://github.com/pirate/bookmark-archiver/pull/20) export support (thanks to https://github.com/ilvar!) 455 | - added [Pinboard-format](https://github.com/pirate/bookmark-archiver/pull/7) export support (thanks to https://github.com/sconeyard!) 456 | - front-page of HN, oops! apparently I have users to support now :grin:? 457 | - added Pocket-format export support 458 | - v0.0.0 released: created Pocket Archive Stream 2017/05/05 459 | 460 | ## Donations 461 | 462 | This project can really flourish with some more engineering effort, but unless it can support 463 | me financially I'm unlikely to be able to take it to the next level alone. It's already pretty 464 | functional and robust, but it really deserves to be taken to the next level with a few more 465 | talented engineers. If you or your foundation wants to sponsor this project long-term, contact 466 | me at bookmark-archiver@sweeting.me. 467 | 468 | [Grants / Donations](https://github.com/pirate/bookmark-archiver/blob/master/donate.md) 469 | -------------------------------------------------------------------------------- /archive: -------------------------------------------------------------------------------- 1 | bin/bookmark-archiver -------------------------------------------------------------------------------- /archiver/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TarekJor/bookmark-archiver/80c02bed5548b3429128cc699c2d462f51cd2df2/archiver/__init__.py -------------------------------------------------------------------------------- /archiver/archive.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # Bookmark Archiver 3 | # Nick Sweeting 2017 | MIT License 4 | # https://github.com/pirate/bookmark-archiver 5 | 6 | import os 7 | import sys 8 | 9 | from datetime import datetime 10 | from subprocess import run 11 | 12 | from parse import parse_links 13 | from links import validate_links 14 | from archive_methods import archive_links, _RESULTS_TOTALS 15 | from index import ( 16 | write_links_index, 17 | write_link_index, 18 | parse_json_links_index, 19 | parse_json_link_index, 20 | ) 21 | from config import ( 22 | OUTPUT_PERMISSIONS, 23 | OUTPUT_DIR, 24 | ANSI, 25 | TIMEOUT, 26 | GIT_SHA, 27 | ) 28 | from util import ( 29 | download_url, 30 | progress, 31 | cleanup_archive, 32 | pretty_path, 33 | migrate_data, 34 | ) 35 | 36 | __AUTHOR__ = 'Nick Sweeting ' 37 | __VERSION__ = GIT_SHA 38 | __DESCRIPTION__ = 'Bookmark Archiver: Create a browsable html archive of a list of links.' 39 | __DOCUMENTATION__ = 'https://github.com/pirate/bookmark-archiver' 40 | 41 | def print_help(): 42 | print(__DESCRIPTION__) 43 | print("Documentation: {}\n".format(__DOCUMENTATION__)) 44 | print("Usage:") 45 | print(" ./bin/bookmark-archiver ~/Downloads/bookmarks_export.html\n") 46 | 47 | 48 | def merge_links(archive_path=OUTPUT_DIR, import_path=None): 49 | """get new links from file and optionally append them to links in existing archive""" 50 | all_links = [] 51 | if import_path: 52 | # parse and validate the import file 53 | raw_links = parse_links(import_path) 54 | all_links = validate_links(raw_links) 55 | 56 | # merge existing links in archive_path and new links 57 | existing_links = [] 58 | if archive_path: 59 | existing_links = parse_json_links_index(archive_path) 60 | all_links = validate_links(existing_links + all_links) 61 | 62 | num_new_links = len(all_links) - len(existing_links) 63 | if num_new_links: 64 | print('[{green}+{reset}] [{}] Adding {} new links from {} to {}/index.json'.format( 65 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 66 | num_new_links, 67 | pretty_path(import_path), 68 | pretty_path(archive_path), 69 | **ANSI, 70 | )) 71 | # else: 72 | # print('[*] [{}] No new links added to {}/index.json{}'.format( 73 | # datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 74 | # archive_path, 75 | # ' from {}'.format(import_path) if import_path else '', 76 | # **ANSI, 77 | # )) 78 | 79 | return all_links 80 | 81 | def update_archive(archive_path, links, source=None, resume=None, append=True): 82 | """update or create index.html+json given a path to an export file containing new links""" 83 | 84 | start_ts = datetime.now().timestamp() 85 | 86 | if resume: 87 | print('{green}[▶] [{}] Resuming archive downloading from {}...{reset}'.format( 88 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 89 | resume, 90 | **ANSI, 91 | )) 92 | else: 93 | print('{green}[▶] [{}] Updating files for {} links in archive...{reset}'.format( 94 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 95 | len(links), 96 | **ANSI, 97 | )) 98 | 99 | # loop over links and archive them 100 | archive_links(archive_path, links, source=source, resume=resume) 101 | 102 | # print timing information & summary 103 | end_ts = datetime.now().timestamp() 104 | seconds = end_ts - start_ts 105 | if seconds > 60: 106 | duration = '{0:.2f} min'.format(seconds / 60, 2) 107 | else: 108 | duration = '{0:.2f} sec'.format(seconds, 2) 109 | 110 | print('{}[√] [{}] Update of {} links complete ({}){}'.format( 111 | ANSI['green'], 112 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 113 | len(links), 114 | duration, 115 | ANSI['reset'], 116 | )) 117 | print(' - {} entries skipped'.format(_RESULTS_TOTALS['skipped'])) 118 | print(' - {} entries updated'.format(_RESULTS_TOTALS['succeded'])) 119 | print(' - {} errors'.format(_RESULTS_TOTALS['failed'])) 120 | 121 | 122 | if __name__ == '__main__': 123 | argc = len(sys.argv) 124 | 125 | if set(sys.argv).intersection(('-h', '--help', 'help')): 126 | print_help() 127 | raise SystemExit(0) 128 | 129 | migrate_data() 130 | 131 | source = sys.argv[1] if argc > 1 else None # path of links file to import 132 | resume = sys.argv[2] if argc > 2 else None # timestamp to resume dowloading from 133 | 134 | if argc == 1: 135 | source, resume = None, None 136 | elif argc == 2: 137 | if all(d.isdigit() for d in sys.argv[1].split('.')): 138 | # argv[1] is a resume timestamp 139 | source, resume = None, sys.argv[1] 140 | else: 141 | # argv[1] is a path to a file to import 142 | source, resume = sys.argv[1].strip(), None 143 | elif argc == 3: 144 | source, resume = sys.argv[1].strip(), sys.argv[2] 145 | else: 146 | print_help() 147 | raise SystemExit(1) 148 | 149 | # See if archive folder already exists 150 | for out_dir in (OUTPUT_DIR, 'bookmarks', 'pocket', 'pinboard', 'html'): 151 | if os.path.exists(out_dir): 152 | break 153 | else: 154 | out_dir = OUTPUT_DIR 155 | 156 | # Step 0: Download url to local file (only happens if a URL is specified instead of local path) 157 | if source and any(source.startswith(s) for s in ('http://', 'https://', 'ftp://')): 158 | source = download_url(source) 159 | 160 | # Step 1: Parse the links and dedupe them with existing archive 161 | links = merge_links(archive_path=out_dir, import_path=source) 162 | 163 | # Step 2: Write new index 164 | write_links_index(out_dir=out_dir, links=links) 165 | 166 | # Step 3: Verify folder structure is 1:1 with index 167 | # cleanup_archive(out_dir, links) 168 | 169 | # Step 4: Run the archive methods for each link 170 | update_archive(out_dir, links, source=source, resume=resume, append=True) 171 | -------------------------------------------------------------------------------- /archiver/archive_methods.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | 4 | from functools import wraps 5 | from collections import defaultdict 6 | from datetime import datetime 7 | from subprocess import run, PIPE, DEVNULL 8 | 9 | from peekable import Peekable 10 | 11 | from index import wget_output_path, parse_json_link_index, write_link_index 12 | from links import links_after_timestamp 13 | from config import ( 14 | CHROME_BINARY, 15 | FETCH_WGET, 16 | FETCH_WGET_REQUISITES, 17 | FETCH_PDF, 18 | FETCH_SCREENSHOT, 19 | FETCH_DOM, 20 | RESOLUTION, 21 | CHECK_SSL_VALIDITY, 22 | SUBMIT_ARCHIVE_DOT_ORG, 23 | FETCH_AUDIO, 24 | FETCH_VIDEO, 25 | FETCH_FAVICON, 26 | WGET_USER_AGENT, 27 | CHROME_USER_DATA_DIR, 28 | TIMEOUT, 29 | ANSI, 30 | ARCHIVE_DIR, 31 | ) 32 | from util import ( 33 | check_dependencies, 34 | progress, 35 | chmod_file, 36 | pretty_path, 37 | ) 38 | 39 | 40 | _RESULTS_TOTALS = { # globals are bad, mmkay 41 | 'skipped': 0, 42 | 'succeded': 0, 43 | 'failed': 0, 44 | } 45 | 46 | def archive_links(archive_path, links, source=None, resume=None): 47 | check_dependencies() 48 | 49 | to_archive = Peekable(links_after_timestamp(links, resume)) 50 | idx, link = 0, to_archive.peek(0) 51 | 52 | try: 53 | for idx, link in enumerate(to_archive): 54 | link_dir = os.path.join(ARCHIVE_DIR, link['timestamp']) 55 | archive_link(link_dir, link) 56 | 57 | except (KeyboardInterrupt, SystemExit, Exception) as e: 58 | print('{lightyellow}[X] [{now}] Downloading paused on link {timestamp} ({idx}/{total}){reset}'.format( 59 | **ANSI, 60 | now=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 61 | idx=idx+1, 62 | timestamp=link['timestamp'], 63 | total=len(links), 64 | )) 65 | print(' Continue where you left off by running:') 66 | print(' {} {}'.format( 67 | pretty_path(sys.argv[0]), 68 | link['timestamp'], 69 | )) 70 | if not isinstance(e, KeyboardInterrupt): 71 | raise e 72 | raise SystemExit(1) 73 | 74 | 75 | def archive_link(link_dir, link, overwrite=True): 76 | """download the DOM, PDF, and a screenshot into a folder named after the link's timestamp""" 77 | 78 | update_existing = os.path.exists(link_dir) 79 | if update_existing: 80 | link = { 81 | **parse_json_link_index(link_dir), 82 | **link, 83 | } 84 | else: 85 | os.makedirs(link_dir) 86 | 87 | log_link_archive(link_dir, link, update_existing) 88 | 89 | if FETCH_WGET: 90 | link = fetch_wget(link_dir, link, overwrite=overwrite) 91 | 92 | if FETCH_PDF: 93 | link = fetch_pdf(link_dir, link, overwrite=overwrite) 94 | 95 | if FETCH_SCREENSHOT: 96 | link = fetch_screenshot(link_dir, link, overwrite=overwrite) 97 | 98 | if FETCH_DOM: 99 | link = fetch_dom(link_dir, link, overwrite=overwrite) 100 | 101 | if SUBMIT_ARCHIVE_DOT_ORG: 102 | link = archive_dot_org(link_dir, link, overwrite=overwrite) 103 | 104 | # if FETCH_AUDIO: 105 | # link = fetch_audio(link_dir, link, overwrite=overwrite) 106 | 107 | # if FETCH_VIDEO: 108 | # link = fetch_video(link_dir, link, overwrite=overwrite) 109 | 110 | if FETCH_FAVICON: 111 | link = fetch_favicon(link_dir, link, overwrite=overwrite) 112 | 113 | write_link_index(link_dir, link) 114 | # print() 115 | 116 | return link 117 | 118 | def log_link_archive(link_dir, link, update_existing): 119 | print('[{symbol_color}{symbol}{reset}] [{now}] "{title}"\n {blue}{url}{reset}'.format( 120 | symbol='*' if update_existing else '+', 121 | symbol_color=ANSI['black' if update_existing else 'green'], 122 | now=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 123 | **link, 124 | **ANSI, 125 | )) 126 | 127 | print(' > {}{}'.format(pretty_path(link_dir), '' if update_existing else ' (new)')) 128 | if link['type']: 129 | print(' i {}'.format(link['type'])) 130 | 131 | 132 | 133 | def attach_result_to_link(method): 134 | """ 135 | Instead of returning a result={output:'...', status:'success'} object, 136 | attach that result to the links's history & latest fields, then return 137 | the updated link object. 138 | """ 139 | def decorator(fetch_func): 140 | @wraps(fetch_func) 141 | def timed_fetch_func(link_dir, link, overwrite=False, **kwargs): 142 | # initialize methods and history json field on link 143 | link['latest'] = link.get('latest') or {} 144 | link['latest'][method] = link['latest'].get(method) or None 145 | link['history'] = link.get('history') or {} 146 | link['history'][method] = link['history'].get(method) or [] 147 | 148 | start_ts = datetime.now().timestamp() 149 | 150 | # if a valid method output is already present, dont run the fetch function 151 | if link['latest'][method] and not overwrite: 152 | print(' √ {}'.format(method)) 153 | result = None 154 | else: 155 | print(' > {}'.format(method)) 156 | result = fetch_func(link_dir, link, **kwargs) 157 | 158 | end_ts = datetime.now().timestamp() 159 | duration = str(end_ts * 1000 - start_ts * 1000).split('.')[0] 160 | 161 | # append a history item recording fail/success 162 | history_entry = { 163 | 'timestamp': str(start_ts).split('.')[0], 164 | } 165 | if result is None: 166 | history_entry['status'] = 'skipped' 167 | elif isinstance(result.get('output'), Exception): 168 | history_entry['status'] = 'failed' 169 | history_entry['duration'] = duration 170 | history_entry.update(result or {}) 171 | link['history'][method].append(history_entry) 172 | else: 173 | history_entry['status'] = 'succeded' 174 | history_entry['duration'] = duration 175 | history_entry.update(result or {}) 176 | link['history'][method].append(history_entry) 177 | link['latest'][method] = result['output'] 178 | 179 | _RESULTS_TOTALS[history_entry['status']] += 1 180 | 181 | return link 182 | return timed_fetch_func 183 | return decorator 184 | 185 | 186 | @attach_result_to_link('wget') 187 | def fetch_wget(link_dir, link, requisites=FETCH_WGET_REQUISITES, timeout=TIMEOUT): 188 | """download full site using wget""" 189 | 190 | domain_dir = os.path.join(link_dir, link['domain']) 191 | existing_file = wget_output_path(link) 192 | if os.path.exists(domain_dir) and existing_file: 193 | return {'output': existing_file, 'status': 'skipped'} 194 | 195 | CMD = [ 196 | # WGET CLI Docs: https://www.gnu.org/software/wget/manual/wget.html 197 | *'wget -N -E -np -x -H -k -K -S --restrict-file-names=unix'.split(' '), 198 | *(('-p',) if FETCH_WGET_REQUISITES else ()), 199 | *(('--user-agent="{}"'.format(WGET_USER_AGENT),) if WGET_USER_AGENT else ()), 200 | *((() if CHECK_SSL_VALIDITY else ('--no-check-certificate',))), 201 | link['url'], 202 | ] 203 | end = progress(timeout, prefix=' ') 204 | try: 205 | result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1) # index.html 206 | end() 207 | output = wget_output_path(link, look_in=domain_dir) 208 | 209 | # Check for common failure cases 210 | if result.returncode > 0: 211 | print(' got wget response code {}:'.format(result.returncode)) 212 | if result.returncode != 8: 213 | print('\n'.join(' ' + line for line in (result.stderr or result.stdout).decode().rsplit('\n', 10)[-10:] if line.strip())) 214 | if b'403: Forbidden' in result.stderr: 215 | raise Exception('403 Forbidden (try changing WGET_USER_AGENT)') 216 | if b'404: Not Found' in result.stderr: 217 | raise Exception('404 Not Found') 218 | if b'ERROR 500: Internal Server Error' in result.stderr: 219 | raise Exception('500 Internal Server Error') 220 | if result.returncode == 4: 221 | raise Exception('Failed wget download') 222 | except Exception as e: 223 | end() 224 | print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD))) 225 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset'])) 226 | output = e 227 | 228 | return { 229 | 'cmd': CMD, 230 | 'output': output, 231 | } 232 | 233 | 234 | @attach_result_to_link('pdf') 235 | def fetch_pdf(link_dir, link, timeout=TIMEOUT, user_data_dir=CHROME_USER_DATA_DIR): 236 | """print PDF of site to file using chrome --headless""" 237 | 238 | if link['type'] in ('PDF', 'image'): 239 | return {'output': wget_output_path(link)} 240 | 241 | if os.path.exists(os.path.join(link_dir, 'output.pdf')): 242 | return {'output': 'output.pdf', 'status': 'skipped'} 243 | 244 | CMD = [ 245 | *chrome_headless(user_data_dir=user_data_dir), 246 | '--print-to-pdf', 247 | link['url'] 248 | ] 249 | end = progress(timeout, prefix=' ') 250 | try: 251 | result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1) # output.pdf 252 | end() 253 | if result.returncode: 254 | print(' ', (result.stderr or result.stdout).decode()) 255 | raise Exception('Failed to print PDF') 256 | chmod_file('output.pdf', cwd=link_dir) 257 | output = 'output.pdf' 258 | except Exception as e: 259 | end() 260 | print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD))) 261 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset'])) 262 | output = e 263 | 264 | return { 265 | 'cmd': CMD, 266 | 'output': output, 267 | } 268 | 269 | @attach_result_to_link('screenshot') 270 | def fetch_screenshot(link_dir, link, timeout=TIMEOUT, user_data_dir=CHROME_USER_DATA_DIR, resolution=RESOLUTION): 271 | """take screenshot of site using chrome --headless""" 272 | 273 | if link['type'] in ('PDF', 'image'): 274 | return {'output': wget_output_path(link)} 275 | 276 | if os.path.exists(os.path.join(link_dir, 'screenshot.png')): 277 | return {'output': 'screenshot.png', 'status': 'skipped'} 278 | 279 | CMD = [ 280 | *chrome_headless(user_data_dir=user_data_dir), 281 | '--screenshot', 282 | '--window-size={}'.format(resolution), 283 | '--hide-scrollbars', 284 | # '--full-page', # TODO: make this actually work using ./bin/screenshot fullPage: true 285 | link['url'], 286 | ] 287 | end = progress(timeout, prefix=' ') 288 | try: 289 | result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1) # sreenshot.png 290 | end() 291 | if result.returncode: 292 | print(' ', (result.stderr or result.stdout).decode()) 293 | raise Exception('Failed to take screenshot') 294 | chmod_file('screenshot.png', cwd=link_dir) 295 | output = 'screenshot.png' 296 | except Exception as e: 297 | end() 298 | print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD))) 299 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset'])) 300 | output = e 301 | 302 | return { 303 | 'cmd': CMD, 304 | 'output': output, 305 | } 306 | 307 | @attach_result_to_link('dom') 308 | def fetch_dom(link_dir, link, timeout=TIMEOUT, user_data_dir=CHROME_USER_DATA_DIR): 309 | """print HTML of site to file using chrome --dump-html""" 310 | 311 | if link['type'] in ('PDF', 'image'): 312 | return {'output': wget_output_path(link)} 313 | 314 | output_path = os.path.join(link_dir, 'output.html') 315 | 316 | if os.path.exists(output_path): 317 | return {'output': 'output.html', 'status': 'skipped'} 318 | 319 | CMD = [ 320 | *chrome_headless(user_data_dir=user_data_dir), 321 | '--dump-dom', 322 | link['url'] 323 | ] 324 | end = progress(timeout, prefix=' ') 325 | try: 326 | with open(output_path, 'w+') as f: 327 | result = run(CMD, stdout=f, stderr=PIPE, cwd=link_dir, timeout=timeout + 1) # output.html 328 | end() 329 | if result.returncode: 330 | print(' ', (result.stderr).decode()) 331 | raise Exception('Failed to fetch DOM') 332 | chmod_file('output.html', cwd=link_dir) 333 | output = 'output.html' 334 | except Exception as e: 335 | end() 336 | print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD))) 337 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset'])) 338 | output = e 339 | 340 | return { 341 | 'cmd': CMD, 342 | 'output': output, 343 | } 344 | 345 | @attach_result_to_link('archive_org') 346 | def archive_dot_org(link_dir, link, timeout=TIMEOUT): 347 | """submit site to archive.org for archiving via their service, save returned archive url""" 348 | 349 | path = os.path.join(link_dir, 'archive.org.txt') 350 | if os.path.exists(path): 351 | archive_org_url = open(path, 'r').read().strip() 352 | return {'output': archive_org_url, 'status': 'skipped'} 353 | 354 | submit_url = 'https://web.archive.org/save/{}'.format(link['url'].split('?', 1)[0]) 355 | 356 | success = False 357 | CMD = ['curl', '-I', submit_url] 358 | end = progress(timeout, prefix=' ') 359 | try: 360 | result = run(CMD, stdout=PIPE, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1) # archive.org.txt 361 | end() 362 | 363 | # Parse archive.org response headers 364 | headers = defaultdict(list) 365 | 366 | # lowercase all the header names and store in dict 367 | for header in result.stdout.splitlines(): 368 | if b':' not in header or not header.strip(): 369 | continue 370 | name, val = header.decode().split(':', 1) 371 | headers[name.lower().strip()].append(val.strip()) 372 | 373 | # Get successful archive url in "content-location" header or any errors 374 | content_location = headers['content-location'] 375 | errors = headers['x-archive-wayback-runtime-error'] 376 | 377 | if content_location: 378 | saved_url = 'https://web.archive.org{}'.format(content_location[0]) 379 | success = True 380 | elif len(errors) == 1 and 'RobotAccessControlException' in errors[0]: 381 | output = submit_url 382 | # raise Exception('Archive.org denied by {}/robots.txt'.format(link['domain'])) 383 | elif errors: 384 | raise Exception(', '.join(errors)) 385 | else: 386 | raise Exception('Failed to find "content-location" URL header in Archive.org response.') 387 | except Exception as e: 388 | end() 389 | print(' Visit url to see output:', ' '.join(CMD)) 390 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset'])) 391 | output = e 392 | 393 | if success: 394 | with open(os.path.join(link_dir, 'archive.org.txt'), 'w', encoding='utf-8') as f: 395 | f.write(saved_url) 396 | chmod_file('archive.org.txt', cwd=link_dir) 397 | output = saved_url 398 | 399 | return { 400 | 'cmd': CMD, 401 | 'output': output, 402 | } 403 | 404 | @attach_result_to_link('favicon') 405 | def fetch_favicon(link_dir, link, timeout=TIMEOUT): 406 | """download site favicon from google's favicon api""" 407 | 408 | if os.path.exists(os.path.join(link_dir, 'favicon.ico')): 409 | return {'output': 'favicon.ico', 'status': 'skipped'} 410 | 411 | CMD = ['curl', 'https://www.google.com/s2/favicons?domain={domain}'.format(**link)] 412 | fout = open('{}/favicon.ico'.format(link_dir), 'w') 413 | end = progress(timeout, prefix=' ') 414 | try: 415 | run(CMD, stdout=fout, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1) # favicon.ico 416 | fout.close() 417 | end() 418 | chmod_file('favicon.ico', cwd=link_dir) 419 | output = 'favicon.ico' 420 | except Exception as e: 421 | fout.close() 422 | end() 423 | print(' Run to see full output:', ' '.join(CMD)) 424 | print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset'])) 425 | output = e 426 | 427 | return { 428 | 'cmd': CMD, 429 | 'output': output, 430 | } 431 | 432 | # @attach_result_to_link('audio') 433 | # def fetch_audio(link_dir, link, timeout=TIMEOUT): 434 | # """Download audio rip using youtube-dl""" 435 | 436 | # if link['type'] not in ('soundcloud',)\ 437 | # and 'audio' not in link['tags']: 438 | # return 439 | 440 | # path = os.path.join(link_dir, 'audio') 441 | 442 | # if not os.path.exists(path) or overwrite: 443 | # print(' - Downloading audio') 444 | # CMD = [ 445 | # "youtube-dl -x --audio-format mp3 --audio-quality 0 -o '%(title)s.%(ext)s'", 446 | # link['url'], 447 | # ] 448 | # end = progress(timeout, prefix=' ') 449 | # try: 450 | # result = run(CMD, stdout=DEVNULL, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1) # audio/audio.mp3 451 | # end() 452 | # if result.returncode: 453 | # print(' ', result.stderr.decode()) 454 | # raise Exception('Failed to download audio') 455 | # chmod_file('audio.mp3', cwd=link_dir) 456 | # return 'audio.mp3' 457 | # except Exception as e: 458 | # end() 459 | # print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD))) 460 | # print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset'])) 461 | # raise 462 | # else: 463 | # print(' √ Skipping audio download') 464 | 465 | # @attach_result_to_link('video') 466 | # def fetch_video(link_dir, link, timeout=TIMEOUT): 467 | # """Download video rip using youtube-dl""" 468 | 469 | # if link['type'] not in ('youtube', 'youku', 'vimeo')\ 470 | # and 'video' not in link['tags']: 471 | # return 472 | 473 | # path = os.path.join(link_dir, 'video') 474 | 475 | # if not os.path.exists(path) or overwrite: 476 | # print(' - Downloading video') 477 | # CMD = [ 478 | # "youtube-dl -x --video-format mp4 --audio-quality 0 -o '%(title)s.%(ext)s'", 479 | # link['url'], 480 | # ] 481 | # end = progress(timeout, prefix=' ') 482 | # try: 483 | # result = run(CMD, stdout=DEVNULL, stderr=DEVNULL, cwd=link_dir, timeout=timeout + 1) # video/movie.mp4 484 | # end() 485 | # if result.returncode: 486 | # print(' ', result.stderr.decode()) 487 | # raise Exception('Failed to download video') 488 | # chmod_file('video.mp4', cwd=link_dir) 489 | # return 'video.mp4' 490 | # except Exception as e: 491 | # end() 492 | # print(' Run to see full output:', 'cd {}; {}'.format(link_dir, ' '.join(CMD))) 493 | # print(' {}Failed: {} {}{}'.format(ANSI['red'], e.__class__.__name__, e, ANSI['reset'])) 494 | # raise 495 | # else: 496 | # print(' √ Skipping video download') 497 | 498 | 499 | def chrome_headless(binary=CHROME_BINARY, user_data_dir=CHROME_USER_DATA_DIR): 500 | args = [binary, '--headless'] # '--disable-gpu' 501 | default_profile = os.path.expanduser('~/Library/Application Support/Google/Chrome/Default') 502 | if user_data_dir: 503 | args.append('--user-data-dir={}'.format(user_data_dir)) 504 | elif os.path.exists(default_profile): 505 | args.append('--user-data-dir={}'.format(default_profile)) 506 | return args 507 | -------------------------------------------------------------------------------- /archiver/config.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import shutil 4 | 5 | from subprocess import run, PIPE 6 | 7 | # ****************************************************************************** 8 | # * TO SET YOUR CONFIGURATION, EDIT THE VALUES BELOW, or use the 'env' command * 9 | # * e.g. * 10 | # * env USE_COLOR=True CHROME_BINARY=google-chrome ./archive.py export.html * 11 | # ****************************************************************************** 12 | 13 | IS_TTY = sys.stdout.isatty() 14 | USE_COLOR = os.getenv('USE_COLOR', str(IS_TTY) ).lower() == 'true' 15 | SHOW_PROGRESS = os.getenv('SHOW_PROGRESS', str(IS_TTY) ).lower() == 'true' 16 | FETCH_WGET = os.getenv('FETCH_WGET', 'True' ).lower() == 'true' 17 | FETCH_WGET_REQUISITES = os.getenv('FETCH_WGET_REQUISITES', 'True' ).lower() == 'true' 18 | FETCH_AUDIO = os.getenv('FETCH_AUDIO', 'False' ).lower() == 'true' 19 | FETCH_VIDEO = os.getenv('FETCH_VIDEO', 'False' ).lower() == 'true' 20 | FETCH_PDF = os.getenv('FETCH_PDF', 'True' ).lower() == 'true' 21 | FETCH_SCREENSHOT = os.getenv('FETCH_SCREENSHOT', 'True' ).lower() == 'true' 22 | FETCH_DOM = os.getenv('FETCH_DOM', 'True' ).lower() == 'true' 23 | FETCH_FAVICON = os.getenv('FETCH_FAVICON', 'True' ).lower() == 'true' 24 | SUBMIT_ARCHIVE_DOT_ORG = os.getenv('SUBMIT_ARCHIVE_DOT_ORG', 'True' ).lower() == 'true' 25 | RESOLUTION = os.getenv('RESOLUTION', '1440,1200' ) 26 | CHECK_SSL_VALIDITY = os.getenv('CHECK_SSL_VALIDITY', 'True' ).lower() == 'true' 27 | OUTPUT_PERMISSIONS = os.getenv('OUTPUT_PERMISSIONS', '755' ) 28 | CHROME_BINARY = os.getenv('CHROME_BINARY', 'chromium-browser' ) # change to google-chrome browser if using google-chrome 29 | WGET_BINARY = os.getenv('WGET_BINARY', 'wget' ) 30 | WGET_USER_AGENT = os.getenv('WGET_USER_AGENT', 'Bookmark Archiver') 31 | CHROME_USER_DATA_DIR = os.getenv('CHROME_USER_DATA_DIR', None) 32 | TIMEOUT = int(os.getenv('TIMEOUT', '60')) 33 | FOOTER_INFO = os.getenv('FOOTER_INFO', 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.',) 34 | 35 | ### Paths 36 | REPO_DIR = os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)), '..')) 37 | 38 | OUTPUT_DIR = os.path.join(REPO_DIR, 'output') 39 | ARCHIVE_DIR = os.path.join(OUTPUT_DIR, 'archive') 40 | SOURCES_DIR = os.path.join(OUTPUT_DIR, 'sources') 41 | 42 | PYTHON_PATH = os.path.join(REPO_DIR, 'archiver') 43 | TEMPLATES_DIR = os.path.join(PYTHON_PATH, 'templates') 44 | 45 | # ****************************************************************************** 46 | # ********************** Do not edit below this point ************************** 47 | # ****************************************************************************** 48 | 49 | ### Terminal Configuration 50 | TERM_WIDTH = shutil.get_terminal_size((100, 10)).columns 51 | ANSI = { 52 | 'reset': '\033[00;00m', 53 | 'lightblue': '\033[01;30m', 54 | 'lightyellow': '\033[01;33m', 55 | 'lightred': '\033[01;35m', 56 | 'red': '\033[01;31m', 57 | 'green': '\033[01;32m', 58 | 'blue': '\033[01;34m', 59 | 'white': '\033[01;37m', 60 | 'black': '\033[01;30m', 61 | } 62 | if not USE_COLOR: 63 | # dont show colors if USE_COLOR is False 64 | ANSI = {k: '' for k in ANSI.keys()} 65 | 66 | ### Confirm Environment Setup 67 | try: 68 | GIT_SHA = run(["git", "rev-list", "-1", "HEAD", "./"], stdout=PIPE, cwd=REPO_DIR).stdout.strip().decode() 69 | except Exception: 70 | GIT_SHA = 'unknown' 71 | print('[!] Warning, you need git installed for some archiving features to save correct version numbers!') 72 | 73 | if sys.stdout.encoding.upper() != 'UTF-8': 74 | print('[X] Your system is running python3 scripts with a bad locale setting: {} (it should be UTF-8).'.format(sys.stdout.encoding)) 75 | print(' To fix it, add the line "export PYTHONIOENCODING=utf8" to your ~/.bashrc file (without quotes)') 76 | print('') 77 | print(' Confirm that it\'s fixed by opening a new shell and running:') 78 | print(' python3 -c "import sys; print(sys.stdout.encoding)" # should output UTF-8') 79 | print('') 80 | print(' Alternatively, run this script with:') 81 | print(' env PYTHONIOENCODING=utf8 ./archive.py export.html') 82 | -------------------------------------------------------------------------------- /archiver/index.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | 4 | from datetime import datetime 5 | from string import Template 6 | from distutils.dir_util import copy_tree 7 | 8 | from config import ( 9 | TEMPLATES_DIR, 10 | OUTPUT_PERMISSIONS, 11 | ANSI, 12 | GIT_SHA, 13 | FOOTER_INFO, 14 | ) 15 | from util import ( 16 | chmod_file, 17 | wget_output_path, 18 | derived_link_info, 19 | pretty_path, 20 | ) 21 | 22 | 23 | ### Homepage index for all the links 24 | 25 | def write_links_index(out_dir, links): 26 | """create index.html file for a given list of links""" 27 | 28 | if not os.path.exists(out_dir): 29 | os.makedirs(out_dir) 30 | 31 | write_json_links_index(out_dir, links) 32 | write_html_links_index(out_dir, links) 33 | 34 | print('{green}[√] [{}] Updated main index files:{reset}'.format( 35 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 36 | **ANSI)) 37 | print(' > {}/index.json'.format(pretty_path(out_dir))) 38 | print(' > {}/index.html'.format(pretty_path(out_dir))) 39 | 40 | def write_json_links_index(out_dir, links): 41 | """write the json link index to a given path""" 42 | 43 | path = os.path.join(out_dir, 'index.json') 44 | 45 | index_json = { 46 | 'info': 'Bookmark Archiver Index', 47 | 'help': 'https://github.com/pirate/bookmark-archiver', 48 | 'version': GIT_SHA, 49 | 'num_links': len(links), 50 | 'updated': str(datetime.now().timestamp()), 51 | 'links': links, 52 | } 53 | 54 | with open(path, 'w', encoding='utf-8') as f: 55 | json.dump(index_json, f, indent=4, default=str) 56 | 57 | chmod_file(path) 58 | 59 | def parse_json_links_index(out_dir): 60 | """load the index in a given directory and merge it with the given link""" 61 | index_path = os.path.join(out_dir, 'index.json') 62 | if os.path.exists(index_path): 63 | with open(index_path, 'r', encoding='utf-8') as f: 64 | return json.load(f)['links'] 65 | 66 | return [] 67 | 68 | def write_html_links_index(out_dir, links): 69 | """write the html link index to a given path""" 70 | 71 | path = os.path.join(out_dir, 'index.html') 72 | 73 | copy_tree(os.path.join(TEMPLATES_DIR, 'static'), os.path.join(out_dir, 'static')) 74 | 75 | with open(os.path.join(TEMPLATES_DIR, 'index.html'), 'r', encoding='utf-8') as f: 76 | index_html = f.read() 77 | 78 | with open(os.path.join(TEMPLATES_DIR, 'index_row.html'), 'r', encoding='utf-8') as f: 79 | link_row_html = f.read() 80 | 81 | link_rows = '\n'.join( 82 | Template(link_row_html).substitute(**derived_link_info(link)) 83 | for link in links 84 | ) 85 | 86 | template_vars = { 87 | 'num_links': len(links), 88 | 'date_updated': datetime.now().strftime('%Y-%m-%d'), 89 | 'time_updated': datetime.now().strftime('%Y-%m-%d %H:%M'), 90 | 'footer_info': FOOTER_INFO, 91 | 'git_sha': GIT_SHA, 92 | 'short_git_sha': GIT_SHA[:8], 93 | 'rows': link_rows, 94 | } 95 | 96 | with open(path, 'w', encoding='utf-8') as f: 97 | f.write(Template(index_html).substitute(**template_vars)) 98 | 99 | chmod_file(path) 100 | 101 | 102 | ### Individual link index 103 | 104 | def write_link_index(out_dir, link): 105 | link['updated'] = str(datetime.now().timestamp()) 106 | write_json_link_index(out_dir, link) 107 | write_html_link_index(out_dir, link) 108 | 109 | def write_json_link_index(out_dir, link): 110 | """write a json file with some info about the link""" 111 | 112 | path = os.path.join(out_dir, 'index.json') 113 | 114 | print(' √ index.json') 115 | 116 | with open(path, 'w', encoding='utf-8') as f: 117 | json.dump(link, f, indent=4, default=str) 118 | 119 | chmod_file(path) 120 | 121 | def parse_json_link_index(out_dir): 122 | """load the json link index from a given directory""" 123 | existing_index = os.path.join(out_dir, 'index.json') 124 | if os.path.exists(existing_index): 125 | with open(existing_index, 'r', encoding='utf-8') as f: 126 | return json.load(f) 127 | return {} 128 | 129 | def write_html_link_index(out_dir, link): 130 | with open(os.path.join(TEMPLATES_DIR, 'link_index_fancy.html'), 'r', encoding='utf-8') as f: 131 | link_html = f.read() 132 | 133 | path = os.path.join(out_dir, 'index.html') 134 | 135 | print(' √ index.html') 136 | 137 | with open(path, 'w', encoding='utf-8') as f: 138 | f.write(Template(link_html).substitute({ 139 | **link, 140 | **link['latest'], 141 | 'type': link['type'] or 'website', 142 | 'tags': link['tags'] or 'untagged', 143 | 'bookmarked': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'), 144 | 'updated': datetime.fromtimestamp(float(link['updated'])).strftime('%Y-%m-%d %H:%M'), 145 | 'bookmarked_ts': link['timestamp'], 146 | 'updated_ts': link['updated'], 147 | 'archive_org': link['latest'].get('archive_org') or 'https://web.archive.org/save/{}'.format(link['url']), 148 | 'wget': link['latest'].get('wget') or wget_output_path(link), 149 | })) 150 | 151 | chmod_file(path) 152 | -------------------------------------------------------------------------------- /archiver/links.py: -------------------------------------------------------------------------------- 1 | """ 2 | In Bookmark Archiver, a Link represents a single entry that we track in the 3 | json index. All links pass through all archiver functions and the latest, 4 | most up-to-date canonical output for each is stored in "latest". 5 | 6 | 7 | Link { 8 | timestamp: str, (how we uniquely id links) _ _ _ _ ___ 9 | url: str, | \ / \ |\| ' | 10 | base_url: str, |_/ \_/ | | | 11 | domain: str, _ _ _ _ _ _ 12 | tags: str, |_) /| |\| | / ` 13 | type: str, | /"| | | | \_, 14 | title: str, ,-'"`-. 15 | sources: [str], /// / @ @ \ \\\\ 16 | latest: { \ :=| ,._,. |=: / 17 | ..., || ,\ \_../ /. || 18 | pdf: 'output.pdf', ||','`-._))'`.`|| 19 | wget: 'example.com/1234/index.html' `-' (/ `-' 20 | }, 21 | history: { 22 | ... 23 | pdf: [ 24 | {timestamp: 15444234325, status: 'skipped', result='output.pdf'}, 25 | ... 26 | ], 27 | wget: [ 28 | {timestamp: 11534435345, status: 'succeded', result='donuts.com/eat/them.html'} 29 | ] 30 | }, 31 | } 32 | 33 | """ 34 | 35 | import datetime 36 | from html import unescape 37 | 38 | from util import ( 39 | domain, 40 | base_url, 41 | str_between, 42 | get_link_type, 43 | merge_links, 44 | wget_output_path, 45 | ) 46 | from config import ANSI 47 | 48 | 49 | def validate_links(links): 50 | links = archivable_links(links) # remove chrome://, about:, mailto: etc. 51 | links = uniquefied_links(links) # merge/dedupe duplicate timestamps & urls 52 | links = sorted_links(links) # deterministically sort the links based on timstamp, url 53 | 54 | if not links: 55 | print('[X] No links found :(') 56 | raise SystemExit(1) 57 | 58 | for link in links: 59 | link['title'] = unescape(link['title']) 60 | link['latest'] = link.get('latest') or {} 61 | 62 | if not link['latest'].get('wget'): 63 | link['latest']['wget'] = wget_output_path(link) 64 | 65 | if not link['latest'].get('pdf'): 66 | link['latest']['pdf'] = None 67 | 68 | if not link['latest'].get('screenshot'): 69 | link['latest']['screenshot'] = None 70 | 71 | if not link['latest'].get('dom'): 72 | link['latest']['dom'] = None 73 | 74 | return list(links) 75 | 76 | 77 | def archivable_links(links): 78 | """remove chrome://, about:// or other schemed links that cant be archived""" 79 | return ( 80 | link 81 | for link in links 82 | if any(link['url'].startswith(s) for s in ('http://', 'https://', 'ftp://')) 83 | ) 84 | 85 | def uniquefied_links(sorted_links): 86 | """ 87 | ensures that all non-duplicate links have monotonically increasing timestamps 88 | """ 89 | 90 | unique_urls = {} 91 | 92 | lower = lambda url: url.lower().strip() 93 | without_www = lambda url: url.replace('://www.', '://', 1) 94 | without_trailing_slash = lambda url: url[:-1] if url[-1] == '/' else url.replace('/?', '?') 95 | 96 | for link in sorted_links: 97 | fuzzy_url = without_www(without_trailing_slash(lower(link['url']))) 98 | if fuzzy_url in unique_urls: 99 | # merge with any other links that share the same url 100 | link = merge_links(unique_urls[fuzzy_url], link) 101 | unique_urls[fuzzy_url] = link 102 | 103 | unique_timestamps = {} 104 | for link in unique_urls.values(): 105 | link['timestamp'] = lowest_uniq_timestamp(unique_timestamps, link['timestamp']) 106 | unique_timestamps[link['timestamp']] = link 107 | 108 | return unique_timestamps.values() 109 | 110 | def sorted_links(links): 111 | sort_func = lambda link: (link['timestamp'], link['url']) 112 | return sorted(links, key=sort_func, reverse=True) 113 | 114 | def links_after_timestamp(links, timestamp=None): 115 | if not timestamp: 116 | yield from links 117 | return 118 | 119 | for link in links: 120 | try: 121 | if float(link['timestamp']) <= float(timestamp): 122 | yield link 123 | except (ValueError, TypeError): 124 | print('Resume value and all timestamp values must be valid numbers.') 125 | 126 | def lowest_uniq_timestamp(used_timestamps, timestamp): 127 | """resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2""" 128 | 129 | timestamp = timestamp.split('.')[0] 130 | nonce = 0 131 | 132 | # first try 152323423 before 152323423.0 133 | if timestamp not in used_timestamps: 134 | return timestamp 135 | 136 | new_timestamp = '{}.{}'.format(timestamp, nonce) 137 | while new_timestamp in used_timestamps: 138 | nonce += 1 139 | new_timestamp = '{}.{}'.format(timestamp, nonce) 140 | 141 | return new_timestamp 142 | -------------------------------------------------------------------------------- /archiver/parse.py: -------------------------------------------------------------------------------- 1 | """ 2 | Everything related to parsing links from bookmark services. 3 | 4 | For a list of supported services, see the README.md. 5 | For examples of supported files see examples/. 6 | 7 | Parsed link schema: { 8 | 'url': 'https://example.com/example/?abc=123&xyc=345#lmnop', 9 | 'domain': 'example.com', 10 | 'base_url': 'example.com/example/', 11 | 'timestamp': '15442123124234', 12 | 'tags': 'abc,def', 13 | 'title': 'Example.com Page Title', 14 | 'sources': ['ril_export.html', 'downloads/getpocket.com.txt'], 15 | } 16 | """ 17 | 18 | import re 19 | import json 20 | import xml.etree.ElementTree as etree 21 | 22 | from datetime import datetime 23 | 24 | from util import ( 25 | domain, 26 | base_url, 27 | str_between, 28 | get_link_type, 29 | ) 30 | 31 | 32 | def get_parsers(file): 33 | """return all parsers that work on a given file, defaults to all of them""" 34 | 35 | return { 36 | 'pocket': parse_pocket_export, 37 | 'pinboard': parse_json_export, 38 | 'bookmarks': parse_bookmarks_export, 39 | 'rss': parse_rss_export, 40 | 'pinboard_rss': parse_pinboard_rss_feed, 41 | 'medium_rss': parse_medium_rss_feed, 42 | } 43 | 44 | def parse_links(path): 45 | """parse a list of links dictionaries from a bookmark export file""" 46 | 47 | links = [] 48 | with open(path, 'r', encoding='utf-8') as file: 49 | for parser_func in get_parsers(file).values(): 50 | # otherwise try all parsers until one works 51 | try: 52 | links += list(parser_func(file)) 53 | if links: 54 | break 55 | except (ValueError, TypeError, IndexError, AttributeError, etree.ParseError): 56 | # parser not supported on this file 57 | pass 58 | 59 | return links 60 | 61 | 62 | def parse_pocket_export(html_file): 63 | """Parse Pocket-format bookmarks export files (produced by getpocket.com/export/)""" 64 | 65 | html_file.seek(0) 66 | pattern = re.compile("^\\s*
  • (.+)
  • ", re.UNICODE) 67 | for line in html_file: 68 | # example line 69 | #
  • example title
  • 70 | match = pattern.search(line) 71 | if match: 72 | fixed_url = match.group(1).replace('http://www.readability.com/read?url=', '') # remove old readability prefixes to get original url 73 | time = datetime.fromtimestamp(float(match.group(2))) 74 | info = { 75 | 'url': fixed_url, 76 | 'domain': domain(fixed_url), 77 | 'base_url': base_url(fixed_url), 78 | 'timestamp': str(time.timestamp()), 79 | 'tags': match.group(3), 80 | 'title': match.group(4).replace(' — Readability', '').replace('http://www.readability.com/read?url=', '') or base_url(fixed_url), 81 | 'sources': [html_file.name], 82 | } 83 | info['type'] = get_link_type(info) 84 | yield info 85 | 86 | def parse_json_export(json_file): 87 | """Parse JSON-format bookmarks export files (produced by pinboard.in/export/)""" 88 | 89 | json_file.seek(0) 90 | json_content = json.load(json_file) 91 | for line in json_content: 92 | # example line 93 | # {"href":"http:\/\/www.reddit.com\/r\/example","description":"title here","extended":"","meta":"18a973f09c9cc0608c116967b64e0419","hash":"910293f019c2f4bb1a749fb937ba58e3","time":"2014-06-14T15:51:42Z","shared":"no","toread":"no","tags":"reddit android"}] 94 | if line: 95 | erg = line 96 | if erg.get('timestamp'): 97 | timestamp = str(erg['timestamp']/10000000) # chrome/ff histories use a very precise timestamp 98 | elif erg.get('time'): 99 | timestamp = str(datetime.strptime(erg['time'].split(',', 1)[0], '%Y-%m-%dT%H:%M:%SZ').timestamp()) 100 | else: 101 | timestamp = str(datetime.now().timestamp()) 102 | info = { 103 | 'url': erg['href'], 104 | 'domain': domain(erg['href']), 105 | 'base_url': base_url(erg['href']), 106 | 'timestamp': timestamp, 107 | 'tags': erg.get('tags') or '', 108 | 'title': (erg.get('description') or '').replace(' — Readability', ''), 109 | 'sources': [json_file.name], 110 | } 111 | info['type'] = get_link_type(info) 112 | yield info 113 | 114 | def parse_rss_export(rss_file): 115 | """Parse RSS XML-format files into links""" 116 | 117 | rss_file.seek(0) 118 | items = rss_file.read().split('\n') 119 | for item in items: 120 | # example item: 121 | # 122 | # <![CDATA[How JavaScript works: inside the V8 engine]]> 123 | # Unread 124 | # https://blog.sessionstack.com/how-javascript-works-inside 125 | # https://blog.sessionstack.com/how-javascript-works-inside 126 | # Mon, 21 Aug 2017 14:21:58 -0500 127 | # 128 | 129 | trailing_removed = item.split('', 1)[0] 130 | leading_removed = trailing_removed.split('', 1)[-1] 131 | rows = leading_removed.split('\n') 132 | 133 | def get_row(key): 134 | return [r for r in rows if r.startswith('<{}>'.format(key))][0] 135 | 136 | title = str_between(get_row('title'), '', '') 138 | ts_str = str_between(get_row('pubDate'), '', '') 139 | time = datetime.strptime(ts_str, "%a, %d %b %Y %H:%M:%S %z") 140 | 141 | info = { 142 | 'url': url, 143 | 'domain': domain(url), 144 | 'base_url': base_url(url), 145 | 'timestamp': str(time.timestamp()), 146 | 'tags': '', 147 | 'title': title, 148 | 'sources': [rss_file.name], 149 | } 150 | info['type'] = get_link_type(info) 151 | 152 | yield info 153 | 154 | def parse_bookmarks_export(html_file): 155 | """Parse netscape-format bookmarks export files (produced by all browsers)""" 156 | 157 | html_file.seek(0) 158 | pattern = re.compile("]*>(.+)", re.UNICODE | re.IGNORECASE) 159 | for line in html_file: 160 | # example line 161 | #
    example bookmark title 162 | 163 | match = pattern.search(line) 164 | if match: 165 | url = match.group(1) 166 | time = datetime.fromtimestamp(float(match.group(2))) 167 | 168 | info = { 169 | 'url': url, 170 | 'domain': domain(url), 171 | 'base_url': base_url(url), 172 | 'timestamp': str(time.timestamp()), 173 | 'tags': "", 174 | 'title': match.group(3), 175 | 'sources': [html_file.name], 176 | } 177 | info['type'] = get_link_type(info) 178 | 179 | yield info 180 | 181 | def parse_pinboard_rss_feed(rss_file): 182 | """Parse Pinboard RSS feed files into links""" 183 | 184 | rss_file.seek(0) 185 | root = etree.parse(rss_file).getroot() 186 | items = root.findall("{http://purl.org/rss/1.0/}item") 187 | for item in items: 188 | url = item.find("{http://purl.org/rss/1.0/}link").text 189 | tags = item.find("{http://purl.org/dc/elements/1.1/}subject").text 190 | title = item.find("{http://purl.org/rss/1.0/}title").text 191 | ts_str = item.find("{http://purl.org/dc/elements/1.1/}date").text 192 | # = 🌈🌈🌈🌈 193 | # = 🌈🌈🌈🌈 194 | # = 🏆🏆🏆🏆 195 | 196 | # Pinboard includes a colon in its date stamp timezone offsets, which 197 | # Python can't parse. Remove it: 198 | if ":" == ts_str[-3:-2]: 199 | ts_str = ts_str[:-3]+ts_str[-2:] 200 | time = datetime.strptime(ts_str, "%Y-%m-%dT%H:%M:%S%z") 201 | info = { 202 | 'url': url, 203 | 'domain': domain(url), 204 | 'base_url': base_url(url), 205 | 'timestamp': str(time.timestamp()), 206 | 'tags': tags, 207 | 'title': title, 208 | 'sources': [rss_file.name], 209 | } 210 | info['type'] = get_link_type(info) 211 | yield info 212 | 213 | def parse_medium_rss_feed(rss_file): 214 | """Parse Medium RSS feed files into links""" 215 | 216 | rss_file.seek(0) 217 | root = etree.parse(rss_file).getroot() 218 | items = root.find("channel").findall("item") 219 | for item in items: 220 | # for child in item: 221 | # print(child.tag, child.text) 222 | url = item.find("link").text 223 | title = item.find("title").text 224 | ts_str = item.find("pubDate").text 225 | time = datetime.strptime(ts_str, "%a, %d %b %Y %H:%M:%S %Z") 226 | info = { 227 | 'url': url, 228 | 'domain': domain(url), 229 | 'base_url': base_url(url), 230 | 'timestamp': str(time.timestamp()), 231 | 'tags': "", 232 | 'title': title, 233 | 'sources': [rss_file.name], 234 | } 235 | info['type'] = get_link_type(info) 236 | yield info 237 | -------------------------------------------------------------------------------- /archiver/peekable.py: -------------------------------------------------------------------------------- 1 | from collections import deque 2 | 3 | _marker = object() 4 | 5 | class Peekable(object): 6 | """Peekable version of a normal python generator. 7 | Useful when you don't want to evaluate the entire iterable to look at 8 | a specific item at a given idx. 9 | """ 10 | def __init__(self, iterable): 11 | self._it = iter(iterable) 12 | self._cache = deque() 13 | 14 | def __iter__(self): 15 | return self 16 | 17 | def __bool__(self): 18 | try: 19 | self.peek() 20 | except StopIteration: 21 | return False 22 | return True 23 | 24 | def __nonzero__(self): 25 | # For Python 2 compatibility 26 | return self.__bool__() 27 | 28 | def peek(self, default=_marker): 29 | """Return the item that will be next returned from ``next()``. 30 | Return ``default`` if there are no items left. If ``default`` is not 31 | provided, raise ``StopIteration``. 32 | """ 33 | if not self._cache: 34 | try: 35 | self._cache.append(next(self._it)) 36 | except StopIteration: 37 | if default is _marker: 38 | raise 39 | return default 40 | return self._cache[0] 41 | 42 | def prepend(self, *items): 43 | """Stack up items to be the next ones returned from ``next()`` or 44 | ``self.peek()``. The items will be returned in 45 | first in, first out order:: 46 | >>> p = peekable([1, 2, 3]) 47 | >>> p.prepend(10, 11, 12) 48 | >>> next(p) 49 | 10 50 | >>> list(p) 51 | [11, 12, 1, 2, 3] 52 | It is possible, by prepending items, to "resurrect" a peekable that 53 | previously raised ``StopIteration``. 54 | >>> p = peekable([]) 55 | >>> next(p) 56 | Traceback (most recent call last): 57 | ... 58 | StopIteration 59 | >>> p.prepend(1) 60 | >>> next(p) 61 | 1 62 | >>> next(p) 63 | Traceback (most recent call last): 64 | ... 65 | StopIteration 66 | """ 67 | self._cache.extendleft(reversed(items)) 68 | 69 | def __next__(self): 70 | if self._cache: 71 | return self._cache.popleft() 72 | 73 | return next(self._it) 74 | 75 | next = __next__ # For Python 2 compatibility 76 | 77 | def _get_slice(self, index): 78 | # Normalize the slice's arguments 79 | step = 1 if (index.step is None) else index.step 80 | if step > 0: 81 | start = 0 if (index.start is None) else index.start 82 | stop = maxsize if (index.stop is None) else index.stop 83 | elif step < 0: 84 | start = -1 if (index.start is None) else index.start 85 | stop = (-maxsize - 1) if (index.stop is None) else index.stop 86 | else: 87 | raise ValueError('slice step cannot be zero') 88 | 89 | # If either the start or stop index is negative, we'll need to cache 90 | # the rest of the iterable in order to slice from the right side. 91 | if (start < 0) or (stop < 0): 92 | self._cache.extend(self._it) 93 | # Otherwise we'll need to find the rightmost index and cache to that 94 | # point. 95 | else: 96 | n = min(max(start, stop) + 1, maxsize) 97 | cache_len = len(self._cache) 98 | if n >= cache_len: 99 | self._cache.extend(islice(self._it, n - cache_len)) 100 | 101 | return list(self._cache)[index] 102 | 103 | def __getitem__(self, index): 104 | if isinstance(index, slice): 105 | return self._get_slice(index) 106 | 107 | cache_len = len(self._cache) 108 | if index < 0: 109 | self._cache.extend(self._it) 110 | elif index >= cache_len: 111 | self._cache.extend(islice(self._it, index + 1 - cache_len)) 112 | 113 | return self._cache[index] 114 | -------------------------------------------------------------------------------- /archiver/templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Archived Sites 5 | 94 | 95 | 96 |
    97 |
    98 | 99 | 100 | 101 |
    102 | 103 | Github 104 | 105 |
    106 |
    107 |

    108 |  Archived Sites 109 |
    110 | 111 | Last updated $time_updated
    112 |
    113 |

    114 |
    115 |
    116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | $rows 130 |
    BookmarkedFilesSaved Link ($num_links)PNGPDFHTMLA.orgOriginal URL
    131 |
    132 |
    133 |
    134 | 135 | Archive created using Bookmark Archiver 136 | version $short_git_sha   |   137 | Download index as JSON 138 |

    139 | $footer_info 140 |
    141 |
    142 |
    143 |
    144 | 145 | 146 | -------------------------------------------------------------------------------- /archiver/templates/index_row.html: -------------------------------------------------------------------------------- 1 | 2 | $date 3 | 4 | 5 | 6 | 7 | 8 | 9 | $title $tags 10 | 11 | 🖼 12 | 📜 13 | 📄 14 | 🏛 15 | $url 16 | 17 | -------------------------------------------------------------------------------- /archiver/templates/link_index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | $title 5 | 6 | 7 | 8 |
    9 |

    10 | $title
    11 | 12 | $base_url 13 | 14 |

    15 |
    16 |
    17 |
    18 | Tags: $tags
    19 | Type: $type
    20 |
    21 | Bookmarked:
    22 | $bookmarked
    23 | Archived:
    24 | $updated
    25 |
    26 |
    27 |
      28 |
    • 29 | Original
      30 | $base_url
        31 |
    • 32 |
    • 33 | Local Archive
      34 | archive/$timestamp/$domain
        35 |
    • 36 |
    • 37 | PDF
      38 | archive/$timestamp/output.pdf
        39 |
    • 40 |
    • 41 | Screenshot
      42 | archive/$timestamp/screenshot.png
        43 |
    • 44 |
    • 45 | HTML
      46 | archive/$timestamp/output.html
        47 |
    • 48 |
    • 49 | Archive.Org
      50 | web.archive.org/web/$base_url
        51 |
    • 52 |
    53 | 62 | 63 | 64 | -------------------------------------------------------------------------------- /archiver/templates/link_index_fancy.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | $title 5 | 155 | 156 | 157 | 158 |
    159 |

    160 | 161 | Archive Icon 162 | 163 | 164 | ▾ 165 | 166 | $title
    167 | 168 | $base_url 169 | 170 |

    171 |
    172 | 266 | 267 | 268 | 272 | 273 | 274 | 316 | 317 | 318 | -------------------------------------------------------------------------------- /archiver/templates/static/archive.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TarekJor/bookmark-archiver/80c02bed5548b3429128cc699c2d462f51cd2df2/archiver/templates/static/archive.png -------------------------------------------------------------------------------- /archiver/templates/static/external.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TarekJor/bookmark-archiver/80c02bed5548b3429128cc699c2d462f51cd2df2/archiver/templates/static/external.png -------------------------------------------------------------------------------- /archiver/templates/static/spinner.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TarekJor/bookmark-archiver/80c02bed5548b3429128cc699c2d462f51cd2df2/archiver/templates/static/spinner.gif -------------------------------------------------------------------------------- /archiver/tests/firefox_export.html: -------------------------------------------------------------------------------- 1 | 2 | 5 | 6 | Bookmarks 7 |

    Bookmarks Menu

    8 | 9 |

    10 |

    Recently Bookmarked 11 |
    Recent Tags 12 |

    Mozilla Firefox

    13 |

    14 |

    Help and Tutorials 15 |
    Customize Firefox 16 |
    Get Involved 17 |
    About Us 18 |

    19 |

    [Folder Name]

    20 |

    21 |

    firefox export bookmarks at DuckDuckGo 22 |
    archive firefox bookmarks at DuckDuckGo 23 |
    nodiscc (nodiscc) · GitHub 24 |
    pirate/bookmark-archiver · Github 25 |
    Phonotactic Reconstruction of Encrypted VoIP Conversations 26 |
    Firefox Bookmarks Archiver - gHacks Tech News 27 |

    28 |

    Bookmarks Toolbar

    29 |
    Add bookmarks to this folder to see them displayed on the Bookmarks Toolbar 30 |

    31 |

    Most Visited 32 |
    Getting Started 33 |

    34 |

    35 | -------------------------------------------------------------------------------- /archiver/tests/pinboard_export.json: -------------------------------------------------------------------------------- 1 | [{"href":"https:\/\/en.wikipedia.org\/wiki\/International_Typographic_Style","description":"International Typographic Style - Wikipedia, the free encyclopedia","extended":"","meta":"32f4cc916e6f5919cc19aceb10559cc1","hash":"3dd64e155e16731d20350bec6bef7cb5","time":"2016-06-07T11:27:08Z","shared":"no","toread":"yes","tags":""}, 2 | {"href":"https:\/\/news.ycombinator.com\/item?id=11686984","description":"Announcing Certbot: EFF's Client for Let's Encrypt | Hacker News","extended":"","meta":"4a49602ba5d20ec3505c75d38ebc1d63","hash":"1c1acb53a5bd520e8529ce4f9600abee","time":"2016-05-13T05:46:16Z","shared":"no","toread":"yes","tags":""}, 3 | {"href":"https:\/\/github.com\/google\/styleguide","description":"GitHub - google\/styleguide: Style guides for Google-originated open-source projects","extended":"","meta":"15a8d50f7295f18ccb6dd19cb689c68a","hash":"1028bf9872d8e4ea1b1858f4044abb58","time":"2016-02-24T08:49:25Z","shared":"no","toread":"no","tags":"code.style.guide programming reference web.dev"}, 4 | {"href":"http:\/\/en.wikipedia.org\/wiki\/List_of_XML_and_HTML_character_entity_references","description":"List of XML and HTML character entity references - Wikipedia, the free encyclopedia","extended":"","meta":"6683a70f0f59c92c0bfd0bce653eab69","hash":"344d975c6251a8d460971fa2c43d9bbb","time":"2014-06-16T04:17:15Z","shared":"no","toread":"no","tags":"html reference web.dev typography"}, 5 | {"href":"https:\/\/pushover.net\/","description":"Pushover: Simple Notifications for Android, iOS, and Desktop","extended":"","meta":"1e68511234d9390d10b7772c8ccc4b9e","hash":"bb93374ead8a937b18c7c46e13168a7d","time":"2014-06-14T15:51:42Z","shared":"no","toread":"no","tags":"app android"}, 6 | {"href":"http:\/\/www.reddit.com\/r\/Android","description":"r\/android","extended":"","meta":"18a973f09c9cc0608c116967b64e0419","hash":"910293f019c2f4bb1a749fb937ba58e3","time":"2014-06-14T15:51:42Z","shared":"no","toread":"no","tags":"reddit android 1"}, 7 | {"href":"http:\/\/www.reddit.com\/r\/Android2","description":"r\/android","extended":"","meta":"18a973f09c9cc0608c116967b64e0419","hash":"910293f019c2f4bb1a749fb937ba58e2","time":"2014-06-14T15:51:42Z","shared":"no","toread":"no","tags":"reddit android 2"}, 8 | {"href":"http:\/\/www.reddit.com\/r\/Android3","description":"r\/android","extended":"","meta":"18a973f09c9cc0608c116967b64e0419","hash":"910293f019c2f4bb1a749fb937ba58e4","time":"2014-06-14T15:51:42Z","shared":"no","toread":"no","tags":"reddit android 3"}] 9 | -------------------------------------------------------------------------------- /archiver/tests/pocket_export.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Pocket Export 7 | 8 | 9 |

    Unread

    10 | 22 | 23 |

    Read Archive

    24 | 37 | 38 | 39 | -------------------------------------------------------------------------------- /archiver/tests/rss_export.xml: -------------------------------------------------------------------------------- 1 | 7 | 8 | 9 | 10 | My Reading List: Read and Unread 11 | Items I've saved to read 12 | http://readitlaterlist.com/users/nikisweeting/feed/all 13 | 14 | 15 | 16 | 17 | <![CDATA[Cell signaling]]> 18 | Unread 19 | https://en.wikipedia.org/wiki/Cell_signaling 20 | https://en.wikipedia.org/wiki/Cell_signaling 21 | Mon, 30 Oct 2017 01:12:10 -0500 22 | 23 | 24 | <![CDATA[Hayflick limit]]> 25 | Unread 26 | https://en.wikipedia.org/wiki/Hayflick_limit 27 | https://en.wikipedia.org/wiki/Hayflick_limit 28 | Mon, 30 Oct 2017 01:11:38 -0500 29 | 30 | 31 | <![CDATA[Even moderate drinking by parents can upset children – study]]> 32 | Unread 33 | https://theguardian.com/society/2017/oct/18/even-moderate-drinking-by-parents-can-upset-children-study?CMP=Share_AndroidApp_Signal 34 | https://theguardian.com/society/2017/oct/18/even-moderate-drinking-by-parents-can-upset-children-study?CMP=Share_AndroidApp_Signal 35 | Mon, 30 Oct 2017 01:11:30 -0500 36 | 37 | 38 | <![CDATA[How Merkle trees enable the decentralized Web]]> 39 | Unread 40 | https://taravancil.com/blog/how-merkle-trees-enable-decentralized-web 41 | https://taravancil.com/blog/how-merkle-trees-enable-decentralized-web 42 | Mon, 30 Oct 2017 01:11:30 -0500 43 | 44 | 45 | <![CDATA[Inertial navigation system]]> 46 | Unread 47 | https://en.wikipedia.org/wiki/Inertial_navigation_system 48 | https://en.wikipedia.org/wiki/Inertial_navigation_system 49 | Mon, 30 Oct 2017 01:10:10 -0500 50 | 51 | 52 | <![CDATA[Dead reckoning]]> 53 | Unread 54 | https://en.wikipedia.org/wiki/Dead_reckoning 55 | https://en.wikipedia.org/wiki/Dead_reckoning 56 | Mon, 30 Oct 2017 01:10:08 -0500 57 | 58 | 59 | <![CDATA[Calling Rust From Python]]> 60 | Unread 61 | https://bheisler.github.io/post/calling-rust-in-python 62 | https://bheisler.github.io/post/calling-rust-in-python 63 | Mon, 30 Oct 2017 01:04:33 -0500 64 | 65 | 66 | <![CDATA[Why would anyone choose Docker over fat binaries?]]> 67 | Unread 68 | http://smashcompany.com/technology/why-would-anyone-choose-docker-over-fat-binaries 69 | http://smashcompany.com/technology/why-would-anyone-choose-docker-over-fat-binaries 70 | Sun, 29 Oct 2017 14:57:25 -0500 71 | 72 | 73 | <![CDATA[]]> 74 | Unread 75 | https://heml.io 76 | https://heml.io 77 | Sun, 29 Oct 2017 14:55:26 -0500 78 | 79 | 80 | <![CDATA[A surprising amount of people want to be in North Korea]]> 81 | Unread 82 | https://blog.benjojo.co.uk/post/north-korea-dprk-bgp-geoip-fruad 83 | https://blog.benjojo.co.uk/post/north-korea-dprk-bgp-geoip-fruad 84 | Sat, 28 Oct 2017 05:41:41 -0500 85 | 86 | 87 | <![CDATA[Learning a Hierarchy]]> 88 | Unread 89 | https://blog.openai.com/learning-a-hierarchy 90 | https://blog.openai.com/learning-a-hierarchy 91 | Thu, 26 Oct 2017 16:43:48 -0500 92 | 93 | 94 | <![CDATA[High Performance Browser Networking]]> 95 | Unread 96 | https://hpbn.co 97 | https://hpbn.co 98 | Wed, 25 Oct 2017 19:05:24 -0500 99 | 100 | 101 | <![CDATA[What tender and juicy drama is going on at your school/workplace?]]> 102 | Unread 103 | https://reddit.com/r/AskReddit/comments/78nc2a/what_tender_and_juicy_drama_is_going_on_at_your/dovab2v 104 | https://reddit.com/r/AskReddit/comments/78nc2a/what_tender_and_juicy_drama_is_going_on_at_your/dovab2v 105 | Wed, 25 Oct 2017 18:05:58 -0500 106 | 107 | 108 | <![CDATA[Using an SSH Bastion Host]]> 109 | Unread 110 | https://blog.scottlowe.org/2015/11/21/using-ssh-bastion-host 111 | https://blog.scottlowe.org/2015/11/21/using-ssh-bastion-host 112 | Wed, 25 Oct 2017 11:38:47 -0500 113 | 114 | 115 | <![CDATA[Let's Define "undefined" | NathanShane.me]]> 116 | Unread 117 | https://nathanshane.me/blog/let's-define-undefined 118 | https://nathanshane.me/blog/let's-define-undefined 119 | Wed, 25 Oct 2017 11:32:59 -0500 120 | 121 | 122 | <![CDATA[Control theory]]> 123 | Unread 124 | https://en.wikipedia.org/wiki/Control_theory#Closed-loop_transfer_function 125 | https://en.wikipedia.org/wiki/Control_theory#Closed-loop_transfer_function 126 | Tue, 24 Oct 2017 22:57:43 -0500 127 | 128 | 129 | <![CDATA[J012-86-intractable.pdf]]> 130 | Unread 131 | http://mit.edu/~jnt/Papers/J012-86-intractable.pdf 132 | http://mit.edu/~jnt/Papers/J012-86-intractable.pdf 133 | Tue, 24 Oct 2017 22:56:32 -0500 134 | 135 | 136 | <![CDATA[Dynamic Programming: First Principles]]> 137 | Unread 138 | http://flawlessrhetoric.com/Dynamic-Programming-First-Principles 139 | http://flawlessrhetoric.com/Dynamic-Programming-First-Principles 140 | Tue, 24 Oct 2017 22:56:30 -0500 141 | 142 | 143 | <![CDATA[What Would Happen If There Were No Number 6?]]> 144 | Unread 145 | https://fivethirtyeight.com/features/what-would-happen-if-there-were-no-number-6 146 | https://fivethirtyeight.com/features/what-would-happen-if-there-were-no-number-6 147 | Tue, 24 Oct 2017 22:21:59 -0500 148 | 149 | 150 | <![CDATA[Ten Basic Rules for Adventure]]> 151 | Unread 152 | https://outsideonline.com/2252916/10-basic-rules-adventure 153 | https://outsideonline.com/2252916/10-basic-rules-adventure 154 | Tue, 24 Oct 2017 20:56:25 -0500 155 | 156 | 157 | <![CDATA[Insects Are In Serious Trouble]]> 158 | Unread 159 | https://theatlantic.com/science/archive/2017/10/oh-no/543390?single_page=true 160 | https://theatlantic.com/science/archive/2017/10/oh-no/543390?single_page=true 161 | Mon, 23 Oct 2017 23:10:10 -0500 162 | 163 | 164 | <![CDATA[Netflix/bless]]> 165 | Unread 166 | https://github.com/Netflix/bless 167 | https://github.com/Netflix/bless 168 | Mon, 23 Oct 2017 23:04:46 -0500 169 | 170 | 171 | <![CDATA[Getting Your First 10 Customers]]> 172 | Unread 173 | https://stripe.com/atlas/guides/starting-sales 174 | https://stripe.com/atlas/guides/starting-sales 175 | Mon, 23 Oct 2017 22:27:36 -0500 176 | 177 | 178 | <![CDATA[GPS Hardware]]> 179 | Unread 180 | https://novasummits.com/gps-hardware 181 | https://novasummits.com/gps-hardware 182 | Mon, 23 Oct 2017 04:44:40 -0500 183 | 184 | 185 | <![CDATA[Bicycle Tires and Tubes]]> 186 | Unread 187 | http://sheldonbrown.com/tires.html#pressure 188 | http://sheldonbrown.com/tires.html#pressure 189 | Mon, 23 Oct 2017 01:28:32 -0500 190 | 191 | 192 | <![CDATA[Tire light is on]]> 193 | Unread 194 | https://reddit.com/r/Justrolledintotheshop/comments/77zm9e/tire_light_is_on/doqbshe 195 | https://reddit.com/r/Justrolledintotheshop/comments/77zm9e/tire_light_is_on/doqbshe 196 | Mon, 23 Oct 2017 01:21:42 -0500 197 | 198 | 199 | <![CDATA[Bad_Salish_Boo ?? on Twitter]]> 200 | Unread 201 | https://t.co/PDLlNjACv9 202 | https://t.co/PDLlNjACv9 203 | Sat, 21 Oct 2017 06:48:07 -0500 204 | 205 | 206 | <![CDATA[Is an Open Marriage a Happier Marriage?]]> 207 | Unread 208 | https://nytimes.com/2017/05/11/magazine/is-an-open-marriage-a-happier-marriage.html 209 | https://nytimes.com/2017/05/11/magazine/is-an-open-marriage-a-happier-marriage.html 210 | Fri, 20 Oct 2017 13:08:52 -0500 211 | 212 | 213 | <![CDATA[The Invention of Monogamy]]> 214 | Unread 215 | https://thenib.com/the-invention-of-monogamy 216 | https://thenib.com/the-invention-of-monogamy 217 | Fri, 20 Oct 2017 12:19:00 -0500 218 | 219 | 220 | <![CDATA[Google Chrome May Add a Permission to Stop In-Browser Cryptocurrency Miners]]> 221 | Unread 222 | https://bleepingcomputer.com/news/google/google-chrome-may-add-a-permission-to-stop-in-browser-cryptocurrency-miners 223 | https://bleepingcomputer.com/news/google/google-chrome-may-add-a-permission-to-stop-in-browser-cryptocurrency-miners 224 | Fri, 20 Oct 2017 03:57:41 -0500 225 | 226 | 227 | 228 | 229 | -------------------------------------------------------------------------------- /archiver/util.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import sys 4 | import time 5 | import json 6 | import requests 7 | 8 | from datetime import datetime 9 | from subprocess import run, PIPE, DEVNULL 10 | from multiprocessing import Process 11 | from urllib.parse import quote 12 | 13 | from config import ( 14 | IS_TTY, 15 | OUTPUT_PERMISSIONS, 16 | REPO_DIR, 17 | SOURCES_DIR, 18 | OUTPUT_DIR, 19 | ARCHIVE_DIR, 20 | TIMEOUT, 21 | TERM_WIDTH, 22 | SHOW_PROGRESS, 23 | ANSI, 24 | CHROME_BINARY, 25 | FETCH_WGET, 26 | FETCH_PDF, 27 | FETCH_SCREENSHOT, 28 | FETCH_DOM, 29 | FETCH_FAVICON, 30 | FETCH_AUDIO, 31 | FETCH_VIDEO, 32 | SUBMIT_ARCHIVE_DOT_ORG, 33 | ) 34 | 35 | # URL helpers 36 | without_scheme = lambda url: url.replace('http://', '').replace('https://', '').replace('ftp://', '') 37 | without_query = lambda url: url.split('?', 1)[0] 38 | without_hash = lambda url: url.split('#', 1)[0] 39 | without_path = lambda url: url.split('/', 1)[0] 40 | domain = lambda url: without_hash(without_query(without_path(without_scheme(url)))) 41 | base_url = lambda url: without_scheme(url) # uniq base url used to dedupe links 42 | 43 | short_ts = lambda ts: ts.split('.')[0] 44 | 45 | 46 | def check_dependencies(): 47 | """Check that all necessary dependencies are installed, and have valid versions""" 48 | 49 | python_vers = float('{}.{}'.format(sys.version_info.major, sys.version_info.minor)) 50 | if python_vers < 3.5: 51 | print('{}[X] Python version is not new enough: {} (>3.5 is required){}'.format(ANSI['red'], python_vers, ANSI['reset'])) 52 | print(' See https://github.com/pirate/bookmark-archiver#troubleshooting for help upgrading your Python installation.') 53 | raise SystemExit(1) 54 | 55 | if FETCH_PDF or FETCH_SCREENSHOT or FETCH_DOM: 56 | if run(['which', CHROME_BINARY], stdout=DEVNULL).returncode: 57 | print('{}[X] Missing dependency: {}{}'.format(ANSI['red'], CHROME_BINARY, ANSI['reset'])) 58 | print(' Run ./setup.sh, then confirm it was installed with: {} --version'.format(CHROME_BINARY)) 59 | print(' See https://github.com/pirate/bookmark-archiver for help.') 60 | raise SystemExit(1) 61 | 62 | # parse chrome --version e.g. Google Chrome 61.0.3114.0 canary / Chromium 59.0.3029.110 built on Ubuntu, running on Ubuntu 16.04 63 | try: 64 | result = run([CHROME_BINARY, '--version'], stdout=PIPE) 65 | version_str = result.stdout.decode('utf-8') 66 | version_lines = re.sub("(Google Chrome|Chromium) (\\d+?)\\.(\\d+?)\\.(\\d+?).*?$", "\\2", version_str).split('\n') 67 | version = [l for l in version_lines if l.isdigit()][-1] 68 | if int(version) < 59: 69 | print(version_lines) 70 | print('{red}[X] Chrome version must be 59 or greater for headless PDF, screenshot, and DOM saving{reset}'.format(**ANSI)) 71 | print(' See https://github.com/pirate/bookmark-archiver for help.') 72 | raise SystemExit(1) 73 | except (IndexError, TypeError, OSError): 74 | print('{red}[X] Failed to parse Chrome version, is it installed properly?{reset}'.format(**ANSI)) 75 | print(' Run ./setup.sh, then confirm it was installed with: {} --version'.format(CHROME_BINARY)) 76 | print(' See https://github.com/pirate/bookmark-archiver for help.') 77 | raise SystemExit(1) 78 | 79 | if FETCH_WGET: 80 | if run(['which', 'wget'], stdout=DEVNULL).returncode or run(['wget', '--version'], stdout=DEVNULL).returncode: 81 | print('{red}[X] Missing dependency: wget{reset}'.format(**ANSI)) 82 | print(' Run ./setup.sh, then confirm it was installed with: {} --version'.format('wget')) 83 | print(' See https://github.com/pirate/bookmark-archiver for help.') 84 | raise SystemExit(1) 85 | 86 | if FETCH_FAVICON or SUBMIT_ARCHIVE_DOT_ORG: 87 | if run(['which', 'curl'], stdout=DEVNULL).returncode or run(['curl', '--version'], stdout=DEVNULL).returncode: 88 | print('{red}[X] Missing dependency: curl{reset}'.format(**ANSI)) 89 | print(' Run ./setup.sh, then confirm it was installed with: {} --version'.format('curl')) 90 | print(' See https://github.com/pirate/bookmark-archiver for help.') 91 | raise SystemExit(1) 92 | 93 | if FETCH_AUDIO or FETCH_VIDEO: 94 | if run(['which', 'youtube-dl'], stdout=DEVNULL).returncode or run(['youtube-dl', '--version'], stdout=DEVNULL).returncode: 95 | print('{red}[X] Missing dependency: youtube-dl{reset}'.format(**ANSI)) 96 | print(' Run ./setup.sh, then confirm it was installed with: {} --version'.format('youtube-dl')) 97 | print(' See https://github.com/pirate/bookmark-archiver for help.') 98 | raise SystemExit(1) 99 | 100 | 101 | def chmod_file(path, cwd='.', permissions=OUTPUT_PERMISSIONS, timeout=30): 102 | """chmod -R /""" 103 | 104 | if not os.path.exists(os.path.join(cwd, path)): 105 | raise Exception('Failed to chmod: {} does not exist (did the previous step fail?)'.format(path)) 106 | 107 | chmod_result = run(['chmod', '-R', permissions, path], cwd=cwd, stdout=DEVNULL, stderr=PIPE, timeout=timeout) 108 | if chmod_result.returncode == 1: 109 | print(' ', chmod_result.stderr.decode()) 110 | raise Exception('Failed to chmod {}/{}'.format(cwd, path)) 111 | 112 | 113 | def progress(seconds=TIMEOUT, prefix=''): 114 | """Show a (subprocess-controlled) progress bar with a timeout, 115 | returns end() function to instantly finish the progress 116 | """ 117 | 118 | if not SHOW_PROGRESS: 119 | return lambda: None 120 | 121 | chunk = '█' if sys.stdout.encoding == 'UTF-8' else '#' 122 | chunks = TERM_WIDTH - len(prefix) - 20 # number of progress chunks to show (aka max bar width) 123 | 124 | def progress_bar(seconds=seconds, prefix=prefix): 125 | """show timer in the form of progress bar, with percentage and seconds remaining""" 126 | try: 127 | for s in range(seconds * chunks): 128 | progress = s / chunks / seconds * 100 129 | bar_width = round(progress/(100/chunks)) 130 | 131 | # ████████████████████ 0.9% (1/60sec) 132 | sys.stdout.write('\r{0}{1}{2}{3} {4}% ({5}/{6}sec)'.format( 133 | prefix, 134 | ANSI['green'], 135 | (chunk * bar_width).ljust(chunks), 136 | ANSI['reset'], 137 | round(progress, 1), 138 | round(s/chunks), 139 | seconds, 140 | )) 141 | sys.stdout.flush() 142 | time.sleep(1 / chunks) 143 | 144 | # ██████████████████████████████████ 100.0% (60/60sec) 145 | sys.stdout.write('\r{0}{1}{2}{3} {4}% ({5}/{6}sec)\n'.format( 146 | prefix, 147 | ANSI['red'], 148 | chunk * chunks, 149 | ANSI['reset'], 150 | 100.0, 151 | seconds, 152 | seconds, 153 | )) 154 | sys.stdout.flush() 155 | except KeyboardInterrupt: 156 | print() 157 | pass 158 | 159 | p = Process(target=progress_bar) 160 | p.start() 161 | 162 | def end(): 163 | """immediately finish progress and clear the progressbar line""" 164 | p.terminate() 165 | sys.stdout.write('\r{}{}\r'.format((' ' * TERM_WIDTH), ANSI['reset'])) # clear whole terminal line 166 | sys.stdout.flush() 167 | 168 | return end 169 | 170 | def pretty_path(path): 171 | """convert paths like .../bookmark-archiver/archiver/../output/abc into output/abc""" 172 | return path.replace(REPO_DIR + '/', '') 173 | 174 | 175 | def download_url(url): 176 | """download a given url's content into downloads/domain.txt""" 177 | 178 | if not os.path.exists(SOURCES_DIR): 179 | os.makedirs(SOURCES_DIR) 180 | 181 | ts = str(datetime.now().timestamp()).split('.', 1)[0] 182 | 183 | source_path = os.path.join(SOURCES_DIR, '{}-{}.txt'.format(domain(url), ts)) 184 | 185 | print('[*] [{}] Downloading {} > {}'.format( 186 | datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 187 | url, 188 | pretty_path(source_path), 189 | )) 190 | end = progress(TIMEOUT, prefix=' ') 191 | try: 192 | downloaded_xml = requests.get(url).content.decode() 193 | end() 194 | except Exception as e: 195 | end() 196 | print('[!] Failed to download {}\n'.format(url)) 197 | print(' ', e) 198 | raise SystemExit(1) 199 | 200 | with open(source_path, 'w', encoding='utf-8') as f: 201 | f.write(downloaded_xml) 202 | 203 | return source_path 204 | 205 | def str_between(string, start, end=None): 206 | """(12345, , ) -> 12345""" 207 | 208 | content = string.split(start, 1)[-1] 209 | if end is not None: 210 | content = content.rsplit(end, 1)[0] 211 | 212 | return content 213 | 214 | def get_link_type(link): 215 | """Certain types of links need to be handled specially, this figures out when that's the case""" 216 | 217 | if link['base_url'].endswith('.pdf'): 218 | return 'PDF' 219 | elif link['base_url'].rsplit('.', 1) in ('pdf', 'png', 'jpg', 'jpeg', 'svg', 'bmp', 'gif', 'tiff', 'webp'): 220 | return 'image' 221 | elif 'wikipedia.org' in link['domain']: 222 | return 'wiki' 223 | elif 'youtube.com' in link['domain']: 224 | return 'youtube' 225 | elif 'soundcloud.com' in link['domain']: 226 | return 'soundcloud' 227 | elif 'youku.com' in link['domain']: 228 | return 'youku' 229 | elif 'vimeo.com' in link['domain']: 230 | return 'vimeo' 231 | return None 232 | 233 | def merge_links(a, b): 234 | """deterministially merge two links, favoring longer field values over shorter, 235 | and "cleaner" values over worse ones. 236 | """ 237 | longer = lambda key: a[key] if len(a[key]) > len(b[key]) else b[key] 238 | earlier = lambda key: a[key] if a[key] < b[key] else b[key] 239 | 240 | url = longer('url') 241 | longest_title = longer('title') 242 | cleanest_title = a['title'] if '://' not in a['title'] else b['title'] 243 | link = { 244 | 'timestamp': earlier('timestamp'), 245 | 'url': url, 246 | 'domain': domain(url), 247 | 'base_url': base_url(url), 248 | 'tags': longer('tags'), 249 | 'title': longest_title if '://' not in longest_title else cleanest_title, 250 | 'sources': list(set(a.get('sources', []) + b.get('sources', []))), 251 | } 252 | link['type'] = get_link_type(link) 253 | return link 254 | 255 | def find_link(folder, links): 256 | """for a given archive folder, find the corresponding link object in links""" 257 | url = parse_url(folder) 258 | if url: 259 | for link in links: 260 | if (link['base_url'] in url) or (url in link['url']): 261 | return link 262 | 263 | timestamp = folder.split('.')[0] 264 | for link in links: 265 | if link['timestamp'].startswith(timestamp): 266 | if link['domain'] in os.listdir(os.path.join(ARCHIVE_DIR, folder)): 267 | return link # careful now, this isn't safe for most ppl 268 | if link['domain'] in parse_url(folder): 269 | return link 270 | return None 271 | 272 | 273 | def parse_url(folder): 274 | """for a given archive folder, figure out what url it's for""" 275 | link_json = os.path.join(ARCHIVE_DIR, folder, 'index.json') 276 | if os.path.exists(link_json): 277 | with open(link_json, 'r') as f: 278 | try: 279 | link_json = f.read().strip() 280 | if link_json: 281 | link = json.loads(link_json) 282 | return link['base_url'] 283 | except ValueError: 284 | print('File contains invalid JSON: {}!'.format(link_json)) 285 | 286 | archive_org_txt = os.path.join(ARCHIVE_DIR, folder, 'archive.org.txt') 287 | if os.path.exists(archive_org_txt): 288 | with open(archive_org_txt, 'r') as f: 289 | original_link = f.read().strip().split('/http', 1)[-1] 290 | with_scheme = 'http{}'.format(original_link) 291 | return with_scheme 292 | 293 | return '' 294 | 295 | def manually_merge_folders(source, target): 296 | """prompt for user input to resolve a conflict between two archive folders""" 297 | 298 | if not IS_TTY: 299 | return 300 | 301 | fname = lambda path: path.split('/')[-1] 302 | 303 | print(' {} and {} have conflicting files, which do you want to keep?'.format(fname(source), fname(target))) 304 | print(' - [enter]: do nothing (keep both)') 305 | print(' - a: prefer files from {}'.format(source)) 306 | print(' - b: prefer files from {}'.format(target)) 307 | print(' - q: quit and resolve the conflict manually') 308 | try: 309 | answer = input('> ').strip().lower() 310 | except KeyboardInterrupt: 311 | answer = 'q' 312 | 313 | assert answer in ('', 'a', 'b', 'q'), 'Invalid choice.' 314 | 315 | if answer == 'q': 316 | print('\nJust run Bookmark Archiver again to pick up where you left off.') 317 | raise SystemExit(0) 318 | elif answer == '': 319 | return 320 | 321 | files_in_source = set(os.listdir(source)) 322 | files_in_target = set(os.listdir(target)) 323 | for file in files_in_source: 324 | if file in files_in_target: 325 | to_delete = target if answer == 'a' else source 326 | run(['rm', '-Rf', os.path.join(to_delete, file)]) 327 | run(['mv', os.path.join(source, file), os.path.join(target, file)]) 328 | 329 | if not set(os.listdir(source)): 330 | run(['rm', '-Rf', source]) 331 | 332 | def fix_folder_path(archive_path, link_folder, link): 333 | """given a folder, merge it to the canonical 'correct' path for the given link object""" 334 | source = os.path.join(archive_path, link_folder) 335 | target = os.path.join(archive_path, link['timestamp']) 336 | 337 | url_in_folder = parse_url(source) 338 | if not (url_in_folder in link['base_url'] 339 | or link['base_url'] in url_in_folder): 340 | raise ValueError('The link does not match the url for this folder.') 341 | 342 | if not os.path.exists(target): 343 | # target doesn't exist so nothing needs merging, simply move A to B 344 | run(['mv', source, target]) 345 | else: 346 | # target folder exists, check for conflicting files and attempt manual merge 347 | files_in_source = set(os.listdir(source)) 348 | files_in_target = set(os.listdir(target)) 349 | conflicting_files = files_in_source & files_in_target 350 | 351 | if not conflicting_files: 352 | for file in files_in_source: 353 | run(['mv', os.path.join(source, file), os.path.join(target, file)]) 354 | 355 | if os.path.exists(source): 356 | files_in_source = set(os.listdir(source)) 357 | if files_in_source: 358 | manually_merge_folders(source, target) 359 | else: 360 | run(['rm', '-R', source]) 361 | 362 | 363 | def migrate_data(): 364 | # migrate old folder to new OUTPUT folder 365 | old_dir = os.path.join(REPO_DIR, 'html') 366 | if os.path.exists(old_dir): 367 | print('[!] WARNING: Moved old output folder "html" to new location: {}'.format(OUTPUT_DIR)) 368 | run(['mv', old_dir, OUTPUT_DIR], timeout=10) 369 | 370 | 371 | def cleanup_archive(archive_path, links): 372 | """move any incorrectly named folders to their canonical locations""" 373 | 374 | # for each folder that exists, see if we can match it up with a known good link 375 | # if we can, then merge the two folders (TODO: if not, move it to lost & found) 376 | 377 | unmatched = [] 378 | bad_folders = [] 379 | 380 | if not os.path.exists(archive_path): 381 | return 382 | 383 | for folder in os.listdir(archive_path): 384 | try: 385 | files = os.listdir(os.path.join(archive_path, folder)) 386 | except NotADirectoryError: 387 | continue 388 | 389 | if files: 390 | link = find_link(folder, links) 391 | if link is None: 392 | unmatched.append(folder) 393 | continue 394 | 395 | if folder != link['timestamp']: 396 | bad_folders.append((folder, link)) 397 | else: 398 | # delete empty folders 399 | run(['rm', '-R', os.path.join(archive_path, folder)]) 400 | 401 | if bad_folders and IS_TTY and input('[!] Cleanup archive? y/[n]: ') == 'y': 402 | print('[!] Fixing {} improperly named folders in archive...'.format(len(bad_folders))) 403 | for folder, link in bad_folders: 404 | fix_folder_path(archive_path, folder, link) 405 | elif bad_folders: 406 | print('[!] Warning! {} folders need to be merged, fix by running bookmark archiver.'.format(len(bad_folders))) 407 | 408 | if unmatched: 409 | print('[!] Warning! {} unrecognized folders in html/archive/'.format(len(unmatched))) 410 | print(' '+ '\n '.join(unmatched)) 411 | 412 | 413 | def wget_output_path(link, look_in=None): 414 | """calculate the path to the wgetted .html file, since wget may 415 | adjust some paths to be different than the base_url path. 416 | 417 | See docs on wget --adjust-extension (-E) 418 | """ 419 | 420 | # if we have it stored, always prefer the actual output path to computed one 421 | if link.get('latest', {}).get('wget'): 422 | return link['latest']['wget'] 423 | 424 | urlencode = lambda s: quote(s, encoding='utf-8', errors='replace') 425 | 426 | if link['type'] in ('PDF', 'image'): 427 | return urlencode(link['base_url']) 428 | 429 | # Since the wget algorithm to for -E (appending .html) is incredibly complex 430 | # instead of trying to emulate it here, we just look in the output folder 431 | # to see what html file wget actually created as the output 432 | wget_folder = link['base_url'].rsplit('/', 1)[0].split('/') 433 | look_in = os.path.join(ARCHIVE_DIR, link['timestamp'], *wget_folder) 434 | 435 | if look_in and os.path.exists(look_in): 436 | html_files = [ 437 | f for f in os.listdir(look_in) 438 | if re.search(".+\\.[Hh][Tt][Mm][Ll]?$", f, re.I | re.M) 439 | ] 440 | if html_files: 441 | return urlencode(os.path.join(*wget_folder, html_files[0])) 442 | 443 | return None 444 | 445 | # If finding the actual output file didn't work, fall back to the buggy 446 | # implementation of the wget .html appending algorithm 447 | # split_url = link['url'].split('#', 1) 448 | # query = ('%3F' + link['url'].split('?', 1)[-1]) if '?' in link['url'] else '' 449 | 450 | # if re.search(".+\\.[Hh][Tt][Mm][Ll]?$", split_url[0], re.I | re.M): 451 | # # already ends in .html 452 | # return urlencode(link['base_url']) 453 | # else: 454 | # # .html needs to be appended 455 | # without_scheme = split_url[0].split('://', 1)[-1].split('?', 1)[0] 456 | # if without_scheme.endswith('/'): 457 | # if query: 458 | # return urlencode('#'.join([without_scheme + 'index.html' + query + '.html', *split_url[1:]])) 459 | # return urlencode('#'.join([without_scheme + 'index.html', *split_url[1:]])) 460 | # else: 461 | # if query: 462 | # return urlencode('#'.join([without_scheme + '/index.html' + query + '.html', *split_url[1:]])) 463 | # elif '/' in without_scheme: 464 | # return urlencode('#'.join([without_scheme + '.html', *split_url[1:]])) 465 | # return urlencode(link['base_url'] + '/index.html') 466 | 467 | 468 | def derived_link_info(link): 469 | """extend link info with the archive urls and other derived data""" 470 | 471 | link_info = { 472 | **link, 473 | 'date': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'), 474 | 'google_favicon_url': 'https://www.google.com/s2/favicons?domain={domain}'.format(**link), 475 | 'favicon_url': 'archive/{timestamp}/favicon.ico'.format(**link), 476 | 'files_url': 'archive/{timestamp}/index.html'.format(**link), 477 | 'archive_url': 'archive/{}/{}'.format(link['timestamp'], wget_output_path(link) or 'index.html'), 478 | 'pdf_link': 'archive/{timestamp}/output.pdf'.format(**link), 479 | 'screenshot_link': 'archive/{timestamp}/screenshot.png'.format(**link), 480 | 'dom_link': 'archive/{timestamp}/output.html'.format(**link), 481 | 'archive_org_url': 'https://web.archive.org/web/{base_url}'.format(**link), 482 | } 483 | 484 | # PDF and images are handled slightly differently 485 | # wget, screenshot, & pdf urls all point to the same file 486 | if link['type'] in ('PDF', 'image'): 487 | link_info.update({ 488 | 'archive_url': 'archive/{timestamp}/{base_url}'.format(**link), 489 | 'pdf_link': 'archive/{timestamp}/{base_url}'.format(**link), 490 | 'screenshot_link': 'archive/{timestamp}/{base_url}'.format(**link), 491 | 'dom_link': 'archive/{timestamp}/{base_url}'.format(**link), 492 | 'title': '{title} ({type})'.format(**link), 493 | }) 494 | return link_info 495 | -------------------------------------------------------------------------------- /bin/bookmark-archiver: -------------------------------------------------------------------------------- 1 | ../archiver/archive.py -------------------------------------------------------------------------------- /bin/export-browser-history: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | REPO_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )"; cd .. && pwd )" 4 | 5 | if [[ "$1" == "--chrome" ]]; then 6 | # Google Chrome / Chromium 7 | default=$(ls ~/Library/Application\ Support/Google/Chrome/Default/History) 8 | if [[ -e "$2" ]]; then 9 | cp "$2" "$REPO_DIR/output/sources/chrome_history.db.tmp" 10 | else 11 | echo "Defaulting to history db: $default" 12 | echo "Optionally specify the path to a different sqlite history database as the 2nd argument." 13 | cp "$default" "$REPO_DIR/output/sources/chrome_history.db.tmp" 14 | fi 15 | sqlite3 "$REPO_DIR/output/sources/chrome_history.db.tmp" "SELECT \"[\" || group_concat(json_object('timestamp', last_visit_time, 'description', title, 'href', url)) || \"]\" FROM urls;" > "$REPO_DIR/output/sources/chrome_history.json" 16 | rm "$REPO_DIR"/output/sources/chrome_history.db.* 17 | echo "Chrome history exported to:" 18 | echo " output/sources/chrome_history.json" 19 | fi 20 | 21 | if [[ "$1" == "--firefox" ]]; then 22 | # Firefox 23 | default=$(ls ~/Library/Application\ Support/Firefox/Profiles/*.default/places.sqlite) 24 | if [[ -e "$2" ]]; then 25 | cp "$2" "$REPO_DIR/output/sources/firefox_history.db.tmp" 26 | else 27 | echo "Defaulting to history db: $default" 28 | echo "Optionally specify the path to a different sqlite history database as the 2nd argument." 29 | cp "$default" "$REPO_DIR/output/sources/firefox_history.db.tmp" 30 | fi 31 | sqlite3 "$REPO_DIR/output/sources/firefox_history.db.tmp" "SELECT \"[\" || group_concat(json_object('timestamp', last_visit_date, 'description', title, 'href', url)) || \"]\" FROM moz_places;" > "$REPO_DIR/output/sources/firefox_history.json" 32 | rm "$REPO_DIR"/output/sources/firefox_history.db.* 33 | echo "Firefox history exported to:" 34 | echo " output/sources/firefox_history.json" 35 | fi 36 | -------------------------------------------------------------------------------- /bin/setup-bookmark-archiver: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Bookmark Archiver Setup Script 3 | # Nick Sweeting 2017 | MIT License 4 | # https://github.com/pirate/bookmark-archiver 5 | 6 | echo "[i] Installing bookmark-archiver dependencies. 📦" 7 | echo "" 8 | echo " You may be prompted for a password in order to install the following dependencies:" 9 | echo " - Chromium Browser (see README for Google-Chrome instructions instead)" 10 | echo " - python3" 11 | echo " - wget" 12 | echo " - curl" 13 | echo "" 14 | echo " You may follow Manual Setup instructions in README.md instead if you prefer not to run an unknown script." 15 | echo " Press enter to continue, or Ctrl+C to cancel..." 16 | read 17 | 18 | echo "" 19 | 20 | # On Linux: 21 | if which apt-get > /dev/null; then 22 | echo "[+] Updating apt repos..." 23 | apt update -q 24 | if which google-chrome; then 25 | echo "[i] You already have google-chrome installed, if you would like to download chromium-browser instead (they work pretty much the same), follow the Manual Setup instructions" 26 | echo "[+] Linking $(which google-chrome) -> /usr/bin/chromium-browser (press enter to continue, or Ctrl+C to cancel...)" 27 | read 28 | sudo ln -s "$(which google-chrome)" /usr/bin/chromium-browser 29 | elif which chromium-browser; then 30 | echo "[i] chromium-browser already installed, using existing installation." 31 | chromium-browser --version 32 | else 33 | echo "[+] Installing chromium-browser..." 34 | apt install chromium-browser -y 35 | fi 36 | echo "[+] Installing python3, wget, curl..." 37 | apt install python3 wget curl 38 | 39 | # On Mac: 40 | elif which brew > /dev/null; then # 🐍 eye of newt 41 | if ls /Applications/Google\ Chrome.app > /dev/null; then 42 | echo "[+] Linking /usr/local/bin/google-chrome -> /Applications/Google Chrome.app" 43 | echo -e '#!/bin/bash\n/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome "$@"' > /usr/local/bin/chromium-browser 44 | chmod +x /usr/local/bin/chromium-browser 45 | 46 | elif which chromium-browser; then 47 | brew cask upgrade chromium-browser 48 | echo "[+] Linking /usr/local/bin/chromium-browser -> /Applications/Chromium.app" 49 | echo -e '#!/bin/bash\n/Applications/Chromium.app/Contents/MacOS/Chromium "$@"' > /usr/local/bin/chromium-browser 50 | chmod +x /usr/local/bin/chromium-browser 51 | 52 | else 53 | echo "[+] Installing chromium-browser..." 54 | brew cask install chromium 55 | echo "[+] Linking /usr/local/bin/chromium-browser -> /Applications/Chromium.app" 56 | echo -e '#!/bin/bash\n/Applications/Chromium.app/Contents/MacOS/Chromium "$@"' > /usr/local/bin/chromium-browser 57 | chmod +x /usr/local/bin/chromium-browser 58 | fi 59 | echo "[+] Installing python3, wget, curl (ignore 'already installed' warnings)..." 60 | brew install python3 wget curl 61 | else 62 | echo "[X] Could not find aptitude or homebrew! ‼️" 63 | echo "" 64 | echo " If you're on macOS, make sure you have homebrew installed: https://brew.sh/" 65 | echo " If you're on Ubuntu/Debian, make sure you have apt installed: https://help.ubuntu.com/lts/serverguide/apt.html" 66 | echo " (those are the only currently supported systems)" 67 | echo "" 68 | echo "See the README.md for Manual Setup & Troubleshooting instructions." 69 | exit 1 70 | fi 71 | 72 | # Check: 73 | echo "" 74 | echo "[*] Checking installed versions:" 75 | which chromium-browser && 76 | chromium-browser --version && 77 | which wget && 78 | which python3 && 79 | which curl && 80 | echo "[√] All dependencies installed. ✅" && 81 | exit 0 82 | 83 | echo "" 84 | echo "[X] Failed to install some dependencies! ‼️" 85 | echo " - Try the Manual Setup instructions in the README.md" 86 | echo " - Try the Troubleshooting: Dependencies instructions in the README.md" 87 | echo " - Open an issue on github to get help: https://github.com/pirate/bookmark-archiver/issues" 88 | exit 1 89 | -------------------------------------------------------------------------------- /setup: -------------------------------------------------------------------------------- 1 | bin/setup-bookmark-archiver --------------------------------------------------------------------------------