├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome digital preservation 2 | 3 | [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) 4 | 5 | Awesome list of digital preservation tools 6 | 7 | ## Table of contents 8 | 9 | * [Web Archiving](#web-archiving) 10 | * [Social networks](#social-networks) 11 | * [Other digital objects](#other-digital-objects) 12 | * [Standards and specifications](#standards-and-specifications) 13 | * [Organizations](#organizations) 14 | * [Major digital archives](#major-digital-archives) 15 | * [Knowledge bases](#knowledge-bases) 16 | * [Related lists](#related-lists) 17 | 18 | 19 | ## Web archiving 20 | 21 | ### Crawlers 22 | * [Wget](https://www.gnu.org/software/wget/) - a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS, the most widely used Internet protocols. 23 | * [WPull](https://github.com/ArchiveTeam/wpull) - Wget-compatible web downloader and crawler. 24 | * [Conifer](https://github.com/Rhizome-Conifer/conifer) - collect and revisit web pages 25 | * [grab-site](https://github.com/ArchiveTeam/grab-site) - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns 26 | * [Heritrix3](https://github.com/internetarchive/heritrix3) - Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. 27 | * [WAIL](https://github.com/machawk1/wail) - Web Archiving Integration Layer: One-Click User Instigated Preservation 28 | - [Browsetrix Crawler](https://github.com/webrecorder/browsertrix-crawler) - run a high-fidelity browser-based crawler in a single Docker container 29 | 30 | ## Replay tools 31 | * [Archive Web.page](https://github.com/webrecorder/archiveweb.page) - A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers 32 | * [Reply Web.page](https://github.com/webrecorder/replayweb.page) - Serverless Web Archive Replay directly in the browser 33 | * [pywb](https://github.com/webrecorder/pywb) - Core Python Web Archiving Toolkit for replay and recording of web archives 34 | * [webrecorder-player](https://github.com/webrecorder/webrecorder-player) - Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder) 35 | * [ipwb](https://github.com/oduwsdl/ipwb) - InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS 36 | 37 | 38 | ### Analysis and data processing 39 | * [AUT](https://github.com/archivesunleashed/aut/) - The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives. 40 | * [AUT Notebooks](https://github.com/archivesunleashed/notebooks) - Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit. 41 | * [WARCIO](https://github.com/webrecorder/warcio) - Streaming WARC/ARC library for fast web archive IO 42 | * [Metawarc](https://github.com/datacoon/metawarc) - Metadata extractor from WARC files 43 | * [WarcDB](https://github.com/Florents-Tselai/WarcDB) - WarcDB: Web crawl data as SQLite databases 44 | * [ArchiveSpark](https://github.com/helgeho/ArchiveSpark) - An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive. 45 | * [CDX Toolkit](https://github.com/cocrawler/cdx_toolkit) - A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine 46 | 47 | ### Page pushers 48 | * [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox) - Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more 49 | * [Wayback](https://github.com/wabarc/wayback) - A self-hosted toolkit for archiving webpages to the Internet Archive, archive.today, IPFS, and local file systems 50 | * [Archivenow](https://github.com/oduwsdl/archivenow) - A Tool To Push Web Resources Into Web Archives 51 | * [iagitup](https://github.com/gdamdam/iagitup) - A command line tool to archive a git repository from GitHub to the Internet Archive. 52 | 53 | ### Online services 54 | * [ArchiveIt](https://archive-it.org/) - web archiving online services 55 | 56 | ## Social Networks 57 | 58 | ### Twitter 59 | * [twarc](https://github.com/DocNow/twarc) - A command line tool (and Python library) for archiving Twitter JSON 60 | 61 | ### Instagram 62 | * [instaloader](https://github.com/instaloader/instaloader) - Download pictures (or videos) along with their captions and other metadata from Instagram. 63 | 64 | 65 | ### Universal 66 | * [sfm-ui](https://github.com/gwu-libraries/sfm-ui) - Social Feed Manager user interface application. 67 | * [Media downloader](https://github.com/awesome-yasin/Media-Downloader) - download Instagram Reels, Stories, Post, Stalk Instagram Profile, Facebook Public Videos, YouTube Videos and YouTube to MP3 converter, SoundCloud MP3 and Dailymotion videos. Made from Node JS Express JS, React JS and Rapid API. 68 | 69 | 70 | ## Other digital objects 71 | 72 | ### Online storage 73 | * [ydiskarc](https://github.com/ruarxive/ydiskarc) - command-line tool to backup public resources from Yandex.disk (disk.yandex.ru / yadi.sk) filestorage service 74 | * [filegetter](https://github.com/ruarxive/filegetter) - A command-line tool to collect files from public data sources using URL patterns and config files 75 | 76 | ### Messengers and chats 77 | * [tgarc](https://github.com/ruarxive/tgarc) - A command line tool for archiving Telegram JSON 78 | 79 | ### Specific CMS 80 | * [wparc](https://github.com/ruarxive/wparc) - Wordpress API data and files archival command line tool 81 | * [spcrawler](https://github.com/ruarxive/spcrawler) - A command-line tool to backup Sharepoint public installations data from open API endpoint 82 | 83 | ### Public Data API 84 | - [apibackuper](https://github.com/ruarxive/apibackuper) - Python library and cmd tool to backup API calls 85 | 86 | ## Standards and specifications 87 | 88 | * [The WARC Format 1.1](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) - The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. 89 | * [CDX File format](https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/) - format of CDX files, that are list of files in WARC files 90 | * [WARC Specifications](https://iipc.github.io/warc-specifications/) - collection of WARC related specifications and formats 91 | * [The WACZ Format 1.1.1](https://specs.webrecorder.net/wacz/1.1.1/) - Web Archive Collection Zipped. WACZ is a media type that allows web archive collections to be packaged and shared on the web as a discrete file. 92 | 93 | ## Organizations 94 | 95 | * [Digital preservation coalition](https://www.dpconline.org/) - The DPC is a not-for-profit company dedicated to digital preservation inititatives 96 | * [International Internet Preservation Consortium ](https://netpreserve.org/) - Leading consortium for web archiving 97 | 98 | ## Knowledge bases 99 | * [Archiveteam Wiki](https://wiki.archiveteam.org/) - Wiki about various archival topics and file formats 100 | 101 | ## Major digital archives 102 | 103 | * [Internet Archive](https://archive.org/) - biggest digital archive with big web archives 104 | * [Common Crawl](https://commoncrawl.org) - open data search engine index crawled from whole Internet 105 | 106 | ## Related lists 107 | 108 | * [Awesome Web Archiving](https://github.com/iipc/awesome-web-archiving) - An Awesome List for getting started with web archiving 109 | * [Awesome data takeout](https://github.com/ivbeg/awesome-data-takeout) - An Awesome Data Takeout list of services to take out your personal data from major online services and providers --------------------------------------------------------------------------------