├── .gitattributes ├── .github └── ISSUE_TEMPLATE │ └── protocol.md ├── .gitignore ├── README.md ├── _config.yml ├── _data ├── refs.yml └── warc_fields.yml ├── _includes ├── _about.md ├── _issues.md └── _toc.md ├── _layouts ├── default.html └── plaintext.txt ├── assets ├── bootstrap │ ├── css │ │ ├── bootstrap-theme.css │ │ ├── bootstrap-theme.css.map │ │ ├── bootstrap-theme.min.css │ │ ├── bootstrap.css │ │ ├── bootstrap.css.map │ │ └── bootstrap.min.css │ ├── fonts │ │ ├── glyphicons-halflings-regular.eot │ │ ├── glyphicons-halflings-regular.svg │ │ ├── glyphicons-halflings-regular.ttf │ │ ├── glyphicons-halflings-regular.woff │ │ └── glyphicons-halflings-regular.woff2 │ └── js │ │ ├── bootstrap.js │ │ ├── bootstrap.min.js │ │ └── npm.js ├── fonts │ ├── glyphicons-halflings-regular.eot │ ├── glyphicons-halflings-regular.svg │ ├── glyphicons-halflings-regular.ttf │ └── glyphicons-halflings-regular.woff ├── javascripts │ └── scale.fix.js └── stylesheets │ ├── pygment_trac.css │ └── styles.css ├── guidelines ├── cdx-non-get-requests │ └── index.md ├── warc-fields │ └── index.md └── warc-implementation-guidelines │ └── index.md ├── index.md ├── primers └── web-archive-formats │ ├── cdx.unsorted.out │ ├── hello-world.txt │ ├── hello-world.warc │ ├── hello-world.warc.cdx │ ├── hello-world.warc.gz │ └── index.md └── specifications ├── cdx-format ├── cdx-2006 │ └── index.md └── cdx-2015 │ └── index.md ├── warc-deduplication ├── recording-arbitrary-duplicates-1.0.md └── samples │ ├── 20130729-heritrix-original.warc.gz │ ├── 20130729-heritrix-revisit-with-http-headers.warc.gz │ ├── 20141124-heritrix-server-not-modified.warc.gz │ ├── 20141129-heritrix-original.warc.gz │ └── 20141129-heritrix-revisit-with-http-headers-and-new-warc-headers.warc.gz ├── warc-format ├── meetings │ └── 2015-05-01-IIPC-GA-WARC-Meeting-Minutes.md ├── warc-1.0 │ ├── The_WARC_Format.md │ ├── WARC_ISO_28500_version1_latestdraft.doc │ ├── WARC_ISO_28500_version1_latestdraft.pdf │ └── index.md ├── warc-1.1-annotated │ └── index.md └── warc-1.1 │ └── index.md ├── warc-rendered-targets └── warc-rendered-targets-1.0.md └── warc-zstd └── index.md /.gitattributes: -------------------------------------------------------------------------------- 1 | *.warc -text 2 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/protocol.md: -------------------------------------------------------------------------------- 1 | ### Protocol name 2 | 3 | > e.g. FTP, HTTP/2 over cleartext TCP 4 | 5 | ### Protocol identifier 6 | 7 | > Guidelines: When in doubt identifiers should be a lowercase ASCII string 8 | > of the form "name/version". The slash character and version should be 9 | > omitted if the protocol doesn't have multiple wire versions. 10 | > 11 | > Consider reusing identifiers from the following registry when possible: 12 | > https://www.iana.org/assignments/tls-extensiontype-values/tls-extensiontype-values.xhtml#alpn-protocol-ids 13 | 14 | ### Specification URL (optional) 15 | 16 | > URL of a document describing the protocol such as an RFC. 17 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | _site/ 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | warc-specifications 2 | =================== 3 | 4 | Centralised repository for WARC usage specifications. 5 | 6 | TODO: 7 | 8 | * Explain this repo briefly here. -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | baseurl: /warc-specifications 2 | 3 | markdown: kramdown 4 | 5 | permalink: pretty 6 | 7 | defaults: 8 | - 9 | scope: 10 | path: "" # an empty string here means all files in the project 11 | values: 12 | layout: "default" 13 | 14 | exclude: 15 | - "*/Makefile" 16 | 17 | gems: 18 | - jekyll-redirect-from 19 | 20 | -------------------------------------------------------------------------------- /_data/refs.yml: -------------------------------------------------------------------------------- 1 | WARC Implementation Guidelines: https://iipc.github.io/warc-specifications/guidelines/warc-implementation-guidelines/ 2 | issues: https://github.com/iipc/warc-specifications/issues 3 | WARC 1.0: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/ 4 | WARC 1.1: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/ 5 | warcit: https://github.com/webrecorder/warcit 6 | -------------------------------------------------------------------------------- /_data/warc_fields.yml: -------------------------------------------------------------------------------- 1 | - name: Content-Length 2 | since: 1.0 3 | spec: WARC 1.1 4 | - name: Content-Type 5 | since: 1.0 6 | spec: WARC 1.1 7 | - name: WARC-Block-Digest 8 | since: 1.0 9 | spec: WARC 1.1 10 | - name: WARC-Cipher-Suite 11 | spec: issues/86 12 | status: proposed 13 | - name: WARC-Concurrent-To 14 | since: 1.0 15 | spec: WARC 1.1 16 | - name: WARC-Creation-Date 17 | spec: warcit#warc-structure-and-format 18 | status: wild 19 | - name: WARC-Date 20 | since: 1.0 21 | spec: WARC 1.1 22 | - name: WARC-Filename 23 | since: 1.0 24 | spec: WARC 1.1 25 | - name: WARC-Identified-Payload-Type 26 | since: 1.0 27 | spec: WARC 1.1 28 | - name: WARC-IP-Address 29 | since: 1.0 30 | spec: WARC 1.1 31 | - name: WARC-Payload-Digest 32 | since: 1.0 33 | spec: WARC 1.1 34 | - name: WARC-Previous-Record-ID 35 | spec: WARC Implementation Guidelines#link-with-previous-record 36 | status: proposed 37 | - name: WARC-Profile 38 | since: 1.0 39 | spec: WARC 1.1 40 | - name: WARC-Protocol 41 | spec: issues/42 42 | status: proposed 43 | - name: WARC-Push-Promised-From 44 | spec: issues/43 45 | status: proposed 46 | - name: WARC-Record-ID 47 | since: 1.0 48 | spec: WARC 1.1 49 | - name: WARC-Refers-To 50 | since: 1.0 51 | spec: WARC 1.1 52 | - name: WARC-Refers-To-Date 53 | since: 1.1 54 | spec: WARC 1.1 55 | - name: WARC-Refers-To-Target-URI 56 | since: 1.1 57 | spec: WARC 1.1 58 | - name: WARC-Segment-Number 59 | since: 1.0 60 | spec: WARC 1.1 61 | - name: WARC-Segment-Origin-ID 62 | since: 1.0 63 | spec: WARC 1.1 64 | - name: WARC-Segment-Total-Length 65 | since: 1.0 66 | spec: WARC 1.1 67 | - name: WARC-Source-URI 68 | spec: warcit#warc-structure-and-format 69 | status: wild 70 | - name: WARC-Target-URI 71 | since: 1.0 72 | spec: WARC 1.1 73 | - name: WARC-Transcluded-By 74 | spec: issues/4 75 | status: proposed 76 | - name: WARC-Truncated 77 | since: 1.0 78 | spec: WARC 1.1 79 | - name: WARC-Type 80 | since: 1.0 81 | spec: WARC 1.1 82 | - name: WARC-Warcinfo-ID 83 | since: 1.0 84 | spec: WARC 1.1 85 | -------------------------------------------------------------------------------- /_includes/_about.md: -------------------------------------------------------------------------------- 1 | {% if page.status %} 2 | {% for op in site.pages %} 3 | {% if page.version-of == op.version-of %} 4 | {% if op.latest == true and op.version != page.version %} 5 | {% assign latest-version = op %} 6 | {% endif %} 7 | {% if op.version == page.previous-version %} 8 | {% assign previous-version = op %} 9 | {% endif %} 10 | {% endif %} 11 | {% endfor %} 12 |
Field | 13 |Status | 14 |Since | 15 |Specification | 16 |
---|---|---|---|
{{ field.name }} | 38 |{{ status }} | 39 |{{ field.since }} | 40 |{{ field.spec | split: "#" | first }} | 41 |
25 | CDX N b a m s k r M S V g 26 | au,gov,financeminister)/ 20150914222034 http://www.financeminister.gov.au/ text/html 200 ZMSA5TNJUKKRYAIM5PRUJLL24DV7QYOO - - 83848 117273 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz 27 |28 | 29 | ---- 30 | 31 | A CDX file consists of individual lines of text, each of which summarizes a single web document. 32 | The first line in the file is a legend for interpreting the data, and the following lines contain the data for referencing the corresponding pages within the host. The first character of the file is the field delimiter used in the rest of the file. This is followed by the literal "CDX" and then individual field markers as defined below. 33 | 34 | The following is a sample from a CDX file: 35 | 36 | ~~~ 37 | CDX A b e a m s c k r V v D d g M n 38 | 0-0-0checkmate.com/Bugs/Bug_Investigators.html 20010424210551 209.52.183.152 0-0-0checkmate.com:80/Bugs/Bug_Investigators.html text/html 200 58670fbe7432c5bed6f3dcd7ea32b221 a725a64ad6bb7112c55ed26c9e4cef63 - 17130110 59129865 1927657 6501523 DE_crawl6.20010424210458 - 5750 39 | 0-0-0checkmate.com/Bugs/Insect_Habitats.html 20010424210312 209.52.183.152 0-0-0checkmate.com:80/Bugs/Insect_Habitats.html text/html 200 d520038e97d7538855715ddcba613d41 30025030eeb72e9345cc2ddf8b5ff218 - 47392928 145482381 4426829 15345336 DE_crawl3.20010424210104 - 6356 40 | 0-0-0checkmate.com/Hot/index.html 20010424212403 209.52.183.152 0-0-0checkmate.com:80/Hot/index.html text/html 200 52242643710547ff4ce2605ed03ed9e2 b06d037c06e7ffd7afc6db270aca7645 - 21301376 62305547 1855363 6627262 DE_crawl6.20010424212307 - 6317 41 | ~~~ 42 | 43 | Field Specifications 44 | -------------------- 45 | 46 | The default first line of a CDX file is: 47 | 48 | ~~~ 49 | CDX A b e a m s c k r V v D d g M n 50 | ~~~ 51 | 52 | The letters use in dat files and cdx files are as follows: 53 | 54 | ~~~ 55 | A canonized url 56 | B news group 57 | C rulespace category *** 58 | D compressed dat file offset 59 | F canonized frame 60 | G multi-column language description (* soon) 61 | H canonized host 62 | I canonized image 63 | J canonized jump point 64 | K Some weird FBIS what's changed kinda thing 65 | L canonized link 66 | M meta tags (AIF) * 67 | N massaged url 68 | P canonized path 69 | Q language string 70 | R canonized redirect 71 | S compressed record size 72 | U uniqueness *** 73 | V compressed arc file offset * 74 | X canonized url in other href tags 75 | Y canonized url in other src tags 76 | Z canonized url found in script 77 | a original url ** 78 | b date ** 79 | c old style checksum * 80 | d uncompressed dat file offset 81 | e IP ** 82 | f frame * 83 | g file name 84 | h original host 85 | i image * 86 | j original jump point 87 | k new style checksum * 88 | l link * 89 | m mime type of original document * 90 | n arc document length * 91 | o port 92 | p original path 93 | r redirect * 94 | s response code * 95 | t title * 96 | v uncompressed arc file offset * 97 | x url in other href tages * 98 | y url in other src tags * 99 | z url found in script * 100 | # comment 101 | 102 | * in alexa-made dat file 103 | ** in alexa-made dat file meta-data line 104 | *** future data 105 | ~~~ 106 | 107 | Document History 108 | ---------------- 109 | 110 | *2020-09-26* -- Minor, fixed some typos. 111 | 112 | *2015-11-30* -- Added example CDX-11 record with tooltips and added 'S compressed record size' to the list. 113 | 114 | *2015-07-10* -- Copied from v.2006 and added notes from [Ilya Kreymer](https://github.com/ikreymer). 115 | 116 | *2015-07-09* -- Imported from the Internet Archive [CDX File Format](http://web.archive.org/web/20031226073353/http://www.archive.org/web/researcher/cdx_file_format.php) and [CDX Legend](http://web.archive.org/web/20031226073353/http://www.archive.org/web/researcher/cdx_legend.php) documents. 117 | -------------------------------------------------------------------------------- /specifications/warc-deduplication/recording-arbitrary-duplicates-1.0.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Proposal for Standardizing the Recording of Arbitrary Duplicates in WARC Files 3 | status: adopted 4 | type: specification 5 | latest: true 6 | version-of: warc-deduplication 7 | version: 1.0 8 | --- 9 | International Internet Preservation Consortium