├── .github ├── CONTRIBUTING.md └── ISSUE_TEMPLATE.md ├── .gitignore ├── LICENSE ├── README.md ├── capture.go ├── capture_test.go ├── digest.go ├── doc.go ├── encoding.go ├── encoding_test.go ├── fields.go ├── header.go ├── header_test.go ├── reader.go ├── reader_test.go ├── record.go ├── record_test.go ├── records.go ├── sanitize.go ├── sanitize_test.go ├── testdata ├── out.warc ├── response.warc ├── test-doc.docx ├── test-doc.docx.gz ├── test.warc ├── test.warc.gz └── warcio │ ├── bad.arc │ ├── example-bad-non-chunked.warc.gz │ ├── example-bad.warc.gz.bad │ ├── example-iana.org-chunked.warc │ ├── example-resource.warc.gz │ ├── example-trunc.txt │ ├── example-trunc.warc │ ├── example.arc │ ├── example.arc.gz │ ├── example.warc │ ├── example.warc.gz │ └── post-test.warc.gz ├── writer.go └── writer_test.go /.github/CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | We love improvements to our tools! Take a moment to check out our organization-wide [Contributing Guidelines](https://github.com/datatogether/datatogether/blob/master/CONTRIBUTING.md) and [Code of Conduct](https://github.com/datatogether/datatogether/blob/master/CONDUCT.md). 4 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | Hey there, thank you for submitting an issue! 2 | 3 | We are trying to keep issues for feature requests and bug reports. Please 4 | complete the following checklist before creating a new one: 5 | 6 | - [ ] Is this a **bug report** (if so, is it something you can **debug and fix**? 7 | Send a pull request!) 8 | - [ ] feature request 9 | - [ ] support request => Please do not submit support requests here, ask your question 10 | on [Slack](https://archivers-slack.herokuapp.com/). 11 | 12 | --- 13 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | coverage.txt 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU AFFERO GENERAL PUBLIC LICENSE 2 | Version 3, 19 November 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU Affero General Public License is a free, copyleft license for 11 | software and other kinds of works, specifically designed to ensure 12 | cooperation with the community in the case of network server software. 13 | 14 | The licenses for most software and other practical works are designed 15 | to take away your freedom to share and change the works. By contrast, 16 | our General Public Licenses are intended to guarantee your freedom to 17 | share and change all versions of a program--to make sure it remains free 18 | software for all its users. 19 | 20 | When we speak of free software, we are referring to freedom, not 21 | price. Our General Public Licenses are designed to make sure that you 22 | have the freedom to distribute copies of free software (and charge for 23 | them if you wish), that you receive source code or can get it if you 24 | want it, that you can change the software or use pieces of it in new 25 | free programs, and that you know you can do these things. 26 | 27 | Developers that use our General Public Licenses protect your rights 28 | with two steps: (1) assert copyright on the software, and (2) offer 29 | you this License which gives you legal permission to copy, distribute 30 | and/or modify the software. 31 | 32 | A secondary benefit of defending all users' freedom is that 33 | improvements made in alternate versions of the program, if they 34 | receive widespread use, become available for other developers to 35 | incorporate. Many developers of free software are heartened and 36 | encouraged by the resulting cooperation. However, in the case of 37 | software used on network servers, this result may fail to come about. 38 | The GNU General Public License permits making a modified version and 39 | letting the public access it on a server without ever releasing its 40 | source code to the public. 41 | 42 | The GNU Affero General Public License is designed specifically to 43 | ensure that, in such cases, the modified source code becomes available 44 | to the community. It requires the operator of a network server to 45 | provide the source code of the modified version running there to the 46 | users of that server. Therefore, public use of a modified version, on 47 | a publicly accessible server, gives the public access to the source 48 | code of the modified version. 49 | 50 | An older license, called the Affero General Public License and 51 | published by Affero, was designed to accomplish similar goals. This is 52 | a different license, not a version of the Affero GPL, but Affero has 53 | released a new version of the Affero GPL which permits relicensing under 54 | this license. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | TERMS AND CONDITIONS 60 | 61 | 0. Definitions. 62 | 63 | "This License" refers to version 3 of the GNU Affero General Public License. 64 | 65 | "Copyright" also means copyright-like laws that apply to other kinds of 66 | works, such as semiconductor masks. 67 | 68 | "The Program" refers to any copyrightable work licensed under this 69 | License. Each licensee is addressed as "you". "Licensees" and 70 | "recipients" may be individuals or organizations. 71 | 72 | To "modify" a work means to copy from or adapt all or part of the work 73 | in a fashion requiring copyright permission, other than the making of an 74 | exact copy. The resulting work is called a "modified version" of the 75 | earlier work or a work "based on" the earlier work. 76 | 77 | A "covered work" means either the unmodified Program or a work based 78 | on the Program. 79 | 80 | To "propagate" a work means to do anything with it that, without 81 | permission, would make you directly or secondarily liable for 82 | infringement under applicable copyright law, except executing it on a 83 | computer or modifying a private copy. Propagation includes copying, 84 | distribution (with or without modification), making available to the 85 | public, and in some countries other activities as well. 86 | 87 | To "convey" a work means any kind of propagation that enables other 88 | parties to make or receive copies. Mere interaction with a user through 89 | a computer network, with no transfer of a copy, is not conveying. 90 | 91 | An interactive user interface displays "Appropriate Legal Notices" 92 | to the extent that it includes a convenient and prominently visible 93 | feature that (1) displays an appropriate copyright notice, and (2) 94 | tells the user that there is no warranty for the work (except to the 95 | extent that warranties are provided), that licensees may convey the 96 | work under this License, and how to view a copy of this License. If 97 | the interface presents a list of user commands or options, such as a 98 | menu, a prominent item in the list meets this criterion. 99 | 100 | 1. Source Code. 101 | 102 | The "source code" for a work means the preferred form of the work 103 | for making modifications to it. "Object code" means any non-source 104 | form of a work. 105 | 106 | A "Standard Interface" means an interface that either is an official 107 | standard defined by a recognized standards body, or, in the case of 108 | interfaces specified for a particular programming language, one that 109 | is widely used among developers working in that language. 110 | 111 | The "System Libraries" of an executable work include anything, other 112 | than the work as a whole, that (a) is included in the normal form of 113 | packaging a Major Component, but which is not part of that Major 114 | Component, and (b) serves only to enable use of the work with that 115 | Major Component, or to implement a Standard Interface for which an 116 | implementation is available to the public in source code form. A 117 | "Major Component", in this context, means a major essential component 118 | (kernel, window system, and so on) of the specific operating system 119 | (if any) on which the executable work runs, or a compiler used to 120 | produce the work, or an object code interpreter used to run it. 121 | 122 | The "Corresponding Source" for a work in object code form means all 123 | the source code needed to generate, install, and (for an executable 124 | work) run the object code and to modify the work, including scripts to 125 | control those activities. However, it does not include the work's 126 | System Libraries, or general-purpose tools or generally available free 127 | programs which are used unmodified in performing those activities but 128 | which are not part of the work. For example, Corresponding Source 129 | includes interface definition files associated with source files for 130 | the work, and the source code for shared libraries and dynamically 131 | linked subprograms that the work is specifically designed to require, 132 | such as by intimate data communication or control flow between those 133 | subprograms and other parts of the work. 134 | 135 | The Corresponding Source need not include anything that users 136 | can regenerate automatically from other parts of the Corresponding 137 | Source. 138 | 139 | The Corresponding Source for a work in source code form is that 140 | same work. 141 | 142 | 2. Basic Permissions. 143 | 144 | All rights granted under this License are granted for the term of 145 | copyright on the Program, and are irrevocable provided the stated 146 | conditions are met. This License explicitly affirms your unlimited 147 | permission to run the unmodified Program. The output from running a 148 | covered work is covered by this License only if the output, given its 149 | content, constitutes a covered work. This License acknowledges your 150 | rights of fair use or other equivalent, as provided by copyright law. 151 | 152 | You may make, run and propagate covered works that you do not 153 | convey, without conditions so long as your license otherwise remains 154 | in force. You may convey covered works to others for the sole purpose 155 | of having them make modifications exclusively for you, or provide you 156 | with facilities for running those works, provided that you comply with 157 | the terms of this License in conveying all material for which you do 158 | not control copyright. Those thus making or running the covered works 159 | for you must do so exclusively on your behalf, under your direction 160 | and control, on terms that prohibit them from making any copies of 161 | your copyrighted material outside their relationship with you. 162 | 163 | Conveying under any other circumstances is permitted solely under 164 | the conditions stated below. Sublicensing is not allowed; section 10 165 | makes it unnecessary. 166 | 167 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 168 | 169 | No covered work shall be deemed part of an effective technological 170 | measure under any applicable law fulfilling obligations under article 171 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 172 | similar laws prohibiting or restricting circumvention of such 173 | measures. 174 | 175 | When you convey a covered work, you waive any legal power to forbid 176 | circumvention of technological measures to the extent such circumvention 177 | is effected by exercising rights under this License with respect to 178 | the covered work, and you disclaim any intention to limit operation or 179 | modification of the work as a means of enforcing, against the work's 180 | users, your or third parties' legal rights to forbid circumvention of 181 | technological measures. 182 | 183 | 4. Conveying Verbatim Copies. 184 | 185 | You may convey verbatim copies of the Program's source code as you 186 | receive it, in any medium, provided that you conspicuously and 187 | appropriately publish on each copy an appropriate copyright notice; 188 | keep intact all notices stating that this License and any 189 | non-permissive terms added in accord with section 7 apply to the code; 190 | keep intact all notices of the absence of any warranty; and give all 191 | recipients a copy of this License along with the Program. 192 | 193 | You may charge any price or no price for each copy that you convey, 194 | and you may offer support or warranty protection for a fee. 195 | 196 | 5. Conveying Modified Source Versions. 197 | 198 | You may convey a work based on the Program, or the modifications to 199 | produce it from the Program, in the form of source code under the 200 | terms of section 4, provided that you also meet all of these conditions: 201 | 202 | a) The work must carry prominent notices stating that you modified 203 | it, and giving a relevant date. 204 | 205 | b) The work must carry prominent notices stating that it is 206 | released under this License and any conditions added under section 207 | 7. This requirement modifies the requirement in section 4 to 208 | "keep intact all notices". 209 | 210 | c) You must license the entire work, as a whole, under this 211 | License to anyone who comes into possession of a copy. This 212 | License will therefore apply, along with any applicable section 7 213 | additional terms, to the whole of the work, and all its parts, 214 | regardless of how they are packaged. This License gives no 215 | permission to license the work in any other way, but it does not 216 | invalidate such permission if you have separately received it. 217 | 218 | d) If the work has interactive user interfaces, each must display 219 | Appropriate Legal Notices; however, if the Program has interactive 220 | interfaces that do not display Appropriate Legal Notices, your 221 | work need not make them do so. 222 | 223 | A compilation of a covered work with other separate and independent 224 | works, which are not by their nature extensions of the covered work, 225 | and which are not combined with it such as to form a larger program, 226 | in or on a volume of a storage or distribution medium, is called an 227 | "aggregate" if the compilation and its resulting copyright are not 228 | used to limit the access or legal rights of the compilation's users 229 | beyond what the individual works permit. Inclusion of a covered work 230 | in an aggregate does not cause this License to apply to the other 231 | parts of the aggregate. 232 | 233 | 6. Conveying Non-Source Forms. 234 | 235 | You may convey a covered work in object code form under the terms 236 | of sections 4 and 5, provided that you also convey the 237 | machine-readable Corresponding Source under the terms of this License, 238 | in one of these ways: 239 | 240 | a) Convey the object code in, or embodied in, a physical product 241 | (including a physical distribution medium), accompanied by the 242 | Corresponding Source fixed on a durable physical medium 243 | customarily used for software interchange. 244 | 245 | b) Convey the object code in, or embodied in, a physical product 246 | (including a physical distribution medium), accompanied by a 247 | written offer, valid for at least three years and valid for as 248 | long as you offer spare parts or customer support for that product 249 | model, to give anyone who possesses the object code either (1) a 250 | copy of the Corresponding Source for all the software in the 251 | product that is covered by this License, on a durable physical 252 | medium customarily used for software interchange, for a price no 253 | more than your reasonable cost of physically performing this 254 | conveying of source, or (2) access to copy the 255 | Corresponding Source from a network server at no charge. 256 | 257 | c) Convey individual copies of the object code with a copy of the 258 | written offer to provide the Corresponding Source. This 259 | alternative is allowed only occasionally and noncommercially, and 260 | only if you received the object code with such an offer, in accord 261 | with subsection 6b. 262 | 263 | d) Convey the object code by offering access from a designated 264 | place (gratis or for a charge), and offer equivalent access to the 265 | Corresponding Source in the same way through the same place at no 266 | further charge. You need not require recipients to copy the 267 | Corresponding Source along with the object code. If the place to 268 | copy the object code is a network server, the Corresponding Source 269 | may be on a different server (operated by you or a third party) 270 | that supports equivalent copying facilities, provided you maintain 271 | clear directions next to the object code saying where to find the 272 | Corresponding Source. Regardless of what server hosts the 273 | Corresponding Source, you remain obligated to ensure that it is 274 | available for as long as needed to satisfy these requirements. 275 | 276 | e) Convey the object code using peer-to-peer transmission, provided 277 | you inform other peers where the object code and Corresponding 278 | Source of the work are being offered to the general public at no 279 | charge under subsection 6d. 280 | 281 | A separable portion of the object code, whose source code is excluded 282 | from the Corresponding Source as a System Library, need not be 283 | included in conveying the object code work. 284 | 285 | A "User Product" is either (1) a "consumer product", which means any 286 | tangible personal property which is normally used for personal, family, 287 | or household purposes, or (2) anything designed or sold for incorporation 288 | into a dwelling. In determining whether a product is a consumer product, 289 | doubtful cases shall be resolved in favor of coverage. For a particular 290 | product received by a particular user, "normally used" refers to a 291 | typical or common use of that class of product, regardless of the status 292 | of the particular user or of the way in which the particular user 293 | actually uses, or expects or is expected to use, the product. A product 294 | is a consumer product regardless of whether the product has substantial 295 | commercial, industrial or non-consumer uses, unless such uses represent 296 | the only significant mode of use of the product. 297 | 298 | "Installation Information" for a User Product means any methods, 299 | procedures, authorization keys, or other information required to install 300 | and execute modified versions of a covered work in that User Product from 301 | a modified version of its Corresponding Source. The information must 302 | suffice to ensure that the continued functioning of the modified object 303 | code is in no case prevented or interfered with solely because 304 | modification has been made. 305 | 306 | If you convey an object code work under this section in, or with, or 307 | specifically for use in, a User Product, and the conveying occurs as 308 | part of a transaction in which the right of possession and use of the 309 | User Product is transferred to the recipient in perpetuity or for a 310 | fixed term (regardless of how the transaction is characterized), the 311 | Corresponding Source conveyed under this section must be accompanied 312 | by the Installation Information. But this requirement does not apply 313 | if neither you nor any third party retains the ability to install 314 | modified object code on the User Product (for example, the work has 315 | been installed in ROM). 316 | 317 | The requirement to provide Installation Information does not include a 318 | requirement to continue to provide support service, warranty, or updates 319 | for a work that has been modified or installed by the recipient, or for 320 | the User Product in which it has been modified or installed. Access to a 321 | network may be denied when the modification itself materially and 322 | adversely affects the operation of the network or violates the rules and 323 | protocols for communication across the network. 324 | 325 | Corresponding Source conveyed, and Installation Information provided, 326 | in accord with this section must be in a format that is publicly 327 | documented (and with an implementation available to the public in 328 | source code form), and must require no special password or key for 329 | unpacking, reading or copying. 330 | 331 | 7. Additional Terms. 332 | 333 | "Additional permissions" are terms that supplement the terms of this 334 | License by making exceptions from one or more of its conditions. 335 | Additional permissions that are applicable to the entire Program shall 336 | be treated as though they were included in this License, to the extent 337 | that they are valid under applicable law. If additional permissions 338 | apply only to part of the Program, that part may be used separately 339 | under those permissions, but the entire Program remains governed by 340 | this License without regard to the additional permissions. 341 | 342 | When you convey a copy of a covered work, you may at your option 343 | remove any additional permissions from that copy, or from any part of 344 | it. (Additional permissions may be written to require their own 345 | removal in certain cases when you modify the work.) You may place 346 | additional permissions on material, added by you to a covered work, 347 | for which you have or can give appropriate copyright permission. 348 | 349 | Notwithstanding any other provision of this License, for material you 350 | add to a covered work, you may (if authorized by the copyright holders of 351 | that material) supplement the terms of this License with terms: 352 | 353 | a) Disclaiming warranty or limiting liability differently from the 354 | terms of sections 15 and 16 of this License; or 355 | 356 | b) Requiring preservation of specified reasonable legal notices or 357 | author attributions in that material or in the Appropriate Legal 358 | Notices displayed by works containing it; or 359 | 360 | c) Prohibiting misrepresentation of the origin of that material, or 361 | requiring that modified versions of such material be marked in 362 | reasonable ways as different from the original version; or 363 | 364 | d) Limiting the use for publicity purposes of names of licensors or 365 | authors of the material; or 366 | 367 | e) Declining to grant rights under trademark law for use of some 368 | trade names, trademarks, or service marks; or 369 | 370 | f) Requiring indemnification of licensors and authors of that 371 | material by anyone who conveys the material (or modified versions of 372 | it) with contractual assumptions of liability to the recipient, for 373 | any liability that these contractual assumptions directly impose on 374 | those licensors and authors. 375 | 376 | All other non-permissive additional terms are considered "further 377 | restrictions" within the meaning of section 10. If the Program as you 378 | received it, or any part of it, contains a notice stating that it is 379 | governed by this License along with a term that is a further 380 | restriction, you may remove that term. If a license document contains 381 | a further restriction but permits relicensing or conveying under this 382 | License, you may add to a covered work material governed by the terms 383 | of that license document, provided that the further restriction does 384 | not survive such relicensing or conveying. 385 | 386 | If you add terms to a covered work in accord with this section, you 387 | must place, in the relevant source files, a statement of the 388 | additional terms that apply to those files, or a notice indicating 389 | where to find the applicable terms. 390 | 391 | Additional terms, permissive or non-permissive, may be stated in the 392 | form of a separately written license, or stated as exceptions; 393 | the above requirements apply either way. 394 | 395 | 8. Termination. 396 | 397 | You may not propagate or modify a covered work except as expressly 398 | provided under this License. Any attempt otherwise to propagate or 399 | modify it is void, and will automatically terminate your rights under 400 | this License (including any patent licenses granted under the third 401 | paragraph of section 11). 402 | 403 | However, if you cease all violation of this License, then your 404 | license from a particular copyright holder is reinstated (a) 405 | provisionally, unless and until the copyright holder explicitly and 406 | finally terminates your license, and (b) permanently, if the copyright 407 | holder fails to notify you of the violation by some reasonable means 408 | prior to 60 days after the cessation. 409 | 410 | Moreover, your license from a particular copyright holder is 411 | reinstated permanently if the copyright holder notifies you of the 412 | violation by some reasonable means, this is the first time you have 413 | received notice of violation of this License (for any work) from that 414 | copyright holder, and you cure the violation prior to 30 days after 415 | your receipt of the notice. 416 | 417 | Termination of your rights under this section does not terminate the 418 | licenses of parties who have received copies or rights from you under 419 | this License. If your rights have been terminated and not permanently 420 | reinstated, you do not qualify to receive new licenses for the same 421 | material under section 10. 422 | 423 | 9. Acceptance Not Required for Having Copies. 424 | 425 | You are not required to accept this License in order to receive or 426 | run a copy of the Program. Ancillary propagation of a covered work 427 | occurring solely as a consequence of using peer-to-peer transmission 428 | to receive a copy likewise does not require acceptance. However, 429 | nothing other than this License grants you permission to propagate or 430 | modify any covered work. These actions infringe copyright if you do 431 | not accept this License. Therefore, by modifying or propagating a 432 | covered work, you indicate your acceptance of this License to do so. 433 | 434 | 10. Automatic Licensing of Downstream Recipients. 435 | 436 | Each time you convey a covered work, the recipient automatically 437 | receives a license from the original licensors, to run, modify and 438 | propagate that work, subject to this License. You are not responsible 439 | for enforcing compliance by third parties with this License. 440 | 441 | An "entity transaction" is a transaction transferring control of an 442 | organization, or substantially all assets of one, or subdividing an 443 | organization, or merging organizations. If propagation of a covered 444 | work results from an entity transaction, each party to that 445 | transaction who receives a copy of the work also receives whatever 446 | licenses to the work the party's predecessor in interest had or could 447 | give under the previous paragraph, plus a right to possession of the 448 | Corresponding Source of the work from the predecessor in interest, if 449 | the predecessor has it or can get it with reasonable efforts. 450 | 451 | You may not impose any further restrictions on the exercise of the 452 | rights granted or affirmed under this License. For example, you may 453 | not impose a license fee, royalty, or other charge for exercise of 454 | rights granted under this License, and you may not initiate litigation 455 | (including a cross-claim or counterclaim in a lawsuit) alleging that 456 | any patent claim is infringed by making, using, selling, offering for 457 | sale, or importing the Program or any portion of it. 458 | 459 | 11. Patents. 460 | 461 | A "contributor" is a copyright holder who authorizes use under this 462 | License of the Program or a work on which the Program is based. The 463 | work thus licensed is called the contributor's "contributor version". 464 | 465 | A contributor's "essential patent claims" are all patent claims 466 | owned or controlled by the contributor, whether already acquired or 467 | hereafter acquired, that would be infringed by some manner, permitted 468 | by this License, of making, using, or selling its contributor version, 469 | but do not include claims that would be infringed only as a 470 | consequence of further modification of the contributor version. For 471 | purposes of this definition, "control" includes the right to grant 472 | patent sublicenses in a manner consistent with the requirements of 473 | this License. 474 | 475 | Each contributor grants you a non-exclusive, worldwide, royalty-free 476 | patent license under the contributor's essential patent claims, to 477 | make, use, sell, offer for sale, import and otherwise run, modify and 478 | propagate the contents of its contributor version. 479 | 480 | In the following three paragraphs, a "patent license" is any express 481 | agreement or commitment, however denominated, not to enforce a patent 482 | (such as an express permission to practice a patent or covenant not to 483 | sue for patent infringement). To "grant" such a patent license to a 484 | party means to make such an agreement or commitment not to enforce a 485 | patent against the party. 486 | 487 | If you convey a covered work, knowingly relying on a patent license, 488 | and the Corresponding Source of the work is not available for anyone 489 | to copy, free of charge and under the terms of this License, through a 490 | publicly available network server or other readily accessible means, 491 | then you must either (1) cause the Corresponding Source to be so 492 | available, or (2) arrange to deprive yourself of the benefit of the 493 | patent license for this particular work, or (3) arrange, in a manner 494 | consistent with the requirements of this License, to extend the patent 495 | license to downstream recipients. "Knowingly relying" means you have 496 | actual knowledge that, but for the patent license, your conveying the 497 | covered work in a country, or your recipient's use of the covered work 498 | in a country, would infringe one or more identifiable patents in that 499 | country that you have reason to believe are valid. 500 | 501 | If, pursuant to or in connection with a single transaction or 502 | arrangement, you convey, or propagate by procuring conveyance of, a 503 | covered work, and grant a patent license to some of the parties 504 | receiving the covered work authorizing them to use, propagate, modify 505 | or convey a specific copy of the covered work, then the patent license 506 | you grant is automatically extended to all recipients of the covered 507 | work and works based on it. 508 | 509 | A patent license is "discriminatory" if it does not include within 510 | the scope of its coverage, prohibits the exercise of, or is 511 | conditioned on the non-exercise of one or more of the rights that are 512 | specifically granted under this License. You may not convey a covered 513 | work if you are a party to an arrangement with a third party that is 514 | in the business of distributing software, under which you make payment 515 | to the third party based on the extent of your activity of conveying 516 | the work, and under which the third party grants, to any of the 517 | parties who would receive the covered work from you, a discriminatory 518 | patent license (a) in connection with copies of the covered work 519 | conveyed by you (or copies made from those copies), or (b) primarily 520 | for and in connection with specific products or compilations that 521 | contain the covered work, unless you entered into that arrangement, 522 | or that patent license was granted, prior to 28 March 2007. 523 | 524 | Nothing in this License shall be construed as excluding or limiting 525 | any implied license or other defenses to infringement that may 526 | otherwise be available to you under applicable patent law. 527 | 528 | 12. No Surrender of Others' Freedom. 529 | 530 | If conditions are imposed on you (whether by court order, agreement or 531 | otherwise) that contradict the conditions of this License, they do not 532 | excuse you from the conditions of this License. If you cannot convey a 533 | covered work so as to satisfy simultaneously your obligations under this 534 | License and any other pertinent obligations, then as a consequence you may 535 | not convey it at all. For example, if you agree to terms that obligate you 536 | to collect a royalty for further conveying from those to whom you convey 537 | the Program, the only way you could satisfy both those terms and this 538 | License would be to refrain entirely from conveying the Program. 539 | 540 | 13. Remote Network Interaction; Use with the GNU General Public License. 541 | 542 | Notwithstanding any other provision of this License, if you modify the 543 | Program, your modified version must prominently offer all users 544 | interacting with it remotely through a computer network (if your version 545 | supports such interaction) an opportunity to receive the Corresponding 546 | Source of your version by providing access to the Corresponding Source 547 | from a network server at no charge, through some standard or customary 548 | means of facilitating copying of software. This Corresponding Source 549 | shall include the Corresponding Source for any work covered by version 3 550 | of the GNU General Public License that is incorporated pursuant to the 551 | following paragraph. 552 | 553 | Notwithstanding any other provision of this License, you have 554 | permission to link or combine any covered work with a work licensed 555 | under version 3 of the GNU General Public License into a single 556 | combined work, and to convey the resulting work. The terms of this 557 | License will continue to apply to the part which is the covered work, 558 | but the work with which it is combined will remain governed by version 559 | 3 of the GNU General Public License. 560 | 561 | 14. Revised Versions of this License. 562 | 563 | The Free Software Foundation may publish revised and/or new versions of 564 | the GNU Affero General Public License from time to time. Such new versions 565 | will be similar in spirit to the present version, but may differ in detail to 566 | address new problems or concerns. 567 | 568 | Each version is given a distinguishing version number. If the 569 | Program specifies that a certain numbered version of the GNU Affero General 570 | Public License "or any later version" applies to it, you have the 571 | option of following the terms and conditions either of that numbered 572 | version or of any later version published by the Free Software 573 | Foundation. If the Program does not specify a version number of the 574 | GNU Affero General Public License, you may choose any version ever published 575 | by the Free Software Foundation. 576 | 577 | If the Program specifies that a proxy can decide which future 578 | versions of the GNU Affero General Public License can be used, that proxy's 579 | public statement of acceptance of a version permanently authorizes you 580 | to choose that version for the Program. 581 | 582 | Later license versions may give you additional or different 583 | permissions. However, no additional obligations are imposed on any 584 | author or copyright holder as a result of your choosing to follow a 585 | later version. 586 | 587 | 15. Disclaimer of Warranty. 588 | 589 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 590 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 591 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 592 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 593 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 594 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 595 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 596 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 597 | 598 | 16. Limitation of Liability. 599 | 600 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 601 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 602 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 603 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 604 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 605 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 606 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 607 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 608 | SUCH DAMAGES. 609 | 610 | 17. Interpretation of Sections 15 and 16. 611 | 612 | If the disclaimer of warranty and limitation of liability provided 613 | above cannot be given local legal effect according to their terms, 614 | reviewing courts shall apply local law that most closely approximates 615 | an absolute waiver of all civil liability in connection with the 616 | Program, unless a warranty or assumption of liability accompanies a 617 | copy of the Program in return for a fee. 618 | 619 | END OF TERMS AND CONDITIONS 620 | 621 | How to Apply These Terms to Your New Programs 622 | 623 | If you develop a new program, and you want it to be of the greatest 624 | possible use to the public, the best way to achieve this is to make it 625 | free software which everyone can redistribute and change under these terms. 626 | 627 | To do so, attach the following notices to the program. It is safest 628 | to attach them to the start of each source file to most effectively 629 | state the exclusion of warranty; and each file should have at least 630 | the "copyright" line and a pointer to where the full notice is found. 631 | 632 | 633 | Copyright (C) 634 | 635 | This program is free software: you can redistribute it and/or modify 636 | it under the terms of the GNU Affero General Public License as published 637 | by the Free Software Foundation, either version 3 of the License, or 638 | (at your option) any later version. 639 | 640 | This program is distributed in the hope that it will be useful, 641 | but WITHOUT ANY WARRANTY; without even the implied warranty of 642 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 643 | GNU Affero General Public License for more details. 644 | 645 | You should have received a copy of the GNU Affero General Public License 646 | along with this program. If not, see . 647 | 648 | Also add information on how to contact you by electronic and paper mail. 649 | 650 | If your software can interact with users remotely through a computer 651 | network, you should also make sure that it provides a way for users to 652 | get its source. For example, if your program is a web application, its 653 | interface could display a "Source" link that leads users to an archive 654 | of the code. There are many ways you could offer source, and different 655 | solutions will be better for different programs; see section 13 for the 656 | specific requirements. 657 | 658 | You should also get your employer (if you work as a programmer) or school, 659 | if any, to sign a "copyright disclaimer" for the program, if necessary. 660 | For more information on this, and how to apply and follow the GNU AGPL, see 661 | . -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # warc 2 | [![GitHub](https://img.shields.io/badge/project-Data_Together-487b57.svg?style=flat-square)](http://github.com/datatogether) 3 | [![Slack](https://img.shields.io/badge/slack-Archivers-b44e88.svg?style=flat-square)](https://archivers-slack.herokuapp.com/) 4 | [![GoDoc](https://godoc.org/github.com/datatogether/warc?status.svg)](http://godoc.org/github.com/datatogether/warc) 5 | [![License](https://img.shields.io/github/license/mashape/apistatus.svg)](./LICENSE) 6 | 7 | warc is an implementation of ISO28500 1.0, the WebARCive specfication. 8 | it provides readers, writers, and structs for working with warc records. 9 | 10 | from the spec: 11 | > The WARC (Web ARChive) file format offers a convention for concatenating 12 | multiple resource records (data objects), each consisting of a set of 13 | simple text headers and an arbitrary data block into one long file. The 14 | WARC format is an extension of the ARC File Format [ARC] that has 15 | traditionally been used to store "web crawls" as sequences of content 16 | blocks harvested from the World Wide Web. Each capture in an ARC file is 17 | preceded by a one-line header that very briefly describes the harvested 18 | content and its length. This is directly followed by the retrieval 19 | protocol response messages and content. The original ARC format file is 20 | used by the Internet Archive (IA) since 1996 for managing billions of 21 | objects, and by several national libraries. 22 | package warc 23 | 24 | ## License & Copyright 25 | 26 | [Affero General Public License v3](http://www.gnu.org/licenses/agpl.html) 27 | 28 | ## Getting Involved 29 | 30 | We would love involvement from more people! If you notice any errors or would like to submit changes, please see our [Contributing Guidelines](./.github/CONTRIBUTING.md). 31 | 32 | We use GitHub issues for [tracking bugs and feature requests](https://github.com/datatogether/REPONAME/issues) and Pull Requests (PRs) for [submitting changes](https://github.com/datatogether/REPONAME/pulls) 33 | 34 | ## Usage 35 | `import "github.com/datatogether/warc"` 36 | -------------------------------------------------------------------------------- /capture.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bytes" 5 | "context" 6 | "crypto/sha1" 7 | "fmt" 8 | "io" 9 | "io/ioutil" 10 | "net" 11 | "net/http" 12 | "net/url" 13 | "strconv" 14 | "strings" 15 | "time" 16 | 17 | "github.com/pkg/errors" 18 | ) 19 | 20 | // TimeFormat is time.RFC3339, but with no timezone (just a Z). 21 | const TimeFormat = "2006-01-02T15:04:05Z" 22 | 23 | // CaptureHelper is used for the NewRequestResponseRecords() method. Additional 24 | // fields may be added in the future. 25 | type CaptureHelper struct { 26 | WarcinfoID string 27 | RemoteAddr string 28 | 29 | // The request body will need to be read multiple times, so please provide 30 | // one of the following. (note: bytes.Reader and strings.Reader are 31 | // ReadSeekers.) 32 | ReqBodyReadSeeker io.ReadSeeker 33 | ReqBodyBytesBuffer *bytes.Buffer 34 | } 35 | 36 | // DialContext returns a wrapper around net.DialContext that saves the 37 | // connected-to IP in the CaptureHelper. 38 | func (c *CaptureHelper) DialContext(dialer *net.Dialer) func(ctx context.Context, network, addr string) (net.Conn, error) { 39 | if dialer == nil { 40 | dialer = &net.Dialer{} 41 | } 42 | return func(ctx context.Context, network, addr string) (net.Conn, error) { 43 | conn, err := dialer.DialContext(ctx, network, addr) 44 | if err == nil { 45 | c.RemoteAddr = conn.RemoteAddr().String() 46 | } 47 | return conn, err 48 | } 49 | } 50 | 51 | func (c *CaptureHelper) resetRequestBody(req *http.Request) io.Reader { 52 | if req.Body == nil || req.Body == http.NoBody { 53 | return http.NoBody 54 | } 55 | if req.GetBody != nil { 56 | body, err := req.GetBody() 57 | if err != nil { 58 | panic(err) 59 | } 60 | return body 61 | } 62 | if c.ReqBodyReadSeeker != nil { 63 | c.ReqBodyReadSeeker.Seek(0, io.SeekStart) 64 | return c.ReqBodyReadSeeker 65 | } 66 | if c.ReqBodyBytesBuffer != nil { 67 | return c.ReqBodyBytesBuffer 68 | } 69 | panic("capturehelper: unable to rewind request body") 70 | } 71 | 72 | // NewRequestResponseRecords creates a new request/response record pair for the 73 | // provided HTTP request and response. 74 | // 75 | // Make sure to provide the request Body in the CaptureHelper so it can be read 76 | // from again. The response Body should not yet have been used; if the caller 77 | // needs the body, replace it with an ioutil.NopCloser(io.TeeReader) (the 78 | // caller is then responsible for calling body.Close()). 79 | func NewRequestResponseRecords(info CaptureHelper, req *http.Request, resp *http.Response) (Record, Record, error) { 80 | reqRec := Record{Format: RecordFormatWarc, Type: RecordTypeRequest, Headers: make(Header)} 81 | respRec := Record{Format: RecordFormatWarc, Type: RecordTypeResponse, Headers: make(Header)} 82 | reqUID := NewUUID() 83 | respUID := NewUUID() 84 | reqRec.Headers.Set(FieldNameWARCRecordID, reqUID) 85 | respRec.Headers.Set(FieldNameWARCRecordID, respUID) 86 | respRec.Headers.Set(FieldNameWARCConcurrentTo, reqUID) 87 | eventStamp := time.Now().UTC().Format(TimeFormat) 88 | reqRec.Headers.Set(FieldNameWARCDate, eventStamp) 89 | respRec.Headers.Set(FieldNameWARCDate, eventStamp) 90 | u2 := new(url.URL) 91 | *u2 = *req.URL 92 | if u2.Host == "" { 93 | u2.Host = req.Host 94 | } 95 | reqRec.Headers.Set(FieldNameWARCTargetURI, u2.String()) 96 | respRec.Headers.Set(FieldNameWARCTargetURI, u2.String()) 97 | if info.WarcinfoID != "" { 98 | reqRec.Headers.Set(FieldNameWARCWarcinfoID, info.WarcinfoID) 99 | respRec.Headers.Set(FieldNameWARCWarcinfoID, info.WarcinfoID) 100 | } 101 | if info.RemoteAddr != "" { 102 | ip, _, err := net.SplitHostPort(info.RemoteAddr) 103 | if err != nil { 104 | if strings.Contains(err.Error(), "missing port in address") { 105 | ip = info.RemoteAddr 106 | err = nil 107 | } 108 | } 109 | if err != nil { 110 | return reqRec, respRec, errors.Wrap(err, "Bad RemoteAddr value in CaptureHelper") 111 | } 112 | reqRec.Headers.Set(FieldNameWARCIPAddress, ip) 113 | respRec.Headers.Set(FieldNameWARCIPAddress, ip) 114 | } 115 | 116 | // Write request using stdlib 117 | reqDigester := sha1.New() 118 | reqRec.Content = new(bytes.Buffer) 119 | clonedBody := info.resetRequestBody(req) 120 | if clonedBody != nil { 121 | teedBody := io.TeeReader(clonedBody, reqDigester) 122 | req.Body = ioutil.NopCloser(teedBody) 123 | } 124 | err := req.Write(reqRec.Content) 125 | reqRec.Headers.Set(FieldNameWARCPayloadDigest, formatDigest(reqDigester.Sum(nil))) 126 | if err != nil { 127 | return reqRec, respRec, errors.Wrap(err, "writing request body") 128 | } 129 | reqRec.Headers[FieldNameWARCBlockDigest] = Sha1Digest(reqRec.Content.Bytes()) 130 | 131 | // Write response 132 | // Can't use stdlib, as it does extra processing of Content-Length, transfer encodings, etc 133 | respDigester := sha1.New() 134 | respRec.Content = new(bytes.Buffer) 135 | teedBody := io.TeeReader(resp.Body, respDigester) 136 | 137 | text := resp.Status 138 | text = strings.TrimPrefix(text, strconv.Itoa(resp.StatusCode)+" ") 139 | fmt.Fprintf(respRec.Content, "HTTP/%d.%d %03d %s\r\n", resp.ProtoMajor, resp.ProtoMinor, resp.StatusCode, text) 140 | resp.Header.Write(respRec.Content) 141 | io.WriteString(respRec.Content, "\r\n") 142 | _, err = io.Copy(respRec.Content, teedBody) 143 | respRec.Headers.Set(FieldNameWARCPayloadDigest, formatDigest(respDigester.Sum(nil))) 144 | if err != nil { 145 | return reqRec, respRec, errors.Wrap(err, "writing response body") 146 | } 147 | // block digest will be set in Record.Write() 148 | 149 | return reqRec, respRec, nil 150 | } 151 | -------------------------------------------------------------------------------- /capture_test.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bytes" 5 | "fmt" 6 | "net/http" 7 | "net/http/httptest" 8 | "strings" 9 | "testing" 10 | ) 11 | 12 | func TestRequestResponseRecords(t *testing.T) { 13 | const ( 14 | hdr1 = "custom header content fde7073d7b95" 15 | hdr2 = "custom header content 25b1be4eb31c" 16 | path = "/4f3f2471fd8d" 17 | responseBody = "Response body\n40f9fcaa4120" 18 | requestStr = "Request body\n1af849d49e58" 19 | ) 20 | var warcinfoID = NewUUID() 21 | srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { 22 | w.Header().Set("Custom-Header", hdr2) 23 | fmt.Fprint(w, responseBody) 24 | })) 25 | defer srv.Close() 26 | 27 | requestBody := strings.Repeat(requestStr, 50) 28 | body := strings.NewReader(requestBody) 29 | helper := CaptureHelper{ 30 | WarcinfoID: warcinfoID, 31 | ReqBodyReadSeeker: body, 32 | } 33 | req, err := http.NewRequest("PUT", srv.URL+path, body) 34 | if err != nil { 35 | t.Fatal(err) 36 | } 37 | req.Header.Set("Custom-Header", hdr1) 38 | client := &http.Client{ 39 | Transport: &http.Transport{ 40 | DialContext: helper.DialContext(nil), 41 | }, 42 | } 43 | resp, err := client.Do(req) 44 | if err != nil { 45 | t.Fatal(err) 46 | } 47 | 48 | reqRecord, respRecord, err := NewRequestResponseRecords(helper, req, resp) 49 | if err != nil { 50 | t.Fatal(err) 51 | } 52 | 53 | if reqRecord.Headers.Get(FieldNameWARCPayloadDigest) != Sha1Digest([]byte(requestBody)) { 54 | t.Error("Bad request body digest") 55 | } 56 | if respRecord.Headers.Get(FieldNameWARCPayloadDigest) != Sha1Digest([]byte(responseBody)) { 57 | t.Error("Bad response body digest") 58 | } 59 | 60 | var buf bytes.Buffer 61 | reqRecord.Write(&buf) 62 | str := buf.String() 63 | // fmt.Println(str) 64 | if !strings.Contains(str, path) { 65 | t.Error("Path not found in request record") 66 | } 67 | if !strings.Contains(str, hdr1) { 68 | t.Error("Headers not found in request record") 69 | } 70 | if !strings.Contains(str, requestBody) { 71 | t.Error("Body not found in request record") 72 | } 73 | if strings.Contains(str, "Transfer-Encoding: chunked") { 74 | t.Error("Request written with chunked Transfer-Encoding") 75 | } 76 | 77 | buf.Reset() 78 | respRecord.Write(&buf) 79 | str = buf.String() 80 | if !strings.Contains(str, path) { 81 | t.Error("Path not found in response record") 82 | } 83 | if !strings.Contains(str, srv.URL) { 84 | t.Error("Hostname (WARC-Request-URI) not found in response record") 85 | } 86 | if !strings.Contains(str, hdr2) { 87 | t.Error("Headers not found in response record") 88 | } 89 | if !strings.Contains(str, responseBody) { 90 | t.Error("Body not found in response record") 91 | } 92 | if !strings.Contains(str, warcinfoID) { 93 | t.Error("Warcinfo-ID not found in response record") 94 | } 95 | 96 | // t.Logf("%#v", respRecord.Content) 97 | } 98 | -------------------------------------------------------------------------------- /digest.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bytes" 5 | "crypto/sha1" 6 | "encoding/base32" 7 | "fmt" 8 | ) 9 | 10 | // Sha1Digest calculates the shasum of a slice of bytes 11 | func Sha1Digest(data []byte) string { 12 | hash := sha1.Sum(data) 13 | return formatDigest(hash[:]) 14 | } 15 | 16 | func formatDigest(h []byte) string { 17 | buf := &bytes.Buffer{} 18 | base32.NewEncoder(base32.StdEncoding, buf).Write(h) 19 | return fmt.Sprintf("sha1:%s", buf.String()) 20 | } 21 | -------------------------------------------------------------------------------- /doc.go: -------------------------------------------------------------------------------- 1 | // Package warc is an implementation of ISO28500 1.0, the WebARCive specfication. 2 | // it provides readers, writers, and structs for working with warc records. 3 | // from the spec: 4 | // The WARC (Web ARChive) file format offers a convention for concatenating 5 | // multiple resource records (data objects), each consisting of a set of 6 | // simple text headers and an arbitrary data block into one long file. The 7 | // WARC format is an extension of the ARC File Format [ARC] that has 8 | // traditionally been used to store "web crawls" as sequences of content 9 | // blocks harvested from the World Wide Web. Each capture in an ARC file is 10 | // preceded by a one-line header that very briefly describes the harvested 11 | // content and its length. This is directly followed by the retrieval 12 | // protocol response messages and content. The original ARC format file is 13 | // used by the Internet Archive (IA) since 1996 for managing billions of 14 | // objects, and by several national libraries. 15 | package warc 16 | -------------------------------------------------------------------------------- /encoding.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bytes" 5 | ) 6 | 7 | // TODO - part of a planned support for golang standard 8 | // encoding / decoding. 9 | // WARCMarshaler follows other golang encoding / decoding 10 | // interfaces. See encoding/json.JSONMarshaler as an example. 11 | // 12 | // type WARCMarshaler interface { 13 | // MarshalWARC() (data []byte, err error) 14 | // } 15 | // 16 | // type WARCUnmarshaler interface { 17 | // UnmarshalWARC(data []byte) (err error) 18 | // } 19 | 20 | // UnmarshalRecord reads a single record from data 21 | func UnmarshalRecord(data []byte) (Record, error) { 22 | r, err := NewReader(bytes.NewReader(data)) 23 | if err != nil { 24 | return Record{}, err 25 | } 26 | return r.readRecord() 27 | } 28 | 29 | // UnmarshalRecords reads a slice of records from a slice of bytes 30 | func UnmarshalRecords(data []byte) (Records, error) { 31 | r, err := NewReader(bytes.NewReader(data)) 32 | if err != nil { 33 | return nil, err 34 | } 35 | return r.ReadAll() 36 | } 37 | -------------------------------------------------------------------------------- /encoding_test.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "io/ioutil" 5 | "os" 6 | "testing" 7 | ) 8 | 9 | func TestUnmarshalRecord(t *testing.T) { 10 | f, err := os.Open("testdata/test.warc") 11 | if err != nil { 12 | t.Error(err.Error()) 13 | return 14 | } 15 | defer f.Close() 16 | data, err := ioutil.ReadAll(f) 17 | if err != nil { 18 | t.Error(err.Error()) 19 | return 20 | } 21 | 22 | rec, err := UnmarshalRecord(data) 23 | if err != nil { 24 | t.Errorf("unexpected UnmarshalRecord error: %s", err.Error()) 25 | return 26 | } 27 | 28 | if rec.Type != RecordTypeResource { 29 | t.Errorf("wrong record type. expected: %s, got: %s", RecordTypeResource, rec.Type) 30 | } 31 | } 32 | 33 | func TestUnmarshalRecords(t *testing.T) { 34 | f, err := os.Open("testdata/test.warc") 35 | if err != nil { 36 | t.Error(err.Error()) 37 | return 38 | } 39 | defer f.Close() 40 | data, err := ioutil.ReadAll(f) 41 | if err != nil { 42 | t.Error(err.Error()) 43 | return 44 | } 45 | 46 | recs, err := UnmarshalRecords(data) 47 | if err != nil { 48 | t.Errorf("unexpected UnmarshalRecord error: %s", err.Error()) 49 | return 50 | } 51 | 52 | if len(recs) != 10 { 53 | t.Errorf("wrong number of records. expected: %d, got: %d", 10, len(recs)) 54 | } 55 | } 56 | -------------------------------------------------------------------------------- /fields.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | // Named fields within a WARC record provide information about the current 4 | // record, and allow additional per-record information. WARC both reuses 5 | // appropriate headers from other standards and defines new headers, all 6 | // beginning "WARC-", for WARC-specific purposes. 7 | // 8 | // WARC named fields of the same type shall not be repeated in the same 9 | // WARC record (for example, a WARC record shall not have several WARC-Date 10 | // or several WARC-Target-URI), except as noted (e.g., WARC-Concurrent-To). 11 | const ( 12 | // An identifier assigned to the current record that is globally unique for 13 | // its period of intended use. No identifier scheme is mandated by this 14 | // specification, but each record-id shall be a legal URI and clearly 15 | // indicate a documented and registered scheme to which it conforms (e.g., 16 | // via a URI scheme prefix such as "http:" or "urn:"). Care should be taken 17 | // to ensure that this value is written with no internal whitespace. 18 | FieldNameWARCRecordID = "WARC-Record-ID" 19 | // The number of octets in the block, similar to [RFC2616]. If no block is 20 | // present, a value of '0' (zero) shall be used. 21 | FieldNameContentLength = "Content-Length" 22 | // A 14-digit UTC timestamp formatted according to YYYY-MM-DDThh:mm:ssZ, 23 | // described in the W3C profile of ISO8601 [W3CDTF]. The timestamp shall 24 | // represent the instant that data capture for record creation began. 25 | // Multiple records written as part of a single capture event (see section 26 | // 5.7) shall use the same WARC-Date, even though the times of their 27 | // writing will not be exactly synchronized. 28 | FieldNameWARCDate = "WARC-Date" 29 | // The type of WARC record: one of 'warcinfo', 'response', 'resource', 30 | // 'request', 'metadata', 'revisit', 'conversion', or 'continuation'. Other 31 | // types of WARC records may be defined in extensions of the core format. 32 | // Types are further described in WARC Record Types. 33 | // A WARC file needs not contain any particular record types, though 34 | // starting all WARC files with a "warcinfo" record is recommended. 35 | FieldNameWARCType = "WARC-Type" 36 | // The MIME type [RFC2045] of the information contained in the record's 37 | // block. For example, in HTTP request and response records, this would be 38 | // 'application/http' as per section 19.1 of [RFC2616] (or 39 | // 'application/http; msgtype=request' and 'application/http; 40 | // msgtype=response' respectively). In particular, the content-type is not 41 | // the value of the HTTP Content-Type header in an HTTP response but a MIME 42 | // type to describe the full archived HTTP message (hence 43 | // 'application/http' if the block contains request or response headers). 44 | FieldNameContentType = "Content-Type" 45 | // The WARC-Record-IDs of any records created as part of the same capture 46 | // event as the current record. A capture event comprises the information 47 | // automatically gathered by a retrieval against a single target-URI; for 48 | // example, it might be represented by a 'response' or 'revisit' record 49 | // plus its associated 'request' record. 50 | // This field may be used to associate records of types 'request', 51 | // 'response', 'resource', 'metadata', and 'revisit' with one another when 52 | // they arise from a single capture event (When so used, any 53 | // WARC-Concurrent-To association shall be considered bidirectional even if 54 | // the  header only appears on one record.) The WARC Concurrent-to field 55 | // shall not be used in 'warcinfo', 'conversion', and 'continuation' 56 | // records. 57 | FieldNameWARCConcurrentTo = "WARC-Concurrent-To" 58 | // An optional parameter indicating the algorithm name and calculated value 59 | // of a digest applied to the full block of the record. 60 | // An example is a SHA-1 labelled Base32 ([RFC3548]) value: 61 | // WARC-Block-Digest: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ 62 | FieldNameWARCBlockDigest = "WARC-Block-Digest" 63 | // An optional parameter indicating the algorithm name and calculated value 64 | // of a digest applied to the payload referred to or contained by the 65 | // record - which is not necessarily equivalent to the record block. 66 | // The payload of an application/http block is its 'entity-body' (per 67 | // [RFC2616]). In contrast to WARC-Block-Digest, the WARC-Payload-Digest 68 | // field may also be used for data not actually present in the current 69 | // record block, for example when a block is left off in accordance with a 70 | // 'revisit' profile (see 'revisit'), or when a record is segmented (the 71 | // WARC-Payload-Digest recorded in the first segment of a segmented record 72 | // shall be the digest of the payload of the logical record). 73 | FieldNameWARCPayloadDigest = "WARC-Payload-Digest" 74 | // The numeric Internet address contacted to retrieve any included content. 75 | // An IPv4 address shall be written as a "dotted quad"; an IPv6 address 76 | // shall be written as per [RFC1884]. For an HTTP retrieval, this will be 77 | // the IP address used at retrieval time corresponding to the hostname in 78 | // the record's target-URI. 79 | FieldNameWARCIPAddress = "WARC-IP-Address" 80 | // The WARC-Refers-To field may be used to associate a 'metadata' record to 81 | // another record it describes. The WARC-Refers-To field may also be used 82 | // to associate a record of type 'revisit' or 'conversion' with the 83 | // preceding record which helped determine the present record content. The 84 | // WARC-Refers-To field shall not be used in 'warcinfo', 'response', 85 | // ‘resource’, 'request', and 'continuation' records. 86 | FieldNameWARCRefersTo = "WARC-Refers-To" 87 | // The original URI whose capture gave rise to the information content in 88 | // this record. In the context of web harvesting, this is the URI that was 89 | // the target of a crawler's retrieval request. For a 'revisit' record, it 90 | // is the URI that was the target of a retrieval request.  Indirectly, such 91 | // as for a 'metadata', or 'conversion' record, it is a copy of the 92 | // WARC-Target-URI appearing in the original record to which the newer 93 | // record pertains. The URI in this value shall be properly escaped 94 | // according to [RFC3986] and written with no internal whitespace. 95 | FieldNameWARCTargetURI = "WARC-Target-URI" 96 | // For practical reasons, writers of the WARC format may place limits on 97 | // the time or storage allocated to archiving a single resource. As a 98 | // result, only a truncated portion of the original resource may be 99 | // available for saving into a WARC record. 100 | // 101 | // Any record may indicate that truncation of its content block has 102 | // occurred and give the reason with a 'WARC-Truncated' field. 103 | FieldNameWARCTruncated = "WARC-Truncated" 104 | // When present, indicates the WARC-Record-ID of the associated 'warcinfo' 105 | // record for this record. Typically, the Warcinfo-ID parameter is used 106 | // when the context of the applicable 'warcinfo' record is unavailable, 107 | // such as after distributing single records into separate WARC files. WARC 108 | // writing applications (such web crawlers) may choose to always record 109 | // this parameter. 110 | FieldNameWARCWarcinfoID = "WARC-Warcinfo-ID" 111 | // The WARC-Filename field may be used in 'warcinfo' type records and shall 112 | // not be used for other record types. 113 | FieldNameWARCFilename = "WARC-Filename" 114 | // A URI signifying the kind of analysis and handling applied in a 115 | // 'revisit' record. (Like an XML namespace, the URI may, but need not, 116 | // return human-readable or machine-readable documentation.) If reading 117 | // software does not recognize the given URI as a supported kind of 118 | // handling, it shall not attempt to interpret the associated record block. 119 | FieldNameWARCProfile = "WARC-Profile" 120 | // The content-type of the record's payload as determined by an independent 121 | // check. This string shall not be arrived at by blindly promoting an HTTP 122 | // Content-Type value up from a record block into the WARC header without 123 | // direct analysis of the payload, as such values may often be unreliable. 124 | FieldNameWARCIdentifiedPayloadType = "WARC-Identified-Payload-Type" 125 | // Reports the current record's relative ordering in a sequence of 126 | // segmented records. 127 | // In the first segment of any record that is completed in one or more 128 | // later 'continuation' WARC records, this parameter is mandatory. Its 129 | // value there is "1". In a 'continuation' record, this parameter is also 130 | // mandatory. Its value is the sequence number of the current segment in 131 | // the logical whole record, increasing by 1 in each next segment. 132 | FieldNameWARCSegmentNumber = "WARC-Segment-Number" 133 | // Identifies the starting record in a series of segmented records whose 134 | // content blocks are reassembled to obtain a logically complete content 135 | // block. 136 | // This field is mandatory on all 'continuation' records, and shall not be 137 | // used in other records. See the section below, Record segmentation, for 138 | // full details on the use of WARC record segmentation. 139 | FieldNameWARCSegmentOriginID = "WARC-Segment-Origin-ID" 140 | // In the final record of a segmented series, reports the total length of 141 | // all segment content blocks when concatenated together. 142 | // This field is mandatory on the last 'continuation' record of a series, 143 | // and shall not be used elsewhere. 144 | FieldNameWARCSegmentTotalLength = "WARC-Segment-Total-Length" 145 | ) 146 | -------------------------------------------------------------------------------- /header.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "net/textproto" 5 | ) 6 | 7 | // Header mimics net/http's header package, 8 | // but with string values 9 | // Users should use Get & Set methods instead of accessing 10 | // the map directly. 11 | type Header map[string]string 12 | 13 | // Get a key from the header map 14 | func (h Header) Get(key string) string { 15 | return h[CanonicalKey(key)] 16 | } 17 | 18 | // Set a key on the header map 19 | func (h Header) Set(key, value string) { 20 | h[CanonicalKey(key)] = value 21 | } 22 | 23 | // CanonicalKey conforms keys to CanonicalMIMEHeaderKey 24 | // (which is Capitals-For-First-Letter-Separated-By-Dashes) 25 | // for any general input with exceptions for capitalized "WARC" header keys. 26 | // The WARC 1.0 spec calls for case-insensitive header keys, 27 | // but the spec token diagrams list headers as being case-sensitive, 28 | // so we'll honor case any case on read, but write records 29 | // that match the spec token diagrams. 30 | func CanonicalKey(key string) string { 31 | key = textproto.CanonicalMIMEHeaderKey(key) 32 | if warcFieldMimeMap[key] != "" { 33 | key = warcFieldMimeMap[key] 34 | } 35 | return key 36 | } 37 | 38 | // warcFieldMap maps CanonicalMIMEHeaderKey input to Warc Field names 39 | var warcFieldMimeMap = map[string]string{ 40 | "Warc-Record-Id": FieldNameWARCRecordID, 41 | "Warc-Date": FieldNameWARCDate, 42 | "Warc-Type": FieldNameWARCType, 43 | "Warc-Concurrent-To": FieldNameWARCConcurrentTo, 44 | "Warc-Block-Digest": FieldNameWARCBlockDigest, 45 | "Warc-Payload-Digest": FieldNameWARCPayloadDigest, 46 | "Warc-Ip-Address": FieldNameWARCIPAddress, 47 | "Warc-Refers-To": FieldNameWARCRefersTo, 48 | "Warc-Target-Uri": FieldNameWARCTargetURI, 49 | "Warc-Truncated": FieldNameWARCTruncated, 50 | "Warc-Warcinfo-Id": FieldNameWARCWarcinfoID, 51 | "Warc-Filename": FieldNameWARCFilename, 52 | "Warc-Profile": FieldNameWARCProfile, 53 | "Warc-Identified-Payload-Type": FieldNameWARCIdentifiedPayloadType, 54 | "Warc-Segment-Number": FieldNameWARCSegmentNumber, 55 | "Warc-Segment-Origin-Id": FieldNameWARCSegmentOriginID, 56 | "Warc-Segment-Total-Length": FieldNameWARCSegmentTotalLength, 57 | } 58 | -------------------------------------------------------------------------------- /header_test.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "testing" 5 | ) 6 | 7 | func TestHeader(t *testing.T) { 8 | h := Header{} 9 | if h.Get("") != "" { 10 | t.Errorf("expected empty string for empty string get") 11 | return 12 | } 13 | h.Set("warc-record-id", "test_id") 14 | if h.Get("WARC-Record-ID") != "test_id" { 15 | t.Errorf("expected get WARC-Record-ID to return %s", "test_id") 16 | return 17 | } 18 | } 19 | 20 | func TestCanonicalKey(t *testing.T) { 21 | cases := []struct { 22 | in, expect string 23 | }{ 24 | {"warc-record-id", FieldNameWARCRecordID}, 25 | {"WARC-DATE", FieldNameWARCDate}, 26 | {"Warc-TYPE", FieldNameWARCType}, 27 | {"warc-CONCURRENt-to", FieldNameWARCConcurrentTo}, 28 | {"warC-block-digest", FieldNameWARCBlockDigest}, 29 | {"Warc-payload-Digest", FieldNameWARCPayloadDigest}, 30 | {"warc-ip-Address", FieldNameWARCIPAddress}, 31 | {"warc-refers-To", FieldNameWARCRefersTo}, 32 | {"warc-target-Uri", FieldNameWARCTargetURI}, 33 | {"warc-truncated", FieldNameWARCTruncated}, 34 | {"warc-warcinfo-Id", FieldNameWARCWarcinfoID}, 35 | {"warc-filename", FieldNameWARCFilename}, 36 | {"warc-profile", FieldNameWARCProfile}, 37 | {"warc-identified-payload-Type", FieldNameWARCIdentifiedPayloadType}, 38 | {"warc-segment-Number", FieldNameWARCSegmentNumber}, 39 | {"warc-segment-origin-Id", FieldNameWARCSegmentOriginID}, 40 | {"warc-segment-total-Length", FieldNameWARCSegmentTotalLength}, 41 | } 42 | 43 | for i, c := range cases { 44 | got := CanonicalKey(c.in) 45 | if got != c.expect { 46 | t.Errorf("case %d mismatch. expected: '%s', got: '%s'", i, c.expect, got) 47 | continue 48 | } 49 | } 50 | } 51 | -------------------------------------------------------------------------------- /reader.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bufio" 5 | "bytes" 6 | "compress/bzip2" 7 | "compress/gzip" 8 | "io" 9 | "io/ioutil" 10 | "strconv" 11 | 12 | "github.com/pkg/errors" 13 | ) 14 | 15 | // Reader parses WARC records from an underlying scanner. 16 | // Create a new reader with NewReader 17 | type Reader struct { 18 | rc io.ReadCloser // raw io.readerCloser 19 | scanner *bufio.Scanner // scanner to pull tokens from 20 | phase scanPhase 21 | bodyLen int64 22 | } 23 | 24 | // NewReader creates a new WARC reader from an io.Reader 25 | // Always use NewReader, (instead of manually allocating a reader) 26 | func NewReader(r io.Reader) (*Reader, error) { 27 | rc, err := decompress(r) 28 | if err != nil { 29 | return nil, err 30 | } 31 | 32 | rdr := &Reader{ 33 | rc: rc, 34 | scanner: bufio.NewScanner(rc), 35 | } 36 | rdr.scanner.Split(rdr.split) 37 | return rdr, nil 38 | } 39 | 40 | // Read a record, will return nil, io.EOF to signal 41 | // no more records 42 | func (r *Reader) Read() (Record, error) { 43 | // rec, err := r.readRecord() 44 | // if err == nil { 45 | // fmt.Println(string(rec.(*Resource).Content)) 46 | // } 47 | return r.readRecord() 48 | } 49 | 50 | // ReadAll Consumes the entire reader, returning a slice of records 51 | func (r *Reader) ReadAll() (records Records, err error) { 52 | for { 53 | record, err := r.Read() 54 | if err == io.EOF { 55 | return records, nil 56 | } 57 | if err != nil { 58 | return nil, err 59 | } 60 | records = append(records, &record) 61 | } 62 | } 63 | 64 | // scanphase denotes different "modes" for scanning 65 | type scanPhase int 66 | 67 | const ( 68 | scanPhaseVersion scanPhase = iota 69 | scanPhaseHeaderKey 70 | scanPhaseHeaderValue 71 | scanPhaseContent 72 | ) 73 | 74 | func (r *Reader) readRecord() (rec Record, err error) { 75 | var key string 76 | rec = Record{ 77 | Headers: map[string]string{}, 78 | } 79 | 80 | for r.scanner.Scan() { 81 | token := r.scanner.Bytes() 82 | 83 | switch r.phase { 84 | case scanPhaseVersion: 85 | rec.Format = recordFormat(string(bytes.TrimSpace(token))) 86 | if rec.Format == RecordFormatUnknown { 87 | return rec, errors.Errorf("Unknown record format: '%s'", string(bytes.TrimSpace(token))) 88 | } 89 | r.phase = scanPhaseHeaderKey 90 | case scanPhaseHeaderKey: 91 | if bytes.Equal(token, crlf) { 92 | r.phase = scanPhaseContent 93 | r.bodyLen = -1 94 | r.checkContentLength(&rec) 95 | 96 | rec.Content = bytes.NewBuffer(nil) 97 | if r.bodyLen != -1 { 98 | rec.Content.Grow(int(r.bodyLen)) 99 | } 100 | } else { 101 | key = CanonicalKey(string(token)) 102 | r.phase = scanPhaseHeaderValue 103 | } 104 | case scanPhaseHeaderValue: 105 | rec.Headers[key] = string(bytes.TrimSpace(token)) 106 | if key == FieldNameWARCType { 107 | rec.Type = ParseRecordType(rec.Headers[key]) 108 | } 109 | r.phase = scanPhaseHeaderKey 110 | case scanPhaseContent: 111 | by := r.scanner.Bytes() 112 | bytes.NewReader(by).WriteTo(rec.Content) 113 | if len(by) == 0 { 114 | r.phase = scanPhaseVersion 115 | return 116 | } else if r.bodyLen != -1 { 117 | r.bodyLen -= int64(len(by)) 118 | } 119 | } 120 | } 121 | 122 | if r.scanner.Err() != nil { 123 | return rec, r.scanner.Err() 124 | } 125 | if r.phase != scanPhaseVersion && r.phase != scanPhaseContent { 126 | return rec, io.ErrUnexpectedEOF 127 | } 128 | return rec, io.EOF 129 | } 130 | 131 | func (r *Reader) checkContentLength(rec *Record) error { 132 | if rec.Headers[FieldNameWARCSegmentNumber] != "" { 133 | // Segmented content 134 | return nil 135 | } 136 | if rec.Headers[FieldNameContentLength] != "" { 137 | // Non-segmented, have Content-Length => read the whole thing in one block 138 | length, err := strconv.ParseInt(rec.Headers[FieldNameContentLength], 10, 64) 139 | if err != nil { 140 | return errors.Wrap(err, "warc: Invalid Content-Length") 141 | } 142 | r.bodyLen = length 143 | } 144 | return nil 145 | } 146 | 147 | func (r *Reader) split(data []byte, atEOF bool) (advance int, token []byte, err error) { 148 | if atEOF && len(data) == 0 { 149 | return 0, nil, nil 150 | } 151 | 152 | switch r.phase { 153 | case scanPhaseVersion: 154 | return splitLine(data, atEOF) 155 | case scanPhaseHeaderKey: 156 | return splitKey(data, atEOF) 157 | case scanPhaseHeaderValue: 158 | return splitValue(data, atEOF) 159 | case scanPhaseContent: 160 | if r.bodyLen != -1 { 161 | return splitFull(data, atEOF, r.bodyLen) 162 | } 163 | fallthrough 164 | default: // default to scanPhaseContent 165 | return splitBlock(data, atEOF) 166 | } 167 | } 168 | 169 | var crlf = []byte("\r\n") 170 | var doubleCrlf = []byte("\r\n\r\n") 171 | 172 | func splitLine(data []byte, atEOF bool) (advance int, token []byte, err error) { 173 | if bytes.HasPrefix(data, crlf) { 174 | // Found block-end from previous record. Skip. 175 | if bytes.HasPrefix(data, doubleCrlf) { 176 | return len(doubleCrlf), nil, nil 177 | } 178 | return len(crlf), nil, nil 179 | } 180 | if i := bytes.IndexByte(data, '\n'); i >= 0 { 181 | // We have a full newline-terminated line. 182 | return i + 1, dropCR(data[0:i]), nil 183 | } 184 | // If we're at EOF, we have a final, non-terminated line. Return it. 185 | if atEOF { 186 | return len(data), dropCR(data), nil 187 | } 188 | // Request more data. 189 | return 0, nil, nil 190 | } 191 | 192 | func splitKey(data []byte, atEOF bool) (advance int, token []byte, err error) { 193 | if bytes.Index(data, crlf) == 0 { 194 | return len(crlf), crlf, nil 195 | } 196 | if i := bytes.IndexByte(data, ':'); i >= 0 { 197 | return i + 1, data[0:i], nil 198 | } 199 | if atEOF { 200 | return len(data), data, nil 201 | } 202 | return 0, nil, nil 203 | } 204 | 205 | func splitValue(data []byte, atEOF bool) (advance int, token []byte, err error) { 206 | // TODO - MULTILINE VALUES 207 | if i := bytes.Index(data, crlf); i == 0 { 208 | // if we hit double clrf return 209 | return len(crlf), nil, nil 210 | } else if i > 0 { 211 | return i + len(crlf), data[0:i], nil 212 | } 213 | 214 | if atEOF { 215 | return len(data), data, nil 216 | } 217 | 218 | return 0, nil, nil 219 | } 220 | 221 | func splitBlock(data []byte, atEOF bool) (advance int, token []byte, err error) { 222 | if i := bytes.Index(data, doubleCrlf); i >= 0 { 223 | return i + len(doubleCrlf), data[0:i], nil 224 | } 225 | // If we're at EOF, we have a final, non-terminated line. Return it. 226 | if atEOF { 227 | return len(data), data, nil 228 | } 229 | return 0, nil, nil 230 | } 231 | 232 | func splitFull(data []byte, atEOF bool, bytesLeft int64) (advance int, token []byte, err error) { 233 | length := int(bytesLeft) 234 | if bytesLeft <= int64(len(data)) { 235 | return length, data[:length], nil 236 | } 237 | if atEOF { 238 | return len(data), data, errors.Errorf("warc: unexpected EOF in record content, got %v bytes (expected %v more)", len(data), bytesLeft) 239 | } 240 | if len(data) > 0 { 241 | return len(data), data, nil 242 | } 243 | return 0, nil, nil 244 | } 245 | 246 | // dropCR drops a terminal \r from the data. 247 | func dropCR(data []byte) []byte { 248 | if len(data) > 0 && data[len(data)-1] == '\r' { 249 | return data[0 : len(data)-1] 250 | } 251 | return data 252 | } 253 | 254 | // readBlockBody 255 | func readBlockBody(data []byte) ([]byte, error) { 256 | start := bytes.LastIndex(data, crlf) 257 | if start == -1 { 258 | return data, nil 259 | } 260 | r := bytes.NewReader(data[start+len(crlf):]) 261 | res, err := decompress(r) 262 | if err != nil { 263 | return nil, err 264 | } 265 | defer res.Close() 266 | return ioutil.ReadAll(res) 267 | } 268 | 269 | const ( 270 | compressionNone = iota 271 | compressionBZIP 272 | compressionGZIP 273 | ) 274 | 275 | // guessCompression returns the compression type of a data stream by matching 276 | // the first two bytes with the magic numbers of compression formats. 277 | func guessCompression(b *bufio.Reader) (int, error) { 278 | magic, err := b.Peek(2) 279 | if err != nil { 280 | if err == io.EOF { 281 | err = nil 282 | } 283 | return compressionNone, err 284 | } 285 | switch { 286 | case magic[0] == 0x42 && magic[1] == 0x5a: 287 | return compressionBZIP, nil 288 | case magic[0] == 0x1f && magic[1] == 0x8b: 289 | return compressionGZIP, nil 290 | } 291 | return compressionNone, nil 292 | } 293 | 294 | // decompress automatically decompresses data streams and makes sure the result 295 | // obeys the io.ReadCloser interface. This way callers don't need to check 296 | // whether the underlying reader has a Close() function or not, they just call 297 | // defer Close() on the result. 298 | func decompress(r io.Reader) (res io.ReadCloser, err error) { 299 | // Create a buffered reader to peek the stream's magic number. 300 | dataReader := bufio.NewReader(r) 301 | compr, err := guessCompression(dataReader) 302 | if err != nil { 303 | return nil, err 304 | } 305 | switch compr { 306 | case compressionGZIP: 307 | gzipReader, err := gzip.NewReader(dataReader) 308 | if err != nil { 309 | return nil, err 310 | } 311 | res = gzipReader 312 | case compressionBZIP: 313 | bzipReader := bzip2.NewReader(dataReader) 314 | res = ioutil.NopCloser(bzipReader) 315 | case compressionNone: 316 | res = ioutil.NopCloser(dataReader) 317 | } 318 | return res, err 319 | } 320 | -------------------------------------------------------------------------------- /reader_test.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "io/ioutil" 5 | "os" 6 | "path/filepath" 7 | "testing" 8 | ) 9 | 10 | func TestReadAll(t *testing.T) { 11 | f, err := os.Open("testdata/test.warc") 12 | if err != nil { 13 | t.Error(err.Error()) 14 | return 15 | } 16 | defer f.Close() 17 | 18 | rdr, err := NewReader(f) 19 | if err != nil { 20 | t.Error(err) 21 | return 22 | } 23 | 24 | records, err := rdr.ReadAll() 25 | if err != nil { 26 | t.Error(err) 27 | return 28 | } 29 | 30 | if len(records) <= 0 { 31 | t.Errorf("record length mismatch: %d isn't enough records", len(records)) 32 | return 33 | } 34 | 35 | // for _, r := range records { 36 | // fmt.Println(r.Type().String()) 37 | // } 38 | } 39 | 40 | func readTestFile(path string) ([]byte, error) { 41 | return ioutil.ReadFile(filepath.Join("testdata", path)) 42 | } 43 | -------------------------------------------------------------------------------- /record.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bytes" 5 | "io" 6 | "strconv" 7 | "strings" 8 | "time" 9 | ) 10 | 11 | // RecordType enumerates different types of WARC Records 12 | type RecordType int 13 | 14 | const ( 15 | // RecordTypeUnknown is the default type of record, which shouldn't be 16 | // accepted by anything that wants to know a type of record. 17 | RecordTypeUnknown RecordType = iota 18 | // RecordTypeWarcInfo describes the records that follow it, up through end 19 | // of file, end of input, or until next 'warcinfo' record. Typically, this 20 | // appears once and at the beginning of a WARC file. For a web archive, it 21 | // often contains information about the web crawl which generated the 22 | // following records. 23 | // The format of this descriptive record block may vary, though the use of 24 | // the "application/warc-fields" content-type is recommended. Allowable 25 | // fields include, but are not limited to, all \[DCMI\] plus the following 26 | // field definitions. All fields are optional. 27 | RecordTypeWarcInfo 28 | // RecordTypeResponse should contain a complete scheme-specific response, 29 | // including network protocol information where possible. The exact 30 | // contents of a 'response' record are determined not just by the record 31 | // type but also by the URI scheme of the record's target-URI, as described 32 | // below. 33 | RecordTypeResponse 34 | // RecordTypeResource contains a resource, without full protocol response 35 | // information. For example: a file directly retrieved from a locally 36 | // accessible repository or the result of a networked retrieval where the 37 | // protocol information has been discarded. The exact contents of a 38 | // 'resource' record are determined not just by the record type but also by 39 | // the URI scheme of the record's target-URI, as described below. 40 | // For all 'resource' records, the payload is defined as the record block. 41 | // A 'resource' record, with a synthesized target-URI, may also be used to 42 | // archive other artefacts of a harvesting process inside WARC files. 43 | RecordTypeResource 44 | // RecordTypeRequest holds the details of a complete scheme-specific 45 | // request, including network protocol information where possible. The 46 | // exact contents of a 'request' record are determined not just by the 47 | // record type but also by the URI scheme of the record's target-URI, as 48 | // described below. 49 | RecordTypeRequest 50 | // RecordTypeMetadata contains content created in order to further 51 | // describe, explain, or accompany a harvested resource, in ways not 52 | // covered by other record types. A 'metadata' record will almost always 53 | // refer to another record of another type, with that other record holding 54 | // original harvested or transformed content. (However, it is allowable for 55 | // a 'metadata' record to refer to any record type, including other 56 | // 'metadata' records.) Any number of metadata records may reference one 57 | // specific other record. 58 | // The format of the metadata record block may vary. The 59 | // "application/warc-fields" format, defined earlier, may be used. 60 | // Allowable fields include all \[DCMI\] plus the following field 61 | // definitions. All fields are optional. 62 | RecordTypeMetadata 63 | // RecordTypeRevisit describes the revisitation of content already 64 | // archived, and might include only an abbreviated content body which has 65 | // to be interpreted relative to a previous record. Most typically, a 66 | // 'revisit' record is used instead of a 'response' or 'resource' record to 67 | // indicate that the content visited was either a complete or substantial 68 | // duplicate of material previously archived. 69 | // Using a 'revisit' record instead of another type is optional, for when 70 | // benefits of reduced storage size or improved cross-referencing of 71 | // material are desired. 72 | RecordTypeRevisit 73 | // RecordTypeConversion shall contain an alternative version of another 74 | // record's content that was created as the result of an archival process. 75 | // Typically, this is used to hold content transformations that maintain 76 | // viability of content after widely available rendering tools for the 77 | // originally stored format disappear. As needed, the original content may 78 | // be migrated (transformed) to a more viable format in order to keep the 79 | // information usable with current tools while minimizing loss of 80 | // information (intellectual content, look and feel, etc). Any number of 81 | // 'conversion' records may be created that reference a specific source 82 | // record, which may itself contain transformed content. Each 83 | // transformation should result in a freestanding, complete record, with no 84 | // dependency on survival of the original record. 85 | // Metadata records may be used to further describe transformation records. 86 | // Wherever practical, a 'conversion' record should contain a 87 | // 'WARC-Refers-To' field to identify the prior material converted. 88 | RecordTypeConversion 89 | // RecordTypeContinuation blocks from 'continuation' records must be appended to 90 | // corresponding prior record block(s) (e.g., from other WARC files) to 91 | // create the logically complete full-sized original record. That is, 92 | // 'continuation' records are used when a record that would otherwise cause 93 | // a WARC file size to exceed a desired limit is broken into segments. A 94 | // continuation record shall contain the named fields 95 | // 'WARC-Segment-Origin-ID' and 'WARC-Segment-Number', and the last 96 | // 'continuation' record of a series shall contain a 97 | // 'WARC-Segment-Total-Length' field. The full details of WARC record 98 | // segmentation are described in the below section Record Segmentation. See 99 | // also annex C.8 below for an example of a ‘continuation’ record. 100 | RecordTypeContinuation 101 | ) 102 | 103 | // RecordType satisfies the stringer interface 104 | func (r RecordType) String() string { 105 | switch r { 106 | case RecordTypeWarcInfo: 107 | return "warcinfo" 108 | case RecordTypeResponse: 109 | return "response" 110 | case RecordTypeResource: 111 | return "resource" 112 | case RecordTypeRequest: 113 | return "request" 114 | case RecordTypeMetadata: 115 | return "metadata" 116 | case RecordTypeRevisit: 117 | return "revisit" 118 | case RecordTypeConversion: 119 | return "conversion" 120 | case RecordTypeContinuation: 121 | return "continuation" 122 | default: 123 | return "" 124 | } 125 | } 126 | 127 | // ParseRecordType parses a RecordType from a string 128 | func ParseRecordType(s string) RecordType { 129 | switch s { 130 | case RecordTypeWarcInfo.String(): 131 | return RecordTypeWarcInfo 132 | case RecordTypeResponse.String(): 133 | return RecordTypeResponse 134 | case RecordTypeResource.String(): 135 | return RecordTypeResource 136 | case RecordTypeRequest.String(): 137 | return RecordTypeRequest 138 | case RecordTypeMetadata.String(): 139 | return RecordTypeMetadata 140 | case RecordTypeRevisit.String(): 141 | return RecordTypeRevisit 142 | case RecordTypeConversion.String(): 143 | return RecordTypeConversion 144 | case RecordTypeContinuation.String(): 145 | return RecordTypeContinuation 146 | default: 147 | return RecordTypeUnknown 148 | } 149 | } 150 | 151 | // A Record consists of a version indicator (eg: WARC/1.0), zero or more headers, 152 | // and possibly a content block. 153 | // Upgrades to specific types of records can be done using type assertions 154 | // and/or the Type method. 155 | type Record struct { 156 | Format RecordFormat 157 | Type RecordType 158 | Headers Header 159 | Content *bytes.Buffer 160 | } 161 | 162 | // ID gives The ID for this record 163 | func (r *Record) ID() string { 164 | return strings.TrimSuffix(strings.TrimPrefix(r.Headers[FieldNameWARCRecordID], "") 165 | } 166 | 167 | // TargetURI is a convenience method for getting the uri 168 | // that this record is targeting 169 | func (r *Record) TargetURI() string { 170 | return r.Headers[FieldNameWARCTargetURI] 171 | } 172 | 173 | // Date gives the time.Time of record creation, returns empty (zero) time if 174 | // no Warc-Date header is present, or if the header is an 175 | // invalid timestamp 176 | func (r *Record) Date() time.Time { 177 | t, err := time.Parse(time.RFC3339, r.Headers[FieldNameWARCDate]) 178 | if err != nil { 179 | return time.Time{} 180 | } 181 | return t 182 | } 183 | 184 | // ContentLength of content block in bytes, returns 0 if 185 | // Content-Length header is missing or invalid 186 | func (r *Record) ContentLength() int { 187 | len, err := strconv.ParseInt(r.Headers[FieldNameContentLength], 10, 64) 188 | if err != nil { 189 | return 0 190 | } 191 | return int(len) 192 | } 193 | 194 | // Write this record to the given writer. 195 | // 196 | // Automatically handles the Content-Length, WARC-Type headers, as well as 197 | // WARC-Block-Digest for Response and Revisit records. 198 | func (r *Record) Write(w io.Writer) error { 199 | r.Headers[FieldNameContentLength] = strconv.FormatInt(int64(r.Content.Len()), 10) 200 | r.Headers[FieldNameWARCType] = r.Type.String() 201 | switch r.Type { 202 | case RecordTypeResponse, RecordTypeRevisit: 203 | r.Headers[FieldNameWARCBlockDigest] = Sha1Digest(r.Content.Bytes()) 204 | } 205 | 206 | if err := writeHeader(w, r); err != nil { 207 | return err 208 | } 209 | return writeBlock(w, bytes.NewReader(r.Content.Bytes())) 210 | } 211 | 212 | // Bytes returns the record formatted as a byte slice 213 | func (r *Record) Bytes() ([]byte, error) { 214 | buf := &bytes.Buffer{} 215 | err := r.Write(buf) 216 | return buf.Bytes(), err 217 | } 218 | 219 | // Body returns a record's body with any HTTP headers omitted 220 | func (r *Record) Body() ([]byte, error) { 221 | // TODO - actually remove headers 222 | // buf := &bytes.Buffer{} 223 | // err := writeBlock(buf, r.Content) 224 | return readBlockBody(r.Content.Bytes()) 225 | } 226 | 227 | // SetBody sets the body of the record, leaving any written 228 | // http headers in record 229 | func (r *Record) SetBody(body []byte) error { 230 | repl, err := replaceBlockBody(r.Content.Bytes(), body) 231 | if err != nil { 232 | return err 233 | } 234 | r.Content = bytes.NewBuffer(repl) 235 | return nil 236 | } 237 | 238 | // RecordFormat determines different formats for records, this is 239 | // for any later support of ARC files, should we need to add it. 240 | type RecordFormat int 241 | 242 | const ( 243 | // RecordFormatWarc default is the Warc Format 1.0 244 | RecordFormatWarc RecordFormat = iota 245 | // RecordFormatUnknown reporesents unknown / errored record format 246 | RecordFormatUnknown 247 | ) 248 | 249 | func (r RecordFormat) String() string { 250 | switch r { 251 | case RecordFormatWarc: 252 | return "WARC/1.0" 253 | default: 254 | return "" 255 | } 256 | } 257 | 258 | func recordFormat(s string) RecordFormat { 259 | switch s { 260 | case "WARC/1.0": 261 | return RecordFormatWarc 262 | default: 263 | return RecordFormatUnknown 264 | } 265 | } 266 | -------------------------------------------------------------------------------- /record_test.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bytes" 5 | "os" 6 | "testing" 7 | ) 8 | 9 | func TestRecordID(t *testing.T) { 10 | r := &Record{} 11 | if r.ID() != "" { 12 | t.Errorf("id mismatch. expected '', got: '%s'", r.ID()) 13 | } 14 | } 15 | 16 | func TestRecordBody(t *testing.T) { 17 | // TODO 18 | f, err := os.Open("testdata/response.warc") 19 | if err != nil { 20 | t.Error(err.Error()) 21 | return 22 | } 23 | defer f.Close() 24 | 25 | rdr, err := NewReader(f) 26 | if err != nil { 27 | t.Error(err) 28 | return 29 | } 30 | 31 | records, err := rdr.ReadAll() 32 | if err != nil { 33 | t.Error(err) 34 | return 35 | } 36 | // fmt.Println(records[1].Content.String()) 37 | 38 | body, err := records[1].Body() 39 | if err != nil { 40 | t.Error(err) 41 | return 42 | } 43 | 44 | if bytes.HasPrefix(body, crlf) { 45 | t.Errorf("content shouldn't have CRLF prefix") 46 | return 47 | } 48 | 49 | if bytes.HasSuffix(body, crlf) { 50 | t.Errorf("content shouldn't have CRLF suffix") 51 | return 52 | } 53 | 54 | // if len(body) != records[1].ContentLength() { 55 | // t.Errorf("content length mistmatch: %d != %d", records[1].ContentLength(), len(body)) 56 | // return 57 | // } 58 | // fmt.Println(string(b)) 59 | } 60 | -------------------------------------------------------------------------------- /records.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | // Records provides utility functions for slices of records. 4 | // 5 | // A WARC format file is the simple concatenation of one or more WARC records. 6 | // The first record usually describes the records to follow. In general, 7 | // record content is either the direct result of a retrieval attempt — web 8 | // pages, inline images, URL redirection information, DNS hostname lookup 9 | // results, standalone files, etc. — or is synthesized material (e.g., 10 | // metadata, transformed content) that provides additional information about 11 | // archived content. 12 | type Records []*Record 13 | 14 | // FilterTypes return all record types that match a provide 15 | // list of RecordTypes 16 | func (rs Records) FilterTypes(types ...RecordType) Records { 17 | res := Records{} 18 | for _, rec := range rs { 19 | for _, t := range types { 20 | if rec.Type == t { 21 | res = append(res, rec) 22 | } 23 | } 24 | } 25 | return res 26 | } 27 | 28 | // TargetURIRecord returns a record matching uri optionally filtered by 29 | // a list of record types. There are a number of "gotchas" if multiple 30 | // record types of the same url are in the list. 31 | // TODO - eliminate "gotchas" 32 | func (rs Records) TargetURIRecord(uri string, types ...RecordType) *Record { 33 | for _, rec := range rs { 34 | if rec.TargetURI() == uri { 35 | if len(types) == 0 { 36 | return rec 37 | } 38 | for _, t := range types { 39 | if rec.Type == t { 40 | return rec 41 | } 42 | } 43 | } 44 | } 45 | return nil 46 | } 47 | 48 | // RemoveTargetURIRecords returns a Records slice with all records 49 | // that refer to uri removed 50 | func (rs Records) RemoveTargetURIRecords(uri string) (recs Records) { 51 | recs = rs 52 | for i, rec := range rs { 53 | if rec.TargetURI() == uri { 54 | recs = append(recs[:i], recs[i+1:]...) 55 | } 56 | } 57 | return 58 | } 59 | -------------------------------------------------------------------------------- /sanitize.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bytes" 5 | "compress/gzip" 6 | ) 7 | 8 | // Sanitize removes any data from a warc record body 9 | // that may interfere with parsing 10 | func Sanitize(contentSniff string, body []byte) (sanitized []byte, err error) { 11 | switch contentSniff { 12 | case "application/pdf", "application/zip": 13 | // default to gzipping content 14 | buf := &bytes.Buffer{} 15 | w := gzip.NewWriter(buf) 16 | if _, err := w.Write(body); err != nil { 17 | return nil, err 18 | } 19 | if err := w.Close(); err != nil { 20 | return nil, err 21 | } 22 | return buf.Bytes(), nil 23 | default: 24 | return bytes.Replace(body, crlf, []byte("\n"), -1), nil 25 | } 26 | } 27 | -------------------------------------------------------------------------------- /sanitize_test.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bytes" 5 | "testing" 6 | 7 | dmp "github.com/sergi/go-diff/diffmatchpatch" 8 | ) 9 | 10 | func TestSanitize(t *testing.T) { 11 | wordDoc, err := readTestFile("test-doc.docx") 12 | if err != nil { 13 | t.Error(err.Error()) 14 | return 15 | } 16 | wordDocGz, err := readTestFile("test-doc.docx.gz") 17 | if err != nil { 18 | t.Error(err.Error()) 19 | return 20 | } 21 | 22 | cases := []struct { 23 | mime string 24 | body []byte 25 | expect []byte 26 | err string 27 | }{ 28 | {"application/zip", wordDoc, wordDocGz, ""}, 29 | } 30 | 31 | for i, c := range cases { 32 | got, err := Sanitize(c.mime, c.body) 33 | 34 | if !(err == nil && c.err == "" || err != nil && err.Error() == c.err) { 35 | t.Errorf("case %d error mismatch. expected: %s, got: %s", i, c.err, err.Error()) 36 | continue 37 | } 38 | 39 | if !bytes.Equal(got, c.expect) { 40 | dmp := dmp.New() 41 | diffs := dmp.DiffMain(string(c.expect), string(got), true) 42 | if len(diffs) == 0 { 43 | t.Logf("case %d bytes were unequal but computed no difference between results", i) 44 | continue 45 | } 46 | t.Errorf("case %d mismatch:\n%s", i, dmp.DiffPrettyText(diffs)) 47 | continue 48 | } 49 | 50 | if bytes.Contains(got, doubleCrlf) { 51 | t.Errorf("case %d can't contain double crlf", i) 52 | continue 53 | } 54 | } 55 | } 56 | -------------------------------------------------------------------------------- /testdata/out.warc: -------------------------------------------------------------------------------- 1 | WARC/1.0 2 | Content-Length: 306 3 | Content-Type: text/plain 4 | WARC-Block-Digest: sha1:28ee620ee6d9ed280505fa9faca0ba357db82ffd 5 | WARC-Date: 2015-07-29T20:10:45+02:00 6 | WARC-Type: resource 7 | 8 | Alice was beginning to get very tired of sitting by her sister on the 9 | bank, and of having nothing to do: once or twice she had peeped into the 10 | book her sister was reading, but it had no pictures or conversations in 11 | it, 'and what is the use of a book,' thought Alice 'without pictures or 12 | conversations?' 13 | 14 | WARC/1.0 15 | Content-Length: 293 16 | Content-Type: text/plain 17 | WARC-Block-Digest: sha1:3c7c9a6136ff74eea4d08b11bcbdc16228f305d0 18 | WARC-Date: 2015-07-29T20:10:45+02:00 19 | WARC-Type: resource 20 | 21 | So she was considering in her own mind (as well as she could, for the 22 | hot day made her feel very sleepy and stupid), whether the pleasure 23 | of making a daisy-chain would be worth the trouble of getting up and 24 | picking the daisies, when suddenly a White Rabbit with pink eyes ran 25 | close by her. 26 | 27 | WARC/1.0 28 | Content-Length: 743 29 | Content-Type: text/plain 30 | WARC-Block-Digest: sha1:be198b8dbb187b92af0157a6dce786ded5b69629 31 | WARC-Date: 2015-07-29T20:10:45+02:00 32 | WARC-Type: resource 33 | 34 | There was nothing so VERY remarkable in that; nor did Alice think it so 35 | VERY much out of the way to hear the Rabbit say to itself, 'Oh dear! 36 | Oh dear! I shall be late!' (when she thought it over afterwards, it 37 | occurred to her that she ought to have wondered at this, but at the time 38 | it all seemed quite natural); but when the Rabbit actually TOOK A WATCH 39 | OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on, 40 | Alice started to her feet, for it flashed across her mind that she had 41 | never before seen a rabbit with either a waistcoat-pocket, or a watch 42 | to take out of it, and burning with curiosity, she ran across the field 43 | after it, and fortunately was just in time to see it pop down a large 44 | rabbit-hole under the hedge. 45 | 46 | WARC/1.0 47 | Content-Length: 110 48 | Content-Type: text/plain 49 | WARC-Block-Digest: sha1:5c6c73b1c4a735ee3019bc73b948f1e9e4d32498 50 | WARC-Date: 2015-07-29T20:10:45+02:00 51 | WARC-Type: resource 52 | 53 | In another moment down went Alice after it, never once considering how 54 | in the world she was to get out again. 55 | 56 | WARC/1.0 57 | Content-Length: 222 58 | Content-Type: text/plain 59 | WARC-Block-Digest: sha1:6435aa05dde53aee9c9f276bb857c869a6c9c70e 60 | WARC-Date: 2015-07-29T20:10:45+02:00 61 | WARC-Type: resource 62 | 63 | The rabbit-hole went straight on like a tunnel for some way, and then 64 | dipped suddenly down, so suddenly that Alice had not a moment to think 65 | about stopping herself before she found herself falling down a very deep 66 | well. 67 | 68 | WARC/1.0 69 | Content-Length: 714 70 | Content-Type: text/plain 71 | WARC-Block-Digest: sha1:4c85924e3f2f368e5acc0ee89bee295ab8002079 72 | WARC-Date: 2015-07-29T20:10:45+02:00 73 | WARC-Type: resource 74 | 75 | Either the well was very deep, or she fell very slowly, for she had 76 | plenty of time as she went down to look about her and to wonder what was 77 | going to happen next. First, she tried to look down and make out what 78 | she was coming to, but it was too dark to see anything; then she 79 | looked at the sides of the well, and noticed that they were filled with 80 | cupboards and book-shelves; here and there she saw maps and pictures 81 | hung upon pegs. She took down a jar from one of the shelves as 82 | she passed; it was labelled 'ORANGE MARMALADE', but to her great 83 | disappointment it was empty: she did not like to drop the jar for fear 84 | of killing somebody, so managed to put it into one of the cupboards as 85 | she fell past it. 86 | 87 | WARC/1.0 88 | Content-Length: 262 89 | Content-Type: text/plain 90 | WARC-Block-Digest: sha1:309231bd60e4ef85ed3a841a419aa59f1cb9b11b 91 | WARC-Date: 2015-07-29T20:10:45+02:00 92 | WARC-Type: resource 93 | 94 | 'Well!' thought Alice to herself, 'after such a fall as this, I shall 95 | think nothing of tumbling down stairs! How brave they'll all think me at 96 | home! Why, I wouldn't say anything about it, even if I fell off the top 97 | of the house!' (Which was very likely true.) 98 | 99 | WARC/1.0 100 | Content-Length: 714 101 | Content-Type: text/plain 102 | WARC-Block-Digest: sha1:e6066ba6c9767cb43122993e6509664201a242fd 103 | WARC-Date: 2015-07-29T20:10:45+02:00 104 | WARC-Type: resource 105 | 106 | Down, down, down. Would the fall NEVER come to an end! 'I wonder how 107 | many miles I've fallen by this time?' she said aloud. 'I must be getting 108 | somewhere near the centre of the earth. Let me see: that would be four 109 | thousand miles down, I think--' (for, you see, Alice had learnt several 110 | things of this sort in her lessons in the schoolroom, and though this 111 | was not a VERY good opportunity for showing off her knowledge, as there 112 | was no one to listen to her, still it was good practice to say it over) 113 | '--yes, that's about the right distance--but then I wonder what Latitude 114 | or Longitude I've got to?' (Alice had no idea what Latitude was, or 115 | Longitude either, but thought they were nice grand words to say.) 116 | 117 | -------------------------------------------------------------------------------- /testdata/response.warc: -------------------------------------------------------------------------------- 1 | WARC/1.0 2 | Content-Length: 16 3 | Content-Type: application/http; msgtype=request 4 | WARC-Date: 2017-11-01T13:53:37-04:00 5 | WARC-Record-ID: 6 | WARC-Target-URI: http://datatogether.org 7 | WARC-Type: request 8 | 9 | GET / HTTP/1.1 10 | 11 | 12 | WARC/1.0 13 | Content-Length: 7060 14 | Content-Type: application/http; msgtype=response 15 | WARC-Block-Digest: sha1:76QR5IA47LIBAQYRFZ53UJXONDH5M3IJ 16 | WARC-Identified-Payload-Type: text/html; charset=utf-8 17 | WARC-Payload-Digest: sha1:YN5KEEDGQYJI5QJHIQTBWK6WWBOS6EQE 18 | WARC-Record-ID: 19 | WARC-Target-URI: https://datatogether.org/ 20 | WARC-Type: response 21 | 22 | Vary: Accept-Encoding 23 | Last-Modified: Wed, 01 Nov 2017 17:53:38 GMT 24 | X-Ipfs-Path: /ipns/datatogether.org/ 25 | Strict-Transport-Security: max-age=15768000; includeSubDomains; preload 26 | Connection: keep-alive 27 | Access-Control-Expose-Headers: Content-Range, X-Chunked-Output, X-Stream-Output 28 | Date: Wed, 01 Nov 2017 17:53:38 GMT 29 | Content-Type: text/html; charset=utf-8 30 | Etag: W/"QmSYFMvJkrHmekiFzPm7y2pcDsZfoi9XFt26aLpd4sKKd3" 31 | Access-Control-Allow-Headers: Content-Range, X-Chunked-Output, X-Stream-Output 32 | Access-Control-Allow-Methods: GET 33 | Access-Control-Allow-Origin: * 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | Data Together 52 | 53 | 54 | 55 | 56 | 57 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 |
90 |
91 | 103 | 112 | 113 |
114 |
115 |

Communities Stewarding Data Together

116 |
117 |
118 | 129 |
130 | 140 |
141 | 170 |
171 | 172 |
173 |
174 |
175 | 176 |
177 |
178 |
179 |
180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | -------------------------------------------------------------------------------- /testdata/test-doc.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/test-doc.docx -------------------------------------------------------------------------------- /testdata/test-doc.docx.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/test-doc.docx.gz -------------------------------------------------------------------------------- /testdata/test.warc: -------------------------------------------------------------------------------- 1 | WARC/1.0 2 | Warc-Date: 2015-07-29T20:10:45+02:00 3 | Warc-Type: resource 4 | Content-Type: text/plain 5 | Warc-Block-Digest: sha1:28ee620ee6d9ed280505fa9faca0ba357db82ffd 6 | Content-Length: 306 7 | 8 | Alice was beginning to get very tired of sitting by her sister on the 9 | bank, and of having nothing to do: once or twice she had peeped into the 10 | book her sister was reading, but it had no pictures or conversations in 11 | it, 'and what is the use of a book,' thought Alice 'without pictures or 12 | conversations?' 13 | 14 | WARC/1.0 15 | Warc-Type: resource 16 | Content-Type: text/plain 17 | Warc-Block-Digest: sha1:3c7c9a6136ff74eea4d08b11bcbdc16228f305d0 18 | Content-Length: 293 19 | Warc-Date: 2015-07-29T20:10:45+02:00 20 | 21 | So she was considering in her own mind (as well as she could, for the 22 | hot day made her feel very sleepy and stupid), whether the pleasure 23 | of making a daisy-chain would be worth the trouble of getting up and 24 | picking the daisies, when suddenly a White Rabbit with pink eyes ran 25 | close by her. 26 | 27 | WARC/1.0 28 | Content-Type: text/plain 29 | Warc-Block-Digest: sha1:be198b8dbb187b92af0157a6dce786ded5b69629 30 | Content-Length: 743 31 | Warc-Date: 2015-07-29T20:10:45+02:00 32 | Warc-Type: resource 33 | 34 | There was nothing so VERY remarkable in that; nor did Alice think it so 35 | VERY much out of the way to hear the Rabbit say to itself, 'Oh dear! 36 | Oh dear! I shall be late!' (when she thought it over afterwards, it 37 | occurred to her that she ought to have wondered at this, but at the time 38 | it all seemed quite natural); but when the Rabbit actually TOOK A WATCH 39 | OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on, 40 | Alice started to her feet, for it flashed across her mind that she had 41 | never before seen a rabbit with either a waistcoat-pocket, or a watch 42 | to take out of it, and burning with curiosity, she ran across the field 43 | after it, and fortunately was just in time to see it pop down a large 44 | rabbit-hole under the hedge. 45 | 46 | WARC/1.0 47 | Content-Length: 110 48 | Warc-Date: 2015-07-29T20:10:45+02:00 49 | Warc-Type: resource 50 | Content-Type: text/plain 51 | Warc-Block-Digest: sha1:5c6c73b1c4a735ee3019bc73b948f1e9e4d32498 52 | 53 | In another moment down went Alice after it, never once considering how 54 | in the world she was to get out again. 55 | 56 | WARC/1.0 57 | Content-Type: text/plain 58 | Warc-Block-Digest: sha1:6435aa05dde53aee9c9f276bb857c869a6c9c70e 59 | Content-Length: 222 60 | Warc-Date: 2015-07-29T20:10:45+02:00 61 | Warc-Type: resource 62 | 63 | The rabbit-hole went straight on like a tunnel for some way, and then 64 | dipped suddenly down, so suddenly that Alice had not a moment to think 65 | about stopping herself before she found herself falling down a very deep 66 | well. 67 | 68 | WARC/1.0 69 | Content-Length: 714 70 | Warc-Date: 2015-07-29T20:10:45+02:00 71 | Warc-Type: resource 72 | Content-Type: text/plain 73 | Warc-Block-Digest: sha1:4c85924e3f2f368e5acc0ee89bee295ab8002079 74 | 75 | Either the well was very deep, or she fell very slowly, for she had 76 | plenty of time as she went down to look about her and to wonder what was 77 | going to happen next. First, she tried to look down and make out what 78 | she was coming to, but it was too dark to see anything; then she 79 | looked at the sides of the well, and noticed that they were filled with 80 | cupboards and book-shelves; here and there she saw maps and pictures 81 | hung upon pegs. She took down a jar from one of the shelves as 82 | she passed; it was labelled 'ORANGE MARMALADE', but to her great 83 | disappointment it was empty: she did not like to drop the jar for fear 84 | of killing somebody, so managed to put it into one of the cupboards as 85 | she fell past it. 86 | 87 | WARC/1.0 88 | Content-Type: text/plain 89 | Warc-Block-Digest: sha1:309231bd60e4ef85ed3a841a419aa59f1cb9b11b 90 | Content-Length: 262 91 | Warc-Date: 2015-07-29T20:10:45+02:00 92 | Warc-Type: resource 93 | 94 | 'Well!' thought Alice to herself, 'after such a fall as this, I shall 95 | think nothing of tumbling down stairs! How brave they'll all think me at 96 | home! Why, I wouldn't say anything about it, even if I fell off the top 97 | of the house!' (Which was very likely true.) 98 | 99 | WARC/1.0 100 | Content-Type: text/plain 101 | Warc-Block-Digest: sha1:e6066ba6c9767cb43122993e6509664201a242fd 102 | Content-Length: 714 103 | Warc-Date: 2015-07-29T20:10:45+02:00 104 | Warc-Type: resource 105 | 106 | Down, down, down. Would the fall NEVER come to an end! 'I wonder how 107 | many miles I've fallen by this time?' she said aloud. 'I must be getting 108 | somewhere near the centre of the earth. Let me see: that would be four 109 | thousand miles down, I think--' (for, you see, Alice had learnt several 110 | things of this sort in her lessons in the schoolroom, and though this 111 | was not a VERY good opportunity for showing off her knowledge, as there 112 | was no one to listen to her, still it was good practice to say it over) 113 | '--yes, that's about the right distance--but then I wonder what Latitude 114 | or Longitude I've got to?' (Alice had no idea what Latitude was, or 115 | Longitude either, but thought they were nice grand words to say.) 116 | 117 | WARC/1.0 118 | Content-Type: text/plain 119 | Warc-Block-Digest: sha1:162dea8c101a7634462896c5c1e138d315be8b63 120 | Content-Length: 690 121 | Warc-Date: 2015-07-29T20:10:45+02:00 122 | Warc-Type: resource 123 | 124 | Presently she began again. 'I wonder if I shall fall right THROUGH the 125 | earth! How funny it'll seem to come out among the people that walk with 126 | their heads downward! The Antipathies, I think--' (she was rather glad 127 | there WAS no one listening, this time, as it didn't sound at all the 128 | right word) '--but I shall have to ask them what the name of the country 129 | is, you know. Please, Ma'am, is this New Zealand or Australia?' (and 130 | she tried to curtsey as she spoke--fancy CURTSEYING as you're falling 131 | through the air! Do you think you could manage it?) 'And what an 132 | ignorant little girl she'll think me for asking! No, it'll never do to 133 | ask: perhaps I shall see it written up somewhere.' 134 | 135 | WARC/1.0 136 | Content-Type: text/plain 137 | Warc-Block-Digest: sha1:7adec3d60dc55bc888c427f66ba0fb3075695bef 138 | Content-Length: 980 139 | Warc-Date: 2015-07-29T20:10:45+02:00 140 | Warc-Type: resource 141 | 142 | Down, down, down. There was nothing else to do, so Alice soon began 143 | talking again. 'Dinah'll miss me very much to-night, I should think!' 144 | (Dinah was the cat.) 'I hope they'll remember her saucer of milk at 145 | tea-time. Dinah my dear! I wish you were down here with me! There are no 146 | mice in the air, I'm afraid, but you might catch a bat, and that's very 147 | like a mouse, you know. But do cats eat bats, I wonder?' And here Alice 148 | began to get rather sleepy, and went on saying to herself, in a dreamy 149 | sort of way, 'Do cats eat bats? Do cats eat bats?' and sometimes, 'Do 150 | bats eat cats?' for, you see, as she couldn't answer either question, 151 | it didn't much matter which way she put it. She felt that she was dozing 152 | off, and had just begun to dream that she was walking hand in hand with 153 | Dinah, and saying to her very earnestly, 'Now, Dinah, tell me the truth: 154 | did you ever eat a bat?' when suddenly, thump! thump! down she came upon 155 | a heap of sticks and dry leaves, and the fall was over. 156 | 157 | -------------------------------------------------------------------------------- /testdata/test.warc.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/test.warc.gz -------------------------------------------------------------------------------- /testdata/warcio/bad.arc: -------------------------------------------------------------------------------- 1 | filedesc://bad.arc.gz 127.0.0.1 20140301000000 text/plain -1 2 | 1 0 Bad Capture 3 | URL IP-address Archive-date Content-type Archive-length 4 | 5 | http://example.com/ 93.184.216.119 201404010000000000 text/html -1 6 | 7 | http://example.com/ 127.0.0.1 20140102000000 text/plain 1 8 | 9 | 10 | http://example.com/ 93.184.216.119 201404010000000000 text/html abc 11 | 12 | -------------------------------------------------------------------------------- /testdata/warcio/example-bad-non-chunked.warc.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/warcio/example-bad-non-chunked.warc.gz -------------------------------------------------------------------------------- /testdata/warcio/example-bad.warc.gz.bad: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/warcio/example-bad.warc.gz.bad -------------------------------------------------------------------------------- /testdata/warcio/example-iana.org-chunked.warc: -------------------------------------------------------------------------------- 1 | WARC/1.0 2 | WARC-Record-ID: 3 | WARC-Type: warcinfo 4 | WARC-Filename: WARCPROX-20170306165409966-00000-13133-ilya-macbook.warc 5 | WARC-Date: 2017-03-06T16:54:09Z 6 | Content-Type: application/warc-fields 7 | Content-Length: 137 8 | 9 | software: webrecorder.io 2.0 (warcprox 1.4-20151006074455-78e4ecd) 10 | hostname: ilya-macbook 11 | ip: 127.0.0.1 12 | format: WARC File Format 1.0 13 | 14 | 15 | WARC/1.0 16 | WARC-Type: response 17 | WARC-Record-ID: 18 | WARC-Date: 2017-03-06T16:54:09Z 19 | WARC-Target-URI: http://www.iana.org/ 20 | WARC-IP-Address: 192.0.32.8 21 | Content-Type: application/http;msgtype=response 22 | Content-Length: 7566 23 | WARC-Block-Digest: sha1:a54fe86cc15cbb3c66f29596f26395bb2f7b5cc6 24 | WARC-Payload-Digest: sha1:b1f949b4920c773fd9c863479ae9a788b948c7ad 25 | 26 | HTTP/1.1 200 OK 27 | Vary: Accept-Encoding 28 | Last-Modified: Fri, 03 Feb 2017 02:44:29 GMT 29 | Content-Type: text/html; charset=UTF-8 30 | Cache-Control: public, max-age=3600 31 | Transfer-Encoding: chunked 32 | Date: Mon, 06 Mar 2017 16:54:09 GMT 33 | Connection: keep-alive 34 | Server: Apache 35 | X-Cache-Host: cache2.lax.icann.org 36 | X-Cache-Hits: 157098 37 | 38 | 001c37 39 | 40 | 41 | 42 | Internet Assigned Numbers Authority 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 69 | 70 | 71 | 72 | 73 | 74 | 75 |
76 |
77 |

Internet Assigned Numbers Authority

78 |
79 |

The global coordination of the DNS Root, IP addressing, 80 | and other Internet protocol resources is performed as the 81 | Internet Assigned Numbers Authority (IANA) functions. 82 | Learn more.

83 |
84 |
85 | 86 |
87 | 88 |
89 |
90 |
91 |

Domain Names

92 |

Management of the DNS Root Zone (assignments of ccTLDs and gTLDs) along with other functions such as the .int and .arpa zones.

93 | 100 |
101 |
102 |

Number Resources

103 |

Coordination of the global IP and AS number spaces, such as allocations made to Regional Internet Registries.

104 | 108 |
109 |
110 |

Protocol Assignments

111 |

The central repository for protocol name and number registries used in many Internet protocols.

112 | 117 |
118 |
119 | 120 |
121 |

122 | New Root Zone trust anchor being distributed. 123 | Following successful key ceremonies in October 2016 and February 2017, 124 | a new key signing key for the DNS root zone has been generated that is expected 125 | to be used later in 2017. Software implementing DNSSEC validation should obtain 126 | the revised root anchors in our repository. 127 | This is an important step in the KSK Rollover project. 128 |

129 |

130 | 2016 Annual Customer Survey available. 131 | In September 2016, we invited our customers to participate in our 132 | fifth annual customer satisfaction survey. We have now published the 133 | compiled results, showing continued very positive satisfaction from 134 | our communities. The report describes the methodology and responses 135 | received. 136 |

137 |
138 | 139 | 142 | 143 |
144 |
145 | 146 | 200 |
201 | 202 | 203 | 204 | 205 | 206 | 0 207 | 208 | 209 | 210 | WARC/1.0 211 | WARC-Type: request 212 | WARC-Record-ID: 213 | WARC-Date: 2017-03-06T16:54:09Z 214 | WARC-Target-URI: http://www.iana.org/ 215 | WARC-Concurrent-To: 216 | WARC-Block-Digest: sha1:01a92c4b0e2c3d3f0e80e8e26cad07509ae8831a 217 | Content-Type: application/http;msgtype=request 218 | Content-Length: 76 219 | 220 | GET / HTTP/1.1 221 | Host: www.iana.org 222 | Accept: */* 223 | User-agent: curl/7.43.0 224 | 225 | 226 | 227 | -------------------------------------------------------------------------------- /testdata/warcio/example-resource.warc.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/warcio/example-resource.warc.gz -------------------------------------------------------------------------------- /testdata/warcio/example-trunc.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/warcio/example-trunc.txt -------------------------------------------------------------------------------- /testdata/warcio/example-trunc.warc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/warcio/example-trunc.warc -------------------------------------------------------------------------------- /testdata/warcio/example.arc: -------------------------------------------------------------------------------- 1 | filedesc://live-web-example.arc.gz 127.0.0.1 20140216050221 text/plain 75 2 | 1 0 LiveWeb Capture 3 | URL IP-address Archive-date Content-type Archive-length 4 | 5 | http://example.com/ 93.184.216.119 20140216050221 text/html 1591 6 | HTTP/1.1 200 OK 7 | Accept-Ranges: bytes 8 | Cache-Control: max-age=604800 9 | Content-Type: text/html 10 | Date: Sun, 16 Feb 2014 05:02:20 GMT 11 | Etag: "359670651" 12 | Expires: Sun, 23 Feb 2014 05:02:20 GMT 13 | Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT 14 | Server: ECS (sjc/4FCE) 15 | X-Cache: HIT 16 | x-ec-custom-error: 1 17 | Content-Length: 1270 18 | 19 | 20 | 21 | 22 | Example Domain 23 | 24 | 25 | 26 | 27 | 58 | 59 | 60 | 61 |
62 |

Example Domain

63 |

This domain is established to be used for illustrative examples in documents. You may use this 64 | domain in examples without prior coordination or asking for permission.

65 |

More information...

66 |
67 | 68 | 69 | 70 | -------------------------------------------------------------------------------- /testdata/warcio/example.arc.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/warcio/example.arc.gz -------------------------------------------------------------------------------- /testdata/warcio/example.warc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/warcio/example.warc -------------------------------------------------------------------------------- /testdata/warcio/example.warc.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/warcio/example.warc.gz -------------------------------------------------------------------------------- /testdata/warcio/post-test.warc.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datatogether/warc/74ef3f5ea69fdd855fb51dd70d0ab4e7d515dda0/testdata/warcio/post-test.warc.gz -------------------------------------------------------------------------------- /writer.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bytes" 5 | "compress/gzip" 6 | stdErrors "errors" 7 | "fmt" 8 | "io" 9 | "net/http" 10 | "sort" 11 | 12 | "github.com/pborman/uuid" 13 | "github.com/pkg/errors" 14 | ) 15 | 16 | // NewUUID generates a new version 4 uuid 17 | func NewUUID() string { 18 | return fmt.Sprintf("", uuid.New()) 19 | } 20 | 21 | type flusher interface { 22 | Flush() error 23 | } 24 | 25 | type closeResetWriter interface { 26 | Close() error 27 | Reset(w io.Writer) 28 | } 29 | 30 | // Writer provides functionality for writing WARC files in compressed and 31 | // uncompressed formats. 32 | // 33 | // To construct a Writer, call NewWriterCompressed or NewWriterRaw. 34 | type Writer struct { 35 | seekW io.WriteSeeker 36 | wr io.Writer 37 | cmprs bool 38 | 39 | // RecordCallback will be called after each record is written to the file. 40 | // If a WriteSeeker was not provided, the provided positions will be 41 | // invalid. 42 | RecordCallback func(r *Record, startPos, endPos int64) 43 | } 44 | 45 | // NewWriterCompressed initializes a WARC Writer writing to a compressed 46 | // stream. The first parameter should be the "backing stream" of the 47 | // compression. The second parameter is a compress/gzip writer writing to the 48 | // rawFile parameter. 49 | // 50 | // Seek will only be called with whence == io.SeekCurrent and offset == 0. 51 | // 52 | // See also CountWriter() if you need a "fake" Seek implementation. 53 | func NewWriterCompressed(rawFile io.WriteSeeker, cmprsWriter *gzip.Writer) (*Writer, error) { 54 | w := &Writer{ 55 | seekW: rawFile, 56 | wr: cmprsWriter, 57 | cmprs: true, 58 | } 59 | return w, nil 60 | } 61 | 62 | // NewWriterRaw initializes a WARC Writer writing to an uncompressed stream. 63 | // If the provided Writer implements io.Seeker, the RecordCallback will be 64 | // available. If the provided Writer implements interface{Flush() error}, it 65 | // will be flushed after every written Record. 66 | // 67 | // See also CountWriter() if you need a "fake" Seek implementation. 68 | func NewWriterRaw(out io.Writer) (*Writer, error) { 69 | w := &Writer{ 70 | wr: out, 71 | } 72 | if wseeker, ok := out.(io.WriteSeeker); ok { 73 | w.seekW = wseeker 74 | } 75 | return w, nil 76 | } 77 | 78 | type countWriter struct { 79 | count int64 80 | w io.Writer 81 | } 82 | 83 | // CountWriter implements a limited version of io.Seeker around the provided 84 | // Writer. It only supports offset == 0 and whence == io.SeekCurrent or 85 | // io.SeekEnd, and returns the current number of written bytes in both cases. 86 | func CountWriter(w io.Writer) io.WriteSeeker { 87 | return &countWriter{count: 0, w: w} 88 | } 89 | 90 | // implements io.Writer 91 | func (c *countWriter) Write(p []byte) (int, error) { 92 | n, err := c.w.Write(p) 93 | if n >= 0 { 94 | c.count += int64(n) 95 | } 96 | return n, err 97 | } 98 | 99 | var errCountWriterNotImplemented = stdErrors.New("unsupported seek operation") 100 | 101 | // implements io.Seeker 102 | func (c *countWriter) Seek(offset int64, whence int) (int64, error) { 103 | if offset != 0 || !(whence == io.SeekCurrent || whence == io.SeekEnd) { 104 | return 0, errCountWriterNotImplemented 105 | } 106 | return c.count, nil 107 | } 108 | 109 | // WriteRecord adds the record to the WARC file and returns the file offsets 110 | // the record was written at. 111 | // 112 | // No processing is done to the Record contents beyond those mentioned in 113 | // Record.Write. If clients want extra processing (e.g. setting the 114 | // Warcinfo-Id header) they are encouraged to create a wrapper. 115 | func (w *Writer) WriteRecord(rec *Record) (startPos, endPos int64, err error) { 116 | if w.seekW != nil { 117 | startPos, err = w.seekW.Seek(0, io.SeekCurrent) 118 | err = errors.Wrap(err, "warc writer: seek 0") 119 | if err != nil { 120 | return 121 | } 122 | } 123 | 124 | err = rec.Write(w.wr) 125 | err = errors.Wrap(err, "warc writer: write record") 126 | if err != nil { 127 | return 128 | } 129 | 130 | // flush is not sufficient for gzip writer, need to Close/Reset 131 | closeReset, crOK := w.wr.(closeResetWriter) 132 | crOK = crOK && w.cmprs 133 | if flusher, ok := w.wr.(flusher); ok && !crOK { 134 | err = errors.Wrap(flusher.Flush(), "warc writer: flush") 135 | if err != nil { 136 | return 137 | } 138 | } 139 | if crOK { 140 | err = errors.Wrap(closeReset.Close(), "warc writer: flush") 141 | if err != nil { 142 | return 143 | } 144 | } 145 | // check the position BETWEEN close / reset 146 | if w.seekW != nil { 147 | endPos, err = w.seekW.Seek(0, io.SeekCurrent) 148 | err = errors.Wrap(err, "warc writer: seek 0") 149 | if err != nil { 150 | return 151 | } 152 | } 153 | if crOK { 154 | closeReset.Reset(w.seekW) 155 | } 156 | 157 | return 158 | } 159 | 160 | // Close cleans up any resources the warc.Writer might be holding on to. 161 | func (w *Writer) Close() error { 162 | return nil 163 | } 164 | 165 | // WriteRecords calls Write on each record to w. 166 | // Deprecated: see Writer type 167 | func WriteRecords(w io.Writer, records Records) error { 168 | for _, rec := range records { 169 | if err := rec.Write(w); err != nil { 170 | return err 171 | } 172 | } 173 | return nil 174 | } 175 | 176 | // WriteHeader writes a fully formed header with version to w 177 | func writeHeader(w io.Writer, r *Record) error { 178 | if err := writeWarcVersion(w, r); err != nil { 179 | return err 180 | } 181 | if err := writeFields(w, r.Headers); err != nil { 182 | return err 183 | } 184 | if _, err := io.WriteString(w, "\r\n"); err != nil { 185 | return err 186 | } 187 | return nil 188 | } 189 | 190 | // WriteBlock writes all of reader (record content) to w, followed by 2 CRLF's 191 | func writeBlock(w io.Writer, r io.Reader) error { 192 | if _, err := io.Copy(w, r); err != nil { 193 | return err 194 | } 195 | // write 2xCRLF 196 | _, err := io.WriteString(w, "\r\n\r\n") 197 | return err 198 | } 199 | 200 | // writeWarcVersion writes the warc version header 201 | func writeWarcVersion(w io.Writer, r *Record) error { 202 | _, err := io.WriteString(w, r.Format.String()+"\r\n") 203 | return err 204 | } 205 | 206 | // WriteRequestMethodAndHeaders calls req.Write(w). (deprecated, see 207 | // NewRequestResponseRecords) 208 | func WriteRequestMethodAndHeaders(w io.Writer, req *http.Request) error { 209 | return req.Write(w) 210 | } 211 | 212 | // WriteHTTPHeaders writes all http headers to an io.Writer, separated by newlines 213 | // Used to add http headers to a record 214 | func WriteHTTPHeaders(w io.Writer, headers http.Header) error { 215 | for k := range headers { 216 | if _, err := io.WriteString(w, fmt.Sprintf("%s: %s\n", k, headers.Get(k))); err != nil { 217 | return err 218 | } 219 | } 220 | return nil 221 | } 222 | 223 | // replaceBlockBody replaces the body of a warc record, leaving 224 | // and written headers in place 225 | func replaceBlockBody(data, repl []byte) ([]byte, error) { 226 | start := bytes.LastIndex(data, crlf) 227 | if start == -1 { 228 | return repl, nil 229 | } 230 | return append(data[start:], repl...), nil 231 | } 232 | 233 | // writeDefinedFields takes a map of token constants to values, and writes them to w 234 | // it skips fields who's value is "" 235 | func writeFields(w io.Writer, fields map[string]string) error { 236 | keys := make([]string, len(fields)) 237 | i := 0 238 | for field := range fields { 239 | keys[i] = field 240 | i++ 241 | } 242 | 243 | // sort fields alphabetically 244 | sort.Slice(keys, func(i, j int) bool { return keys[i] < keys[j] }) 245 | 246 | for _, key := range keys { 247 | if err := writeField(w, key, fields[key]); err != nil { 248 | return err 249 | } 250 | } 251 | return nil 252 | } 253 | 254 | func writeField(w io.Writer, key, value string) error { 255 | // don't write empty fields 256 | if value == "" { 257 | return nil 258 | } 259 | // format entry 260 | ln := fmt.Sprintf("%s: %s\r\n", key, value) 261 | _, err := io.WriteString(w, ln) 262 | return err 263 | } 264 | -------------------------------------------------------------------------------- /writer_test.go: -------------------------------------------------------------------------------- 1 | package warc 2 | 3 | import ( 4 | "bytes" 5 | "compress/gzip" 6 | "crypto/sha1" 7 | "encoding/base32" 8 | "fmt" 9 | "io" 10 | "os" 11 | "strings" 12 | "testing" 13 | 14 | dmp "github.com/sergi/go-diff/diffmatchpatch" 15 | ) 16 | 17 | func init() { 18 | for _, t := range []*[]byte{ 19 | &WARCInfoRecord, 20 | &ResponseRecord, 21 | &ResponseRecord2, 22 | &RequestRecord, 23 | &RequestRecord2, 24 | &RevisitRecord1, 25 | &RevisitRecord2, 26 | &ResourceRecord, 27 | &MetadataRecord, 28 | &DNSResponseRecord, 29 | &DNSResourceRecord, 30 | } { 31 | // need to replace '\r' from raw string literals with actual 32 | // carriage return character 33 | *t = bytes.Replace(*t, []byte{'\\', 'r'}, []byte{0x0d}, -1) 34 | } 35 | } 36 | 37 | func TestNewUUID(t *testing.T) { 38 | id := NewUUID() 39 | if !strings.HasPrefix(id, "= len(expect) || b != expect[i] { 268 | return fmt.Errorf("byte length mismatch. expected: %d, got: %d. first error at index %d: '%#v'", len(expect), len(buf.Bytes()), i, b) 269 | } 270 | } 271 | 272 | return fmt.Errorf("byte length mismatch. expected: %d, got: %d, ", len(expect), len(buf.Bytes())) 273 | } 274 | 275 | if !bytes.Equal(buf.Bytes(), expect) { 276 | return fmt.Errorf("byte mismatch: %s != %s", buf.String(), string(expect)) 277 | } 278 | 279 | if r.Headers[FieldNameWARCBlockDigest] != "" { 280 | checkSha1Hash(r.Content.Bytes(), r.Headers[FieldNameWARCBlockDigest]) 281 | } 282 | 283 | return nil 284 | } 285 | 286 | func checkSha1Hash(content []byte, hashstr string) error { 287 | hash := sha1.Sum(content) 288 | buf := &bytes.Buffer{} 289 | base32.NewEncoder(base32.StdEncoding, buf).Write(hash[:]) 290 | s := fmt.Sprintf("sha1:%s", buf.String()) 291 | if s != hashstr { 292 | return fmt.Errorf("hash mismatch. expected '%s'. got: '%s'", hashstr, s) 293 | } 294 | return nil 295 | } 296 | 297 | // func testRequestResponseConcur(t *testing.T) { 298 | // } 299 | 300 | // func testReadFromStreamNoContentLength(t *testing.T) { 301 | // } 302 | 303 | func validateResponse(r *Record) error { 304 | return nil 305 | } 306 | 307 | var WARCInfoRecordID = "" 308 | var WARCInfoRecord = []byte(`WARC/1.0\r 309 | Content-Length: 86\r 310 | Content-Type: application/warc-fields\r 311 | WARC-Date: 2000-01-01T00:00:00Z\r 312 | WARC-Filename: testfile.warc.gz\r 313 | WARC-Record-ID: \r 314 | WARC-Type: warcinfo\r 315 | \r 316 | software: recorder test\r 317 | format: WARC File Format 1.0\r 318 | json-metadata: {"foo": "bar"}\r 319 | \r 320 | \r 321 | `) 322 | 323 | var ResponseRecordID = "" 324 | var ResponseRecord = []byte(`WARC/1.0\r 325 | Content-Length: 97\r 326 | Content-Type: application/http; msgtype=response\r 327 | WARC-Block-Digest: sha1:OS3OKGCWQIJOAOC3PKXQOQFD52NECQ74\r 328 | WARC-Date: 2000-01-01T00:00:00Z\r 329 | WARC-Payload-Digest: sha1:B6QJ6BNJ3R4B23XXMRKZKHLPGJY2VE4O\r 330 | WARC-Record-ID: \r 331 | WARC-Target-URI: http://example.com/\r 332 | WARC-Type: response\r 333 | \r 334 | HTTP/1.0 200 OK\r 335 | Content-Type: text/plain; charset="UTF-8"\r 336 | Custom-Header: somevalue\r 337 | \r 338 | some 339 | text\r 340 | \r 341 | `) 342 | 343 | var ResponseRecord2ID = "" 344 | var ResponseRecord2 = []byte(` 345 | WARC/1.0\r 346 | WARC-Type: response\r 347 | WARC-Record-ID: \r 348 | WARC-Target-URI: http://example.com/\r 349 | WARC-Date: 2000-01-01T00:00:00Z\r 350 | WARC-Payload-Digest: sha1:B6QJ6BNJ3R4B23XXMRKZKHLPGJY2VE4O\r 351 | WARC-Block-Digest: sha1:U6KNJY5MVNU3IMKED7FSO2JKW6MZ3QUX\r 352 | Content-Type: application/http; msgtype=response\r 353 | Content-Length: 145\r 354 | \r 355 | HTTP/1.0 200 OK\r 356 | Content-Type: text/plain; charset="UTF-8"\r 357 | Content-Length: 9\r 358 | Custom-Header: somevalue\r 359 | Content-Encoding: x-unknown\r 360 | \r 361 | some 362 | text\r 363 | \r 364 | `) 365 | 366 | var RequestRecordID = "" 367 | var RequestRecord = []byte(`WARC/1.0\r 368 | Content-Length: 54\r 369 | Content-Type: application/http; msgtype=request\r 370 | WARC-Block-Digest: sha1:ONEHF6PTXPTTHE3333XHTD2X45TZ3DTO\r 371 | WARC-Date: 2000-01-01T00:00:00Z\r 372 | WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ\r 373 | WARC-Record-ID: \r 374 | WARC-Target-URI: http://example.com/\r 375 | WARC-Type: request\r 376 | \r 377 | GET / HTTP/1.0\r 378 | User-Agent: foo\r 379 | Host: example.com\r 380 | \r 381 | \r 382 | \r 383 | `) 384 | 385 | var RequestRecord2ID = "" 386 | var RequestRecord2 = []byte(` 387 | WARC/1.0\r 388 | WARC-Type: request\r 389 | WARC-Record-ID: \r 390 | WARC-Target-URI: http://example.com/\r 391 | WARC-Date: 2000-01-01T00:00:00Z\r 392 | WARC-Payload-Digest: sha1:R5VZAKIE53UW5VGK43QJIFYS333QM5ZA\r 393 | WARC-Block-Digest: sha1:L7SVBUPPQ6RH3ANJD42G5JL7RHRVZ5DV\r 394 | Content-Type: application/http; msgtype=request\r 395 | Content-Length: 92\r 396 | \r 397 | POST /path HTTP/1.0\r 398 | Content-Type: application/json\r 399 | Content-Length: 17\r 400 | \r 401 | {"some": "value"}\r 402 | \r 403 | `) 404 | 405 | var RevisitRecord1ID = "" 406 | var RevisitRecord1 = []byte(` 407 | WARC/1.0\r 408 | WARC-Type: revisit\r 409 | WARC-Record-ID: \r 410 | WARC-Target-URI: http://example.com/\r 411 | WARC-Date: 2000-01-01T00:00:00Z\r 412 | WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest\r 413 | WARC-Refers-To-Target-URI: http://example.com/foo\r 414 | WARC-Refers-To-Date: 1999-01-01T00:00:00Z\r 415 | WARC-Payload-Digest: sha1:B6QJ6BNJ3R4B23XXMRKZKHLPGJY2VE4O\r 416 | WARC-Block-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ\r 417 | Content-Type: application/http; msgtype=response\r 418 | Content-Length: 0\r 419 | \r 420 | \r 421 | \r 422 | `) 423 | 424 | var RevisitRecord2ID = "" 425 | var RevisitRecord2 = []byte(` 426 | WARC/1.0\r 427 | WARC-Type: revisit\r 428 | WARC-Record-ID: \r 429 | WARC-Target-URI: http://example.com/\r 430 | WARC-Date: 2000-01-01T00:00:00Z\r 431 | WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest\r 432 | WARC-Refers-To-Target-URI: http://example.com/foo\r 433 | WARC-Refers-To-Date: 1999-01-01T00:00:00Z\r 434 | WARC-Payload-Digest: sha1:B6QJ6BNJ3R4B23XXMRKZKHLPGJY2VE4O\r 435 | WARC-Block-Digest: sha1:A6J5UTI2QHHCZFCFNHQHCDD3JJFKP53V\r 436 | Content-Type: application/http; msgtype=response\r 437 | Content-Length: 88\r 438 | \r 439 | HTTP/1.0 200 OK\r 440 | Content-Type: text/plain; charset="UTF-8"\r 441 | Custom-Header: somevalue\r 442 | \r 443 | \r 444 | \r 445 | `) 446 | 447 | var ResourceRecordID = "" 448 | var ResourceRecord = []byte(` 449 | WARC/1.0\r 450 | WARC-Type: resource\r 451 | WARC-Record-ID: \r 452 | WARC-Target-URI: ftp://example.com/\r 453 | WARC-Date: 2000-01-01T00:00:00Z\r 454 | WARC-Payload-Digest: sha1:B6QJ6BNJ3R4B23XXMRKZKHLPGJY2VE4O\r 455 | WARC-Block-Digest: sha1:B6QJ6BNJ3R4B23XXMRKZKHLPGJY2VE4O\r 456 | Content-Type: text/plain\r 457 | Content-Length: 9\r 458 | \r 459 | some 460 | text\r 461 | \r 462 | `) 463 | 464 | var MetadataRecordID = "" 465 | var MetadataRecord = []byte(` 466 | WARC/1.0\r 467 | WARC-Type: metadata\r 468 | WARC-Record-ID: \r 469 | WARC-Target-URI: http://example.com/\r 470 | WARC-Date: 2000-01-01T00:00:00Z\r 471 | WARC-Payload-Digest: sha1:ZOLBLKAQVZE5DXH56XE6EH6AI6ZUGDPT\r 472 | WARC-Block-Digest: sha1:ZOLBLKAQVZE5DXH56XE6EH6AI6ZUGDPT\r 473 | Content-Type: application/json\r 474 | Content-Length: 67\r 475 | \r 476 | {"metadata": {"nested": "obj", "list": [1, 2, 3], "length": "123"}}\r 477 | \r 478 | `) 479 | 480 | var DNSResponseRecordID = "" 481 | var DNSResponseRecord = []byte(` 482 | WARC/1.0\r 483 | WARC-Type: response\r 484 | WARC-Record-ID: \r 485 | WARC-Target-URI: dns:google.com\r 486 | WARC-Date: 2000-01-01T00:00:00Z\r 487 | WARC-Payload-Digest: sha1:2AAVJYKKIWK5CF6EWE7PH63EMNLO44TH\r 488 | WARC-Block-Digest: sha1:2AAVJYKKIWK5CF6EWE7PH63EMNLO44TH\r 489 | Content-Type: application/http; msgtype=response\r 490 | Content-Length: 147\r 491 | \r 492 | 20170509000739 493 | google.com. 185 IN A 209.148.113.239 494 | google.com. 185 IN A 209.148.113.238 495 | google.com. 185 IN A 209.148.113.250 496 | \r\r 497 | `) 498 | 499 | var DNSResourceRecordID = "" 500 | var DNSResourceRecord = []byte(` 501 | WARC/1.0\r 502 | WARC-Type: resource\r 503 | WARC-Record-ID: \r 504 | WARC-Target-URI: dns:google.com\r 505 | WARC-Date: 2000-01-01T00:00:00Z\r 506 | WARC-Payload-Digest: sha1:2AAVJYKKIWK5CF6EWE7PH63EMNLO44TH\r 507 | WARC-Block-Digest: sha1:2AAVJYKKIWK5CF6EWE7PH63EMNLO44TH\r 508 | Content-Type: application/warc-record\r 509 | Content-Length: 147\r 510 | \r 511 | 20170509000739 512 | google.com. 185 IN A 209.148.113.239 513 | google.com. 185 IN A 209.148.113.238 514 | google.com. 185 IN A 209.148.113.250 515 | \r\r 516 | `) 517 | --------------------------------------------------------------------------------