├── .gitignore ├── CODE_OF_CONDUCT.md ├── LICENSE.md └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | 2 | # Contributor Covenant Code of Conduct 3 | 4 | ## Our Pledge 5 | 6 | We as members, contributors, and leaders pledge to make participation in our 7 | community a harassment-free experience for everyone, regardless of age, body 8 | size, visible or invisible disability, ethnicity, sex characteristics, gender 9 | identity and expression, level of experience, education, socio-economic status, 10 | nationality, personal appearance, race, caste, color, religion, or sexual identity 11 | and orientation. 12 | 13 | We pledge to act and interact in ways that contribute to an open, welcoming, 14 | diverse, inclusive, and healthy community. 15 | 16 | ## Our Standards 17 | 18 | Examples of behavior that contributes to a positive environment for our 19 | community include: 20 | 21 | * Demonstrating empathy and kindness toward other people 22 | * Being respectful of differing opinions, viewpoints, and experiences 23 | * Giving and gracefully accepting constructive feedback 24 | * Accepting responsibility and apologizing to those affected by our mistakes, 25 | and learning from the experience 26 | * Focusing on what is best not just for us as individuals, but for the 27 | overall community 28 | 29 | Examples of unacceptable behavior include: 30 | 31 | * The use of sexualized language or imagery, and sexual attention or 32 | advances of any kind 33 | * Trolling, insulting or derogatory comments, and personal or political attacks 34 | * Public or private harassment 35 | * Publishing others' private information, such as a physical or email 36 | address, without their explicit permission 37 | * Other conduct which could reasonably be considered inappropriate in a 38 | professional setting 39 | 40 | ## Enforcement Responsibilities 41 | 42 | Community leaders are responsible for clarifying and enforcing our standards of 43 | acceptable behavior and will take appropriate and fair corrective action in 44 | response to any behavior that they deem inappropriate, threatening, offensive, 45 | or harmful. 46 | 47 | Community leaders have the right and responsibility to remove, edit, or reject 48 | comments, commits, code, wiki edits, issues, and other contributions that are 49 | not aligned to this Code of Conduct, and will communicate reasons for moderation 50 | decisions when appropriate. 51 | 52 | ## Scope 53 | 54 | This Code of Conduct applies within all community spaces, and also applies when 55 | an individual is officially representing the community in public spaces. 56 | Examples of representing our community include using an official e-mail address, 57 | posting via an official social media account, or acting as an appointed 58 | representative at an online or offline event. 59 | 60 | ## Enforcement 61 | 62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 63 | reported to the community leaders responsible for enforcement at [code-of-conduct@pdfa.org](mailto:code-of-conduct@pdfa.org) 64 | or directly to the CEO or Chair of the Board of Directors. 65 | All complaints will be reviewed and investigated promptly and fairly. 66 | 67 | All community leaders are obligated to respect the privacy and security of the 68 | reporter of any incident. 69 | 70 | ## Enforcement Guidelines 71 | 72 | Community leaders will follow these Community Impact Guidelines in determining 73 | the consequences for any action they deem in violation of this Code of Conduct: 74 | 75 | ### 1. Correction 76 | 77 | **Community Impact**: Use of inappropriate language or other behavior deemed 78 | unprofessional or unwelcome in the community. 79 | 80 | **Consequence**: A private, written warning from community leaders, providing 81 | clarity around the nature of the violation and an explanation of why the 82 | behavior was inappropriate. A public apology may be requested. 83 | 84 | ### 2. Warning 85 | 86 | **Community Impact**: A violation through a single incident or series 87 | of actions. 88 | 89 | **Consequence**: A warning with consequences for continued behavior. No 90 | interaction with the people involved, including unsolicited interaction with 91 | those enforcing the Code of Conduct, for a specified period of time. This 92 | includes avoiding interactions in community spaces as well as external channels 93 | like social media. Violating these terms may lead to a temporary or 94 | permanent ban. 95 | 96 | ### 3. Temporary Ban 97 | 98 | **Community Impact**: A serious violation of community standards, including 99 | sustained inappropriate behavior. 100 | 101 | **Consequence**: A temporary ban from any sort of interaction or public 102 | communication with the community for a specified period of time. No public or 103 | private interaction with the people involved, including unsolicited interaction 104 | with those enforcing the Code of Conduct, is allowed during this period. 105 | Violating these terms may lead to a permanent ban. 106 | 107 | ### 4. Permanent Ban 108 | 109 | **Community Impact**: Demonstrating a pattern of violation of community 110 | standards, including sustained inappropriate behavior, harassment of an 111 | individual, or aggression toward or disparagement of classes of individuals. 112 | 113 | **Consequence**: A permanent ban from any sort of public interaction within 114 | the community. 115 | 116 | ## Attribution 117 | 118 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 119 | version 2.0, available at 120 | [https://www.contributor-covenant.org/version/2/0/code_of_conduct.html][v2.0]. 121 | 122 | Community Impact Guidelines were inspired by 123 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 124 | 125 | For answers to common questions about this code of conduct, see the FAQ at 126 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available 127 | at [https://www.contributor-covenant.org/translations][translations]. 128 | 129 | [homepage]: https://www.contributor-covenant.org 130 | [v2.0]: https://www.contributor-covenant.org/version/2/0/code_of_conduct.html 131 | [Mozilla CoC]: https://github.com/mozilla/diversity 132 | [FAQ]: https://www.contributor-covenant.org/faq 133 | [translations]: https://www.contributor-covenant.org/translations 134 | 135 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | # Creative Commons Attribution 4.0 International 2 | 3 | Creative Commons Corporation ("Creative Commons") is not a law firm and does not provide legal services or legal advice. Distribution of Creative Commons public licenses does not create a lawyer-client or other relationship. Creative Commons makes its licenses and related information available on an "as-is" basis. Creative Commons gives no warranties regarding its licenses, any material licensed under their terms and conditions, or any related information. Creative Commons disclaims all liability for damages resulting from their use to the fullest extent possible. 4 | 5 | ### Using Creative Commons Public Licenses 6 | 7 | Creative Commons public licenses provide a standard set of terms and conditions that creators and other rights holders may use to share original works of authorship and other material subject to copyright and certain other rights specified in the public license below. The following considerations are for informational purposes only, are not exhaustive, and do not form part of our licenses. 8 | 9 | * __Considerations for licensors:__ Our public licenses are intended for use by those authorized to give the public permission to use material in ways otherwise restricted by copyright and certain other rights. Our licenses are irrevocable. Licensors should read and understand the terms and conditions of the license they choose before applying it. Licensors should also secure all rights necessary before applying our licenses so that the public can reuse the material as expected. Licensors should clearly mark any material not subject to the license. This includes other CC-licensed material, or material used under an exception or limitation to copyright. [More considerations for licensors](http://wiki.creativecommons.org/Considerations_for_licensors_and_licensees#Considerations_for_licensors). 10 | 11 | * __Considerations for the public:__ By using one of our public licenses, a licensor grants the public permission to use the licensed material under specified terms and conditions. If the licensor's permission is not necessary for any reason-for example, because of any applicable exception or limitation to copyright - then that use is not regulated by the license. Our licenses grant only permissions under copyright and certain other rights that a licensor has authority to grant. Use of the licensed material may still be restricted for other reasons, including because others have copyright or other rights in the material. A licensor may make special requests, such as asking that all changes be marked or described. Although not required by our licenses, you are encouraged to respect those requests where reasonable. [More considerations for the public](http://wiki.creativecommons.org/Considerations_for_licensors_and_licensees#Considerations_for_licensees). 12 | 13 | ## Creative Commons Attribution 4.0 International Public License 14 | 15 | By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions. 16 | 17 | ### Section 1 - Definitions. 18 | 19 | a. __Adapted Material__ means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image. 20 | 21 | b. __Adapter's License__ means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License. 22 | 23 | c. __Copyright and Similar Rights__ means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights. 24 | 25 | d. __Effective Technological Measures__ means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements. 26 | 27 | e. __Exceptions and Limitations__ means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material. 28 | 29 | f. __Licensed Material__ means the artistic or literary work, database, or other material to which the Licensor applied this Public License. 30 | 31 | g. __Licensed Rights__ means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license. 32 | 33 | h. __Licensor__ means the individual(s) or entity(ies) granting rights under this Public License. 34 | 35 | i. __Share__ means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them. 36 | 37 | j. __Sui Generis Database Rights__ means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world. 38 | 39 | k. __You__ means the individual or entity exercising the Licensed Rights under this Public License. __Your__ has a corresponding meaning. 40 | 41 | ### Section 2 - Scope. 42 | 43 | a. ___License grant.___ 44 | 45 | 1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to: 46 | 47 | A. reproduce and Share the Licensed Material, in whole or in part; and 48 | 49 | B. produce, reproduce, and Share Adapted Material. 50 | 51 | 2. __Exceptions and Limitations.__ For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions. 52 | 53 | 3. __Term.__ The term of this Public License is specified in Section 6(a). 54 | 55 | 4. __Media and formats; technical modifications allowed.__ The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material. 56 | 57 | 5. __Downstream recipients.__ 58 | 59 | A. __Offer from the Licensor - Licensed Material.__ Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License. 60 | 61 | B. __No downstream restrictions.__ You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material. 62 | 63 | 6. __No endorsement.__ Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i). 64 | 65 | b. ___Other rights.___ 66 | 67 | 1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise. 68 | 69 | 2. Patent and trademark rights are not licensed under this Public License. 70 | 71 | 3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties. 72 | 73 | ### Section 3 - License Conditions. 74 | 75 | Your exercise of the Licensed Rights is expressly made subject to the following conditions. 76 | 77 | a. ___Attribution.___ 78 | 79 | 1. If You Share the Licensed Material (including in modified form), You must: 80 | 81 | A. retain the following if it is supplied by the Licensor with the Licensed Material: 82 | 83 | i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated); 84 | 85 | ii. a copyright notice; 86 | 87 | iii. a notice that refers to this Public License; 88 | 89 | iv. a notice that refers to the disclaimer of warranties; 90 | 91 | v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable; 92 | 93 | B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and 94 | 95 | C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License. 96 | 97 | 2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information. 98 | 99 | 3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable. 100 | 101 | 4. If You Share Adapted Material You produce, the Adapter's License You apply must not prevent recipients of the Adapted Material from complying with this Public License. 102 | 103 | ### Section 4 - Sui Generis Database Rights. 104 | 105 | Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material: 106 | 107 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database; 108 | 109 | b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and 110 | 111 | c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database. 112 | 113 | For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights. 114 | 115 | ### Section 5 - Disclaimer of Warranties and Limitation of Liability. 116 | 117 | a. __Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.__ 118 | 119 | b. __To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.__ 120 | 121 | c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability. 122 | 123 | ### Section 6 - Term and Termination. 124 | 125 | a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically. 126 | 127 | b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates: 128 | 129 | 1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or 130 | 131 | 2. upon express reinstatement by the Licensor. 132 | 133 | For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License. 134 | 135 | c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License. 136 | 137 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License. 138 | 139 | ### Section 7 - Other Terms and Conditions. 140 | 141 | a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed. 142 | 143 | b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License. 144 | 145 | ### Section 8 - Interpretation. 146 | 147 | a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License. 148 | 149 | b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions. 150 | 151 | c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor. 152 | 153 | d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority. 154 | 155 | > Creative Commons is not a party to its public licenses. Notwithstanding, Creative Commons may elect to apply one of its public licenses to material it publishes and in those instances will be considered the "Licensor." The text of the Creative Commons public licenses is dedicated to the public domain under the [CC0 Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/legalcode). Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at [creativecommons.org/policies](http://creativecommons.org/policies), Creative Commons does not authorize the use of the trademark "Creative Commons" or any other trademark or logo of Creative Commons without its prior written consent including, without limitation, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses. 156 | > 157 | > Creative Commons may be contacted at [creativecommons.org](creativecommons.org). 158 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # **PDF Corpora** 2 | 3 | ![LinkedIn](https://img.shields.io/static/v1?style=social&label=LinkedIn&logo=linkedin&message=PDF-Association) 4 |     5 | ![Twitter Follow](https://img.shields.io/twitter/follow/PDFAssociation?style=social) 6 |     7 | ![YouTube Channel Subscribers](https://img.shields.io/youtube/channel/subscribers/UCJL_M0VH2lm65gvGVarUTKQ?style=social) 8 | 9 | This index references a number of the more significant public corpora (data sets) that may contain both valid and invalid, real and synthetic PDF files, reflecting the realities of processing PDF files 'from the wild'. In addition, targeted test suites for specific PDF features or ISO subsets of PDF are also listed. It is not intended to be a list of every website where PDFs may be obtained. 10 | 11 | This informative resource is freely provided to all interested PDF developers, end-users and researchers. It does not reflect an endorsement of any organization, website or corpus. If you currate, maintain or identify other corpora that you believe might be useful to the PDF industry and is freely available please contact us. Some corpora may require registration in order to access. 12 | 13 | 14 | **CAUTION: like any file downloaded from the internet, good computer security and hygiene practices should always be employed as some of these corpora contain files that are malicious! Use at your own risk!** 15 | 16 | For groups interested in creating their own PDF-centric corpora using [GitHub](https://github.com/), please consider using [Git LFS](https://git-lfs.github.com/) so that large files can be easily supported. 17 | 18 | ## CC-MAIN-2021-31-PDF-UNTRUNCATED (SafeDocs) 19 | 20 | - https://digitalcorpora.org/cc-main-2021-31-pdf-untruncated/ 21 | - https://downloads.digitalcorpora.org/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/ 22 | 23 | Announced in May 2023, this corpus contains nearly 8 million PDFs gathered from across the web in July/August of 2021. The PDF files were initially identified by [Common Crawl](https://commoncrawl.org/) as part of their July/August 2021 crawl (identified as "CC-MAIN-2021-31") with all truncated PDFs (those >1MB) subsequently updated and collated as part of the DARPA SafeDocs program. As far as we know this is the largest corpus of PDFs that is an unbiased representative of the current global public web, overcoming truncation limitations, top-level domain constraints or other limitations that prior web-crawled corpora impose. See https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/ for a more detailed description. 24 | 25 | 26 | ## UNSAFE-DOCS (CC-MAIN-2021-31-UNSAFE) (SafeDocs) 27 | 28 | - https://digitalcorpora.org/corpora/file-corpora/unsafe-docs-cc-main-2021-31-unsafe/ 29 | - https://pdfa.org/unsafe-docs-a-new-safedocs-corpus-including-the-voice-of-the-offense/ (_announcement_) 30 | 31 | **CAUTION: this dataset is known to contain malformed and malicious PDF files that impact various PDF processors!** 32 | 33 | "_Following the release of the curated CC-MAIN-2021-31-PDF-UNTRUNCATED corpus, the UNSAFE-DOCS corpus contains over 5.0 million files collected by a team at NASA’s Jet Propulsion Laboratory (JPL) or synthetically generated by Kudu Dynamics ‘ “Voice of the Offense” (VoO) team for the Defense Advanced Research Project Agency (DARPA)’s SafeDocs Program. 34 | The common-crawl component of the UNSAFE-DOCS corpus (over 3.9 million PDF files as collected by a team at NASA’s Jet Propulsion Laboratory (JPL)) includes some truncated PDF files, as described in the paper “Building a Wide Reach Corpus for Secure Parser Development” by Allison et al ([[slides](http://spw20.langsec.org/slides/corpus_LangSec2020.pdf)] [[paper](http://spw20.langsec.org/papers/corpus_LangSec2020.pdf)]). This component was utilized within the SafeDocs program as the starting point for many generated files and as the Program’s evaluation corpus. It was later expanded and improved to become CC-MAIN-2021-31-PDF-UNTRUNCATED._" 35 | 36 | ## GovDocs1 37 | - https://digitalcorpora.org/corpora/files 38 | 39 | This is a very well-studied and large corpus created in 2010, comprising almost 1 million documents from the USA .gov TLD, of which more than 231,000 are PDF documents. The above URL provides lots of detailed information and references. For reference, the 1000 ZIP files total about 308GB. As of November 2020, GovDocs1 has joined the AWS Open Data Sponsorship Program and data is now available directly in AWS S3 (see https://digitalcorpora.org/s3_browser.html#corpora/files/govdocs1/zipfiles/) 40 | 41 | 42 | ## CommonCrawl.org 43 | - http://commoncrawl.org/ 44 | 45 | The Common Crawl corpus contains petabytes of data collected since 2008 and is the core data behind the Wayback Machine (https://web.archive.org/). It contains raw web page data, extracted metadata and text extractions and, of course, millions and millions of PDF files gathered from across the web. 46 | 47 | NOTE: Based on published SafeDocs research, many PDFs in the CommonCrawl database are known to be truncated. See: T. Allison et al., “*Building a Wide Reach Corpus*,” in LangSec 2020, May 2020. http://spw20.langsec.org/papers/corpus_LangSec2020.pdf: 48 | 49 | > Common Crawl truncates files at 1 MB. If researchers require intact files, they must re-pull truncated files from the original websites. In the December, 2019 crawl, nearly 430,000 PDFs (22%) were truncated. 50 | 51 | 52 | ## SafeDocs "Issue Tracker" Corpus 53 | - Individual PDFs can be downloaded from the Apache Tika regression test server: https://corpora.tika.apache.org/base/docs/bug_trackers/ 54 | - The November 2020 update conveniently pre-packages all PDFs into six compressed tarballs (.tgz files): https://corpora.tika.apache.org/base/packaged/pdfs/ 55 | - The initial release from Sept 2020 is also still available from https://corpora.tika.apache.org/base/packaged/pdfs/archive/ 56 | 57 | An outcome of the DARPA-funded SafeDocs research program, a large and growing corpus (>32K files, >31GB) collated by targeted deep-crawls of various issue trackers (e.g. Bugzilla, JIRA, GitHub) to extract PDF attachments on public bug reports for various well-known open-source PDF-aware implementations. These PDFs are not directly discoverable via standard internet search engines. By its very nature, this corpus has a higher-than-normal quantity of unusual and malformed PDFs. Further technical details of this corpus can be found at https://www.pdfa.org/a-new-stressful-pdf-corpus/ and https://www.pdfa.org/stressful-pdf-corpus-grows/ as well as README files on the Apache Tika regression server. 58 | 59 | **NOTE: this unsanitized collated corpus contains a few PDFs that are known to trigger certain anti-malware/anti-virus programs.** 60 | 61 | ## Library of Congress 1000 .gov dataset 62 | 63 | - https://www.loc.gov/item/2020445568/ 64 | 65 | Effectively a smaller but updated dataset, similar to GovDocs1: "_This dataset of 1,000 PDF files was generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as PDF files and hosted on .gov domains. The set includes 1,000 unique PDF files and minimal metadata about these PDFs, including links to their locations within the Library's web archive. Dataset originally created 11/6/2018_." 66 | 67 | ## FoxHex0ne Mutations 68 | - https://foxhex0ne.com/ 69 | 70 | A set of mutated PDFs (and other document formats) created via mutation-based fuzzing. See also ~http://foxhex0ne.blogspot.com/2020/01/lets-continue-with-corpus-distillation.html~ https://web.archive.org/web/20200325183758/http://foxhex0ne.blogspot.com/2020/01/lets-continue-with-corpus-distillation.html. 71 | 72 | 73 | ## OpenPreserve Foundation Format Corpus 74 | - https://github.com/openpreserve/format-corpus 75 | 76 | This is a digital preservation-focused corpus that is openly licensed and covers a wide range of formats and creation tools. 77 | 78 | 79 | ## The Archivist's PDF Cabinet of Horrors 80 | - https://github.com/openpreserve/format-corpus/tree/master/pdfCabinetOfHorrors 81 | 82 | A smaller sub-corpus of PDF test files, created for detecting PDF features that are generally undesirable in an archival setting. 83 | 84 | 85 | ## Cal Poly Graphic Communications PDF/VT Test File Suite 1.0.1 86 | - https://www.pdfa.org/resource/cal-poly-pdfvt-test-suite/ 87 | 88 | This suite provides a collection of four sets of graphically-rich, robust and valid ISO 16612-2 PDF/VT-1 files targeting high-volume variable data printing. Each set comprises PDFs with 10, 100, 500, 1000, 5000, 10000, and 15000 records and can be useful for examining how PDF technology scales as PDF file size and page counts increase. 89 | 90 | 91 | ## VeraPDF Test Suite for PDF/A 92 | - https://www.pdfa.org/resource/verapdf-test-suite/ 93 | - https://github.com/veraPDF/veraPDF-corpus 94 | 95 | The veraPDF test corpus targets the ISO 19005 family of PDF/A specifications (Versions 1B, 1A, 2B, 2U, 2A, 3B, 3U, 3A) as well as a number of additional PDF test files for ISO 32000-1:2008 (PDF 1.7). This test suite complements the Isartor and Bavaria PDF/A-1b test suites and follows their test file pattern: 96 | - all test files are atomic; 97 | - they are self-documented via the document outlines; and 98 | - the naming pattern and the directory structure indicate relevant parts of ISO 19005 specifications 99 | 100 | ## Isartor Test Suite for PDF/A-1b 101 | - https://www.pdfa.org/resource/isartor-test-suite/ 102 | 103 | This test suite comprises a set of files that are used to check the conformance with PDF/A-1. More precisely, the Isartor test suite can be used to "validate the validators": It deliberately violates the requirements of PDF/A-1 in a systematic way in order to check whether PDF/A-1 validation software actually finds the violations. 104 | 105 | 106 | ## BFO PDF/A Test Suite 107 | - https://github.com/bfosupport/pdfa-testsuite 108 | 109 | A collection of about 40 PDF documents that should either pass or fail a conformance test against the specified ISO 19005 PDF/A profile. The description.txt file lists the reason they should pass or fail. This collection was inspired by the Isartor test suite and follows a similar layout with respect to test case names, including the section of the PDF/A specification to which each test refers. Unlike Isartor there are also valid documents that test a particular area of the specification. 110 | 111 | 112 | ## "Synthetic PDF Testset for File Format Validation" 113 | - https://www.radar-service.eu/radar/en/dataset/JtlOdwQquZWDqQdq 114 | 115 | (Abstract) "_This data set presents a corpus of light-weight files designed to test the validation criteria of JHOVE's PDF module against "well-formedness". Test cases are based on structural requirements for PDF files as per ISO 32000-1:2008 standard. The basis for all test files is a single page, one line document with no special features such as linearization. While such a light-weight document only allows to check against a fragment of standard requirements, the focus was put on basic structure violations at the header, trailer, document catalog, page tree node and cross-reference levels. The test set also checks for basic violations at the page node, page resource and stream object level. The accompanying spreadsheet briefly categorizes and describes the test set and includes the outcome when running the test set against JHOVE 1.16, PDF-hul 1.8 as well as Adobe Acrobat Professional XI Pro (11.0.15). The spreadsheet also includes a codecov coverage statistic for the test set in relation to the JHOVE 1.16, PDF-hul 1.8 module. Further information can be found in the paper "A PDF Test-Set for Well-Formedness Validation in JHOVE - The Good, the Bad and the Ugly", published in the proceedings of the 14th International Conference on Digital Preservation (Kyoto, Japan, September 25-29 2017). While the spreadsheet only contains results of running the test set against JHOVE, it can be used as a ground truth for any file format validation process._" 116 | 117 | ## Ghent Working Group (GWG) Output Test Suite 118 | - https://www.gwg.org/workflow-tools-downloads/test-suites/ghent-output-suite/ 119 | 120 | The Ghent Output Suite (currently v5.0) has been created for testing PDF processing in the graphic arts industry and to determine whether workflows are behaving as expected according to the ISO 15930 family of PDF/X standards. The PDF files provide a series of test patches that can be used by end users of graphic arts equipment as well as developers of applications that handle PDF files. This test suite is highly technical and a good understanding of ISO 15930 is essential. 121 | 122 | 123 | ## PDF/UA Reference Suite 124 | - https://www.pdfa.org/resource/pdfua-reference-suite/ 125 | 126 | To serve as a reference for software developers and practitioners interested in best practices for creating tagged and accessible PDF files, the PDF Association's PDF/UA Competence Center has posted a set of 10 PDF documents conforming to ISO 14289-1 PDF/UA-1. 127 | 128 | 129 | ## Matterhorn Protocol 1.02 130 | - https://www.pdfa.org/resource/the-matterhorn-protocol-1-02/ 131 | 132 | The PDF Association's PDF/UA Competence Center developed the Matterhorn Protocol as a list of all the possible ways to fail PDF/UA. Following the requirements of PDF/UA, the document consists of 31 checkpoints comprised of 136 Failure Conditions. The Matterhorn Protocol 1.02 (PDF, 339kB, 2014-06-26) is delivered as a PDF file conforming to PDF/UA-1 (ISO 14289-1) and to PDF/A-2a (ISO 19005-2) and is a reference-quality PDF/UA file. 133 | 134 | 135 | ## Altona Test Suite 136 | - http://www.eci.org/en/downloads 137 | 138 | The Altona Test Suite is a set of highly technical PDF files and patches specifically designed for testing ISO 15930 PDF/X compliance and color accuracy including transparency blending, font handling, smooth shades, gray balance, overprinting, etc. 139 | 140 | 141 | ## 3D PDF Showcase 142 | - https://www.pdfa.org/3d-pdf-showcase/ 143 | 144 | Originally collated by members of the 3D PDF Consortium ([Wayback Machine link](https://web.archive.org/web/20220425214456/http://3dpdfconsortium.org/showcase/)), the 3D PDF showcase corpus provides about 20 PDFs containing different kinds of 3D content from various creators. All PDFs were transitioned to the PDF Association. 145 | 146 | 147 | ## Google pdfium Regression Test Suite 148 | - https://pdfium.googlesource.com/pdfium_tests/ 149 | 150 | A set of PDFs (both real and synthetic) used in regression testing Google's pdfium implementation used in Chrome and elsewhere. 151 | 152 | 153 | ## Mozilla pdf.js Regression Test Suite 154 | - https://github.com/mozilla/pdf.js/tree/master/test/pdfs 155 | 156 | PDF.js is a PDF viewer supported by Mozilla that is built with HTML5. 157 | 158 | 159 | ## Ghent Working Group (GWG) Processing Steps 160 | - https://www.gwg.org/processing-steps-specification/ 161 | 162 | Three sample PDF files containing ISO 19593-1 compliant processing step data (i.e. PDF optional content layers describing cut contours (die lines), varnish, braille, legends, etc.). These sample files are fully compliant with the ISO standard and serve to illustrate the concepts discussed in the standard. 163 | 164 | 165 | ## iText Regression Test Suite 166 | - https://github.com/itext/itext7 167 | 168 | Among the iText Java source code is a sizeable corpus of just under 4,000 PDF files provided as the regression test suite for the iText PDF library. The PDF files are nicely classified as to the PDF feature being tested, including a [good suite of encrypted PDFs](https://github.com/itext/itext7/tree/develop/kernel/src/test/resources/com/itextpdf/kernel/crypto/PdfEncryptionTest) and many others. Each PDF tends to be nice and small for targetted testing of a specific feature with set permutations of PDF keys, values, etc. 169 | 170 | ## Asymptote gallery samples 171 | - https://asymptote.sourceforge.io/gallery/PDFs/index.html: advanced shadings, clips, etc. 172 | - https://asymptote.sourceforge.io/gallery/animations/index.html: various PDFs containing embedded movies 173 | 174 | Asymptote is an open-source vector graphics language with advanced rendering capabilities. It can output to PDF using many of the advanced PDF graphics capabilities such as complex shadings/patterns, complex clips, etc. (rather than dumbing down to pre-rendered bitmap images like other packages). The Asymptote Gallery provides many examples in PDF format. 175 | 176 | ## Legacy Adobe Acrobat and Reader Engineering team test files (via Wayback Machine) 177 | - https://web.archive.org/web/20130503115947/http://acroeng.adobe.com/wp/ 178 | - https://web.archive.org/web/20130717074145/http://acroeng.adobe.com/wp/?page_id=208 179 | 180 | Note: all data is from the Wayback Machine and many of the PDFs were not archived. Sometimes files of the same name and size can be located elsewhere on the internet. 181 | 182 | Various small test sets (such as "Fast Web View/Linearization", "PDF Page Elements", "PDF Files Elements", "Acrobat Sanity Testing", "PDFs in Browser Testing", "Viewer Testing", and "(ISO) PDF Standards Testing") from the Adobe engineering teams responsible for Adobe Reader and Acrobat. 183 | 184 | ## Ground-truthed data sets for PDF table recognition 185 | - http://www.tamirhassan.com/html/dataset.html 186 | 187 | Two ground-truthed datasets of natively-digital PDF documents containing tables. These documents have been collected systematically from the European Union and US Government websites. 188 | 189 | 190 | ## HisDoc18 (Chinese) 191 | - https://github.com/SCUT-DLVCLab/HisDoc1B?tab=readme-ov-file 192 | - Shi, Y. et al. (January 2025) ‘A large-scale dataset for Chinese historical document recognition and analysis’, Scientific Data, 12(1), p. 169. https://doi.org/10.1038/s41597-025-04495-x. 193 | 194 | HisDoc1B is a large-scale annotated dataset for Chinese historical document recognition and analysis. The HisDoc1B corpus comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. The ZIP download referenced from the GitHub link above is 15.8GB. The HisDoc1B8 dataset consists of two main folders, one dedicated to storing historical documents in the form of e-books and the other containing their respective annotation files. The e-books are archived in PDF or DjVu formats. The annotation files are stored in JavaScript Object Notation (JSON) format, aligning with each corresponding e-book. The dataset employs a unique book identifier (ID) to pair e-books with their annotation files, ensuring precise alignment 195 | 196 | 197 | ## PDF-TREX Table Recognition and Extraction data set 198 | - http://staff.icar.cnr.it/ruffolo/pdf-trex 199 | 200 | The freely available PDF-TREX dataset is a standard dataset in the TREX (Table Recognition and EXtraction) field. The dataset contains 100 PDF documents and 164 tables having different layouts. 201 | 202 | 203 | ## Derek Willis' "The Worst PDFs on the Planet" 204 | - https://github.com/dwillis/nicar25-pdfs 205 | - https://arstechnica.com/ai/2025/03/why-extracting-data-from-pdfs-is-still-a-nightmare-for-data-experts/ 206 | 207 | A small number of PDFs that supposedly push the current limits of visual understanding and OCR-based LLMs. 208 | 209 | 210 | ## US National Library of Medicine - National Institutes of Health 211 | - https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ 212 | 213 | A collection of scientific PDF publications is included in the PubMed Central Open Access Subset (commercial use collection). 214 | 215 | 216 | ## US Incident Management Situation Reports from 1990 to 2022 217 | 218 | - https://famprod.nwcg.gov/batchout/IMSRS_from_1990_to_2022/ 219 | - https://www.nature.com/articles/s41597-023-02876-8 (_article_) 220 | 221 | Dataset of Incident Management Situation Reports (IMSR) from 1990 to 2022, which document daily wildland fire situations across ten geographical regions in the United States. 222 | 223 | ## NEPATEC1.0 - US Environmental Impact Statement (EIS) PDF dataset 224 | 225 | - Sharma, S. et al. (2024) ‘NEPATEC1.0: First Large-Scale Text Corpus of National Environmental Policy Act PDF Documents’, p. 7. https://www.pnnl.gov/sites/default/files/media/file/PNNL_PolicyAI_Dataset_Model_Release_IR_06_26.pdf. 226 | - https://huggingface.co/datasets/PolicyAI/NEPATEC1.0 227 | 228 | "_NEPATEC1.0 is an AI-ready dataset that contains data extracted from the Environmental Impact 229 | Statement (EIS) Database provided by U.S. Environmental Protection Agency (EPA). An EIS is a 230 | government document that analyzes the potential environmental effects of a proposed project and 231 | identifies ways to mitigate those effects. NEPATEC1.0 contains textual data and associated metadata 232 | extracted from 2,917 projects. These projects contain 28,212 documents, 4.6 million pages, and over 233 | 3.6 billion tokens of textual data (using GPT2 tokenizer)._" 234 | 235 | 236 | ## OpenLibrary.org 237 | - https://openlibrary.org/ 238 | 239 | Scanned books and other publications can be downloaded in PDF format. Note that many such PDFs are quite large in size. 240 | 241 | 242 | ## International Labour Organization 243 | - https://www.ilo.org/public/libdoc/ilo/ILO-SR/ 244 | 245 | This is a corpus of approximately 730 legacy PDF documents (from 2008-), however it has somewhat limited variability in PDF technical and syntactic constructs. 246 | 247 | 248 | ## New York State Regents Exams 249 | 250 | - https://www.nysl.nysed.gov/regentsexams.htm (overview) 251 | - https://nysl.ptfs.com/knowvation/app/consolidatedSearch/#search/v=list,c=1,sm=s,l=library1_lib%2Clibrary4_lib%2Clibrary5_lib (Search page) 252 | 253 | "_NYS Regents Exams in PDF format are part of the Library's Digital Collections. In addition to current exams, many historical ones have also been digitized: some of the oldest Regents Exams currently available online are in Physical Geography (1884) and Astronomy (1893)._". See https://www.nysl.nysed.gov/collections/regentsexams/ for some images of content. Corpus has a variety of complex content such as music scores, mathematics, etc. 254 | 255 | 256 | ## Mikal's "pdfdb" 257 | 258 | - http://www.stillhq.com/pdfdb/db.html (dead link) 259 | - Wayback Machine link: https://web.archive.org/web/20190212183538/http://www.stillhq.com/pdfdb/db.html 260 | - https://www.madebymikal.com/pdfdb/db.html (also a dead link) 261 | 262 | A database referenced from StackOverflow (https://stackoverflow.com/questions/14386393/pdf-specification-compliance-testing-sample-files) that is no longer directly available. 263 | 264 | 265 | ## Vasulka PDF Archive (history of video and electronic art) 266 | - https://vasulka.org/archive/sitemap.html 267 | 268 | From https://vasulka.org/about_archive.html: "_The Vasulka Archive currently consists of over 27,000 pages of documents relevant to the history of video and electronic art. These include articles, essays, interviews, reviews, schematics, diagrams, illustrations, posters, concert programs, photographs, and correspondence. While a large percentage of this material directly relates to the art and careers of Steina and Woody Vasulka, there are well over 200 artists and scholars represented. Some of the material has been taken from periodicals that are both in and out of print. The rest has been taken from the personal collection of the Vasulkas that began over thirty years ago._" 269 | 270 | 271 | ## Artifex MuPDF Public Test Corpus 272 | 273 | - https://cgit.ghostscript.com/cgi-bin/cgit.cgi/tests.git/ 274 | 275 | Public test corpus for Ghostscript/MuPDF, maintained by Artifex (https://mupdf.com/). 276 | 277 | 278 | ## Qiqqa's "Evil Base" Test Corpus 279 | 280 | - https://github.com/GerHobbelt/Evil-PDF-Library-for-Qiqqa 281 | 282 | Test corpus used by the [Qiqqa PDF document management software](https://github.com/jimmejardine/qiqqa-open-source) for testing various PDF-centric processes (metadata extraction, text extraction and OCR for meta-search & ~-research, page rendering/viewing, ...). 283 | 284 | **WARNING: Be aware that this corpus includes *malformed*, *invalid* and *malicious* PDFs**, which serve as an acid test for robustness testing production-level PDF processors. *Cave canem.* 285 | 286 | 287 | ## Digital Library of Slovenia 288 | 289 | - https://dlib.si/?&language=eng (English website) 290 | - As described in ["_A Lightweight Approach to a Giga-Corpus of Historical Periodicals: The Story of a Slovenian Historical Newspaper Collection_"](https://aclanthology.org/2024.lrec-main.61/) by Filip Dobranić, Bojan Evkoski and Nikola Ljubešić. 291 | 292 | The Digital Library of Slovenia includes various kinds of PDF documents, including scans of historic documents with overlaid invisible text (text render mode 3). Many PDFs also include pure text equivalents. It is a useful corpus for non-English Slovac documents. 293 | 294 | 295 | ## Italian Institute for Environmental Protection and Research (ISPRA) regional snowfall records 296 | 297 | - http://www.bio.isprambiente.it/annalipdf/ 298 | - Described by the article: ["Historical snowfall precipitation data in the Apennine Mountains, Italy" by Capozzi Vincenzo; Serrapica Francesco; Rocco Armando; Annella Clizia and Budillon Giorgio, 2024-01-01](https://ricerca.uniparthenope.it/handle/11367/133936) 299 | 300 | Scans of historical Italian snowfall records by region stored as PDF : "_... the original data sources of this database, the Hydrological Yearbooks of the NHMS, are freely accessible in printed version (i.e. as scanned images in portable document format) through the Italian Institute for Environmental Protection and Research (ISPRA) website (http://www.bio.isprambiente.it/annalipdf)._" 301 | 302 | 303 | ## IUST-PDFCorpus 304 | 305 | - https://zenodo.org/records/3484013 306 | - described in "[Format-aware Learn&Fuzz: Deep Test Data Generation for Efficient Fuzzing](https://arxiv.org/abs/1812.09961v2)" 307 | 308 | "_IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers [snip]. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications._". 309 | 310 | ## Public Affairs Layout database (PALdb) 311 | 312 | - https://github.com/BiDAlab/PALdb - requires personal registration 313 | - As referenced in Peña et al, "Continuous document layout analysis: Human-in-the-loop AI-based data curation, database, and evaluation in the domain of public affairs", Information Fusion, 2024, ISSN 1566-2535, https://doi.org/10.1016/j.inffus.2024.102398. (https://www.sciencedirect.com/science/article/pii/S1566253524001763) 314 | 315 | "_The database was collected from a set of 24 data sources from the Spanish administration and comprises nearly 37,910 documents, 441.3K document pages, 138.1M tokens and 8M layout labels. ... 1M images, 118.7K tables, 14.4K links, and 7.1M text blocks_". Intended use of this dataset is for document layout analysis (DLA). 316 | 317 | ## Various US Government Corpora 318 | 319 | In addition to GovDocs1 mentioned above, the following specialized archives provide 320 | many PDFs (largely in US English) but do not provide pre-packaged ZIP files: 321 | 322 | * https://governmentattic.org/ 323 | 324 | > The US Government Attic "_provides electronic copies of thousands of interesting Federal Government documents obtained under the Freedom of Information Act. Fascinating historical documents, reports on items in the news, oddities and fun stuff and government bloopers, they're all here._". Files are stored as PDFs with many having redactions. There appears to be no combined download of the entire corpus but the web pages are sufficiently simple to recursively fetch using something like `curl` or `wget`. 325 | 326 | * https://www.oversight.gov/ 327 | 328 | > "_Oversight.gov is a publicly accessible, searchable website containing the latest public reports from Federal Inspectors General who are members of the Council of the Inspectors General on Integrity and Efficiency (CIGIE)._" There appears to be no combined download of the entire corpus. 329 | 330 | * https://nsarchive.gwu.edu/digital-national-security-archive 331 | 332 | > The Digital National Security Archive (DNSA) "_is an invaluable online collection of more than 100,000 declassified records documenting historic U.S. policy decisions._" 333 | 334 | * https://archive.org/details/usfederalcourts 335 | 336 | > CourtListener’s RECAP court document archive. "_The documents in this collection are from the US Federal Courts. A large collection come from the federal government's project for Public Access to Court Electronic Records (PACER). The PACER Service Center is the Federal Judiciary's centralized registration, billing, and technical support center for electronic access to U.S. District, Bankruptcy, and Appellate court records. For more information on the RECAP project, visit https://www.recapthelaw.org_" 337 | 338 | 339 | # Malicious Datasets 340 | 341 | **These datasets contain malicious PDFs and should be treated with extreme caution!!** 342 | 343 | ## Canadian Institute for Cybersecurity "CIC-Evasive-PDFMal2022" dataset 344 | 345 | * https://www.unb.ca/cic/datasets/pdfmal-2022.html 346 | * https://www.kaggle.com/datasets/dhoogla/cic-evasive-pdfmal2022 347 | 348 | > "_Evasive-PDFMal2022 consists of 10,025 records with 5557 malicious and 4468 benign records that tend to evade the common significant features found in each class. This makes them harder to detect by common learning algorithms. We have collected 11,173 malicious files from Contagio, 20,000 malicious files from VirusTotal, and 9,109 benign files from Contagio_." 349 | 350 | # *Other formats* 351 | 352 | Internally PDF supports many so-called nested or embedded formats, such as JPEG, JPEG 2000, JBIG2, ICC, OpenType and other font programs, as well as conversion from other formats. Thus sources of corpora in other formats may also be of interest to the broader PDF community. Note that PDF can and does technically limit the scope of what certain nested formats can contain, so **do not** assume that all files in these corpora are valid for nesting inside PDF! Always refer to the latest PDF specification \([ISO 32000-2, PDF 2.0](https://www.pdfa.org/resource/iso-32000-pdf/)) for all technical requirements. 353 | 354 | ## PRImA Labs 355 | 356 | The University of Salford Pattern Recognition & Image Analysis Research Lab (PRImA) provide many image-based data sets "_ranging from historical books and newspapers to contemporary documents_", that have been "_collected, ground-truthed and organised \[as] a number of datasets which are available for research and/or personal use_". 357 | 358 | - https://www.primaresearch.org/ 359 | 360 | ## IEEE DataPort 361 | 362 | The IEEE DataPort contains some open access data sets. Although mainly focused on machine learning, many data sets are image-based. 363 | 364 | - https://ieee-dataport.org/datasets 365 | 366 | ## AWS Open Data Program 367 | 368 | Both GovDocs1 and CommonCrawl are part of the AWS Open Data Program (see above), but there are also many other data sets (mostly image and video related). Data is stored in AWS S3. 369 | 370 | - https://registry.opendata.aws/ 371 | 372 | ## Microsoft Research Open Data 373 | 374 | The MSR Open Data datasets provide a convenient UI for selecting datasets based on a format (file type) which includes PDF, docx and png. Datasets are stored in Azure. 375 | 376 | - https://msropendata.com/ 377 | 378 | ## Kaggle 379 | 380 | [Kaggle](https://www.kaggle.com/) provides various datasets mainly intended for machine learning and AI applications. Some include PDFs: 381 | 382 | - Scanned-to-PDF receipts from around the world [https://www.kaggle.com/datasets/jenswalter/receipts](https://www.kaggle.com/datasets/jenswalter/receipts) 383 | - 2400+ Resumes [https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset) 384 | - arXiv PDF publication dataset (stored in Amazon S3) [https://www.kaggle.com/datasets/Cornell-University/arxiv](https://www.kaggle.com/datasets/Cornell-University/arxiv) 385 | 386 | ## ICC Profiles 387 | 388 | PDF files can contain ICC profiles to provide device-independent definitions for color. As a result, all PDF viewers, PDF renderers and many other PDF processors need to have robust handling of ICC profiles. The following links provide ICC-based corpora that PDF developers may therefore find useful (both valid and invalid ICC profiles): 389 | 390 | - the ICC website [color.org](https://color.org/profiles.xalter) has many ICC profiles, including probe profiles which can test correct handling of rendering intents 391 | - [Google's SKIA skcms corpus](https://skia.googlesource.com/skcms/+/refs/heads/main/profiles/) includes many good and bad ICC profiles 392 | - [LitteCMS test bed](https://github.com/mm2/Little-CMS/tree/master/testbed) 393 | - a collection of [monitor ICC profiles](https://tftcentral.co.uk/articles/icc_profiles.htm) 394 | 395 | # *Legal* 396 | In accordance with Title 17 U.S.C. Section 107, the material in this document is distributed without profit to those who have an interest in understanding the interoperability of PDF files, including for research and educational purposes. If you wish to use the copyrighted material of others that is referenced in this document for purposes of your own that go beyond 'fair use', it is your responsibility to obtain permission from the relevant copyright owner. 397 | 398 | The PDF Association does not warrant the accuracy, timeliness or completeness of the information contained in this document. All copyright and trademarks remain with their respective owners. If you have a particular complaint about something you’ve read here, please contact us. 399 | --------------------------------------------------------------------------------