├── README.md └── corpus-list.yaml /README.md: -------------------------------------------------------------------------------- 1 | # Corpus List 2 | A structured list of textual corpora, created for use with corpus downloader. This list is read by the Python module and command-line program [corpus-downloader](https://github.com/DH-Box/corpus-downloader), originally developed for use in [DHBox](http://dhbox.org). 3 | 4 | ## How to Add Your Corpus 5 | 6 | Do you own or maintain a textual corpus? Please fork this repository and add it to the list. Here are a list of current fields and their descriptions: 7 | 8 | `shortname`: a short and easy-to-type name for your corpus, e.g. `shc` for the corpus “Shakespeare His Contemporaries.” 9 | `title`: the full name for your corpus, e.g. “Shakespeare His Contemporaries.” 10 | `categories`: a disciplinary label you can assign to your corpus. If if it’s mostly of interest to classics scholars, write “classics.” Current values include “literature,” “classics,” and “history,” but feel free to add your own. 11 | `languages`: The ISO 649-2 language code for the language(s) of your corpus. Examples include `deu` (German), `eng` (English), `enm` (Middle English), and `fra` (French). For a more complete list, [check here](https://www.loc.gov/standards/iso639-2/php/code_list.php). 12 | `text`: information about the actual text(s) of your corpus. 13 | `markup`: how the text is encoded. E.g. `TXT` (plain text), `HTML` (HTML files), or `TEI` (TEI XML). 14 | `url`: the URL for your textual corpus, e.g. `http://www.folgerdigitaltexts.org/download/txt/FolgerDigitalTexts_TXT_Complete.zip` 15 | `file-format`: the file format of the URL above. In the above example, `zip`. 16 | 17 | ``` 18 | - shortname: perseus-c-greek 19 | title: Perseus Canonical Greek 20 | categories: classics 21 | authors: multiple 22 | languages: grc 23 | text: 24 | markup: TEI 25 | url: https://github.com/PerseusDL/canonical-greekLit.git 26 | file-format: git 27 | ``` 28 | -------------------------------------------------------------------------------- /corpus-list.yaml: -------------------------------------------------------------------------------- 1 | - shortname: shc 2 | title: Shakespeare His Contemporaries 3 | categories: literature 4 | authors: multiple 5 | centuries: 16th, 17th 6 | languages: eng 7 | text: 8 | markup: TEI-Simple 9 | url: 'https://github.com/JonathanReeve/corpus-SHC.git' 10 | file-format: git 11 | 12 | - shortname: folger-shakespeare 13 | title: Folger Shakespeare Library Digital Texts 14 | categories: literature 15 | centuries: 16th, 17th 16 | authors: single 17 | languages: eng 18 | homepage: http://www.folgerdigitaltexts.org/ 19 | url-source: http://www.folgerdigitaltexts.org/download/ 20 | text: 21 | - markup: TEI 22 | url: http://www.folgerdigitaltexts.org/download/xml/FolgerDigitalTexts_XML_Complete.zip 23 | file-format: zip 24 | - markup: HTML 25 | url: http://www.folgerdigitaltexts.org/download/html/FolgerDigitalTexts_HTML_Complete.zip 26 | file-format: zip 27 | - markup: TXT 28 | url: http://www.folgerdigitaltexts.org/download/txt/FolgerDigitalTexts_TXT_Complete.zip 29 | file-format: zip 30 | 31 | - shortname: perseus-c-greek 32 | title: Perseus Canonical Greek 33 | categories: classics 34 | authors: multiple 35 | languages: grc 36 | text: 37 | markup: TEI 38 | url: https://github.com/PerseusDL/canonical-greekLit.git 39 | file-format: git 40 | 41 | - shortname: perseus-c-latin 42 | title: Perseus Canonical Latin 43 | categories: classics 44 | authors: multiple 45 | languages: lat 46 | text: 47 | markup: TEI 48 | url: https://github.com/PerseusDL/canonical-latinLit.git 49 | file-format: git 50 | 51 | - shortname: stanford-1880s 52 | title: 'Adult British Fiction of the 1880s, Assembled by the Stanford Literary Lab' 53 | categories: literature 54 | centuries: 19th 55 | languages: eng 56 | text: 57 | markup: TXT 58 | url: https://github.com/JonathanReeve/corpus-1880s-all.git 59 | file-format: git 60 | subcorpora: 61 | - shortname: stanford-1880s-male 62 | title: 'Adult British fiction of the 1880s, male authors. Assembled by the Stanford Literary Lab' 63 | text: 64 | markup: txt 65 | url: https://github.com/JonathanReeve/corpus-1880s-male.git 66 | file-format: git 67 | 68 | - shortname: stanford-1880s-female 69 | title: 'Adult British fiction of the 1880s, female authors. Assembled by the Stanford Literary Lab' 70 | text: 71 | markup: txt 72 | url: https://github.com/JonathanReeve/corpus-1880s-female.git 73 | file-format: git 74 | 75 | - shortname: reuters-21578 76 | title: Reuters-21578 77 | homepage: http://www.daviddlewis.com/resources/testcollections/reuters21578/ 78 | categories: history 79 | languages: eng 80 | text: 81 | markup: txt 82 | url: http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz 83 | file-format: tar.gz 84 | 85 | - shortname: ecco-tcp 86 | title: Eighteenth Century Collections Online / Text Creation Partnership ECCO-TCP 87 | homepage: http://www.textcreationpartnership.org/tcp-ecco/ 88 | categories: literature 89 | centuries: 18th 90 | languages: eng 91 | text: 92 | markup: xml 93 | file-format: zip 94 | url: 95 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-200510.ecco.zip 96 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-200601.ecco.zip 97 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-200604.ecco.zip 98 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-200609.ecco.zip 99 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-200702.ecco.zip 100 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-200802.ecco.zip 101 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-200809.ecco.zip 102 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-200902.ecco.zip 103 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-200909.ecco.zip 104 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-201004.ecco.zip 105 | - http://www.lib.umich.edu/tcp/docs/texts/ecco/xml-201106.ecco.zip 106 | 107 | - shortname: dta 108 | title: Deutsches Textarchiv (German Text Archive) 109 | homepage: http://www.deutschestextarchiv.de/ 110 | categories: literature, science, history 111 | centuries: 16th, 17th, 18th, 19th 112 | languages: deu 113 | text: 114 | markup: TEI 115 | file-format: zip 116 | url: http://media.dwds.de/dta/download/dta_komplett_2016-02-11.zip 117 | 118 | - shortname: ota 119 | title: Oxford Text Archive 120 | homepage: https://ota.ox.ac.uk/ 121 | categories: literature, history 122 | centuries: 16th, 17th, 18th, 19th, 20th 123 | languages: eng, enm, fra, deu, lat, grc 124 | text: 125 | markup: TXT 126 | file-format: git 127 | url: https://github.com/mimno/ota.git 128 | 129 | - shortname: txtLAB450 130 | title: txtLAB450, a Multilingual Data Set of Novels 131 | categories: literature 132 | authors: multiple 133 | centuries: 17th, 18th, 19th 134 | languages: eng, fra, deu 135 | text: 136 | markup: TXT 137 | url: 'https://ndownloader.figshare.com/files/3686778' 138 | file-format: zip 139 | 140 | - shortname: cenlab 141 | title: CENLab 142 | homepage: https://github.com/JonathanReeve/cenlab 143 | categories: literature 144 | centuries: 18th, 19th, 20th 145 | languages: eng 146 | text: 147 | markup: TXT 148 | file-format: git 149 | url: https://github.com/JonathanReeve/cenlab.git 150 | 151 | - shortname: brown 152 | title: Brown Corpus 153 | homepage: http://www.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html 154 | categories: linguistics 155 | languages: eng 156 | centuries: 20th 157 | text: 158 | markup: TXT 159 | file-format: zip 160 | url: https://github.com/nltk/nltk_data/raw/gh-pages/packages/corpora/brown.zip 161 | --------------------------------------------------------------------------------