├── .gitignore ├── info └── mod date.png ├── drop-caps.asciidoc ├── .travis.yml ├── questions-DP.asciidoc ├── unicoding.asciidoc ├── metadata ├── README.asciidoc ├── editions.asciidoc ├── pgdata5.asciidoc ├── pgdata3.asciidoc ├── pgdata4.asciidoc ├── pgdata2.asciidoc ├── pgdata.asciidoc └── pandata_attribute_dictionary.yaml ├── encoding.asciidoc ├── filetypes.asciidoc ├── versioning.asciidoc ├── index.asciidoc ├── adopting.asciidoc ├── how_to.asciidoc ├── activerepos.csv ├── sectioning.asciidoc ├── Travis-CI.asciidoc └── README.asciidoc /.gitignore: -------------------------------------------------------------------------------- 1 | # html generated from asciidoc 2 | *.asciidoc.html -------------------------------------------------------------------------------- /info/mod date.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitenberg-dev/documentation/HEAD/info/mod date.png -------------------------------------------------------------------------------- /drop-caps.asciidoc: -------------------------------------------------------------------------------- 1 | = Drop Caps 2 | 3 | Drop caps are an important stylistic feature of many books. 4 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: ruby 2 | before_install: 3 | - gem install asciidoctor 4 | script: 5 | - asciidoctor index.asciidoc 6 | after_success: 7 | - build/push.sh 8 | env: 9 | global: 10 | secure: MVsv9XYEjlgRT2gWwATBva0YQ1e3XEF8b3lVUynY7MgL548iA9TYNze1G2eC3v1YKtjs+VeoHx5rRhIVY+pc79sTuW41Xo2bJImgsySYxJ6FwoR0AOu7DwMTgGIol40yNELBnUF0NyXFVVqMx9aHZ4E8VxZSe3bG90iBMIZZOhk= 11 | -------------------------------------------------------------------------------- /questions-DP.asciidoc: -------------------------------------------------------------------------------- 1 | = Questions about Distributed Proofreaders 2 | 3 | * Where can we get a database of DP projects > PG ebooks? 4 | ** can we get access or a link to the scans a DP book was created from? 5 | ** Can we get a link to the forum thread for the DP book? 6 | ** Can we get the original html product of DP? 7 | * How does PG change DP generated html? 8 | * How would you like your html to be distributed/managed? 9 | -------------------------------------------------------------------------------- /unicoding.asciidoc: -------------------------------------------------------------------------------- 1 | GITenberg uses unicode characters for typography where appropriate. 2 | Common unicode characters that we use are curved quotes (aka “smart quotes”) and https://en.wikipedia.org/wiki/Dash#Common_dashes[dashes]. 3 | 4 | Contributor @rsperberg has written a script that will convert many of these characters automatically, available in the https://github.com/gitenberg-dev/punctuation-cleanup[gitenberg-dev/punctuation-cleanup] repo with instructions on use. -------------------------------------------------------------------------------- /metadata/README.asciidoc: -------------------------------------------------------------------------------- 1 | == GITenberg metadata 2 | 3 | - Part 1. link:pgdata.asciidoc[Boiling down the Gutenberg RDF] 4 | - Part 2. link:pgdata2.asciidoc[Combing through the Gutenberg metadata] 5 | - Part 3. link:pgdata3.asciidoc[Other sources of metadata] 6 | - Part 4. link:pgdata4.asciidoc[Metadata that's needed, but missing. Also, covers.] 7 | - Part 5. link:pgdata5.asciidoc[Metadata targets] 8 | ** Dublin Core / EPUB 9 | ** Bibframe / MARC 10 | ** Schema.org 11 | ** TEI 12 | ** ONIX 13 | ** BIBO 14 | - Part 6. link:editions.asciidoc[Editions] 15 | -------------------------------------------------------------------------------- /encoding.asciidoc: -------------------------------------------------------------------------------- 1 | The https://en.wikipedia.org/wiki/Character_encoding[character encoding] of GITenberg books 2 | should always be https://en.wikipedia.org/wiki/UTF-8[UTF-8] (aka Unicode). 3 | Project Gutenberg source files were often created before Unicode existed and need to be converted. 4 | 5 | If you get an error like the following, you will need to do this step: 6 | 7 | pandoc: Cannot decode byte '\xb0': Data.Text.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream 8 | 9 | 10 | To re-encode a file as UTF-8, you need to use the `iconv` commandline tool. 11 | 12 | iconv -c -t "UTF-8" < input.html > output.asciidoc 13 | 14 | For instance, if 164-h/164-h.htm 15 | -------------------------------------------------------------------------------- /filetypes.asciidoc: -------------------------------------------------------------------------------- 1 | = Filetypes 2 | 3 | There are many types of files that make up GITenberg repos. 4 | 5 | == Types of files 6 | == Types of PG records 7 | There are also many types of records from Project Gutenberg. 8 | It makes sense to identify some of these repo types here. 9 | 10 | === Audiobooks 11 | PG includes at least two types of audiobooks: 12 | 13 | * Computer-read audiobooks 14 | ** https://github.com/GITenberg/The-Call-of-the-Wild_9115[call of the wild 9115] 15 | * Human-read audiobooks produced by Librivox 16 | ** https://github.com/GITenberg/The-War-of-the-Worlds_26291[war of the worlds 26291] 17 | ** https://github.com/GITenberg/Call-of-the-Wild_19678[call of the wild 19678] 18 | 19 | Here are a list of librivox books: 20 | 21 | 22 | -------------------------------------------------------------------------------- /versioning.asciidoc: -------------------------------------------------------------------------------- 1 | = Versioning Gitenberg books 2 | 3 | Gitenberg uses http://semver.org/[Semantic Versioning] to describe the changes to books over time. 4 | The current version of a GITenberg book is stored in three places: 5 | 6 | * As a git tag 7 | * As the value for `_version` in repo's link:metadata/pandata_attribute_dictionary.yaml[metadata.yaml] 8 | * On the 3rd line of the main book asciidoc file (prefixed with `v`) 9 | 10 | [source, asciidoc] 11 | ---- 12 | = Title 13 | Author 14 | RevisionNumber, RevisionDate 15 | ---- 16 | 17 | A semantic version is in the format of `Major.Minor.Patch`. 18 | The only agreed upon minor revision is a complete epub file built by travis as described in the https://github.com/gitenberg-dev/Second-Folio/[Second-Folio] collection. 19 | 20 | For now, assume that your changes increment the `Patch` value. 21 | `v0.0.23` would become `v0.0.24` 22 | 23 | * http://asciidoc.org/userguide.html#X95[Asciidoc documentation] 24 | * http://asciidoctor.org/docs/user-manual/#revision-number-date-and-remark[Asciidoctor documentation] 25 | -------------------------------------------------------------------------------- /index.asciidoc: -------------------------------------------------------------------------------- 1 | = GITenberg documentation 2 | Gitenberg team 3 | v0.0.1 4 | :toc: left 5 | 6 | [WARNING] 7 | .ALPHA level documentation 8 | ==== 9 | This documentation is at an alpha level of quality. 10 | Please help us improve this 11 | https://github.com/gitenberg-dev/documentation[documentation on github] 12 | ==== 13 | 14 | :leveloffset: +1 15 | 16 | include::how_to.asciidoc[] 17 | 18 | // include::metadata[] 19 | 20 | 21 | include::adopting.asciidoc[] 22 | 23 | include::filetypes.asciidoc[] 24 | 25 | include::drop-caps.asciidoc[] 26 | 27 | include::encoding.asciidoc[] 28 | 29 | include::sectioning.asciidoc[] 30 | 31 | include::Travis-CI.asciidoc[] 32 | 33 | include::unicoding.asciidoc[] 34 | 35 | include::versioning.asciidoc[] 36 | 37 | [appendix] 38 | == Active Repositories 39 | 40 | [WARNING] 41 | .Warning about this section 42 | ==== 43 | This section is out of date as of 2015-07-30, see https://github.com/gitenberg-dev/Second-Folio/blob/master/Gitenberg%20Book%20List.csv[Second-Folio] for a more up to date list. 44 | ==== 45 | 46 | [format="csv", options="header"] 47 | |=== 48 | include::activerepos.csv[] 49 | |=== 50 | 51 | :leveloffset: -1 52 | -------------------------------------------------------------------------------- /metadata/editions.asciidoc: -------------------------------------------------------------------------------- 1 | == GITenberg metadata: Editions 2 | 3 | There is a need to be able to express metadata for multiple editions in one repository. Different editions will share most metadata, but may differ in ISBN, publisher, cover, rights, etc. 4 | 5 | To support this need, pandata provides the following properties: "_edition", "edition_list", "edition_note" "edition_identifiers", and any other property beginning with "edition_" 6 | 7 | "_edition" provides the name for the edition, which should be a string that can be used for file naming, and identifying the edition. If not present, the first isbn is used as the edition name, or if no isbn, then the repo name is used. 8 | 9 | "edition_list" should contain a list of objects (dicts), each with an "_edition" entry and any other properties that are distinctive for the edition. These properties replace any properties specified at the top level of the pandata. 10 | 11 | "edition_identifiers" provides an object (dict) whose contents replace corresponding entries in the identifiers property at the top level. Some identifiers, such as isbn and oclc numbers, typically identify an edition. 12 | 13 | "edition_note" is a descriptive note about the edition. Can be used for MARC 250 field. 14 | -------------------------------------------------------------------------------- /adopting.asciidoc: -------------------------------------------------------------------------------- 1 | = Help us select a starter set of PG texts! 2 | 3 | Ideally, we'd like to have a hundred texts to serve as a testbed for 4 | the GITenberg toolchain. 5 | 6 | . you need to have a github account. sign up at https://github.com/join 7 | 8 | . pick out a GITenberg repo you'd like to help out on. You can search 9 | the repos with the search box *below* Johannes at 10 | https://github.com/GITenberg 11 | 12 | . "star" the repo you've selected. Just click on the star at the top of the repo page. 13 | 14 | . If you're new to Github, create an "issue" in gitenberg-dev at 15 | 16 | .. copy the url of the repo you've selected 17 | .. go to https://github.com/gitenberg-dev/wiki/issues and click the 18 | green "new issue" button 19 | 20 | .. paste the url of your repo and add any comments you'd like 21 | .. click the green "Submit New Issue" button. 22 | .. congratulations, you're now a GITenberger! 23 | .. someone will add your repo to the list and "close" your issue 24 | 25 | NOTE: following instructions need a lot of elaboration! 26 | 27 | . If you want to add directly to the list: 28 | .. "fork" the gitenberg-dev repo 29 | .. add the url of the repo you've selected, along with your github 30 | username to your copy of activerepos.csv 31 | .. commit the change to your copy 32 | .. submit a pull request to gitenberg-dev/master 33 | -------------------------------------------------------------------------------- /how_to.asciidoc: -------------------------------------------------------------------------------- 1 | = How To 2 | 3 | This document sets out the basics of how to prepare works for GITenberg. 4 | 5 | == Overview 6 | 7 | The tasks that need to be done for these books are documented on the following pages: 8 | 9 | * link:encoding.asciidoc[Encoding] 10 | * link:unicoding.asciidoc[Unicoding] 11 | * link:sectioning.asciidoc[Sectioning] 12 | * link:drop-caps.asciidoc[Drop Caps] 13 | * link:versioning.asciidoc[Versioning] 14 | 15 | == Help us select a starter set of PG texts! 16 | 17 | Ideally, we'd like to have a hundred texts to serve as a testbed for the GITenberg toolchain. 18 | 19 | 1. you need to have a github account. sign up at https://github.com/join 20 | 2. pick out a GITenberg repo you'd like to help out on. You can search the repos with the search box *below* Johannes at https://github.com/GITenberg 21 | 3. "star" the repo you've selected. Just click on the star at the top of the repo page. 22 | 4. If you're new to Github, create an "issue" in gitenberg-dev at 23 | a. copy the url of the repo you've selected 24 | b. go to https://github.com/gitenberg-dev/wiki/issues and click the green "new issue" button 25 | c. paste the url of your repo and add any comments you'd like 26 | d. click the green "Submit New Issue" button. 27 | e. congratulations, you're now a GITenberger! 28 | f. someone will add your repo to the list and "close" your issue 29 | Note: following instructions need a lot of elaboration! 30 | 5. If you want to add directly to the list: 31 | a. "fork" the gitenberg-dev repo 32 | b. add the url of the repo you've selected, along with your github username to your copy of activerepos.csv 33 | c. commit the change to your copy 34 | d. submit a pull request to gitenberg-dev/master 35 | 36 | -------------------------------------------------------------------------------- /activerepos.csv: -------------------------------------------------------------------------------- 1 | PG ID,Title,Status,participants,Repo 2 | 103,Around the World in 80 Days,in-progress,eshellman,https://github.com/GITenberg/Around-the-World-in-80-Days_103 3 | 151,The Rime of the Ancient Mariner,in-progress,sethwoodworth,https://github.com/sethwoodworth/The-Rime-of-the-Ancient-Mariner_151 4 | 201,Flatland - A Romance of Many Dimensions,awaiting PR,sethwoodworth,https://github.com/sethwoodworth/Flatland--A-Romance-of-Many-Dimensions--Illustrated-_201 5 | 219,Heart of Darkness,rsperberg,awaiting review,https://github.com/GITenberg/Heart-of-Darkness_219 6 | 525,"Youth, A Narrative",rsperberg,unicoded,https://github.com/GITenberg/Youth-a-Narrative_525 7 | 1213,The Man That Corrupted Hadleyburg,tmtmtmtm,available,https://github.com/GITenberg/The-Man-That-Corrupted-Hadleyburg_1213 8 | 2376,Up From Slavery,eshellman,available,https://github.com/GITenberg/Up-from-Slavery--An-Autobiography_2376 9 | 2814,Dubliners,rdhyee,asciidoc,https://github.com/GITenberg/Dubliners_2814 10 | 4298,The Paying Guest,samwilson,asciidoc,https://github.com/GITenberg/The-Paying-Guest_4298 11 | 5614,Chess Strategy,aputtu,available,https://github.com/GITenberg/Chess-Strategy_5614 12 | 8121,Ghosts,eshellman,available,https://github.com/GITenberg/Ghosts_8121 13 | 10662,The Night Land,rhythmus,markdown,https://github.com/GITenberg/The-Night-Land_10662 14 | 12297,Slave Narratives- a Folk History of Slavery in the United States,cliotropic,available,https://github.com/GITenberg/Slave-Narratives--a-Folk-History-of-Slavery-in-the-United-StatesFlorida-Narratives_12297 15 | 23174,A Good For Nothing,NickCarneiro,available,https://github.com/GITenberg/A-Good-For-Nothing1876_23174 16 | 25308,Eugenics and Other Evils,tmtmtmtm,availab,ehttps://github.com/GITenberg/Eugenics-and-Other-Evils_25308 17 | 28054,The Brothers Karamazov,rdhyee,asciidoc epub,https://github.com/GITenberg/The-Brothers-Karamazov_28054 18 | 4302,Thyrza,samwilson,asciidoc,https://github.com/GITenberg/Thyrza_4302 19 | 3261,News From Nowhere,samwilson,asciidoc,https://github.com/GITenberg/News-from-Nowhere--Or-An-Epoch-of-Rest--13-Being-Some-Chapters-from-a-Utopian-Romance_3261 20 | -------------------------------------------------------------------------------- /sectioning.asciidoc: -------------------------------------------------------------------------------- 1 | = Sectioning 2 | 3 | The project's preferred markup format is http://asciidoc.org[Asciidoc]. 4 | Asciidoc has a lot of great features, but denoting title and chapter information is the first step in making beautiful ebooks. 5 | 6 | == Titles 7 | The title of the book should be on the first line of the asciidoc file. 8 | It should be prefixed with a single `=`. 9 | This turns it into a level-one header (+

+ in HTML). 10 | 11 | = Alice in Wonderland 12 | 13 | Subtitles are not well supported in Asciidoc, 14 | but can be included in the title separated by a colon. 15 | They will be included in works' metadata. 16 | 17 | == Chapters, sections, and subsections 18 | Chapter headings should have two equals signs. This turns into a level-two header (+

+ tag in HTML). 19 | 20 | == Chapter 1 21 | 22 | Paragraph text &c. 23 | 24 | == Chapter 2 25 | 26 | If the book is broken up into smaller subsections than chapters, you can use more equal signs: 27 | 28 | == Chapter 9 29 | === Section A 30 | 31 | If a chapter or section heading has a title as well as a number, put them all on the same line and separated by a colon: 32 | 33 | == Chapter 5: Advice from a Caterpillar 34 | 35 | 36 | == Pages 37 | 38 | Where a Gitenberg book has scans of the original available, 39 | you can choose to include page-markers 40 | (http://asciidoc.org/userguide.html#X30[anchors] in Asciidoc and http://asciidoctor.org/docs/user-manual/#anchordef[asciidoctor]) 41 | to make it easier to relate the text to the original images. 42 | These are invisible in the formatted output. 43 | 44 | First came ten soldiers carrying clubs; these were all shaped like the three 45 | gardeners, oblong and flat, with their hands and feet at the corners: next the 46 | ten courtiers; these were ornamented all over with diamonds, and walked two 47 | and [[source-page-115]] two, as the soldiers did. After these came the royal children; 48 | there were ten of them, and the little dears came jumping merrily along hand in 49 | hand, in couples: they were all ornamented with hearts. 50 | 51 | The anchor should be of the form `[[source-page-x]]` where `x` is the number of the page. 52 | If a word is hyphenated across the page, put the anchor in the next available whitespace. 53 | -------------------------------------------------------------------------------- /Travis-CI.asciidoc: -------------------------------------------------------------------------------- 1 | = Travis CI and GITenberg 2 | 3 | == Building Asciidoc after each commit 4 | I started integrating GITenberg with Travis-CI this weekend. 5 | Travis-CI is an open source continuous integration server. 6 | Typically, a CI server watches for changes on github, then checks out the changed code, 7 | runs tests and/or tries to compile the software. 8 | Importantly, the hook for a CI server to run is any change being made 9 | on Github triggering a 'build' via post_commit hooks. 10 | I've taken my asciidoc fork of 11 | link:https://github.com/sethwoodworth/The-Rime-of-the-Ancient-Mariner_151[Rime of the Ancient Mariner], 12 | told Travis-CI about my repo, and added this 13 | link:https://github.com/sethwoodworth/The-Rime-of-the-Ancient-Mariner_151/blob/gh-pages/.travis.yml[Travis config file]. 14 | 15 | Now, whenever I make a commit to Rime it triggers 16 | a build on Travis that 17 | link:https://travis-ci.org/sethwoodworth/The-Rime-of-the-Ancient-Mariner_151/builds/55742441[looks like this]. 18 | In this case, I am installing asciidoctor 19 | and using it to transform the Rime asciidoc file into html. 20 | But I could just as easily build epubs with a slightly different command. 21 | 22 | Any files generated by the travis-ci build are 23 | automatically uploaded to the amazon file storage cloud. 24 | The Rime html is 25 | link:https://s3.amazonaws.com/gitenberg-build/sethwoodworth/The-Rime-of-the-Ancient-Mariner_151/7/7.1/rime.html[available here]. 26 | 27 | This is a preliminary result. 28 | And more work needs to be done before this is ready for primetime. 29 | 30 | 31 | == Any other format 32 | 33 | Asciidoc isn't the only type of file we can build this way. 34 | I've taken the bi-lingual book that Tom mentioned, 35 | added it to GITenberg, and forked it to my repo: 36 | link:https://github.com/sethwoodworth/The-Jesuit-Relations-and-Allied-Documents-Vol.-V--Quebec-1632-1633_48562/[_Jesuit Relations_]. 37 | I think the PG html version isn't as awesome as the 38 | link:https://github.com/sethwoodworth/The-Jesuit-Relations-and-Allied-Documents-Vol.-V--Quebec-1632-1633_48562/blob/master/48562-0.txt[raw text version], 39 | which has english and french side by side. 40 | But I attempted to build the html of Jesuit Relations into an epub with Project Gutenberg's 41 | link:https://pypi.python.org/pypi/epubmaker/0.3.20[epubmaker]. 42 | I have epubmaker and the python requirements installed, 43 | but I am having a pathing issue in a bash script that is causing the 44 | link:https://travis-ci.org/sethwoodworth/The-Jesuit-Relations-and-Allied-Documents-Vol.-V--Quebec-1632-1633_48562[build to fail]. 45 | 46 | 47 | == Conclusion 48 | Nevertheless, this has been an enlightening experiment and I am very hopeful that we can build PG html edition ebooks easily. 49 | 50 | 51 | This contributes to an overall point I would like to make: 52 | I like asciidoc, I think we have the best tools for asciidoc. 53 | I want to try everything and compare them. 54 | PG has ~400 books in ReStructured Text (a format I have researched thoroughly and consider it second to only Asciidoctor). I would love to auto-build these books as part of the GITenberg infrastructure. 55 | 56 | 57 | == Last few points: 58 | There are currently tradeoffs and limitations to using Travis-ci: 59 | 60 | * the GITenberg organization is too large for Travis to list all of our repos (a common issue, but should be fixable) 61 | * using travis means including a .travis.yml file in every repo 62 | * we will have to enable each repo by hand on the travis-ci site (there may be an api for this) 63 | 64 | 65 | --Seth 66 | -------------------------------------------------------------------------------- /metadata/pgdata5.asciidoc: -------------------------------------------------------------------------------- 1 | == GITenberg metadata 2 | === Part 5. Implementation / Targets 3 | 4 | Code implementing a GITenberg metadata is in the https://github.com/gitenberg-dev/metadata[gitenberg-dev/metadata repo]. The generic metadata object is called https://github.com/gitenberg-dev/metadata/pandata.py[pandata], by analogy to pandoc. 5 | 6 | 7 | Property names are defined with the aim to bridge the chasm between Gutenberg/GITenberg to MARC and other metadata target formats. In the YAML file, properties are presented in alphabetic order to facilitate version control. 8 | 9 | We've used leading underscores denote gitenberg specific metadata, thus _repo denotes the string identifying the repo. A sample "pandata" file is https://github.com/gitenberg-dev/metadata/blob/master/samples/pandata.yaml[here] (We'll use "metadata.yaml" as our standard finel name when this is in a Gitenberg repo. 10 | 11 | 12 | ==== Bibframe/MARC 13 | 14 | Code implementing a conversion from MARC is in the https://github.com/gitenberg-dev/metadata/marc.py[marc module]. 15 | 16 | ===== Creators and Contributors 17 | Creators and Contributors are handled with property names based on http://www.loc.gov/marc/relators/relaterm.htm[marc relator codes], grouped under a creator and contributor top-level attribute. There are 25 types used in Project Gutenberg; we'll add "Book_producer" so we can add credits for the people who worked on producing the edition. See the link:pandata_attribute_dictionary.yaml[attribute dictionary] for details 18 | 19 | We choose to provide singular and plural versions of these 20 | 21 | It's expected that contributors will eventually be independent entities with their own metadata, linked with a url, but for now we keep everything in line with properties such as author_name and author_sortname formed by adding on to the marc relator name. 22 | 23 | ===== Publisher 24 | 25 | MARC regards "Project Gutenberg" as the publisher of these editions, so we're not going to rock the boat. 26 | 27 | ===== subjects 28 | 29 | Here we use custom constructors to add type information to subject headings. library of congress headings and classes are supported. 30 | 31 | ===== rights 32 | 33 | For GITenberg, we focus on standard rights statements that allow free use. Possible values, along with defining urls. (how do we manage versioning?) 34 | 35 | - BY-NC-ND, Creative Commons Attribution-NonCommercial-NoDerivs', http://creativecommons.org/licenses/by-nc-nd/3.0/ 36 | - BY-NC-SA, Creative Commons Attribution-NonCommercial-ShareAlike', http://creativecommons.org/licenses/by-nc-sa/3.0/ 37 | - BY-NC, Creative Commons Attribution-NonCommercial, 'http://creativecommons.org/licenses/by-nc/3.0/ 38 | - BY-ND, Creative Commons Attribution-NoDerivs, http://creativecommons.org/licenses/by-nd/3.0/ 39 | - BY-SA, Creative Commons Attribution-ShareAlike, http://creativecommons.org/licenses/by-sa/3.0/ 40 | - BY, 'Creative Commons Attribution, http://creativecommons.org/licenses/by/3.0/ 41 | - CC0, No Rights Reserved, http://creativecommons.org/about/cc0 42 | - GFDL, GNU Free Documentation License, http://www.gnu.org/licenses/fdl-1.3-standalone.html 43 | - LAL, Licence Art Libre, http://artlibre.org/licence/lal/ 44 | - PD-US, Public Domain US, http://creativecommons.org/about/pdm 45 | 46 | ==== Covers 47 | 48 | The property 'covers' is considered a list of covers. Each cover should have an 'image_path' which includes a path from the root of the repo. If the 'rights' property could be different from the rest of the book it's specified for the cover, otherwise rights are inherited from the book. 49 | 50 | Covers can be of different types. To start, we define: 51 | 52 | - archival: a scan or image of a print edition 53 | - archival-back: a scan or image of a the back cover of a print edition 54 | - original: a newly designed cover image specifically designed for an ebook 55 | - generated: a cover produced automatically or semiautomatically using software. 56 | - titlepage_image: sometimes the best image for a cover is an image of the titlepage, which was often ornate. 57 | 58 | ==== Schema.org 59 | 60 | ==== TEI 61 | 62 | ==== ONIX 63 | 64 | ==== BIBO 65 | 66 | -------------------------------------------------------------------------------- /README.asciidoc: -------------------------------------------------------------------------------- 1 | = Welcome to the GITenberg documentation! 2 | 3 | For more information about GITenberg, please see our https://gitenberg.github.io[website]. 4 | 5 | == How can you contribute? 6 | 7 | We are currently working on converting and editing our first 100 books. 8 | 9 | * Help us link:how_to.asciidoc[format books] into link:sectioning.asciidoc[asciidoc] 10 | * Suggest books we should improve 11 | * Join our announcement list 12 | 13 | For now there are a few things you can do depending on your interest and 14 | skill level. Firstly, if you find an error or typo in any of the books, 15 | report it in the 'Issues' tab on that repo. If you would like to offer 16 | changes: fork, edit and create a Pull Request. If you would like to make 17 | suggestions, help in another way, or would like to get more involved, 18 | you can join the project 19 | https://groups.google.com/forum/#!forum/gitenberg-project[mailing list]. 20 | 21 | == Project Status 22 | 23 | You can read the full text of the last project status report 24 | https://groups.google.com/d/msg/gitenberg-project/i3gV2OjEeAQ/m8bC81tBhokJ[here]. 25 | 26 | == Documentation Index 27 | 28 | * Infrastructure Documentation — Descriptions of the various systems (current and future) that run GITenberg. 29 | ** [Github post-commit hooks] — how we know when to rebuild ebooks when changes are made to repos 30 | ** [Book repo digestor] — Receives a book repo, decides what to do based on the content of the repo 31 | * link:filetypes[Filetypes] 32 | * link:metadata/README.ascidoc[Metadata Development] - How GITenberg describes the data in a repo. 33 | 34 | == Active goals 35 | 36 | Step 1 is to create a list of ebook repos we'll use as a testbed for the 37 | GITenberg tool chain. The current list is maintained on [this 38 | file][active-repos]. Instructions on including a new repo are 39 | link:how_to[available] 40 | 41 | == Open questions 42 | 43 | === How to generate covers for the books? 44 | 45 | === What will be the source format? 46 | 47 | Discussion is still open, but asciidoc is the current best candidate. 48 | 49 | === How are the repositories created? 50 | 51 | === What is the reference format from PG? 52 | 53 | === How to cope with punctuation? 54 | 55 | There is consensus to convert when possible ascii punctuation to unicode 56 | more precise equivalents. 57 | 58 | === What is the licensing big schema? 59 | 60 | === What about Metadata? 61 | 62 | ==== How to read metadata from PG? 63 | 64 | Eric Hellman is working to understand metadata dumps provided by PG. It 65 | is a work in progress, and preliminary results are available in 66 | https://gist.github.com/eshellman/40d85be01acf1172a5c1[this document]. 67 | 68 | ==== How to archive metadata for gitenberg usage? 69 | 70 | Ongoing discussion, current state 71 | https://gist.github.com/eshellman/7a6d34c88e797b439938[here]. 72 | 73 | == Repositories 74 | 75 | === Books 76 | 77 | Every book on PG gets its own repository. All are listed on the 78 | gitenberg organization github page, https://github.com/GITenberg/[here]. 79 | 80 | There's a tsv file with a full list of repo names at 81 | https://github.com/gitenberg-dev/giten_site/blob/master/assets/GITenberg_repos_list_2.tsv 82 | 83 | The main gitenberg page, which is for the moment a developper's page, is 84 | hosted http://gitenberg.github.io/[here], and its source is 85 | https://github.com/GITenberg/gitenberg.github.com/blob/master/index.html[here]. 86 | 87 | === Development 88 | 89 | There is work going on on several repositories. The central point for 90 | development is the gitenberg-dev organization, whose repos can be seen 91 | on its github git@github.com:sethwoodworth/GITenberg.git[page]. 92 | 93 | There are other repositories in which some work for gitenberg happened: 94 | https://github.com/sethwoodworth/GITenmake[gitenmake], 95 | https://github.com/sethwoodworth/GITenberg[sethwoodworth/gitenberg] 96 | 97 | == Readmes, introductions and FAQs 98 | 99 | Several pieces of information are scattered around. This one is intended 100 | to replace all of them but for the main gitemberg page, which serves as 101 | first contact. All the others are being amended to point to this 102 | document. 103 | 104 | http://gitenberg.github.io/[Main Gitenberg page] 105 | -------------------------------------------------------------------------------- /metadata/pgdata3.asciidoc: -------------------------------------------------------------------------------- 1 | == GITenberg metadata 2 | === Part 3. Other sources of metadata 3 | 4 | In link:pgdata2.asciidoc[Part 2] we looked at the data already in project Gutenberg. Now we're going to want to bring in metadata from other sources. OpenLibrary is a source of metadata with a reasonably well designed API. It returns JSON, which can be readily converted to yaml 5 | 6 | The OpenLibrary metadata for one edition (manifestation) of Space Viking is here: 7 | 8 | [source, json] 9 | ---- 10 | olid:OL7526155M: 11 | bib_key: olid:OL7526155M 12 | details: 13 | authors: 14 | - key: /authors/OL42322A 15 | name: H. Beam Piper 16 | classifications: {} 17 | covers: 18 | - 284580 19 | created: 20 | type: /type/datetime 21 | value: '2008-04-29T13:35:46.876380' 22 | identifiers: 23 | goodreads: 24 | - '1440159' 25 | librarything: 26 | - '138032' 27 | isbn_10: 28 | - 0441602258 29 | isbn_13: 30 | - '9780441602254' 31 | key: /books/OL7526155M 32 | languages: 33 | - key: /languages/eng 34 | last_modified: 35 | type: /type/datetime 36 | value: '2012-06-18T00:43:42.710342' 37 | latest_revision: 7 38 | number_of_pages: 191 39 | publish_date: '1964' 40 | publishers: 41 | - Ace Books 42 | revision: 7 43 | series: 44 | - VIntage Ace SF, F-225 45 | title: Space Viking 46 | type: 47 | key: /type/edition 48 | works: 49 | - key: /works/OL579067W 50 | info_url: https://openlibrary.org/books/OL7526155M/Space_Viking 51 | preview: noview 52 | preview_url: https://openlibrary.org/books/OL7526155M/Space_Viking 53 | thumbnail_url: https://covers.openlibrary.org/b/id/284580-S.jpg 54 | ---- 55 | 56 | Things to note about the OpenLibrary metadata: 57 | 58 | * some things it says are duplicates of what PG says, but in a different format. So "/languages/eng" instad of "en". the open library value is technically a pointer to https://openlibrary.org/languages/eng, which gives the full name of the language, English, but not much else. Except for the sad fact that Aaron Swartz had to revert the name from a mistaken change to "Malayalam". 59 | * The Goodreads and Librarything ids are very useful for linking, in both directions. Both of these services have difficulties linking to PG because their data models center around ISBN. If they could load using a local id, making links to our ebooks would be a lot easier for them. Same story for OpenLibrary itself. 60 | * The main difficulty for use of this metadata in GITenberg is that it's edition specific. occasionally there might be an edition that coincides with the edition used for digitization, but that determination would have to be done by hand. 61 | * ISBNs can be very useful to get more info. LibraryThing and OCLC have apis to get related isbns, which is useful for libraries because it helps them group editions into works. This is especially usefule for GITenberg because there aren't ISBNs for the editions. It's not impossible that we could get a block of ISBNs specifically for the PG text. 62 | 63 | Another aggregation of metadata with a well-supported API is Google Books. You can obtain a google books id with an isbn. So for example https://www.googleapis.com/books/v1/volumes/j6NrcDkA2wYC 64 | The yaml version is here: https://gist.github.com/eshellman/a4e6c3a671db98b56b07 Not so much there is useful to us because of terms of service restrictions. 65 | 66 | Moving into the library world, most everything in PG is cataloged in some way by Library of Congress: http://lccn.loc.gov/75000422 https://gist.github.com/eshellman/c2879061d753bcde63e1[Here are the files.] 67 | 68 | The MARC-XML is what many libraries would use; the MODS is somewhat less mystifying to the uninitiated. The most useful bits here are the Dewey Decimal numbers and the subject headings. 69 | [source,xml] 70 | ---- 71 | PZ4.P666 Sp4 72 | PS3566.I58 73 | 813/.5/4 74 | ---- 75 | and, wouldn't you know it, it's a bit different from what's in the gutenberg RDF, which uses PS as the top level class, which LC has as an alternate. So which is right? A pull request to change should include a reference, or a discussion of why it ought to be changed, so an admin needed research the matter. 76 | 77 | Finally, Wikipedia has a lot of good data in its "infoboxes". The infobox for _Space Viking_ includes info on the series the book is part of, as well as links (denoted by square brackets) to genre, and author. The image field provides a cover. 78 | 79 | [source] 80 | ---- 81 | {{Infobox book 82 | | name = Space Viking 83 | | title_orig = 84 | | translator = 85 | | image = Image:Space Viking (pb cover 02ba).jpg 86 | | caption = First edition 87 | | author = [[H. Beam Piper]] 88 | | illustrator = 89 | | cover_artist = 90 | | country = United States 91 | | language = English 92 | | series = The Tanith Series 93 | | genre = [[Science fiction novel]] 94 | | publisher = [[Ace Books]] 95 | | release_date = [[1963 in literature|1963]] 96 | | media_type = Print ([[Paperback]]) 97 | | pages = 191 98 | | isbn = 99 | | preceded_by = 100 | | followed_by = Prince of Tanith 101 | }} 102 | ---- 103 | 104 | There's plenty of data in various places that could be added to a Gitenberg record. 105 | 106 | The link:pgdata4.asciidoc[next chapter] will discuss metadata that is needed to make GITenberg work for us all, but doesn't exist in the the available metadata. -------------------------------------------------------------------------------- /metadata/pgdata4.asciidoc: -------------------------------------------------------------------------------- 1 | == GITenberg metadata 2 | === Part 4. Metadata that's needed, but missing. Also, covers. 3 | 4 | In link:pgdata3.asciidoc[Part 3], I looked at the metadata that's available via api from other metadata sources. But there's a bunch of metadata internal to GITenberg (and to a lesser extent, Project Gutenberg) that should probably be included in a GITenberg metadata file. 5 | 6 | ==== Housekeeping data 7 | 8 | * the Repo URL. For Space Viking, that would be https://github.com/GITenberg/Space-Viking_20728 Unfortunately, because of unicode and OS level issues, it's not as simple as you might think to derive the url from other metadata. And maybe for portability, we should just keep "Space-Viking_20728". This has been implemented as "_repo" 9 | * Download URLs. Well maybe not. It probably makes more sense to have a resolver service separate from the repo. 10 | * version info. Again, maybe not. It probably makes sense to use Github's versioning to keep track of this. On the other hand, downstream sites will need to know this stuff. But a MARC record builder could grab version info directly from Github rather than a version-controlled metadata file. 11 | * process/toolchain info. This requirement is best understood with an example. We have a tool that changes ascii quotes into typographic quotes. It might not be obvious from inspection of a file that this process has been completed. One way to record this would be to add something like "quotification_completed: March 23, 2015" to the metadata file. I'm not sure if this would be the best way to do this. 12 | 13 | ==== Cover data 14 | 15 | This could probably be a post on its own, but covers, and to a lesser extent, illustrations, present a set of new issues when combined with a version control system. The traditional view in the book world is that a book may have many editions (or manifestations), but each edition has a single cover associated with it. For public domain books, it's common for tens, hundreds, or even thousands of editions to exist, each with its own cover. The text inside might be the text from Gutenberg, perhaps with some editorial enhancement. If GITenberg succeeds, this will no doubt continue, but with improved quality all around. 16 | 17 | What's changed with public domain digital books, is that there's no need for a "manifestation" to fix a single cover. For some purposes, a scan of the original printed cover might be the best choice for a cover image. For other purposes, a branded, generated cover might do a better job of communicating to a potential reader what's "behind" the cover. Projects such as http://shop.thecreativeactionnetwork.com/collections/recovering-the-classics[Recovering the Classics] are creating new composite works by letting artists and designers breathe new life into covers for old books. And it's simple to swap in an alternate cover on-demand during a digital download. 18 | 19 | The consensus in the GITenberg core team has been that we need to make room for different types of covers. Each cover included in a repo will need its own metadata- license, cover type, attribution, and perhaps other things. And there needs to be available for every book an attractive public domain or CC0 cover that works well in a digital display. 20 | 21 | Because covers are used to identify editions in the marketplace, it's appropriate, in the GITenberg context, to assign ISBNs to editions on a cover-by-cover basis. Since GITenberg will offer multiple formats and won't need to track them, if we use ISBNs, we won't use a different one for each format, but instead fix the ISBN to the cover. In the http://www.bic-media.com/dmrn/codelists/onix-codelist-10.htm[ONIX file type code list], we would use '000', the Epublication "content package". 22 | 23 | ==== Source data 24 | 25 | Most PG texts I've seen have some indicators of the production process. For example, in the text of PG's Space Viking there is: 26 | 27 | [source] 28 | ---- 29 | Produced by Greg Weeks, William Woods and the Online 30 | Distributed Proofreading Team at http://www.pgdp.net 31 | 32 | [Transcriber's note: 33 | This etext was produced from Analog Science Fact--Science Fiction 34 | November 1962, December 1962, January 1963, February 1963. 35 | Extensive research did not uncover any evidence that the copyright 36 | on this publication was renewed.] 37 | ---- 38 | These notes and credits should be moved to the metadata, and the ebook builder should present these notes appropriately. And bibliographic data for the original edition should get recorded, if available. I can't imagine that there isn't a worldcat record for 99% of the source editions in PG; these ought to be identified and added somehow. 39 | 40 | * Much better provenance, including links to DP projects, scanned source files, Internet Archive mirrors, etc would be useful metadata to add (from Tom Morris) 41 | 42 | ==== Quality data 43 | 44 | A library, bookseller or other sort of distributor should be able to select based on the quality of the ebooks available. A set of objective quality criteria would be useful. Here are some examples: 45 | 46 | * passes epubcheck 47 | * includes a clickable toc/index 48 | * includes semantic markup 49 | 50 | ==== Relations within PG 51 | 52 | (from Tom Morris) 53 | 54 | * PG has multiple editions of many works and often the later ones are of higher quality than the older "editionless" editions, yet the earlier ones get downloaded way more. Enhancing the bibliographic information to help with this issue would be useful to readers. For example, this early editionless Pride & Prejudice http://www.gutenberg.org/ebooks/1342#bibrec is downloaded over 30 times more often than this later high quality transcription of the 1932 R.W. Chapman edition http://www.gutenberg.org/ebooks/42671#bibrec donated by Distributed Proofreaders. #76 & #32325 are another example. It would be good to be able to link the various editions together. 55 | 56 | ==== External IDs 57 | 58 | In prior chapters we talked about external sources of metadata. We should include ids for: 59 | 60 | * OpenLibrary 61 | * OCLC 62 | * Google Books 63 | * Wikipedia 64 | * Goodreads 65 | * LibraryThing 66 | * Unglue.it 67 | * ManyBooks 68 | * FeedBooks 69 | 70 | ==== More 71 | 72 | I'm sure there are types of metadata I haven't thought of! Suggestions and comment are both welcomed and needed. 73 | 74 | In the link:pgdata5.asciidoc[next chapter], I'll discuss target vocabularies for the metadata. 75 | -------------------------------------------------------------------------------- /metadata/pgdata2.asciidoc: -------------------------------------------------------------------------------- 1 | == GITenberg metadata 2 | === Part 2. Combing through the Gutenberg metadata 3 | 4 | To recap link:pgdata.asciidoc[Part 1], the Project Gutenberg metadate boils down to the following, expressed in YAML 5 | 6 | [source,yaml] 7 | ---- 8 | # Project Gutenberg Metadata 9 | pgterms:ebook: 10 | url: http://www.gutenberg.org/ebooks/20728 11 | marcrel:ill: 12 | - pgterms:agent: http://www.gutenberg.org/2009/agents/25396 13 | pgterms:alias: 14 | - Schoenherr, John (John Carl) 15 | - Schoenherr, Jack 16 | pgterms:birthdate: 1935 17 | pgterms:deathdate: 2010 18 | pgterms:name: Schoenherr, John 19 | pgterms:webpage: 20 | - url: http://en.wikipedia.org/wiki/John_Schoenherr 21 | dcterms:description: Wikipedia 22 | 23 | dcterms:creator: 24 | - pgterms:agent: http://www.gutenberg.org/2009/agents/8301 25 | pgterms:alias: Piper, Henry Beam 26 | pgterms:birthdate: 1904 27 | pgterms:deathdate: 1964 28 | pgterms:name: Piper, H. Beam 29 | pgterms:webpage: 30 | - url: http://en.wikipedia.org/wiki/H._Beam_Piper 31 | dcterms:description: Wikipedia 32 | 33 | dcterms:issued: '2007-03-03' 34 | dcterms:language: !dcterms:RFC4646 en 35 | dcterms:license: http://www.gutenberg.org/license 36 | dcterms:publisher: Project Gutenberg 37 | dcterms:rights: Public domain in the USA. 38 | dcterms:subject: 39 | - !dcterms:LCSH Space warfare -- Fiction 40 | - !dcterms:LCSH Revenge -- Fiction 41 | - !dcterms:LCC PS 42 | - !dcterms:LCSH Science fiction 43 | dcterms:title: Space Viking 44 | dcterms:type: !dcterms:DCMIType Text 45 | pgterms:downloads: 299 46 | ---- 47 | 48 | From the top. 49 | 50 | * "pgterms" is a vocabulary specific to Project Gutenberg. The url for this vocabulary given in the RDF, http://www.gutenberg.org/2009/pgterms/ is a 404. 51 | * "pgterms:agent" refers to creator entities. http://www.gutenberg.org/2009/agents/8301 is the identifier given to H. Beam Piper; That's also a 404, but you might want to look at http://www.gutenberg.org/ebooks/author/8301 or http://www.gutenberg.org/ebooks/author/25396 52 | ** In a relational database, the metadata for authors and illustrators would be represented with a foreign key to an agent table. 53 | ** For GITenberg, it makes sense to to maintain author metadata separately, on the theory that when an author dies, you should have to update the metadata for every single book the author has written. 54 | ** It also makes sense to link to "authority files" for agent metadata. so for example, we could enter http://viaf.org/viaf/81055793/ into the H. Beam Piper agent field and pull back the other agent metadata as needed. 55 | ** The PG corpus has a number of duplicate entries for authors, despite the nominally canonical IDs 56 | + 57 | Tom Morris did a quick recheck of this. Here is a sample of some of his corrections after using the clustering facility of OpenRefine . Corrected (nominally) version in the first column. 58 | + 59 | |==== 60 | |American Sunday-School Union |Union, American Sunday-School 61 | |Bakunin, Mikhail Aleksandrovic |Bakunin, Mikhail Aleksandrovich 62 | |Barine, Arvède |Barine, Arvede 63 | |Ditchfield, P. H. (Peter Hampson) |Ditchfield, P. H. Peter Hampson) 64 | |Gerhard, J. W. |Gerhard, J.W. 65 | |Haapanen-Tallgren, Tyyni |Haapanen-Tallgren, Tyyne 66 | |Knatchbull-Hugesson, Edward Hugessen |Knatchbull-Hugessen, Edward Hugessen 67 | |La Monte, Robert Rives |Monte, Robert Rives la 68 | |Levett-Yeats, S. (Sidney) |Levett Yeats, S. (Sidney) 69 | |Library of Congress. Copyright Office |Copyright Office. Library of Congress. 70 | |==== 71 | + 72 | On the plus side, there are only a couple dozen of these records in the 20k+ authors, so it's a pretty small problem, but it does indicate that the PG author records can't be relied upon to be unique. 73 | 74 | * the agents in the pg metadata use "relations". For illustrators, the relation used is "marcrel:ill" which comes from MARC's relators: http://www.loc.gov/marc/relators/relaterm.html, while for authors the dcterms:creator (Dublin Core) relation is used. MARC has the "aut" relator which means the same thing. 75 | * dcterms:issued: and dcterms:publisher: refer to Project Gutenberg's publication of the ebook, not to the publication of the print original. Surprisingly, the metadata makes no attempt to identify or refer to the original print edition its made from. When GITenberg starts making version of the ebook, what should it be saying in these fields, and should it even be trying? 76 | * dcterms:subject: refer to Library of congress subject headings and class codes. As Alex points out in the Google Group, the values aren't URIs, they're text. We should be able to do better at normalization. 77 | * pgterms:downloads: the downloads number refers to a prior week. There's no date context for this number, so it doesn't seem very useful. I told you RDF wasn't very good at representing dynamic state, and here's a good example. You can _do_ it, but it's more work than you really want. 78 | * dcterms:type: !dcterms:DCMIType Text. Apparently this is either audio or text. A verbose bit. 79 | 80 | Conspicuous by its absence is a dcterms:description element. To see how representative _Space Viking_ is, Raymond did a predicate analysis of the entire PG RDF corpus. It's here: https://gist.github.com/rdhyee/8f84142f808d36796fa3 81 | 82 | You can see that the file manifests take up a big chunk of the metadata as there are 654,523 files in all. 83 | 84 | Apparently 37,199 of the 48,538 ebooks have descriptions. Only 10 of them don't have titles. 3127 of them include a table of contents in the metadata. There are plenty more relators- editors and translators head the list. 85 | 86 | The other category of metadata are taken up by a bunch of MARC-related fields, the most common being 9,238 appearances of marc901 and 3,067 appearances of marc010. MARC 901 is bizarre because it's a local data field- used by libraries for strictly local purposes. MARC 010 is the library of congress control number, or lccn. Other information in MARC fields includes some publication info, uniform title, series info, production credits, and some edition notes. 87 | 88 | In the link:pgdata3.asciidoc[next chapter], I'll look at other metadata sources that we could bring into the Gitenberg metadata, including data from library catalogs. 89 | -------------------------------------------------------------------------------- /metadata/pgdata.asciidoc: -------------------------------------------------------------------------------- 1 | == GITenberg metadata 2 | === Part 1. Boiling down the Gutenberg RDF 3 | 4 | One of the objectives of GITenberg is to provide a github-flavored pathway for the improvement of the metadata for Project Gutenberg ebooks. This runs in two directions: 5 | . Improving the accessibility and usability of PG metadata 6 | . Improving the quality and completeness of PG metadata 7 | 8 | The first step in this effort is to figure out what metadata already exists in Project Gutenberg. 9 | 10 | Project Gutenberg provides periodic dumps of its metadata in a set of RDF files. These are the metadata used to make the "bibrec" pages and also to make ebook files (an EPUB package, for example, stores this metadata in its "OPF" file). The dump consists of a zipped tarfile containing one rdf file per pg text. In the second tranche of repos moved to github (roughly those above 10,000), Seth added the rdf file for each text to the corresponding archive. This saves us from having to deal with opening the archive and letting our operating systems deal with 50,000 directories in a directory. 11 | 12 | The rdf file corresponding to #20728, "Space Viking" is here: https://gist.github.com/eshellman/90f1996b33e20e069040 13 | 14 | it starts out 15 | [source,xml] 16 | ---- 17 | 18 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | en 33 | 34 | 35 | ---- 36 | 37 | So far, the translation of this into English is: 38 | 39 | ========================== 40 | The resource identified by http://www.gutenberg.org/ has a GPL License. 41 | The resource identified by http://www.gutenberg.org/ebooks/20728 is an ebook in English. 42 | ========================== 43 | The goal of GITenberg is to present metadata in a way that humans can read and edit, and git can track changes. RDF is a poor answer to these requirements, at least in this xml serialization. 44 | 45 | We can take a step towards making this more usable by using the json-ld serialization for the rdf. I re-serialized the RDF XML file for Space Viking using the rdf translator at http://rdf-translator.appspot.com/ The result is available here: https://gist.github.com/eshellman/140258d9958c97580b42 46 | 47 | it starts out: 48 | [source,json] 49 | ---- 50 | { 51 | "@context": { 52 | "cc": "http://web.resource.org/cc/", 53 | "dcam": "http://purl.org/dc/dcam/", 54 | "dcterms": "http://purl.org/dc/terms/", 55 | "marcrel": "http://id.loc.gov/vocabulary/relators/", 56 | "pgterms": "http://www.gutenberg.org/2009/pgterms/", 57 | "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#", 58 | "rdfs": "http://www.w3.org/2000/01/rdf-schema#", 59 | "xsd": "http://www.w3.org/2001/XMLSchema#" 60 | }, 61 | "@graph": [ 62 | { 63 | "@id": "_:N2a40c4f7b6b8445b86d117253d79e17f", 64 | "dcam:memberOf": { 65 | "@id": "dcterms:IMT" 66 | }, 67 | "rdf:value": { 68 | "@type": "dcterms:IMT", 69 | "@value": "application/x-mobipocket-ebook" 70 | } 71 | }, 72 | ---- 73 | 74 | The translation of this is: 75 | 76 | ========================== 77 | There is a mime-type with value "application/x-mobipocket-ebook" 78 | ========================== 79 | This is significantly better for use with git, but not much of an improvement for readability. The biggest difficulty for both these serializations is the explicit "blank nodes" with id's that look like "_:N2a40c4f7b6b8445b86d117253d79e17f". Blank nodes are what the RDF model uses to convert hierarchical data structures into super-simple data triples. RDF processors easily ingest the triples, but humans find them inscrutable. There are also some deep problems with blank nodes, which I've blogged about. http://go-to-hellman.blogspot.com/2009/11/blank-node-bother-and-rdf-copymess.html 80 | 81 | Fortunately, you can remove the blank nodes without changing the representation a bit. I wrote a little script to do that. https://gist.github.com/eshellman/b717d6b1b49498140218 82 | 83 | The result is here: https://gist.github.com/eshellman/b8fbd310c5ec2d77d949 84 | 85 | It starts 86 | [source,json] 87 | ---- 88 | { 89 | "@context": { 90 | "cc": "http://web.resource.org/cc/", 91 | "dcam": "http://purl.org/dc/dcam/", 92 | "dcterms": "http://purl.org/dc/terms/", 93 | "marcrel": "http://id.loc.gov/vocabulary/relators/", 94 | "pgterms": "http://www.gutenberg.org/2009/pgterms/", 95 | "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#", 96 | "rdfs": "http://www.w3.org/2000/01/rdf-schema#", 97 | "xsd": "http://www.w3.org/2001/XMLSchema#" 98 | }, 99 | "@graph": [ 100 | { 101 | "@id": "http://www.gutenberg.org/files/20728/20728.txt", 102 | "@type": "pgterms:file", 103 | "dcterms:extent": 414640, 104 | "dcterms:format": { 105 | "dcam:memberOf": { 106 | "@id": "dcterms:IMT" 107 | }, 108 | "rdf:value": { 109 | "@type": "dcterms:IMT", 110 | "@value": "text/plain; charset=us-ascii" 111 | } 112 | }, 113 | "dcterms:isFormatOf": { 114 | "@id": "http://www.gutenberg.org/ebooks/20728" 115 | }, 116 | "dcterms:modified": { 117 | "@type": "xsd:dateTime", 118 | "@value": "2012-07-02T13:51:54" 119 | } 120 | }, 121 | ---- 122 | 123 | As you can see, it's starting to be slightly comprehensible. 124 | 125 | Before I translate this, I'll go one more step, converting the JSON in YAML: https://gist.github.com/eshellman/0577a24406be521dfe78 126 | 127 | YAML is a data serialization designed to be reader friendly. It pretty much nails our application; it's also used in applications such as Pandoc, the open-source swiss-army knife of document converters. 128 | Take a look at it on GitHub, you'll see that it GitHub does a lovely job presenting the YAML file. 129 | 130 | Now we have, to start: 131 | [source,yaml] 132 | ---- 133 | '@context': 134 | cc: http://web.resource.org/cc/ 135 | dcam: http://purl.org/dc/dcam/ 136 | dcterms: http://purl.org/dc/terms/ 137 | marcrel: http://id.loc.gov/vocabulary/relators/ 138 | pgterms: http://www.gutenberg.org/2009/pgterms/ 139 | rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# 140 | rdfs: http://www.w3.org/2000/01/rdf-schema# 141 | xsd: http://www.w3.org/2001/XMLSchema# 142 | '@graph': 143 | - '@id': http://www.gutenberg.org/files/20728/20728.txt 144 | '@type': pgterms:file 145 | dcterms:extent: 414640 146 | dcterms:format: 147 | dcam:memberOf: 148 | '@id': dcterms:IMT 149 | rdf:value: 150 | '@type': dcterms:IMT 151 | '@value': text/plain; charset=us-ascii 152 | dcterms:isFormatOf: 153 | '@id': http://www.gutenberg.org/ebooks/20728 154 | dcterms:modified: 155 | '@type': xsd:dateTime 156 | '@value': '2012-07-02T13:51:54' 157 | ---- 158 | This says: 159 | 160 | ========================== 161 | There is a file with id http://www.gutenberg.org/files/20728/20728.txt 162 | which has size 414640 163 | and mime-type text/plain; charset=us-ascii 164 | which is a format of http://www.gutenberg.org/ebooks/20728 165 | and was last modified on '2012-07-02T13:51:54' 166 | ========================== 167 | and if you are a metadata-head, it's reasonably straightforward to understand. 168 | 169 | Now that we know what the metadata is saying, the obvious question is: _Why on earth is PG reporting file manifests in RDF???_ 170 | 171 | I think it's appropriate to forgive the very smart people who decided to put a file manifest in RDF, because it was a very different time and who could have foreseen back then which way technology would go. 172 | 173 | RDF's strength is moving data from silo to silo. One of its weaknesses is representing dynamic state. Today we have file systems in the cloud that can report a file's size, minute by minute. Trying to mirror the information in a file system is a set of bugs waiting to happen. 174 | 175 | Our next step is to toss out all that metadata about files so we can focus on the content metadata. 176 | 177 | Also on the chopping block is the assertion about http://www.gutenberg.org/ being GPL. First of all, I think the assertion is *trying* to say that the RDF file itself is GPL, because why would you put an assertion about PG as a whole in every rdf file? Even if it's trying to make an assertion obout the file itself, it doesn't make much sense (Metadata is not generally copyrightable, so GPL has no force.) 178 | 179 | In addition, I swap in YAML's value typing machinery (see the !values below) for RDF's somewhat clumsy machinery. I factor out the context, as it's machine oriented, giving precise meaning to the prefixes used, and it'll be the same for every repo. The graph id, needed for RDF technicalities, does more harm than good in the yaml because Github gives us all the id machinery we need. 180 | 181 | We're left with what I call simple PG YAML: https://gist.github.com/eshellman/2863fc5ffb129714f617 182 | 183 | It's boiled down enough that I can quote it here in its entirety: 184 | 185 | [source,yaml] 186 | ---- 187 | # Project Gutenberg Metadata 188 | pgterms:ebook: 189 | url: http://www.gutenberg.org/ebooks/20728 190 | marcrel:ill: 191 | - pgterms:agent: http://www.gutenberg.org/2009/agents/25396 192 | pgterms:alias: 193 | - Schoenherr, John (John Carl) 194 | - Schoenherr, Jack 195 | pgterms:birthdate: 1935 196 | pgterms:deathdate: 2010 197 | pgterms:name: Schoenherr, John 198 | pgterms:webpage: 199 | - url: http://en.wikipedia.org/wiki/John_Schoenherr 200 | dcterms:description: Wikipedia 201 | 202 | dcterms:creator: 203 | - pgterms:agent: http://www.gutenberg.org/2009/agents/8301 204 | pgterms:alias: Piper, Henry Beam 205 | pgterms:birthdate: 1904 206 | pgterms:deathdate: 1964 207 | pgterms:name: Piper, H. Beam 208 | pgterms:webpage: 209 | - url: http://en.wikipedia.org/wiki/H._Beam_Piper 210 | dcterms:description: Wikipedia 211 | 212 | dcterms:issued: '2007-03-03' 213 | dcterms:language: !dcterms:RFC4646 en 214 | dcterms:license: http://www.gutenberg.org/license 215 | dcterms:publisher: Project Gutenberg 216 | dcterms:rights: Public domain in the USA. 217 | dcterms:subject: 218 | - !dcterms:LCSH Space warfare -- Fiction 219 | - !dcterms:LCSH Revenge -- Fiction 220 | - !dcterms:LCC PS 221 | - !dcterms:LCSH Science fiction 222 | dcterms:title: Space Viking 223 | dcterms:type: !dcterms:DCMIType Text 224 | pgterms:downloads: 299 225 | ---- 226 | 227 | I think I'll stop here, because link:pgdata2.asciidoc[the next chapter] will be about the metadata rather than how it might be presented, which is a different topic. 228 | -------------------------------------------------------------------------------- /metadata/pandata_attribute_dictionary.yaml: -------------------------------------------------------------------------------- 1 | 2 | # pandata design notes 3 | # "objects" are attribute sets. 4 | # plural names for plural creators used because it seemed more natural to have author and authors attributes, for example. 5 | # it's a mistake to have both. 6 | 7 | _edition: Provides the name for the edition, which should be a string that can be used for file naming and for identifying the edition. 8 | _repo: the GITenberg repo name. 9 | _version: the version string. Should come from the repo tag. 10 | agent_name: the name of the creator/contributor, for use in creator/contributor objects 11 | agent_sortname: the agent_name, arranged for sorting, typically last name first 12 | alias: A name or list of names for a creator/contributor 13 | alternative_title: An alternative title. 14 | attribution_url: for Creative Commons licenses, the url for attribution 15 | birthdate: Value should be a date. For creator/contributor objects. 16 | contributor: From Dublin Core, a contributor who is not the principle creator. Value is an object (dict) with one or more contributors. In principle, could be any role from https://www.loc.gov/marc/relators/relaterm.html 17 | adapter: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code adp 18 | adapters: A type of creators/contributors. Value should be a list of adapters. 19 | annotator: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code ann 20 | annotators: A type of creators/contributors. Value should be a list of annotators. 21 | arranger: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code arr 22 | arrangers: A type of creators/contributors. Value should be a list of arrangers. 23 | artist: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code art 24 | artists: A type of creators/contributors. Value should be a list of artists. 25 | author_of_afterword: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code aft 26 | author_of_afterwords: A type of creators/contributors. Value should be a list of author_of_afterwords. 27 | author_of_introduction: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code aui 28 | author_of_introductions: A type of creators/contributors. Value should be a list of author_of_introductions. 29 | book_producer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code bkp 30 | book_producers: A type of creator/contributor. Value should be a list of book_producers 31 | collaborator: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code clb 32 | collaborators: A type of creators/contributors. Value should be a list of contributors. 33 | commentator: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code cmm 34 | commentators: A type of creators/contributors. Value should be a list of commentators. 35 | compiler: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code com 36 | compilers: A type of creators/contributors. Value should be a list of compilers. 37 | composer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code cmp 38 | composers: A type of creators/contributors. Value should be a list of composers. 39 | conductor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code cnd 40 | conductors: A type of creators/contributors. Value should be a list of conductors. 41 | contributor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code ctb 42 | dubious_author: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code dub 43 | dubious_authors: A type of creators/contributors. Value should be a list of dubious_authors. 44 | editor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code edt 45 | editors: A type of creators/contributors. Value should be a list of editors. 46 | engineer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code egr 47 | engineers: A type of creators/contributors. Value should be a list of engineers. 48 | illustrator: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code ill 49 | illustrators: A type of creators/contributors. Value should be a list of illustrators. 50 | librettist: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code lbt 51 | librettists: A type of creators/contributors. Value should be a list of librettists. 52 | other_contributor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code oth 53 | other_contributors: A type of creators/contributors. Value should be a list of other_contributors. 54 | performer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code prf 55 | performers: A type of creators/contributors. Value should be a list of performers. 56 | photographer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code pht 57 | photographers: A type of creators/contributors. Value should be a list of photographers. 58 | printer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code prt 59 | printers: A type of creators/contributors. Value should be a list of printers. 60 | publisher: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code pbl 61 | publisher: A type of creators/contributors. Value should be a list of publisher_contributors. 62 | researcher: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code res 63 | researchers: A type of creators/contributors. Value should be a list of researchers. 64 | transcriber: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code trc 65 | transcribers: A type of creators/contributors. Value should be a list of transcribers. 66 | translator: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code trl 67 | translators: A type of creators/contributors. Value should be a list of translators. 68 | unknown_contributor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code unk 69 | unknown_contributors: A type of creators/contributors. Value should be a list of unknown_contributors. 70 | covers: Value should be a list of cover objects. 71 | attribution_url: for Creative Commons licenses, the url for attribution 72 | cover_type: archival | archival-back | original | generated | titlepage_image 73 | image_path: a file path to the image file from the directory containing the metadata file 74 | image_url: a url for the cover image 75 | rights: rights pertaining to the cover 76 | rights_url: url defining the rights 77 | creator: From Dublin Core, usually the principle creators- usually the author, sometimes editors. Value is an object (dict) with one or more creators. In principle, could be any role from https://www.loc.gov/marc/relators/relaterm.html 78 | author: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code aut 79 | authors: A type of creators/contributors. Value should be a list of authors. 80 | editor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code edt 81 | editors: A type of creators/contributors. Value should be a list of authors. 82 | deathdate: Value should be a date. For creator/contributor objects. 83 | description: Text description. 84 | download_url_epub: url to download an EPUB file 85 | download_url_mobi: url to download a MOBI (kindle) file 86 | download_url_pdf: url to download a PDF file 87 | edition_identifiers: An object (dict) with one or more identifiers. If there are more than one of each type of identifer, the value of the specific identifier should be a list of values 88 | isbn: An identifier, International Standard book number. Always 13 digits. No dashes please. 89 | oclc: A number. Can link to worldcat with http://www.worldcat.org/oclc/{%oclc%} 90 | edition_list: A list of objects (dicts), each with an "edition" entry and any other properties that are distinctive for the edition. These properties replace any properties specified at the top level of the pandata. 91 | edition_note: Descriptive note about the edition. For MARC 250 field. 92 | funding_info: Text describing the funding for this edition. Used for MARC 536 field. 93 | gutenberg_agent_id: An agent identifier used in Project Gutenberg. 94 | gutenberg_bookshelf: Project Gutenberg assigns some books to one or more "bookshelves". 95 | gutenberg_issued: Project Gutenberg's issuance date. 96 | gutenberg_transcribers_note: Project Gutenberg includes notes from the transcriber in text; this should be moved into metadata. 97 | gutenberg_type: Within Project Gutenberg, the item type. Usually "Text" 98 | identifiers: An object (dict) with one or more identifiers. If there are more than one of each type of identifer, the value of the specific identifier should be a list of values 99 | goodreads: A number. Can link to goodreads with https://www.goodreads.com/book/show/{%goodreads%} 100 | googlebooks: A Case sensetive base64 string. Can link to Google Books with https://books.google.com/books?id={%googlebooks%) 101 | gutenberg: A number. Can link to Project Gutenberg with https://www.gutenberg.org/ebooks/{%gutenberg%} 102 | isbns_related: A list of isbns for related editions. 103 | isbn_electronic: ISBN for an electronic edition (various formats). 104 | isbn_epub: ISBN for an EPUB edition. 105 | isbn_hard: ISBN for a hardcover edition. 106 | isbn_mobi: ISBN for a MOBI (kindle) edition. 107 | isbn_paper: ISBN for a paperpack edition. 108 | isbn_pdf: ISBN for a PDF edition. 109 | lccn: An identifier, library of congress control number. For MARC 010 field. 110 | librarything: A number. Can link to librarything with https://www.librarything.com/work/{%librarything%} 111 | openlibrary: A case-sensetive string like '/works/OL1892624W'. Can link to openlibrary with https://openlibrary.org{%openlibrary%} 112 | unglueit: A number. Can link to unglueit with https://unglue.it/work/{%unglueit%} 113 | language_note: Text for MARC 546 field. Usually used to note multiple languages, etc. 114 | language: An iso code for the item's language usually 2 letters, but could be 3, and might have a locale (zh-TW, for example) 115 | ocr_file_link: A url leading to the OCR files or OCRed image related to this book 116 | physical_description_note: Text for MARC 300 field. 117 | publication_date: The publication date of the edition. 118 | publication_date_original: The publication date of the edition from which the book is derived. 119 | production_note: Text for use in MARC 508 field. 120 | publication_note: Text for MARC 260 field. 121 | publisher_original: For gutenberg, the original publisher of the edition from which the book is derived. 122 | publisher: The name of the publisher of an edition. 123 | rights: A dublin core term, a string describing the usage rights. For example, "Public Domain, US" 124 | rights_url: A url defining the rights. For example, http://creativecommons.org/about/pdm 125 | series_note: Text for use in MARC 440 field. 126 | subjects: A list of subjects. These can be typed. !lcsh is library of congress subject headings, !lcc is library of congress classification. Use this list for source- http://www.loc.gov/standards/sourcelist/subject.html. If there's no authority, just leave the value untyped 127 | summary: Text for use in MARC 520 field. 128 | tableOfContents: Value should be text. 129 | title: Plain text title. use asciidoc for subscripts, etc. 130 | url: An identifying url for the object. 131 | wikipedia: A URL or list of URLs for a wikipedia page. 132 | --------------------------------------------------------------------------------