├── .gitignore
├── info
    └── mod date.png
├── drop-caps.asciidoc
├── .travis.yml
├── questions-DP.asciidoc
├── unicoding.asciidoc
├── metadata
    ├── README.asciidoc
    ├── editions.asciidoc
    ├── pgdata5.asciidoc
    ├── pgdata3.asciidoc
    ├── pgdata4.asciidoc
    ├── pgdata2.asciidoc
    ├── pgdata.asciidoc
    └── pandata_attribute_dictionary.yaml
├── encoding.asciidoc
├── filetypes.asciidoc
├── versioning.asciidoc
├── index.asciidoc
├── adopting.asciidoc
├── how_to.asciidoc
├── activerepos.csv
├── sectioning.asciidoc
├── Travis-CI.asciidoc
└── README.asciidoc


/.gitignore:
--------------------------------------------------------------------------------
1 | # html generated from asciidoc
2 | *.asciidoc.html


--------------------------------------------------------------------------------
/info/mod date.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitenberg-dev/documentation/HEAD/info/mod date.png


--------------------------------------------------------------------------------
/drop-caps.asciidoc:
--------------------------------------------------------------------------------
1 | = Drop Caps
2 | 
3 | Drop caps are an important stylistic feature of many books.
4 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: ruby
 2 | before_install:
 3 |   - gem install asciidoctor
 4 | script:
 5 |   - asciidoctor index.asciidoc
 6 | after_success:
 7 |   - build/push.sh
 8 | env:
 9 |   global:
10 |     secure: MVsv9XYEjlgRT2gWwATBva0YQ1e3XEF8b3lVUynY7MgL548iA9TYNze1G2eC3v1YKtjs+VeoHx5rRhIVY+pc79sTuW41Xo2bJImgsySYxJ6FwoR0AOu7DwMTgGIol40yNELBnUF0NyXFVVqMx9aHZ4E8VxZSe3bG90iBMIZZOhk=
11 | 


--------------------------------------------------------------------------------
/questions-DP.asciidoc:
--------------------------------------------------------------------------------
1 | = Questions about Distributed Proofreaders
2 | 
3 | * Where can we get a database of DP projects > PG ebooks?
4 | ** can we get access or a link to the scans a DP book was created from?
5 | ** Can we get a link to the forum thread for the DP book?
6 | ** Can we get the original html product of DP?
7 | * How does PG change DP generated html?
8 | * How would you like your html to be distributed/managed?
9 | 


--------------------------------------------------------------------------------
/unicoding.asciidoc:
--------------------------------------------------------------------------------
1 | GITenberg uses unicode characters for typography where appropriate.  
2 | Common unicode characters that we use are curved quotes (aka &ldquo;smart quotes&rdquo;) and https://en.wikipedia.org/wiki/Dash#Common_dashes[dashes].
3 | 
4 | Contributor @rsperberg has written a script that will convert many of these characters automatically, available in the https://github.com/gitenberg-dev/punctuation-cleanup[gitenberg-dev/punctuation-cleanup] repo with instructions on use.


--------------------------------------------------------------------------------
/metadata/README.asciidoc:
--------------------------------------------------------------------------------
 1 | == GITenberg metadata
 2 | 
 3 | - Part 1. link:pgdata.asciidoc[Boiling down the Gutenberg RDF]
 4 | - Part 2. link:pgdata2.asciidoc[Combing through the Gutenberg metadata]
 5 | - Part 3. link:pgdata3.asciidoc[Other sources of metadata]
 6 | - Part 4. link:pgdata4.asciidoc[Metadata that's needed, but missing. Also, covers.]
 7 | - Part 5. link:pgdata5.asciidoc[Metadata targets]
 8 | ** Dublin Core / EPUB
 9 | ** Bibframe / MARC
10 | ** Schema.org
11 | ** TEI
12 | ** ONIX
13 | ** BIBO
14 | - Part 6. link:editions.asciidoc[Editions]
15 | 


--------------------------------------------------------------------------------
/encoding.asciidoc:
--------------------------------------------------------------------------------
 1 | The https://en.wikipedia.org/wiki/Character_encoding[character encoding] of GITenberg books 
 2 | should always be https://en.wikipedia.org/wiki/UTF-8[UTF-8] (aka Unicode).  
 3 | Project Gutenberg source files were often created before Unicode existed and need to be converted.
 4 | 
 5 | If you get an error like the following, you will need to do this step:
 6 | 
 7 |     pandoc: Cannot decode byte '\xb0': Data.Text.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream
 8 | 
 9 | 
10 | To re-encode a file as UTF-8, you need to use the `iconv` commandline tool.
11 | 
12 |     iconv -c -t "UTF-8" < input.html > output.asciidoc
13 | 
14 | For instance, if 164-h/164-h.htm
15 | 


--------------------------------------------------------------------------------
/filetypes.asciidoc:
--------------------------------------------------------------------------------
 1 | = Filetypes
 2 | 
 3 | There are many types of files that make up GITenberg repos.
 4 | 
 5 | == Types of files
 6 | == Types of PG records
 7 | There are also many types of records from Project Gutenberg.
 8 | It makes sense to identify some of these repo types here.
 9 | 
10 | === Audiobooks
11 | PG includes at least two types of audiobooks:
12 | 
13 | * Computer-read audiobooks
14 | ** https://github.com/GITenberg/The-Call-of-the-Wild_9115[call of the wild 9115]
15 | * Human-read audiobooks produced by Librivox
16 | ** https://github.com/GITenberg/The-War-of-the-Worlds_26291[war of the worlds 26291]
17 | ** https://github.com/GITenberg/Call-of-the-Wild_19678[call of the wild 19678]
18 | 
19 | Here are a list of librivox books:
20 | 
21 | 
22 | 


--------------------------------------------------------------------------------
/versioning.asciidoc:
--------------------------------------------------------------------------------
 1 | = Versioning Gitenberg books
 2 | 
 3 | Gitenberg uses http://semver.org/[Semantic Versioning] to describe the changes to books over time.
 4 | The current version of a GITenberg book is stored in three places:
 5 | 
 6 | * As a git tag
 7 | * As the value for `_version` in repo's link:metadata/pandata_attribute_dictionary.yaml[metadata.yaml]
 8 | * On the 3rd line of the main book asciidoc file (prefixed with `v`)
 9 | 
10 | [source, asciidoc]
11 | ----
12 | = Title
13 | Author
14 | RevisionNumber, RevisionDate
15 | ----
16 | 
17 | A semantic version is in the format of `Major.Minor.Patch`.
18 | The only agreed upon minor revision is a complete epub file built by travis as described in the https://github.com/gitenberg-dev/Second-Folio/[Second-Folio] collection.
19 | 
20 | For now, assume that your changes increment the `Patch` value.
21 | `v0.0.23` would become `v0.0.24`
22 | 
23 | * http://asciidoc.org/userguide.html#X95[Asciidoc documentation]
24 | * http://asciidoctor.org/docs/user-manual/#revision-number-date-and-remark[Asciidoctor documentation]
25 | 


--------------------------------------------------------------------------------
/index.asciidoc:
--------------------------------------------------------------------------------
 1 | = GITenberg documentation
 2 | Gitenberg team
 3 | v0.0.1
 4 | :toc: left
 5 | 
 6 | [WARNING]
 7 | .ALPHA level documentation
 8 | ====
 9 | This documentation is at an alpha level of quality.
10 | Please help us improve this
11 | https://github.com/gitenberg-dev/documentation[documentation on github]
12 | ====
13 | 
14 | :leveloffset: +1
15 | 
16 | include::how_to.asciidoc[]
17 | 
18 | // include::metadata[]
19 | 
20 | 
21 | include::adopting.asciidoc[]
22 | 
23 | include::filetypes.asciidoc[]
24 | 
25 | include::drop-caps.asciidoc[]
26 | 
27 | include::encoding.asciidoc[]
28 | 
29 | include::sectioning.asciidoc[]
30 | 
31 | include::Travis-CI.asciidoc[]
32 | 
33 | include::unicoding.asciidoc[]
34 | 
35 | include::versioning.asciidoc[]
36 | 
37 | [appendix]
38 | == Active Repositories
39 | 
40 | [WARNING]
41 | .Warning about this section
42 | ====
43 | This section is out of date as of 2015-07-30, see https://github.com/gitenberg-dev/Second-Folio/blob/master/Gitenberg%20Book%20List.csv[Second-Folio] for a more up to date list.
44 | ====
45 | 
46 | [format="csv", options="header"]
47 | |===
48 | include::activerepos.csv[]
49 | |===
50 | 
51 | :leveloffset: -1
52 | 


--------------------------------------------------------------------------------
/metadata/editions.asciidoc:
--------------------------------------------------------------------------------
 1 | == GITenberg metadata: Editions
 2 | 
 3 | There is a need to be able to express metadata for multiple editions in one repository. Different editions will share most metadata, but may differ in ISBN, publisher, cover, rights, etc.
 4 | 
 5 | To support this need, pandata provides the following properties: "_edition", "edition_list", "edition_note" "edition_identifiers", and any other property beginning with "edition_"
 6 | 
 7 | "_edition" provides the name for the edition, which should be a string that can be used for file naming, and identifying the edition. If not present, the first isbn is used as the edition name, or if no isbn, then the repo name is used. 
 8 | 
 9 | "edition_list" should contain a list of objects (dicts), each with an "_edition" entry and any other properties that are distinctive for the edition. These properties replace any properties specified at the top level of the pandata.
10 | 
11 | "edition_identifiers" provides an object (dict) whose contents replace corresponding entries in the identifiers property at the top level. Some identifiers, such as isbn and oclc numbers, typically identify an edition.
12 | 
13 | "edition_note" is a descriptive note about the edition. Can be used for MARC 250 field.
14 | 


--------------------------------------------------------------------------------
/adopting.asciidoc:
--------------------------------------------------------------------------------
 1 | = Help us select a starter set of PG texts!
 2 | 
 3 | Ideally, we'd like to have a hundred texts to serve as a testbed for
 4 | the GITenberg toolchain.
 5 | 
 6 | . you need to have a github account. sign up at https://github.com/join
 7 | 
 8 | . pick out a GITenberg repo you'd like to help out on. You can search
 9 | the repos with the search box *below* Johannes at
10 | https://github.com/GITenberg
11 | 
12 | . "star" the repo you've selected. Just click on the star at the top of the repo page.
13 | 
14 | . If you're new to Github, create an "issue" in  gitenberg-dev at
15 | 
16 | .. copy the url of the repo you've selected
17 | .. go to https://github.com/gitenberg-dev/wiki/issues and click the
18 | green "new issue" button
19 | 
20 | .. paste the url of your repo and add any comments you'd like
21 | .. click the green "Submit New Issue" button.
22 | .. congratulations, you're now a GITenberger!
23 | .. someone will add your repo to the list and "close" your issue
24 | 
25 | NOTE: following instructions need a lot of elaboration!
26 | 
27 | . If you want to add directly to the list:
28 | .. "fork" the gitenberg-dev repo
29 | .. add the url of the repo you've selected, along with your github
30 | username to your copy of activerepos.csv
31 | .. commit the change to your copy
32 | .. submit a pull request to gitenberg-dev/master
33 | 


--------------------------------------------------------------------------------
/how_to.asciidoc:
--------------------------------------------------------------------------------
 1 | = How To
 2 | 
 3 | This document sets out the basics of how to prepare works for GITenberg.
 4 | 
 5 | == Overview
 6 | 
 7 | The tasks that need to be done for these books are documented on the following pages:
 8 | 
 9 | * link:encoding.asciidoc[Encoding]
10 | * link:unicoding.asciidoc[Unicoding]
11 | * link:sectioning.asciidoc[Sectioning]
12 | * link:drop-caps.asciidoc[Drop Caps]
13 | * link:versioning.asciidoc[Versioning]
14 | 
15 | == Help us select a starter set of PG texts!
16 | 
17 | Ideally, we'd like to have a hundred texts to serve as a testbed for the GITenberg toolchain.
18 | 
19 | 1. you need to have a github account. sign up at https://github.com/join
20 | 2. pick out a GITenberg repo you'd like to help out on. You can search the repos with the search box *below* Johannes at https://github.com/GITenberg
21 | 3. "star" the repo you've selected. Just click on the star at the top of the repo page.
22 | 4. If you're new to Github, create an "issue" in  gitenberg-dev at  
23 |     a. copy the url of the repo you've selected
24 |     b. go to https://github.com/gitenberg-dev/wiki/issues and click the green "new issue" button
25 |     c. paste the url of your repo and add any comments you'd like
26 |     d. click the green "Submit New Issue" button. 
27 |     e. congratulations, you're now a GITenberger!
28 |     f. someone will add your repo to the list and "close" your issue  
29 |        Note: following instructions need a lot of elaboration!
30 | 5. If you want to add directly to the list:  
31 |     a. "fork" the gitenberg-dev repo
32 |     b. add the url of the repo you've selected, along with your github username to your copy of activerepos.csv
33 |     c. commit the change to your copy
34 |     d. submit a pull request to gitenberg-dev/master
35 | 
36 | 


--------------------------------------------------------------------------------
/activerepos.csv:
--------------------------------------------------------------------------------
 1 | PG ID,Title,Status,participants,Repo
 2 | 103,Around the World in 80 Days,in-progress,eshellman,https://github.com/GITenberg/Around-the-World-in-80-Days_103
 3 | 151,The Rime of the Ancient Mariner,in-progress,sethwoodworth,https://github.com/sethwoodworth/The-Rime-of-the-Ancient-Mariner_151
 4 | 201,Flatland - A Romance of Many Dimensions,awaiting PR,sethwoodworth,https://github.com/sethwoodworth/Flatland--A-Romance-of-Many-Dimensions--Illustrated-_201
 5 | 219,Heart of Darkness,rsperberg,awaiting review,https://github.com/GITenberg/Heart-of-Darkness_219
 6 | 525,"Youth, A Narrative",rsperberg,unicoded,https://github.com/GITenberg/Youth-a-Narrative_525
 7 | 1213,The Man That Corrupted Hadleyburg,tmtmtmtm,available,https://github.com/GITenberg/The-Man-That-Corrupted-Hadleyburg_1213
 8 | 2376,Up From Slavery,eshellman,available,https://github.com/GITenberg/Up-from-Slavery--An-Autobiography_2376
 9 | 2814,Dubliners,rdhyee,asciidoc,https://github.com/GITenberg/Dubliners_2814
10 | 4298,The Paying Guest,samwilson,asciidoc,https://github.com/GITenberg/The-Paying-Guest_4298
11 | 5614,Chess Strategy,aputtu,available,https://github.com/GITenberg/Chess-Strategy_5614
12 | 8121,Ghosts,eshellman,available,https://github.com/GITenberg/Ghosts_8121
13 | 10662,The Night Land,rhythmus,markdown,https://github.com/GITenberg/The-Night-Land_10662
14 | 12297,Slave Narratives- a Folk History of Slavery in the United States,cliotropic,available,https://github.com/GITenberg/Slave-Narratives--a-Folk-History-of-Slavery-in-the-United-StatesFlorida-Narratives_12297
15 | 23174,A Good For Nothing,NickCarneiro,available,https://github.com/GITenberg/A-Good-For-Nothing1876_23174
16 | 25308,Eugenics and Other Evils,tmtmtmtm,availab,ehttps://github.com/GITenberg/Eugenics-and-Other-Evils_25308
17 | 28054,The Brothers Karamazov,rdhyee,asciidoc epub,https://github.com/GITenberg/The-Brothers-Karamazov_28054
18 | 4302,Thyrza,samwilson,asciidoc,https://github.com/GITenberg/Thyrza_4302
19 | 3261,News From Nowhere,samwilson,asciidoc,https://github.com/GITenberg/News-from-Nowhere--Or-An-Epoch-of-Rest--13-Being-Some-Chapters-from-a-Utopian-Romance_3261
20 | 


--------------------------------------------------------------------------------
/sectioning.asciidoc:
--------------------------------------------------------------------------------
 1 | = Sectioning
 2 | 
 3 | The project's preferred markup format is http://asciidoc.org[Asciidoc].
 4 | Asciidoc has a lot of great features, but denoting title and chapter information is the first step in making beautiful ebooks.
 5 | 
 6 | == Titles
 7 | The title of the book should be on the first line of the asciidoc file.
 8 | It should be prefixed with a single `=`.
 9 | This turns it into a level-one header (+<h1>+ in HTML).
10 | 
11 |   = Alice in Wonderland
12 | 
13 | Subtitles are not well supported in Asciidoc,
14 | but can be included in the title separated by a colon.
15 | They will be included in works' metadata.
16 | 
17 | == Chapters, sections, and subsections
18 | Chapter headings should have two equals signs.  This turns into a level-two header (+<h2>+ tag in HTML).
19 | 
20 |   == Chapter 1
21 | 
22 |   Paragraph text &c.
23 | 
24 |   == Chapter 2
25 | 
26 | If the book is broken up into smaller subsections than chapters, you can use more equal signs:
27 | 
28 |   == Chapter 9
29 |   === Section A
30 | 
31 | If a chapter or section heading has a title as well as a number, put them all on the same line and separated by a colon:
32 | 
33 |   == Chapter 5: Advice from a Caterpillar
34 | 
35 | 
36 | == Pages
37 | 
38 | Where a Gitenberg book has scans of the original available,
39 | you can choose to include page-markers 
40 | (http://asciidoc.org/userguide.html#X30[anchors] in Asciidoc and http://asciidoctor.org/docs/user-manual/#anchordef[asciidoctor])
41 | to make it easier to relate the text to the original images.
42 | These are invisible in the formatted output.
43 | 
44 |   First came ten soldiers carrying clubs; these were all shaped like the three
45 |   gardeners, oblong and flat, with their hands and feet at the corners: next the
46 |   ten courtiers; these were ornamented all over with diamonds, and walked two
47 |   and [[source-page-115]] two, as the soldiers did. After these came the royal children;
48 |   there were ten of them, and the little dears came jumping merrily along hand in
49 |   hand, in couples: they were all ornamented with hearts.
50 | 
51 | The anchor should be of the form `[[source-page-x]]` where `x` is the number of the page.
52 | If a word is hyphenated across the page, put the anchor in the next available whitespace.
53 | 


--------------------------------------------------------------------------------
/Travis-CI.asciidoc:
--------------------------------------------------------------------------------
 1 | = Travis CI and GITenberg
 2 | 
 3 | == Building Asciidoc after each commit
 4 | I started integrating GITenberg with Travis-CI this weekend.  
 5 | Travis-CI is an open source continuous integration server.  
 6 | Typically, a CI server watches for changes on github, then checks out the changed code, 
 7 | runs tests and/or tries to compile the software.  
 8 | Importantly, the hook for a CI server to run is any change being made 
 9 | on Github triggering a 'build' via post_commit hooks.
10 | I've taken my asciidoc fork of 
11 | link:https://github.com/sethwoodworth/The-Rime-of-the-Ancient-Mariner_151[Rime of the Ancient Mariner], 
12 | told Travis-CI about my repo, and added this 
13 | link:https://github.com/sethwoodworth/The-Rime-of-the-Ancient-Mariner_151/blob/gh-pages/.travis.yml[Travis config file].
14 | 
15 | Now, whenever I make a commit to Rime it triggers 
16 | a build on Travis that 
17 | link:https://travis-ci.org/sethwoodworth/The-Rime-of-the-Ancient-Mariner_151/builds/55742441[looks like this].  
18 | In this case, I am installing asciidoctor 
19 | and using it to transform the Rime asciidoc file into html.  
20 | But I could just as easily build epubs with a slightly different command.
21 | 
22 | Any files generated by the travis-ci build are 
23 | automatically uploaded to the amazon file storage cloud.  
24 | The Rime html is 
25 | link:https://s3.amazonaws.com/gitenberg-build/sethwoodworth/The-Rime-of-the-Ancient-Mariner_151/7/7.1/rime.html[available here].
26 | 
27 | This is a preliminary result.  
28 | And more work needs to be done before this is ready for primetime.
29 | 
30 | 
31 | == Any other format
32 | 
33 | Asciidoc isn't the only type of file we can build this way.  
34 | I've taken the bi-lingual book that Tom mentioned, 
35 | added it to GITenberg, and forked it to my repo: 
36 | link:https://github.com/sethwoodworth/The-Jesuit-Relations-and-Allied-Documents-Vol.-V--Quebec-1632-1633_48562/[_Jesuit Relations_].  
37 | I think the PG html version isn't as awesome as the 
38 | link:https://github.com/sethwoodworth/The-Jesuit-Relations-and-Allied-Documents-Vol.-V--Quebec-1632-1633_48562/blob/master/48562-0.txt[raw text version], 
39 | which has english and french side by side.  
40 | But I attempted to build the html of Jesuit Relations into an epub with Project Gutenberg's 
41 | link:https://pypi.python.org/pypi/epubmaker/0.3.20[epubmaker].  
42 | I have epubmaker and the python requirements installed, 
43 | but I am having a pathing issue in a bash script that is causing the 
44 | link:https://travis-ci.org/sethwoodworth/The-Jesuit-Relations-and-Allied-Documents-Vol.-V--Quebec-1632-1633_48562[build to fail].
45 | 
46 | 
47 | == Conclusion
48 | Nevertheless, this has been an enlightening experiment and I am very hopeful that we can build PG html edition ebooks easily.
49 | 
50 | 
51 | This contributes to an overall point I would like to make:
52 | I like asciidoc, I think we have the best tools for asciidoc.
53 | I want to try everything and compare them.
54 | PG has ~400 books in ReStructured Text (a format I have researched thoroughly and consider it second to only Asciidoctor).  I would love to auto-build these books as part of the GITenberg infrastructure.
55 | 
56 | 
57 | == Last few points:
58 | There are currently tradeoffs and limitations to using Travis-ci: 
59 | 
60 | * the GITenberg organization is too large for Travis to list all of our repos (a common issue, but should be fixable)
61 | * using travis means including a .travis.yml file in every repo
62 | * we will have to enable each repo by hand on the travis-ci site (there may be an api for this)
63 | 
64 | 
65 | --Seth
66 | 


--------------------------------------------------------------------------------
/metadata/pgdata5.asciidoc:
--------------------------------------------------------------------------------
 1 | == GITenberg metadata
 2 | === Part 5. Implementation / Targets
 3 | 
 4 | Code implementing a GITenberg metadata is in the https://github.com/gitenberg-dev/metadata[gitenberg-dev/metadata repo]. The generic metadata object is called https://github.com/gitenberg-dev/metadata/pandata.py[pandata], by analogy to pandoc. 
 5 | 
 6 | 
 7 | Property names are defined with the aim to bridge the chasm between Gutenberg/GITenberg to MARC and other metadata target formats. In the YAML file, properties are presented in alphabetic order to facilitate version control. 
 8 | 
 9 | We've used leading underscores denote gitenberg specific metadata, thus _repo denotes the string identifying the repo. A sample "pandata" file is https://github.com/gitenberg-dev/metadata/blob/master/samples/pandata.yaml[here] (We'll use "metadata.yaml" as our standard finel name when this is in a Gitenberg repo.
10 | 
11 | 
12 | ==== Bibframe/MARC
13 | 
14 | Code implementing a conversion from  MARC is in the https://github.com/gitenberg-dev/metadata/marc.py[marc module].
15 | 
16 | ===== Creators and Contributors
17 | Creators and Contributors are handled with property names based on http://www.loc.gov/marc/relators/relaterm.htm[marc relator codes], grouped under a creator and contributor top-level attribute. There are 25 types used in Project Gutenberg; we'll add "Book_producer" so we can add credits for the people who worked on producing the edition. See the link:pandata_attribute_dictionary.yaml[attribute dictionary] for details
18 | 
19 | We choose to provide singular and plural versions of these 
20 | 
21 | It's expected that contributors will eventually be independent entities with their own metadata, linked with a url, but for now we keep everything in line with properties such as author_name and author_sortname formed by adding on to the marc relator name.
22 | 
23 | ===== Publisher
24 | 
25 | MARC regards "Project Gutenberg" as the publisher of these editions, so we're not going to rock the boat.
26 | 
27 | ===== subjects
28 |  
29 | Here we use custom constructors to add type information to subject headings. library of congress headings and classes are supported.
30 | 
31 | ===== rights
32 | 
33 | For GITenberg, we focus on standard rights statements that allow free use. Possible values, along with defining urls. (how do we manage versioning?)
34 | 
35 | - BY-NC-ND, Creative Commons Attribution-NonCommercial-NoDerivs', http://creativecommons.org/licenses/by-nc-nd/3.0/
36 | - BY-NC-SA, Creative Commons Attribution-NonCommercial-ShareAlike', http://creativecommons.org/licenses/by-nc-sa/3.0/
37 | - BY-NC, Creative Commons Attribution-NonCommercial, 'http://creativecommons.org/licenses/by-nc/3.0/
38 | - BY-ND, Creative Commons Attribution-NoDerivs, http://creativecommons.org/licenses/by-nd/3.0/ 
39 | - BY-SA, Creative Commons Attribution-ShareAlike, http://creativecommons.org/licenses/by-sa/3.0/
40 | - BY, 'Creative Commons Attribution, http://creativecommons.org/licenses/by/3.0/
41 | - CC0, No Rights Reserved, http://creativecommons.org/about/cc0
42 | - GFDL, GNU Free Documentation License, http://www.gnu.org/licenses/fdl-1.3-standalone.html
43 | - LAL, Licence Art Libre, http://artlibre.org/licence/lal/
44 | - PD-US, Public Domain US, http://creativecommons.org/about/pdm
45 | 
46 | ==== Covers
47 | 
48 | The property 'covers' is considered a list of covers. Each cover should have an 'image_path' which includes a path from the root of the repo. If the 'rights' property could be different from the rest of the book it's specified for the cover, otherwise rights are inherited from the book.
49 | 
50 | Covers can be of different types. To start, we define:
51 | 
52 | - archival: a scan or image of a print edition
53 | - archival-back:  a scan or image of a the back cover of a print edition
54 | - original: a newly designed cover image specifically designed for an ebook
55 | - generated: a cover produced automatically or semiautomatically using software.
56 | - titlepage_image: sometimes the best image for a cover is an image of the titlepage, which was often ornate.
57 | 
58 | ==== Schema.org
59 | 
60 | ==== TEI
61 | 
62 | ==== ONIX
63 | 
64 | ==== BIBO
65 | 
66 | 


--------------------------------------------------------------------------------
/README.asciidoc:
--------------------------------------------------------------------------------
  1 | = Welcome to the GITenberg documentation!
  2 | 
  3 | For more information about GITenberg, please see our https://gitenberg.github.io[website].
  4 | 
  5 | == How can you contribute?
  6 | 
  7 | We are currently working on converting and editing our first 100 books.
  8 | 
  9 | * Help us link:how_to.asciidoc[format books] into link:sectioning.asciidoc[asciidoc]
 10 | * Suggest books we should improve
 11 | * Join our announcement list
 12 | 
 13 | For now there are a few things you can do depending on your interest and
 14 | skill level. Firstly, if you find an error or typo in any of the books,
 15 | report it in the 'Issues' tab on that repo. If you would like to offer
 16 | changes: fork, edit and create a Pull Request. If you would like to make
 17 | suggestions, help in another way, or would like to get more involved,
 18 | you can join the project
 19 | https://groups.google.com/forum/#!forum/gitenberg-project[mailing list].
 20 | 
 21 | == Project Status
 22 | 
 23 | You can read the full text of the last project status report
 24 | https://groups.google.com/d/msg/gitenberg-project/i3gV2OjEeAQ/m8bC81tBhokJ[here].
 25 | 
 26 | == Documentation Index
 27 | 
 28 | * Infrastructure Documentation — Descriptions of the various systems (current and future) that run GITenberg.
 29 | ** [Github post-commit hooks] — how we know when to rebuild ebooks when changes are made to repos
 30 | ** [Book repo digestor] — Receives a book repo, decides what to do based on the content of the repo
 31 | * link:filetypes[Filetypes]
 32 | * link:metadata/README.ascidoc[Metadata Development] - How GITenberg describes the data in a repo.
 33 | 
 34 | == Active goals
 35 | 
 36 | Step 1 is to create a list of ebook repos we'll use as a testbed for the
 37 | GITenberg tool chain. The current list is maintained on [this
 38 | file][active-repos]. Instructions on including a new repo are
 39 | link:how_to[available]
 40 | 
 41 | == Open questions
 42 | 
 43 | === How to generate covers for the books?
 44 | 
 45 | === What will be the source format?
 46 | 
 47 | Discussion is still open, but asciidoc is the current best candidate.
 48 | 
 49 | === How are the repositories created?
 50 | 
 51 | === What is the reference format from PG?
 52 | 
 53 | === How to cope with punctuation?
 54 | 
 55 | There is consensus to convert when possible ascii punctuation to unicode
 56 | more precise equivalents.
 57 | 
 58 | === What is the licensing big schema?
 59 | 
 60 | === What about Metadata?
 61 | 
 62 | ==== How to read metadata from PG?
 63 | 
 64 | Eric Hellman is working to understand metadata dumps provided by PG. It
 65 | is a work in progress, and preliminary results are available in
 66 | https://gist.github.com/eshellman/40d85be01acf1172a5c1[this document].
 67 | 
 68 | ==== How to archive metadata for gitenberg usage?
 69 | 
 70 | Ongoing discussion, current state
 71 | https://gist.github.com/eshellman/7a6d34c88e797b439938[here].
 72 | 
 73 | == Repositories
 74 | 
 75 | === Books
 76 | 
 77 | Every book on PG gets its own repository. All are listed on the
 78 | gitenberg organization github page, https://github.com/GITenberg/[here].
 79 | 
 80 | There's a tsv file with a full list of repo names at
 81 | https://github.com/gitenberg-dev/giten_site/blob/master/assets/GITenberg_repos_list_2.tsv
 82 | 
 83 | The main gitenberg page, which is for the moment a developper's page, is
 84 | hosted http://gitenberg.github.io/[here], and its source is
 85 | https://github.com/GITenberg/gitenberg.github.com/blob/master/index.html[here].
 86 | 
 87 | === Development
 88 | 
 89 | There is work going on on several repositories. The central point for
 90 | development is the gitenberg-dev organization, whose repos can be seen
 91 | on its github git@github.com:sethwoodworth/GITenberg.git[page].
 92 | 
 93 | There are other repositories in which some work for gitenberg happened:
 94 | https://github.com/sethwoodworth/GITenmake[gitenmake],
 95 | https://github.com/sethwoodworth/GITenberg[sethwoodworth/gitenberg]
 96 | 
 97 | == Readmes, introductions and FAQs
 98 | 
 99 | Several pieces of information are scattered around. This one is intended
100 | to replace all of them but for the main gitemberg page, which serves as
101 | first contact. All the others are being amended to point to this
102 | document.
103 | 
104 | http://gitenberg.github.io/[Main Gitenberg page]
105 | 


--------------------------------------------------------------------------------
/metadata/pgdata3.asciidoc:
--------------------------------------------------------------------------------
  1 | == GITenberg metadata
  2 | === Part 3. Other sources of metadata
  3 | 
  4 | In link:pgdata2.asciidoc[Part 2] we looked at the data already in project Gutenberg. Now we're going to want to bring in metadata from other sources. OpenLibrary is a source of metadata with a reasonably well designed API. It returns JSON, which can be readily converted to yaml
  5 | 
  6 | The OpenLibrary metadata for one edition (manifestation) of Space Viking is here:
  7 | 
  8 | [source, json]
  9 | ----
 10 | olid:OL7526155M:
 11 |   bib_key: olid:OL7526155M
 12 |   details:
 13 |     authors:
 14 |     - key: /authors/OL42322A
 15 |       name: H. Beam Piper
 16 |     classifications: {}
 17 |     covers:
 18 |     - 284580
 19 |     created:
 20 |       type: /type/datetime
 21 |       value: '2008-04-29T13:35:46.876380'
 22 |     identifiers:
 23 |       goodreads:
 24 |       - '1440159'
 25 |       librarything:
 26 |       - '138032'
 27 |     isbn_10:
 28 |     - 0441602258
 29 |     isbn_13:
 30 |     - '9780441602254'
 31 |     key: /books/OL7526155M
 32 |     languages:
 33 |     - key: /languages/eng
 34 |     last_modified:
 35 |       type: /type/datetime
 36 |       value: '2012-06-18T00:43:42.710342'
 37 |     latest_revision: 7
 38 |     number_of_pages: 191
 39 |     publish_date: '1964'
 40 |     publishers:
 41 |     - Ace Books
 42 |     revision: 7
 43 |     series:
 44 |     - VIntage Ace SF, F-225
 45 |     title: Space Viking
 46 |     type:
 47 |       key: /type/edition
 48 |     works:
 49 |     - key: /works/OL579067W
 50 |   info_url: https://openlibrary.org/books/OL7526155M/Space_Viking
 51 |   preview: noview
 52 |   preview_url: https://openlibrary.org/books/OL7526155M/Space_Viking
 53 |   thumbnail_url: https://covers.openlibrary.org/b/id/284580-S.jpg
 54 | ----
 55 | 
 56 | Things to note about the OpenLibrary metadata:
 57 | 
 58 | * some things it says are duplicates of what PG says, but in a different format. So "/languages/eng" instad of "en". the open library value is technically a pointer to https://openlibrary.org/languages/eng, which gives the full name of the language, English, but not much else. Except for the sad fact that Aaron Swartz had to revert the name from a mistaken change to "Malayalam".
 59 | * The Goodreads and Librarything ids are very useful for linking, in both directions. Both of these services have difficulties linking to PG because their data models center around ISBN. If they could load using a local id, making links to our ebooks would be a lot easier for them. Same story for OpenLibrary itself.
 60 | * The main difficulty for use of this metadata in GITenberg is that it's edition specific. occasionally there might be an edition that coincides with the edition used for digitization, but that determination would have to be done by hand.
 61 | * ISBNs can be very useful to get more info. LibraryThing and OCLC have apis to get related isbns, which is useful for libraries because it helps them group editions into works. This is especially usefule for GITenberg because there aren't ISBNs for the editions. It's not impossible that we could get a block of ISBNs specifically for the PG text.
 62 | 
 63 | Another aggregation of metadata with a well-supported API is Google Books. You can obtain a google books id with an isbn. So for example https://www.googleapis.com/books/v1/volumes/j6NrcDkA2wYC
 64 | The yaml version is here: https://gist.github.com/eshellman/a4e6c3a671db98b56b07 Not so much there is useful to us because of terms of service restrictions.
 65 | 
 66 | Moving into the library world, most everything in PG is cataloged in some way by Library of Congress: http://lccn.loc.gov/75000422 https://gist.github.com/eshellman/c2879061d753bcde63e1[Here are the files.]
 67 | 
 68 | The MARC-XML is what many libraries would use; the MODS is somewhat less mystifying to the uninitiated. The most useful bits here are the Dewey Decimal numbers and the subject headings.
 69 | [source,xml]
 70 | ----
 71 | <classification authority="lcc">PZ4.P666 Sp4</classification>
 72 | <classification authority="lcc">PS3566.I58</classification>
 73 | <classification authority="ddc">813/.5/4</classification>
 74 | ----
 75 | and, wouldn't you know it, it's a bit different from what's in the gutenberg RDF, which uses PS as the top level class, which LC has as an alternate. So which is right? A pull request to change should include a reference, or a discussion of why it ought to be changed, so an admin needed research the matter.
 76 | 
 77 | Finally, Wikipedia has a lot of good data in its "infoboxes". The infobox for _Space Viking_ includes info on the series the book is part of, as well as links (denoted by square brackets) to genre, and author. The image field provides a cover.
 78 | 
 79 | [source]
 80 | ----
 81 | {{Infobox book 
 82 | | name          = Space Viking
 83 | | title_orig    = 
 84 | | translator    = 
 85 | | image         = Image:Space Viking (pb cover 02ba).jpg
 86 | | caption       = First edition
 87 | | author        = [[H. Beam Piper]]
 88 | | illustrator   = 
 89 | | cover_artist  = 
 90 | | country       = United States
 91 | | language      = English
 92 | | series        = The Tanith Series
 93 | | genre         = [[Science fiction novel]]
 94 | | publisher     = [[Ace Books]]
 95 | | release_date  = [[1963 in literature|1963]]
 96 | | media_type    = Print ([[Paperback]])
 97 | | pages         = 191
 98 | | isbn          = 
 99 | | preceded_by   = 
100 | | followed_by   = Prince of Tanith
101 | }}
102 | ----
103 | 
104 | There's plenty of data in various places that could be added to a Gitenberg record.
105 | 
106 | The link:pgdata4.asciidoc[next chapter] will discuss metadata that is needed to make GITenberg work for us all, but doesn't exist in the the available metadata.


--------------------------------------------------------------------------------
/metadata/pgdata4.asciidoc:
--------------------------------------------------------------------------------
 1 | == GITenberg metadata
 2 | === Part 4. Metadata that's needed, but missing. Also, covers.
 3 | 
 4 | In link:pgdata3.asciidoc[Part 3], I looked at the metadata that's available via api from other metadata sources. But there's a bunch of metadata internal to GITenberg (and to a lesser extent, Project Gutenberg) that should probably be included in a GITenberg metadata file.
 5 | 
 6 | ==== Housekeeping data
 7 | 
 8 | * the Repo URL. For Space Viking, that would be https://github.com/GITenberg/Space-Viking_20728 Unfortunately, because of unicode and OS level issues, it's not as simple as you might think to derive the url from other metadata. And maybe for portability, we should just keep "Space-Viking_20728". This has been implemented as "_repo"
 9 | * Download URLs. Well maybe not. It probably makes more sense to have a resolver service separate from the repo.
10 | * version info. Again, maybe not. It probably makes sense to use Github's versioning to keep track of this. On the other hand, downstream sites will need to know this stuff. But a MARC record builder could grab version info directly from Github rather than a version-controlled metadata file.
11 | * process/toolchain info. This requirement is best understood with an example. We have a tool that changes ascii quotes into typographic quotes. It might not be obvious from inspection of a file that this process has been completed. One way to record this would be to add something like "quotification_completed: March 23, 2015" to the metadata file. I'm not sure if this would be the best way to do this.
12 | 
13 | ==== Cover data
14 | 
15 | This could probably be a post on its own, but covers, and to a lesser extent, illustrations, present a set of new issues when combined with a version control system. The traditional view in the book world is that a book may have many editions (or manifestations), but each edition has a single cover associated with it. For public domain books, it's common for tens, hundreds, or even thousands of editions to exist, each with its own cover. The text inside might be the text from Gutenberg, perhaps with some editorial enhancement. If GITenberg succeeds, this will no doubt continue, but with improved quality all around.
16 | 
17 | What's changed with public domain digital books, is that there's no need for a "manifestation" to fix a single cover. For some purposes, a scan of the original printed cover might be the best choice for a cover image. For other purposes, a branded, generated cover might do a better job of communicating to a potential reader what's "behind" the cover. Projects such as http://shop.thecreativeactionnetwork.com/collections/recovering-the-classics[Recovering the Classics] are creating new composite works by letting artists and designers breathe new life into  covers for old books. And it's simple to swap in an alternate cover on-demand during a digital download.
18 | 
19 | The consensus in the GITenberg core team has been that we need to make room for different types of covers. Each cover included in a repo will need its own metadata- license, cover type, attribution, and perhaps other things. And there needs to be available for every book an attractive public domain or CC0 cover that works well in a digital display.
20 | 
21 | Because covers are used to identify editions in the marketplace, it's appropriate, in the GITenberg context, to assign ISBNs to editions on a cover-by-cover basis. Since GITenberg will offer multiple formats and won't need to track them, if we use ISBNs, we won't use a different one for each format, but instead fix the ISBN to the cover. In the http://www.bic-media.com/dmrn/codelists/onix-codelist-10.htm[ONIX file type code list], we would use '000', the Epublication "content package".
22 | 
23 | ==== Source data
24 | 
25 | Most PG texts I've seen have some indicators of the production process. For example, in the text of PG's Space Viking there is:
26 | 
27 | [source]
28 | ----
29 | Produced by Greg Weeks, William Woods and the Online
30 | Distributed Proofreading Team at http://www.pgdp.net
31 | 
32 | [Transcriber's note:
33 | This etext was produced from Analog Science Fact--Science Fiction
34 | November 1962, December 1962, January 1963, February 1963.
35 | Extensive research did not uncover any evidence that the copyright
36 | on this publication was renewed.]
37 | ----
38 | These notes and credits should be moved to the metadata, and the ebook builder should present these notes appropriately. And bibliographic data for the original edition should get recorded, if available. I can't imagine that there isn't a worldcat record for 99% of the source editions in PG; these ought to be identified and added somehow.
39 | 
40 | * Much better provenance, including links to DP projects, scanned source files, Internet Archive mirrors, etc would be useful metadata to add (from Tom Morris)
41 | 
42 | ==== Quality data
43 | 
44 | A library, bookseller or other sort of distributor should be able to select based on the quality of the ebooks available. A set of objective quality criteria would be useful. Here are some examples:
45 | 
46 | * passes epubcheck
47 | * includes a clickable toc/index
48 | * includes semantic markup
49 | 
50 | ==== Relations within PG
51 | 
52 | (from Tom Morris)
53 | 
54 | * PG has multiple editions of many works and often the later ones are of higher quality than the older "editionless" editions, yet the earlier ones get downloaded way more.  Enhancing the bibliographic information to help with this issue would be useful to readers.  For example, this early editionless Pride & Prejudice http://www.gutenberg.org/ebooks/1342#bibrec is downloaded over 30 times more often than this later high quality transcription of the 1932 R.W. Chapman edition http://www.gutenberg.org/ebooks/42671#bibrec donated by Distributed Proofreaders. #76 & #32325 are another example.  It would be good to be able to link the various editions together.
55 | 
56 | ==== External IDs
57 | 
58 | In prior chapters we talked about external sources of metadata. We should include ids for:
59 | 
60 | * OpenLibrary
61 | * OCLC
62 | * Google Books
63 | * Wikipedia
64 | * Goodreads
65 | * LibraryThing
66 | * Unglue.it
67 | * ManyBooks
68 | * FeedBooks
69 | 
70 | ==== More
71 | 
72 | I'm sure there are types of metadata I haven't thought of! Suggestions and comment are both welcomed and needed.
73 | 
74 | In the link:pgdata5.asciidoc[next chapter], I'll discuss target vocabularies for the metadata.
75 | 


--------------------------------------------------------------------------------
/metadata/pgdata2.asciidoc:
--------------------------------------------------------------------------------
 1 | == GITenberg metadata
 2 | === Part 2. Combing through the Gutenberg metadata
 3 | 
 4 | To recap link:pgdata.asciidoc[Part 1], the Project Gutenberg metadate boils down to the following, expressed in YAML
 5 | 
 6 | [source,yaml]
 7 | ----
 8 | # Project Gutenberg Metadata
 9 | pgterms:ebook: 
10 |     url: http://www.gutenberg.org/ebooks/20728
11 |     marcrel:ill: 
12 |     -   pgterms:agent: http://www.gutenberg.org/2009/agents/25396
13 |         pgterms:alias:
14 |           - Schoenherr, John (John Carl)
15 |           - Schoenherr, Jack
16 |         pgterms:birthdate: 1935
17 |         pgterms:deathdate: 2010
18 |         pgterms:name: Schoenherr, John
19 |         pgterms:webpage: 
20 |         -   url: http://en.wikipedia.org/wiki/John_Schoenherr
21 |             dcterms:description: Wikipedia
22 |     
23 |     dcterms:creator: 
24 |     -   pgterms:agent: http://www.gutenberg.org/2009/agents/8301
25 |         pgterms:alias: Piper, Henry Beam
26 |         pgterms:birthdate: 1904
27 |         pgterms:deathdate: 1964
28 |         pgterms:name: Piper, H. Beam
29 |         pgterms:webpage: 
30 |         -   url: http://en.wikipedia.org/wiki/H._Beam_Piper
31 |             dcterms:description: Wikipedia
32 | 
33 |     dcterms:issued: '2007-03-03'
34 |     dcterms:language: !dcterms:RFC4646 en
35 |     dcterms:license: http://www.gutenberg.org/license
36 |     dcterms:publisher: Project Gutenberg
37 |     dcterms:rights: Public domain in the USA.
38 |     dcterms:subject:
39 |     - !dcterms:LCSH Space warfare -- Fiction
40 |     - !dcterms:LCSH Revenge -- Fiction
41 |     - !dcterms:LCC PS
42 |     - !dcterms:LCSH Science fiction
43 |     dcterms:title: Space Viking
44 |     dcterms:type: !dcterms:DCMIType Text
45 |     pgterms:downloads: 299
46 | ----
47 | 
48 | From the top.
49 | 
50 | * "pgterms" is a vocabulary specific to Project Gutenberg. The url for this vocabulary given in the RDF, http://www.gutenberg.org/2009/pgterms/ is a 404. 
51 | * "pgterms:agent" refers to creator entities. http://www.gutenberg.org/2009/agents/8301 is the identifier given to H. Beam Piper; That's also a 404, but you might want to look at http://www.gutenberg.org/ebooks/author/8301 or http://www.gutenberg.org/ebooks/author/25396
52 | **  In a relational database, the metadata for authors and illustrators would be represented with a foreign key to an agent table. 
53 | ** For GITenberg, it makes sense to to maintain author metadata separately, on the theory that when an author dies, you should have to update the metadata for every single book the author has written.
54 | ** It also makes sense to link to "authority files" for agent metadata. so for example, we could enter http://viaf.org/viaf/81055793/ into the H. Beam Piper agent field and pull back the other agent metadata as needed.
55 | ** The PG corpus has a number of duplicate entries for authors, despite the nominally canonical IDs
56 | +
57 | Tom Morris did a quick recheck of this. Here is a sample of some of his corrections after using the clustering facility of OpenRefine . Corrected (nominally) version in the first column.
58 | +
59 | |====
60 | |American Sunday-School Union |Union, American Sunday-School
61 | |Bakunin, Mikhail Aleksandrovic |Bakunin, Mikhail Aleksandrovich
62 | |Barine, Arvède |Barine, Arvede
63 | |Ditchfield, P. H. (Peter Hampson) |Ditchfield, P. H. Peter Hampson)
64 | |Gerhard, J. W. |Gerhard, J.W.
65 | |Haapanen-Tallgren, Tyyni |Haapanen-Tallgren, Tyyne
66 | |Knatchbull-Hugesson, Edward Hugessen |Knatchbull-Hugessen, Edward Hugessen
67 | |La Monte, Robert Rives |Monte, Robert Rives la
68 | |Levett-Yeats, S. (Sidney) |Levett Yeats, S. (Sidney)
69 | |Library of Congress. Copyright Office |Copyright Office. Library of Congress.
70 | |====
71 | +
72 | On the plus side, there are only a couple dozen of these records in the 20k+ authors, so it's a pretty small problem, but it does indicate that the PG author records can't be relied upon to be unique.
73 | 
74 | * the agents in the pg metadata use "relations". For illustrators, the relation used is "marcrel:ill" which comes from MARC's relators: http://www.loc.gov/marc/relators/relaterm.html, while for authors the dcterms:creator (Dublin Core) relation is used. MARC has the "aut" relator which means the same thing.
75 | * dcterms:issued: and dcterms:publisher: refer to Project Gutenberg's publication of the ebook, not to the publication of the print original. Surprisingly, the metadata makes no attempt to identify or refer to the original print edition its made from. When GITenberg starts making version of the ebook, what should it be saying in these fields, and should it even be trying?
76 | * dcterms:subject: refer to Library of congress subject headings and class codes. As Alex points out in the Google Group, the values aren't URIs, they're text. We should be able to do better at normalization.
77 | * pgterms:downloads: the downloads number refers to a prior week. There's no date context for this number, so it doesn't seem very useful. I told you RDF wasn't very good at representing dynamic state, and here's a good example. You can _do_ it, but it's more work than you really want.
78 | * dcterms:type: !dcterms:DCMIType Text. Apparently this is either audio or text. A verbose bit.
79 | 
80 | Conspicuous by its absence is a dcterms:description element. To see how representative _Space Viking_ is, Raymond did a predicate analysis of the entire PG RDF corpus. It's here: https://gist.github.com/rdhyee/8f84142f808d36796fa3
81 | 
82 | You can see that the file manifests take up a big chunk of the metadata as there are 654,523 files in all.
83 | 
84 | Apparently 37,199 of the 48,538 ebooks have descriptions. Only 10 of them don't have titles. 3127 of them include a table of contents in the metadata. There are plenty more relators- editors and translators head the list.
85 | 
86 | The other category of metadata are taken up by a bunch of MARC-related fields, the most common being 9,238 appearances of marc901 and 3,067 appearances of marc010. MARC 901 is bizarre because it's a local data field- used by libraries for strictly local purposes. MARC 010 is the library of congress control number, or lccn. Other information in MARC fields includes some publication info, uniform title, series info, production credits, and some edition notes.
87 | 
88 | In the link:pgdata3.asciidoc[next chapter], I'll look at other metadata sources that we could bring into the Gitenberg metadata, including data from library catalogs.
89 | 


--------------------------------------------------------------------------------
/metadata/pgdata.asciidoc:
--------------------------------------------------------------------------------
  1 | == GITenberg metadata
  2 | === Part 1. Boiling down the Gutenberg RDF
  3 | 
  4 | One of the objectives of GITenberg is to provide a github-flavored pathway for the improvement of the metadata for Project Gutenberg ebooks. This runs in two directions:
  5 | . Improving the accessibility and usability of PG metadata
  6 | . Improving the quality and completeness of PG metadata
  7 | 
  8 | The first step in this effort is to figure out what metadata already exists in Project Gutenberg.
  9 | 
 10 | Project Gutenberg provides periodic dumps of its metadata in a set of RDF files. These are the metadata used to make the "bibrec" pages and also to make ebook files (an EPUB package, for example, stores this metadata in its "OPF" file). The dump consists of a zipped tarfile containing one rdf file per pg text. In the second tranche of repos moved to github (roughly those above 10,000), Seth added the rdf file for each text to the corresponding archive. This saves us from having to deal with opening the archive and letting our operating systems deal with 50,000 directories in a directory.
 11 | 
 12 | The rdf file corresponding to #20728, "Space Viking" is here: https://gist.github.com/eshellman/90f1996b33e20e069040
 13 | 
 14 | it starts out
 15 | [source,xml]
 16 | ----
 17 | <?xml version="1.0" encoding="utf-8"?>
 18 | <rdf:RDF xml:base="http://www.gutenberg.org/"
 19 |   xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
 20 |   xmlns:cc="http://web.resource.org/cc/"
 21 |   xmlns:dcam="http://purl.org/dc/dcam/"
 22 |   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 23 |   xmlns:marcrel="http://id.loc.gov/vocabulary/relators/"
 24 |   xmlns:dcterms="http://purl.org/dc/terms/"
 25 | >
 26 |   <cc:Work rdf:about="">
 27 |     <cc:license rdf:resource="http://www.gnu.org/licenses/gpl.html"/>
 28 |   </cc:Work>
 29 |   <pgterms:ebook rdf:about="ebooks/20728">
 30 |     <dcterms:language>
 31 |       <rdf:Description rdf:nodeID="N724c8e78b9c84f9598157b0dd7c24cfb">
 32 |         <rdf:value rdf:datatype="http://purl.org/dc/terms/RFC4646">en</rdf:value>
 33 |       </rdf:Description>
 34 |     </dcterms:language>
 35 | ----
 36 | 
 37 | So far, the translation of this into English is:
 38 | 
 39 | ==========================
 40 | The resource identified by http://www.gutenberg.org/ has a GPL License.
 41 | The resource identified by http://www.gutenberg.org/ebooks/20728 is an ebook in English.
 42 | ==========================
 43 | The goal of GITenberg is to present metadata in a way that humans can read and edit, and git can track changes. RDF is a poor answer to these requirements, at least in this xml serialization. 
 44 | 
 45 | We can take a step towards making this more usable by using the json-ld serialization for the rdf. I re-serialized the RDF XML file for Space Viking  using the rdf translator at http://rdf-translator.appspot.com/ The result is available here: https://gist.github.com/eshellman/140258d9958c97580b42
 46 | 
 47 | it starts out:
 48 | [source,json]
 49 | ----
 50 | {
 51 |   "@context": {
 52 |     "cc": "http://web.resource.org/cc/",
 53 |     "dcam": "http://purl.org/dc/dcam/",
 54 |     "dcterms": "http://purl.org/dc/terms/",
 55 |     "marcrel": "http://id.loc.gov/vocabulary/relators/",
 56 |     "pgterms": "http://www.gutenberg.org/2009/pgterms/",
 57 |     "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
 58 |     "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
 59 |     "xsd": "http://www.w3.org/2001/XMLSchema#"
 60 |   },
 61 |   "@graph": [
 62 |     {
 63 |       "@id": "_:N2a40c4f7b6b8445b86d117253d79e17f",
 64 |       "dcam:memberOf": {
 65 |         "@id": "dcterms:IMT"
 66 |       },
 67 |       "rdf:value": {
 68 |         "@type": "dcterms:IMT",
 69 |         "@value": "application/x-mobipocket-ebook"
 70 |       }
 71 |     },
 72 | ----
 73 | 
 74 | The translation of this is:
 75 | 
 76 | ==========================
 77 | There is a mime-type with value "application/x-mobipocket-ebook"
 78 | ==========================
 79 | This is significantly better for use with git, but not much of an improvement for readability. The biggest difficulty for both these serializations is the explicit "blank nodes" with id's that look like "_:N2a40c4f7b6b8445b86d117253d79e17f". Blank nodes are what the RDF model uses to convert hierarchical data structures into super-simple data triples. RDF processors easily ingest the triples, but humans find them inscrutable. There are also some deep problems with blank nodes, which I've blogged about. http://go-to-hellman.blogspot.com/2009/11/blank-node-bother-and-rdf-copymess.html
 80 | 
 81 | Fortunately, you can remove the blank nodes without changing the representation a bit. I wrote a little script to do that. https://gist.github.com/eshellman/b717d6b1b49498140218
 82 | 
 83 | The result is here: https://gist.github.com/eshellman/b8fbd310c5ec2d77d949
 84 | 
 85 | It starts
 86 | [source,json]
 87 | ----
 88 | {
 89 |   "@context": {
 90 |     "cc": "http://web.resource.org/cc/",
 91 |     "dcam": "http://purl.org/dc/dcam/",
 92 |     "dcterms": "http://purl.org/dc/terms/",
 93 |     "marcrel": "http://id.loc.gov/vocabulary/relators/",
 94 |     "pgterms": "http://www.gutenberg.org/2009/pgterms/",
 95 |     "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
 96 |     "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
 97 |     "xsd": "http://www.w3.org/2001/XMLSchema#"
 98 |   },
 99 |   "@graph": [
100 |     {
101 |       "@id": "http://www.gutenberg.org/files/20728/20728.txt",
102 |       "@type": "pgterms:file",
103 |       "dcterms:extent": 414640,
104 |       "dcterms:format": {
105 |         "dcam:memberOf": {
106 |           "@id": "dcterms:IMT"
107 |         },
108 |         "rdf:value": {
109 |           "@type": "dcterms:IMT",
110 |           "@value": "text/plain; charset=us-ascii"
111 |         }
112 |       },
113 |       "dcterms:isFormatOf": {
114 |         "@id": "http://www.gutenberg.org/ebooks/20728"
115 |       },
116 |       "dcterms:modified": {
117 |         "@type": "xsd:dateTime",
118 |         "@value": "2012-07-02T13:51:54"
119 |       }
120 |     },
121 | ----
122 | 
123 | As you can see, it's starting to be slightly comprehensible.
124 | 
125 | Before I translate this, I'll go one more step, converting the JSON in YAML: https://gist.github.com/eshellman/0577a24406be521dfe78
126 | 
127 | YAML is a data serialization designed to be reader friendly. It pretty much nails our application; it's also used in applications such as Pandoc, the open-source swiss-army knife of document converters.
128 | Take a look at it on GitHub, you'll see that it GitHub does a lovely job presenting the YAML file.
129 | 
130 | Now we have, to start:
131 | [source,yaml]
132 | ----
133 | '@context':
134 |   cc: http://web.resource.org/cc/
135 |   dcam: http://purl.org/dc/dcam/
136 |   dcterms: http://purl.org/dc/terms/
137 |   marcrel: http://id.loc.gov/vocabulary/relators/
138 |   pgterms: http://www.gutenberg.org/2009/pgterms/
139 |   rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
140 |   rdfs: http://www.w3.org/2000/01/rdf-schema#
141 |   xsd: http://www.w3.org/2001/XMLSchema#
142 | '@graph':
143 | - '@id': http://www.gutenberg.org/files/20728/20728.txt
144 |   '@type': pgterms:file
145 |   dcterms:extent: 414640
146 |   dcterms:format:
147 |     dcam:memberOf:
148 |       '@id': dcterms:IMT
149 |     rdf:value:
150 |       '@type': dcterms:IMT
151 |       '@value': text/plain; charset=us-ascii
152 |   dcterms:isFormatOf:
153 |     '@id': http://www.gutenberg.org/ebooks/20728
154 |   dcterms:modified:
155 |     '@type': xsd:dateTime
156 |     '@value': '2012-07-02T13:51:54'
157 | ----
158 | This says:
159 | 
160 | ==========================
161 |     There is a file with id http://www.gutenberg.org/files/20728/20728.txt
162 |     which has size 414640
163 |     and mime-type text/plain; charset=us-ascii
164 |     which is a format of http://www.gutenberg.org/ebooks/20728
165 |     and was last modified on '2012-07-02T13:51:54'
166 | ==========================
167 | and if you are a metadata-head, it's reasonably straightforward to understand.
168 | 
169 | Now that we know what the metadata is saying, the obvious question is: _Why on earth is PG reporting file manifests in RDF???_
170 | 
171 | I think it's appropriate to forgive the very smart people who decided to put a file manifest in RDF, because it was a very different time and who could have foreseen back then which way technology would go.
172 | 
173 | RDF's strength is moving data from silo to silo. One of its weaknesses is representing dynamic state. Today we have file systems in the cloud that can report a file's size,  minute by minute. Trying to mirror the information in a file system is a set of bugs waiting to happen. 
174 | 
175 | Our next step is to toss out all that metadata about files so we can focus on the content metadata.
176 | 
177 | Also on the chopping block is the assertion about http://www.gutenberg.org/ being GPL. First of all, I think the assertion is *trying* to say that the RDF file itself is GPL, because why would you put an assertion about PG as a whole in every rdf file? Even if it's trying to make an assertion obout the file itself, it doesn't make much sense (Metadata is not generally copyrightable, so GPL has no force.)
178 | 
179 | In addition, I swap in YAML's value typing machinery (see the !values below) for RDF's somewhat clumsy machinery. I factor out the context, as it's machine oriented, giving precise meaning to the prefixes used, and it'll be the same for every repo. The graph id, needed for RDF technicalities, does more harm than good in the yaml because Github gives us all the id machinery we need.
180 | 
181 | We're left with what I call simple PG YAML: https://gist.github.com/eshellman/2863fc5ffb129714f617
182 | 
183 | It's boiled down enough that I can quote it here in its entirety:
184 | 
185 | [source,yaml]
186 | ----
187 | # Project Gutenberg Metadata
188 | pgterms:ebook: 
189 |     url: http://www.gutenberg.org/ebooks/20728
190 |     marcrel:ill: 
191 |     -   pgterms:agent: http://www.gutenberg.org/2009/agents/25396
192 |         pgterms:alias:
193 |           - Schoenherr, John (John Carl)
194 |           - Schoenherr, Jack
195 |         pgterms:birthdate: 1935
196 |         pgterms:deathdate: 2010
197 |         pgterms:name: Schoenherr, John
198 |         pgterms:webpage: 
199 |         -   url: http://en.wikipedia.org/wiki/John_Schoenherr
200 |             dcterms:description: Wikipedia
201 |     
202 |     dcterms:creator: 
203 |     -   pgterms:agent: http://www.gutenberg.org/2009/agents/8301
204 |         pgterms:alias: Piper, Henry Beam
205 |         pgterms:birthdate: 1904
206 |         pgterms:deathdate: 1964
207 |         pgterms:name: Piper, H. Beam
208 |         pgterms:webpage: 
209 |         -   url: http://en.wikipedia.org/wiki/H._Beam_Piper
210 |             dcterms:description: Wikipedia
211 | 
212 |     dcterms:issued: '2007-03-03'
213 |     dcterms:language: !dcterms:RFC4646 en
214 |     dcterms:license: http://www.gutenberg.org/license
215 |     dcterms:publisher: Project Gutenberg
216 |     dcterms:rights: Public domain in the USA.
217 |     dcterms:subject:
218 |     - !dcterms:LCSH Space warfare -- Fiction
219 |     - !dcterms:LCSH Revenge -- Fiction
220 |     - !dcterms:LCC PS
221 |     - !dcterms:LCSH Science fiction
222 |     dcterms:title: Space Viking
223 |     dcterms:type: !dcterms:DCMIType Text
224 |     pgterms:downloads: 299
225 | ----
226 | 
227 | I think I'll stop here, because link:pgdata2.asciidoc[the next chapter] will be about the metadata rather than how it might be presented, which is a different topic. 
228 | 


--------------------------------------------------------------------------------
/metadata/pandata_attribute_dictionary.yaml:
--------------------------------------------------------------------------------
  1 | 
  2 | # pandata design notes
  3 | # "objects" are attribute sets. 
  4 | # plural names for plural creators used because it seemed more natural to have author and authors attributes, for example.
  5 | # it's a mistake to have both.
  6 | 
  7 | _edition: Provides the name for the edition, which should be a string that can be used for file naming and for identifying the edition.
  8 | _repo: the GITenberg repo name. 
  9 | _version: the version string. Should come from the repo tag.
 10 | agent_name: the name of the creator/contributor, for use in creator/contributor objects
 11 | agent_sortname: the agent_name, arranged for sorting, typically last name first
 12 | alias: A name or list of names for a creator/contributor
 13 | alternative_title: An alternative title.
 14 | attribution_url: for Creative Commons licenses, the url for attribution
 15 | birthdate: Value should be a date. For creator/contributor objects.
 16 | contributor: From Dublin Core, a contributor who is not the principle creator.  Value is an object (dict) with one or more contributors. In principle, could be any role from https://www.loc.gov/marc/relators/relaterm.html
 17 |     adapter: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code adp
 18 |     adapters: A type of creators/contributors. Value should be a list of adapters.
 19 |     annotator: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code ann
 20 |     annotators: A type of creators/contributors. Value should be a list of annotators.
 21 |     arranger: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code arr
 22 |     arrangers: A type of creators/contributors. Value should be a list of arrangers.
 23 |     artist: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code art
 24 |     artists: A type of creators/contributors. Value should be a list of artists.
 25 |     author_of_afterword: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code aft
 26 |     author_of_afterwords: A type of creators/contributors. Value should be a list of author_of_afterwords.
 27 |     author_of_introduction: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code aui
 28 |     author_of_introductions: A type of creators/contributors. Value should be a list of author_of_introductions.
 29 |     book_producer:  A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code bkp
 30 |     book_producers:  A type of creator/contributor. Value should be a list of book_producers
 31 |     collaborator: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code clb
 32 |     collaborators: A type of creators/contributors. Value should be a list of contributors.
 33 |     commentator: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code cmm
 34 |     commentators: A type of creators/contributors. Value should be a list of commentators.
 35 |     compiler: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code com
 36 |     compilers: A type of creators/contributors. Value should be a list of compilers.
 37 |     composer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code cmp
 38 |     composers: A type of creators/contributors. Value should be a list of composers.
 39 |     conductor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code cnd
 40 |     conductors: A type of creators/contributors. Value should be a list of conductors.
 41 |     contributor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code ctb
 42 |     dubious_author: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code dub
 43 |     dubious_authors: A type of creators/contributors. Value should be a list of dubious_authors.
 44 |     editor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code edt
 45 |     editors: A type of creators/contributors. Value should be a list of editors.
 46 |     engineer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code egr
 47 |     engineers: A type of creators/contributors. Value should be a list of engineers.
 48 |     illustrator: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code ill
 49 |     illustrators: A type of creators/contributors. Value should be a list of illustrators.
 50 |     librettist: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code lbt
 51 |     librettists: A type of creators/contributors. Value should be a list of librettists.
 52 |     other_contributor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code oth
 53 |     other_contributors: A type of creators/contributors. Value should be a list of other_contributors.
 54 |     performer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code prf
 55 |     performers: A type of creators/contributors. Value should be a list of performers.
 56 |     photographer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code pht
 57 |     photographers: A type of creators/contributors. Value should be a list of photographers.
 58 |     printer: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code prt
 59 |     printers: A type of creators/contributors. Value should be a list of printers.
 60 |     publisher: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code pbl
 61 |     publisher: A type of creators/contributors. Value should be a list of publisher_contributors.
 62 |     researcher: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code res
 63 |     researchers: A type of creators/contributors. Value should be a list of researchers.
 64 |     transcriber: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code trc
 65 |     transcribers: A type of creators/contributors. Value should be a list of transcribers.
 66 |     translator: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code trl
 67 |     translators: A type of creators/contributors. Value should be a list of translators.
 68 |     unknown_contributor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code unk
 69 |     unknown_contributors: A type of creators/contributors. Value should be a list of unknown_contributors.
 70 | covers: Value should be a list of cover objects.
 71 |     attribution_url: for Creative Commons licenses, the url for attribution
 72 |     cover_type: archival | archival-back | original | generated | titlepage_image
 73 |     image_path: a file path to the image file from the directory containing the metadata file
 74 |     image_url: a url for the cover image
 75 |     rights: rights pertaining to the cover
 76 |     rights_url: url defining the rights
 77 | creator: From Dublin Core, usually the principle creators- usually the author, sometimes editors.  Value is an object (dict) with one or more creators. In principle, could be any role from https://www.loc.gov/marc/relators/relaterm.html
 78 |     author: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code aut
 79 |     authors: A type of creators/contributors. Value should be a list of authors. 
 80 |     editor: A type of creator/contributor. Value should be a creator/contributor object. Defined by a MARC relation, code edt
 81 |     editors: A type of creators/contributors. Value should be a list of authors. 
 82 | deathdate: Value should be a date. For creator/contributor objects.
 83 | description: Text description.
 84 | download_url_epub: url to download an EPUB file
 85 | download_url_mobi: url to download a MOBI (kindle) file
 86 | download_url_pdf: url to download a PDF file
 87 | edition_identifiers: An object (dict) with one or more identifiers. If there are more than one of each type of identifer, the value of the specific identifier should be a list of values
 88 |     isbn: An identifier, International Standard book number. Always 13 digits. No dashes please.
 89 |     oclc: A number. Can link to worldcat with http://www.worldcat.org/oclc/{%oclc%}
 90 | edition_list: A list of objects (dicts), each with an "edition" entry and any other properties that are distinctive for the edition. These properties replace any properties specified at the top level of the pandata.
 91 | edition_note: Descriptive note about the edition. For MARC 250 field.
 92 | funding_info: Text describing the funding for this edition. Used for MARC 536 field.
 93 | gutenberg_agent_id: An agent identifier used in Project Gutenberg.
 94 | gutenberg_bookshelf: Project Gutenberg assigns some books to one or more "bookshelves".
 95 | gutenberg_issued: Project Gutenberg's issuance date.
 96 | gutenberg_transcribers_note: Project Gutenberg includes notes from the transcriber in text; this should be moved into metadata.
 97 | gutenberg_type: Within Project Gutenberg, the item type. Usually "Text"
 98 | identifiers: An object (dict) with one or more identifiers. If there are more than one of each type of identifer, the value of the specific identifier should be a list of values
 99 |     goodreads: A number. Can link to goodreads with https://www.goodreads.com/book/show/{%goodreads%}
100 |     googlebooks: A Case sensetive base64 string. Can link to Google Books with https://books.google.com/books?id={%googlebooks%)
101 |     gutenberg: A number. Can link to Project Gutenberg with https://www.gutenberg.org/ebooks/{%gutenberg%}
102 |     isbns_related: A list of isbns for related editions.
103 |     isbn_electronic: ISBN for an electronic edition (various formats).
104 |     isbn_epub: ISBN for an EPUB edition.
105 |     isbn_hard: ISBN for a hardcover edition.
106 |     isbn_mobi: ISBN for a MOBI (kindle) edition.
107 |     isbn_paper: ISBN for a paperpack edition.
108 |     isbn_pdf: ISBN for a PDF edition.
109 |     lccn:  An identifier, library of congress control number. For MARC 010 field.
110 |     librarything: A number. Can link to librarything with https://www.librarything.com/work/{%librarything%}
111 |     openlibrary: A case-sensetive string like '/works/OL1892624W'. Can link to openlibrary with https://openlibrary.org{%openlibrary%}
112 |     unglueit: A number.  Can link to unglueit with https://unglue.it/work/{%unglueit%}
113 | language_note: Text for MARC 546 field. Usually used to note multiple languages, etc.
114 | language: An iso code for the item's language usually 2 letters, but could be 3, and might have a locale (zh-TW, for example)
115 | ocr_file_link: A url leading to the OCR files or OCRed image related to this book
116 | physical_description_note: Text for MARC 300 field.
117 | publication_date: The publication date of the edition.
118 | publication_date_original: The publication date of the edition from which the book is derived.
119 | production_note: Text for use in MARC 508 field.
120 | publication_note: Text for MARC 260 field.
121 | publisher_original: For gutenberg, the original publisher of the edition from which the book is derived.
122 | publisher: The name of the publisher of an edition.
123 | rights: A dublin core term, a string describing the usage rights. For example, "Public Domain, US"
124 | rights_url: A url defining the rights. For example, http://creativecommons.org/about/pdm
125 | series_note: Text for use in MARC 440 field.
126 | subjects:  A list of subjects. These can be typed. !lcsh is library of congress subject headings, !lcc is library of congress classification. Use this list for source- http://www.loc.gov/standards/sourcelist/subject.html.  If there's no authority, just leave the value untyped
127 | summary: Text for use in MARC 520 field.
128 | tableOfContents: Value should be text.
129 | title: Plain text title. use asciidoc for subscripts, etc.
130 | url: An identifying url for the object.
131 | wikipedia: A URL or list of URLs for a wikipedia page.
132 | 


--------------------------------------------------------------------------------