├── .github └── workflows │ └── publish.yml ├── .gitignore ├── README.md ├── _publish.yml ├── _quarto.yml ├── about.qmd ├── contributing.qmd ├── guide-domain-specific ├── ecocom.qmd ├── hymet.qmd ├── preface.qmd └── soil-carbon.qmd ├── guide-eml-bp ├── content-recommendations.qmd ├── introduction.qmd └── preface.qmd ├── guide-special-cases ├── code.qmd ├── images-and-docs.qmd ├── img │ └── guide-special-cases-model-fig1.png ├── introduction.qmd ├── large-offline.qmd ├── model-based.qmd ├── moving-platforms.qmd ├── other-repositories.qmd ├── preface.qmd └── spatial-data.qmd ├── img └── edi-logo.png ├── index.qmd ├── references.bib └── references.qmd /.github/workflows/publish.yml: -------------------------------------------------------------------------------- 1 | on: 2 | workflow_dispatch: 3 | push: 4 | branches: 5 | - main 6 | - prerelease 7 | workflow_call: 8 | inputs: 9 | ref: 10 | required: true 11 | description: 'The ref to trigger the workflow on' 12 | type: string 13 | prerelease: 14 | required: true 15 | description: 'Whether to deploy prerelease or production site' 16 | type: boolean 17 | default: false 18 | 19 | name: Quarto Publish 20 | 21 | jobs: 22 | build-deploy: 23 | runs-on: ubuntu-latest 24 | permissions: 25 | contents: write 26 | steps: 27 | - name: Check out repository 28 | uses: actions/checkout@v4 29 | 30 | - name: Set up Quarto 31 | uses: quarto-dev/quarto-actions/setup@v2 32 | 33 | - name: Render and Publish 34 | #if: ${{ format('{0}', inputs.prerelease) == 'false' }} 35 | if: ${{ format('{0}', inputs.prerelease) == 'false' || (!inputs.prerelease && contains(fromJSON('["push", "workflow_dispatch"]'), github.event_name) && github.ref == 'refs/heads/main') }} 36 | uses: quarto-dev/quarto-actions/publish@v2 37 | with: 38 | target: gh-pages 39 | env: 40 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} 41 | 42 | - name: Render and Publish to netlify 43 | # if: ${{ format('{0}', inputs.prerelease) != 'false' }} 44 | if: ${{ inputs.prerelease || (format('{0}', inputs.prerelease) != 'false' && contains(fromJSON('["push", "workflow_dispatch"]'), github.event_name) && github.ref == 'refs/heads/prerelease') }} 45 | uses: quarto-dev/quarto-actions/publish@v2 46 | with: 47 | target: netlify 48 | NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }} -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | .DS_Store 6 | 7 | 8 | /.quarto/ 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Dataset Preparation Guides for the EDI Community (data-package-best-practices) 2 | 3 | This repository contains a series of documents about preparing and publishing datasets in the environmental sciences and similar contexts. Topics include community-developed metadata standards, serialization and markup formatting guidelines, and best practices for the content of published research datasets. This documentation is maintained by the [Environmental Data Initiative](https://edirepository.org) (EDI) and all content has been developed and written in collaboration with EDI's community of scientists, data managers, and repository users. Two versions of this content are published online: 4 | 5 | * [The current production version](https://ediorg.github.io/data-package-best-practices/), derived from the `main` branch of the repository. 6 | * [A prerelease version](https://prerelease-edi-docs.netlify.app) containing new and in-development documents, derived from the `prerelease` branch. 7 | 8 | The guide documents are published as a collection of [Quarto books](https://quarto.org/docs/books). Revisions and additions to these guides will occur using this repository, with periodic release and distribution of copies in print-ready formats (PDF, MS Word). For more on the history of this effort, and archived earlier editions of the guides, see the [History](history.qmd) page. 9 | 10 | # Contributing 11 | 12 | The maintenance of this repository and development of the included documents is coordinated by EDI, with major updates and new content developed and approved through a community-oriented process. If you would like to contribute to these documents, please contact the maintainers and working group leads listed on the [About](about.qmd) page, or reach out to EDI at [info@edirepository.org](mailto:info@edirepository.org>). Some details on how contribution works are below. 13 | 14 | ## Branches 15 | 16 | * **main**: The `main` branch holds the current production version of the documents. Documents in this branch have generally been edited and approved by the community. They are published in website format with GitHub Pages, under the "EDIorg" organization. 17 | * **prerelease**: The `prerelease` branch contains documents that are under development for publication in an upcoming release. These documents contain new content and revisions to existing documents that may be under review by the community. Documents in this branch are published in website format to Netlify. After review and approval, changes may be merged into `main`. 18 | * **feature** or **content** branches: These may exist during the early development of new features or drafting of content. They will first be merged into the `prerelease` branch for review. No feature or content branches are currently published in website form. 19 | 20 | ## Contributing changes 21 | 22 | The guide documents and most website content are written in Quarto markdown, a variant of `pandoc` markdown, and saved as `.qmd` files. See the [Quarto guide](https://quarto.org/docs/guide/) for information on how to author `.qmd` files. New content, or edits to existing documents, can be contributed in several possible ways: 23 | 24 | 1. To suggest a change you may [file an issue](https://github.com/EDIorg/data-package-best-practices/issues/new/choose) in the GitHub repository outlining your proposed changes. This will begin a conversation with the maintainers and others in the community about whether and how to implement changes to the documents or website. You may be invited to draft the content changes (see the next item). 25 | 2. If you are ready to create a draft of the changes yourself, new `.qmd` files or edits to existing pages may be submitted as a pull request to the `prerelease` branch. See this [GitHub tutorial](https://github.blog/developer-skills/github/beginners-guide-to-github-creating-a-pull-request/) for some simple instructions and links to more resources. Maintainers will review the changes before merging them into the `prerelease` branch for further community input. 26 | 3. In some cases, particularly if you are involved in a community working group, you may request permission to push changes directly to the `prerelease` branch. If you think this would be the best way to contribute your changes please contact the maintainers listed in the [About](about.qmd) page or . 27 | 4. Periodically, community working groups convene to develop and draft changes to these guides, and much of the writing, revising, and editing process takes place in formats outside this repository (Word or Google Docs). To get involved in one of these efforts and contribute changes that way, see the [About](about.qmd) page and contact working group leads, or propose your own working group to the community. 28 | 29 | Once changes are moved into the `prerelease` branch, they will be reviewed by the the larger community (EDI, LTER Network, EML users, repository communities, etc.) and approved (or not) for inclusion in the production documents (`main` branch). 30 | 31 | ## Publishing workflow 32 | 33 | Both `main` and `prerelease` branches have GitHub Actions workflows configured to build and deploy their associated website any time new commits are pushed to that branch. The production site, derived from `main`, is published as a GitHub Pages site ([Quarto documentation](https://quarto.org/docs/publishing/github-pages.html#github-action)). The prerelease site, derived from the `prerelease` branch, is published to Netlify ([Quarto documentation](https://quarto.org/docs/publishing/netlify.html#github-action)). The GitHub Actions publishing workflow for both branches is specified in the [`.github/workflows/publish.yml`](.github/workflows/publish.yml) file, which was modeled in part on the [Quarto website version](https://github.com/quarto-dev/quarto-web/blob/main/.github/workflows/publish.yml). When changes are pushed to either branch, please verify that the GitHub Action completed and the website and all documents were built as expected. 34 | 35 | 36 | -------------------------------------------------------------------------------- /_publish.yml: -------------------------------------------------------------------------------- 1 | - source: project 2 | netlify: 3 | - id: 295af2f3-409a-48ef-a32c-a8f146d3cf7c 4 | url: 'https://prerelease-edi-docs.netlify.app' 5 | -------------------------------------------------------------------------------- /_quarto.yml: -------------------------------------------------------------------------------- 1 | project: 2 | type: book 3 | output-dir: _book 4 | 5 | book: 6 | title: "EDI Dataset Preparation Guides" 7 | author: "Environmental Data Initiative and colleagues" 8 | description: "Community-developed guides for preparing and publishing datasets in the environmental sciences, and similar contexts, using the Ecological Metadata Language (EML)." 9 | date: "2022-05-25" 10 | favicon: img/edi-logo.png 11 | site-url: https://ediorg.github.io/data-package-best-practices 12 | repo-url: https://github.com/ediorg/data-package-best-practices 13 | repo-branch: main 14 | repo-actions: [edit, issue] 15 | navbar: 16 | search: true 17 | logo: img/edi-logo.png 18 | chapters: 19 | - index.qmd 20 | - about.qmd 21 | - contributing.qmd 22 | 23 | - part: guide-eml-bp/preface.qmd 24 | chapters: 25 | - guide-eml-bp/introduction.qmd 26 | - guide-eml-bp/content-recommendations.qmd 27 | 28 | - part: guide-special-cases/preface.qmd 29 | chapters: 30 | - guide-special-cases/code.qmd 31 | - guide-special-cases/model-based.qmd 32 | - guide-special-cases/images-and-docs.qmd 33 | - guide-special-cases/spatial-data.qmd 34 | - guide-special-cases/moving-platforms.qmd 35 | - guide-special-cases/other-repositories.qmd 36 | - guide-special-cases/large-offline.qmd 37 | 38 | - part: guide-domain-specific/preface.qmd 39 | chapters: 40 | - guide-domain-specific/ecocom.qmd 41 | - guide-domain-specific/hymet.qmd 42 | - guide-domain-specific/soil-carbon.qmd 43 | - references.qmd 44 | 45 | execute: 46 | freeze: auto 47 | 48 | bibliography: references.bib 49 | 50 | format: 51 | html: 52 | theme: cosmo 53 | # pdf: 54 | # documentclass: scrreprt 55 | 56 | 57 | 58 | -------------------------------------------------------------------------------- /about.qmd: -------------------------------------------------------------------------------- 1 | # About {.unnumbered} 2 | 3 | This website and the content it contains are a collaborative project of the Environmental Data Initiative (EDI) and numerous community partners and individuals. The intended audience is researchers, data managers, and repository users working the environmental science and management fields. 4 | 5 | ## The website 6 | 7 | ::: {style='text-align:center;'} 8 | 9 | This website is managed and maintained by the Environmental Data Initiative (EDI) 10 | **** 11 | 12 | For general inquiries, contact us at: 13 | 14 | 15 | **Current maintainers** 16 | Gregory Maurer () 17 | Colin Smith () 18 | 19 | ::: 20 | 21 | Content on this website is created and published using [Quarto](https://quarto.org/). All source files are managed in the [`EDIorg/data-package-best-practices`](https://github.org/EDIorg/data-package-best-practices) GitHub repository, and rendered web pages are hosted by [GitHub Pages](https://pages.github.com) or [Netlify](https://netlify.app). The `main` branch of the repository renders to the [current production website](https://ediorg.github.io/data-package-best-practices/), and the `prerelease` branch renders to a [prerelease website](https://prerelease-edi-docs.netlify.app/) for new content and documents under review or development. See the repository [README](https://github.org/EDIorg/data-package-best-practices) or [Contributing page](contributing.qmd) for more details. 22 | 23 | 24 | ## Contributors to the guides 25 | 26 | All of the dataset preparation guides presented here have been developed and written by a collaborative community of scientists, data managers, and repository users. This community draws heavily from the Information Management Committee of the the [U.S. Long-term Ecological Research Network (LTER)](https://lternet.edu), data specialists and curators at [EDI](https://edirepository.org), and numerous partners in the environmental research and data management communities. 27 | 28 | Many individuals from the community described above have contributed to these guides over the years. Individual contributors are listed in the title page, citation information, and/or document metadata for each guide. Please don't hesitate to reach out to the maintainers if anyone has been overlooked, or if information needs to be corrected. 29 | 30 | **If you would like to contribute**, please read the [Contributing page](contributing.qmd) and contact the maintainers or representatives from the current working groups below. 31 | 32 | ## Community working groups 33 | 34 | The ideas, words, and recommendations in these guides came to fruition as a result of many community working groups convening over periods of months to years. Below is a non-exhaustive list of these current and past working groups and their contacts. Don't hesitate to reach out to working group contacts if you would like to participate. 35 | 36 | * **EML Best Practices Working Group** 37 | - In existence 2005-present 38 | - Contact Greg Maurer () 39 | - Currently active members (2022-present): Corinna Gries (University of Wisconsin, Madison), Gabriel Kamener (Florida International University), Sage Lichtenwalner (Rutgers University), Mary Martin (University of New Hampshire), Gregory E. Maurer (New Mexico State University), Margaret O’Brien (University of California, Santa Barbara), John H. Porter (University of Virginia), Tim Whiteaker (University of Texas at Austin) 40 | - Past members: Dan Bahauddin (University of Minnesota), Barbara Benson, Emery Boose (Harvard University), James Brunt 41 | (University of New Mexico), Duane Costa (University of New Mexico), Corinna Gries (University of Wisconsin, Madison), Don Henshaw (Oregon State University), Margaret O’Brien (University of California, Santa Barbara), Ken Ramsey (New Mexico State University), Inigo San Gil (University of New Mexico), Mark Servilla (University of New Mexico), Wade Sheldon, Phillip Tarrant, Theresa Valentine, John Vande Castle, Kristin Vanderbilt (University of New Mexico), Jonathan Walsh, Yang Xia (Kansas State University) 42 | 43 | * **Non-tabular Data Working Group** 44 | - Active from 2019-2021 45 | - Contact Corinna Gries () 46 | - Members: Stace Beaulieu (Woods Hole Oceanographic Institute), Renée F. Brown (University of New Mexico), Jason Downing (University of Alaska, Fairbanks), Sarah Elmendorf (University of Colorado Boulder), Hap Garritt (Woods Hole Oceanographic Institute), Gastil Gastil-Buhl (University of California Santa Barbara), Corinna Gries (University of Wisconsin, Madison), Hsun-Yi Hseih (Michigan State University), Li Kui (University of California Santa Barbara), Mary Martin (University of New Hampshire), Gregory E. Maurer (New Mexico State University), An T. Nguyen (University of Texas at Austin), John H. Porter (University of Virginia), Adam Sapp (University of Georgia), Mark Servilla (University of New Mexico), Tim Whiteaker (University of Texas at Austin) 47 | 48 | -------------------------------------------------------------------------------- /contributing.qmd: -------------------------------------------------------------------------------- 1 | # Contributing {.unnumbered} 2 | 3 | The maintenance of this repository and development of the included documents is coordinated by EDI, with major updates and new content developed and approved through a community-oriented process. If you would like to help develop these documents please contact the maintainers and working group leads listed on the [About](about.qmd) page, or reach out to EDI at [info@edirepository.org](mailto:info@edirepository.org>). Some details on how contribution works are below. 4 | 5 | ## Branches 6 | 7 | * **main**: The `main` branch holds the current production version of the documents. Documents in this branch have generally been edited and approved by the community. They are published in website format with GitHub Pages, under the "EDIorg" organization. 8 | * **prerelease**: The `prerelease` branch contains the in-development, "next version" of the documents. These documents contain new content or changes submitted by, or under review by the community. Documents in this branch are published in website format to Netlify. After a review and approval process changes may be merged into `main`. 9 | * **feature** or **content** branches: These may exist during the early development of new features or drafting of content. They will first be merged into the `prerelease` branch and they are not currently published in website form. 10 | 11 | ## Contributing changes 12 | 13 | The guide documents and most website content are written in Quarto markdown, a variant of `pandoc` markdown, and saved as `.qmd` files. See the [Quarto guide](https://quarto.org/docs/guide/) for information on how to author `.qmd` files. New content, or edits to existing documents, can be contributed in several possible ways: 14 | 15 | 1. To suggest a change you may [file an issue](https://github.com/EDIorg/data-package-best-practices/issues/new/choose) in the GitHub repository outlining your proposed changes. This will begin a conversation with the maintainers and others in the community about whether and how to implement changes to the documents or website. You may be invited to draft the content changes (see the next item). 16 | 2. If you are ready to create a draft of the changes yourself, new `.qmd` files or edits to existing pages may be submitted as a pull request to the `prerelease` branch. See this [GitHub tutorial](https://github.blog/developer-skills/github/beginners-guide-to-github-creating-a-pull-request/) for some simple instructions and links to more resources. Maintainers will review the changes before merging them into the `prerelease` branch for further community input. 17 | 3. In some cases, particularly if you are involved in a community working group, you may request permission to push changes directly to the `prerelease` branch. If you think this would be the best way to contribute your changes please contact the maintainers listed in the [About](about.qmd) page or . 18 | 4. Periodically, community working groups convene to develop and draft changes to these guides, and much of the writing, revising, and editing process takes place in formats outside this repository (Word or Google Docs). To get involved in one of these efforts and contribute changes that way, see the [About](about.qmd) page and contact working group leads, or propose your own working group to the community. 19 | 20 | Once changes are moved into the `prerelease` branch, they will be reviewed by the the larger community (EDI, LTER Network, EML users, repository communities, etc.) and approved (or not) for inclusion in the production documents (`main` branch). 21 | 22 | ## Publishing workflow 23 | 24 | Both `main` and `prerelease` branches have GitHub Actions workflows configured to build and deploy their associated website any time new commits are pushed to that branch. The production site, derived from `main`, is published as a GitHub pages site ([Quarto documentation](https://quarto.org/docs/publishing/github-pages.html#github-action)). The prerelease site, derived from the `prerelease` branch, is published to Netlify ([Quarto documentation](https://quarto.org/docs/publishing/netlify.html#github-action)). The GitHub Actions publishing workflow for both branches is specified in the [`.github/workflows/publish.yml`](.github/workflows/publish.yml) file, which was modeled in part on the [Quarto website version](https://github.com/quarto-dev/quarto-web/blob/main/.github/workflows/publish.yml). When changes are pushed to either branch, please verify that the GitHub action completed and the website and all documents were built as expected. -------------------------------------------------------------------------------- /guide-domain-specific/ecocom.qmd: -------------------------------------------------------------------------------- 1 | # Ecological community surveys 2 | 3 | DRAFT DRAFT DRAFT 4 | 5 | ## Introduction 6 | 7 | We know from experience that primary research data sets cannot be easily reused until all data are completely understood. Community survey data in particular, can be highly complex, often with location-specific methods. Some general principles can be summarized from: 8 | 9 | - synthesis scientists working with this type of data 10 | - lessons learned from harmonization of primary data, e.g., EDI's `ecocomDP` project. 11 | 12 | ### Authors 13 | - Margaret O'Brien (https://orcid.org/0000-0002-1693-8322) 14 | - ecocomDP working group: https://edirepository.org/resources/thematic-standardization 15 | 16 | ## Anticipated use cases 17 | 18 | ### Synthesis 19 | Synthesis requires that primary research data sets be understood and combined into a similar format. EDI has examined the needs of synthesis scientists using community survey data, and that feedback is incorporated here. 20 | 21 | ### Harmonization 22 | Although a prescribed format would make data easily reused, the complexity of community surveys generally means that a prescribed format cannot be imposed on research studies. Therefore, EDI is harmonizing this data type in a workflow framework, and the lessons learned from examining raw data are included here. For more information on EDI's harmonization of community survey data, see: 23 | - https://github.com/EDIorg/ecocomDP 24 | - https://edirepository.org/resources/thematic-standardization#ecocomdp 25 | 26 | ### External applications 27 | Community survey data are of great interest to the broader biodiversity community, particularly through their support for portals such as GBIF, and the use of the Darwin Core Vocabulary. Harmonization is the first step in this process. 28 | 29 | 30 | ## Definitions and conventions 31 | The recommendations and examples below are organized according to how easily the data are reused. In all cases **Best** is preferred, and recommended. **Marginal** and **not useable** may overlap in some situations. 32 | - **Best**: Data are easily understood (do not require manual handling or further investigation) 33 | - **OK**: Some manual, custom or specialized handling required 34 | - **Marginal**: Additional metadata is required, along with custom handling and probably investigation (e.g., questions answered via email) 35 | - **Not useable**: the dataset has significant metadata missing; or has too many inconsistencies or layout challenges for it to be used even with manual handling. 36 | 37 | 38 | 39 | ## Recommendations for datasets 40 | 41 | ### Sampling methods 42 | Methods are generally text metadata 43 | * **Best** 44 | * explanation of the sampling strategy. 45 | * diagrams of sampling plots and their spatial relationships 46 | * **OK** 47 | * reference included to a paper describing the above 48 | * **Marginal/not usable** 49 | * no description of methods 50 | 51 | ### Dates 52 | Temporal sampling regime is consistent 53 | 54 | * **Best**: A column for dateTime is in the entity, and its format is consistent throughout 55 | * example: [https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-mcr.6.56](https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-mcr.6.56) 56 | * **OK**: sampling regime changes over time (yyyy, vs yyyy-mm-dd) 57 | * YYYY, vs YYYY-MM-DD 58 | * **Not useable**: date and time columns are not typed in EML as dateTimes (i.e, typed as strings, as below) 59 | 60 | 62 | 63 | Synthesis efforts may be able to circumvent the lack of true dates by dropping records (and elevate a "not useable" dataset to "marginal") 64 | 65 | *the ECC currently checks attributes typed as dateTime in two ways: 66 | 67 | - format compared to a preferred list formats 68 | - if the format is in the preferred list, checks agreement of data values with that format 69 | * 70 | 71 | 72 | ### Locations 73 | Should be complete, with latitude and longitude 74 | 75 | * **Best**: Columns for digital lat/lon 76 | * example: [https://portal.edirepository.org/nis/metadataviewer?packageid=edi.5.3](https://portal.edirepository.org/nis/metadataviewer?packageid=edi.5.3) 77 | * **OK** (need custom processing): 78 | * In metadata only: 79 | * example: [https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-sbc.17.33](https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-sbc.17.33) 80 | * Deg-min-sec (strings) 81 | * Locations in second table 82 | * **Not useable**: sites codes without lat/lon 83 | 84 | 85 | ### Site nesting 86 | Sampling site nesting can be understood 87 | 88 | * **Best**: subsites are labeled 89 | * example: [https://portal.edirepository.org/nis/metadataviewer?packageid=edi.5.3](https://portal.edirepository.org/nis/metadataviewer?packageid=edi.5.3) 90 | * **OK**: 91 | * **Not useable**: 92 | * 93 | 94 | 95 | ### Taxa 96 | Taxa can be resolved 97 | 98 | * **Best**: Taxon codes in the entity itself, assigned at source by those familiar with these taxonomic groups 99 | * example: [https://portal.edirepository.org/nis/metadataviewer?packageid=edi.3.5](https://portal.edirepository.org/nis/metadataviewer?packageid=edi.3.5) 100 | * **OK**: species binomials (ids will have to be added later, by someone less familiar with these taxa) 101 | * example: [https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-sbc.17.33](https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-sbc.17.33) 102 | * **Not useable**: local codes only 103 | * example (*if all they had included was the column called “sp_code”): [https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-sbc.17.33](https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-sbc.17.33) 104 | 105 | 106 | ### Table column names 107 | Metadata can be matched to entity column 108 | 109 | * **Best**: attributeName exactly matches column header 110 | * example [https://portal.edirepository.org/nis/metadataviewer?packageid=edi.3.5](https://portal.edirepository.org/nis/metadataviewer?packageid=edi.3.5) 111 | * **OK**: can be matched by manual examination 112 | * example [https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-mcr.1039.9](https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-mcr.1039.9) 113 | * **Marginal**: no header 114 | * example 115 | 116 | This feature has come up in other discussions. The EML metadata asserts what the content of a column is, however there is no explicit “key” into that column except for the column header in the data entity itself. If these do not match (or there is no table header), then there is nothing to go on but trust. That’s fine if data are shared only within a tightly knit community, but is less reliable when data are reused outside the orginating group. 117 | 118 | *the ECC currently checks that number of columns and their typing match (within the limits of an RDB). For attributeNames, It shows you the first line of the table for manual comparison (an info check).* 119 | 120 | 121 | ### Table linkages 122 | Foreign Key linkages are clear 123 | 124 | * **Best**: EML constraint included, with referential integrity 125 | * example [https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-mcr.6.56](https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-mcr.6.56) 126 | * **OK**: FK detected manually, has referential integrity 127 | * url 128 | * **Not useable**: FK detected manually, but no referential integrity 129 | * url 130 | 131 | Table linkages are most important for harmonization; synthesis efforts may be able to circumvent issues (and elevate a "not useable" dataset to "marginal") by dropping records without referents in key fields. -------------------------------------------------------------------------------- /guide-domain-specific/hymet.qmd: -------------------------------------------------------------------------------- 1 | # Hydrology and Meteorology Data 2 | 3 | ## Introduction 4 | This is a summary page. it should have links out for details. 5 | 6 | 7 | ### Synthesis use 8 | para here about CUAHSI, with link to that project. 9 | 10 | ## Recommendations for datasets 11 | summary here. -------------------------------------------------------------------------------- /guide-domain-specific/preface.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: Scientific Domain-Specific Metadata Guide 3 | date: '2019' 4 | author: Margaret O'Brien and others 5 | title-block-banner: false 6 | --- 7 | 8 | # Preface {.unnumbered} 9 | 10 | _Very much a work in progress._ 11 | 12 | Recommendations for community-developed data products from specific scientific domains. Not all scientific domains are covered. The data packages are derived from raw data and reformatted to meet certain data harmonization standards. Many of these data products have extensive related code bases, which recommendations can take into account and link to. -------------------------------------------------------------------------------- /guide-domain-specific/soil-carbon.qmd: -------------------------------------------------------------------------------- 1 | # Soil Carbon 2 | 3 | ## Introduction 4 | Page desribing BPs for data packages about soil carbon. it will probably have links out for details. 5 | 6 | 7 | ### Synthesis use 8 | para here about current synthesis use, eg, Weider WG. 9 | 10 | ## Recommendations for datasets 11 | summary here. -------------------------------------------------------------------------------- /guide-eml-bp/introduction.qmd: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | This document contains current 'Best Practice' recommendations for EML content for metadata related to ecological and environmental data. It is intended to augment the EML schema documentation [@jones_ecological_2019] for a less-technical audience. The current version (v3, 2017) is one component of several resources available to EML preparers. These recommendations are directed towards the following goals: 4 | 5 | - Provide guidance and clarification in the implementation of EML for datasets 6 | - Minimize heterogeneity of EML documents to simplify development and re-use of software built to ingest it 7 | - Maximize interoperability of EML documents to facilitate data synthesis 8 | 9 | At time of this document's publication (late 2017), the version of EML currently in production was EML.2.1.1. EML 2.2.0 is anticipated within the next year. Contact EDI for more information. 10 | 11 | ## History 12 | 13 | EML Best Practice recommendations have evolved over time. The most active contributors have been members of the LTER Information Managers Committee in multiple working groups and workshops. EML has been widely used for several years with multiple applications written against it, and the community has had the opportunity to observe the consequences of many content patterns. As much as possible, recommendations have been aligned with those experiences, as well as with the capability of data contributors. 14 | 15 | Timeline and Previous Revisions 16 | 17 | - 2017 Best Practices for Dataset Metadata in EML v3 (this document) 18 | - 2016 EDI inception, see http://edirepository.org 19 | - 2011 EML Best Practices for LTER sites v2 20 | - 2008 EML 2.1 release 21 | - 2004 EML Best Practices for LTER sites 22 | - 2003 LTER adopts EML as network exchange standard 23 | 24 | Contributors, including LTER EML Best Practices Working Groups and workshops in 2003, 2004, 2010 (alphabetical order): 25 | 26 | - Dan Bahauddin 27 | - Barbara Benson 28 | - Emery Boose 29 | - James Brunt 30 | - Duane Costa 31 | - Corinna Gries 32 | - Don Henshaw 33 | - Margaret O'Brien 34 | - Ken Ramsey 35 | - Inigo San Gil 36 | - Mark Servilla 37 | - Wade Sheldon 38 | - Philip Tarrant 39 | - Theresa Valentine 40 | - John Vande Castle, 41 | - Kristin Vanderbilt 42 | - Jonathan Walsh 43 | - Yang Xia 44 | 45 | 46 | ## General Recommendations 47 | 48 | Following are general best practices for handling EML dataset metadata: 49 | 50 | ### Metadata Distribution 51 | 52 | Do not publicly distribute EML documents containing elements with incorrect information, e.g., as a workaround for missing metadata or to meet validation requirements. Pre-publication drafts, or EML produced for demonstration or testing purposes should be clearly identified as such and not contributed to public archives, because these are passed on to large-scale clearinghouses. For previews of drafts or handling test and demonstration data packages, consult your repository to learn about options. 53 | 54 | ### Data Package Identifiers 55 | 56 | Metadata and data set versioning are controlled by the contributor, and so identifiers are tied to local systems. Many repository systems that accept EML-described data support principles of immutable metadata and data entity versioning. EML has elements to contain package identifiers, although these may also be assigned externally. It is the responsibility of the submitters to understand the practices of their intended repository when using identifiers. 57 | 58 | ### High-priority Elements 59 | 60 | - To support locating data by time, geographic location, and taxonomically, metadata should provide as much information as possible for the data package, in the three <**coverage**>; elements: 61 | - <**temporalCoverage**>; (when), 62 | - <**geographicCoverage**>; (where) and 63 | - <**taxonomicCoverage**> (what). 64 | - For a potential user to evaluate the relevance and usability of the data package for their research study or synthesis projects, metadata should include detailed descriptions in the 65 | - <**project**>, 66 | - <**methods**>, 67 | - <**protocols**>, and 68 | - <**intellectualRights**> elements. -------------------------------------------------------------------------------- /guide-eml-bp/preface.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: Best Practices for Dataset Metadata in Ecological Metadata Language (EML) 3 | subtitle: Version 3 4 | date: November 2017 5 | author: EML Best Practices Committee 6 | --- 7 | 8 | 9 | # Preface {.unnumbered} 10 | 11 | ::: {style='text-align:center;'} 12 | 13 | This document is managed and distributed by the Environmental Data Initiative (EDI) 14 | 15 | https://edirepository.org | info@edirepository.org 16 | 17 | ::: 18 | 19 | The recommendations for EML metadata apply to all data packages. This book is a reproduction of V3 of the static PDF document "Best Practices for Dataset Metadata in Ecological Metadata Language (EML)," last updated in 2017. The entire most recent (versioned, citable) release will be made available as a PDF. 20 | 21 | Please cite as: 22 | 23 | Best Practices for Dataset Metadata in Ecological Metadata Language (EML Best Practices V3). 2017. Environmental Data Initiative. 24 | 25 | ## Conventions and Definitions 26 | 27 | ### Audience 28 | 29 | This document is intended for data managers. It assumes that readers are familiar with 30 | 31 | - the basic structure of an XML document, and the ability to edit in an XML editor like OxygenXML or XMLSpy. 32 | - the process for contributing data to a repository. If you reached this document from a repository's help-page, contact them for more information. 33 | 34 | ### Fonts and typeface 35 | 36 | Numbered examples of EML nodes are in fixed-width font: 37 | 38 | ```{.xml} 39 | 40 | ``` 41 | 42 | XML element and attribute names, XPath and references to element names in text are in bold face. Single element names are surrounded by angle brackets, as they appear in XML. 43 | 44 | <**dataTable**> 45 | **/eml:eml/\@packageId** 46 | 47 | Some recommendations have special context, e.g., an XML element or attribute may be requested by a community (e.g., LTER), or required by the EDI repository (but not by other repositories). 48 | 49 | _Context notes: Recommendations for EML usage in a specific context are called "context notes", and are placed in separate paragraphs, in italic._ 50 | 51 | ### Definitions 52 | 53 | **EML preparer**: the person responsible for "building" the EML metadata record. Generally, this is a data manager working with a project or physical site that produces data. 54 | 55 | **Contributor**: the research project contributing the data package, e.g., an LTER or OBFS site, or a Macrosystems project. Generally, the "EML preparer" works with or for the "Contributor." 56 | 57 | **Data package**: the EML metadata together with its entity or entities. This is generally the unit housed in repositories. We use this term to avoid confusion with the EML element "**dataset**". 58 | 59 | ### Other EML Resources 60 | 61 | Some sections refer to further information or tools. These can be found on the EDI website, under "Resources", at 62 | [https://edirepository.org](https://edirepository.org) -------------------------------------------------------------------------------- /guide-special-cases/code.qmd: -------------------------------------------------------------------------------- 1 | # Code 2 | 3 | Contributors: An T. Nguyen, Tim Whiteaker 4 | 5 | ## Introduction 6 | 7 | This document describes best practices for archiving software, code, or scripts, such as a simulation model, data visualization package, or data manipulation scripts. The intention of these recommendations is to make research based on modeling or software more transparent rather than achieve exact reproducibility, i.e., provide sufficient documentation so that a knowledgeable person can understand algorithms, programming decisions, and their ramifications for the results, rather than run the model and obtain the same results. 8 | 9 | Examples of candidate archives for code include [CoMSES Net](https://www.comses.net/), which focuses on sharing models related to social and ecological sciences, and [Zenodo](https://zenodo.org/), a popular DOI-minting all-purpose repository, that can conveniently archive a specific version of [code in a GitHub repository](https://guides.github.com/activities/citable-code/). Alternatively, code may be archived in the EDI repository, either by itself or as part of a data package. The best practices in this document cover both archiving code in EDI and referencing code archived elsewhere. 10 | 11 | While metadata for software may be described in detail using the EML <**software**> tree, there exists a project called [CodeMeta](https://codemeta.github.io/index.html) which is specifically designed for software metadata. Therefore, one of the key recommendations in this document is to include a CodeMeta file when archiving software or code in EDI. 12 | 13 | ## Recommendations for data packages 14 | 15 | ### Considerations for archiving software or code 16 | 17 | * If it is a model and/or a model-based dataset, please see the [best practices for archiving model-based datasets](model-based-datasets.html). 18 | * How likely is it that the code will be well maintained into the future? For example, code packages submitted to established code repositories may stay there only while they comply with all testing requirements and may be removed if not well maintained (e.g., the R package repository CRAN). If that commitment to code maintenance is unlikely, such a package should be archived in a repository without maintenance requirements. 19 | * Should the code be archived as a separate package or with the data? 20 | * If the code is used to generate several independent datasets it should be archived as a separate package. 21 | * The software authors wishing to place it under a different license from that of the associated data, or to obtain a DOI for only the code, may be reasons to separate code and data packages. 22 | * If deciding to package code separately, it may be archived on EDI or another repository. If archiving code outside of EDI, see section 2.2.4 for instructions on how to reference that code from related data packages in EDI. 23 | * In most other cases, it is recommended to archive code and data together for context. 24 | * Large community software packages are usually maintained and available elsewhere. However, they may undergo significant updates and it may make sense to archive the code of a certain version with the data for transparency reasons. Consider whether prior versions of a software package are available wherever that software is distributed. 25 | * When choosing a repository for the code, consider the ease of the archiving process and how well the code can be described. For example, Zenodo offers an easy pathway to archive code that is currently in GitHub, though metadata requirements are very light. Following the best practices described herein, you would create a CodeMeta file if you were going to archive with EDI. This is more rigorous than Zenodo, but then your code is better described, and in a machine-readable way. 26 | 27 | ### Documenting software/code 28 | 29 | When describing the code with EML, include the code as an otherEntity in a data package. Although a well documented human readable text format of the code is preferred, in case of multiple scripts, and/or where directory structure is important, a zip archive may be used. For the formatName and entityType elements in EML, we recommend using format names from the [DataONE format list](https://cn.dataone.org/cn/v2/formats) when possible. Some format names are included in examples below. Always check the list for the most up-to-date version of these names. 30 | 31 | Example 1: EML otherEntity snippet for a script file. 32 | 33 | ```{.xml} 34 | 35 | R script to process CTD data 36 | Annotated RMarkdown script to process, calibrate, and flag raw CTD data. 37 | 38 | BLE_LTER_CTD_QAQC.Rmd 39 | 9674 40 | 8547b7a63fcf6c1f0913a5bd7549d9d1 41 | 42 | 43 | R Markdown file 44 | 45 | 46 | 47 | script 48 | 49 | ``` 50 | 51 | #### Software License 52 | 53 | It is important to include a use license to make it clear how others can use your work. We recommend the [Creative Commons "no copyright reserved" (CC0)](https://creativecommons.org/share-your-work/public-domain/cc0/) license, which places the software in the public domain and makes it easiest for end users to adapt and use your work. If a more restrictive license is required, we recommend the [Apache License, Version 2.0](https://directory.fsf.org/wiki/License:Apache-2.0) license, a permissive license that allows others to reuse, modify, and redistribute your software. 54 | 55 | If a mix of data and code needs to be archived, and they each fall under different licenses, then separating them into different packages is advisable to eliminate ambiguity on which license applies to which portion of a data package. When a license other than a public domain dedication is used, then in addition to specifying the license in the metadata (see the "intellectualRights" element in EML), consider including a copy of the license at the beginning of the code files themselves so that the license is readily apparent to end users who peruse the code. 56 | 57 | #### CodeMeta 58 | 59 | Include a CodeMeta JSON file for all code that is archived in EDI. The CodeMeta file should be named "codemeta.json" and listed as an EML otherEntity. The formatName should be "JavaScript Object Notation (JSON) file", the entityType should be "metadata", and the entityDescription should indicate that this is a CodeMeta file for a given software or script in the data package. 60 | 61 | For unnamed projects, e.g., one-off scripts for data processing, analysis, and/or visualisation, a CodeMeta file might appear to be overkill; however, CodeMeta files are simple to generate, and we recommend the below bare minimum. If there are multiple scripts each in their own otherEntity tag, we recommend aggregating information about them into one codemeta.json. 62 | 63 | Example 2: Minimum recommended codemeta.json example for unnamed projects. 64 | 65 | ```json 66 | { 67 | "@context": ["https://doi.org/10.5063/schema/codemeta-2.0", 68 | "http://schema.org" 69 | ], 70 | "@type": "SoftwareSourceCode", 71 | "description": "RMarkdown script to calibrate and flag raw CTD data.", 72 | "author": { 73 | "@type": "Person", 74 | "givenName": "Christina", 75 | "familyName": "Bonsell", 76 | "email": "cbonsell@utexas.edu", 77 | "@id": "https://orcid.org/0000-0002-8564-0618" 78 | }, 79 | "keywords": ["calibration", "CTD", "RMarkdown"], 80 | "license": "https://unlicense.org/", 81 | "dateCreated": "2013-10-19", 82 | "programmingLanguage": { 83 | "@type": "ComputerLanguage", 84 | "name": "R", 85 | "version": "3.6.2", 86 | "url": "https://r-project.org" 87 | } 88 | } 89 | ``` 90 | 91 | Example 3: sample otherEntity metadata for example 2’s codemeta.json. 92 | 93 | ```{.xml} 94 | 95 | CodeMeta file for BLE_LTER_CTD_QAQC.Rmd 96 | CodeMeta file for annotated RMarkdown script to process, calibrate, and flag raw CTD data. 97 | 98 | codemeta.json 99 | 702 100 | 8547b7a63abc6c1f0913a5bd7549d9d1 101 | 102 | 103 | application/json 104 | 105 | 106 | 107 | CodeMeta 108 | 109 | ``` 110 | 111 | For named projects, also include the software name, and the version if applicable. The example below shows some additional metadata you can include. See also the more complete [codemetar example](https://docs.ropensci.org/codemetar/articles/codemeta-intro.html) and the available [CodeMeta terms](https://codemeta.github.io/terms/). 112 | 113 | Example 4: A more complete CodeMeta example for named projects. Example taken from the CodeMeta project Github with edits for brevity. 114 | 115 | ```json 116 | { 117 | "@context": ["https://doi.org/10.5063/schema/codemeta-2.0", 118 | "http://schema.org" 119 | ], 120 | "@type": "SoftwareSourceCode", 121 | "name": "codemetar: Generate 'CodeMeta' Metadata for R Packages", 122 | "description": "A JSON-LD format for software metadata", 123 | "author": [{ 124 | "@type": "Person", 125 | "givenName": "Carl", 126 | "familyName": "Boettiger", 127 | "email": "cboettig@gmail.com", 128 | "@id": "https://orcid.org/0000-0002-1642-628X" 129 | }, 130 | { 131 | "@type": "Person", 132 | "givenName": "Maëlle", 133 | "familyName": "Salmon", 134 | "@id": "https://orcid.org/0000-0002-2815-0399" 135 | } 136 | ], 137 | "codeRepository": "https://github.com/ropensci/codemetar", 138 | "dateCreated": "2013-10-19", 139 | "license": "https://spdx.org/licenses/GPL-3.0", 140 | "version": "0.1.8", 141 | "programmingLanguage": { 142 | "@type": "ComputerLanguage", 143 | "name": "R", 144 | "version": "3.5.3", 145 | "url": "https://r-project.org" 146 | }, 147 | "softwareRequirements": [{ 148 | "@type": "SoftwareApplication", 149 | "identifier": "R", 150 | "name": "R", 151 | "version": ">= 3.0.0" 152 | }, 153 | { 154 | "@type": "SoftwareApplication", 155 | "identifier": "git2r", 156 | "name": "git2r", 157 | "provider": { 158 | "@id": "https://cran.r-project.org", 159 | "@type": "Organization", 160 | "name": "Comprehensive R Archive Network (CRAN)", 161 | "url": "https://cran.r-project.org" 162 | } 163 | } 164 | ], 165 | "keywords": ["metadata", "codemeta", "ropensci"] 166 | } 167 | ``` 168 | 169 | #### Metadata to enable reproducibility 170 | 171 | When archiving software, we strongly recommend including a user guide with installation and usage instructions if such would not already be apparent to the typical user. Take into account that the user might not have access to certain inputs that the software/scripts require. Include when feasible at least some example data, and configure the script so that it is ready to run with the example data. 172 | 173 | Aside from the software/code itself and its dependencies, other pieces of information may be important should a user wish to reproduce results, such as the operating system and version and the system locale. Include this information in the data package’s methods/methodStep/description. For certain tools, there are ways to easily generate this information, e.g., a call to sessionInfo() in the R console. If the system outputs this information in a standardly formatted plain text file, that might be included as an otherEntity. 174 | 175 | ### Linking code and data 176 | 177 | There are a few solutions for providing explicit machine-readable linkages between different entities/packages (the distinction between code/data doesn’t matter too much here). For most cases we recommend the simplest approach, which is to use the methods/methodStep/description element of EML. More advanced users may wish to utilize the other solutions described herein. 178 | 179 | #### Descriptive approach 180 | 181 | In the dataset methods/methodStep/description element, include verbal descriptions such as "results.csv was derived from raw_data.csv using script.R" and repeat for all entities. If code and data reside in different packages, be sure to specify that. 182 | 183 | #### The EML dataSource element 184 | 185 | Nested under methods/methodStep, dataSource elements describe other data packages that serve as source for the current package. dataSource looks like a mini-EML tree describing the source data. Example: [ecocomDP packages](http://portal.edirepository.org/nis/simpleSearch?defType=edismax&q=ecocomDP&fq=-scope:ecotrends&fq=-scope:lter-landsat*&fl=id,packageid,title,author,organization,pubdate,coordinates&debug=false) list the original packages under dataSource. dataSource does not describe relationships between entities in the same package, and as far as we know there is no explicit way in EML to do so. 186 | 187 | #### ProvONE 188 | 189 | [ProvONE](http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html) is a model developed by DataONE affiliates for provenance or denoting relationships between data entities. Each package on DataONE is described by a science metadata document (e.g., EML, ISO, FGDC) and a resource map document following ProvONE. The resource map powers a nice display of data relationships (see [this package on the Arctic Data Center](https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2556Q)). This handles both relationships between entities in the same package and entities residing in different packages. However, note that EDI currently does not utilize this model. 190 | 191 | ### External software 192 | 193 | Large community-backed tools or proprietary software such as ArcGIS Pro or Microsoft Excel do not need to be archived. However, if they have had any impact on the final data (e.g., ArcGIS Pro was used to modify spatial rasters), the EML methods section should describe the routines performed. Within the data package, indicate linkage to external software as follows. 194 | 195 | * Briefly describe the software/code and its relationship to the data in EML’s methods/methodStep/description element. 196 | * Names of all software used. Include both the common acronym and the full spelling. 197 | * The URL(s) to all models/software used. Stable, persistent URLs pointing to exact version(s) are preferable, rather than generic links such as a project homepage. If the archived model has a DOI, then include a full citation to the model in the methods/methodStep/description text. The exception to this is when referencing tools such as Excel that have achieved global household name status. 198 | * Broadly, the system setup used, if relevant. 199 | * Information on exact versions for all code used (including dependencies). This is important, e.g., ArcGIS Pro 2.4.1 is very different from ArcGIS for Desktop 10.7.1. Different systems have methods to easily generate this information, e.g. a call to sessionInfo() in the R console. 200 | * Consider, if applicable, to archive the "runfile" as its own data entity within the data package, i.e., the script(s) that sets parameters and/or calls on functions imported from external software. 201 | 202 | Example 5: EML method description referring to external software. 203 | 204 | ```{.xml} 205 | 206 | 207 | 208 | 209 | The seagrass coverage raster was created in ArcGIS Pro (version 210 | 2.4.3, by Esri) using the IDW geoprocessing tool on 211 | sampling_points.csv with a power of 2 and the nearest 212 | 12 points. 213 | 214 | 215 | The raster was then refined using the seagrass-refiner package 216 | with the auto-refine option checked (Smith, 2017). 217 | 218 | 219 | Smith, J. (2017). seagrass-refiner: a package that does the cool 220 | seagrass stuff, Version 1.2, Zenodo. https://doi.org/this-is/a-fake-doi, 221 | 2017. 222 | 223 | 224 | 225 | 226 | ``` 227 | 228 | ## Resources 229 | 230 | [CodeMeta](https://codemeta.github.io/) website 231 | 232 | [CodeMeta generator](https://codemeta.github.io/codemeta-generator/) for creating CodeMeta 233 | 234 | [CodeMeta crosswalks](https://codemeta.github.io/crosswalk/) for a number of popular software 235 | 236 | [CodeMeta terms](https://codemeta.github.io/terms/) you can use for describing software 237 | 238 | A description of some [software licenses](https://opensource.org/licenses) 239 | 240 | [Best practices document to archiving model-based datasets](model-based-datasets.html) 241 | 242 | [ProvONE documentation](http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html) 243 | 244 | [W3C PROV-O documentation](https://www.w3.org/TR/prov-o/) 245 | 246 | [Licensing software as part of an EDI data package](https://docs.google.com/document/d/1JeznivTDubi0ZX_lsO50eCUl-8zxSiz_xq5SsBRwbuw/edit) -------------------------------------------------------------------------------- /guide-special-cases/images-and-docs.qmd: -------------------------------------------------------------------------------- 1 | # Images and Documents as Data 2 | 3 | Contributors: Renee F. Brown (lead), Stace Beaulieu, Sarah Elmendorf, Gastil Gastil-Buhl, Corinna Gries, Li Kui, Mary Martin, Greg Maurer, John Porter, Tim Whiteaker 4 | 5 | ## Introduction 6 | 7 | This chapter describes best practices for archiving images and other documents as data. The [Environment Ontology (ENVO)](http://purl.obolibrary.org/obo/IAO_0000101) defines a document as '_a collection of information content entities intended to be understood together as a whole_.' Common examples include still images, audio and/or video multimedia files, field notebooks, written interview notes or transcribed oral accounts, historical document collections, and 'paper' maps (non-digitized maps). For images that are already handled by specialized repositories (e.g., phenocam images, specimen images) refer to [Data in Other Repositories](data-in-other-repositories.html), for additional information on how to handle images from uncrewed (underwater or aerial) vehicles refer to [Data Gathered with Small Moving Platforms](data-gathered-with-small-moving-platforms.html), and for geospatial imagery refer to [Spatial Data](spatial-data.html). 8 | 9 | ## Recommendations for data packages 10 | 11 | ### Reasons to archive documents as data 12 | 13 | * **Enhance the credibility of associated datasets.** Many document types (field notes, still images, etc.) often provide additional metadata that cannot easily be encapsulated in the associated dataset(s) or were not considered important at the time of transcription. As such, these documents may provide opportunities to rectify transcription errors, retrospectively provide explanations of unusual data, and/or include additional observational or measured data, such as opportunistic measurements or calibration parameters. 14 | * **Provide opportunities for new analyses.** New analytical methods may be employed on archived documents (especially still images) or documents that were never archived previously because the cost-to-benefit ratio was considered too high (e.g., pilot projects). 15 | * **Improve ease of access.** In distributed projects, access to original and/or 'hard-copy' documents may be limited to a particular institution or subset of people. By digitally archiving these documents in a data repository, the data become more findable, accessible, interoperable, and reusable (FAIR). 16 | 17 | ### Considerations for data package structure 18 | 19 | * **Balance file size and number of files.** A data package may contain document files individually or bundled as a compressed archive (e.g., zip). The decision of how to bundle documents into compressed archives and then into data packages should be guided by the overall goal of making data usable for the intended purpose of the documents. In most cases, this would involve finding specific documents by, for example, the date or location of the acquisition, or some other aspect of interest. In addition, the effort of documenting documents (each individually vs. in groups) has to be taken into account. Also see [Large Data Sets](large-data-sets.html). 20 | * **Document grouping.** Data packages, or compressed archives within data packages, may be grouped spatially (e.g., by location) and/or temporally (e.g., by date, season, or year). For example, data outputs from a stationary camera may be archived in annual data packages, each containing monthly compressed archives if the number of images is large. While moving camera outputs may also be archived annually, these data packages may instead include compressed archives containing all still images for a single location. 21 | * **Document naming.** To maximize searchability, document names should be unique and meaningful for a data reuser. It is recommended that individual documents be named according to their content, and compressed archives include date, location, and other relevant information in the filename. 22 | * **Data inventory table.** An inventory table providing the structure and organization of the included document entities or groups of documents (see Table 4.1) is recommended, especially for larger collections of documents within a data package. The inventory table serves as an additional source of metadata and may also be used to link specific documents to additional information. 23 | * **Archival frequency.** One should strive for archiving a fully processed group of documents when no more updates are expected (e.g., after a field season or annually) due to the large volume of documents to be handled repeatedly for each update. 24 | * **Linking to related data packages.** In the case where the documents are useful to understanding another data package and vice versa (e.g., met station visitation logs and met station time series data), it is recommended to link the complementary data package in the methods section of both datasets. Alternatively, include the document(s) or compressed archive(s) in the existing dataset as otherEntity, as described in the next section. 25 | 26 | ### Documenting data packages 27 | 28 | #### Ecological Metadata Language 29 | 30 | All data packages require good discovery-level metadata in Ecological Metadata Language (EML), which should be assembled using standard documented best practices. Documents (including compressed archives) should be included as otherEntity in the data package (e.g., see Example 4.1). Refer to the most recent version of EML Best Practices ([currently v3](https://environmentaldatainitiative.files.wordpress.com/2017/11/emlbestpractices-v3.pdf)) for guidance regarding the formatName and entityType EML elements. If a format for your document type is not covered, it is recommended to use the appropriate [MIME type](https://github.com/DataONEorg/object-formats), if available. 31 | 32 | Example 1: EML otherEntity snippet for a pdf file 33 | 34 | ```{.xml} 35 | 36 | site date 37 | Field notes at site and date. 38 | 39 | site_date.pdf 40 | 9674 41 | 8547b7a63fcf6c1f0913a5bd7549d9d1 42 | 43 | 44 | Portable Document Format 45 | 46 | 47 | 48 | application/pdf 49 | 50 | ``` 51 | 52 | The EML metadata should also include appropriate keywords describing the general purpose of the document or compressed archive (e.g., ice phenology, community composition, stream hydrology, etc.). For example, for still images, it is recommended to include keyword: image with the semantic annotation from the Information Artifact Ontology (IAO) : 53 | 54 | **Term IRI:** [http://purl.obolibrary.org/obo/IAO_0000101](http://purl.obolibrary.org/obo/IAO_0000101) 55 | 56 | 57 | **Definition:** An image is an affine projection to a two dimensional surface, of measurements of some quality of an entity or entities repeated at regular intervals across a spatial range, where the measurements are represented as color and luminosity on the projected surface. 58 | 59 | Note, IAO includes at least one subcategory for image (e.g., [photograph](http://www.ontobee.org/ontology/IAO?iri=http://purl.obolibrary.org/obo/IAO_0000185)). It is recommended the most specific applicable concept be used. 60 | 61 | #### Data Inventory Table 62 | 63 | We recommend that an additional level of metadata be provided through a data inventory table that effectively serves as a document catalog (see Table 4.1). The detail provided in this table should be guided by the same principles as stated above -- to enable optimal usability of the documents. For example, still images from a stationary camera require latitude and longitude only in the EML file, not for each individual image. However, images from a moving camera may need that information for every image, or at least for every location (e.g., site, quadrat, transect). Additionally, Exif metadata from photographic images may be programmatically extracted to supplement the inventory table (refer to the _Tips and Tricks_ section of [Data Gathered with Small Moving Platforms](data-gathered-with-Small-moving-platforms.html)). 64 | 65 | The data inventory table should be structured such that each column represents a particular attribute, described in EML as a [dataTable](https://eml.ecoinformatics.org/schema/eml-dataTable_xsd.html#eml-dataTable.xsd) entity, and each row represents an individual document or a compressed archive of a group of documents. At minimum, the table should include an attribute for the document/archive filename, as well as any other essential attributes that vary per each document/archive. Additional attributes may include information on the date and/or time, but for this information to be useful, be consistent and use a controlled vocabulary for these fields so that a user can effectively search on them. 66 | 67 | Table 1: Data inventory table structure. 68 | 69 | 70 | 71 | 73 | 75 | 76 | 77 | 79 | 81 | 82 | 83 | 85 | 87 | 88 | 89 | 91 | 93 | 94 | 95 | 97 | 99 | 100 | 101 | 103 | 105 | 106 | 107 | 109 | 111 | 112 | 113 | 115 | 117 | 118 |
Column 72 | Attribute Description 74 |
Filename 78 | Filename of each document or compressed archive, including file extension (e.g., 'site_date.jpg'). For compressed archives, include the relative path of the document, with respect to the uncompressed directory structure (e.g., '2018/SITE3/quadrat4.jpg'). 80 |
Link/URL/URI 84 | Link to download a document if it is available on a different system (also see Data in Other Repositories). Persistent identifiers are recommended, if available. 86 |
Creator(s) 90 | Name(s) of the creator(s) of the original document (e.g., photographer, field technician, interviewer). Multiple creators should be entered into a single cell using the pipe delimiter. 92 |
Datetime 96 | Date (and time) associated with the document, in ISO 8601 format (e.g., 2007-04-05T12:30-02:00). 98 |
Project specific datetime attributes 102 | One or more appropriately labeled columns containing project specific date and time information for easier search and retrieval of documents (e.g., year, season, campaign). 104 |
Location 108 | One or more location columns as appropriate, such as latitude and longitude in decimal degrees, site name, transect name, altitude, depth, habitat, etc. 110 |
Document specific attributes 114 | One or more columns as appropriate to the document type, such as weather conditions, organism name, instrument type, etc. 116 |
119 | 120 | ## Example data packages in EDI 121 | 122 | Each of the Environmental Data Initiative (EDI) data packages listed below include images or other documents as data. Some of these packages contain data inventory tables (as dataTable entities) described in the EML metadata. 123 | 124 | Table 2: Data packages in EDI providing examples of best practices from this document. 125 | 126 | 127 | 128 | 130 | 132 | 134 | 135 | 136 | 138 | 140 | 142 | 143 | 144 | 146 | 148 | 150 | 151 | 152 | 154 | 156 | 158 | 159 | 160 | 162 | 164 | 166 | 167 | 168 | 170 | 172 | 174 | 175 |
Dataset Title 129 | Description 131 | EDI Package ID 133 |
Annual ground-based photographs taken at 15 net primary production (NPP) study sites at Jornada Basin LTER, 1996-ongoing 137 | Compressed archives of images grouped by year. Includes data inventory file. 139 | knb-lter-jrn.210011005.105 141 |
McMurdo Dry Valleys LTER: Landscape Albedo in Taylor Valley, Antarctica from 2015 to 2019 145 | Compressed archives of aerial images, grouped by flight date, and associated reflectance data. 147 | knb-lter-mcm.2016.2 149 |
MCR LTER: Coral Reef: Computer Vision: Multi-annotator Comparison of Coral Photo Quadrat Analysis 153 | 5090 coral reef survey images, and 251,988 random-point annotations by coral ecology experts. 155 | knb-lter-mcr.5013.3 157 |
Abundance and biovolume of taxonomically-resolved phytoplankton and microzooplankton imaged continuously underway with an Imaging FlowCytobot along the NES-LTER Transect in winter 2018 161 | 144,281 images from a plankton imaging system with annotations and extracted size data. 163 | knb-lter-nes.9.1 165 |
Calling activity of Birds in the White Mountain National Forest: Audio Recordings (2016 and 2018) 169 | Compressed archive containing 410 audio files in wav format. Includes data inventory table. 171 | knb-lter-hbr.268.1 173 |
176 | 177 | ## Resources 178 | 179 | ### Considerations for digitizing documents 180 | 181 | Following are some general considerations and recommendations for digitizing paper or other 'hard-copy' documents for archival. This is not meant to be an exhaustive list. For further and more detailed information, please refer to the U.S. National Archives and Records Administration (NARA)’s _[Technical Guidelines for Digitizing Archival Materials for Electronic Access](https://www.archives.gov/files/preservation/technical/guidelines.pdf)_. 182 | 183 | * **Effort.** The decision to digitize documents, as well as the digitization method, involves trade-offs in the accessibility and ease of using particular hardware and/or software technologies, the quality of the digitization, and the overall effort spent. Digitization efforts may be significant, for example, when dealing with a large number of documents requiring meaningful file names, text recognition, and/or high resolution for improved accessibility. 184 | * **Equipment.** Instruments for digitizing hard-copy documents range from high resolution scanners (less accessible, less user-friendly, more expensive, better quality) to smartphone cameras (ubiquitous, easy-to-use, lower quality). For example, taking a smartphone image in the field may be utilized for quick and easy digitization of field notes. 185 | * **Document resolution and file size.** This is an important consideration that should be guided by the content and purpose of the document. Detailed paper maps should probably be scanned at high resolution and large file size, while field sheets may not need as much detail. 186 | * **Optical Character Recognition (OCR):** When digitizing documents that include text, we recommend using scanning or other software with OCR capabilities (e.g., Adobe, ABBYY, Tesseract) to convert the text into machine readable characters so that the documents are searchable and thus, more usable. OCR does not work well for handwritten text, older fonts, or documents with busy backgrounds (speckled, dirty, faded, etc.). 187 | * **Sensitive Information and Human Subjects:** Regardless of the digitization method, one should be mindful of sensitive information that shouldn’t be archived or otherwise redacted (e.g., photographs of human subjects, field notebooks containing personal messages, gate combinations, and/or telephone numbers). In all cases in which human subjects are involved, Institutional Review Board (IRB) restrictions must be heeded. A signed IRB consent form for the associated research project represents a contract between researcher and human subject. It is important to note that IRB restrictions can differ among research studies within the same project. For further information, see the [EDI Data Initiative Data Policy](https://edirepository.org/about/edi-policy#sensitive-data). 188 | 189 | While transcription is a digitization method that can be performed on certain types of documents (e.g., audio/video recordings, field notebooks) and can enhance search capabilities, transcript generation requires substantially more effort than other digitization methods, and is prone to error. Moreover, in the case where the original documents contain drawings, transcripts may be incomplete or otherwise inaccurate. _Thus, we recommend digitizing documents by other means, using the considerations described above._ 190 | -------------------------------------------------------------------------------- /guide-special-cases/img/guide-special-cases-model-fig1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EDIorg/data-package-best-practices/1dbf0fc4cf3a70f23e267dce6ffc5253185a337d/guide-special-cases/img/guide-special-cases-model-fig1.png -------------------------------------------------------------------------------- /guide-special-cases/introduction.qmd: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | Members of the working group developing these documents: S. Beaulieu, R. Brown, J. Downing, S. Elmendorf, H. Garritt, G. Gastil-Buhl, C. Gries, J. Hollingsworth, H.-Y. Hsieh, L. Kui, M. Martin, G. Maurer, A. Nguyen, J. Porter, A. Sapp, M. Servilla, T. Whiteaker 4 | 5 | In these documents we consider special cases for archiving research data based on their data type, format, or acquisition method, and recommend practices that ensure optimal re-usability of the data. Most recommendations here are aimed at improving documentation of data acquisition and processing to avoid misinterpretation. This includes the recommendation to publish raw data and/or processing code along with the data products. Others are aimed at usability in terms of data size/volume, or connecting related data. Some recommendations involve including a metadata document formatted according to a new and emerging standard (e.g., codeMeta) or a data inventory table. Data inventory tables can cross the line between metadata and data and are intended to improve discoverability and navigation of archived data. 6 | 7 | The intended audience for these best practice recommendations is the ecological research information manager (IM) community, and they are applicable to anyone operating in the context of an ecological research program. We assume that the target data repository is designed to handle ecological data, and that a given archive package will include metadata encoded in a community standard. This document references elements of the [EML metadata standard](https://eml.ecoinformatics.org/), but many aspects would similarly apply to other metadata standards and these documents should be considered in the larger context of applicable metadata standard best practices. We refer to the Environmental Data Initiative ([EDI](https://portal.edirepository.org)) as an example data repository, though the same practices could be applied to other similar repositories. 8 | 9 | Throughout the chapters we use the term _data package_ to refer to a published unit of data and metadata together, which is the convention at the EDI repository. At other data repositories, equivalent terms for a data package, such as _dataset_, may be used. A data package may contain one or more _entities_, such as csv tables, spatial data, processing or modeling code, and other documents (pdf, jpg, zip). A basic discussion of data package design can be found as [EDI's first phase of data publishing documentation](https://edirepository.org/resources/designing-a-data-package) and in the [LTER Best Practices for Dataset Metadata in Ecological Metadata Language (EML)](https://environmentaldatainitiative.files.wordpress.com/2017/11/emlbestpractices-v3.pdf). 10 | 11 | Generally, we recommend archiving entities using standard file formats that are likely to be machine readable in the future. Exceptions to this may exist where the community standard for processing particular data types relies on specialized file formats (binary, closed specification, etc.) or proprietary software. In these cases, it may be appropriate to archive specialized file types and/or a copy that has been parsed into a format (e.g. ascii) that does not require proprietary software. 12 | 13 | **Table of contents** 14 | 15 | * [Processing code](./code.qmd) 16 | * [Modeling datasets](./model-based.qmd) 17 | * [Images and documents](./images-and-docs.qmd) 18 | * [Spatial data](./spatial-data.qmd) 19 | * [Data gathered with small, moving platforms](./moving-platforms.qmd) 20 | * [Provenance and data in other repositories](./other-repositories.qmd) 21 | * [Very large datasets](./large-offline.qmd) 22 | 23 | -------------------------------------------------------------------------------- /guide-special-cases/large-offline.qmd: -------------------------------------------------------------------------------- 1 | # Large Data Sets 2 | 3 | Contributors: Margaret O’Brien, Corinna Gries, Mark Servilla 4 | 5 | ## Introduction 6 | 7 | Data entities are kept offline when they are too large to be handled easily by the HTTP protocol, are expected to be rarely requested, and can be mailed on an external drive. If you suspect your data fall into this category, contact EDI for advice (support@edirepository.org). Below are recommendations for the EDI repository's handling of data packages that have an offline component. 8 | 9 | ### Background 10 | 11 | Standard practice is to handle data entities (both upload and download) via the HTTP protocol, using a URL. However, for very large datasets HTTP can fail due to physical limits. The limit for “too large” is somewhat subjective; EDI's current limit for datasets that are “too large for HTTP” is 100GB (all data and metadata). 12 | 13 | ## Recommendations for data packages 14 | 15 | ### Physical Storage 16 | 17 | * The use of a Solid-state Drive (SSD) is strongly recommended for all offline data storage. The SSD should be formatted using one of the following file systems: 1) exFAT, 2) NTFS, or 3) ext4. Each of these file systems can accommodate individual file sizes greater than 1TB. 18 | 19 | * Add data to external drive in native (non-compressed, non-tarred, non-zipped) format, deliver to EDI (e.g., by physical mail). 20 | * EDI will store three copies, one external hard drive each in New Mexico and in Wisconsin, one copy in general EDI backup cloud storage. 21 | * Please mail one copy each to: 22 | 23 | Attn: Mark Servilla 24 | UNM Biology, Castetter Hall 1480 25 | MSC03-2020, 219 Yale Blvd NE 26 | Albuquerque, NM 87131-0001 27 | 28 | Attn: Corinna Gries 29 | University of Wisconsin 30 | Center for Limnology 31 | 680 North Park Street 32 | Madison WI 53706-1413 33 | 34 | ### Access 35 | 36 | To receive offline data, send a request to support@edirepository.org listing the data package(s) of interest and we'll determine the best mode of delivery. If technically feasible, we will stage data on an internet accessible host and will send you a URL when the data are ready for download. Otherwise, data will be sent on physical media. To receive data on physical media, you will be required to send a storage device of adequate volume to: 37 | 38 | Attn: Mark Servilla 39 | UNM Biology, Castetter Hall 1480 MSC03-2020 40 | 219 Yale Blvd NE Albuquerque, NM 87131-0001 41 | 42 | We will utilize a delivery service of your choice, however, you will be responsible for all shipping arrangements and fees, including pre-arranging return delivery. 43 | 44 | ### Data package 45 | 46 | * The external hard drive should contain at least two entities: the data (which will be offline) and an inventory or manifest that describe the contents of the external hard drive. 47 | * Content of the manifest (inventory of holdings) would be dictated by the type of data entity. The **manifest will be available as an online entity** (through the EDI Data Portal) so that potential requestors can evaluate the offline resource before requesting it. 48 | * Suggested columns are: 49 | * Filename(s) 50 | * Format (netCDF, tabular csv, etc.) 51 | * Start_datetime 52 | * End_datetime 53 | * Location_lat 54 | * Location_lon 55 | * (other params the PIs may feel are essential) 56 | * Checksum 57 | 58 | ### Package Metadata 59 | 60 | (in EDI metadata template and converted to EML - generally, as for any data package) 61 | 62 | * Abstract: describe the collection generally. If individual files require specific software to read, provide the name of that software. 63 | * Creators 64 | * Contact - The first listed contact is responsible for sending out copies as requested and is typically the EDI repository manager: 65 | ```{.xml} 66 | 67 | Environmental Data Initiative 68 | support@edirepository.org 69 | 70 | ``` 71 | * Methods - detailed collection/generation methods for the offline data entities. Detailed information for re-using the data. (May instead be included in the manifest table if different for different offline files.) 72 | * Data Entities 73 | * Offline Entity: 74 | * Describe as you would for an online resource. Restate the software needed to read the individual files if this is important to a user. See [Table 1](#table_1) and [Sample XML](#sample_xml_offline). 75 | * Manifest (inventory of the offline holdings) 76 | * Column descriptions as for any data table 77 | 78 | ### EML 79 | 80 | In addition to basic resource-level metadata, at least two entities should be described: 81 | 82 | * Manifest (inventory) should be a tableEntity: will be the online entity and described as all 83 | * Offline entity: 84 | * Fill out high-level fields as for an online resource. Restate the software needed to read the individual files if this is important to a user. 85 | * Distribution node will be `offline` (See Table 1, code block) 86 | 87 | #### Table 1. Three required fields for an offline distribution {#table_1} 88 | 89 | 90 | 91 | 93 | 95 | 96 | 97 | 99 | 101 | 102 | 103 | 105 | 107 | 108 |
physical/objectName 92 | As for any entity, this is the name of the file or data object 94 |
dataFormat/ExternallyDefinedFormat/formatName 98 | The name of the format the data object is in. If there is a special compression applied, list it here. 100 |
distribution/offline/mediumName 104 | Instead of a data URL, you will have an offline distribution node. The name of almost all offline media is “external drive”, because that is how you will deliver the data to a requestor. 106 |
109 | 110 | #### Sample XML, offline entity {#sample_xml_offline} 111 | 112 | ```{.xml} 113 | 114 | 115 | mainl_2005acc.zip 116 | 117 | 118 | netCDF file 119 | 120 | 121 | 122 | 123 | External drive 124 | 125 | 126 | 127 | 128 | ``` 129 | 130 | ## References 131 | 132 | ### EML documentation 133 | 134 | [https://eml.ecoinformatics.org/schema/index.html](https://eml.ecoinformatics.org/schema/index.html) 135 | 136 | Look for the PhysicalDistributionType 137 | 138 | ## Potential Issues 139 | 140 | * SSD formatting (eventually, whatever we use, it will become unusable). 141 | * Even with cloud storage, eventually a binary format will become unusable. -------------------------------------------------------------------------------- /guide-special-cases/model-based.qmd: -------------------------------------------------------------------------------- 1 | # Model-Based Datasets 2 | 3 | Contributors: An T. Nguyen, Tim Whiteaker, Corinna Gries 4 | 5 | ## Introduction 6 | 7 | This document includes recommendations for archiving data packages composed of model-based datasets. These datasets may include the model code itself, input data, model parameter settings, and output data. 8 | 9 | The range of cases for model-based datasets includes small one-off model code specific to one research question, through various code packages which are maintained in community repositories as long as they meet requirements (e.g., CRAN for R packages), to large community models maintained by groups of programmers and users. 10 | 11 | The intention of these recommendations is to make research based on modeling more transparent rather than achieve exact reproducibility, i.e., provide sufficient documentation so that a knowledgeable person can understand algorithms, programming decisions, and their ramifications for the results, rather than run the model and obtain the same results. 12 | 13 | It is not always easy to determine who among project personnel (IMs, scientists, programmers) is responsible for the different components of a model-based dataset. This is best decided on a case-by-case basis. A common division is that the code authors annotate the code, and the IM handles the archiving and linkage to data product(s); partially except in cases of large community models. 14 | 15 | ## Recommendations for data packages 16 | 17 | ![Figure 1: Flowchart for considering archival paths for various model components.](img/guide-special-cases-model-fig1.png "model data management flowchart") 18 | 19 | ### Referencing models in EML 20 | 21 | For data packages related to a model, whether the model is archived within the same data package or not, indicate linkage to the model in EML following the [best practices for archiving code](code.html) (see the section on linking code and data). 22 | 23 | Example 1: EML snippet relating data to models via the method description: 24 | 25 | ```{.xml} 26 | 27 | 28 | This methodStep contains data provenance information as specified in the LTER EML Best Practices. Each dataSource element here lists entity-specific information and links to source data used in the creation of this derivative data package. 29 | 30 | 31 | Source dataset title 32 | 33 | 34 | first name 35 | last name 36 | 37 | organization name 38 | email@some.edu 39 | 40 | 41 | 42 | This is a link to an external online data resource (describe resource and repository location). 43 | https://pasta.lternet.edu/package/metadata/eml/knb-lter-ntl/80/2 44 | 45 | 46 | 47 | Information Manager 48 | organization name 49 | infomgr@some.edu 50 | 51 | 52 | 53 | ``` 54 | 55 | #### Model code 56 | 57 | The model used to produce certain data needs to be well documented and linked from the resulting data product(s). However, it is not always easy to decide where and how to archive the code, and whether or not in conjunction with the data product(s). We outline in sections below three common code archiving options. 58 | 59 | Note that these scenarios (model code archived with data, or standalone in EDI, or elsewhere) are not mutually exclusive. Any project that involves code might make use of both established and custom software hosted on many different platforms, and might use some or all archiving options. 60 | 61 | To decide between archiving options, consider the questions listed in [best practices for publishing code](code.html). 62 | 63 | #### Model code and data in the same package 64 | 65 | The goal of this practice is to ensure transparency of the data, and it applies to one-off models developed for the associated data, or occasionally to larger code bases for the reasons outlined in [best practices for archiving code](code.html). Include the code as a dataset/otherEntity. Additionally, it is recommended to include a CodeMeta file, which can also be handled and documented in EML as dataset/otherEntity. CodeMeta is a metadata standard for software and code compatible with schema.org. Refer to [best practices for archiving code](code.html) for how to document the code and create CodeMeta. 66 | 67 | #### Model code as standalone package 68 | 69 | If the model has been used to generate several datasets, i.e., is more widely applicable, it can be archived as its own package in EDI and assigned a DOI. Include the code as a dataset/otherEntity. Additionally, it is recommended to include a CodeMeta JSON-LD file, which can also be handled and documented in EML as dataset/otherEntity. CodeMeta is a metadata standard for software and code compatible with schema.org. Refer to [best practices for archiving code](code.html) for how to document the code and create CodeMeta. 70 | 71 | #### Model code archived/maintained elsewhere 72 | 73 | This might include complex community models/software maintained by many people, published and actively maintained R/Python packages, etc., or simply code archived in another repository such as [CoMSES Net](https://www.comses.net/). It may sometimes be advisable to archive a copy of the model code with the data, even if it appears to be maintained elsewhere. See recommendations above for referencing models in EML. 74 | 75 | ### Model input and output data 76 | 77 | These are considered data entities, which should be handled according to EML best practices for corresponding data types. However, if the resulting datasets are very large, one may consider if input/output from all individual model runs need to be archived. Are there specific model run results that are more useful for non-modelers? For example: results from model runs leading to a journal publication. 78 | 79 | Very large model inputs/outputs may need to be archived offline. Refer to [best practices for offline data](large-data-sets.html). 80 | 81 | If the model requires a specific folder structure, you can zip model input files within the package to preserve that folder structure. A disadvantage of this approach is that you cannot elegantly describe each file with EML. 82 | 83 | The [EarthCube Research Coordination Network, "What About Model Data?" group](https://modeldatarcn.github.io/#project-description) is working on a rubric to help you determine how much model output data to save, based on assorted criteria on reproducibility/value of the data. Learn more about that group and their rubric on their [Model Data RCN website](https://modeldatarcn.github.io/). 84 | 85 | Researchers at the Department of Energy’s Environmental Systems Science are also working on assessing model archiving needs. In [this preprint](https://eartharxiv.org/repository/view/260/), Simmonds et al. 2020 discuss feedback from communications with modellers and propose preliminary solutions. With regards to input/output data, their feedback indicates two opposite opinions: some feel the whole gamut of raw to aggregated outputs needs to be archived, while others advocate for only high-level outputs corresponding to publication figures. They also found that spin-up simulations were not considered a high priority for archiving. See section 2.3 What is worth archiving and for how long does it remain useful? 86 | 87 | ### Model parameters 88 | 89 | Include model parameters whenever applicable. If code/input/output from multiple model runs are archived, make sure to archive all corresponding sets of parameters, and be explicit in linking the different components together. 90 | 91 | Consider archiving model parameter files as their own data object(s) in both their native format and as a text (non-binary) version. If the 'runfile' will be archived, consider including the parameters within that file with appropriate annotations. 92 | 93 | ## Example data packages in EDI 94 | 95 | 96 | 97 | 99 | 101 | 103 | 104 | 105 | 107 | 109 | 111 | 112 | 113 | 115 | 117 | 119 | 120 | 121 | 123 | 125 | 127 | 128 |
Dataset Title 98 | Description 100 | EDI Package ID 102 |
North Temperate Lakes LTER General Lake Model Parameter Set for Lake Mendota, Summer 2016 Calibration 106 | Parameters for specific GLM runs. GLM is a large community model, not managed and archived in EDI 108 | knb-lter-ntl.348.2 110 |
SBC LTER: Regional Oceanic Modeling System (ROMS) Setup Files, Code, and Lagrangian Model Setup Files 114 | All the necessary code, grid, forcing, initial, and boundary condition files for running the UCLA version of the Regional Oceanic Modeling System (ROMS) for the Santa Barbara Channel 116 | knb-lter-sbc.126.1 118 |
Lake thermal structure drives inter-annual variability in summer anoxia dynamics in a eutrophic lake over 37 years 122 | Dataset to run a 37-year simulation (1979-2015) of the Lake Mendota lake ecosystem using the vertical 1D GLM-AED2 model. 124 | knb-lter-ntl.396.1 126 |
129 | 130 | ## Resources 131 | 132 | Janssen, Marco A., Lilian Na'ia Alessa, Michael Barton, Sean Bergin and Allen Lee (2008). ‘Towards a Community Framework for Agent-Based Modelling’. Journal of Artificial Societies and Social Simulation 11(2)6 [http://jasss.soc.surrey.ac.uk/11/2/6.html](http://jasss.soc.surrey.ac.uk/11/2/6.html). 133 | 134 | Simmonds, Maegen, William J. Riley, Shreyas Cholia, and Charuleka Varadharajan (2020). 'Addressing Model Data Archiving Needs for the Department of Energy’s Environmental Systems Science Community'. EarthArXiv (preprint). [https://doi.org/10.31223/osf.io/acdk4](https://doi.org/10.31223/osf.io/acdk4). 135 | 136 | See sections 2.3 What is worth archiving and for how long does it remain useful?, discussed above, plus 2.4 Model data archiving protocol, where the authors argue for better standardized reporting format for model data, e.g., top-level metadata and directory structure at a minimum. Section 4.1 Developing Model Data Archiving Guidelines proposes an organization scheme for model data. 137 | -------------------------------------------------------------------------------- /guide-special-cases/moving-platforms.qmd: -------------------------------------------------------------------------------- 1 | # Data Gathered with Small Moving Platforms 2 | 3 | Contributors: Sarah Elmendorf (lead), Tim Whiteaker, Lindsay Barbieri, Jane Wyngaard, Greg Maurer, Hap Garritt, Adam Sapp, Corinna Gries, Stace Beaulieu 4 | 5 | ## Introduction 6 | 7 | Modern advances in technology have increasingly allowed the collection of ecological data using small, often uncrewed, moving platforms. Systems variously known as small Uncrewed Aircraft Systems (sUAS), Uncrewed Surface Vehicles (USV), Autonomous, Uncrewed Underwater Vehicles (AUV or UUV) or ?drones,? more generally, now frequently serve as sensor carrying platforms. Moving platforms may also include gliders or animals with sensors affixed. Depending on the sensor(s) installed on the moving platform, data collected may include environmental measurements (temperature, concentration of chemicals), imagery (digital photos, multi- or hyperspectral sensors), or other remote-sensing acquisitions (ranging data, ground-penetrating radar). Example research applications include studies of vegetation cover and phenology, snowpack cover and depth, ground surface temperature, terrain elevation, bathymetry, species distribution or abundance, and many others. 8 | 9 | Raw drone data can be voluminous and challenging to archive, but after processing, derived drone datasets typically resemble the more conventional spatial datasets that are regularly used in ecological research. In this document we focus on best practices for archiving raw and derived drone data, with particular attention to metadata and processing code that is specific to drone datasets. Note that this chapter does not specifically address data collected by large moving platforms like airplanes and satellites, or by human and animal platforms. 10 | 11 | ## Recommendations for data packages 12 | 13 | General considerations for archiving data from moving sensor platforms 14 | 15 | * **Repository**: We are currently unaware of many specialized repositories for these data, and therefore, EDI is used as the representative data repository for many cases presented here. Repositories other than EDI may have specific metadata formatting requirements, but the general recommendations with regard to content could presumably be applied. For LiDAR based UAV data, consider contributing to Open Topography [https://opentopography.org/](https://opentopography.org/)); for AUV data, the U.S. Marine Geoscience Data System (MGDS) ([http://www.marine-geo.org/index.php](http://www.marine-geo.org/index.php)) which serves the IEDA "MGDL" node in DataONE is a good option. Glider data may be contributed to the U.S. IOOS Glider DAC ([https://gliders.ioos.us/data/](https://slack-redir.net/link?url=https%3A%2F%2Fgliders.ioos.us%2Fdata%2F&v=3)), archived at the National Centers for Environmental Information (NCEI) thus fulfilling NSF OCE Data Policy. If a decision is made to archive an LTER drone dataset in an external (i.e. non-EDI) repository but links to EDI data packages are desired, recommendations in the [Data in Other Repositories](data-in-other-repositories.html) chapter may apply. 16 | * **Size of data set**: The file size of raw data from drone imagery can be substantial. If large volumes of raw data (>100 GB in total) are to be archived on EDI, please coordinate with EDI and follow the best practices for [Large Data Sets](large-data-sets.htm). Even if raw data are in a proprietary binary format and specific software is required for processing, publishing them may be important for reprocessing when software improves. 17 | * **Designing a data package**: In many applications of moving sensor data, raw images/measurements must be processed to arrive at data products that can be analyzed to answer research questions. To enable a fully reproducible analysis pipeline, we recommend archiving three components: the raw data, any key derived data products (e.g., orthomosaic images, DEMs, DSMs, NDVI, landcover, snow depth, or surface temperature maps), and the processing code. These three components may be archived in separate data packages or together, and each should follow accepted best practices for its data type. To archive raw image collections, for example, see the considerations on grouping images into compressed archives (.zip, .tar) and creating an inventory file, as described in the [Images and Documents as Data](images-and-documents-as-data.html) chapter. For derived geospatial files, such as DEMs, refer to the [Spatial Data](spatial-data.htm) chapter. Custom processing code should be archived with the data following recommendations in the [Code in EDI](code.html) chapter. If a standalone program is used to process data, reference the program in the methods metadata with adequate details to ensure reproducibility (name, version, date, configuration, etc.). 18 | 19 | ### Metadata for moving platform data packages at EDI 20 | 21 | The data package should include metadata elements that, at a minimum, (a) identify it as being collected by a moving platform, (b) deliver basic information about the data collection platform, instrument payload (camera, sensors), and procedure (flight information or similar), and (c) deliver necessary information about post-processing of the raw camera or sensor data, if any. Accordingly, these recommendations vary based on whether the data package contains raw or derived data. 22 | 23 | High level metadata pertaining to the entire data package are easily provided in the EML file (e.g. a geographic bounding box). Data packages from drones or other moving platforms commonly include numerous point measurements, images, or other granular data entities, either separately or inside a compressed archive file. Detailed metadata pertaining to these data entities may be included as additional files in the data package. Inventory tables, usually a simple CSV file, are one such additional metadata format. For example, an inventory table could be used to list individual data files in the data package (e.g., images from one drone flight) and provide metadata (e.g. point location) about each. In addition to inventory tables, files that enable or supplement common processing pipelines, such as flight or mission logs, may be included. A flight/mission log may be provided in a proprietary binary format, but because software for parsing these formats may become obsolete, we recommend archiving the log in the format most useful for contemporary analysis software, and extracting and appending the information to the inventory file where appropriate. Exif (Exchangeable image file format) metadata in images may also be programmatically extracted to supplement the inventory file. 24 | 25 | Clearly, there are many possible ways to combine raw data, derived data, and metadata files into a moving platform data package. No matter the combination, the critical metadata categories and the recommended contents below should be considered and included where possible. The decision on whether to provide the metadata in EML or at a more granular level, such as an inventory table, will depend on the given dataset. 26 | 27 | * **Methods:** unique identifier for a given flight or mission; summary information from a flight log; weather conditions; accuracy of sensor and geographic location information; data processing method; ground sample distance; image overlap; flight height; whether UAS followed terrain elevation vs fixed-height flight; location of UAS launch (since some image metadata are derived from this); general description of software used and for what purpose; sensor calibration date and procedure; general description of payload type, such as multispectral camera and spectral bands. 28 | * **Instrumentation:** make and model of platform, sensor, and camera, including manufacturer and specific model names and numbers. Include make and model of any interchangeable lenses in cameras. Specifics like spectral bands, temperature range, sensor accuracy, etc. 29 | * **Software:** (see also [Code in EDI](code.html) chapter) list of software used. Especially when code is proprietary or archived elsewhere, the name, version, and configuration of any software used are advisable, as are corrections applied (e.g., correction for sensor angle or heat/air flow). Ideally, a .pdf report generated by processing software can be archived as an otherEntity together with the imagery itself to convey much of the necessary information. Also include data used as a ground truth or calibration points for post processing (e.g. spectral calibration image/biomass sample/wind speed/etc) and their date of collection. 30 | * **People with specified role:** drone operator, image processor 31 | * **Geographic Information:** (see also [Spatial Data chapter](sptial-data.html)) a general bounding box should be included in EML, while the individual location of images or point measurements should be handled in the inventory table, or directly in the included data files. Also include the coordinate reference system (e.g., WGS 84) used for images and (if different) ground control points, projection type if needed, altitude of image/measurement acquisition, spatio-temporal coordinates, pitch, roll, and yaw from flight log or image data points. It should be noted that there are special considerations for underwater vehicles, especially with regard to metadata to explain how geographic positions were obtained. With autonomous underwater vehicles (AUVs), there can be error sources in the topside GPS localization, the underwater acoustic positioning system (e.g., Long Baseline, Ultrashort Baseline), as well as any sensors used for dead reckoning (e.g., accelerometer, Doppler Velocity Log). At a minimum, it would be useful to know which sensors were used to produce the localization data and whether the navigation tracklines were post-processed with benchmarks. 32 | * **Temporal Information:** may also be provided at the EML level or as timestamps for individual data/image points, either in inventory tables or in the data files themselves. Time of day critically affects useability for image-based datasets; so ensure that the time of day is clear from the metadata available prior to download, either in the EML temporal coverage or via the methods. 33 | * **Keywords:** Use of appropriate keywords aids in data discovery. Keywords that identify datasets as drone-related are therefore recommended (e.g. drone, UAV, UAS). Keywords describing the type of data collected are also recommended (e.g. image collection, aerial imagery, thermal imagery, NDVI, digital elevation map). For drone mapping data products, keyword recommendations from the [Spatial Data chapter](spatial-data.html) are largely applicable. 34 | 35 | ### Examples and additional metadata guidance 36 | 37 | Several EDI data packages for data from moving platforms are presented as examples in Table 1. Many more detailed, ?drone-specific? metadata terms and values can be included in data packages for drones and other moving platforms. For completeness we have developed a comprehensive list of recommended and optional metadata terms based on the work of Wyngaard et al. (2019), Thorner et al. (2020), with mappings to select relevant ontologies, viewable [here](https://docs.google.com/spreadsheets/d/1PQ0SUEQLgQXdz2PUNDy2jGry3o9veAedcz8l-5ubpwU). For each metadata element, we assessed its utility in terms of data discovery, evaluating fitness for use, and actual data reuse. The minimum recommended subsets of metadata that are included in the section above were derived from this table. 38 | 39 | Table 1. Example packages at EDI and other repositories 40 | 41 | 42 | 43 | 45 | 47 | 49 | 50 | 51 | 53 | 55 | 57 | 58 | 59 | 61 | 63 | 65 | 66 | 67 | 69 | 71 | 73 | 74 | 75 | 77 | 79 | 81 | 82 |
Title 44 | Description 46 | EDI packageID 48 |
Orthophoto and elevation models from UAV overflights at the G-IBPE study site at Jornada Basin LTER in 2019 52 | Approximately 599 RGB images and data derived from uncrewed aerial vehicle (UAV) overflights of the G-IBPE study site at the Jornada Basin LTER in southern New Mexico, USA. 54 | knb-lter-jrn.210543001 56 |
Aerial imagery from unmanned aerial systems (UAS) flights and ground control points: Plum Island Estuary and Parker River NWR (PRNWR), February 27th, 2018. 60 | USGS Aerial imagery UAS flights at the Parker River National Wildlife Refuge, Massachusetts, USA, includes ground control, multispectral and true color child items which each have data entities that include ground control or a file catalog of images 62 | ScienceBase 64 |
Spatial variability in water chemistry of four Wisconsin aquatic ecosystems - High speed limnology Environmental Science and Technology datasets 68 | water chemistry sensors embedded in a high-speed water intake system to document spatial variability. 70 | knb-lter-ntl.337.4 72 |
Thermal infrared, multispectral, and photogrammetric data collected by drone for hydrogeologic analysis of the East River beaver-impacted corridor near Crested Butte, Colorado 76 | infrared, multispectral, visual image data, and derivative products (orthomosaic and digital surface model) collected along a beaver-impacted section of the East River from August 12-17, 2017 and July 28-August 2, 2018. 78 | ScienceBase 80 |
83 | 84 | ## Resources 85 | 86 | ### Tips and Tricks 87 | 88 | For making an image catalog (.csv) from a directory of images, consider using the exif tool [https://exiftool.org/](https://exiftool.org/). For example, the command ?exiftool.exe -csv -r mydirectory > image_catalog.csv? will extract the entirety of the exif tags from all files stored under mydirectory into a comma-delimited table and write it to the the file image_catalog.csv 89 | 90 | ### Semantic Annotation 91 | 92 | Semantic annotation of drone imagery is a rapidly developing field. Ontologies that provide relevant terms include: [dronetology](http://www.dronetology.net/); [sensorML](http://www.sensorml.com/ontologies.html); [FGDC content standard for digital geospatial metadata](https://www.fgdc.gov/metadata/csdgm/) (not officially an ontology but a structured metadata format with defined terms); [Semantic Sensor Network ontology](https://www.w3.org/TR/vocab-ssn/) (SSN, including the SOSA core); [Semantic Web for Earth and Environment Technology ontology](https://bioportal.bioontology.org/ontologies/SWEET) (SWEET); and [Environment Ontology](http://www.obofoundry.org/ontology/envo.html). 93 | 94 | ### References 95 | 96 | Thomer, Andrea K., Swanz, Sarah, Barbieri, Lindsay, Wyngaard, Jane. (2020). A minimum information framework the FAIR collection of earth and environmental science data with drones. DOI: 10.5281/zenodo.4017647 97 | 98 | Wyngaard, J.; Barbieri, L.; Thomer, A.; Adams, J.; Sullivan, D.; Crosby, C.; Parr, C.; Klump, J.; Raj Shrestha, S.; Bell, T. Emergent Challenges for Science sUAS Data Management: Fairness through Community Engagement and Best Practices Development. _Remote Sens._ **2019**, _11_, 1797. 99 | -------------------------------------------------------------------------------- /guide-special-cases/other-repositories.qmd: -------------------------------------------------------------------------------- 1 | # Data in Other Repositories 2 | 3 | Contributors: Greg Maurer (lead), Stace Beaulieu, Renée Brown, Sarah Elmendorf, Hap Garritt, Gastil Gastil-Buhl, Corinna Gries, Li Kui, An Nguyen, John Porter, Margaret O'Brien, Tim Whiteaker 4 | 5 | ## Introduction 6 | 7 | A wide variety of data repositories are available for publishing biological, environmental, and Earth observation data, and the choice of where to publish a particular dataset is determined by many competing factors. For example, a funding agency or journal may require a certain repository (e.g., NSF BCO-DMO, NSF ADC, USDA ADC, DOE ESS-DIVE); the research subject or data type may be best served by a specialized repository (e.g., AmeriFlux, GenBank); or datasets may be submitted to a general purpose repository with minimal metadata requirements to simplify and speed data publishing (e.g., DRYAD, Figshare, Zenodo). For these and other reasons related datasets are sometimes published in disparate data repositories, the same data needs to be discoverable in more than one repository, or multiple datasets from one or more repositories may be used to create a new, derived dataset. In such cases, it can be advantageous to establish links between datasets in different repositories such that provenance, supplementation, duplication or other relationships are explicit. Clearly, this subject goes well beyond the single repository and better standards and approaches for linking resources and documenting data provenance are being developed elsewhere (e.g. [DataONE](https://www.dataone.org/network/), [ProvONE](https://purl.dataone.org/provone-v1-dev), [WholeTale](https://wholetale.org/)). Here we concentrate on specific cases in the context of large and multidisciplinary projects, such as LTER sites, that wish to enhance data discovery and preserve data relationships across multiple repositories. 8 | 9 | ## Recommendations for data packages 10 | 11 | ### Considerations for creating linked data 12 | 13 | In practice, links to data in other repositories can be achieved using metadata only, by including a data inventory file, or, although not recommended, by duplicating the data in the new repository record. **Generally, duplicating data in multiple repositories is not recommended because it creates two problems.** First, it is a burden to maintain multiple copies of a dataset and avoid divergence between them. Second, it can create confusion for data re-users who may download or cite the same data multiple times. Care must be taken to clearly identify such duplications for data users when they are created. Whenever linked datasets are created, it is strongly recommended that both repositories are aligned with FAIR data principles, [outlined here](https://doi.org/10.1038/sdata.2016.18), so that users have unfettered access to all data and metadata. 14 | 15 | In addition to these considerations, there are a number of reasons to create a new repository record that is linked to data in other repositories. Each of these reasons, which are outlined below, has pros and cons that will need to be weighed from the different perspectives of the data user, data provider and research project management requirements. 16 | 17 | * **Requirements dictate multiple repositories:** Large research projects or sites are frequently funded by different agencies and programs. Data collection may be supported by several such funding streams and, hence, fall in the purview of more than one requirement to archive data in a particular repository. In some cases, data repositories already accommodate such requirements by linking or replicating data appropriately. Examples of this are LTER data in EDI, NSF BCO-DMO and NSF ADC. 18 | * **Adding important metadata**: If data were originally submitted to a general purpose repository with minimal metadata requirements (e.g., DRYAD, figshare) additional metadata (e.g., EML) may be needed for discoverability, reusability, and integration. By creating a new repository record that identifies and is linked to the original published dataset, richer and more useful metadata can be added to the new record and utilized. 19 | * **Use of specialist repositories for related data:** There are sometimes advantages to publishing particular data types in specialized repositories. Specialized data repositories (e.g. GenBank, AmeriFlux) usually enforce strict data formatting, provide quality standards, enhanced search, discovery and reuse of particular types of data across projects in a way that is not possible using a generalized metadata format (EML) and repository (EDI). However, these data may not be discoverable with other, related project data taken at the same location and time. Creating links between related datasets held in specialist and generalist repositories helps preserve this context. 20 | * **A derived data product is archived in a different repository than the source (raw) data:** A wide range of cases fall into this category, from a direct one to one relationship of, e.g., a gene sequence and its OTU identification, a metagenome analysis and its community diversity metrics, to several datasets being combined in synthesis or meta-analysis studies. In these cases, links between source data and derived data products that are published in separate repositories need to be created and clearly documented. 21 | * **Linking to site- or project-relevant data from other research groups or agencies:** Although it may help with some aspects of data discovery it is generally not recommended to create records in EDI for data collected and managed by entirely different research groups or agencies. **_In these cases, however, it is recommended to place a pointer to such repositories on a project website or develop other means for data users to discover relevant resources._** 22 | 23 | ### General metadata for linked data packages in EDI 24 | 25 | In EDI, the linked data package can be assembled using standard practices and EML metadata elements, but the included metadata and data entities must clearly lead the data user to files held in outside repositories. In addition, the package metadata should communicate the essential elements needed for data discovery (subject matter, authors, location, time-frame, etc.) and a brief description of how the data may be accessed, re-used, and cited via the outside repository as needed. General guidance on the content and structure of key metadata elements in an EDI data package linked to data in other repositories are described below. 26 | 27 | * **Abstract:** Describe the key features of the data package. If the data package contains only links to data held in other repositories, or data duplicated from another repository, clearly state that the original data are located in a different repository and direct the user to the correct data citation. Describe the target data in sufficient detail that users can determine whether these data are fit for their use, and instruct them on how to find and re-use the data. 28 | * **Methods:** Collection/generation methods for any data entities included or linked to. If the methods are well-described in the metadata at another repository, this element can simply refer users there. If the new data package includes ancillary data or derived data, describe how those data were collected or derived. 29 | * **Geographic description** and **coordinates:** At a minimum these elements should define a bounding box that will make the data package discoverable through EDI, DataOne, or other geographic search interfaces. Additional, more detailed coordinates may be given in the inventory file entity as described below. 30 | * **Keywords:** Since some linked data packages include an inventory of data held at a different repository, include the keyword "data inventory" and thematic keywords that describe the data entities in the other repository. 31 | 32 | ### Common use cases and their structure in EDI 33 | 34 | There are several common use cases for creating a new linked data package in EDI. The new package may establish either a one-to-one link from EDI to a dataset in another repository, or a one-to-many relationship that is more complex. Three possible cases are described below in terms of what entities to publish, where to publish them, metadata elements to be created in EML, and the contents of included data entities. There are likely to be other use cases for linking EDI data packages to other repositories. 35 | 36 | **Case 1:** One dataset needs to be discoverable in more than one data repository. The data remain the same, but the metadata in the new data package at EDI may be upgraded beyond what exists in the other repository. 37 | 38 | * The metadata in EDI must clearly state, preferably in the abstract or another obvious location, that this data package is already published in another repository. Include the original unique identifier and instruct users to cite that original data, if appropriate. 39 | * Include instructions on how to access and cite the original data if the original repository is lacking in such guidance. 40 | * If data are duplicated (which is not recommended), metadata should include information on how versions in different repositories are kept synchronized. If such synchronization is not feasible, users should be warned to inspect both sources for the latest data.. 41 | * In EML the <additionalIdentifier> field may be used to store the persistent identifier (DOI), or a link (URL) that refers to the data held in another repository to make the link machine readable. Where an external repository supplies both a URL and DOI, use the DOI as URLs may not be maintained through time. 42 | 43 | **Case 2:** A list of data records held in a specialized repository needs to be linked to ancillary or supporting data that are being published in EDI (for derived data see Case 3). 44 | 45 | * This case applies when a collection of datasets, or similar scientific resources, is held in a specialized repository and closely related ancillary or supporting data and metadata needs to be archived in a more generalist data repository like EDI. For example, ancillary environmental data or laboratory analyses held in EDI could be linked to collections of sequence reads held in NCBI GenBank or museum voucher specimens archived with Darwin Core metadata. See complete examples in Table 5.3. 46 | * The new EDI data package should include a 'data inventory' (or manifest of holdings) file as a data entity. This is most likely a simple tabular data file, such as a CSV, that lists and describes the repository records held in the specialized data repository and has its column attributes described in EML as a [dataTable](https://eml.ecoinformatics.org/schema/eml-dataTable_xsd.html#eml-dataTable.xsd) entity. 47 | * The inventory table must have a row for each outside repository record (or some meaningful grouping of records, e.g., project in NCBI) being linked to, with columns that include persistent unique identifiers of the data in the other repository, and relevant descriptors of the data. The complete content of the inventory will be dictated by the structure of the other repository and the data entities and metadata held there. Suggested columns are presented in Table 5.1. 48 | * The inventory table may also provide additional contextual information for each individual data resource in another repository. Table 5.2 presents examples of these contextual columns. They are, however, subject dependent and may vary for different projects. For more examples, see the discussion on sequencing and genomic data later in this document. 49 | 50 | Table 1: Suggested columns for identifying the external data in the data inventory table. 51 | 52 | 53 | 54 | 56 | 58 | 59 | 60 | 62 | 64 | 65 | 66 | 68 | 70 | 71 | 72 | 74 | 76 | 77 | 78 | 80 | 82 | 83 | 84 | 86 | 88 | 89 | 90 | 92 | 94 | 95 |
Column 55 | Description 57 |
External unique ID 61 | Unique identifier for the data resource in the other repository. E.g. Accession number 63 |
External access URL 67 | A unique, persistent link to the data resource in the other repository. 69 |
Title/description 73 | Title and/or brief description of the data resource 75 |
Filename(s) 79 | Dataset or file name at the other repository 81 |
Format 85 | File format of above 87 |
Repository URL 91 | URL of the repository being linked to 93 |
96 | 97 | Table 2: Examples of additional contextual columns in the data inventory table. 98 | 99 | 100 | 101 | 103 | 105 | 106 | 107 | 109 | 111 | 112 | 113 | 115 | 117 | 118 | 119 | 121 | 123 | 124 | 125 | 127 | 129 | 130 | 131 | 133 | 135 | 136 |
Column 102 | Description 104 |
Latitude/Longitude 108 | Latitude and longitude in standard format for each data resource in the other repository. 110 |
Location name 114 | Locally used name of collection site 116 |
Treatment level 120 | Experimental treatment applied to the outside dataset 122 |
Start/End datetime 126 | Starting/ending datetime of the data resource (NA for End if data collection is ongoing) 128 |
Reference publication 132 | DOI of publication providing in-depth context for data 134 |
137 | 138 | **Case 3:** One or more datasets in other repositories are used to create derived data products that need to be archived in EDI. 139 | 140 | * In this case the new dataset is directly or indirectly derived from the 'source' dataset(s) in other repositories. Such derived data may serve a wide range of research purposes, including use in cross-site synthesis, re-analysis, or meta-analysis studies. 141 | * Provenance metadata should be used to describe the relationship between the source and derived datasets, which ensures reproducibility and preserves data lineage. In a new EDI data package that archives derived data, the provenance metadata should be inserted in the EML file utilizing <dataSource> elements. The <dataSource> elements should be nested within a <methodStep> element and will establish the links to any source datasets located in another repository. An example snippet of provenance EML is shown in Figure 1. 142 | * Other cross-repository standards for provenance metadata are still being developed and are not widely adopted, e.g., [ProvONE](http://homepages.cs.ncl.ac.uk/paolo.missier/doc/dataone-prov-3-years-later.pdf). 143 | * The EDI portal interface provides [automatic generation of provenance metadata](https://portal.edirepository.org/nis/provenanceGenerator.jsp) EML snippets for datasets in EDI. The [EMLassemblyline](https://github.com/EDIorg/emlAssemblyLine) and [MetaEgress](https://github.com/BLE-LTER/MetaEgress) (in connection with [LTER-core-metabase](https://github.com/lter/LTER-core-metabase)) R packages for EML creation will also generate provenance metadata. 144 | 145 | Example 1: EML snippet with a data provenance methodStep: 146 | 147 | ```{.xml} 148 | 149 | 150 | This methodStep contains data provenance information as specified in the LTER EML Best Practices. Each dataSource element here lists entity-specific information and links to source data used in the creation of this derivative data package. 151 | 152 | 153 | Source dataset title 154 | 155 | 156 | first name 157 | last name 158 | 159 | organization name 160 | email@some.edu 161 | 162 | 163 | 164 | This is a link to an external online data resource (describe resource and repository location). 165 | https://pasta.lternet.edu/package/metadata/eml/knb-lter-ntl/80/2 166 | 167 | 168 | 169 | Information Manager 170 | organization name 171 | infomgr@some.edu 172 | 173 | 174 | 175 | ``` 176 | 177 | ## Nucleotide sequence and genomic data 178 | 179 | Nucleotide sequence data consists of the order and arrangement of DNA or RNA bases extracted from individual organisms or environmental samples. Similarly, genomic data refers to the complete genetic information (either DNA or RNA) of an organism, while metagenomic data refers to the study of genomes recovered from environmental samples. Sequencing, genomic and metagenomic datasets can be very large and complex, and researchers in these fields benefit from particular methods of data access, analysis, and collaboration. Therefore, these data have specialized requirements for data archiving. 180 | 181 | Archiving nucleotide sequence and genomic (or other '[omics](https://en.wikipedia.org/wiki/Omics)') data are a common use case for creating linked datasets. Data that originate from nucleotide sequencing techniques are most often stored in specialized repositories such as National Center for Biotechnology Information (NCBI) GenBank and the European Nucleotide Archive. However, while sequences or assembled genomes constitute important raw data, ancillary and derived data products related to these raw data are frequently published in repositories specializing in ecological data. For example, data derived from sequence data, such as operational taxonomic units (OTUs) or functional assignments, and ancillary data that describe the environmental, biochemical, or experimental context of the sequencing data, are often included in scientific publications, and do not always fit within the scope of a specialized sequence or genome data repository. 182 | 183 | ### Recommendations for sequencing or genomic datasets 184 | 185 | Linking to genomics data is an example of Case 2 described above. Summaries or inventories of data records held in a repository like NCBI GenBank are linked to their derived products or additional measurements published in a more generalist repository such as EDI. 186 | 187 | In addition to the metadata typically included with any data package published by the site or research group, include metadata that is descriptive specifically of sequencing and genomics datasets. It is recommended to refer to the [MixS templates](https://press3.mcs.anl.gov/gensc/mixs/) for standard terminology, especially in the keyword section: 188 | 189 | **Keywords** that can help users discover the sequencing or genomic dataset include: 190 | 191 | 1. General data type descriptions ('nucleotide sequence', 'genomics', 'metagenomics') 192 | 2. Names of target genes or subfragments ('16S rRNA', '18S rRNA', 'nif', 'amoA', 'rpo', 'ITS') 193 | 3. Names of the sequencing technique ('Sanger', 'pyrosequencing', 'ABI-solid') 194 | 4. Names of the linked repository ('SRA', 'EMBL', 'Ensembl') 195 | 5. Descriptors of included ancillary data ('nitrogen', 'soil', 'drought') 196 | 6. Descriptors of derived data products ('OTU', 'functional annotation', 'population') 197 | 198 | **Inventory tables** are of central importance to datasets that index data resources in a sequencing or genomics repository. It is recommended that this inventory should have the columns described in Table 1. Note that the unique identifiers included will depend on the granularity of the links to the outside repository. For example, in NCBI, there are accession numbers and URLs for a project, samples within the project, and sequence datasets from a given sample. 199 | 200 | **External unique ID and URL:** For NCBI GenBank this would be the accession number for a collection. For most sequence and genomic datasets an access URL would include an accession number (e.g. [https://www.ncbi.nlm.nih.gov/nuccore/AY741555](https://www.ncbi.nlm.nih.gov/nuccore/AY741555)). Referring to a range of accession numbers, may involve providing a search URL that will return the desired list, e.g. ([https://www.ncbi.nlm.nih.gov/popset/?term=AY741555](https://www.ncbi.nlm.nih.gov/popset/?term=AY741555)). The recommendation is to link to the widest level of sequence or genomic granularity that is useful to interpret data being archived in the new dataset. 201 | The following are suggestions for additional contextual columns in the inventory table. This information is generally associated with the data in the genomics repository and should only be duplicated if deemed useful for reuse, or if missing in the original data. 202 | 203 | * **Sequencing method**: the name of the sequencing method used; e.g., Sanger, pyrosequencing, ABI-solid. This attribute is used in MIxS templates, where it is called seq_meth. 204 | * **Environment (biome, feature, or material) descriptors**: These are descriptors of the environmental context and are standardized by the genomics community in the [MixS templates](https://press3.mcs.anl.gov/gensc/mixs/) and [EnvO](www.environmentontology.org/Browse-EnvO). 205 | * **Taxon description**: If applicable, e.g., Binomial name, or taxonomic group 206 | 207 | Data packages of metadata and inventory tables will aid in discovering genomic data within an ecological data repository (EDI) and will aid in clarifying the context in which they were collected. Most use cases, however, employ this inventory table to link specific genetic data to derived data. Such products frequently are community or population metrics where species, OTUs or traits have been determined from the sequence data. 208 | 209 | ## Example data packages in EDI 210 | 211 | Each of the EDI data packages below are linked to data in outside repositories. Some contain data inventory tables (as dataTable entities) that link to the datasets held in outside repositories and are described in the EML metadata. The EML abstract and methods elements in each give detailed access and citation instructions. 212 | 213 | Table 3: Linked data packages at EDI that provide examples of the best practices in this document. 214 | 215 | 216 | 217 | 219 | 221 | 223 | 224 | 225 | 227 | 229 | 231 | 232 | 233 | 235 | 237 | 239 | 240 | 241 | 243 | 245 | 247 | 248 | 249 | 251 | 253 | 255 | 256 | 257 | 259 | 261 | 263 | 264 |
Title 218 | Description 220 | EDI packageID 222 |
Mass and energy fluxes from the US-Jo2 AmeriFlux eddy covariance tower in Tromble Weir experimental watershed at the Jornada Basin LTER site, 2010-ongoing 226 | This data package links to eddy covariance data from a Jornada Basin LTER tower. The data are held at the AmeriFlux data repository (https://ameriflux.lbl.gov) 228 | knb-lter-jrn.210338005 230 |
Catalog of GenBank sequence read archive (SRA) entries of 16S and 18S rRNA genes from bacterial and protistan planktonic communities along the Eastern Beaufort Sea coast, North Slope, Alaska, 2011-2013 234 | Data inventory of runs, samples, and experiments held at GenBank. 236 | knb-lter-ble.10 238 |
Correlation of native and exotic species richness: a global meta-analysis finds no invasion paradox across scales 242 | This data package re-publishes data held in a package in Dryad. The metadata has been substantially enriched relative to the original dataset. 244 | edi.548.1 246 |
Vascular Flora of the Harvard Farm at Harvard Forest since 2014 250 | This data package includes an inventory table with information on voucher specimens held in the Harvard Herbarium. 252 | knb-lter-hfr.236.3 254 |
Biological responses to landscape change in the McMurdo Dry Valleys, Antarctica 258 | This data package links to genomic data in NCBI, and includes additional data from biogeochemical analyses performed on each sample. 260 | knb-lter-mcm.262.1 262 |
265 | 266 | ## Appendix: Tips and repository information 267 | 268 | This section aggregates information helpful at the time this document was written, particularly regarding nucleotide sequence and genomic data repositories in widespread use at this time. Given the rapid rate of change in the field, this info may fall out of date quickly. 269 | 270 | ### Sequence and genomic repository information 271 | 272 | It is generally preferable that sequencing and genomic data are archived in community repositories that are specialized for their data type, rather than in a generalist repository such as the [Environmental Data Initiative](https://edirepository.org/) (EDI). There are many such specialized repositories; a fairly comprehensive listing is provided by the journal _[Nucleic Acids Research](http://nar.oxfordjournals.org/)_ (summarized on [this page](https://en.wikipedia.org/wiki/List_of_biological_databases)). Metadata standards and collaborative structures among these repositories are governed by the [International Nucleotide Sequence Database Collaboration](http://www.insdc.org/) (INSDC, more guidance [here](https://www.nature.com/sdata/policies/repositories#nuc)).Often these repositories provide or are accessible to specialized tools for searching, accessing, and analyzing the data (e.g., BLAST, MG-RAST). Furthermore, some products derived from sequence or genomic data are best archived in another specialized repository (e.g., metagenome-assembled genomes, or MAGs). As a general rule, these specialized repositories assign unique identifiers to projects, samples, and/or single sequences (often referred to as accession numbers) that can be used to locate sequences or genomic data. Note that each repository may have its own mechanism for reverse linking to related data held in another repository (such as EDI), and these mechanisms are beyond the scope of this document. 273 | 274 | * [NCBI Databases](https://www.ncbi.nlm.nih.gov/search/) - list of various databases with search capabilities. See also [How to submit data to GenBank](https://www.ncbi.nlm.nih.gov/genbank/submit/). 275 | * [NCBI Accession Number prefixes](https://www.ncbi.nlm.nih.gov/Sequin/acc.html) - Explanation of accession number prefix codes. 276 | * [DNA DataBank of Japan (DDBJ)](http://www.ddbj.nig.ac.jp/) - list of various databases with search capabilities. See also [Submissions](https://www.ddbj.nig.ac.jp/submission-e.html). 277 | * [European Nucleotide Archive (ENA)](http://www.ebi.ac.uk/ena/) - list of various databases with search capabilities. See also [Submit and update](https://www.ebi.ac.uk/ena/submit). 278 | * [Integrated Microbial Genomes & Microbiomes (IMG/M) system](https://img.jgi.doe.gov/) from the Joint Genome Institute 279 | * [MG-RAST](http://www.mg-rast.org/mgmain.html?mgpage=downloadintro) (technically an analysis pipeline not a primary repository, but replicates to primary repositories) Replicates to the European Bioinformatics Institute (EMBL-EBI), which in turn replicates to the NCBI Sequence Read Archive (such that data submitted on MG-RAST will automatically appear on all three). 280 | * [Barcode of Life DataSystems (BOLD)](http://www.barcodinglife.com/) DNA barcoding is a taxonomic method that uses one or more standardized short genetic markers in an organism's DNA to identify it as belonging to a particular species. Through this method unknown DNA samples are identified to registered species based on comparison to a reference library. The Centre for Biodiversity Genomics in Canada maintains the BOLD public data portal, a cloud-based data storage and analysis platform. 281 | 282 | ### Tips for locating metadata in sequence and genomic data repositories 283 | 284 | Where information for populating metadata in EML has not been supplied directly to the IM from the research group, metadata that investigators provided when submitting data may be found in the genomics repository. 285 | 286 | * For data in NCBI, go to the [NCBI website](https://www.ncbi.nlm.nih.gov/search/) and search using the accession number. Or search by accession number in a specific NCBI Database, for example Genes PopSet (the PopSet database is a collection of related DNA sequences derived from population, phylogenetic, mutation and ecosystem studies that have been submitted to NCBI). 287 | * For sequences submitted to the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), there are some easily accessible online tools for generating tables of linked sequence data and their metadata. For an example, go to the example dataset at [https://www.ncbi.nlm.nih.gbov/bioproject/305753](https://www.ncbi.nlm.nih.gov/bioproject/305753), and click the number next to **SRA Experiments** to see a list of all experiments. Then click **Send results to Run selector** to see a table summarizing geolocations and associated metadata which could be archived at EDI or used to extract metadata for EML preparation. 288 | * A full Data Carpentry tutorial on accessing data on the NCBI SRA database can be found here: [Examining Data on the NCBI SRA Database](https://datacarpentry.org/organization-genomics/03-ncbi-sra/index.html) 289 | * [BCO-DMO examples](https://www.bco-dmo.org/contributing-sequence-accession-numbers) for contributing sequence accession numbers. 290 | 291 | ### Darwin Core standard for sequence data 292 | 293 | For sequence data to conform with the Darwin Core standard, a column header 'associatedSequences' ([https://dwc.tdwg.org/terms/#dwc:associatedSequences](https://dwc.tdwg.org/terms/#dwc:associatedSequences)) may be used in the inventory table populated with a unique identifier (or list of identifiers) for the sequence data (e.g., SNLBE002-17, a sequence in Barcode Of Life Data system, aka BOLD) or full URL (e.g., [http://www.boldsystems.org/index.php/Public_RecordView?processid=SNLBE002-17](http://www.boldsystems.org/index.php/Public_RecordView?processid=SNLBE002-17)). 294 | -------------------------------------------------------------------------------- /guide-special-cases/preface.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: Data Package Design for Special Cases 3 | date: '2021' 4 | subtitle: Version 1 5 | author: LTER/EDI Non-tabular Data Working Group 6 | --- 7 | 8 | # Preface {.unnumbered} 9 | 10 | Considerations for a well designed data package including special cases based on data type, format, or acquisition method. Examples are images, documents, raw data stored in other repositories. 11 | 12 | **Cite this document as:** 13 | 14 | Gries, Corinna, Stace Beaulieu, Renee F Brown, Sarah Elmendorf, Hap Garritt, Gastil Gastil-Buhl, Hsun-Yi Hsieh, et al. 2021. “Data Package Design for Special Cases.” Environmental Data Initiative. https://doi.org/10.6073/PASTA/9D4C803578C3FBCB45FC23F13124D052. 15 | -------------------------------------------------------------------------------- /guide-special-cases/spatial-data.qmd: -------------------------------------------------------------------------------- 1 | # Spatial Data 2 | 3 | Contributors: Tim Whiteaker, John Porter, Mary Martin 4 | 5 | ## Introduction 6 | 7 | This document/chapter contains recommendations on data package structure and metadata for spatial datasets. Over the timeline of Long Term Ecological Research (LTER) Network?s use of the Ecological Metadata Language (EML), both spatial data formats and data curation options have evolved. In this document, focus on best practices that can be widely adopted with the goal of enhancing data discoverability and usability, and the understanding that there are multiple solutions to creating these data packages. 8 | 9 | ## Recommendations for data packages 10 | 11 | ### Considerations for archiving spatial data 12 | 13 | #### Data formats 14 | 15 | To maximize reuse, avoid proprietary formats. The formats listed below can be read or imported by most mainstream GIS programs or with code using libraries such as GDAL. 16 | 17 | Strongly recommended formats: 18 | 19 | * **GeoTIFF** - An open format for storing spatial raster data and metadata in a TIFF file. 20 | * **GeoPackage** - A standard format from the Open Geospatial Consortium (OGC) for storing vector and raster data in a SQLite database file. 21 | 22 | Some other formats to consider are listed below. 23 | 24 | * **KML/KMZ** - Keyhole Markup Language (KML) file and its zipped version for storing vector data. This format was popularized by Google Earth and is now an OGC standard. KML is best visualized in Google software and may not render as well in other GIS software. 25 | * **GeoJSON** - A format for storing vector data as text in JavaScript Object Notation (JSON). GeoJSON data are limited to the WGS84 coordinate system. 26 | * **netCDF/HDF5** - binary formats originally designed for storing multidimensional arrays of spatial data typically organized onto a grid, but which now can accommodate vector data following the NetCDF Climate and Forecast Conventions (version 1.8 or higher). 27 | 28 | A couple of Esri formats are worth mentioning and are listed below. 29 | 30 | * **File geodatabase** - One of Esri's formats for storing vector and raster information. Several feature classes and rasters can be stored in this folder and file based structure. GDAL's OpenFileGDB driver enables non-Esri software to view at least the data layers in a file geodatabase, but usage of more advanced file geodatabase components such as topology rules or geometric networks may not be available outside of Esri software. Field types may not be imported correctly either. Export to GeoPackage instead, unless geodatabase is the only format that supports the advanced representation of your GIS data. Just know that you limit potential reuse of your data if you use this format. 31 | * **Shapefile** - A legacy format for vector data which is widely supported. Be aware of [shapefile limitations](https://en.wikipedia.org/wiki/Shapefile#Limitations) when considering this format. A shapefile consists of several individual files; include them as a single zip file in the data package. If the package has more than one shapefile, create a separate zip file for each shapefile. 32 | 33 | Although other open formats exist, their implementation in popular GIS software may be less common. If a proprietary format must be used to capture the full meaning of the data, consider also including a version of the data in an open format such as a simple data table along with metadata explaining its limitations in that format, or instructions on how to utilize the proprietary format. For example, an Esri layer package could be used when including recommended symbols for drawing vector features in a GIS is desired, in which case one could note that the vector data can be extracted by treating the layer package as a zip file. 34 | 35 | Formats that are composed of more than one file, such as shapefiles, should be zipped. Include one dataset per zip file. For example, if you have 10 shapefiles, you would create 10 zip files. 36 | 37 | ### Documenting spatial data packages 38 | 39 | #### Document as spatial[Raster, Vector] vs. otherEntity in EML 40 | 41 | There is a noticeable divergence in EDI spatial data packages, specifically, in the use of otherEntity vs spatial[Raster,Vector]. Here we discuss pros and cons of why one might choose to document spatial data with one type of EML entity over another. Either method is acceptable, and we recommend using spatial[Raster,Vector] when feasible. The documentation that follows provides best practices that will maximize discoverability and useability of spatial data, regardless of the entity type used. 42 | 43 | #### [otherEntity](https://eml.ecoinformatics.org/schema/eml-dataset_xsd.html#DatasetType_otherEntity) 44 | 45 | * Pros 46 | * EML preparation is simpler than with the spatial EML types 47 | * Allows aggregated data structures (e.g., file geodatabases) 48 | * Cons 49 | * Spatial data stored as <otherEntity> might be harder to discover because it may be difficult to determine if data in an <otherEntity> is spatial data or some other type when searching or browsing 50 | * There is currently no controlled keywording to identify spatial data files that are included as otherEntity in EML 51 | * Tabular attributes of geometric entities may not be described in detail 52 | * Units (latitude/longitude vs meters vs feet) and projections may not be identified 53 | 54 | #### spatial[[Raster](https://eml.ecoinformatics.org/schema/eml-dataset_xsd.html#DatasetType_spatialRaster),[Vector](https://eml.ecoinformatics.org/schema/eml-dataset_xsd.html#DatasetType_spatialVector)] 55 | 56 | * Pros 57 | * EML more fully describes vector attributes 58 | * There is a well documented path from Esri metadata to EML 59 | * An EML metadata search (on EDI or elsewhere) clearly identifies these as spatial datasets through the use of spatialRaster or spatialVector entities 60 | * LTER has built applications based on spatial[Raster,Vector] entities 61 | * Cons 62 | * Data may not originate in ArcGIS, requiring a custom workflow to generate spatial entity EML 63 | * Spatial[Raster,Vector] can?t describe multi-layer aggregates of GIS data (e.g., geodatabases containing multiple feature classes) 64 | 65 | #### Keywords 66 | 67 | Clearly identifying a dataset as spatial in nature is important to discoverability. This can be achieved by the use of keywords in the EML keyword elements as well as in the title/abstract and methods where appropriate. Keywords frequently searched include: GIS, geographic information system, spatial data, plus the more specific format names like shapefile, geoTIFF etc. Consider including as appropriate. 68 | 69 | Do include the keywords **spatial vector** and **spatial raster** as appropriate for your data. These keywords should be used especially if the data are archived as otherEntity. 70 | 71 | You may also include keywords that describe broad spatial data layers, e.g., digital elevation model, elevation, boundary, land use, land cover, census, parcel, imagery, as well as keywords that describe the specifics associated with a broad spatial data layer, e.g., land cover types such as water and vegetation types, land use types such as urban and forest, and so on. 72 | 73 | #### GIS software compatible metadata 74 | 75 | GIS platforms will not ingest EML metadata. If your GIS software creates its own metadata file specific to that software, then it may be included as otherEntity. Be sure to populate this metadata, for example with descriptions and units for attributes in vector data or raster attribute tables. However, metadata in the standard ISO 19115 or CSDGM format to enable the metadata to be read by other GIS software is more useful. 76 | 77 | #### Attribute and coordinate system detail for otherEntity 78 | 79 | While the GIS software compatible metadata included in the package typically describes attributes and coordinate systems of the data, such descriptions should also be included in the EML metadata to help users determine fit for use prior to data download. The EML spatialVector and spatialRaster types include elements for this purpose. EML otherEntity can also include attribute descriptions; however, inclusion of attributes in this more generic element may not be as common, and the element does not formally support a description of coordinate systems. 80 | 81 | When using [otherEntity](https://eml.ecoinformatics.org/schema/eml-dataset_xsd.html#DatasetType_otherEntity) instead of spatialVector or spatialRaster, include coordinate system details in the otherEntity/entityDescription element. If not including a description of attributes in the otherEntity/attributeList element, at least include a summary description of attributes in otherEntity/entityDescription. If the spatial dataset and its associated metadata files are the only items in the data package, then you can include these descriptions in higher level EML elements such as the dataset abstract in addition to or in place of descriptions at the entity level. 82 | 83 | #### Standardized content for formats and entity types 84 | 85 | In EML physical/dataFormat/externallyDefinedFormat, include a **formatName** indicating the spatial data file format. We recommend using format names from the [DataONE format list](https://cn.dataone.org/cn/v2/formats) when possible. Some spatial items from that list are shown below. Always check the list for the most up-to-date version of these names. 86 | 87 | * Esri Shapefile (zipped) 88 | * Google Earth Keyhole Markup Language (KML) 89 | * Google Earth Keyhole Markup Language (KML) Compressed archive 90 | * Network Common Data Format, version 4 91 | * Hierarchical Data Format version 5 (HDF5) 92 | * GeoTIFF 93 | * GeoPackage Encoding Standard (OGC) Format Family 94 | * Esri File Geodatabase (zipped) 95 | * GeoJSON, version RFC 7946 96 | 97 | If your format is not included in the DataONE list, consider submitting an issue to that GitHub repository's issue tracker so that the format can be added. 98 | 99 | EML **formatVersion**, a sibling of formatName can be used to indicate the format version as in the example EML snippet below. 100 | 101 | ```{.xml} 102 | 103 | Network Common Data Format, version 4 104 | netCDF-4 classic 105 | 106 | ``` 107 | 108 | For otherEntity, when populating the **entityType** element, use **spatial raster** or **spatial vector** as appropriate. 109 | -------------------------------------------------------------------------------- /img/edi-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EDIorg/data-package-best-practices/1dbf0fc4cf3a70f23e267dce6ffc5253185a337d/img/edi-logo.png -------------------------------------------------------------------------------- /index.qmd: -------------------------------------------------------------------------------- 1 | # Overview {.unnumbered} 2 | 3 | ::: {.callout-note} 4 | 5 | **These guides and the website are being updated right now.** 6 | 7 | Most content here is from the most recent "official" update of the guides that happened around May of 2022. Newer, community-led content is available for review in the [prerelease version](https://prerelease-edi-docs.netlify.app). 8 | 9 | ::: 10 | 11 | This website contains a series of documents about preparing and publishing datasets in the environmental sciences and similar contexts. Topics include community-developed metadata standards, serialization and markup formatting guidelines, best practices for content in ecological synthesis datasets, and more. This documentation is maintained by the [Environmental Data Initiative](https://edirepository.org) (EDI) and all content has been developed and written in coordination with EDI's community of scientists, data managers, and repository users. 12 | 13 | Guides published here are directed towards the following goals: 14 | 15 | - Minimize heterogeneity of EML-described data packages to simplify development and re-use of software 16 | - Maximize interoperability to facilitate data synthesis 17 | - Provide guidance and clarification on 18 | - the use of Ecological Metadata Language (EML) 19 | - design a data package 20 | - prepare a data product for synthesis 21 | 22 | To contribute to these documents or participate in the associated working groups, see the "[About this site](about.qmd)" page or the [repository README](https://github.com/EDIorg/data-package-best-practices). This website and all documents are rendered as a [Quarto book](https://quarto.org/docs/books). 23 | 24 | ## Books 25 | 26 | ### [Best Practices for Dataset Metadata in Ecological Metadata Language (EML)](guide-eml-bp/preface.qmd) 27 | The recommendations for EML metadata apply to all data packages. This book is a reproduction of V3 of the static PDF document "Best Practices for Dataset Metadata in Ecological Metadata Language (EML)," last updated in 2017. The entire most recent (versioned, citable) release will be made available as a PDF. 28 | 29 | ### [Data Package Design for Special Cases](guide-special-cases/preface.qmd) 30 | Considerations for a well designed data package including special cases based on data type, format, or acquisition method. Examples are images, documents, raw data stored in other repositories. 31 | 32 | ### [Scientific Domain-Specific Dataset Guidelines](guide-domain-specific/preface.qmd) 33 | _Very much a work in progress._ Recommendations for community-developed data products from specific scientific domains. Not all scientific domains are covered. The data packages are derived from raw data and reformatted to meet certain data harmonization standards, often with extensive related code bases. -------------------------------------------------------------------------------- /references.bib: -------------------------------------------------------------------------------- 1 | 2 | @article{wilkinson_fair_2016, 3 | title = {The {FAIR} {Guiding} {Principles} for scientific data management and stewardship}, 4 | volume = {3}, 5 | copyright = {2016 The Author(s)}, 6 | issn = {2052-4463}, 7 | url = {https://www.nature.com/articles/sdata201618}, 8 | doi = {10.1038/sdata.2016.18}, 9 | abstract = {There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.}, 10 | language = {en}, 11 | number = {1}, 12 | urldate = {2024-11-12}, 13 | journal = {Scientific Data}, 14 | author = {Wilkinson, Mark D. and Dumontier, Michel and Aalbersberg, IJsbrand Jan and Appleton, Gabrielle and Axton, Myles and Baak, Arie and Blomberg, Niklas and Boiten, Jan-Willem and da Silva Santos, Luiz Bonino and Bourne, Philip E. and Bouwman, Jildau and Brookes, Anthony J. and Clark, Tim and Crosas, Mercè and Dillo, Ingrid and Dumon, Olivier and Edmunds, Scott and Evelo, Chris T. and Finkers, Richard and Gonzalez-Beltran, Alejandra and Gray, Alasdair J. G. and Groth, Paul and Goble, Carole and Grethe, Jeffrey S. and Heringa, Jaap and ’t Hoen, Peter A. C. and Hooft, Rob and Kuhn, Tobias and Kok, Ruben and Kok, Joost and Lusher, Scott J. and Martone, Maryann E. and Mons, Albert and Packer, Abel L. and Persson, Bengt and Rocca-Serra, Philippe and Roos, Marco and van Schaik, Rene and Sansone, Susanna-Assunta and Schultes, Erik and Sengstag, Thierry and Slater, Ted and Strawn, George and Swertz, Morris A. and Thompson, Mark and van der Lei, Johan and van Mulligen, Erik and Velterop, Jan and Waagmeester, Andra and Wittenburg, Peter and Wolstencroft, Katherine and Zhao, Jun and Mons, Barend}, 15 | month = mar, 16 | year = {2016}, 17 | note = {Publisher: Nature Publishing Group}, 18 | keywords = {Publication characteristics, Research data}, 19 | pages = {160018}, 20 | } 21 | 22 | @article{bahim_fair_2020, 23 | title = {The {FAIR} {Data} {Maturity} {Model}: {An} {Approach} to {Harmonise} {FAIR} {Assessments}}, 24 | volume = {19}, 25 | issn = {1683-1470}, 26 | shorttitle = {The {FAIR} {Data} {Maturity} {Model}}, 27 | url = {https://datascience.codata.org/articles/10.5334/dsj-2020-041}, 28 | doi = {10.5334/dsj-2020-041}, 29 | abstract = {The CODATA Data Science Journal is a peer-reviewed, open access, electronic journal, publishing papers on the management, dissemination, use and reuse of research data and databases across all research domains, including science, technology, the humanities and the arts. The scope of the journal includes descriptions of data systems, their implementations and their publication, applications, infrastructures, software, legal, reproducibility and transparency issues, the availability and usability of complex datasets, and with a particular focus on the principles, policies and practices for open data. All data is in scope, whether born digital or converted from other sources.}, 30 | language = {en-US}, 31 | number = {1}, 32 | urldate = {2024-11-12}, 33 | journal = {Data Science Journal}, 34 | author = {Bahim, Christophe and Casorrán-Amilburu, Carlos and Dekkers, Makx and Herczog, Edit and Loozen, Nicolas and Repanas, Konstantinos and Russell, Keith and Stall, Shelley}, 35 | month = oct, 36 | year = {2020}, 37 | } 38 | 39 | @misc{esip_checklist_2022, 40 | type = {online resource}, 41 | title = {Checklist to {Examine} {AI}-readiness for {Open} {Environmental} {Datasets}}, 42 | url = {https://esip.figshare.com/articles/online_resource/Checklist_to_Examine_AI-readiness_for_Open_Environmental_Datasets/19983722/1}, 43 | abstract = {This checklist is developed to help data users, producers, and managers to assess the readiness for level for AI applications and target data improvement in the future. 44 | Note that some of these factors came from the draft readiness matrix developed by the OSTP Subcommittee on Open Science and some have been added based on further research. Definitions for some concepts are listed at the end of this document. This checklist is developed through a collaboration of ESIP Data Readiness Cluster members including representatives from NOAA, NASA, USGS, and other organizations. The checklist will be updated periodically to reflect community feedback.  45 | 46 | The current version for the checklist is v.0.2.}, 47 | language = {en}, 48 | urldate = {2024-11-12}, 49 | journal = {figshare}, 50 | author = {ESIP, Data Readiness Cluster}, 51 | month = jun, 52 | year = {2022}, 53 | doi = {10.6084/m9.figshare.19983722.v1}, 54 | note = {Publisher: ESIP}, 55 | } 56 | 57 | @article{jones_quantifying_2019, 58 | title = {Quantifying {FAIR}: metadata improvement and guidance in the {DataONE} repository network}, 59 | shorttitle = {Quantifying {FAIR}}, 60 | url = {https://knb.ecoinformatics.org/view/doi:10.5063/F14T6GP0}, 61 | doi = {10.5063/F14T6GP0}, 62 | abstract = {DataONE has consistently focused on interoperability among data repositories to enable seamless access to well-described data on the Earth and the environment. Our existing services promote data discovery and access through harmonization of the diverse metadata specifications used across communities, and through our integrated data search portal and services. In terms of the FAIR principles, we have done a good job at Findable and Accessible, while as a community we have placed less emphasis on Interoperable and Reusable. We present new DataONE services for quantitatively providing guidance on metadata completeness and effectiveness relative to the FAIR principles. The services produce guidance for FAIRness at both the level of an individual data set and trends through time for repository, user, and funder data collections. These analytical results regarding conformance to FAIR principles are preliminary and based on proposed quantitative assessment metrics for FAIR which will be changed with input from the community. The current statistics are based on version 0.2.0 of the DataONE FAIR suite. Thus, these results should not be viewed as conclusive about the data sets presented, but rather illustrate the types of quantitative comparisons that will be able to be made when the FAIR metrics at DataONE have been finalized.}, 63 | language = {en}, 64 | urldate = {2024-11-12}, 65 | author = {Jones, Matthew and Slaughter, Peter and Habermann, Ted}, 66 | year = {2019}, 67 | note = {Publisher: urn:node:KNB}, 68 | } 69 | 70 | @misc{jones_ecological_2019, 71 | title = {Ecological {Metadata} {Language} version 2.2.0}, 72 | url = {https://eml.ecoinformatics.org}, 73 | abstract = {The Ecological Metadata Language (EML) defines a comprehensive vocabulary and a readable XML markup syntax for documenting research data. It is in widespread use in the earth and environmental sciences, and increasingly in other research disciplines as well. EML is a community-maintained specification, and evolves to meet the data documentation needs of researchers who want to openly document, preserve, and share data and outputs. EML includes modules for identifying and citing data packages, for describing the spatial, temporal, taxonomic, and thematic extent of data, for describing research methods and protocols, for describing the structure and content of data within sometimes complex packages of data, and for precisely annotating data with semantic vocabularies. EML includes metadata fields to fully detail data papers that are published in journals specializing in scientific data sharing and preservation.}, 74 | urldate = {2024-11-12}, 75 | publisher = {KNB Data Repository}, 76 | author = {Jones, Matthew and O'Brien, Margaret and Mecum, Bryce and Boettiger, Carl and Schildhauer, Mark and Maier, Mitchell and Whiteaker, Timothy and Earl, Stevan and Chong, Steven}, 77 | year = {2019}, 78 | doi = {10.5063/F11834T2}, 79 | keywords = {metadata}, 80 | } 81 | 82 | @article{wyngaard_emergent_2019, 83 | title = {Emergent {Challenges} for {Science} {sUAS} {Data} {Management}: {Fairness} through {Community} {Engagement} and {Best} {Practices} {Development}}, 84 | volume = {11}, 85 | copyright = {http://creativecommons.org/licenses/by/3.0/}, 86 | issn = {2072-4292}, 87 | shorttitle = {Emergent {Challenges} for {Science} {sUAS} {Data} {Management}}, 88 | url = {https://www.mdpi.com/2072-4292/11/15/1797}, 89 | doi = {10.3390/rs11151797}, 90 | abstract = {The use of small Unmanned Aircraft Systems (sUAS) as platforms for data capture has rapidly increased in recent years. However, while there has been significant investment in improving the aircraft, sensors, operations, and legislation infrastructure for such, little attention has been paid to supporting the management of the complex data capture pipeline sUAS involve. This paper reports on a four-year, community-based investigation into the tools, data practices, and challenges that currently exist for particularly researchers using sUAS as data capture platforms. The key results of this effort are: (1) sUAS captured data—as a set that is rapidly growing to include data in a wide range of Physical and Environmental Sciences, Engineering Disciplines, and many civil and commercial use cases—is characterized as both sharing many traits with traditional remote sensing data and also as exhibiting—as common across the spectrum of disciplines and use cases—novel characteristics that require novel data support infrastructure; and (2), given this characterization of sUAS data and its potential value in the identified wide variety of use case, we outline eight challenges that need to be addressed in order for the full value of sUAS captured data to be realized. We conclude that there would be significant value gained and costs saved across both commercial and academic sectors if the global sUAS user and data management communities were to address these challenges in the immediate to near future, so as to extract the maximal value of sUAS captured data for the lowest long-term effort and monetary cost.}, 91 | language = {en}, 92 | number = {15}, 93 | urldate = {2024-11-15}, 94 | journal = {Remote Sensing}, 95 | author = {Wyngaard, Jane and Barbieri, Lindsay and Thomer, Andrea and Adams, Josip and Sullivan, Don and Crosby, Christopher and Parr, Cynthia and Klump, Jens and Raj Shrestha, Sudhir and Bell, Tom}, 96 | month = jan, 97 | year = {2019}, 98 | note = {Number: 15 99 | Publisher: Multidisciplinary Digital Publishing Institute}, 100 | keywords = {community, data, drone, FAIR, management, practices, RPAS, standards, sUAS, UAV}, 101 | pages = {1797}, 102 | } 103 | 104 | @misc{thomer_minimum_2020, 105 | title = {A minimum information framework the {FAIR} collection of earth and environmental science data with drones}, 106 | url = {https://zenodo.org/records/4124167}, 107 | abstract = {This repository contains a minimum information framework (MIF) for data collected by small unmanned aerial systems (AKA sUAS AKA RPAs AKA drones). A MIF is essentially a framework for the development for further data standards; it enumerates the metadata needed for the collection of FAIR (Findable Accessible Interoperable and Reusable) scientific data with drones/sUAS/RPAs. 108 | 109 | 110 | The MIF was drafted through examination of 3 case studies of data collection with drones, and then refined through iterative rounds of community feedback and reflection on the authors' own work with drone-based data collection. We are currently writing a short paper further describing the development of the standard. 111 | 112 | 113 | This project was funded as an ESIP Lab and we thank ESIP for their support. 114 | 115 | 116 | Please cite as: Thomer, Andrea K., Swanz, Sarah, Barbieri, Lindsay, Wyngaard, Jane. (2020). A minimum information framework the FAIR collection of earth and environmental science data with drones. DOI: 10.5281/zenodo.4017647}, 117 | urldate = {2024-11-15}, 118 | publisher = {Zenodo}, 119 | author = {Thomer, Andrea and Swanz, Sarah and Barbieri, Lindsay and Wyngaard, Jane}, 120 | month = oct, 121 | year = {2020}, 122 | doi = {10.5281/zenodo.4124167}, 123 | } 124 | 125 | @misc{marco_a_janssen_towards_2008, 126 | type = {Text.{Article}}, 127 | title = {Towards a {Community} {Framework} for {Agent}-{Based} {Modelling}}, 128 | copyright = {JASSS@soc.surrey.ac.uk}, 129 | url = {https://www.jasss.org/11/2/6.html}, 130 | abstract = {Agent-based modelling has become an increasingly important tool for scholars studying social and social-ecological systems, but there are no community standards on describing, implementing, testing and teaching these tools. This paper reports on the establishment of the Open Agent-Based Modelling Consortium, www.openabm.org, a community effort to foster the agent-based modelling development, communication, and dissemination for research, practice and education.}, 131 | language = {en}, 132 | urldate = {2024-11-15}, 133 | author = {Marco A. Janssen, Lilian Na'ia Alessa}, 134 | month = mar, 135 | year = {2008}, 136 | note = {Publisher: JASSS}, 137 | } 138 | 139 | @article{simmonds_addressing_2020, 140 | title = {Addressing {Model} {Data} {Archiving} {Needs} for the {Department} of {Energy}’s {Environmental} {Systems} {Science} {Community}}, 141 | copyright = {CC BY Attribution 4.0 International}, 142 | url = {https://eartharxiv.org/repository/view/260/}, 143 | abstract = {Researchers in the Department of Energy’s ESS program use a variety of models to advance robust, scale-aware predictions of terrestrial and subsurface ecosystems. ESS projects typically conduct field observations and experiments coupled with modeling exercises using a model-experimental (ModEx) approach that enables iterative co-development of experiments and models, and ensures that experimental data needed to parameterize and test models are collected. Thus preserving “model data” comprising the outputs from simulations, as well as driving, parameterization and validation data with associated codes is becoming increasingly important. The ESS-DIVE repository stores data associated with the ESS programs and conducted a months long survey of the ESS community to identify needs for archiving, sharing, and utilizing model data. Here, we present the results of the community survey, and the proposed ESS-DIVE approach over the short-term (next 3 years) and long-term (4-10 years) to support the needs of the ESS modeling community. In the short-term ESS-DIVE proposes to work on functionality that supports archiving of model data associated with publications, with an emphasis on developing community guidelines and standards that make the data more discoverable, accessible and usable. The long-term vision is to broadly enable data-model integration, and knowledge generation from model and observational data. This vision will be achieved through close partnerships with the ESS community.}, 144 | language = {en}, 145 | urldate = {2024-11-15}, 146 | author = {Simmonds, Maegen and Riley, William J. and Cholia, Shreyas and Varadharajan, Charuleka}, 147 | month = may, 148 | year = {2020}, 149 | note = {Publisher: EarthArXiv}, 150 | } 151 | 152 | -------------------------------------------------------------------------------- /references.qmd: -------------------------------------------------------------------------------- 1 | # References {.unnumbered} 2 | 3 | ::: {#refs} 4 | ::: 5 | --------------------------------------------------------------------------------