├── .gitignore
├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
├── custom_theme
    ├── footer.html
    └── toc.html
├── docs
    ├── bagging.md
    ├── checking.md
    ├── coordinator-work.md
    ├── css
    │   └── extra.css
    ├── describing.md
    ├── faq.md
    ├── harvesting.md
    ├── img
    │   ├── button-checkin.png
    │   ├── button-checkout.png
    │   ├── button-save.png
    │   ├── button-upload-token.png
    │   ├── button-upload.png
    │   ├── button-zipstarter.png
    │   ├── harvest-00-overview-homepage.png
    │   ├── harvest-01-UUID.png
    │   ├── harvest-01-zipstarter.png
    │   ├── harvest-02-toolbar.png
    │   ├── harvest-02-upload.png
    │   ├── harvest-03-upload-awstoken.png
    │   ├── harvest-04-add-details.png
    │   ├── harvest-05-notes.png
    │   ├── research-00-overview-homepage.png
    │   ├── research-01-title.png
    │   ├── research-02-dnh-section.png
    │   ├── research-03-recommended-approach.png
    │   ├── research-04-formats.png
    │   ├── research-05-estimated-size.png
    │   ├── research-06-linkURL.png
    │   └── research-07-check-as-done.png
    ├── index.md
    ├── organizing
    │   ├── post-event.md
    │   └── pre-event.md
    ├── researching.md
    ├── seeding.md
    └── surveying.md
└── mkdocs.yml


/.gitignore:
--------------------------------------------------------------------------------
1 | site/
2 | env/
3 | dist/
4 | htmlcov/
5 | .tox/
6 | mkdocs.egg-info/
7 | *.pyc
8 | .coverage
9 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | We love improvements to our documentation!
 4 | 
 5 | ## Submitting Issues
 6 | 
 7 | The simplest way to tell us about a possible improvement is to open a new issue in the [Workflow repo](https://github.com/datarefuge/workflow/issues).
 8 | 
 9 | ## Submitting Changes Through Pull Requests
10 | 
11 | Our process for accepting changes has a few steps.
12 | 
13 | 1. If you haven't submitted anything before, and you aren't (yet!) a member of our organization, **fork and clone** the repo:
14 | 
15 |         $ git clone git@github.com:<your-username>/<repository-name>.git
16 | 
17 |   Organization members should clone the upsteam repo, instead of working from a personal fork:
18 | 
19 |       $ git clone git@github.com:datarefuge/workflow.git
20 | 
21 | 2. Create a **new branch** for the changes you want to work on. Choose a topic for your branch name that reflects the change:
22 | 
23 |         $ git checkout -b <branch-name>
24 | 
25 | 3. **Create or modify the files** with your changes. If you want to show other people work that isn't ready to merge in, commit your changes then create a pull request (PR) with _WIP_ or _Work In Progress_ in the title.
26 | 
27 |         https://github.com/datarefuge/workflow/pull/new/master
28 | 
29 | 4. Once your changes are ready for final review, commit your changes then modify or **create your pull request (PR)**, assign a reviewer or ping (using "`@datarefuge/workflow`") those able to review and merge in PRs: @khdelphine, @dcwalk, @liblaurie or @titaniumbones.
30 | 
31 | 5. Allow others sufficient **time for review and comments** before merging. We make use of GitHub's review feature to to comment in-line one PRs when possible. There may be some fixes or adjustments you'll have to make based on feedback.
32 | 
33 | 6. Once you have integrated comments, or waited for feedback, your changes should get merged in!
34 | 
35 | ## Incremental Changes
36 | Note that it is better to submit incremental bite-size changes that are easier to review.
37 | 
38 | If you have in mind heavy changes, especially if they will affect the overall structure of the documentation, please discuss your plans with the other editors first.
39 | 
40 | ## Viewing your Changes
41 | 
42 | Documentation is built with [MkDocs](http://www.mkdocs.org/), a static site generator tailored to writing documentation.
43 | 
44 | [Install Mkdocs](http://www.mkdocs.org/#installation) with a package manager or python/pip:
45 | 
46 | ```sh
47 | $ brew install mkdocs
48 | ```
49 | or
50 | ```sh
51 | $ pip install mkdocs
52 | ```
53 | 
54 | Clone this repo and navigate to it, make changes to the `.md` files.
55 | 
56 | You can view changes via your browser at `http://127.0.0.1:8000`, by running the command:
57 | 
58 | ```sh
59 | $ mkdocs serve
60 | ```
61 | 
62 | **Note: Mkdocs enforces Markdown syntax strictly, please refer to [Github's Markdown guide](https://guides.github.com/features/mastering-markdown/) and [MkDocs guide](http://www.mkdocs.org/user-guide/writing-your-docs/#markdown-extensions) for details.**
63 | 
64 | Once a pull request has been merged into master, the gh-pages need to be regenerated by a reviewer. They do that from a up-to-date local master branch at the command line:
65 | ```
66 | $ mkdocs gh-deploy
67 | ```
68 | 
69 | _These guidelines are based on [Toronto Mesh](https://github.com/tomeshnet) and [EDGI's](https://github.com/edgi-govdata-archiving)._
70 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
  1 | ## creative commons
  2 | 
  3 | # Attribution-ShareAlike 4.0 International
  4 | 
  5 | Creative Commons Corporation (“Creative Commons”) is not a law firm and does not provide legal services or legal advice. Distribution of Creative Commons public licenses does not create a lawyer-client or other relationship. Creative Commons makes its licenses and related information available on an “as-is” basis. Creative Commons gives no warranties regarding its licenses, any material licensed under their terms and conditions, or any related information. Creative Commons disclaims all liability for damages resulting from their use to the fullest extent possible.
  6 | 
  7 | ### Using Creative Commons Public Licenses
  8 | 
  9 | Creative Commons public licenses provide a standard set of terms and conditions that creators and other rights holders may use to share original works of authorship and other material subject to copyright and certain other rights specified in the public license below. The following considerations are for informational purposes only, are not exhaustive, and do not form part of our licenses.
 10 | 
 11 | * __Considerations for licensors:__ Our public licenses are intended for use by those authorized to give the public permission to use material in ways otherwise restricted by copyright and certain other rights. Our licenses are irrevocable. Licensors should read and understand the terms and conditions of the license they choose before applying it. Licensors should also secure all rights necessary before applying our licenses so that the public can reuse the material as expected. Licensors should clearly mark any material not subject to the license. This includes other CC-licensed material, or material used under an exception or limitation to copyright. [More considerations for licensors](http://wiki.creativecommons.org/Considerations_for_licensors_and_licensees#Considerations_for_licensors).
 12 | 
 13 | * __Considerations for the public:__ By using one of our public licenses, a licensor grants the public permission to use the licensed material under specified terms and conditions. If the licensor’s permission is not necessary for any reason–for example, because of any applicable exception or limitation to copyright–then that use is not regulated by the license. Our licenses grant only permissions under copyright and certain other rights that a licensor has authority to grant. Use of the licensed material may still be restricted for other reasons, including because others have copyright or other rights in the material. A licensor may make special requests, such as asking that all changes be marked or described. Although not required by our licenses, you are encouraged to respect those requests where reasonable. [More considerations for the public](http://wiki.creativecommons.org/Considerations_for_licensors_and_licensees#Considerations_for_licensees).
 14 | 
 15 | ## Creative Commons Attribution-ShareAlike 4.0 International Public License
 16 | 
 17 | By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
 18 | 
 19 | ### Section 1 – Definitions.
 20 | 
 21 | a. __Adapted Material__ means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.
 22 | 
 23 | b. __Adapter's License__ means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.
 24 | 
 25 | c. __BY-SA Compatible License__ means a license listed at [creativecommons.org/compatiblelicenses](http://creativecommons.org/compatiblelicenses), approved by Creative Commons as essentially the equivalent of this Public License.
 26 | 
 27 | d. __Copyright and Similar Rights__ means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.
 28 | 
 29 | e. __Effective Technological Measures__ means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.
 30 | 
 31 | f. __Exceptions and Limitations__ means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.
 32 | 
 33 | g. __License Elements__ means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike.
 34 | 
 35 | h. __Licensed Material__ means the artistic or literary work, database, or other material to which the Licensor applied this Public License.
 36 | 
 37 | i. __Licensed Rights__ means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.
 38 | 
 39 | j. __Licensor__ means the individual(s) or entity(ies) granting rights under this Public License.
 40 | 
 41 | k. __Share__ means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.
 42 | 
 43 | l. __Sui Generis Database Rights__ means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.
 44 | 
 45 | m. __You__ means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.
 46 | 
 47 | ### Section 2 – Scope.
 48 | 
 49 | a. ___License grant.___
 50 | 
 51 |     1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to:
 52 | 
 53 |         A. reproduce and Share the Licensed Material, in whole or in part; and
 54 | 
 55 |         B. produce, reproduce, and Share Adapted Material.
 56 | 
 57 |     2. __Exceptions and Limitations.__ For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.
 58 | 
 59 |     3. __Term.__ The term of this Public License is specified in Section 6(a).
 60 | 
 61 |     4. __Media and formats; technical modifications allowed.__ The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material.
 62 | 
 63 |     5. __Downstream recipients.__
 64 | 
 65 |         A. __Offer from the Licensor – Licensed Material.__ Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.
 66 | 
 67 |         B. __Additional offer from the Licensor – Adapted Material. Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter’s License You apply.
 68 | 
 69 |         C. __No downstream restrictions.__ You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.
 70 | 
 71 |     6. __No endorsement.__ Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i).
 72 | 
 73 | b. ___Other rights.___
 74 | 
 75 |     1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.
 76 | 
 77 |     2. Patent and trademark rights are not licensed under this Public License.
 78 | 
 79 |     3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties.
 80 | 
 81 | ### Section 3 – License Conditions.
 82 | 
 83 | Your exercise of the Licensed Rights is expressly made subject to the following conditions.
 84 | 
 85 | a. ___Attribution.___
 86 | 
 87 |     1. If You Share the Licensed Material (including in modified form), You must:
 88 | 
 89 |         A. retain the following if it is supplied by the Licensor with the Licensed Material:
 90 | 
 91 |             i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);
 92 | 
 93 |             ii. a copyright notice;
 94 | 
 95 |             iii. a notice that refers to this Public License;
 96 | 
 97 |             iv. a notice that refers to the disclaimer of warranties;
 98 | 
 99 |             v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
100 | 
101 |         B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
102 | 
103 |         C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License.
104 | 
105 |     2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information.
106 | 
107 |     3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable.
108 | 
109 | b. ___ShareAlike.___
110 | 
111 | In addition to the conditions in Section 3(a), if You Share Adapted Material You produce, the following conditions also apply.
112 | 
113 | 1. The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License.
114 | 
115 | 2. You must include the text of, or the URI or hyperlink to, the Adapter's License You apply. You may satisfy this condition in any reasonable manner based on the medium, means, and context in which You Share Adapted Material.
116 | 
117 | 3. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply.
118 | 
119 | ### Section 4 – Sui Generis Database Rights.
120 | 
121 | Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:
122 | 
123 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database;
124 | 
125 | b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and
126 | 
127 | c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
128 | 
129 | For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
130 | 
131 | ### Section 5 – Disclaimer of Warranties and Limitation of Liability.
132 | 
133 | a. __Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.__
134 | 
135 | b. __To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.__
136 | 
137 | c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
138 | 
139 | ### Section 6 – Term and Termination.
140 | 
141 | a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.
142 | 
143 | b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
144 | 
145 |     1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or
146 | 
147 |     2. upon express reinstatement by the Licensor.
148 | 
149 |     For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.
150 | 
151 | c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
152 | 
153 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
154 | 
155 | ### Section 7 – Other Terms and Conditions.
156 | 
157 | a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
158 | 
159 | b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.t stated herein are separate from and independent of the terms and conditions of this Public License.
160 | 
161 | ### Section 8 – Interpretation.
162 | 
163 | a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
164 | 
165 | b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
166 | 
167 | c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
168 | 
169 | d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.
170 | 
171 | ```
172 | Creative Commons is not a party to its public licenses. Notwithstanding, Creative Commons may elect to apply one of its public licenses to material it publishes and in those instances will be considered the “Licensor.” Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at [creativecommons.org/policies](http://creativecommons.org/policies), Creative Commons does not authorize the use of the trademark “Creative Commons” or any other trademark or logo of Creative Commons without its prior written consent including, without limitation, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses.
173 | 
174 | Creative Commons may be contacted at creativecommons.org
175 | ```
176 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DataRescue Workflow
 2 | 
 3 | This guide describes the [DataRescue workflow](https://datarefuge.github.io/workflow/) we use for DataRescue activities as developed by the [DataRefuge project](http://www.ppehlab.org/) and [EDGI](https://envirodatagov.org/), both at in-person events and when people work remotely. It explains the process that a URL/dataset goes through from the time it has been identified, either by a [Seeder](https://datarefuge.github.io/workflow/seeding/) as "uncrawlable," or by other means, until it is made available as a record in the [datarefuge.org](http://www.datarefuge.org) CKAN data catalog. The process involves several stages, and is designed to maximize smooth hand-offs so that each phase is handled by someone with distinct expertise in the area they're tackling, while the data is always being tracked for security.
 4 | 
 5 | # Note: This workflow is no longer supported as of May 21, 2017 
 6 | 
 7 | ## Are you looking for the actual documentation?
 8 | We have moved the documentation to a more user-friendly format. You can now find the guide at [datarefuge.github.io/workflow](https://datarefuge.github.io/workflow/).
 9 | 
10 | Note that we are still working on it, and will shortly add screenshots, etc.
11 | 
12 | ## Contributing to this guide
13 | 
14 | Suggestions and improvements are welcome! All changes to the guide are managed through this GitHub repository. 
15 | Please check our [contribution guidelines](CONTRIBUTING.md) for details.
16 | 
17 | **********************
18 | ## Partners
19 | DataRescue is a broad, grassroots effort with support from numerous local and nationwide networks. [DataRefuge](http://www.ppehlab.org/datarefuge/) and [EDGI](https://envirodatagov.org/) partner with local organizers in supporting these events. See more of our institutional partners on the [DataRefuge home page](http://www.ppehlab.org/datarefuge#partners).
20 | 


--------------------------------------------------------------------------------
/custom_theme/footer.html:
--------------------------------------------------------------------------------
 1 | <footer>
 2 |   {%- block next_prev %}
 3 |   {% if page and page.next_page or page.previous_page %}
 4 |     <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
 5 |       {% if page.next_page %}
 6 |         <a href="{{ page.next_page.url }}" class="btn btn-neutral float-right" title="{{ page.next_page.title }}">Next <span class="icon icon-circle-arrow-right"></span></a>
 7 |       {% endif %}
 8 |       {% if page.previous_page %}
 9 |         <a href="{{ page.previous_page.url }}" class="btn btn-neutral" title="{{ page.previous_page.title }}"><span class="icon icon-circle-arrow-left"></span> Previous</a>
10 |       {% endif %}
11 |     </div>
12 |   {% endif %}
13 |   {%- endblock %}
14 | 
15 |   <hr/>
16 | 
17 |   <div role="contentinfo">
18 |     <!-- Copyright etc -->
19 |     <p>This documentation is licensed under a <a href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative Commons Attribution-ShareAlike 4.0 International License</a>.</p>
20 |   </div>
21 | 
22 |   <p>Source available at <a href="https://github.com/datarefuge/workflow" target="_blank"><i class="fa fa-github"></i>&nbsp;datarefuge/workflow</a>, built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.</p>
23 | </footer>
24 | 


--------------------------------------------------------------------------------
/custom_theme/toc.html:
--------------------------------------------------------------------------------
 1 | {% if nav_item.children %}
 2 |     <ul class="subnav">
 3 |     <li><span>{{ nav_item.title }}</span></li>
 4 | 
 5 |         {% for nav_item in nav_item.children %}
 6 |             {% include 'toc.html' %}
 7 |         {% endfor %}
 8 |         {% if nav_item.title == 'Additional Resources' %}
 9 |           <li class="toctree-l1">
10 |               <a class="{% if nav_item.active%}current{%endif%}" href="https://youtu.be/tvSSILnHnpA" target="_blank">Archivers App Walkthrough</a>
11 |           </li>
12 |           <li class="toctree-l1">
13 |               <a class="{% if nav_item.active%}current{%endif%}" href="https://drive.google.com/file/d/0B8Gv3Zy5ceY3X0xiSlV2dnhMYWc/view" target="_blank">Crawlable or Uncrawlable Poster</a>
14 |           </li>
15 |         {% endif %}
16 |     </ul>
17 | {% else %}
18 |     <li class="toctree-l1 {% if nav_item.active%}current{%endif%}">
19 |         <a class="{% if nav_item.active%}current{%endif%}" href="{{ nav_item.url }}">{{ nav_item.title }}</a>
20 |         {% if nav_item == page %}
21 |             <ul>
22 |             {% for toc_item in page.toc %}
23 |                 <li class="toctree-l3"><a href="{{ toc_item.url }}">{{ toc_item.title }}</a></li>
24 |                 {% for toc_item in toc_item.children %}
25 |                     <li><a class="toctree-l4" href="{{ toc_item.url }}">{{ toc_item.title }}</a></li>
26 |                 {% endfor %}
27 |             {% endfor %}
28 |             </ul>
29 |         {% endif %}
30 |     </li>
31 | {% endif %}
32 | 


--------------------------------------------------------------------------------
/docs/bagging.md:
--------------------------------------------------------------------------------
  1 | ## What Do Baggers Do?
  2 | 
  3 | Baggers do some quality assurance on the dataset to make sure the content is correct and corresponds to what was described in the spreadsheet. Then they package the data into a bagit file (or "bag"), which includes basic technical metadata, and upload it to the final DataRefuge destination.
  4 | 
  5 | **Note: Checking is currently performed by Baggers and does not exist as a separate stage in the Archivers app.**
  6 | 
  7 | <div class = "note">
  8 |   <strong>Recommended Skills</strong> <br />  
  9 |   Consider this path if you have data or web archiving experience, or have strong tech skills and an attention to detail.
 10 | </div>
 11 | 
 12 | ## Getting Set up as a Bagger
 13 | 
 14 | - Apply to become a Bagger by filling out [this form](https://docs.google.com/a/temple.edu/forms/d/e/1FAIpQLSfh9YIFnDrc-Cuc0hTd-U37J3D8xw8K7VXmzWkPs6Y5Q0wfVg/viewform)
 15 |   	- Note that an email address is required to apply.
 16 | 	- Note also that you should be willing to have your real name be associated with the datasets, to follow archival best practices (see [guidelines on archival best practices for Data Refuge](http://www.ppehlab.org/blogposts/2017/2/1/data-refuge-rests-on-a-clear-chain-of-custody) for more information).
 17 | - The organizers of the event (in-person or remote) will send you an invite to the [Archivers app](http://www.archivers.space/), which helps us coordinate all the data archiving work we do.
 18 |   	- Click the invite link, and choose a user name and a password.
 19 | 	- Verify that you have bagging permissions by going to the [Archivers app](http://www.archivers.space/), clicking `URLS`, and confirming that you can see a section called "Bag".
 20 | - Create an account on the DataRefuge Slack using this [slack-in](https://rauchg-slackin-qonsfhhvxs.now.sh/) (or use the Slack team recommended by your event organizers). This is where people share expertise and answer each other's questions.   
 21 | - Get set up with Python and the [`bagit-python`](https://github.com/LibraryOfCongress/bagit-python) script to make a bag at the command line
 22 | - If you need any assistance:
 23 |      - Talk to your DataRescue guide if you are at an in-person event.
 24 |      - Or post questions in the DataRefuge Slack `#baggers` channel (or other channel recommended by your event organizers).
 25 | 
 26 | ## Claiming a Dataset for Bagging
 27 | 
 28 | - You will work on datasets that were harvested by Harvesters.
 29 | - Go to the [Archivers app](http://www.archivers.space/), click `URLS` and then `BAG`: all the URLs listed are ready to be bagged.
 30 |     - Available URLs are the ones that have not been checked out by someone else, i.e. that do not have someone's name in the User column.
 31 | - Select an available URL and click its UUID to get to the detailed view, then click `Checkout this URL`. It is now ready for you to work on, and no one else can do anything to it while you have it checked out.
 32 | - While you go through the bagging process, make sure to report as much information as possible in the Archivers app, as this is the place where we collectively keep track of all the work done.
 33 | 
 34 | <div class = "note">
 35 |   <strong>Note: URL vs UUID</strong> <br />  
 36 |   The <code>URL</code> is the link to examine and harvest, and the <code>UUID</code> is a canonical ID we use to connect the URL with the data in question. The UUID will have been generated earlier in the process. UUID stands for Universal Unique Identifier.
 37 | </div>
 38 | 
 39 | ## Downloading & Opening the Dataset
 40 | 
 41 | - The zipped dataset that is ready to be bagged is under `Harvest Url / Location` in the the Archivers app. Download it to your laptop and unzip it.
 42 | - Extra check: Is this URL truly ready to bag?
 43 |     - While everybody is doing their best to provide accurate information, occasionally a URL will be presented as "ready to bag" but, in fact, is not. Symptoms include:
 44 | 	    - There is no value in the "Harvest Url / Location" field.
 45 | 	        - If you don't see a "Harvest Url / Location" field at all, confirm that you have bagging privileges.
 46 | 	        - _Please note that even though the Archivers app field is not populated, in some case you might still be able to locate the file on our cloud storage. Check for the file presence by using the following URL structure: `https://drp-upload.s3.amazonaws.com/remote/ + [UUID] + .zip`, so for instance: `https://drp-upload.s3.amazonaws.com/remote/13E0A60E-2324-4321-927D-8496F136B2B5.zip`
 47 | 	    - There is a note in the Harvest section that seems to indicate that the harvest was only partially performed.  
 48 | 	    - In either case, uncheck the "Harvest" checkbox, and add a note in the `Notes From Harvest` field indicating that the URL does not seem ready for bagging and needs to be reviewed by a Harvester.
 49 | 
 50 | ## Quality Assurance
 51 | 
 52 | - Confirm the harvested files:
 53 |     - Go to the original URL and check that the dataset is complete and accurate.
 54 |     - You also need to check that the dataset is meaningful, that is: "will the bag make sense to a scientist"?
 55 |     For instance, if a dataset is composed of a spreadsheet without any accompanying key or explanation of what the data represents, it might be completely impossible for a scientist to use it.
 56 |     - Spot-check to make sure the files open properly and are not faulty in any way.
 57 | - Confirm contents of JSON file:
 58 |     - The JSON should match the information from the Researcher and use the following format:  
 59 | ```
 60 | {
 61 | 	"Date of capture": "Fri Feb 24 2017 21:44:07 GMT-0800 (PST)",
 62 | 	"File formats contained in package": ".xls, .zip",
 63 | 	"Free text description of capture process": "Metadata was generated by viewing page, data was bulk downloaded using download_ftp_tree.py and then bagged.",
 64 | 	"Individual source or seed URL": "ftp://podaac-ftp.jpl.nasa.gov/allData/nimbus7",
 65 | 	"Institution facilitating the data capture creation and packaging": "DataRescue SF Bay",
 66 | 	"Name of package creator": "JohnDoe",
 67 | 	"Name of resource": "Nimbus-7 SMMR global 60km gridded ocean parameters for 1979 - 1984",
 68 | 	"Type(s) of content in package": "These files are the Wentz Nimbus-7 SMMR global 60km gridded ocean parameters for 1979 - 1984.  There are 72 files which breaks down to 1 file per month. So, NIMBUS7-SMMR00000.dat is January 1979, NIMBUS7-SMMR00001.dat is February 1979, etc.",
 69 | 	"UUID": "F3499E3B-7517-4C06-A661-72B4DA13A2A2",
 70 | 	"recommended_approach": "",
 71 | 	"significance": "These files are the Wentz Nimbus-7 SMMR global 60km gridded ocean parameters for 1979 - 1984.  There are 72 files which breaks down to 1 file per month. So, NIMBUS7-SMMR00000.dat is January 1979, NIMBUS7-SMMR00001.dat is February 1979, etc.",
 72 | 	"title": "Nimbus-7 SMMR global 60km gridded ocean parameters for 1979 - 1984",
 73 | 	"url": "ftp://podaac-ftp.jpl.nasa.gov/allData/nimbus7"
 74 | }
 75 | ```
 76 | 
 77 | - If you make any changes, make sure to save this as a .json file.
 78 | - Confirm that the JSON file is within the package with the dataset(s).
 79 | 
 80 | ## Creating the Bag
 81 | 
 82 | - Run the python command line script that creates the bag:
 83 | ```
 84 | bagit.py --contact-name '[your name]' /directory/to/bag
 85 | ```
 86 | - You should be left with a 'data' folder (which contains the downloaded content and metadata file) and four separate bagit files:
 87 |     - bag-info.txt
 88 |     - bagit.txt
 89 |     - manifest-md5.txt
 90 |     - tagmanifest-md5.txt
 91 | - **IMPORTANT: It's crucial that you do not move or open the bag once you have created it. This may create hidden files that could make the bag invalid later.**
 92 | - Run the following python command line script to do an initial validation of a bag:
 93 | ```
 94 | bagit.py --validate [directory/of/bag/to/validate]
 95 | ```
 96 | 
 97 | - If it comes back as valid, proceed to the next step of creating the zip file and uploading it. If it does not, make a note of the error, review your steps, and re-bag the file. If you continue to get invalid bags, please see a DataRescue guide or reach out in the Baggers Slack channel.
 98 | 
 99 | ## Creating the Zip File and Uploading It
100 | 
101 | - Zip this entire collection (data folder and bagit files) and confirm that it is named with the row's UUID.
102 | - **Without moving the file**, upload the zipped bag using the application http://drp-upload-bagger.herokuapp.com/ using the user ID and password provided in the Archivers App
103 |   - Make sure to select the name of your event in the dropdown (and "remote" if you are working remotely)
104 | - The application will return the location URL for your zip file.
105 |   - The syntax will be `[UrlStub]/[UUID].zip`
106 |   - Copy and paste that URL to the `Bag URL` field in the Archivers app.
107 |   - Note that files beyond 5 Gigs must be uploaded through the more advanced `Generate Upload Token` option. This will require using the aws command line interface.
108 |   - Please talk to your DataRescue guide or post on Slack in the Baggers channel, if you are having issues with this more advanced method.
109 | 
110 | ## Quality Assurance and Finishing Up
111 | 
112 | - Once the zip file has been fully uploaded, download the bag back to your computer (use the URL provided by the Archiver App) and run the following python command line script for validation:
113 | ```
114 | bagit.py --validate [directory/of/bag/to/validate]
115 | ```
116 | 
117 | - If it comes back as valid, open the bag and spot-check to make sure everything looks the same as when you uploaded it (this will not affect the validity of the bags already uploaded). If all seems right, proceed to the rest of the quality assurance steps. If it does not come back as valid, make a note of the error, review your steps, and re-bag the file. If you continue to get invalid bags, please see a DataRescue guide or reach out in the Baggers Slack channel.
118 | - Fill out as much information as possible in the `Notes From Bagging` field in the Archivers app to document your work.
119 | - Check the checkbox that certifies this is a "well-checked bag".
120 | - Check the Bag checkbox (far right on the same line as the "Bag" section heading) to mark that step as completed.
121 | - Click `Save`.
122 | - Click `Checkin this URL` to release it and allow someone else to work on the next step.
123 | 


--------------------------------------------------------------------------------
/docs/checking.md:
--------------------------------------------------------------------------------
 1 | **This document is currently unused, as the Archivers app does not have a separate Checking phase at this time. Instead we have added Checking instructions in the Bagging documentation.**
 2 | 
 3 | ## What Do Checkers Do?
 4 | 
 5 | Checkers inspect a harvested dataset and make sure that it is complete. The main question Checkers need to answer is "will the bag make sense to a scientist"?
 6 | 
 7 | <div class = "note">
 8 |   <strong>Recommended Skills</strong> <br />  
 9 |   Consider this path if you have data or web archiving experience, or have strong tech skills and an attention to detail.
10 | </div>
11 | 
12 | ## Getting Set up as a Checker
13 | 
14 | - Apply to become a Checker by filling out [this form](https://docs.google.com/a/temple.edu/forms/d/e/1FAIpQLSfh9YIFnDrc-Cuc0hTd-U37J3D8xw8K7VXmzWkPs6Y5Q0wfVg/viewform).
15 |     - Note that an email address is required to apply.
16 |     - Note also that you should be willing to have your real name be associated with the datasets, to follow archival best practices (see [guidelines on archival best practices for Data Refuge](http://www.ppehlab.org/blogposts/2017/2/1/data-refuge-rests-on-a-clear-chain-of-custody) for more information).
17 | - The organizers of the event (in-person or remote) will send you an invite to the [Archivers app](http://www.archivers.space/), which helps us coordinate all the data archiving work we do.
18 |     - Click the invite link, and choose a user name and a password.
19 | - Make sure you have an account on the DataRefuge Slack where people share expertise and answer each other's questions.
20 |     - Ask your event organizer to send you an invite.
21 | - You may also need some other software and utilities set up on your computer, depending on the methods you will use, if you need to harvest supplemental materials to add to a dataset.
22 | - If you need any assistance:
23 |     - Talk to your DataRescue guide if you are at an in-person event.
24 |     - Or post questions in the DataRefuge Slack `#checkers` channel (or other channel recommended by your event organizers).
25 | 
26 | 
27 | ## Claiming a Dataset for the Checking Step
28 | 
29 | - You will work on datasets that were harvested by Harvesters.
30 | - Go to the [Archivers app](http://www.archivers.space/), click `URLS` and then `FINALIZE`: all the URLs listed are ready to be checked.
31 |     - Available URLs are ones that have not been checked out by someone else, i.e. that do not have someone's name in the User column.
32 | - Select an available URL and click its UUID to get to the detailed view, then click `Check out this URL`. It is now ready for you to work on, and no one else can do anything to it while you have it checked out.
33 | - While you go through the checking process, make sure to report as much information as possible in the Archivers app, as this is the place where we collectively keep track of all the work done.
34 | 
35 | <div class = "note">
36 |   <strong>Note: URL vs UUID</strong> <br />  
37 |   The <code>URL</code> is the link to examine and harvest, and the <code>UUID</code> is a canonical ID we use to connect the URL with the data in question. The UUID will have been generated earlier in the process. UUID stands for Universal Unique Identifier.
38 | </div>
39 | 
40 | **Note: the next few steps below need to be reviewed in light of the new app-driven workflow**
41 | 
42 | ## Downloading & opening the dataset
43 | 
44 | - Go to the URL containing the zipped dataset (provided in cell "URL from upload of zip")
45 | - Download the zip file to your laptop, and unzip it.
46 | 
47 | ## Checking for completeness and meaningfulness
48 | 
49 | - Your role is to inspect the dataset and make sure that it is complete.
50 | - You also need to check that the dataset is *meaningful*, that is: "will the bag make sense to a scientist"?
51 |     - For instance, if a dataset is composed of a spreadsheet without any accompanying key or explanation of what the data represents, it might be completely impossible for a scientist to use it.
52 | 
53 | ## Adding missing items
54 | 
55 | - You should add any missing file or metadata information to the dataset
56 | - Please refer to the [Harvesting Tookit](https://github.com/datarefugephilly/workflow/tree/FinalizeRemote-Delphine/harvesting-toolkit) for more details
57 | 
58 | ## Re-uploading
59 | 
60 | - If you have made any changes to the dataset, zip the all the files and upload the new resulting zip file, using the application http://drp-upload.herokuapp.com/
61 |     - Make sure to select the name of your event in the dropdown (and "remote" if you are working remotely)
62 |     - Note that files beyond 5 Gigs cannot be uploaded through this method
63 |         - Please talk to your DataRescue guide/post on Slack in Checkers channel, if you have a larger file
64 |     - The file you uploaded has now replaced the old version, and it is available at the same url (in cell "URL from upload of zip")
65 | - Quality assurance:
66 |     - To ensure that the zip file was uploaded successfully, go to the URL and download it back to your laptop.
67 |     - Unzip it, open it and spot check to make sure that all the files are there and seem valid.
68 | 
69 | ## Finishing up
70 | 
71 | - In the Archivers app, make sure to fill out as much information as possible to document your work.
72 | - Check the Finalize checkbox (on the right-hand side) to mark that step as completed.
73 | - Click `Save`.
74 | - Click `Check in URL`, to release it and allow someone else to work on the next step.
75 | - You're done! Move on to the next URL!
76 | 
77 |  <!-- - In the Uncrawlable spreadsheet, briefly describe any change you have made in cell "Any Changes?", and answer yes or no in cell "Files in UUID.zip are all good?" -->
78 | 


--------------------------------------------------------------------------------
/docs/coordinator-work.md:
--------------------------------------------------------------------------------
 1 | While DataRefuge/DataRescue strives to be a distributed effort, there is need for a certain level of coordination to make sure there is no effort duplication between events. We are not going to cover all the tasks that Coordinators tackle in this document; instead we will focus on specific tasks that will help all project participants better understand the overall workflow.
 2 | 
 3 | # EDGI Coordinators
 4 | 
 5 | [Add a paragraph on EDGI and links to other relevant documents/sites --> Could someone from EDGI please do that?]
 6 | EDGI focuses particularly on helping Seeders teams get set up. They also provide advanced recommendations and know-how on the harvesting process.
 7 | 
 8 | # DataRefuge Coordinators
 9 | 
10 | DataRefuge Coordinators help facilitate the DataRefuge project and the development of the [DataRefuge repository](https://www.datarefuge.org/). In particular they help each DataRescue event get set up with their list of Uncrawlable URLS.
11 | 
12 | ## Uncrawlable Spreadsheet structure
13 | 
14 | - While eventually we hope to be able to develop an integrated web-based application that will facilitate the entire workflow, for now we have to rely on a system based on Google Spreadsheets to manage harvesting activities.
15 | 
16 | - Uncrawlable Action spreadsheets
17 |   - An Uncrawlable Action spreadsheets contains no more than 500 URLs
18 |   - It is divided into tabs (one per role)
19 |     - with formulas that help populate rows from one tab to the other
20 |   - It contain a number of column per tab to support various aspects of the role's work
21 |   - Each DataRescue event is attributed one separate Action spreadsheet
22 |     - The event participants can continue working on it remotely after the event until it is completed
23 |     - Or alternatively they could "give it back" to the organizers at the end of the event
24 | 
25 | - An Uncrawlable Index spreadsheet
26 |   - A single Uncrawlable Index spreadsheet is used to keep track of all URLs being managed in the project
27 |   - Using formulas, it automatically compiles a list of all URLS listed in all Uncrawlable Action spreadsheets.
28 |   - This spreadsheet only contains minimal information about each URL (URL, UUID, Done/Not Done)
29 |   - It is not meant to be edited in any way.
30 |   - However its can be downloaded and imported into a tool like OpenRefine to help with deduplication.
31 | 
32 | - Seeders spreadsheet
33 |   - The Chrome Extension used by the Seeders automatically populate a separate spreadsheet.
34 |   - We have found that keeping that spreadsheet separate helps with the workflow
35 | 
36 | ## How each Uncrawlable Action spreadsheet is populated
37 | 
38 | It will include URLs coming from two main sources:
39 | 
40 | - URLs that were nominated by Seeders at a previous DataRescue event
41 | - URLs that were identified througth the Union of Concerned Scientists survey, which asked the scientific community to list the most vulnerable and important data currently on accessible through federal websites.
42 | 
43 | ## Spreadsheet Minder
44 | 
45 | - A Spreadsheet Minder (or Spreadsheet Minder Team) is in charge with managing the Uncrawlable Index and Action spreadsheets
46 | - Here are the task they are responsible for:
47 |     - Prepare new Uncrawlable Action spreadsheets by moving URLs from the Seeders spreadsheet and the Survey into new Uncrawlable Action spreadsheets.
48 |     - Generate UUIDs (see "UUID generation" below) and add them to the Uncrawlable Action spreadsheets, making sure that each URL has a UUID
49 |     - Add a rough importance rating (in the "Importance" cell)
50 |     - E.g., a URL coming from the Survey would automatically get a high importance rating
51 | - Add the name of the event (in the "Event Name" cell)
52 | - Keep an eye on all spreadsheets making sure that everything is in order.
53 | - Regularly take snapshots of each spreadsheet in case an problem occurs and the content of a spreadsheet needs to get recovered
54 | - Use the Index spreadsheet to track overall progress (ratio Done / Not Done) and to help with deduplication efforts
55 | 
56 | ## UUID generation
57 | 
58 | - UUIDs are "universal unique IDs"
59 | - Each URL listed in an Uncrawlable spreadsheet is assigned one UUID (in cell "UUID").
60 | - Generating enough (UUIDs) ahead of time, and cutting and pasting them in the UUID column (in the spreadsheet's empty rows).
61 | - The web-based tool [UUID generator](https://www.browserling.com/tools/random-uuid) can generate individual or multiple UUIDs.
62 | 


--------------------------------------------------------------------------------
/docs/css/extra.css:
--------------------------------------------------------------------------------
1 | li {
2 |   font-size: 12pt;
3 | }
4 | 
5 | div.note {
6 |   font-size: 12pt;
7 | }
8 | 


--------------------------------------------------------------------------------
/docs/describing.md:
--------------------------------------------------------------------------------
  1 | ## What Do Describers Do?
  2 | 
  3 | Describers create a descriptive record in the DataRefuge CKAN repository for each bag. Then they link the record to the bag and make the record public.
  4 | 
  5 | <div class = "note">
  6 |   <strong>Recommended Skills</strong> <br />  
  7 |   Consider this path if you have experience working with scientific data (particularly climate or environmental data) or with metadata practices.
  8 | </div>
  9 | 
 10 | ## Getting Set up as a Describer
 11 | 
 12 | - Apply to become a Describer by asking your DataRescue guide or by filling out [this form](https://docs.google.com/a/temple.edu/forms/d/e/1FAIpQLSfh9YIFnDrc-Cuc0hTd-U37J3D8xw8K7VXmzWkPs6Y5Q0wfVg/viewform).
 13 |     - Note that an email address is required to apply.
 14 |     - Note also that you should be willing to have your real name be associated with the datasets, to follow archival best practices (see [guidelines on archival best practices for DataRefuge](http://www.ppehlab.org/blogposts/2017/2/1/data-refuge-rests-on-a-clear-chain-of-custody) for more information).
 15 | - The organizers of the event (in-person or remote) will send you an invite to the [Archivers app](http://www.archivers.space/), which helps us coordinate all the data archiving work we do.
 16 |   	- Click the invite link, and choose a user name and a password.
 17 | - Create an account on the DataRefuge Slack using this [slack-in](https://rauchg-slackin-qonsfhhvxs.now.sh/) (or use the Slack team recommended by your event organizers). This is where people share expertise and answer each other's questions.
 18 | 	- Ask your event organizer to send you an invite.
 19 | - The organizers will also create an account for you in the [datarefuge.org](https://www.datarefuge.org/) CKAN instance.
 20 |     - Test that you can log in successfully.
 21 | - Get set up with Python and the [`bagit-python`](https://github.com/LibraryOfCongress/bagit-python) script to make a bag at the command line
 22 | - If you need any assistance:
 23 |     - Talk to your DataRescue guide if you are at an in-person event.
 24 |     - Or post questions in the DataRefuge Slack `#describers` channel (or other channel recommended by your event organizers).
 25 | 
 26 | ## Claiming a Bag
 27 | 
 28 | - You will work on datasets that were bagged by Baggers.
 29 | - Go to the [Archivers app](http://www.archivers.space/), click `URLS` and then `DESCRIBE`: all the URLs listed are ready to be added to the CKAN instance.
 30 |     - Available URLs are ones that have not been checked out by someone else, i.e. that do not have someone's name in the User column.
 31 | - Select an available URL and click its UUID to get to the detailed view, then click `Checkout this URL`. It is now ready for you to work on, and no one else can do anything to it while you have it checked out.
 32 | 
 33 | <div class = "note">
 34 |   <strong>Note: URL vs UUID</strong> <br />  
 35 |   The <code>URL</code> is the link to examine and harvest, and the <code>UUID</code> is a canonical ID we use to connect the URL with the data in question. The UUID will have been generated earlier in the process. UUID stands for Universal Unique Identifier.
 36 | </div>
 37 | 
 38 | ## QA Step
 39 | 
 40 | - In the Archivers app, scroll down to the `Describe` section.
 41 | - The URL of the zipped bag is in the `Bag Url / Location` field.
 42 | - Cut and paste that URL into your browser and download it.
 43 | - After downloading, unzip it.
 44 | - Spot-check some of the files (make sure they open and look normal, i.e., not garbled).
 45 | - If the file fails QA:
 46 |     - Uncheck the Bagging checkbox.
 47 |     - Make a note in the `Notes From Bagging` field, explaining in what way the bag failed QA and asking a bagger to please fix the issue.
 48 | 
 49 | ## Create New Record in CKAN
 50 | 
 51 | - Go to [CKAN](https://www.datarefuge.org/) and click Organizations in the top menu.
 52 | - Choose the organization (i.e., federal agency) that your dataset belongs to, e.g. `NOAA`, and click it.
 53 |     - If the Organization you need does not exist yet, create it by clicking `Add Organization`.
 54 | - Click "Add Dataset".
 55 | - Start entering metadata in the new record, following the metadata template below:
 56 |     - **Title:** Title of dataset, e.g., "Form EIA-411 Data".
 57 |     - __Custom Text: DO NOT Fill OUT (this field does not function properly at this time)__
 58 |     - **Description:** Usually copied and pasted description found on webpage.
 59 |     - **Tags:** Basic descriptive keywords, e.g., "electric reliability", "electricity", "power systems".
 60 |     - **License:** Choose value in dropdown. If there is no indicated license, select "Other - Public Domain".
 61 |     - **Organization:** Choose value in dropdown, e.g., "United States Department of Energy".
 62 |     - **Visibility:** Select "Public".
 63 |     - **Source:** URL where site is live, also in JSON, e.g. "http://www.eia.gov/electricity/data/eia411/".
 64 | - To decide what value to enter in each field:
 65 |     - Open the JSON file that is in the bag you have downloaded; it contains some of the metadata you need.
 66 |     - Go to the original location of the item on the federal agency website (found in the JSON file), to find more facts about the item such as description, title of the dataset, etc.
 67 |         - Alternatively, you can also open the HTML file that should be included in the bag and is a copy of that original main page.
 68 | 
 69 | ## Enhancing Existing Metadata
 70 | 
 71 | These sites have federally-sourced metadata that can be added to the CKAN record for more accurate metadata:
 72 | 
 73 | - EPA:
 74 |     - [https://www.epa.gov/enviro/facility-registry-service-frs](https://www.epa.gov/enviro/facility-registry-service-frs)
 75 |     - [https://edg.epa.gov/metadata/catalog/main/home.page](https://edg.epa.gov/metadata/catalog/main/home.page)
 76 | 
 77 | These sites are sources of scientific metadata standards to review when choosing keywords:
 78 | 
 79 | - GCMD Keywords, downloadable CSV files of the GCMD taxonomies:
 80 |     - [https://wiki.earthdata.nasa.gov/display/cmr/gcmd+keyword+access](https://wiki.earthdata.nasa.gov/display/cmr/gcmd+keyword+access)
 81 | - ATRAC, a free tool for accessing geographic metadata standards including auto-populating thesauri (GCMD and others commonly used with climate data):
 82 |     - [https://www.ncdc.noaa.gov/atrac/index.html](https://www.ncdc.noaa.gov/atrac/index.html)
 83 | 
 84 | ## Linking the CKAN Record to the Bag
 85 | 
 86 | - Click "Next: Add Data" at the bottom of the CKAN form.
 87 | - Enter the following information:
 88 |     - **Link:** Bag URL, e.g., `https://drp-upload-bagger.s3.amazonaws.com/remote/77DD634E-EBCE-412E-88B5-A02B0EF12AF6_2.zip`.
 89 |     - **Name:** filename, e.g., `77DD634E-EBCE-412E-88B5-A02B0EF12AF6_2.zip`.
 90 |     - **Format:** select "Zip".
 91 | - Click "Finish".
 92 | - Test that the link you just created works by clicking it, and verifying that the file begins to download.
 93 |     - Note that you don't need to finish downloading it again.
 94 |     - Alternatively, use WGET to test without downloading: `wget --spider [BAG URL]`
 95 | 
 96 | ## Adding the CKAN record to the "Data Rescue Events" group
 97 | 
 98 | - Once the record is created, click the tab `Groups`  
 99 | - Select `Data Rescue Events` in the dropdown and click `Add to Group`.
100 | - In the future, it will be useful to be able to differentiate that among different groups of records based on how they were generated.
101 | 
102 | ## Finishing Up
103 | 
104 | - In the Archivers app, add the URL to the CKAN record in the `CKAN URL` field.
105 |     - The syntax will be:
106 |      `https://www.datarefuge.org//dataset/[datasetNameGeneratedByCkan]`
107 | - Add any useful notes to document your work.
108 | - Check the Describe checkbox (far right on the same line as the "Describe" section heading) to mark that step as completed.
109 | - Click `Save`.
110 | - Click `Checkin this URL`, to release it.
111 | 
112 | ## Possible Tools: JSON Viewers
113 | 
114 | - [jsoneditoronline.org](http://www.jsoneditoronline.org/)
115 | - [jsonviewer.stack.hu](http://jsonviewer.stack.hu/)
116 | 


--------------------------------------------------------------------------------
/docs/faq.md:
--------------------------------------------------------------------------------
  1 | The [Archivers.space](https://www.archivers.space/) application is extremely fresh, we have some known issues and workarounds documented below.
  2 | 
  3 | ## 1) I'm looking at a URL, but I can't edit anything!
  4 | 
  5 | Make sure you have clicked the big blue button "**Checkout this URL**" near the top. None of the fields can be edited until the URL is checked out.
  6 | 
  7 | ## 2) Why are URLs that may have already been archived by Ann Arbor available to research?
  8 | 
  9 | When selecting a URL to review "`0`" is the *default* priority; it generally means that no-one has reviewed it *UNLESS* it says `MAY HAVE BEEN HARVESTED AT ANN ARBOR`.
 10 | 
 11 | In those cases, assign the priority to "`1`", so the URL drops down in the queue and then *SKIP IT*.
 12 | 
 13 | ## 3) What does it mean if it says _Crawled by Internet Archive_: Yes?
 14 | 
 15 | "_Crawled by Internet Archive_" means the *page itself* was crawled; it may or may not mean the *dataset* was crawled.
 16 | 
 17 | Based on what Heretrix [Can and Can't Crawl](https://edgi-govdata-archiving.github.io/guides/internet-archive-crawler/), you will need to judge whether the dataset will be captured by the Internet Archive crawl and use your best judgement about whether to mark as `Do not harvest`.
 18 | 
 19 | ## 4) How should I handle overly broad sites with just a search form, e.g. noaa.gov?
 20 | 
 21 | In cases like [**noaa.gov**](http://www.noaa.gov/), you have to investigate  and try to find the data source a page is referencing and whether or not there is some way to query that data.
 22 | In many cases, it might be difficult or near impossible to isolate and query, depending on the kind of database.
 23 | 
 24 | Complete the **Research** section to the best of your abilities, especially the _Recommended Approach for Harvesting Data_.
 25 | 
 26 | ## 5) Do we have a scripting system set up preserving data or data endpoints that are updated regularly?
 27 | 
 28 | Not yet; addressing these datasets is a goal going-forward.
 29 | 
 30 | Currently, indicate in the notes in both the **Research** and **Harvest** sections that the dataset is updated regularly, and mark it complete anyway (note decision per @mattprice/this FAQ).
 31 | 
 32 | ## 6) What if I have a site and want to know if it has been crawled already?
 33 | 
 34 | Internet Archive has both a [Wayback Machine Chrome Extension](https://chrome.google.com/webstore/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak) and [APIs](https://archive.org/help/wayback_api.php) you can use to check if something has been archived:
 35 | 
 36 | - [**Wayback Machine Chrome Extension**](https://chrome.google.com/webstore/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak)
 37 |   - You can also check on the Internet Archive site directly at [archive.org/web/](https://archive.org/web/)
 38 | - [**Wayback CDX Server API**](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server)
 39 | 
 40 | There is a `check-ia` script in the [harvesting tools](https://github.com/edgi-govdata-archiving/harvesting-tools/tree/master/url-check) for batch URLs.
 41 | 
 42 | ## 7) What if the site has in fact been crawled well by the Internet Archive?
 43 | 
 44 | If the site includes only crawlable data, then there is no need to harvest it. These should be marked `Do not harvest` in the **Research** phase.
 45 | 
 46 | If the site includes one of the forms of uncrawlable content:  
 47 | 1) FTP  
 48 | 2) Many Files  
 49 | 3) Database  
 50 | 4) Visualization/Interactive  
 51 | 
 52 | Then mark accordingly in and harvest the datasets.
 53 | 
 54 | ## 8) What does it mean when it says "checking out this url will unlock url: xxx"?
 55 | 
 56 | That means you have another URL checked out. In order to avoid an overlap in efforts, when you check out a URL only you can work on it. By checking out a new URL the previous one will be unlocked.
 57 | 
 58 | ## 9) What do I do when there is stuff listed in the Link URL section?
 59 | 
 60 | If there are a bunch of sub-sites listed that are **not** links, then you are on the master entry; the child entries are therefore just advisory and you should try to make sure that your harvesting includes all of the datasets contained across them, but otherwise keep going.
 61 | 
 62 | If the Link URL section has a single URL listed and it's a link, you are on a child item, which is the wrong place. Click the link and work on the master record.
 63 | 
 64 | ## 10) How do I partition large "parent" URLs (e.g., to reduce the size of the download < 5 GB)?
 65 | 
 66 | From the [overview pane](https://www.archivers.space/urls?phase=research), click `Add Url` on the top right side of app. Add a URL for each child and enter a description indicating these new URLs are children to the "parent URL". Make sure the priority of each child is the same as the parent.
 67 | 
 68 | Check out the parent URL, and under **Research** use "Link Url" link it to all of its children and add a description. Make sure the priority of each child is the same as the parent. Start harvesting each child.
 69 | 
 70 | ## 11) Wifi is kind of Slow, are there workarounds for a faster connection?
 71 | 
 72 | 1. Do as much of the work as possible remotely: spin up a VM (e.g., AWS EC2, Digital Ocean droplet) or something, `ssh` to those machines and do the downloading to there. The fewer people that are using the bandwidth onsite for big things, the less congestion this network will have.
 73 | 
 74 | 2. Tether your phone :), thought if you do be mindful of bandwidth caps and don't forget to plug in your charger!
 75 | 
 76 | ## 12) Why can't I edit the harvesting section?
 77 | 
 78 | Archivers is set up such that each URL moves through the stages of the workflow in sequence. In order to edit the **Harvesting** section, you will first need to mark **Research** as complete. Look for the checkbox on the right-hand side at the top of the **Research** section. Once you've checked it, make sure to hit `Save`.
 79 | 
 80 | ## 13) When harvesting, why doesn't clicking on the `Download Zipstarter` button work?
 81 | 
 82 | Unfortunately this is a known issue. Make sure you've marked **Research** complete. Try reloading the page, or switching browsers if you can.
 83 | 
 84 | The App is not compatible with Safari.
 85 | 
 86 | ## 14) In the **Research** section, what are all the checkboxes for?
 87 | 
 88 | Please read the DataRescue Workflow documentation for more info!
 89 | 
 90 | ## 15) I have a process improvement that would make this go better!
 91 | 
 92 | Great! Open an issue in the [archivers.space GitHub repository](https://github.com/edgi-govdata-archiving/archivers.space/), or report it in the appropriate channel in your Slack team.
 93 | 
 94 | ## 16) How do I add a new event?
 95 | 
 96 | Admins can add events under the "Events" tab. Regular users will have to ask an admin for help!
 97 | 
 98 | ## 17) What is the difference between Crawlable and Harvested?
 99 | 
100 | [Researchers](researching.md) investigate whether URLs listed in the [Archivers app](http://www.archivers.space/urls) need to be manually downloaded (harvested) or if they can be automatically saved (crawled) by the Internet Archive. The URLs listed as [Crawlable](http://www.archivers.space/urls?phase=crawlable) were determined as such during that research phase and are submitted to the Internet Archive, they do not need to be harvested. 
101 | 
102 | These URLs represent only a small portion of those submitted to the Internet Archive from DataRescue events. Most crawlable URLs are identified by [Seeders](seeding.md) at the beginning of the workflow and completely bypass the Archivers app.
103 | 


--------------------------------------------------------------------------------
/docs/harvesting.md:
--------------------------------------------------------------------------------
  1 | ## What Do Harvesters Do?
  2 | 
  3 | Harvesters take the "uncrawlable" data and try to figure out how to actually capture it based on the recommendations of the Researchers. This is a complex task which can require substantial technical expertise, and which requires different techniques for different tasks.
  4 | 
  5 | <div class = "note">
  6 |   <strong>Recommended Skills</strong> <br />  
  7 |   Consider this path if you're a skilled technologist with a programming language of your choice (e.g., Python, JavaScript, C, etc.), are comfortable with the command line (bash, shell, powershell), or experience working with structured data. Experience in front-end web development a plus.
  8 | </div>
  9 | 
 10 | ## Getting Set up as a Harvester
 11 | 
 12 | -   The organizers of the event (in-person or remote) will tell you how to volunteer for the Harvester role, either through Slack or a form.
 13 |     -   They will send you an invite to the [Archivers app](http://www.archivers.space/), which helps us coordinate all the data archiving work we do.
 14 |     -   Click the invite link, and choose a username and a password. It is helpful to use the same username on the app and Slack.
 15 | -   Create an account on the DataRefuge Slack using this [slack-in](https://rauchg-slackin-qonsfhhvxs.now.sh/) (or use the Slack team recommended by your event organizers). This is where people share expertise and answer each other's questions.
 16 | -   You might also need other software and utilities set up on your computer, depending on the harvesting methods you use.
 17 | -   Harvesters should start by reading this document, which outlines the steps for constructing a proper data archive of the highest possible integrity. The primary focus of this document is on _semi-automated harvesting as part of a team_, and the workflow described is best-suited for volunteers working to preserve small and medium-sized collections. Where possible, we try to link out to other options appropriate to other circumstances.
 18 | -   If you need any assistance:
 19 |     -   Talk to your DataRescue guide if you are at an in-person event
 20 |     -   Or post questions in the DataRefuge Slack `#harvesters` channel (or other channel recommended by your event organizers).
 21 | 
 22 | <div class = "note">
 23 |   <strong>Researchers and Harvesters</strong> <br />  
 24 |   <ul>
 25 |     <li>Researchers and Harvesters should coordinate together as their work is closely related and benefits from close communication</li>
 26 |     <li>It may be most effective to work together in pairs or small groups, or for a single person to both research and harvest</li>
 27 |     <li>As a Harvester, make sure to check out the <a href="/researching/">Researching documentation</a> to familiarize yourself with their role</li>
 28 |   </ul>
 29 | </div>
 30 | 
 31 | ## Harvesting Tools
 32 | 
 33 | For in-depth information on tools and techniques to harvest open data, please check EDGI's extensive [harvesting tools](https://github.com/edgi-govdata-archiving/harvesting-tools).
 34 | 
 35 | ## 1. Claiming a Dataset to Harvest
 36 | 
 37 | <div class = "note">
 38 |   <strong>Using Archivers App</strong> <br />  
 39 |   Review our walkthrough video below and refer to the <a href="/faq/">FAQ</a> for any additional questions about the <a href="http://www.archivers.space" target="_blank">Archivers app</a><!--_-->. <br />
 40 |   &nbsp;<br />
 41 |   <p style="text-align:center"><iframe width="80%" height="315" src="https://www.youtube.com/embed/tvSSILnHnpA" frameborder="0" allowfullscreen></iframe></p>
 42 | </div>
 43 | 
 44 | -   You will work on datasets that were confirmed as uncrawlable by Researchers.
 45 | -   Go to the [Archivers app](http://www.archivers.space/), click `URLS` and then `HARVEST`: all the URLs listed are ready to be harvested.
 46 |      - Available URLs are the ones that have not been checked out by someone else, i.e. that do not have someone's name in the User column.
 47 | -   Select an available URL and click its UUID to get to the detailed view, then click `Checkout this URL`. It is now ready for you to work on, and no one else can do anything to it while you have it checked out.
 48 | -   While you go through the harvesting process, make sure to report as much information as possible in the Archivers app, as this is the place were we collectively keep track of all the work done.
 49 | 
 50 | <div class = "note">
 51 |   <strong>URL vs UUID</strong> <br />  
 52 |   The <code>URL</code> is the link to examine and harvest, and the <code>UUID</code> is a canonical ID we use to connect the URL with the data in question. The UUID will have been generated earlier in the process. UUID stands for Universal Unique Identifier.
 53 | </div>
 54 | 
 55 | ## 2. Investigate the Dataset
 56 | 
 57 | <div class = "note">
 58 |   <strong>A "Meaningful Dataset"</strong> <br />  
 59 |   Your role is to harvest datasets that are complete and <em>meaningful</em>, by which we mean: "will the dataset make sense to a scientist"? <br />
 60 |   For instance, if a dataset is composed of a spreadsheet without any accompanying key or explanation of what the data represents, it might be completely impossible for a scientist to use it.
 61 | </div>
 62 | 
 63 | ### 2a. Classify Source Type & Archivability
 64 | 
 65 | Before doing anything, take a minute to understand what you're looking at. It's usually best to do a quick check of the URL to confirm that this data in fact not crawlable. Often as part of the harvesting team, you'll be the first person with a higher level of technical knowledge to review the URL in question.
 66 | 
 67 | #### Check for False-Positives (Content That Is in Fact Crawlable)
 68 | 
 69 | Generally, any URL that returns standard HTML, links to more [HTML mimetype pages](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types), and contains little-to-no non-HTML content, is crawlable. "View source" from your browser of choice will help see what the crawler itself is seeing. If in fact the data can be crawled, nominate it to the Internet Archive using the [EDGI Nomination Chrome Extension](https://chrome.google.com/webstore/detail/nominationtool/abjpihafglmijnkkoppbookfkkanklok), click the `Do not harvest` checkbox in the Research section of the Archivers app, click `Checkin this URL`, and move on to another URL.
 70 | 
 71 | A written guide on using the Chrome Nomination tool, the EDGI Primer Database, and a video tutorial are available [in Seeders' Documentation](seeding/#crawlable-urls). 
 72 | 
 73 | #### Some Things to Think About While Reviewing a URL
 74 | 
 75 | -   Does this page use JavaScript to render its content, especially to _generate links_ or _dynamically pull up images and PDF content_? Crawlers generally cannot parse dynamically generated content.
 76 | -   Does this URL contain links to non-HTML content? (For example, zip files, PDFs, Excel files, etc...)
 77 | -   Is this URL some sort of interface for a large database or service? (For example, an interactive map, API gateway, etc.)
 78 | -   Does this URL contain instructions for connecting to a server, database, or other special source of data?
 79 | 
 80 | #### Check the Terms of Service!
 81 | 
 82 | Before you go any further, it is _always_ worth confirming that the data in question is in fact open for archiving. If the terms of service explicitly prohibit archiving, _make a note of it_. Generally archive-a-thons are purposely only aimed at publically available data, but it is easy to follow a link away from a publically available source onto a site that has different terms of service.
 83 | 
 84 | _**Data acquired outside terms of service is not usable.**_
 85 | 
 86 | ### 2b. Determine Scale of the Dataset
 87 | 
 88 | If the dataset you're looking at is quite large -- say, more than 1000 documents -- capturing it may require more elaborate programming than is described here, and it may be difficult to complete in the timeframe of the event. In that case, you may want to look outside the scope of this document and read the documentation of tools such as the [EIS WARC archiver](https://github.com/edgi-govdata-archiving/eis-WARC-archiver), which shows how to initiate a larger, fully automated harvest on a web-based virtual machine. Talk to your DataRescue guide to determine how to best proceed.
 89 | 
 90 | ## 3. Generate HTML, JSON & Directory
 91 | 
 92 | To get started, click `Download Zip Starter`, which will download an empty zip archive structure for the data you are about to harvest.
 93 | The structure looks like this:
 94 | 
 95 | ```no-highlight
 96 | DAFD2E80-965F-4989-8A77-843DE716D899
 97 | 	├── DAFD2E80-965F-4989-8A77-843DE716D899.html
 98 | 	├── DAFD2E80-965F-4989-8A77-843DE716D899.json
 99 | 	├── /tools
100 | 	└── /data
101 | ```
102 | 
103 | Each row in the above is:
104 | 
105 | ```no-highlight
106 | A directory named by the UUID
107 | 	├── a .html "web archive" file of the URL for future reference, named with the ID
108 | 	├── a .json metadata file that contains relevant metadata, named with the ID
109 | 	├── a /tools directory to include any scripts, notes & files used to acquire the data
110 | 	└── a /data directory that contains the data in question
111 | ```
112 | 
113 | ### Folder Structure
114 | 
115 | #### UUID
116 | 
117 | The goal is to pass this finalized folder off for ["bagging"](bagging.md). We repeatedly use the UUID so that we can programmatically work through this data later. It is important that the ID be copied _exactly_ wherever it appears, with no leading or trailing spaces, and honoring case-sensitivity.
118 | 
119 | #### [UUID].html file
120 | 
121 | The zip starter archive will automatically include a copy of the page corresponding to the URL. The HTML file gives the archive a snapshot of the page at the time of archiving which we can use to monitor for changing data in the future, and corroborate the provenance of the archive itself. We can also use the `.html` in conjunction with the scripts you'll include in the tools directory to replicate the archive in the future.
122 | 
123 | #### [UUID].json file
124 | 
125 | You'll need to inspect the .json manifest to be sure all fields are correct. This file contains vital data, including the url that was archived and date of archiving. The manifest should contain the following fields:
126 | 
127 | ```
128 | {
129 | 	"Date of capture": "",
130 | 	"File formats contained in package": "",
131 | 	"Free text description of capture process": "",
132 | 	"Individual source or seed URL": "",
133 | 	"Institution facilitating the data capture creation and packaging": "",
134 | 	"Name of package creator": "",
135 | 	"Name of resource": "",
136 | 	"Type(s) of content in package": "",
137 | 	"recommended_approach": "",
138 | 	"significance": "",
139 | 	"title": "",
140 | 	"url": ""
141 | }
142 | ```
143 | 
144 | #### [UUID]/tools/
145 | 
146 | Directory containing any scripts, notes & files used to acquire the data. Put any scripts you write or tools you use into this directory. This is useful in case new data needs to be archived from the same site again at a later date.
147 | 
148 | #### [UUID]/data/
149 | 
150 | Directory containing the data in question.
151 | 
152 | ## 4. Acquire the Data
153 | 
154 | Your method for doing this will depend on the shape and size of the data you're dealing with. A few methods are described below.
155 | 
156 | ### 4a. Identify Data Links & Acquire Them in a wget Loop
157 | 
158 | If you encounter a page that links to lots of data (for example a "downloads" page), this approach may work well. It's important to only use this approach when you encounter _data_, for example PDF's, .zip archives, .csv datasets, etc.
159 | 
160 | The tricky part of this approach is generating a list of URLs to download from the page. If you're skilled with using scripts in combination with html-parsers (for example python's wonderful [beautiful-soup package](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)), go for it. Otherwise, we've included the [jquery-url-extraction guide](https://github.com/edgi-govdata-archiving/harvesting-tools/tree/master/jquery-url-extraction), which has the advantage of working within a browser and can operate on a page that has been modified by JavaScript.
161 | 
162 | Our example dataset uses jquery-URL, [leveraging that tool to generate a list of URLs to feed the wget loop](https://github.com/edgi-govdata-archiving/harvesting-tools/tree/master/jquery-url-extraction/README.md).
163 | 
164 | ### 4b. Identify Data Links & Acquire Them via WARCFactory
165 | 
166 | For search results from large document sets, you may need to do more sophisticated "scraping" and "crawling" -- again, check out tools built at previous events such as the [EIS WARC archiver](https://github.com/edgi-govdata-archiving/eis-WARC-archiver) or the [EPA Search Utils](https://github.com/edgi-govdata-archiving/epa-search-utils) for ideas on how to proceed.
167 | 
168 | ### 4c. FTP Download
169 | 
170 | Government datasets are often stored on FTP. It's pretty easy to crawl these FTP sites with a simple Python script. Have a look at [download_ftp_tree.py](https://github.com/edgi-govdata-archiving/harvesting-tools/tree/master/ftp/download_ftp_tree.py) as an example. Note that the Internet Archive is doing an FTP crawl, so another option (especially if the dataset is large) would be to nominate this as a seed (though FTP seeds should be nominated **separately** from HTTP seeds).
171 | 
172 | ### 4d. API Scrape / Custom Solution
173 | 
174 | If you encounter an API, chances are you'll have to build some sort of custom solution, or investigate a social angle. For example: asking someone with greater access for a database dump.
175 | 
176 | ### 4e. Automated Full Browser
177 | 
178 | The last resort of harvesting should be to drive it with a full web browser. It is slower than other approaches such as `wget`, `curl`, or a headless browser. Additionally, this implementation is prone to issues where the resulting page is saved before it's done loading. There is a [ruby example](https://github.com/edgi-govdata-archiving/harvesting-tools/tree/master/ruby-watir-collect).
179 | 
180 | ### Tips
181 | 
182 | -   If you encounter a search bar, try entering "*"<!--*--> to see if that returns "all results".
183 | -   Leave the data unmodified. During the process, you may feel inclined to clean things up, add structure to the data, etc. Avoid temptation. Your finished archive will be hashed so we can compare it later for changes, and it's important that we archive original, unmodified content.
184 | 
185 | ## 5. Complete [UUID].json & Add /tools
186 | 
187 | From there you'll want to complete the [UUID].json. Use the template below as a guide.
188 | 
189 | -   The json should match the information from the Researcher and use the following format:
190 | 
191 | ```
192 | {
193 | 	"Date of capture": "Fri Feb 24 2017 21:44:07 GMT-0800 (PST)",
194 | 	"File formats contained in package": ".xls, .zip",
195 | 	"Free text description of capture process": "Metadata was generated by viewing page, data was bulk downloaded using download_ftp_tree.py and then bagged.",
196 | 	"Individual source or seed URL": "ftp://podaac-ftp.jpl.nasa.gov/allData/nimbus7",
197 | 	"Institution facilitating the data capture creation and packaging": "DataRescue SF Bay",
198 | 	"Name of package creator": "JohnDoe",
199 | 	"Name of resource": "Nimbus-7 SMMR global 60km gridded ocean parameters for 1979 - 1984",
200 | 	"Type(s) of content in package": "These files are the Wentz Nimbus-7 SMMR global 60km gridded ocean parameters for 1979 - 1984.  There are 72 files which breaks down to 1 file per month. So, NIMBUS7-SMMR00000.dat is January 1979, NIMBUS7-SMMR00001.dat is February 1979, etc.",
201 | 	"UUID": "F3499E3B-7517-4C06-A661-72B4DA13A2A2",
202 | 	"recommended_approach": "",
203 | 	"significance": "These files are the Wentz Nimbus-7 SMMR global 60km gridded ocean parameters for 1979 - 1984.  There are 72 files which breaks down to 1 file per month. So, NIMBUS7-SMMR00000.dat is January 1979, NIMBUS7-SMMR00001.dat is February 1979, etc.",
204 | 	"title": "Nimbus-7 SMMR global 60km gridded ocean parameters for 1979 - 1984",
205 | 	"url": "ftp://podaac-ftp.jpl.nasa.gov/allData/nimbus7"
206 | }
207 | ```
208 | 
209 | -   Make sure to save this as a .json file.
210 | 
211 | In addition, copy any scripts and tools you used into the /tools directory. It may seem strange to copy code multiple times, but this can help later to reconstruct the archiving process for further refinement later on.
212 | 
213 | It's worth using some judgement here. If a "script" you used includes an entire copy of JAVA, or some suite beyond a simple script, it may be better to document your process in a file and leave that in the tools directory instead.
214 | 
215 | ## 6. Uploading the data
216 | 
217 | -   Zip all the files pertaining to your dataset within the zip started archive structure and confirm that it is named with the original UUID.
218 | -   Upload the zip file by selecting `Choose File` and then clicking `Upload` in the Archivers app.
219 | -   Note that files beyond 5 Gigs must be uploaded through the more advanced `Generate Upload Token` option. This will require using the aws command line interface.
220 |     -   Please talk to your DataRescue guide or post on Slack in the Harvesters channel, if you are having issues with this more advanced method.
221 | 
222 | <div class = "attention">
223 |   <strong>See FAQ <a href="faq/#17-what-is-the-difference-between-crawlable-and-harvested">"What is the difference between Crawlable and Harvested in Archivers.space?"</a></strong>
224 | </div>
225 | 
226 | ## 7. Finishing up
227 | 
228 | -   In the Archivers app, make sure to fill out as much information as possible to document your work.
229 | -   Check the Harvest checkbox (far right on the same line as the "Harvest" section heading) to mark that step as completed.
230 | -   Click `Save`.
231 | -   Click `Checkin this URL`, to release it and allow someone else to work on the next step.
232 | -   You're done! Move on to the next URL!
233 | 


--------------------------------------------------------------------------------
/docs/img/button-checkin.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/button-checkin.png


--------------------------------------------------------------------------------
/docs/img/button-checkout.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/button-checkout.png


--------------------------------------------------------------------------------
/docs/img/button-save.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/button-save.png


--------------------------------------------------------------------------------
/docs/img/button-upload-token.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/button-upload-token.png


--------------------------------------------------------------------------------
/docs/img/button-upload.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/button-upload.png


--------------------------------------------------------------------------------
/docs/img/button-zipstarter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/button-zipstarter.png


--------------------------------------------------------------------------------
/docs/img/harvest-00-overview-homepage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/harvest-00-overview-homepage.png


--------------------------------------------------------------------------------
/docs/img/harvest-01-UUID.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/harvest-01-UUID.png


--------------------------------------------------------------------------------
/docs/img/harvest-01-zipstarter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/harvest-01-zipstarter.png


--------------------------------------------------------------------------------
/docs/img/harvest-02-toolbar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/harvest-02-toolbar.png


--------------------------------------------------------------------------------
/docs/img/harvest-02-upload.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/harvest-02-upload.png


--------------------------------------------------------------------------------
/docs/img/harvest-03-upload-awstoken.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/harvest-03-upload-awstoken.png


--------------------------------------------------------------------------------
/docs/img/harvest-04-add-details.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/harvest-04-add-details.png


--------------------------------------------------------------------------------
/docs/img/harvest-05-notes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/harvest-05-notes.png


--------------------------------------------------------------------------------
/docs/img/research-00-overview-homepage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/research-00-overview-homepage.png


--------------------------------------------------------------------------------
/docs/img/research-01-title.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/research-01-title.png


--------------------------------------------------------------------------------
/docs/img/research-02-dnh-section.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/research-02-dnh-section.png


--------------------------------------------------------------------------------
/docs/img/research-03-recommended-approach.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/research-03-recommended-approach.png


--------------------------------------------------------------------------------
/docs/img/research-04-formats.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/research-04-formats.png


--------------------------------------------------------------------------------
/docs/img/research-05-estimated-size.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/research-05-estimated-size.png


--------------------------------------------------------------------------------
/docs/img/research-06-linkURL.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/research-06-linkURL.png


--------------------------------------------------------------------------------
/docs/img/research-07-check-as-done.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datarefuge/workflow/c6d688b4694b808b5992db2f5a1c0f9bc257561f/docs/img/research-07-check-as-done.png


--------------------------------------------------------------------------------
/docs/index.md:
--------------------------------------------------------------------------------
 1 | # DataRescue Workflow
 2 | 
 3 | # Note that this workflow is being retired and will no longer be supported as of May 21 2017
 4 | 
 5 | This document describes DataRescue activities at both in-person events and remotely, as developed by the [DataRefuge project](http://www.ppehlab.org/) and [EDGI](https://envirodatagov.org/). It explains the process that a URL/dataset goes through from the time it has been identified, either by a [Seeder](seeding.md) as difficult to preserve, or  "uncrawlable," until it is made available as a record in the [datarefuge.org](http://www.datarefuge.org) data catalog. The process involves several stages and is designed to maximize smooth hand-offs. At each step the data is with someone with distinct expertise and the data is always being tracked for security.
 6 | 
 7 | **********************
 8 | 
 9 | ## Event Organizers
10 | 
11 | Learn about what you need to do to [before](/organizing/pre-event.md) and [after](/organizing/post-event.md) an event.
12 | 
13 | ## Event Attendees
14 | 
15 | - Join the event Slack team recommended by event organizers, this is often the [DataRefuge Slack](https://rauchg-slackin-qonsfhhvxs.now.sh/). During the event people share expertise and answer each other's questions here.  
16 | - Pick your role from the paths below, get account credentials, and make sure you have access to the key documents and tools you need to work. Organizers will instruct you on these steps.
17 | - Review the relevant sections(s) of this workflow.
18 | 
19 | ### Path I. Surveying
20 | 
21 | #### [Surveying](surveying.md)
22 | 
23 | Surveyors identify key programs, datasets, and documents on Federal Agency websites that are vulnerable to change and loss. Using templates and how-to guides, they create Main Agency Primers in order to introduce a particular agency, and Sub-Agency Primers in order to guide web archiving efforts by laying out a list of URLs that cover the breadth of an office.
24 | 
25 | ### Path II. Website Archiving
26 | 
27 | #### [Seeding](seeding.md)
28 | 
29 | Seeders canvass the resources of a given government agency, identifying important URLs. They identify whether those URLs can be crawled by the [Internet Archive's](http://archive.org) web crawler. Using the [EDGI Nomination Chrome extension](https://chrome.google.com/webstore/detail/nominationtool/abjpihafglmijnkkoppbookfkkanklok?hl=en), Seeders nominate crawlable URLs to the Internet Archive or add them to the Archivers app if they require manual archiving.
30 | 
31 | ### Path III. Archiving More Complex Datasets
32 | ### A. [Researching](researching.md)
33 | 
34 | Researchers inspect the "uncrawlable" list to confirm that Seeders' assessments were correct (that is, that the URL/dataset is indeed uncrawlable), and investigate how the dataset could be best harvested. [Researching.md](researching.md) describes this process in more detail.
35 | 
36 | *We recommend that Researchers and Harvesters (see below) work together in pairs, as much communication is needed between the two roles. In some cases, one person will fulfill both roles.*
37 | 
38 | ### B. [Harvesting](harvesting.md)
39 | 
40 | Harvesters take the "uncrawlable" data and try to figure out how to actually capture it based on the recommendations of the Researchers. This is a complex task which can require substantial technical expertise and different techniques for different tasks. Harvesters should also review the [Harvesting Toolkit](https://github.com/edgi-govdata-archiving/harvesting-tools) for tools.
41 | 
42 | ### C. [Checking/Bagging](bagging.md)
43 | 
44 | Checkers inspect a harvested dataset and make sure that it is complete. The main question the checkers need to answer is "will the bag make sense to a scientist"? Checkers need to have an in-depth understanding of harvesting goals and potential content variations for datasets. <br /> **Note: Checking is currently performed by Baggers and does not exist as a separate stage in the Archivers app.**
45 | 
46 | Baggers perform some quality assurance on the dataset to make sure the content is correct and corresponds to the original URL. Then they package the data into a bagit file (or "bag"), which includes basic technical metadata, and upload it to the final DataRefuge destination.
47 | 
48 | ### D. [Describing](describing.md)
49 | 
50 | Describers create a descriptive record in the DataRefuge CKAN repository for each bag. Then they link the record to the bag and make the record public.
51 | 
52 | **********************
53 | 
54 | ## Partners
55 | 
56 | DataRescue is a broad, grassroots effort with support from numerous local and nationwide networks. [DataRefuge](http://www.ppehlab.org/datarefuge/) and [EDGI](https://envirodatagov.org/) partner with local organizers in supporting these events. See more of our institutional partners on the [DataRefuge home page](http://www.ppehlab.org/datarefuge#partners).
57 | 


--------------------------------------------------------------------------------
/docs/organizing/post-event.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## Key Steps
 3 | 
 4 | 1. **Schedule a debrief call** with EDGI regarding the Seeding status using the [Agency Primers](https://envirodatagov.org/agencyprimers/)
 5 | 1. **Follow up about the final disposition of the data** with [DataRefuge](http://www.ppehlab.org/) and [EDGI](https://envirodatagov.org/)
 6 |     - Are there large datasets that need to be uploaded to S3 by an authorized person?
 7 |     - Has someone taken responsibility for handing seeds off to the Internet Archive?
 8 | 1. **Provide feedback about the event** processes to other DataRescue groups
 9 | 
10 | ## Ongoing Involvement
11 | 
12 | DataRescue is a coalition and a movement. When the event is over, and exhaustion is setting in over a couple of rounds... there is still work to do.
13 | 
14 | - Participants might want to continue the work started at the event, this is possible as our workflow is meant to function in-person as well as remotely.
15 | - There are emerging opportunities posted in the [DataRefuge](https://rauchg-slackin-qonsfhhvxs.now.sh/) and [Archivers](https://archivers-slack.herokuapp.com/) Slack teams about how to be involved in the data protection movement to contribute your ideas, energy, and good will to building a sustainable future for knowledge.
16 | 


--------------------------------------------------------------------------------
/docs/organizing/pre-event.md:
--------------------------------------------------------------------------------
 1 | Below we've outlined the critical technical considerations for planning a DataRescue event.
 2 | 
 3 | ## Key Steps
 4 | 
 5 | 1. **Read the [DataRescue Paths](https://docs.google.com/document/d/19A_0W1QWBgaiu42XPjMyV5BPn8Wjw4vc01v82S-uT5w/edit)** available as part of [DataRefuge's Overview](http://www.ppehlab.org/datarescueworkflow) or [EDGI's DataRescue Event Toolkit](https://envirodatagov.org/event-toolkit/).
 6 | 1. **Join the [DataRefuge Slack team](https://rauchg-slackin-qonsfhhvxs.now.sh/)** and start a channel for your event.
 7 | 1. **Review the [workflow documentation](https://datarefuge.github.io/workflow/)** and decide which paths your event will have.
 8 | 1. **Schedule a call with DataRefuge** to:
 9 |     - review the workflow and confirm event logistics like volunteer support
10 |     - receive access to the [Archivers app](http://www.archivers.space/) to archive complex datasets
11 | 1. **Schedule a call with EDGI** to:
12 |     - receive training on using [Agency Primers](https://envirodatagov.org/agencyprimers/) and EDGI's [Chrome Extension](https://chrome.google.com/webstore/detail/nominationtool/abjpihafglmijnkkoppbookfkkanklok) to identify and preserve web pages on federal government web sites
13 |     - receive an orientation on event [harvesting tools](https://github.com/edgi-govdata-archiving/harvesting-tools)
14 | 
15 | ## Event Preservation Tools
16 | 
17 | ### Archivers App
18 | 
19 | A DataRefuge organizer will set up your event in the app and coordinate initial account creation. The [Archivers app](http://www.archivers.space/) enables us to keep track of all the DataRescue event preservation and coordinate the work across different roles.
20 | 
21 | The app includes URLs coming from two main sources:
22 | - URLs nominated by Seeders at previous DataRescue events
23 | - URLs identified by a Union of Concerned Scientists survey which asked the scientific community to list the most vulnerable and important data currently accessible through federal websites.
24 | 
25 | ### Agency Primers and Chrome Extension for Seeding
26 | 
27 | An EDGI coordinator will set up access to Agency Primer and Sub-primer documents as well as a seed progress spreadsheet. These documents will inform the work of the Seeders at your event. They will tell them which website or website sections they should be focusing on for URL discovery.
28 | 
29 | <div class = "note">
30 |   <strong>Crawl vs. Harvest: Where is the Data Stored?</strong> <br />  
31 |   The workflow is designed to triage whether a URL will be stored by the Internet Archive or in the <a href="https://www.datarefuge.org/" target="_blank">DataRefuge repository</a><!---_---> based on whether it can be automatically crawled by the Internet Archive web crawler or needs to be manually harvested.<br />
32 |   <ul>
33 |     <li>Nominating crawlable URLs makes use of Internet Archive's existing infrastructure. See <a href="/seeding/">Seeding</a> for more information on this process.</li>
34 |     <li>Datasets manually harvested are uploaded through the Archivers app to an Amazon S3 storage managed by DataRefuge.</li>
35 |   </ul>
36 | </div>
37 | 
38 | ## Permissions and Credentials
39 | 
40 | - **All** Path II Attendees need to have an account on the [Archivers app](http://www.archivers.space/).
41 |     - You will need to generate invites for each one [within the app](http://www.archivers.space/invites/new), and paste the URL generated in a Slack Direct Message or email.
42 |     - Each participant invited will automatically "belong" to your event in the app.
43 | - In addition, Checkers and Baggers need to be given additional privileges in the app to access the Checking (i.e. "Finalize") and Bagging sections.
44 | 
45 | ## Technical Resources
46 | 
47 | - Access to Wi-Fi
48 | - Extra Power Strips and Extension Cords
49 | - Backup storage (e.g., large (>16GB) thumb drives)
50 | - Backup cloud compute resources
51 | 


--------------------------------------------------------------------------------
/docs/researching.md:
--------------------------------------------------------------------------------
  1 | ## What Do Researchers Do?
  2 | 
  3 | Researchers review "uncrawlables" identified during [Seeding](seeding.md), confirm the URL/dataset is indeed uncrawlable, and investigate how the dataset could be best harvested. Researchers need to have a good understanding of harvesting goals and have some familiarity with datasets.
  4 | 
  5 | <div class = "note">
  6 |   <strong>Recommended Skills</strong> <br />  
  7 |   Consider this path if you have strong front-end web experience and enjoy research. An understanding of how federal data is organized (e.g. where "master" datasets are) would be valuable.
  8 | </div>
  9 | 
 10 | ## Getting Set up as a Researcher
 11 | 
 12 | -   Event organizers (in-person or remote) will tell you how to volunteer for the Researcher role, either through Slack or a form.
 13 |     -   They will send you an invite to the [Archivers app](http://www.archivers.space/), which helps us coordinate all the data archiving work we do.
 14 |     -   Click the invite link, and choose a username and a password. It is helpful to use the same username on the app and Slack.
 15 | -   Create an account on the DataRefuge Slack using this [slack-in](https://rauchg-slackin-qonsfhhvxs.now.sh/) or use the Slack team recommended by your event organizers. This is where people share expertise and answer each other's questions.
 16 | -   If you need any assistance:
 17 |     -   Talk to your DataRescue guide if you are at an in-person event
 18 |     -   Or post questions in the DataRefuge Slack `#researchers` channel (or other channel recommended by your event organizers).
 19 | 
 20 | <div class = "note">
 21 |   <strong>Researchers and Harvesters</strong> <br />  
 22 |   <ul>
 23 |     <li>Researchers and Harvesters should coordinate together as their work is closely related and benefits from close communication</li>
 24 |     <li>It may be most effective to work together in pairs or small groups, or for a single person to both research and harvest</li>
 25 |     <li>As a Researcher, make sure to check out the <a href="/harvesting/">Harvesting documentation</a> to familiarize yourself with their role</li>
 26 |   </ul>
 27 | </div>
 28 | 
 29 | ## Claiming a Dataset to Research
 30 | 
 31 | <div class = "note">
 32 |   <strong>Using Archivers App</strong> <br />  
 33 |   Review our walkthrough video below and refer to the <a href="/faq/">FAQ</a> for any additional questions about the <a href="http://www.archivers.space" target="_blank">Archivers app</a><!--_-->. <br />
 34 |   &nbsp;<br />
 35 |   <p style="text-align:center"><iframe width="80%" height="315" src="https://www.youtube.com/embed/tvSSILnHnpA" frameborder="0" allowfullscreen></iframe></p>
 36 | </div>
 37 | 
 38 | -   Researchers work on datasets that were listed as uncrawlable by Seeders.
 39 | -   Go to the [Archivers app](http://www.archivers.space/), click `URLS` and then `RESEARCH`: all the URLs listed are ready to be researched.
 40 |     -   Available URLs are ones that have not been checked out by someone else, i.e. that do not have someone's name in the User column.
 41 |     -   Priority is indicated by the “!” field.  The range is from 0 to 10, with 10 being highest priority.
 42 | -   Select an available URL (you may decide to select a URL relevant to your area of expertise or assigned a high priority) and click its UUID to get to the detailed view, then click `Checkout this URL`. It is now ready for you to work on, and no one else can do anything to it while you have it checked out.
 43 | -   While you go through the research process, make sure to report as much information as possible in the Archivers app, as this is the place were we collectively keep track of all the work done.
 44 | 
 45 | <div class = "note">
 46 |   <strong>URL vs UUID</strong> <br />  
 47 |   The <code>URL</code> is the link to examine and harvest, and the <code>UUID</code> is a canonical ID we use to connect the URL with the data in question. The UUID will have been generated earlier in the process. UUID stands for Universal Unique Identifier.
 48 | </div>
 49 | 
 50 | ## Evaluating the Data
 51 | 
 52 | Go to the URL and start inspecting the content.
 53 | 
 54 | ### Is the data actually crawlable?
 55 | 
 56 | Again, see [EDGI's Guides](https://edgi-govdata-archiving.github.io/guides/) for a mostly non-technical introduction to the crawler:
 57 | 
 58 | -   [Understanding the Internet Archive Web Crawler](https://edgi-govdata-archiving.github.io/guides/internet-archive-crawler/)
 59 | -   [Seeding the Internet Archive’s Web Crawler](https://edgi-govdata-archiving.github.io/guides/seeding-internet-archive/)
 60 | -   A written guide on using the Chrome Nomination tool, the EDGI Primer Database, and a video tutorial are available [in Seeders' Documentation](seeding/#crawlable-urls)
 61 | 
 62 | Some additional technical notes for answering this:
 63 | 
 64 | -   There is no specific file size cutoff for what is crawlable, but large files should be manually captured anyway.
 65 | -   File types like ZIP, PDF, Excel, etc. are crawlable if they are linked, but it may be useful to archive them if they represent a meaningful dataset, or if there are many of them on a page.
 66 | -   The crawler can only follow HTTP links that appear directly in the DOM at load time. (That is, they should appear as `<a href ...>` tags in the page source.)
 67 | If links are added by JavaScript or require submitting a form, they are not crawlable.
 68 | -   The crawler does not tolerate web frames (but it is straightforward to inspect a page to obtain the content in the frame directly, and then nominate *that*).
 69 | -   The crawler recently added the ability to crawl FTP, but we will not rely on this; we will treat resources served over FTP as uncrawlable.
 70 | 
 71 | #### YES
 72 | 
 73 | If the URL is crawlable or you locate a crawlable URL that accesses the underlying dataset:
 74 | 
 75 | -   Nominate it using the [EDGI Nomination Chrome Extension](https://chrome.google.com/webstore/detail/nominationtool/abjpihafglmijnkkoppbookfkkanklok).
 76 | -   Click the `Do not harvest` checkbox in the Research section in the Archivers app.
 77 | -   Click `Checkin this URL` and move on to another URL.
 78 | 
 79 | #### NO
 80 | 
 81 | If it is confirmed not crawlable:
 82 | 
 83 | -   Search agency websites and data.gov for dataset entry points for your dataset collection.
 84 |     -   Tips: Try to understand what datasets are underlying the web pages. Look for related entries in the Archivers app, and ensure that you aren't harvesting a subdirectory if you can harvest the entire directory. Often, data underlying dozens of pages or multiple "access portal" apps is also available as one structured data file.
 85 |     -   Make note of any better entry point in the `Recommended Approach for Harvesting Data` field, along with any other recommendations on how to proceed with this harvest.
 86 | -   Fill out all of the fields in the Research section to the best of your ability.
 87 | -   Occasionally, URL's will have been nominated separately, but are actually different interfaces built on the same dataset. We want to scrape all of this data and do it exactly one time. The `Link URL` field lets you search for associated URLs; add any URLs that should be grouped into a single record.
 88 | 
 89 | #### YES and NO
 90 | 
 91 | For example, FTP address, mixed content, big data sets:
 92 | <!--  - Fill out the cell "Can it be crawled?" = "yes & no" in Researcher section of the spreadsheet-->
 93 | 
 94 | -   Nominate it anyway, but also follow the steps for uncrawlable content above.
 95 | -   *While we understand that this may result in some dataset duplication, this is not a concern. We are ensuring that the data is fully preserved and accessible.*
 96 | 
 97 | <div class = "attention">
 98 |   <strong>See FAQ <a href="faq/#17-what-is-the-difference-between-crawlable-and-harvested">"What is the difference between Crawlable and Harvested in Archivers.space?"</a></strong>
 99 | </div>
100 | 
101 | ## Finishing Up
102 | 
103 | -   In the Archivers app, make sure to fill out as much information as possible to document your work.
104 | -   Check the Research checkbox (far right on the same line as the "Research" section heading) to mark that step as completed.
105 | -   Click `Save`.
106 | -   Click `Checkin this URL`, to release it and allow someone else to work on the next step.
107 | -   You're done! Move on to the next URL!
108 | 
109 | <!-- HOW DOES THIS PROCESS WORK NOW:    - If ever a day or more passed  since you originally claimed the item, update the date to today's date.
110 |     - Note that if more than 2 days have passed since you claimed the dataset and it is still not closed, the **Date field will turn red**, signaling that someone else can claim it in your place and start working on it
111 |       - This will avoid datasets being stuck in the middle of the workflow and not being finalized.-->
112 | 


--------------------------------------------------------------------------------
/docs/seeding.md:
--------------------------------------------------------------------------------
 1 | ## What Do Seeders Do?
 2 | 
 3 | Seeders canvass the resources of a given government agency, identifying important URLs. They identify whether those URLs can be crawled by the [Internet Archive's](http://archive.org) web crawler. Using the [EDGI Nomination Chrome extension](https://chrome.google.com/webstore/detail/nominationtool/abjpihafglmijnkkoppbookfkkanklok?hl=en), Seeders nominate crawlable URLs to the Internet Archive or add them to the Archivers app if they require manual archiving.
 4 | 
 5 | <div class = "note">
 6 |   <strong>Recommended Skills</strong> <br />  
 7 |   Consider this path if you’re comfortable browsing the web and have great attention to detail. An understanding of how web pages are structured will help you with this task.
 8 | </div>
 9 | 
10 | ## Choosing the Website
11 | 
12 | Seeders use the [EDGI Archiving Primers](https://envirodatagov.org/archiving/), or a similar set of resources, to identify important and at-risk data. Talk to the DataRescue organizers to learn more.
13 | 
14 | ## Canvassing the Website and Evaluating Content
15 | 
16 | - Start exploring the website assigned, identifying important URLs.
17 | - Decide whether the data on a page or website subsection can be automatically captured by the Internet Archive web crawler.
18 | - [EDGI's Guides](https://edgi-govdata-archiving.github.io/guides/) have information critical to the seeding and sorting process:
19 |     - [Understanding the Internet Archive Web Crawler](https://edgi-govdata-archiving.github.io/guides/internet-archive-crawler/)
20 |     - [Seeding the Internet Archive’s Web Crawler](https://edgi-govdata-archiving.github.io/guides/seeding-internet-archive/)
21 | 
22 | ### Crawlable URLs
23 | 
24 | - URLs judged to be crawlable are nominated ("seeded") to the Internet Archive, using the [EDGI Nomination Chrome extension](https://chrome.google.com/webstore/detail/nominationtool/abjpihafglmijnkkoppbookfkkanklok?hl=en).
25 | 
26 | To learn more about nominating URLs, refer to [this Google Doc](https://docs.google.com/document/d/1L_JYldCwCHxVEW_9nD6llWPtbh_VvPwp2UYE6cGE2g0/edit), watch this [training video on Agency Primers and EOT](https://youtu.be/Ro-f58Cecdg) or talk to the DataRescue organizers.
27 | 
28 | **Wherever possible, add in the Agency Office Code from the [sub-primer database](https://envirodatagov.org/event-toolkit/primers-database/).** 
29 | 
30 | ### Uncrawlable URLs
31 | 
32 | - If URL is judged not crawlable, check one of the checkboxes next to the four types of uncrawlables in the Chrome Extension. This will add the URL to the Researching queue in the Archivers app.
33 | - The URL will be automatically associated with a universal unique identifier (UUID).
34 | - You can check whether the page or some files are archived using the Internet Archive's [Wayback Machine Chrome Extension](https://chrome.google.com/webstore/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak)
35 | 
36 | ### Not Sure?
37 | 
38 | - This sorting is only provisional; when in doubt, Seeders nominate the URL **and** mark it as possibly not crawlable.
39 | 


--------------------------------------------------------------------------------
/docs/surveying.md:
--------------------------------------------------------------------------------
 1 | ## What Do Surveyors Do?
 2 | 
 3 | Surveyors identify key programs, datasets, and documents on Federal Agency websites that are vulnerable to change and loss. Using templates and how-to guides, they create Main Agency Primers in order to introduce a particular agency, and Sub-Agency Primers in order to guide web archiving efforts by laying out a list of URLs that cover the breadth of an office.
 4 | 
 5 | <div class = "note">
 6 |   <strong>Recommended Skills</strong> <br />  
 7 |   Consider this path if you’re familiar with federal data, interested in particular offices or data sets that don’t already have a primer, or want to help create materials for use at other archiving events!
 8 | </div>
 9 | 
10 | ## Getting Set up as a Surveyor
11 | 
12 | 
13 | ## Anatomy of a Primer
14 | 
15 | For each federal agency website, we create a set of Agency Primers that consist of a Main Agency Primer and several Sub-Agency Primers.
16 | 
17 | **Main Agency Primers (MAPs)** describe the range of risks to the listed agencies’ current programming in the coming administration, focusing on offices with programs relating to climate change, renewable energy, sustainability, and the environment. Each agency or department has one MAP. MAPs are not used for web archiving, instead they are used as introductory documents to become familiar with a particular agency.
18 | 
19 | <div class = "attention">
20 |   <strong>See Example MAP: <a href="https://docs.google.com/document/d/1KWVL_EuJgX_OAztPecmqYm3sm-MTvBAqceV_OdSlouE/edit">Department of Energy</a></strong>
21 | </div>
22 | 
23 | **Sub-Agency Primers (SAPs)** identify at-risk offices within agencies and guide the web archiving efforts. A particular agency or department can have several SAPs, one for each office. The list of SAPs is linked in the MAP for a particular department or agency.
24 | 
25 | <div class = "attention">
26 |   <strong>See Example SAP: <a href="https://docs.google.com/document/d/1dnlWSNHC7MSiJeh-TJYBvlJLTNKIQ8GUKEeLoPLyd_E/edit">Department of Energy - Office of Energy Efficiency and Renewable Energy </a></strong>
27 | </div>
28 | 
29 | The full list of completed Agency Archiving Primers is available on [EDGI’s website](https://envirodatagov.org/agencyprimers/). Our [Google Drive](https://drive.google.com/drive/folders/0B1QUChqMY8ehMWJCaElHdzctRTg?usp=drive_web) has primers in the process of being prepared for use.
30 | 
31 | ## Claiming a Primer
32 | 
33 | 1.  Identify a particular office you’d like to work with.
34 | 1.  Check the Agency Office Database for the status of the primer for that office. The possible statuses for a primer are:
35 |     -   `No primer created for the office yet`  
36 |     claim the primer and start writing!
37 |     -   `A primer has been written but not checked`  
38 |     claim the primer and start checking!
39 |     -   `A primer has been written and checked`  
40 |     claim the primer and start web archiving!
41 | 1.  Claim the primer for a specific purpose (writing, checking, or using) through the [Add/Edit a Primer form](https://airtable.com/shrAUSWqwE4CproKa).
42 | 
43 | ## Creating or Checking a New Primer
44 | 
45 | 1.  Claim the primer for a specific office using the [Add/Edit a Primer form](https://airtable.com/shrAUSWqwE4CproKa).
46 | 1.  Navigate to the Agency Primers [Google Drive](https://drive.google.com/drive/folders/0B1QUChqMY8ehMWJCaElHdzctRTg?usp=drive_web).
47 | 1.  If there is **no existing primer** for that office, follow MAP or SAP instructions below:  
48 |     **MAP Path**
49 | 
50 |     -   Create a MAP for a main agency by copying the [MAP template](https://docs.google.com/document/d/1Ekr0yH-jFF8VOpZzquevKaQyhb06oZywSBf6_PgXJM0/edit) and saving it to the agency folder (ex. DOJ folder).
51 |     -   Follow the [Primer How-To](https://docs.google.com/document/d/1nMHY5vvVcu_hsjcnvuSaXAvu95M2_nlTdxoQGp1q5HE/edit) (look for the MAP section) for guidance.
52 | 
53 |     **SAP Path**
54 | 
55 |     -   Create SAP for individual office or division by copying the [SAP template](https://docs.google.com/document/d/1hBytasnJDGKgRIj3b90Xk8Zlz3Txe0Zmaq0bhbo5UY4/edit) and saving the copy to the agency folder (ex. Bureau of Land Management _in_ DOI folder).
56 |     -   Follow the [Primer How-To](https://docs.google.com/document/d/1nMHY5vvVcu_hsjcnvuSaXAvu95M2_nlTdxoQGp1q5HE/edit) (look for the SAP section) for guidance.
57 | 
58 | 1.  If there is an **existing primer** that needs to be checked:  
59 | Work with others from the agency’s designated community (e.g. experts from communities, librarians, professional associations) who can provide input on vulnerable or at risk information.
60 | 1.  When you’re done, update the Agency Office Database status and primer link by again submitting the [Add/Edit a Primer form](https://airtable.com/shrAUSWqwE4CproKa).
61 | 


--------------------------------------------------------------------------------
/mkdocs.yml:
--------------------------------------------------------------------------------
 1 | site_name: DataRescue Workflow
 2 | pages:
 3 |     - Home: index.md
 4 |     - Surveying:
 5 |       - Surveying: surveying.md
 6 |     - Website Archiving:
 7 |       - Seeding: seeding.md
 8 |     - Archiving More Complex Datasets:
 9 |       - Researching: researching.md
10 |       - Harvesting: harvesting.md
11 |       - Checking/Bagging: bagging.md
12 |       - Describing: describing.md
13 |     - Event Organizing:
14 |       - Before an Event: organizing/pre-event.md
15 |       - After an Event: organizing/post-event.md
16 |     - Additional Resources:
17 |       - Archivers App FAQ: faq.md
18 | theme: readthedocs
19 | theme_dir: custom_theme
20 | extra_css:
21 |     - css/extra.css
22 | 


--------------------------------------------------------------------------------