├── README.md
├── ait-implementation
    ├── README.md
    ├── wasapi
    │   ├── __init__.py
    │   ├── admin.py
    │   ├── filters.py
    │   ├── implemented-swagger.yaml
    │   ├── mailer.py
    │   ├── migrations
    │   │   └── __init__.py
    │   ├── models.py
    │   ├── selectors.py
    │   ├── serializers.py
    │   ├── swagger.yaml
    │   ├── tests.py
    │   ├── tests
    │   │   ├── __init__.py
    │   │   ├── fixtures.json
    │   │   ├── test_fixtures.py
    │   │   └── test_job_result.py
    │   ├── urls.py
    │   └── views.py
    └── webdata
    │   ├── __init__.py
    │   ├── decorators.py
    │   ├── urls.py
    │   └── views.py
├── ait-specification
    ├── README.md
    └── transfer_api_archive-it_v1.yaml
├── general-specification
    ├── README.md
    └── transfer_api_v1.yaml
├── lockss-implementation
    ├── README.md
    ├── default_controller.py
    └── wasapi-test.py
└── utilities
    └── README.md


/README.md:
--------------------------------------------------------------------------------
 1 | ## WASAPI Data Transfer APIs
 2 | 
 3 | This is the public repository for work on the Web Archiving Systems API (WASAPI) data transfer APIs. The intention for these APIs is to provide a standardized mechanism for export and import of web archive data (and perhaps, ultimately, derivative data and capture metadata) between diverse systems for preservation, replication, research use, data delivery, and other purposes. Design and development is being carried out by project partners on the [Institute of Museum and Library Services](https://www.imls.gov/)-funded [National Leadership Grant](https://www.imls.gov/grants/available/national-leadership-grants-libraries), [LG-71-15-0174](https://www.imls.gov/grants/awarded/lg-71-15-0174-15), "[Systems Interoperability and Collaborative Development for Web Archiving](https://www.imls.gov/sites/default/files/proposal_narritive_lg-71-15-0174_internet_archive.pdf)" (PDF) in consultation with a technical working group and based on feedback from the web archiving community.
 4 | 
 5 | ## Clients & Utilities
 6 | * Stanford University Digital Library Systems and Services: wasapi-downloader, https://github.com/sul-dlss/wasapi-downloader
 7 | * UNT Libraries: py-wasapi-client, https://github.com/unt-libraries/py-wasapi-client
 8 | * LOCKSS: LAAWS Live Demo, http://demo.laaws.lockss.org/
 9 | * Rutgers University: Location Mapper, https://github.com/mwe400/LocationMapper
10 | 
11 | ## Documents
12 | * 2017-12-11: "[WASAPI Document Repository in archive.org](https://archive.org/details/wasapi)"
13 | * 2017-04-23: "[National Symposium on Web Archiving Interoperability: Agenda & Presentation Links](https://docs.google.com/document/d/1PM8u5nxAKUFb4oh1JTDARfl9hat7gOxgU1t2mGvn8Fg/edit#heading=h.n0bnn4za99v2)"
14 | * 2017-03-02: "[National Symposium on Web Archiving Interoperability Trip Report](https://web.archive.org/web/20180410063821/https://ws-dl.blogspot.fr/2017/03/2017-03-02-national-symposium-on-web.html)"
15 | * 2017-03-30: "[IMLS Year 1 Interim Performance Report Narrative](https://archive.org/details/WASAPIYearOneReport)"
16 | * 2016-11-29: "[Archive-It 2016 State of the WARC Report](https://archive-it.org/blog/post/2016-state-of-the-warc-our-second-annual-digital-preservation-survey-results/)"
17 | * 2016-08-19: "[WASAPI Survey on Data Transfer APIs](https://drive.google.com/file/d/0B7toWei7Sy_SOUJlZFhySHZYTWM/view?usp=sharing)"
18 | * 2015-09-15: "[Systems Interoperability and Collaborative Development for Web Archiving](https://www.imls.gov/grants/awarded/lg-71-15-0174-15)"
19 | 
20 | ## Presentations
21 | * 2017-12-11: "[Web Archiving Systems APIs (WASAPI)](https://docs.google.com/presentation/d/1lAjeNmnnJb_lLYofqR-ZlqcqxKZ_ithQ57vCPWdPFt4/edit?usp=sharing)
22 | * 2017-06-15: "[WASAPI:  (Web Archiving Systems APIs) Project Updates and Data Transfer APIs Specifications and Demonstrations](https://docs.google.com/presentation/d/1nbfKd80V613-S7AH9CvbMZVp9SyLWW7ByMwMsBflM5s/edit?usp=sharing)" at [IIPC/ReSAW Web Archiving Week](http://netpreserve.org/wac2017/)
23 | * 2017-04-27: "[Stanford Web Archiving Work Cycle Inception Deck for Automating Web Archive Crawl Download and Accessioning](https://drive.google.com/file/d/0B7toWei7Sy_SU2VvWWNVUmRRQkk/view?usp=sharing)"
24 | * 2016-08-03: "[WASAPI: Web Archiving Systems Application Programming Interfaces](https://docs.google.com/presentation/d/1XajUcvETUTL_mSsr0vCno-fzSB15MsRsRmP_pikvGO8/edit?usp=sharing)" at the [SAA Web Archiving Roundtable Meeting](https://archives2016.sched.org/event/6niM/web-archiving) at [Archives Records 2016](http://www2.archivists.org/am2016)
25 | * 2016-06-14: "[WASAPI Web Archive Data Transfer APIs](http://www.slideshare.net/nullhandle/wasapi-web-archive-data-transfer-apis)" at [Archives Unleashed 2.0](http://archivesunleashed.com/)
26 | * 2016-05-26: "[Systems Interoperability and Collaborative Development for Web Archiving - Filling Gaps in the IMLS National Digital Platform](http://digital.library.unt.edu/ark:/67531/metadc848591/)" at [Texas Conference on Digital Libraries](https://conferences.tdl.org/tcdl/index.php/TCDL/TCDL2016)
27 | * 2016-04-12: "[Building API-Based Web Archiving Systems and Services](https://docs.google.com/presentation/d/1IJ9IcLG2cO118oNX0Z5rakiDVySuB9TBWwnVvHTEOAg/edit?usp=sharing)" at the [International Internet Preservation Consortium General Assembly](http://www.netpreserve.org/general-assembly/2016/overview)
28 | * 2016-04-05: "[Building National Web Archiving Capacity](https://drive.google.com/file/d/0BwW5mtdXJ3huLUowUnRZb0E0Z0E/view?usp=sharing)" at the [CNI Spring 2016 Meeting](https://www.cni.org/events/membership-meetings/past-meetings/spring-2016)
29 | 
30 | ## Meeting Notes
31 | * 2016-12-13: [WASAPI Technical Working Group](https://docs.google.com/document/d/1q7m6pgINRAUOFGg3SMhCVD_IstAvYbF8cIJ5puO2HP8/edit)
32 | * 2016-03-30: [WASAPI Technical Working Group](https://docs.google.com/document/d/1kDbk3J_DVpqj2rBFQmQIoijYjwgWQKgY-19H6rckGkk/edit?ts=57c36d5a)
33 | 
34 | ## How to connect
35 | * Join or send a message to the [WASAPI-community Google Group](https://groups.google.com/forum/#!forum/wasapi-community).
36 | * Join the [WASAPI Slack](https://docs.google.com/forms/d/e/1FAIpQLScsdTqssLrM9FinmpP8Mow2Hl8zJnfJZfjWxaeXddlvu2VjBw/viewform).
37 | 


--------------------------------------------------------------------------------
/ait-implementation/README.md:
--------------------------------------------------------------------------------
  1 | # Introduction
  2 | 
  3 | This document should assist developers in building their own implementation of WASAPI based on Archive-It's implementation.
  4 | 
  5 | # Archive-It WASAPI implementation
  6 | 
  7 | The `archiveit/wasapi` application is the bulk of the code by which Archive-It implements the WASAPI specification.  It was written within and then extracted from the [Django](https://www.djangoproject.com/) project (version 1.8.5) that serves [Archive-It's partner site](https://partner.archive-it.org/), so --while it can not be run alone-- it can be fit easily into another Django project. This document outlines implementation details, proposed changes to the WASAPI Data Transfer API general specification and the Archive-It additions beyond the minimum specifications.
  8 | 
  9 | 
 10 | ## Formal specifications
 11 | 
 12 | The [OpenAPI](https://www.openapis.org/) file `wasapi/swagger.yaml` describes Archive-It's ideal specification at the start of implementation (with few adjustments).  The file `wasapi/implemented-swagger.yaml` shows what has been implemented.  The difference between the two serves as a to-do list:  note that you can submit new jobs and monitor their status but not yet retrieve their results.
 13 | 
 14 | 
 15 | ## Re-integrating the code
 16 | 
 17 | To use the `wasapi` application within another Django project, you must resolve some references it has to the Archive-It project.
 18 | 
 19 | Archive-It's webdata files are modeled in `archiveit.archiveit.models.WarcFile`; replace that with your own.
 20 | 
 21 | The `AitWasapiDateTimeField` field replaces `django.db.models.fields.DateTimeField` with the ability to parse abbreviated dates and adjust dates with timezones.
 22 | 
 23 | The URL paths to the WASAPI endpoints (and also transport of webdata files) were established in `archiveit/urls.py`; add your own reference to your own routing file (as appropriate for the version of Django you are using):
 24 | 
 25 |     urlpatterns = (
 26 |       # [...]
 27 |       patterns('',
 28 |         # [...]
 29 |         url(r'^wasapi/v1/', include('archiveit.wasapi.urls')),
 30 |         url(r'^webdatafile/', include('archiveit.webdata.urls')),
 31 |       )
 32 |     )
 33 | 
 34 | The full URL to transport a webdata file was defined by WEBDATA_LOCATION_TEMPLATE in `archiveit.settings`.  It uses named parameters from the webdata file model, eg `filename`.  Make your own, eg:
 35 | 
 36 |     WEBDATA_LOCATION_TEMPLATE = BASEURL + '/webdatafile/%%(filename)s'
 37 | 
 38 | 
 39 | ## The notification mailer and admin interface
 40 | 
 41 | To fit with Archive-It's existing work flow, the implementation sends a notification email upon submission of a new job.  We also provide [Django admin site](https://docs.djangoproject.com/en/dev/ref/contrib/admin/) to change the state of jobs.  This is outside the scope of the WASAPI specification, but we include it here for completeness.
 42 | 
 43 | ## The `webdata` application
 44 | 
 45 | The `archiveit/webdata` application implements transport of webdata files.  That is well outside the scope of the WASAPI specification, but we include it here also for completeness.  It transparently serves webdata files from Archive-It's Petabox and HDFS stores.
 46 | 
 47 | 
 48 | ## Proposed changes to the published minimum
 49 | 
 50 | After more experience, we suggest that the minimum specification should be
 51 | adjusted.
 52 | 
 53 | ### Mandatory pagination syntax
 54 | 
 55 | The specification should support pagination of large results.  Simple
 56 | implementations may give the full results in a single page, but adding the
 57 | syntax later would be difficult.
 58 | 
 59 | Unable to find consistent recommendations for pagination syntax, we adopt that
 60 | of the [Django Rest Framework](http://www.django-rest-framework.org/).  The
 61 | client must accept `count`, `previous`, and `next` parameters.  The
 62 | implementation must provide the number of files/jobs/etc in `count`.  The
 63 | `previous` and `next` values can be either URLs by which to fetch other pages
 64 | of results using a `page` parameter, be absent, or (as the Django Rest
 65 | Framework does) hold an explicit `null`.
 66 | 
 67 | ### Matching filenames
 68 | 
 69 | Matching of filenames should consider only the basename and not any path of
 70 | directories.  The glob pattern should be matched against the complete basename
 71 | (ie must match the beginning and end of the filename).  An implementation that
 72 | wants to match pathnames including directories (and consider eg whether `**`
 73 | should match multiple directory separators eg `/`) may offer a different
 74 | parameter.
 75 | 
 76 | ### Simpler webdata file bundles
 77 | 
 78 | We should drop `WebdataMenu` and `WebdataBundle`.  The multiple `locations` of
 79 | a `WebdataFile` provide most of their value.  Rather than giving the client
 80 | more information than would be used, an implementation can accept a request for
 81 | specific transports and formats.
 82 | 
 83 | ### Separate endpoint for results of a job; reporting on a failed job
 84 | 
 85 | We replace `completion-time` with `termination-time` to ease polling for new
 86 | information about jobs.  Rather than a job that may include a successful result
 87 | but gives the same indistinguishable lack of result for both progress and
 88 | failure, we provide distinct endpoints:  `/jobs/{jobtoken}/result` for a
 89 | successful result and `/jobs/{jobtoken}/error` for reporting the error of a
 90 | failed job.  A client can easily poll `/jobs/{jobtoken}/result` and will be
 91 | redirected to `/jobs/{jobtoken}/error` in the case that the job fails.
 92 | 
 93 | ### Checksums of a webdata file
 94 | 
 95 | Since it is useless to require the presence of checksums without mandating any
 96 | specific checksum, every implementation should provide at least one of MD5 or
 97 | SHA1.  To allow evolution, the specification should use a dictionary instead of
 98 | a single string.  To ensure interoperability, all checksums should be
 99 | represented as hexadecimal strings.
100 | 
101 | ### Change label describing format of archive file
102 | 
103 | Using the label `content-type` to describe the format of the archive files can
104 | be confused with the "content-type" or "MIME-type" of the resources within the
105 | archive.  The label `content-type` should be reserved as a potentially valuable
106 | parameter to select such resources, and the current use should be replaced with
107 | `filetype`.  Another label to consider is "archive-format" which explicitly
108 | references its subject.
109 | 
110 | 
111 | ## Extensions beyond the published minimum
112 | 
113 | Archive-It extends the minimum specification with our own special parameters for our v1.0 release.
114 | 
115 | ### Time range parameters
116 | 
117 | We want to support date ranges, but we want to be careful about which time we
118 | refer to:  the instant the crawl was requested, the instant a delayed crawl was
119 | scheduled to start, the instant the crawl started, the instant the resource is
120 | retrieved, the instant the archive file was written.  For ease of
121 | implementation, we choose to operate on the time that the crawl started using
122 | the `crawl-start-after` and `crawl-start-before` parameters.
123 | 
124 | ### Collection parameter
125 | 
126 | The `collection` parameter accepts a numeric collection identifier as is used
127 | in the Archive-It application; multiple parameters allow matching files across
128 | multiple collections.
129 | 
130 | ### Crawl parameter
131 | 
132 | The `crawl` parameter accepts a numeric crawl identifier as is used in the
133 | Archive-It application.
134 | 
135 | ### Functions
136 | 
137 | Archive-It supports jobs of three functions:
138 | - `build-wat`:  Build a WAT file with metadata from the matched archive files
139 | - `build-wane`:  Build a WANE file with the named entities from the matched
140 |   archive files
141 | - `build-cdx`:  Build a CDX file indexing the matched archive files
142 | 
143 | Archive-It functions do not yet accept any parameters.
144 | 
145 | ### States of a job
146 | 
147 | An Archive-It job can be described as being in one of five distinct states:
148 | - `queued`:  Job has been submitted and is waiting to run.
149 | - `running`:  Job is currently running.
150 | - `failed`:  Job ran but failed.
151 | - `complete`:  Job ran and successfully completed; result is available.
152 | - `gone`:  Job ran, but the result is no longer available (eg deleted to save
153 |   storage).
154 | 
155 | ## Contacts
156 | 
157 | *Archive-It (Internet Archive)*
158 | * Jefferson Bailey, Director, Web Archiving, jefferson@archive.org
159 | * Mark Sullivan, Web Archiving Software Engineer, msullivan@archive.org
160 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/WASAPI-Community/data-transfer-apis/4fab8164f40dcb16a601d4606f9ab67889076d6b/ait-implementation/wasapi/__init__.py


--------------------------------------------------------------------------------
/ait-implementation/wasapi/admin.py:
--------------------------------------------------------------------------------
1 | from django.contrib import admin
2 | from archiveit.wasapi.models import WasapiJob
3 | 
4 | class WasapiJobAdmin(admin.ModelAdmin):
5 |     search_fields = ['id', 'state', 'function', 'account__id', 'account__organization_name']
6 |     list_display = ['id', 'state', 'function', 'termination_time']
7 | 
8 | admin.site.register(WasapiJob, WasapiJobAdmin)
9 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/filters.py:
--------------------------------------------------------------------------------
 1 | from rest_framework.filters import BaseFilterBackend
 2 | from archiveit.wasapi.selectors import select_webdata_query, select_auth
 3 | 
 4 | # A few words about selectors and "guts":  We want to share functionality
 5 | # between the "webdata" query and selecting the source files for a job, but
 6 | # those two clients have different information in different structures.
 7 | # Therefore, we extract (most of) the guts from the filter_queryset methods
 8 | # into functions in selectors.py, giving them a different interface (narrower
 9 | # than Django's, ie a querydict and kwargs that may or may not include an
10 | # account that may or may not get used) which WasapiJob.set_ideal_result can
11 | # use.
12 | 
13 | class WasapiWebdataQueryFilterBackend(BaseFilterBackend):
14 |     """Filtering on composition necessary for the webdata query"""
15 | 
16 |     def filter_queryset(self, request, queryset, view):
17 |         return select_webdata_query(request.GET, queryset,
18 |           account=request.user.account, user=request.user)
19 | 
20 | 
21 | class WasapiAuthFilterBackend(BaseFilterBackend):
22 |     """Filtering on authorization"""
23 |     def filter_queryset(self, request, queryset, view):
24 |         return select_auth(request.GET, queryset,
25 |           account=request.user.account, user=request.user)
26 | 
27 | 
28 | class WasapiAuthJobBackend(BaseFilterBackend):
29 |     """Filtering on authorization to see the specific job"""
30 |     def filter_queryset(self, request, queryset, view):
31 |         # TODO:   raise http error rather than empty result
32 |         queryset = queryset.filter(job_id=view.kwargs['jobid'])
33 |         if request.user.is_superuser:
34 |             return queryset  # no restriction
35 |         elif request.user.is_anonymous():
36 |             return queryset.none()  # ie hide it all
37 |         else:
38 |             return queryset.filter(job__account=request.user.account)
39 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/implemented-swagger.yaml:
--------------------------------------------------------------------------------
  1 | swagger: '2.0'
  2 | info:
  3 |   title: WASAPI Export API as implemented by Archive-It
  4 |   description: >
  5 |     WASAPI Export API.  What Archive-It has implemented.
  6 |   version: 1.0.0
  7 |   contact:
  8 |     name: Jefferson Bailey and Mark Sullivan
  9 |     url: https://github.com/WASAPI-Community/data-transfer-apis
 10 |   license:
 11 |     name: Apache 2.0
 12 |     url: http://www.apache.org/licenses/LICENSE-2.0.html
 13 | consumes:
 14 |   - application/json
 15 | produces:
 16 |   - application/json
 17 | basePath: /wasapi/v1
 18 | schemes:
 19 |   - https
 20 | paths:
 21 |   /webdata:
 22 |     get:
 23 |       summary: Get the archive files I need
 24 |       description: >
 25 |         Produces a page of the list of the files accessible to the client
 26 |         matching all of the parameters.  A parameter with multiple options
 27 |         matches when any option matches; a missing parameter implicitly
 28 |         matches.
 29 |       parameters:
 30 |         # pagination
 31 |         - $ref: '#/parameters/page'
 32 |         # basic query
 33 |         - $ref: '#/parameters/filename'
 34 |         # specific to Archive-It
 35 |         - $ref: '#/parameters/crawl-time-after'
 36 |         - $ref: '#/parameters/crawl-time-before'
 37 |         - $ref: '#/parameters/crawl-start-after'
 38 |         - $ref: '#/parameters/crawl-start-before'
 39 |         - $ref: '#/parameters/collection'
 40 |         - $ref: '#/parameters/crawl'
 41 |       responses:
 42 |         '200':
 43 |           description: Success
 44 |           schema:
 45 |             $ref: '#/definitions/FileSet'
 46 |         '400':
 47 |           description: The request could not be interpreted
 48 |   /jobs:
 49 |     get:
 50 |       summary: What jobs do I have?
 51 |       description:
 52 |         Show the jobs on this server accessible to the client
 53 |       parameters:
 54 |         - $ref: '#/parameters/page'
 55 |       responses:
 56 |         '200':
 57 |           description: >
 58 |             Success.  Produces a page of the list of the jobs accessible to
 59 |             the client.
 60 |           schema:
 61 |             type: object
 62 |             required:
 63 |               - count
 64 |               - jobs
 65 |             properties:
 66 |               count:
 67 |                 type: integer
 68 |                 description: >
 69 |                   The total number of jobs matching the query (across all pages)
 70 |               previous:
 71 |                 description: >
 72 |                   Link (if any) to the previous page of jobs; otherwise null
 73 |                 type: [string, "null"]
 74 |                 format: url
 75 |               next:
 76 |                 description: >
 77 |                   Link (if any) to the next page of jobs; otherwise null
 78 |                 type: [string, "null"]
 79 |                 format: url
 80 |               jobs:
 81 |                 type: array
 82 |                 items:
 83 |                   $ref: '#/definitions/Job'
 84 |     post:
 85 |       summary: Make a new job
 86 |       description:
 87 |         Create a job to perform some task
 88 |       parameters:
 89 |         - name: query
 90 |           in: formData
 91 |           required: true
 92 |           description: >
 93 |             URL-encoded query as appropriate for /webdata end-point.  The empty
 94 |             query (which matches everything) must explicitly be given as the
 95 |             empty string.
 96 |           type: string
 97 |         - $ref: '#/parameters/function'
 98 |         - name: parameters
 99 |           in: formData
100 |           required: false
101 |           description: >
102 |             Other parameters specific to the function and implementation
103 |             (URL-encoded).  For example: level of compression, priority, time
104 |             limit, space limit.  Archive-It does not yet accept any such
105 |             parameters.
106 |           type: string
107 |       responses:
108 |         '201':
109 |           description: >
110 |             Job was successfully submitted.  Body is the submitted job.
111 |           schema:
112 |             $ref: '#/definitions/Job'
113 |         '400':
114 |           description: The request could not be interpreted
115 |   '/jobs/{jobtoken}':
116 |     get:
117 |       summary: How is my job doing?
118 |       description:
119 |         Retrieve information about a job, both the parameters of its submission
120 |         and its current state.  If the job is complete, the client can get the
121 |         result through a separate request to `jobs/{jobtoken}/result`.
122 |       parameters:
123 |         - $ref: '#/parameters/jobtoken'
124 |       responses:
125 |         '200':
126 |           description: Success
127 |           schema:
128 |             $ref: '#/definitions/Job'
129 |         '400':
130 |           description: The request could not be interpreted
131 |         '404':
132 |           description: No such job visible to this client
133 |   '/jobs/{jobtoken}/result':
134 |     get:
135 |       summary: What is the result of my job?
136 |       description: >
137 |         For a complete job, produces a page of the resulting files.
138 |       parameters:
139 |         - $ref: '#/parameters/page'
140 |         - $ref: '#/parameters/jobtoken'
141 |       responses:
142 |         '200':
143 |           description: Success
144 |           schema:
145 |             $ref: '#/definitions/FileSet'
146 | definitions:
147 |   WebdataFile:
148 |     description: >
149 |       Description of a unit of distribution of web archival data.  (This data
150 |       type does not include the actual archival data.)  Examples: a WARC file,
151 |       an ARC file, a CDX file, a WAT file, a DAT file, a tarball.
152 |     type: object
153 |     required:
154 |       - filename
155 |       - checksums
156 |       - filetype
157 |       - locations
158 |     properties:
159 |       filename:
160 |         type: string
161 |         description: The name of the webdata file
162 |       filetype:
163 |         type: string
164 |         description: >
165 |           The format of the archive file, eg `warc`, `wat`, `cdx`
166 |       checksums:
167 |         type: object
168 |         items:
169 |           type: string
170 |           format: hexstring
171 |         description: >
172 |           Verification of the content of the file.  Must include at least one
173 |           of MD5 or SHA1.  The key specifies the lowercase name of the
174 |           algorithm; the element is a hexadecimal string of the checksum
175 |           value.  For example:
176 |           {"sha1":"6b4f32a3408b1cd7db9372a63a2053c3ef25c731",
177 |           "md5":"766ba6fd3a257edf35d9f42a8dd42a79"}
178 |       size:
179 |         type: integer
180 |         format: int64
181 |         description: The size in bytes of the webdata file
182 |       collection:
183 |         type: integer
184 |         format: int64
185 |         description: The numeric ID of the collection
186 |       crawl:
187 |         type: integer
188 |         format: int64
189 |         description: The numeric ID of the crawl
190 |       crawl-time:
191 |         type: string
192 |         format: date-time
193 |         description: Time the original content of the file was crawled
194 |       crawl-start:
195 |         type: string
196 |         format: date-time
197 |         description: Time the crawl started
198 |       locations:
199 |         type: array
200 |         items:
201 |           type: string
202 |           format: url
203 |         description: >
204 |           A list of (mirrored) sources from which to retrieve (identical copies
205 |           of) the webdata file, eg `https://partner.archive-it.org/webdatafile/ARCHIVEIT-4567-CRAWL_SELECTED_SEEDS-JOB1000016543-20170107214356419-00005.warc.gz`,
206 |           `/ipfs/Qmee6d6b05c21d1ba2f2020fe2db7db34e`
207 |   FileSet:
208 |     type: object
209 |     required:
210 |       - count
211 |       - files
212 |     properties:
213 |       includes-extra:
214 |         type: boolean
215 |         description: >
216 |           When false, the data in the `files` contains nothing extraneous from
217 |           what is necessary to satisfy the query or job.  When true or absent,
218 |           the client must be prepared to handle irrelevant data within the
219 |           referenced `files`.
220 |       count:
221 |         type: integer
222 |         description: The total number of files (across all pages)
223 |       previous:
224 |         description: >
225 |           Link (if any) to the previous page of files; otherwise null
226 |         type: [string, "null"]
227 |         format: url
228 |       next:
229 |         description: >
230 |           Link (if any) to the next page of files; otherwise null
231 |         type: [string, "null"]
232 |         format: url
233 |       files:
234 |         type: array
235 |         items:
236 |           $ref: '#/definitions/WebdataFile'
237 |   Job:
238 |     type: object
239 |     description: >
240 |       A job submitted to perform a task.  Conceptually, a complete job has a
241 |       `result` FileSet, but we avoid sending that potentially large data with
242 |       every mention of every job.  If the job is complete, the client can get
243 |       the result through a separate request to `jobs/{jobtoken}/result`.
244 |     required:
245 |       - jobtoken
246 |       - function
247 |       - query
248 |       - submit-time
249 |       - state
250 |     properties:
251 |       jobtoken:
252 |         type: string
253 |         description: >
254 |           Identifier unique across the implementation.  Archive-It has chosen
255 |           to use an increasing integer.
256 |       function:
257 |         $ref: '#/definitions/Function'
258 |       query:
259 |         type: string
260 |         description: >
261 |           The specification of what webdata to include in the job.  Encoding is
262 |           URL-style, eg `param=value&otherparam=othervalue`.
263 |       submit-time:
264 |         type: string
265 |         format: date-time
266 |         description: Time of submission, formatted according to RFC3339
267 |       termination-time:
268 |         type: string
269 |         format: date-time
270 |         description: >
271 |           Time of completion or failure, formatted according to RFC3339
272 |       state:
273 |         type: string
274 |         enum:
275 |           - queued
276 |           - running
277 |           - failed
278 |           - complete
279 |           - gone
280 |         # alas, can't use GFM
281 |         description: >
282 |           The state of the job through its lifecycle.
283 |           `queued`:  Job has been submitted and is waiting to run.
284 |           `running`:  Job is currently running.
285 |           `failed`:  Job ran but failed.
286 |           `complete`:  Job ran and successfully completed; result is available.
287 |           `gone`:  Job ran, but the result is no longer available (eg deleted
288 |             to save storage).
289 |   Function:
290 |     type: string
291 |     enum:
292 |       - build-wat
293 |       - build-wane
294 |       - build-cdx
295 |     # This would be the more meaningful place to document the concept of
296 |     # "function", but the parameter gives the documentation more space and
297 |     # handles GFM.
298 |     description: >
299 |       The function of the job.  See the `function` parameter to the POST that
300 |       created the job.
301 | parameters:
302 |   # I wish OpenAPI offered a way to define and compose sets of parameters.
303 |   # pagination:
304 |   page:
305 |     name: page
306 |     in: query
307 |     type: integer
308 |     required: false
309 |     description: >
310 |       One-based index for pagination
311 |   # job token:
312 |   jobtoken:
313 |     name: jobtoken
314 |     in: path
315 |     description: The job token returned from previous request
316 |     required: true
317 |     type: string
318 |   # basic query:
319 |   filename:
320 |     name: filename
321 |     in: query
322 |     type: string
323 |     required: false
324 |     description: >
325 |       A string exactly matching the webdata file's basename (ie must match the
326 |       beginning and end of the filename, not the full path of directories).
327 |   # Archive-It's implementation of a function
328 |   function:
329 |     name: function
330 |     in: formData
331 |     required: true
332 |     description: >
333 |       One of the following strings which have the following meanings:
334 | 
335 |       - `build-wat`:  Build a WAT file with metadata from the matched archive
336 |         files
337 | 
338 |       - `build-wane`:  Build a WANE file with the named entities from the
339 |         matched archive files
340 | 
341 |       - `build-cdx`:  Build a CDX file indexing the matched archive files
342 | 
343 |     type: string
344 |     enum:
345 |       - build-wat
346 |       - build-wane
347 |       - build-cdx
348 |   # time of crawl (specific to Archive-It):
349 |   crawl-time-after:
350 |     name: crawl-time-after
351 |     type: string
352 |     format: date-time
353 |     in: query
354 |     required: false
355 |     description: >
356 |       Match resources that were crawled at or after the time given according to
357 |       RFC3339.  A date given with no time of day means midnight.  Coordinated
358 |       Universal (UTC) is preferrred and assumed if no timezone is included.
359 |       Because `crawl-time-after` matches equal time stamps while
360 |       `crawl-time-before` excludes equal time stamps, and because we specify
361 |       instants rather than durations implicit from our units, we can smoothly
362 |       scale between days and seconds.  That is, we specify ranges in the manner
363 |       of the C programming language, eg low ≤ x < high.  For example, matching
364 |       the month of November of 2016 is specified by
365 |       `crawl-time-after=2016-11 & crawl-time-before=2016-12` or
366 |       equivalently by `crawl-time-after=2016-11-01T00:00:00Z &
367 |       crawl-time-before=2016-11-30T16:00:00-08:00`.
368 |   crawl-time-before:
369 |     name: crawl-time-before
370 |     type: string
371 |     format: date-time
372 |     in: query
373 |     required: false
374 |     description: >
375 |       Match resources that were crawled strictly before the time given
376 |       according to RFC3339.  See more detail at `crawl-time-after`.
377 |   crawl-start-after:
378 |     name: crawl-start-after
379 |     type: string
380 |     format: date-time
381 |     in: query
382 |     required: false
383 |     description: >
384 |       Match resources that were crawled in a job that started at or after the
385 |       time given according to RFC3339.  (Note that the original content of a
386 |       file could be crawled many days after the crawl job started; would you
387 |       prefer `crawl-time-after` / `crawl-time-before`?)
388 |   crawl-start-before:
389 |     name: crawl-start-before
390 |     type: string
391 |     format: date-time
392 |     in: query
393 |     required: false
394 |     description: >
395 |       Match resources that were crawled in a job that started strictly before
396 |       the time given according to RFC3339.  See more detail at
397 |       `crawl-start-after`.
398 |   # collection (specific to Archive-It):
399 |   collection:
400 |     name: collection
401 |     type: integer
402 |     in: query
403 |     required: false
404 |     description: >
405 |       The numeric ID of one or more collections, given as separate fields.
406 |       For only this parameter, WASAPI accepts multiple values and will match
407 |       items in any of the specified collections.  For example, matching the
408 |       items from two collections can be specified by `collection=1 &
409 |       collection=2`.
410 |   # crawl (specific to Archive-It):
411 |   crawl:
412 |     name: crawl
413 |     type: integer
414 |     in: query
415 |     required: false
416 |     description: >
417 |       The numeric ID of the crawl
418 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/mailer.py:
--------------------------------------------------------------------------------
 1 | from django.conf import settings
 2 | from django.core.mail import send_mail
 3 | 
 4 | def new_wasapijob(job, covering_collection_times):
 5 |     message_template = """
 6 | Dear Web Archivists,
 7 | 
 8 | {user_full_name:s} of account "{account_name:s}" ({account_id:d}) has submitted a new job ({job_id:d}):
 9 | {function:s}
10 | {query:s}
11 | 
12 | {urls:s}
13 | 
14 | {todo}
15 | 
16 | Love,
17 | WASAPI within AIT5
18 | """
19 |     if not covering_collection_times:
20 |         todo = "The query matched no files, so the job is already complete."
21 |     elif job.state == job.COMPLETE:
22 |         todo = "The query matched only files already derived, so the job is already complete."
23 |     else:
24 |         todo = (
25 |           "Please derive from these collections over these time spans:\n" +
26 |           "\n".join([
27 |             '%s %s %d %s %s' % (start, end, collection.id,
28 |               collection.account.organization_name, collection.name)
29 |             for collection,start,end in covering_collection_times ]))
30 |     message = message_template.format(
31 |       user_full_name = job.user.full_name,
32 |       account_name = job.account.organization_name,
33 |       account_id = job.account.id,
34 |       job_id = job.id,
35 |       function = job.function,
36 |       query = job.query,
37 |       urls = '\n'.join(
38 |         [settings.BASE50URL + job.get_absolute_url()] +
39 |         ([settings.BASE50URL + job.get_absolute_url() + '/result']
40 |           if job.state==job.COMPLETE else []) +
41 |         ['%s/admin/wasapi/wasapijob/%d/' % (settings.BASE50URL, job.id)] ),
42 |       todo = todo,
43 |     )
44 |     send_mail(
45 |       'Research Services Dataset Request via WASAPI',
46 |       message,
47 |       'donotreply@archive-it.org',
48 |       settings.AITRESEARCHSERVICES_ADDRESS,
49 |       fail_silently=False)
50 | 
51 | 
52 | def complete_wasapijob(job):
53 |     message_template = """
54 | Dear Web Archivists,
55 | 
56 | The job ({job_id:d}) for {user_full_name:s} of account "{account_name:s}" ({account_id:d}) has completed:
57 | {function:s}
58 | {query:s}
59 | 
60 | {urls:s}
61 | 
62 | Love,
63 | WASAPI within AIT5
64 | """
65 |     message = message_template.format(
66 |       user_full_name = job.user.full_name,
67 |       account_name = job.account.organization_name,
68 |       account_id = job.account.id,
69 |       job_id = job.id,
70 |       function = job.function,
71 |       query = job.query,
72 |       urls = '\n'.join([
73 |         settings.BASE50URL + job.get_absolute_url(),
74 |         settings.BASE50URL + job.get_absolute_url() + '/result',
75 |         '%s/admin/wasapi/wasapijob/%d/' % (settings.BASE50URL, job.id),
76 |       ])
77 |     )
78 |     send_mail(
79 |       'WASAPI job %d completed' % (job.id),
80 |       message,
81 |       'donotreply@archive-it.org',
82 |       settings.AITRESEARCHSERVICES_ADDRESS,
83 |       fail_silently=False)
84 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/migrations/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/WASAPI-Community/data-transfer-apis/4fab8164f40dcb16a601d4606f9ab67889076d6b/ait-implementation/wasapi/migrations/__init__.py


--------------------------------------------------------------------------------
/ait-implementation/wasapi/models.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from datetime import datetime
  3 | import re
  4 | from django.core.urlresolvers import reverse
  5 | from django.db import models
  6 | from .mailer import new_wasapijob, complete_wasapijob
  7 | from django.db.models.signals import post_save, pre_save
  8 | from django.http import QueryDict
  9 | from archiveit.archiveit.models import WarcFile, DerivativeFile
 10 | from archiveit.archiveit.model_fields import AitWasapiDateTimeField
 11 | from archiveit.wasapi.selectors import select_webdata_query
 12 | 
 13 | 
 14 | # This set of functions is specific to Archive-It:
 15 | function_instances = {}
 16 | class JobFunction(object):
 17 |     identifier = None
 18 |     code       = None
 19 |     english    = None
 20 |     @staticmethod
 21 |     def ideal_results_for(job, source_files):
 22 |         '''Transform DerivativeFile records to new WasapiJobResultFile
 23 |         records according to the job function'''
 24 |         assert NotImplementedError('Implement in concrete class')
 25 | 
 26 |     @classmethod
 27 |     def register(cls):
 28 |         assert cls.identifier not in function_instances, "Already got one"
 29 |         function_instances[cls.code] = cls
 30 | 
 31 | class BuildWat(JobFunction):
 32 |     identifier = "BUILD_WAT"
 33 |     code       = "build-wat"
 34 |     english    = "Build a WAT file"
 35 |     @staticmethod
 36 |     def ideal_results_for(job, source_files):
 37 |         return [WasapiJobResultFile(
 38 |           job=job,
 39 |           filename=re.sub(r'\.(w?arc)\.gz', '_\\1.wat.gz', warcfile.filename)
 40 |         ) for warcfile in source_files]
 41 | BuildWat.register()
 42 | 
 43 | class BuildWane(JobFunction):
 44 |     identifier = "BUILD_WANE"
 45 |     code       = "build-wane"
 46 |     english    = "Build a WANE file"
 47 |     @staticmethod
 48 |     def ideal_results_for(job, source_files):
 49 |         return [WasapiJobResultFile(
 50 |           job=job,
 51 |           filename=re.sub(r'\.(w?arc)\.gz', '_\\1.wane.gz', warcfile.filename)
 52 |         ) for warcfile in source_files]
 53 | BuildWane.register()
 54 | 
 55 | class BuildCdx(JobFunction):
 56 |     identifier = "BUILD_CDX"
 57 |     code       = "build-cdx"
 58 |     english    = "Build a CDX file"
 59 |     @staticmethod
 60 |     def ideal_results_for(job, source_files):
 61 |         return [WasapiJobResultFile(
 62 |           job=job,
 63 |           filename=re.sub(r'\.(w?arc)\.gz', '_\\1.cdx.gz', warcfile.filename)  # TODO:  wait to support CDX
 64 |         ) for warcfile in source_files]
 65 | #BuildCdx.register()  # TODO:  implement and register
 66 | 
 67 | class BuildLga(JobFunction):
 68 |     identifier = "BUILD_LGA"
 69 |     code       = "build-lga"
 70 |     english    = "Build an LGA file"
 71 |     @staticmethod
 72 |     def ideal_results_for(job, source_files):
 73 |         # TODO:  we ignore the results of the query, so don't execute it
 74 |         # note lack of list comprehension (because LGA is not one-to-one)
 75 |         return [WasapiJobResultFile(
 76 |           job=job,
 77 |           filename='ARCHIVEIT-%d-LONGITUDINAL-GRAPH-%4d-%02d-%02d.lga.tgz' % (
 78 |             job.collection,  # TODO:  but jobs are too general, pull from query?
 79 |             job.submitTime.year, job.submitTime.month, job.submitTime.day)
 80 |         )]
 81 | #BuildLga.register()  # TODO:  implement and register
 82 | 
 83 | 
 84 | class WasapiJob(models.Model):
 85 | 
 86 |     id = models.AutoField(primary_key=True)
 87 | 
 88 |     # fields for minimal, generic WASAPI:
 89 | 
 90 |     FUNCTION_CHOICES = [(concrete_instance.code, concrete_instance.english)
 91 |       for concrete_instance in function_instances.values()]
 92 |     function = models.CharField(max_length=32, null=False, choices=FUNCTION_CHOICES)
 93 |     @property
 94 |     def function_instance(self):
 95 |         return function_instances[self.function]
 96 | 
 97 |     query = models.CharField(max_length=1024, blank=True, null=False)
 98 |     submit_time = AitWasapiDateTimeField(db_column='submitTime', auto_now_add=True, null=False)
 99 |     termination_time = AitWasapiDateTimeField(db_column='terminationTime', null=True, blank=True)
100 | 
101 |     # This list of states is specific to Archive-It:
102 |     STATES = [
103 |       # (identifier, code,       english)
104 |       ("QUEUED",     "queued",   "Queued"),
105 |       ("RUNNING",    "running",  "Running"),
106 |       ("FAILED",     "failed",   "Failed"),
107 |       ("COMPLETE",   "complete", "Complete"),
108 |       ("GONE",       "gone",     "Gone")]
109 |     for identifier, code, english in STATES:
110 |         locals()[identifier] = code
111 |     STATE_CHOICES = [(code, english) for identifier, code, english in STATES]
112 |     state = models.CharField(max_length=32, null=False, choices=STATE_CHOICES)
113 | 
114 |     # fields specific to Archive-It:
115 | 
116 |     account = models.ForeignKey('accounts.Account', db_column='accountId', editable=False, blank=True)
117 |     user = models.ForeignKey('accounts.User', db_column='userId', editable=False, blank=True)
118 | 
119 |     class Meta:
120 |         managed = False
121 |         db_table = 'WasapiJob'
122 | 
123 |     # the same as WebdataQueryViewSet
124 |     queryset = WarcFile.objects.all().order_by('-id')
125 |     def query_just_like_webdataqueryviewset(self):
126 |         '''Create the same queryset that WebdataQueryViewSet executes'''
127 |         querydict = QueryDict(self.query)
128 |         return select_webdata_query(querydict, self.queryset,
129 |           account=self.account, user=self.user)
130 | 
131 |     def set_ideal_result_and_state(self, source_files):
132 |         '''Calculate WasapiJobResultFiles that would comprise result when
133 |         complete; update state'''
134 |         ideal_results = self.function_instance.ideal_results_for(self, source_files)
135 |         # TODO:  batch the DB query
136 |         for ideal_result in ideal_results:
137 |             ideal_result.update_state()
138 |         if all(ideal_result.is_complete() for ideal_result in ideal_results):
139 |             # vacuous (if ideal_results is empty) or freebie job
140 |             self.state = self.COMPLETE
141 |             self.termination_time = datetime.now()
142 |             # new_wasapijob will say complete so needn't call complete_wasapijob
143 |             self._ideal_results = []
144 |         else:
145 |             # juicy job, ie have to do some work
146 |             self.state = self.QUEUED
147 |             self._ideal_results = ideal_results  # save them after we get our id
148 | 
149 |     def save_results(self):
150 |         '''Save the WasapiJobResultFile records now that they can refer to
151 |         their WasapiJob record'''
152 |         for ideal_result in self._ideal_results:
153 |             ideal_result.job = self  # job_id wasn't yet available at creation
154 |             ideal_result.save()
155 | 
156 |     def update_state(self):
157 |         if self.state in (self.RUNNING, self.QUEUED):
158 |             ideal_results = WasapiJobResultFile.objects.filter(job=self)
159 |             if all(ideal_result.is_complete() for ideal_result in ideal_results):
160 |                 self.state = self.COMPLETE
161 |                 self.termination_time = datetime.now()
162 |                 complete_wasapijob(self)
163 | 
164 |     def __str__(self):
165 |         return "<WasapiJob %s in state %s>" % (self.id, self.state)
166 | 
167 |     def get_absolute_url(self):
168 |         return reverse('wasapijob-detail', kwargs={'pk':self.id})
169 | 
170 |     @staticmethod
171 |     def covering_collection_times(source_files):
172 |         '''Returns a list of (collection,start,end) tuples.  Deriving each
173 |         collection for the given time range will generate a superset of the
174 |         desired result files.'''
175 |         by_collection = {}
176 |         for source_file in source_files:
177 |             partition = by_collection.get(source_file.collection, set())
178 |             by_collection[source_file.collection] = partition
179 |             partition.add(source_file)
180 |         return [
181 |           ( collection,
182 |             min(sf.crawl_time for sf in sfs),
183 |             max(sf.crawl_time for sf in sfs) )
184 |           for collection, sfs in by_collection.items()]
185 | 
186 |     @classmethod
187 |     def pre_save(cls, instance, **kwargs):
188 |         job = instance
189 |         if not job.id:  # freshly created job
190 |             source_files = job.query_just_like_webdataqueryviewset()
191 |             job.set_ideal_result_and_state(source_files)
192 |             job._source_files = source_files  # stash for mailer in post-save
193 | 
194 |     @classmethod
195 |     def post_save(cls, instance, **kwargs):
196 |         # would prefer to call the parameter "job", but "send" passes it by name
197 |         job = instance
198 |         if hasattr(job, '_source_files'):  # freshly created job
199 |             job.save_results()
200 |             new_wasapijob(job, cls.covering_collection_times(job._source_files))
201 | 
202 | pre_save.connect(receiver=WasapiJob.pre_save, sender=WasapiJob)
203 | post_save.connect(receiver=WasapiJob.post_save, sender=WasapiJob)
204 | 
205 | # Voodoo to patch bug exposed in restore_object:
206 | # TypeError: can only concatenate tuple (not "list") to tuple
207 | # at ait5/ lib/python3.5/site-packages/rest_framework/serializers.py:969
208 | # for field in meta.many_to_many + meta.virtual_fields:
209 | WasapiJob._meta.virtual_fields = ()  # was []; many_to_many is ()
210 | 
211 | 
212 | def proxy_to_derivative(*fields):
213 |     def decorator(cls):
214 |         # invoke another method to prevent iterator variable from varying
215 |         def install_proxy(cls, field):
216 |             @property
217 |             def ameth(self):
218 |                 return( self.derivative_file and
219 |                   getattr(self.derivative_file, field) )
220 |             setattr(cls, field, ameth)
221 |         for field in fields:
222 |             install_proxy(cls, field)
223 |         return cls
224 |     return decorator
225 | 
226 | @proxy_to_derivative('filetype', 'md5', 'sha1', 'size', 'store_time',
227 |   'crawl_time', 'account_id', 'collection_id', 'crawl_job', 'crawl_job_id',
228 |   'pbox_item', 'hdfs_path')
229 | class WasapiJobResultFile(models.Model):
230 |     id = models.AutoField(primary_key=True)
231 |     job = models.ForeignKey('WasapiJob', db_column='jobId')
232 |     filename = models.CharField(max_length=4000)
233 |     derivative_file = models.ForeignKey('archiveit.DerivativeFile', db_column='derivativeFileId', null=True)
234 | 
235 |     class Meta:
236 |         managed = False
237 |         db_table = 'WasapiJobResultFile'
238 | 
239 |     def is_complete(self):
240 |         return self.derivative_file
241 | 
242 |     def update_state(self):
243 |         '''Update reference to any newly existing derivative_file; return
244 |         whether state changed (ie whether need to propagate changes further)'''
245 |         if self.derivative_file:
246 |             return False
247 |         self.derivative_file = DerivativeFile.objects.filter(filename=self.filename).first()
248 |         return self.derivative_file
249 | 
250 |     def dict_for_location(self):
251 |         '''Return a mapping that can fill location templates'''
252 |         d = self.__dict__
253 |         d.update(pbox_item=self.pbox_item)
254 |         return d
255 | 
256 |     def __repr__(self):
257 |         return( 'WasapiJobResultFile(id=%s,filename=%s,job_id=%s,%s)' %
258 |           (self.id, self.filename, self.job_id,
259 |             "complete" if self.derivative_file else "incomplete"))
260 | 
261 |     @classmethod
262 |     def update_completed_result_files(cls):
263 |         '''Find and update result files that completed without notification'''
264 |         result_files = cls.objects.raw('''
265 |           select WasapiJobResultFile.id, WasapiJobResultFile.jobId
266 |           from WasapiJobResultFile
267 |           join DerivativeFile
268 |             on WasapiJobResultFile.filename=DerivativeFile.filename
269 |           where WasapiJobResultFile.derivativeFileId is null''')
270 |         # TODO:  use that result rather than refetching for each result file
271 |         cls.update_states(result_files)
272 | 
273 |     @staticmethod
274 |     def update_states(result_files):
275 |         jobs_to_update = set()
276 |         for result_file in result_files:
277 |             if result_file.update_state():
278 |                 result_file.save()
279 |                 jobs_to_update.add(result_file.job)
280 |         # TODO:  batch the DB queries
281 |         for job in jobs_to_update:
282 |             job.update_state()
283 |             job.save()
284 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/selectors.py:
--------------------------------------------------------------------------------
 1 | # Functions that selectively filter a queryset, ie the "guts" of filters.
 2 | 
 3 | # Filtering functions, both abstract (could be in the DRF library) and specific
 4 | # to WASAPI, and composite
 5 | 
 6 | 
 7 | # for WasapiWebdataQueryFilterBackend and executing query of new WasapiJob
 8 | def select_webdata_query(querydict, queryset, **kwargs):
 9 |     sub_filters = [
10 |       select_auth,
11 |       select_wasapi_direct_fields,
12 |       select_wasapi_mapped_fields,
13 |     ]
14 |     for filter in sub_filters:
15 |         queryset = filter(querydict, queryset, **kwargs)
16 |     return queryset
17 | 
18 | 
19 | # for WasapiAuthFilterBackend
20 | def select_auth(querydict, queryset, **kwargs):
21 |     if kwargs['user'].is_superuser:
22 |         return queryset  # no restriction
23 |     if kwargs['user'].is_anonymous():
24 |         return queryset.none()  # ie hide it all
25 |     return queryset.filter(account_id=kwargs['account'].id)
26 | 
27 | 
28 | def generate_select_direct_fields(*args):
29 |     """Simple filtering on equality and inclusion of fields
30 | 
31 |     Any given field that is included in the class's multi_field_names is tested
32 |     against any of potentially multiple arguments given in the request.  Any
33 |     other (existing) field is tested for equality with the given value."""
34 |     multi_field_names = set(args)
35 |     def select_direct_fields(querydict, queryset, **kwargs):
36 |         field_names = set( field.name for field in queryset.model._meta.get_fields() )
37 |         for key, value in querydict.items():
38 |             if key in multi_field_names:
39 |                 filter_value = { key+'__in': querydict.getlist(key) }
40 |                 queryset = queryset.filter(**filter_value)
41 |             elif key in field_names:
42 |                 queryset = queryset.filter(**{ key:value })
43 |         return queryset
44 |     return select_direct_fields
45 | 
46 | # for WebdataDirectFieldFilterBackend
47 | select_wasapi_direct_fields = generate_select_direct_fields('collection')
48 | 
49 | 
50 | def generate_select_mapped_fields(filter_for_parameter):
51 |     """Map parameters to ORM filters
52 | 
53 |     Based on `filter_for_parameter` dictionary mapping HTTP parameter name to
54 |     ORM query filter"""
55 | 
56 |     def select_mapped_fields(querydict, queryset, **kwargs):
57 |         for parameter_name, filter_name in filter_for_parameter.items():
58 |             value = querydict.get(parameter_name)
59 |             if value:
60 |                 queryset = queryset.filter(**{filter_name:value})
61 |         return queryset
62 |     return select_mapped_fields
63 | 
64 | # for WebdataMappedFieldFilterBackend
65 | select_wasapi_mapped_fields = generate_select_mapped_fields({
66 |   'crawl': 'crawl_job_id',
67 |   'crawl-time-after': 'crawl_time__gte',
68 |   'crawl-time-before': 'crawl_time__lt',
69 |   'crawl-start-after': 'crawl_job__original_start_date__gte',
70 |   'crawl-start-before': 'crawl_job__original_start_date__lt',
71 | })
72 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/serializers.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime
  2 | from django.conf import settings
  3 | from django.core.exceptions import PermissionDenied
  4 | from rest_framework import serializers
  5 | from rest_framework.pagination import PaginationSerializer
  6 | from archiveit.archiveit.models import WarcFile
  7 | from archiveit.wasapi.models import WasapiJob
  8 | 
  9 | class WebdataFileSerializer(serializers.HyperlinkedModelSerializer):
 10 |     # explicitly adding to locals() lets us include '-' in name of fields
 11 |     locals().update({
 12 |       'filetype': serializers.CharField(),
 13 |       'checksums': serializers.SerializerMethodField('checksums_method'),
 14 |       'account': serializers.SerializerMethodField('account_method'),
 15 |       'collection': serializers.SerializerMethodField('collection_method'),
 16 |       'crawl': serializers.SerializerMethodField('crawl_method'),
 17 |       'crawl-time': serializers.DateTimeField(source='crawl_time'),
 18 |       'crawl-start': serializers.SerializerMethodField('crawl_start_method'),
 19 |       'locations': serializers.SerializerMethodField('locations_method')})
 20 |     def checksums_method(self, obj):
 21 |         return {
 22 |           'sha1': obj.sha1,
 23 |           'md5': obj.md5 }
 24 |     def account_method(self, obj):
 25 |         return obj.account_id
 26 |     def collection_method(self, obj):
 27 |         return obj.collection_id
 28 |     def crawl_method(self, obj):
 29 |         return obj.crawl_job_id
 30 |     def crawl_start_method(self, obj):
 31 |         crawl_job = obj.crawl_job
 32 |         return crawl_job and crawl_job.original_start_date
 33 |     def locations_method(self, obj):
 34 |         return (
 35 |           [settings.WEBDATA_LOCATION_TEMPLATE % obj.dict_for_location()] +
 36 |           ([settings.PBOX_LOCATION_TEMPLATE % obj.dict_for_location()]
 37 |             if obj.pbox_item else []));
 38 |     class Meta:
 39 |         model = WarcFile
 40 |         fields = (
 41 |           'filename',
 42 |           'filetype',
 43 |           'checksums',
 44 |           'account',
 45 |           'size',
 46 |           'collection',
 47 |           'crawl',
 48 |           'crawl-time',
 49 |           'crawl-start',
 50 |           'locations')
 51 | 
 52 | class PaginationSerializerOfFiles(PaginationSerializer):
 53 |     'Pagination serializer that labels the "results" as "files"'
 54 |     results_field = 'files'
 55 | 
 56 | 
 57 | class JobSerializer(serializers.HyperlinkedModelSerializer):
 58 | 
 59 |     # explicitly adding to locals() lets us include '-' in name of fields
 60 |     locals().update({
 61 |       'state': serializers.CharField(read_only=True),
 62 |       'account': serializers.PrimaryKeyRelatedField(read_only=True),
 63 |       'jobtoken': serializers.SerializerMethodField('jobtoken_method'),
 64 |       'submit-time': serializers.DateTimeField(source='submit_time', read_only=True),
 65 |       'termination-time': serializers.DateTimeField(source='termination_time', read_only=True)})
 66 | 
 67 |     def jobtoken_method(self, obj):
 68 |         return str(obj.id)
 69 | 
 70 |     class Meta:
 71 |         model = WasapiJob
 72 |         fields = (
 73 |           'jobtoken',
 74 |           'function',
 75 |           'query',
 76 |           'submit-time',
 77 |           'termination-time',
 78 |           'state',
 79 |           'account')
 80 | 
 81 |     def validate(self, value):
 82 |         # I'd prefer to use a dedicated method rather than hooking into
 83 |         # validation routines, but my "to_internal_value" and "create" methods
 84 |         # don't get called.
 85 |         # It would be nice if we let each field set its value, but I don't see
 86 |         # an easy way to do that.
 87 |         value = self.set_account(value)
 88 |         value = self.set_user(value)
 89 |         value = self.set_submit_time(value)
 90 |         return value
 91 | 
 92 |     def set_account(self, value):
 93 |         account = self.context['request'].user.account
 94 |         if not account:  # eg "system" user
 95 |             raise PermissionDenied
 96 |         value['account'] = account
 97 |         return value
 98 | 
 99 |     def set_user(self, value):
100 |         value['user'] = self.context['request'].user
101 |         return value
102 | 
103 |     def set_submit_time(self, value):
104 |         value['submit_time'] = datetime.now()
105 |         return value
106 | 
107 | class PaginationSerializerOfJobs(PaginationSerializer):
108 |     'Pagination serializer that labels the "results" as "jobs"'
109 |     results_field = 'jobs'
110 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/swagger.yaml:
--------------------------------------------------------------------------------
  1 | swagger: '2.0'
  2 | info:
  3 |   title: Draft WASAPI Export API by Archive-It
  4 |   description: >
  5 |     WASAPI Export API.  What Archive-It will implement.
  6 |   version: 1.0.0
  7 |   contact:
  8 |     name: Jefferson Bailey and Mark Sullivan
  9 |     url: https://github.com/WASAPI-Community/data-transfer-apis
 10 |   license:
 11 |     name: Apache 2.0
 12 |     url: http://www.apache.org/licenses/LICENSE-2.0.html
 13 | consumes:
 14 |   - application/json
 15 | produces:
 16 |   - application/json
 17 | basePath: /v1
 18 | schemes:
 19 |   - https
 20 | paths:
 21 |   /webdata:
 22 |     get:
 23 |       summary: Get the archive files I need
 24 |       description: >
 25 |         Produces a page of the list of the files accessible to the client
 26 |         matching all of the parameters.  A parameter with multiple options
 27 |         matches when any option matches; a missing parameter implicitly
 28 |         matches.
 29 |       parameters:
 30 |         # pagination
 31 |         - $ref: '#/parameters/page'
 32 |         # basic query
 33 |         - $ref: '#/parameters/filename'
 34 |         - $ref: '#/parameters/filetype'
 35 |         # specific to Archive-It
 36 |         - $ref: '#/parameters/crawl-time-after'
 37 |         - $ref: '#/parameters/crawl-time-before'
 38 |         - $ref: '#/parameters/crawl-start-after'
 39 |         - $ref: '#/parameters/crawl-start-before'
 40 |         - $ref: '#/parameters/collection'
 41 |         - $ref: '#/parameters/crawl'
 42 |       responses:
 43 |         '200':
 44 |           description: Success
 45 |           schema:
 46 |             $ref: '#/definitions/FileSet'
 47 |         '400':
 48 |           description: The request could not be interpreted
 49 |         '401':
 50 |           description: The request was unauthorized
 51 |   /jobs:
 52 |     get:
 53 |       summary: What jobs do I have?
 54 |       description:
 55 |         Show the jobs on this server accessible to the client
 56 |       parameters:
 57 |         - $ref: '#/parameters/page'
 58 |       responses:
 59 |         '200':
 60 |           description: >
 61 |             Success.  Produces a page of the list of the jobs accessible to
 62 |             the client.
 63 |           schema:
 64 |             type: object
 65 |             required:
 66 |               - count
 67 |               - jobs
 68 |             properties:
 69 |               count:
 70 |                 type: integer
 71 |                 description: >
 72 |                   The total number of jobs matching the query (across all pages)
 73 |               previous:
 74 |                 description: >
 75 |                   Link (if any) to the previous page of jobs; otherwise null
 76 |                 type: [string, "null"]
 77 |                 format: url
 78 |               next:
 79 |                 description: >
 80 |                   Link (if any) to the next page of jobs; otherwise null
 81 |                 type: [string, "null"]
 82 |                 format: url
 83 |               jobs:
 84 |                 type: array
 85 |                 items:
 86 |                   $ref: '#/definitions/Job'
 87 |     post:
 88 |       summary: Make a new job
 89 |       description:
 90 |         Create a job to perform some task
 91 |       parameters:
 92 |         - name: query
 93 |           in: formData
 94 |           required: true
 95 |           description: >
 96 |             URL-encoded query as appropriate for /webdata end-point.  The empty
 97 |             query (which matches everything) must explicitly be given as the
 98 |             empty string.
 99 |           type: string
100 |         - $ref: '#/parameters/function'
101 |         - name: parameters
102 |           in: formData
103 |           required: false
104 |           description: >
105 |             Other parameters specific to the function and implementation
106 |             (URL-encoded).  For example: level of compression, priority, time
107 |             limit, space limit.  Archive-It does not yet accept any such
108 |             parameters.
109 |           type: string
110 |       responses:
111 |         '201':
112 |           description: >
113 |             Job was successfully submitted.  Body is the submitted job.
114 |           schema:
115 |             $ref: '#/definitions/Job'
116 |         '400':
117 |           description: The request could not be interpreted
118 |         '401':
119 |           description: The request was unauthorized
120 |   '/jobs/{jobtoken}':
121 |     get:
122 |       summary: How is my job doing?
123 |       description:
124 |         Retrieve information about a job, both the parameters of its submission
125 |         and its current state.  If the job is complete, the client can get the
126 |         result through a separate request to `jobs/{jobtoken}/result`.
127 |       parameters:
128 |         - $ref: '#/parameters/jobtoken'
129 |       responses:
130 |         '200':
131 |           description: Success
132 |           schema:
133 |             $ref: '#/definitions/Job'
134 |         '400':
135 |           description: The request could not be interpreted
136 |         '401':
137 |           description: The request was unauthorized
138 |         '403':
139 |           description: Forbidden
140 |         '404':
141 |           description: No such job
142 |         '410':
143 |           description: >
144 |             Gone / invalidated.  Body may include non-result information about
145 |             the job.
146 |   '/jobs/{jobtoken}/result':
147 |     get:
148 |       summary: What is the result of my job?
149 |       description: >
150 |         For a complete job, produces a page of the resulting files.
151 |       parameters:
152 |         - $ref: '#/parameters/page'
153 |         - $ref: '#/parameters/jobtoken'
154 |       responses:
155 |         '200':
156 |           description: Success
157 |           schema:
158 |             $ref: '#/definitions/FileSet'
159 |         '301':
160 |           description: The job is in a failed state; get details elsewhere
161 |         '307':
162 |           description: >
163 |             The job failed and will never be fixed; get details elsewhere
164 |         '400':
165 |           description: The request could not be interpreted
166 |         '401':
167 |           description: The request was unauthorized
168 |         '403':
169 |           description: Forbidden
170 |         '404':
171 |           description: No such job or it is not complete
172 |         '410':
173 |           description: Job is gone / invalidated
174 |   '/jobs/{jobtoken}/error':
175 |     get:
176 |       summary: Why did my job fail?
177 |       description: >
178 |         Give details about a failed job
179 |       parameters:
180 |         - $ref: '#/parameters/jobtoken'
181 |       responses:
182 |         '200':
183 |           description: Success (of reporting the error, not of the job itself)
184 |         '400':
185 |           description: The request could not be interpreted
186 |         '401':
187 |           description: The request was unauthorized
188 |         '403':
189 |           description: Forbidden
190 |         '404':
191 |           description: No such job or it did not fail
192 |         '410':
193 |           description: Job is gone / invalidated
194 | definitions:
195 |   WebdataFile:
196 |     description: >
197 |       Description of a unit of distribution of web archival data.  (This data
198 |       type does not include the actual archival data.)  Examples: a WARC file,
199 |       an ARC file, a CDX file, a WAT file, a DAT file, a tarball.
200 |     type: object
201 |     required:
202 |       - filename
203 |       - checksums
204 |       - filetype
205 |       - locations
206 |     properties:
207 |       filename:
208 |         type: string
209 |         description: The name of the webdata file
210 |       filetype:
211 |         # TODO:  handle compression etc
212 |         type: string
213 |         description: >
214 |           The format of the archive file, eg `warc`, `wat`, `cdx`
215 |       checksums:
216 |         type: object
217 |         items:
218 |           type: string
219 |           format: hexstring
220 |         description: >
221 |           Verification of the content of the file.  Must include at least one
222 |           of MD5 or SHA1.  The key specifies the lowercase name of the
223 |           algorithm; the element is a hexadecimal string of the checksum
224 |           value.  For example:
225 |           {"sha1":"6b4f32a3408b1cd7db9372a63a2053c3ef25c731",
226 |           "md5":"766ba6fd3a257edf35d9f42a8dd42a79"}
227 |       size:
228 |         type: integer
229 |         format: int64
230 |         description: The size in bytes of the webdata file
231 |       collection:
232 |         type: integer
233 |         format: int64
234 |         description: The numeric ID of the collection
235 |       crawl:
236 |         type: integer
237 |         format: int64
238 |         description: The numeric ID of the crawl
239 |       crawl-time:
240 |         type: string
241 |         format: date-time
242 |         description: Time the original content of the file was crawled
243 |       crawl-start:
244 |         type: string
245 |         format: date-time
246 |         description: Time the crawl started
247 |       locations:
248 |         type: array
249 |         items:
250 |           type: string
251 |           format: url
252 |         description: >
253 |           A list of (mirrored) sources from which to retrieve (identical copies
254 |           of) the webdata file, eg `https://partner.archive-it.org/webdatafile/ARCHIVEIT-4567-CRAWL_SELECTED_SEEDS-JOB1000016543-20170107214356419-00005.warc.gz`,
255 |           `/ipfs/Qmee6d6b05c21d1ba2f2020fe2db7db34e`
256 |   FileSet:
257 |     type: object
258 |     required:
259 |       - count
260 |       - files
261 |     properties:
262 |       includes-extra:
263 |         type: boolean
264 |         description: >
265 |           When false, the data in the `files` contains nothing extraneous from
266 |           what is necessary to satisfy the query or job.  When true or absent,
267 |           the client must be prepared to handle irrelevant data within the
268 |           referenced `files`.
269 |       count:
270 |         type: integer
271 |         description: The total number of files (across all pages)
272 |       previous:
273 |         description: >
274 |           Link (if any) to the previous page of files; otherwise null
275 |         type: [string, "null"]
276 |         format: url
277 |       next:
278 |         description: >
279 |           Link (if any) to the next page of files; otherwise null
280 |         type: [string, "null"]
281 |         format: url
282 |       files:
283 |         type: array
284 |         items:
285 |           $ref: '#/definitions/WebdataFile'
286 |   Job:
287 |     type: object
288 |     description: >
289 |       A job submitted to perform a task.  Conceptually, a complete job has a
290 |       `result` FileSet, but we avoid sending that potentially large data with
291 |       every mention of every job.  If the job is complete, the client can get
292 |       the result through a separate request to `jobs/{jobtoken}/result`.
293 |     required:
294 |       - jobtoken
295 |       - function
296 |       - query
297 |       - submit-time
298 |       - state
299 |     properties:
300 |       jobtoken:
301 |         type: string
302 |         description: >
303 |           Identifier unique across the implementation.  The implementation
304 |           chooses the format.  For example: GUID, increasing integer.
305 |       function:
306 |         $ref: '#/definitions/Function'
307 |       query:
308 |         type: string
309 |         description: >
310 |           The specification of what webdata to include in the job.  Encoding is
311 |           URL-style, eg `param=value&otherparam=othervalue`.
312 |       submit-time:
313 |         type: string
314 |         format: date-time
315 |         description: Time of submission, formatted according to RFC3339
316 |       termination-time:
317 |         type: string
318 |         format: date-time
319 |         description: >
320 |           Time of completion or failure, formatted according to RFC3339
321 |       state:
322 |         type: string
323 |         enum:
324 |           - queued
325 |           - running
326 |           - failed
327 |           - complete
328 |           - gone
329 |         # alas, can't use GFM
330 |         description: >
331 |           The state of the job through its lifecycle.
332 |           `queued`:  Job has been submitted and is waiting to run.
333 |           `running`:  Job is currently running.
334 |           `failed`:  Job ran but failed.
335 |           `complete`:  Job ran and successfully completed; result is available.
336 |           `gone`:  Job ran, but the result is no longer available (eg deleted
337 |             to save storage).
338 |   Function:
339 |     type: string
340 |     enum:
341 |       - build-wat
342 |       - build-wane
343 |       - build-cdx
344 |     # This would be the more meaningful place to document the concept of
345 |     # "function", but the parameter gives the documentation more space and
346 |     # handles GFM.
347 |     description: >
348 |       The function of the job.  See the `function` parameter to the POST that
349 |       created the job.
350 | parameters:
351 |   # I wish OpenAPI offered a way to define and compose sets of parameters.
352 |   # pagination:
353 |   page:
354 |     name: page
355 |     in: query
356 |     type: integer
357 |     required: false
358 |     description: >
359 |       One-based index for pagination
360 |   # job token:
361 |   jobtoken:
362 |     name: jobtoken
363 |     in: path
364 |     description: The job token returned from previous request
365 |     required: true
366 |     type: string
367 |   # basic query:
368 |   filename:
369 |     name: filename
370 |     in: query
371 |     type: string
372 |     required: false
373 |     description: >
374 |       A semicolon-separated list of "glob" patterns.  In each pattern, a
375 |       star `*` matches any string of characters, and a question mark `?`
376 |       matches exactly one character.  The pattern is matched against the
377 |       full basename (ie must match the beginning and end of the filename,
378 |       not the full path of directories).
379 |   filetype:
380 |     name: filetype
381 |     in: query
382 |     type: string
383 |     required: false
384 |     description: >
385 |       A semicolon-separated list of formats of acceptable archive file, eg
386 |       `warc`, `wat`, `cdx`
387 |   # Archive-It's implementation of a function
388 |   function:
389 |     name: function
390 |     in: formData
391 |     required: true
392 |     description: >
393 |       One of the following strings which have the following meanings:
394 | 
395 |       - `build-wat`:  Build a WAT file with metadata from the matched archive
396 |         files
397 | 
398 |       - `build-wane`:  Build a WANE file with the named entities from the
399 |         matched archive files
400 | 
401 |       - `build-cdx`:  Build a CDX file indexing the matched archive files
402 | 
403 |     type: string
404 |     enum:
405 |       - build-wat
406 |       - build-wane
407 |       - build-cdx
408 |   # time of crawl (specific to Archive-It):
409 |   crawl-time-after:
410 |     name: crawl-time-after
411 |     type: string
412 |     format: date-time
413 |     in: query
414 |     required: false
415 |     description: >
416 |       Match resources that were crawled at or after the time given according to
417 |       RFC3339.  A date given with no time of day means midnight.  Coordinated
418 |       Universal (UTC) is preferrred and assumed if no timezone is included.
419 |       Because `crawl-time-after` matches equal time stamps while
420 |       `crawl-time-before` excludes equal time stamps, and because we specify
421 |       instants rather than durations implicit from our units, we can smoothly
422 |       scale between days and seconds.  That is, we specify ranges in the manner
423 |       of the C programming language, eg low ≤ x < high.  For example, matching
424 |       the month of November of 2016 is specified by
425 |       `crawl-time-after=2016-11 & crawl-time-before=2016-12` or
426 |       equivalently by `crawl-time-after=2016-11-01T00:00:00Z &
427 |       crawl-time-before=2016-11-30T16:00:00-08:00`.
428 |   crawl-time-before:
429 |     name: crawl-time-before
430 |     type: string
431 |     format: date-time
432 |     in: query
433 |     required: false
434 |     description: >
435 |       Match resources that were crawled strictly before the time given
436 |       according to RFC3339.  See more detail at `crawl-time-after`.
437 |   crawl-start-after:
438 |     name: crawl-start-after
439 |     type: string
440 |     format: date-time
441 |     in: query
442 |     required: false
443 |     description: >
444 |       Match resources that were crawled in a job that started at or after the
445 |       time given according to RFC3339.  (Note that the original content of a
446 |       file could be crawled many days after the crawl job started; would you
447 |       prefer `crawl-time-after` / `crawl-time-before`?)
448 |   crawl-start-before:
449 |     name: crawl-start-before
450 |     type: string
451 |     format: date-time
452 |     in: query
453 |     required: false
454 |     description: >
455 |       Match resources that were crawled in a job that started strictly before
456 |       the time given according to RFC3339.  See more detail at
457 |       `crawl-start-after`.
458 |   # collection (specific to Archive-It):
459 |   collection:
460 |     name: collection
461 |     type: integer
462 |     in: query
463 |     required: false
464 |     description: >
465 |       The numeric ID of one or more collections, given as separate fields.
466 |       For only this parameter, WASAPI accepts multiple values and will match
467 |       items in any of the specified collections.  For example, matching the
468 |       items from two collections can be specified by `collection=1 &
469 |       collection=2`.
470 |   # crawl (specific to Archive-It):
471 |   crawl:
472 |     name: crawl
473 |     type: integer
474 |     in: query
475 |     required: false
476 |     description: >
477 |       The numeric ID of the crawl
478 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/tests.py:
--------------------------------------------------------------------------------
1 | from django.test import TestCase
2 | 
3 | # Create your tests here.
4 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/WASAPI-Community/data-transfer-apis/4fab8164f40dcb16a601d4606f9ab67889076d6b/ait-implementation/wasapi/tests/__init__.py


--------------------------------------------------------------------------------
/ait-implementation/wasapi/tests/fixtures.json:
--------------------------------------------------------------------------------
  1 | [
  2 | {
  3 |     "pk": 1,
  4 |     "fields": {
  5 |         "public_site_settings": null,
  6 |         "google_analytics_id": "",
  7 |         "annual_crawl_budget": 9223372036854775807,
  8 |         "metadata_public": true,
  9 |         "max_concurrent_test_crawls": null,
 10 |         "created_by": "",
 11 |         "partner_description": "",
 12 |         "created_date": null,
 13 |         "ignore_robots_option_visible": false,
 14 |         "public_registry_enabled": 1,
 15 |         "subscription_end_date": null,
 16 |         "total_data_budget_in_bytes": null,
 17 |         "annual_data_budget_in_gbs": 9223372036854775807,
 18 |         "partner_type": "",
 19 |         "private_metadata_fields": "",
 20 |         "deleted": false,
 21 |         "member_since_date": null,
 22 |         "partner_url": "",
 23 |         "active": true,
 24 |         "account_type": "",
 25 |         "last_updated_date": "2014-07-10T00:28:56.454",
 26 |         "organization_name": "test origanization",
 27 |         "custom_name": "",
 28 |         "logo_blob": null,
 29 |         "last_updated_by": "",
 30 |         "billing_period_start_date": "2014-07-10T00:28:56.454",
 31 |         "tag": "",
 32 |         "feed_enabled": false
 33 |     },
 34 |     "model": "accounts.account"
 35 | },
 36 | {
 37 |     "pk": 2,
 38 |     "fields": {
 39 |         "public_site_settings": null,
 40 |         "google_analytics_id": "",
 41 |         "annual_crawl_budget": 9223372036854775807,
 42 |         "metadata_public": true,
 43 |         "max_concurrent_test_crawls": null,
 44 |         "created_by": "",
 45 |         "partner_description": "",
 46 |         "created_date": null,
 47 |         "ignore_robots_option_visible": false,
 48 |         "public_registry_enabled": 1,
 49 |         "subscription_end_date": null,
 50 |         "total_data_budget_in_bytes": null,
 51 |         "annual_data_budget_in_gbs": 9223372036854775807,
 52 |         "partner_type": "",
 53 |         "private_metadata_fields": "",
 54 |         "deleted": false,
 55 |         "member_since_date": null,
 56 |         "partner_url": "",
 57 |         "active": true,
 58 |         "account_type": "",
 59 |         "last_updated_date": "2014-07-10T00:29:40.286",
 60 |         "organization_name": "another test origanization",
 61 |         "custom_name": "",
 62 |         "logo_blob": null,
 63 |         "last_updated_by": "",
 64 |         "billing_period_start_date": "2014-07-10T00:29:40.286",
 65 |         "tag": "",
 66 |         "feed_enabled": false
 67 |     },
 68 |     "model": "accounts.account"
 69 | },
 70 | {
 71 |     "pk": 1,
 72 |     "fields": {
 73 |         "email": "system@archive-it.org",
 74 |         "beta_opt_in": 1,
 75 |         "deleted": 0,
 76 |         "full_name": "",
 77 |         "date_joined": "2014-07-10T00:27:24.045",
 78 |         "last_updated_date": null,
 79 |         "username": "system",
 80 |         "time_zone_abbreviation": "",
 81 |         "created_by": "",
 82 |         "created_date": null,
 83 |         "temporary_password": null,
 84 |         "last_updated_by": "",
 85 |         "password": "",
 86 |         "last_login": "2014-07-10T00:27:24.045",
 87 |         "require_change_password": 0,
 88 |         "password_hash": "cf9c14fc1bcf5d4de8b912a8f2456e6bb46a470f",
 89 |         "language": ""
 90 |     },
 91 |     "model": "accounts.user"
 92 | },
 93 | {
 94 |     "pk": 2,
 95 |     "fields": {
 96 |         "email": "",
 97 |         "beta_opt_in": 1,
 98 |         "deleted": 0,
 99 |         "full_name": "",
100 |         "date_joined": null,
101 |         "last_updated_date": null,
102 |         "username": "authuser",
103 |         "time_zone_abbreviation": "",
104 |         "created_by": "",
105 |         "created_date": null,
106 |         "temporary_password": null,
107 |         "last_updated_by": "",
108 |         "password": "",
109 |         "last_login": "2014-07-10T00:29:01.162",
110 |         "require_change_password": 0,
111 |         "password_hash": "cb879e21d380ed6ac814cdf38a1aff5dec2a13ef",
112 |         "language": ""
113 |     },
114 |     "model": "accounts.user"
115 | },
116 | {
117 |     "pk": 3,
118 |     "fields": {
119 |         "email": "",
120 |         "beta_opt_in": 1,
121 |         "deleted": 0,
122 |         "full_name": "",
123 |         "date_joined": null,
124 |         "last_updated_date": null,
125 |         "username": "authuser2",
126 |         "time_zone_abbreviation": "",
127 |         "created_by": "",
128 |         "created_date": null,
129 |         "temporary_password": null,
130 |         "last_updated_by": "",
131 |         "password": "",
132 |         "last_login": "2014-07-10T00:29:17.850",
133 |         "require_change_password": 0,
134 |         "password_hash": "a8d9d1882162af890fb2b60738bf3fc3e8090f33",
135 |         "language": ""
136 |     },
137 |     "model": "accounts.user"
138 | },
139 | {
140 |     "pk": 4,
141 |     "fields": {
142 |         "email": "",
143 |         "beta_opt_in": 1,
144 |         "deleted": 0,
145 |         "full_name": "",
146 |         "date_joined": null,
147 |         "last_updated_date": null,
148 |         "username": "anotheruser",
149 |         "time_zone_abbreviation": "",
150 |         "created_by": "",
151 |         "created_date": null,
152 |         "temporary_password": null,
153 |         "last_updated_by": "",
154 |         "password": "",
155 |         "last_login": "2014-07-10T00:29:44.106",
156 |         "require_change_password": 0,
157 |         "password_hash": "7f965560c9f2ce126407eda7c7dbbdb75037ef4d",
158 |         "language": ""
159 |     },
160 |     "model": "accounts.user"
161 | },
162 | {
163 |     "pk": 1,
164 |     "fields": {
165 |         "topics": "",
166 |         "deleted": 0,
167 |         "image": null,
168 |         "account": 1,
169 |         "last_updated_date": "2016-06-14T19:30:50.999",
170 |         "state": "",
171 |         "created_by": "authuser2",
172 |         "created_date": "2014-07-10T00:29:25.215",
173 |         "name": "Private Test Collection",
174 |         "last_updated_by": null,
175 |         "tag": "",
176 |         "publicly_visible": false,
177 |         "oai_exported": false
178 |     },
179 |     "model": "archiveit.collection"
180 | },
181 | {
182 |     "pk": 2,
183 |     "fields": {
184 |         "topics": "",
185 |         "deleted": 0,
186 |         "image": null,
187 |         "account": 1,
188 |         "last_updated_date": "2016-06-14T19:30:51.032",
189 |         "state": "",
190 |         "created_by": "authuser",
191 |         "created_date": "2014-07-10T00:29:33.943",
192 |         "name": "Public Test Collection",
193 |         "last_updated_by": null,
194 |         "tag": "",
195 |         "publicly_visible": true,
196 |         "oai_exported": false
197 |     },
198 |     "model": "archiveit.collection"
199 | },
200 | {
201 |     "pk": 3,
202 |     "fields": {
203 |         "topics": "",
204 |         "deleted": 0,
205 |         "image": null,
206 |         "account": 2,
207 |         "last_updated_date": "2016-06-14T19:30:51.064",
208 |         "state": "",
209 |         "created_by": "anotheruser",
210 |         "created_date": "2014-07-10T00:29:56.855",
211 |         "name": "Another Public Test Collection",
212 |         "last_updated_by": null,
213 |         "tag": "",
214 |         "publicly_visible": true,
215 |         "oai_exported": false
216 |     },
217 |     "model": "archiveit.collection"
218 | },
219 | {
220 |     "pk": 1,
221 |     "fields": {
222 |         "patch_for_qa_job": null,
223 |         "byte_limit": null,
224 |         "test": false,
225 |         "time_limit": 600,
226 |         "collection": 1,
227 |         "account": 1,
228 |         "patch_ignore_robots": 0,
229 |         "one_time_subtype": "",
230 |         "document_limit": null,
231 |         "recurrence_type": "",
232 |         "pdfs_only": false
233 |     },
234 |     "model": "archiveit.crawldefinition"
235 | },
236 | {
237 |     "pk": 2,
238 |     "fields": {
239 |         "patch_for_qa_job": null,
240 |         "byte_limit": null,
241 |         "test": false,
242 |         "time_limit": 600,
243 |         "collection": 2,
244 |         "account": 1,
245 |         "patch_ignore_robots": 0,
246 |         "one_time_subtype": "",
247 |         "document_limit": null,
248 |         "recurrence_type": "",
249 |         "pdfs_only": false
250 |     },
251 |     "model": "archiveit.crawldefinition"
252 | },
253 | {
254 |     "pk": 1,
255 |     "fields": {
256 |         "host": "http://example.org",
257 |         "end_date": null,
258 |         "duplicate_count": null,
259 |         "queued_count": null,
260 |         "type": "0",
261 |         "resumption_count": null,
262 |         "document_limit": 100000,
263 |         "test_crawl_state_changed_by": null,
264 |         "patch_for_qa_job": null,
265 |         "downloaded_count": null,
266 |         "total_data_in_kbs": null,
267 |         "url": "",
268 |         "crawl_stop_requested": null,
269 |         "novel_count": null,
270 |         "novel_bytes": null,
271 |         "port": 80,
272 |         "kb_rate": null,
273 |         "recurrence_type": "NONE",
274 |         "pdfs_only": 0,
275 |         "thread_count": null,
276 |         "test": 0,
277 |         "uid": 791728987711,
278 |         "collection": 1,
279 |         "account": 1,
280 |         "processing_end_date": null,
281 |         "doc_rate": null,
282 |         "discovered_count": null,
283 |         "status": "",
284 |         "warc_revisit_count": null,
285 |         "warc_content_bytes": null,
286 |         "download_failures": null,
287 |         "description": "",
288 |         "duplicate_bytes": null,
289 |         "byte_limit": null,
290 |         "job_name": "1",
291 |         "original_start_date": "2014-07-10T00:29:25.234",
292 |         "warc_compressed_bytes": null,
293 |         "scheduled_crawl_event": null,
294 |         "start_date": "2014-07-10T00:29:25.234",
295 |         "workflow_step": null,
296 |         "elapsed_ms": null,
297 |         "current_kb_rate": null,
298 |         "current_doc_rate": null,
299 |         "time_limit": 100000,
300 |         "test_crawl_state": "False",
301 |         "warc_url_count": null,
302 |         "warc_uncompressed_bytes": null
303 |     },
304 |     "model": "archiveit.crawljob"
305 | },
306 | {
307 |     "pk": 2,
308 |     "fields": {
309 |         "host": "http://example.com",
310 |         "end_date": null,
311 |         "duplicate_count": null,
312 |         "queued_count": null,
313 |         "type": "0",
314 |         "resumption_count": null,
315 |         "document_limit": 100000,
316 |         "test_crawl_state_changed_by": null,
317 |         "patch_for_qa_job": null,
318 |         "downloaded_count": null,
319 |         "total_data_in_kbs": null,
320 |         "url": "",
321 |         "crawl_stop_requested": null,
322 |         "novel_count": null,
323 |         "novel_bytes": null,
324 |         "port": 80,
325 |         "kb_rate": null,
326 |         "recurrence_type": "NONE",
327 |         "pdfs_only": 0,
328 |         "thread_count": null,
329 |         "test": 0,
330 |         "uid": 728791872212,
331 |         "collection": 2,
332 |         "account": 1,
333 |         "processing_end_date": null,
334 |         "doc_rate": null,
335 |         "discovered_count": null,
336 |         "status": "",
337 |         "warc_revisit_count": null,
338 |         "warc_content_bytes": null,
339 |         "download_failures": null,
340 |         "description": "",
341 |         "duplicate_bytes": null,
342 |         "byte_limit": null,
343 |         "job_name": "2",
344 |         "original_start_date": "2014-07-10T00:29:33.963",
345 |         "warc_compressed_bytes": null,
346 |         "scheduled_crawl_event": null,
347 |         "start_date": "2014-07-10T00:29:33.963",
348 |         "workflow_step": null,
349 |         "elapsed_ms": null,
350 |         "current_kb_rate": null,
351 |         "current_doc_rate": null,
352 |         "time_limit": 100000,
353 |         "test_crawl_state": "False",
354 |         "warc_url_count": null,
355 |         "warc_uncompressed_bytes": null
356 |     },
357 |     "model": "archiveit.crawljob"
358 | },
359 | {
360 |     "pk": 3,
361 |     "fields": {
362 |         "host": "http://examyank.com",
363 |         "end_date": null,
364 |         "duplicate_count": null,
365 |         "queued_count": null,
366 |         "type": "0",
367 |         "resumption_count": null,
368 |         "document_limit": 100000,
369 |         "test_crawl_state_changed_by": null,
370 |         "patch_for_qa_job": null,
371 |         "downloaded_count": null,
372 |         "total_data_in_kbs": null,
373 |         "url": "",
374 |         "crawl_stop_requested": null,
375 |         "novel_count": null,
376 |         "novel_bytes": null,
377 |         "port": 80,
378 |         "kb_rate": null,
379 |         "recurrence_type": "NONE",
380 |         "pdfs_only": 0,
381 |         "thread_count": null,
382 |         "test": 0,
383 |         "uid": 728791872212,
384 |         "collection": 2,
385 |         "account": 1,
386 |         "processing_end_date": null,
387 |         "doc_rate": null,
388 |         "discovered_count": null,
389 |         "status": "",
390 |         "warc_revisit_count": null,
391 |         "warc_content_bytes": null,
392 |         "download_failures": null,
393 |         "description": "",
394 |         "duplicate_bytes": null,
395 |         "byte_limit": null,
396 |         "job_name": "2",
397 |         "original_start_date": "2015-05-15T15:55:55.543",
398 |         "warc_compressed_bytes": null,
399 |         "scheduled_crawl_event": null,
400 |         "start_date": "2015-05-15T15:55:55.543",
401 |         "workflow_step": null,
402 |         "elapsed_ms": null,
403 |         "current_kb_rate": null,
404 |         "current_doc_rate": null,
405 |         "time_limit": 100000,
406 |         "test_crawl_state": "False",
407 |         "warc_url_count": null,
408 |         "warc_uncompressed_bytes": null
409 |     },
410 |     "model": "archiveit.crawljob"
411 | },
412 | {
413 |     "pk": 4,
414 |     "fields": {
415 |         "host": "http://yaketty.com",
416 |         "end_date": null,
417 |         "duplicate_count": null,
418 |         "queued_count": null,
419 |         "type": "0",
420 |         "resumption_count": null,
421 |         "document_limit": 100000,
422 |         "test_crawl_state_changed_by": null,
423 |         "patch_for_qa_job": null,
424 |         "downloaded_count": null,
425 |         "total_data_in_kbs": null,
426 |         "url": "",
427 |         "crawl_stop_requested": null,
428 |         "novel_count": null,
429 |         "novel_bytes": null,
430 |         "port": 80,
431 |         "kb_rate": null,
432 |         "recurrence_type": "NONE",
433 |         "pdfs_only": 0,
434 |         "thread_count": null,
435 |         "test": 0,
436 |         "uid": 728791872212,
437 |         "collection": 3,
438 |         "account": 2,
439 |         "processing_end_date": null,
440 |         "doc_rate": null,
441 |         "discovered_count": null,
442 |         "status": "",
443 |         "warc_revisit_count": null,
444 |         "warc_content_bytes": null,
445 |         "download_failures": null,
446 |         "description": "",
447 |         "duplicate_bytes": null,
448 |         "byte_limit": null,
449 |         "job_name": "2",
450 |         "original_start_date": "2015-05-15T15:55:55.543",
451 |         "warc_compressed_bytes": null,
452 |         "scheduled_crawl_event": null,
453 |         "start_date": "2015-05-15T15:55:55.543",
454 |         "workflow_step": null,
455 |         "elapsed_ms": null,
456 |         "current_kb_rate": null,
457 |         "current_doc_rate": null,
458 |         "time_limit": 100000,
459 |         "test_crawl_state": "False",
460 |         "warc_url_count": null,
461 |         "warc_uncompressed_bytes": null
462 |     },
463 |     "model": "archiveit.crawljob"
464 | },
465 | {
466 |     "pk": "DAILY",
467 |     "fields": {
468 |         "interval_sec": 86400
469 |     },
470 |     "model": "archiveit.frequency"
471 | },
472 | {
473 |     "pk": "WEEKLY",
474 |     "fields": {
475 |         "interval_sec": 604800
476 |     },
477 |     "model": "archiveit.frequency"
478 | },
479 | {
480 |     "pk": 1,
481 |     "fields": {
482 |         "last_checked_http_response_code": null,
483 |         "deleted": false,
484 |         "collection": 1,
485 |         "login_password": "",
486 |         "seed_group": null,
487 |         "active": true,
488 |         "last_updated_date": "2016-06-14T19:30:50.913",
489 |         "seed_type": "",
490 |         "http_response_code": null,
491 |         "created_by": null,
492 |         "crawl_definition": 1,
493 |         "created_date": "2014-07-10T00:29:25.260",
494 |         "valid": false,
495 |         "last_updated_by": null,
496 |         "canonical_url": "http://example.com/",
497 |         "login_username": "",
498 |         "publicly_visible": false,
499 |         "url": "http://example.com"
500 |     },
501 |     "model": "archiveit.seed"
502 | },
503 | {
504 |     "pk": 2,
505 |     "fields": {
506 |         "last_checked_http_response_code": null,
507 |         "deleted": false,
508 |         "collection": 2,
509 |         "login_password": "",
510 |         "seed_group": null,
511 |         "active": true,
512 |         "last_updated_date": "2016-06-14T19:30:50.953",
513 |         "seed_type": "",
514 |         "http_response_code": null,
515 |         "created_by": null,
516 |         "crawl_definition": 2,
517 |         "created_date": "2014-07-10T00:29:33.992",
518 |         "valid": false,
519 |         "last_updated_by": null,
520 |         "canonical_url": "http://google.com/",
521 |         "login_username": "",
522 |         "publicly_visible": true,
523 |         "url": "http://www.google.com"
524 |     },
525 |     "model": "archiveit.seed"
526 | },
527 | {
528 |     "pk": 1,
529 |     "fields": {
530 |         "name": "Test SeedGroup",
531 |         "collection": 1
532 |     },
533 |     "model": "archiveit.seedgroup"
534 | },
535 | {
536 |     "pk": 2,
537 |     "fields": {
538 |         "name": "Test SeedGroup 2",
539 |         "collection": 2
540 |     },
541 |     "model": "archiveit.seedgroup"
542 | },
543 | {
544 |     "pk": 1,
545 |     "fields": {
546 |         "user": 2,
547 |         "nickname": "",
548 |         "crawl_email_enabled": false,
549 |         "account": 1
550 |     },
551 |     "model": "accounts.accountmember"
552 | },
553 | {
554 |     "pk": 2,
555 |     "fields": {
556 |         "user": 3,
557 |         "nickname": "",
558 |         "crawl_email_enabled": false,
559 |         "account": 1
560 |     },
561 |     "model": "accounts.accountmember"
562 | },
563 | {
564 |     "pk": 3,
565 |     "fields": {
566 |         "user": 4,
567 |         "nickname": "",
568 |         "crawl_email_enabled": false,
569 |         "account": 2
570 |     },
571 |     "model": "accounts.accountmember"
572 | },
573 | {
574 |     "pk": 1,
575 |     "fields": {
576 |         "filename": "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000.warc.gz",
577 |         "md5": "119ceaf851143f0036f83fe6f9d59711",
578 |         "sha1": "11e928116cf3ff3efe8818c706b616447b7b8e11",
579 |         "size": 10910469,
580 |         "store_time": "2014-07-10T02:22:22.234",
581 |         "crawl_time": "2014-07-10T00:35:44.123",
582 |         "account": 1,
583 |         "collection": 2,
584 |         "crawl_job": 2,
585 |         "pbox_item": null,
586 |         "hdfs_path": "/ait/qa/h3-wayback/warcs/ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000.warc.gz",
587 |         "all_pbox_items": null
588 |     },
589 |     "model": "archiveit.warcfile"
590 | },
591 | {
592 |     "pk": 2,
593 |     "fields": {
594 |         "filename": "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000.warc.gz",
595 |         "md5": "229ceaf851143f0036f83fe6f9d59722",
596 |         "sha1": "22e928116cf3ff3efe8818c706b616447b7b8e22",
597 |         "size": 10220229,
598 |         "store_time": "2014-07-10T02:22:24.424",
599 |         "crawl_time": "2014-07-10T01:40:44.456",
600 |         "account": 1,
601 |         "collection": 2,
602 |         "crawl_job": 2,
603 |         "pbox_item": null,
604 |         "hdfs_path": "/ait/qa/h3-wayback/warcs/ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000.warc.gz",
605 |         "all_pbox_items": null
606 |     },
607 |     "model": "archiveit.warcfile"
608 | },
609 | {
610 |     "pk": 3,
611 |     "fields": {
612 |         "filename": "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000.warc.gz",
613 |         "md5": "339ceaf851143f0036f83fe6f9d59733",
614 |         "sha1": "33e928116cf3ff3efe8818c706b616447b7b8e33",
615 |         "size": 10220229,
616 |         "store_time": "2014-07-10T02:01:01.012",
617 |         "crawl_time": "2014-07-10T01:01:01.012",
618 |         "account": 1,
619 |         "collection": 2,
620 |         "crawl_job": 3,
621 |         "pbox_item": null,
622 |         "hdfs_path": "/ait/qa/h3-wayback/warcs/ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044455-00000.warc.gz",
623 |         "all_pbox_items": null
624 |     },
625 |     "model": "archiveit.warcfile"
626 | },
627 | {
628 |     "pk": 4,
629 |     "fields": {
630 |         "filename": "ARCHIVEIT-1-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000.warc.gz",
631 |         "md5": "449ceaf851143f0036f83fe6f9d59744",
632 |         "sha1": "44e928116cf3ff3efe8818c706b616447b7b8e44",
633 |         "size": 10220229,
634 |         "store_time": "2014-07-10T02:01:01.012",
635 |         "crawl_time": "2014-07-10T01:01:01.012",
636 |         "account": 1,
637 |         "collection": 1,
638 |         "crawl_job": 1,
639 |         "pbox_item": null,
640 |         "hdfs_path": "/ait/qa/h3-wayback/warcs/ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044455-00000.warc.gz",
641 |         "all_pbox_items": null
642 |     },
643 |     "model": "archiveit.warcfile"
644 | },
645 | {
646 |     "pk": 5,
647 |     "fields": {
648 |         "filename": "ARCHIVEIT-1-CRAWL_SELECTED_SEEDS-JOB1-20140710010101012-00000.warc.gz",
649 |         "md5": "559ceaf851143f0036f83fe6f9d59755",
650 |         "sha1": "55e928116cf3ff3efe8818c706b616447b7b8e55",
651 |         "size": 10220229,
652 |         "store_time": "2014-07-10T02:01:01.012",
653 |         "crawl_time": "2014-07-10T01:01:01.012",
654 |         "account": 2,
655 |         "collection": 3,
656 |         "crawl_job": 4,
657 |         "pbox_item": null,
658 |         "hdfs_path": "/ait/qa/h3-wayback/warcs/ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044455-00000.warc.gz",
659 |         "all_pbox_items": null
660 |     },
661 |     "model": "archiveit.warcfile"
662 | }
663 | ]
664 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/tests/test_fixtures.py:
--------------------------------------------------------------------------------
 1 | from django.test import TestCase
 2 | 
 3 | from archiveit.archiveit.models import CrawlJob, WarcFile
 4 | 
 5 | class TestFixtures(TestCase):
 6 |     '''Ensure we have the set of fixtures other tests depend on'''
 7 |     fixtures = ['archiveit/wasapi/tests/fixtures.json']
 8 | 
 9 |     def test_nested_sets_of_warcfiles(self):
10 |         '''To test an automatically built query, we need some WarcFiles that
11 |         match it and some that don't match it'''
12 |         # the interesting WarcFile at the core of the nested sets
13 |         awarc = WarcFile.objects.get(id=1)
14 |         self.assertGreater(
15 |           len(WarcFile.objects.filter(crawl_job_id=awarc.crawl_job_id)),
16 |           1,
17 |           "Should have multiple WarcFiles in the crawl job")
18 |         self.assertGreater(
19 |           len(WarcFile.objects.filter(collection_id=awarc.collection_id)),
20 |           len(WarcFile.objects.filter(crawl_job_id=awarc.crawl_job_id)),
21 |           "Should have WarcFiles in the collection outside the crawl job")
22 |         self.assertGreater(
23 |           len(WarcFile.objects.filter(account_id=awarc.account_id)),
24 |           len(WarcFile.objects.filter(collection_id=awarc.collection_id)),
25 |           "Should have WarcFiles in the account outside the collection")
26 |         self.assertGreater(
27 |           len(WarcFile.objects.all()),
28 |           len(WarcFile.objects.filter(account_id=awarc.account_id)),
29 |           "Should have WarcFiles outside the account")
30 | 
31 |     def test_fields_of_warcfiles_are_unique(self):
32 |         for fieldname in ['filename','md5','sha1']:
33 |             self.assertEqual(
34 |               conflicts(WarcFile.objects.all(), fieldname), {},
35 |               "%s values should be unique across WarcFiles" % (fieldname))
36 | 
37 |     def test_denormalization(self):
38 |         self.assertEqual(
39 |           [warcfile for warcfile in WarcFile.objects.all()
40 |             if warcfile.crawl_job and
41 |               warcfile.collection_id != warcfile.crawl_job.collection_id],
42 |           [],
43 |           "each warcfile should match its crawl job's collection")
44 |         self.assertEqual(
45 |           [warcfile for warcfile in WarcFile.objects.all()
46 |             if warcfile.crawl_job and
47 |               warcfile.account_id != warcfile.crawl_job.account_id],
48 |           [],
49 |           "each warcfile should match its crawl job's account")
50 |         self.assertEqual(
51 |           [crawl_job for crawl_job in CrawlJob.objects.all()
52 |             if crawl_job.collection.account_id != crawl_job.account_id],
53 |           [],
54 |           "each crawl job should match its collection's account")
55 | 
56 | 
57 | def conflicts(col, fieldname):
58 |     '''Returns a dict of lists of conflicting elements keyed by their elements' value from partition_key'''
59 |     partition_key = lambda obj: obj.__getattribute__(fieldname)
60 |     return dict(
61 |       [k,v] for k,v in partition(col, partition_key, list).items()
62 |       if len(v) > 1 )
63 | 
64 | def partition(col, partition_key, empty_col=None):
65 |     '''Returns a dict to the set of elements keyed by their value from partition_key'''
66 |     if empty_col == None:
67 |         empty_col = type(col)
68 |     ret = {}
69 |     for item in col:
70 |         key = partition_key(item)
71 |         ret[key] = ret.get(key, empty_col())
72 |         ret[key].append(item)
73 |     return ret
74 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/tests/test_job_result.py:
--------------------------------------------------------------------------------
  1 | from django.test import TestCase
  2 | 
  3 | from archiveit.accounts.models import Account, User
  4 | from archiveit.archiveit.models import Collection, CrawlJob, DerivativeFile
  5 | from archiveit.wasapi.models import WasapiJob, WasapiJobResultFile
  6 | from archiveit.wasapi.views import WebdataQueryViewSet
  7 | 
  8 | class TestJobResult(TestCase):
  9 |     fixtures = ['archiveit/wasapi/tests/fixtures.json']
 10 | 
 11 |     def test_query_execution(self):
 12 |         cases = [
 13 |           ('crawl=2',      {1,2},     "Query by crawl job"),
 14 |           ('collection=2', {1,2,3},   "Query by collection"),
 15 |           ('',             {1,2,3,4}, "Empty query ie query by account"),
 16 |         ]
 17 |         for query, ideal, msg in cases:
 18 |             with self.subTest(query=query):
 19 |                 source_files = WasapiJob(
 20 |                   query=query,
 21 |                   account=Account.objects.get(id=1),
 22 |                   user=User.objects.get(username='authuser'),
 23 |                 ).query_just_like_webdataqueryviewset()
 24 |                 self.assertEqual(set(wf.id for wf in source_files), ideal, msg)
 25 | 
 26 |     def test_creation_of_resultfiles(self):
 27 |         job = WasapiJob(
 28 |           query='collection=2',
 29 |           function='build-wat',
 30 |           account=Account.objects.get(id=1),
 31 |           user=User.objects.get(username='authuser'))
 32 |         job.save()
 33 |         result_files = WasapiJobResultFile.objects.filter(job_id=job.id)
 34 |         ideal = {
 35 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000_warc.wat.gz",
 36 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000_warc.wat.gz",
 37 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000_warc.wat.gz"}
 38 |         self.assertEqual( set(f.filename for f in result_files), ideal,
 39 |           "Create result files")
 40 | 
 41 |     def test_vacuous_job_should_update_its_own_state(self):
 42 |         '''"Vacuous" means the job needs no result files'''
 43 |         # partner creates job via DRF
 44 |         job = WasapiJob(
 45 |           query="filename=nonesuch",
 46 |           function='build-wat',
 47 |           account=Account.objects.get(id=1),
 48 |           user=User.objects.get(username='authuser'),
 49 |         )
 50 |         job.save()  # DRF calls save
 51 |         self.assertEqual(job.state, WasapiJob.COMPLETE,
 52 |           "Vacuous job should be in state complete after save")
 53 |         self.assertIsNotNone(job.termination_time,
 54 |           "Vacuous job should have a termination time after save")
 55 | 
 56 |     def test_freebie_job_should_update_its_own_state(self):
 57 |         '''"Freebie" means the job is satisfied by pre-existing result files'''
 58 |         account = Account.objects.get(id=1)
 59 |         user = User.objects.get(username='authuser')
 60 |         collection = Collection.objects.get(id=2)
 61 |         # some earlier job derives some files
 62 |         earlier_job = WasapiJob(function='build-wat',account=account,user=user)
 63 |         earlier_job.save()
 64 |         already_deriveds = [
 65 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000",
 66 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000",
 67 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000",
 68 |         ]
 69 |         for rootname in already_deriveds:
 70 |             filename = rootname + '_warc.wat.gz'
 71 |             df = DerivativeFile(filename=filename,
 72 |               size=1, account=account, collection=collection)
 73 |             df.save()
 74 |             rf = WasapiJobResultFile(filename=filename,
 75 |               derivative_file=df, job=earlier_job)
 76 |             rf.save()
 77 |         # partner creates job via DRF
 78 |         job = WasapiJob(
 79 |           query='collection=%d' % (collection.id),
 80 |           function='build-wat',
 81 |           account=account,
 82 |           user=user)
 83 |         job.save()  # DRF calls save
 84 |         self.assertEqual(job.state, WasapiJob.COMPLETE,
 85 |           "Freebie job should be in state complete after save")
 86 |         self.assertIsNotNone(job.termination_time,
 87 |           "Freebie job should have a termination time after save")
 88 | 
 89 |     def test_juicy_job_should_update_state_upon_derive(self):
 90 |         '''"Juicy" means the job needs fresh result files'''
 91 |         account = Account.objects.get(id=1)
 92 |         user = User.objects.get(username='authuser')
 93 |         collection = Collection.objects.get(id=2)
 94 |         # partner creates job via DRF
 95 |         job = WasapiJob(
 96 |           query='collection=%d' % (collection.id),
 97 |           function='build-wat',
 98 |           account=account,
 99 |           user=user,
100 |         )
101 |         job.save()  # DRF calls save
102 |         self.assertEqual(job.state, WasapiJob.QUEUED,
103 |           "Juicy job should remain in state queued after save")
104 |         self.assertIsNone(job.termination_time,
105 |           "Juicy job should not have a termination time after save")
106 | 
107 |         # archivist manually changes its state
108 |         job.state = WasapiJob.RUNNING
109 |         job.save()
110 |         self.assertEqual(job.state, WasapiJob.RUNNING,
111 |           "Juicy job runs")
112 | 
113 |         # the first file is derived
114 |         first_rf = WasapiJobResultFile.objects.get(filename='ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000_warc.wat.gz')
115 |         first_df = DerivativeFile(
116 |           filename='ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000_warc.wat.gz',
117 |           size=1, account=account, collection=collection)
118 |         first_df.save()
119 |         # script notifies us of first derivative
120 |         WasapiJobResultFile.update_states(WasapiJobResultFile.objects.filter(filename='ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000_warc.wat.gz'))
121 |         self.assertEqual(job.state, WasapiJob.RUNNING,
122 |           "Juicy job should remain in state running during derivation")
123 |         self.assertIsNone(job.termination_time,
124 |           "Juicy job should not have a termination time during derivation")
125 | 
126 |         # the other files are derived
127 |         other_basenames = [
128 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000",
129 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000",
130 |         ]
131 |         for basename in other_basenames:
132 |             other_rf = WasapiJobResultFile.objects.get(
133 |               filename=basename+'_warc.wat.gz')
134 |             other_df = DerivativeFile(filename=basename+'_warc.wat.gz',
135 |               size=1, account=account, collection=collection)
136 |             other_df.save()
137 |         # script notifies us of other derivatives
138 |         for basename in other_basenames:
139 |             WasapiJobResultFile.update_states(WasapiJobResultFile.objects.filter(filename=basename+'_warc.wat.gz'))
140 |         job.refresh_from_db()
141 |         self.assertEqual(job.state, WasapiJob.COMPLETE,
142 |           "Juicy job should change to state completed when files derived")
143 |         self.assertIsNotNone(job.termination_time,
144 |           "Juicy job should get a termination time when files derived")
145 | 
146 |     def test_juicy_job_should_update_state_upon_cron(self):
147 |         '''"Juicy" means the job needs fresh result files'''
148 |         account = Account.objects.get(id=1)
149 |         user = User.objects.get(username='authuser')
150 |         collection = Collection.objects.get(id=2)
151 |         # partner creates job via DRF
152 |         job = WasapiJob(
153 |           query='collection=%d' % (collection.id),
154 |           function='build-wat',
155 |           account=account,
156 |           user=user)
157 |         job.save()  # DRF calls save
158 |         self.assertEqual(job.state, WasapiJob.QUEUED,
159 |           "Juicy job should remain in state queued after save")
160 |         self.assertIsNone(job.termination_time,
161 |           "Juicy job should not have a termination time after save")
162 | 
163 |         # archivist manually changes its state
164 |         job.state = WasapiJob.RUNNING
165 |         job.save()
166 |         self.assertEqual(job.state, WasapiJob.RUNNING,
167 |           "Juicy job runs")
168 | 
169 |         # the files are derived
170 |         other_basenames = [
171 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000",
172 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000",
173 |           "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000",
174 |         ]
175 |         for basename in other_basenames:
176 |             other_rf = WasapiJobResultFile.objects.get(
177 |               filename=basename+'_warc.wat.gz')
178 |             other_df = DerivativeFile(filename=basename+'_warc.wat.gz',
179 |               size=1, account=account, collection=collection)
180 |             other_df.save()
181 |         # but the script somehow doesn't notify us of other derivatives
182 |         self.assertEqual(job.state, WasapiJob.RUNNING,
183 |           "Juicy job should remain in state running without notification")
184 |         self.assertIsNone(job.termination_time,
185 |           "Juicy job should remain without a termination time without notification")
186 |         # cronjob triggers clean up
187 |         WasapiJobResultFile.update_completed_result_files()
188 |         job.refresh_from_db()
189 |         self.assertEqual(job.state, WasapiJob.COMPLETE,
190 |           "Juicy job should change to state completed after clean up")
191 |         self.assertIsNotNone(job.termination_time,
192 |           "Juicy job should get a termination time after clean up")
193 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/urls.py:
--------------------------------------------------------------------------------
 1 | from django.core.urlresolvers import RegexURLPattern
 2 | from rest_framework.routers import DefaultRouter
 3 | from archiveit.wasapi.views import WebdataQueryViewSet, JobsViewSet, update_result_file_state, JobResultViewSet
 4 | 
 5 | router = DefaultRouter(trailing_slash=False)
 6 | router.register(r'webdata', WebdataQueryViewSet)
 7 | router.register(r'jobs/(?P<jobid>\d+)/result', JobResultViewSet)
 8 | router.register(r'jobs', JobsViewSet)
 9 | router.urls.append(RegexURLPattern(r'^update_result_file_state/(?P<filename>.*)', update_result_file_state))
10 | urlpatterns = router.urls
11 | 


--------------------------------------------------------------------------------
/ait-implementation/wasapi/views.py:
--------------------------------------------------------------------------------
 1 | from django.http import HttpResponse
 2 | from rest_framework import viewsets
 3 | from rest_framework.response import Response
 4 | from archiveit.wasapi.serializers import WebdataFileSerializer, PaginationSerializerOfFiles, JobSerializer, PaginationSerializerOfJobs
 5 | from archiveit.wasapi.filters import WasapiWebdataQueryFilterBackend, WasapiAuthFilterBackend, WasapiAuthJobBackend
 6 | from archiveit.archiveit.models import WarcFile
 7 | from archiveit.wasapi.models import WasapiJob, WasapiJobResultFile
 8 | 
 9 | 
10 | class WebdataQueryViewSet(viewsets.ModelViewSet):
11 |     """
12 |     API endpoint that allows webdata files to be queried for and listed.
13 |     """
14 |     # TODO:  decide how to order results
15 |     queryset = WarcFile.objects.all().order_by('-id')
16 |     serializer_class = WebdataFileSerializer
17 |     # selector shared with WasapiJob.query_just_like_webdataqueryviewset
18 |     filter_backends = [WasapiWebdataQueryFilterBackend]
19 |     pagination_serializer_class = PaginationSerializerOfFiles
20 |     paginate_by_param = 'page_size'
21 |     paginate_by = 100
22 |     max_paginate_by = 2000
23 | 
24 |     def list(self, request, *args, **kwargs):
25 |         """Cloned (but trimmed) from ModelViewSet.list"""
26 |         self.object_list = self.filter_queryset(self.get_queryset())
27 |         # always paginate our responses because that's how we implement the spec
28 |         page = self.paginate_queryset(self.object_list)
29 |         serializer = self.get_pagination_serializer(page)
30 |         # The change to add other fields:
31 |         # The current implementation doesn't support any query that could
32 |         # include extra data, so we can hard-code False.  We must revisit
33 |         # this as we add other queries.
34 |         serializer.fields['includes-extra'] = WedgeValueIntoObjectField(
35 |           value=False, label='includes-extra')
36 |         serializer.fields['request-url'] = WedgeValueIntoObjectField(
37 |           value=request._request.build_absolute_uri(), label='request-url')
38 |         return Response(serializer.data)
39 | 
40 | 
41 | class JobsViewSet(viewsets.ModelViewSet):
42 |     """
43 |     API endpoint that allows WASAPI jobs to be created and monitored.
44 |     """
45 |     queryset = WasapiJob.objects.all().order_by('-id')
46 |     serializer_class = JobSerializer
47 |     filter_backends = [WasapiAuthFilterBackend]
48 |     pagination_serializer_class = PaginationSerializerOfJobs
49 |     paginate_by_param = 'page_size'
50 |     paginate_by = 100
51 |     max_paginate_by = 2000
52 | 
53 | 
54 | def update_result_file_state(request, filename):
55 |     result_files = WasapiJobResultFile.objects.filter(filename=filename)
56 |     if not result_files:
57 |         return HttpResponse("", status=404)
58 |     WasapiJobResultFile.update_states(result_files)
59 |     return HttpResponse("")
60 | 
61 | 
62 | class JobResultViewSet(viewsets.ModelViewSet):
63 |     """
64 |     API endpoint that gives the result of a WASAPI job.
65 |     """
66 |     queryset = WasapiJobResultFile.objects.all().order_by('-id')
67 |     serializer_class = WebdataFileSerializer
68 |     filter_backends = [
69 |       WasapiAuthJobBackend
70 |       # don't need WasapiAuthFilterBackend since we already filtered on the job
71 |     ]
72 |     pagination_serializer_class = PaginationSerializerOfFiles
73 |     paginate_by_param = 'page_size'
74 |     paginate_by = 100
75 |     max_paginate_by = 2000
76 | 
77 | 
78 | class WedgeValueIntoObjectField(object):
79 |     read_only = False
80 |     def __init__(self, value, label):
81 |         self.value = value
82 |         self.label = label
83 |     def initialize(self, parent, field_name):
84 |         pass
85 |     def field_to_native(self, obj, field_name):
86 |         return self.value
87 | 


--------------------------------------------------------------------------------
/ait-implementation/webdata/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/WASAPI-Community/data-transfer-apis/4fab8164f40dcb16a601d4606f9ab67889076d6b/ait-implementation/webdata/__init__.py


--------------------------------------------------------------------------------
/ait-implementation/webdata/decorators.py:
--------------------------------------------------------------------------------
 1 | # The following 83 lines are copied verbatim from https://www.djangosnippets.org/snippets/243/
 2 | # which is a simple implementation of per-view HTTP Basic Authentication
 3 | 
 4 | import base64
 5 | 
 6 | from django.http import HttpResponse
 7 | from django.contrib.auth import authenticate, login
 8 | 
 9 | #############################################################################
10 | #
11 | def view_or_basicauth(view, request, test_func, realm = "", *args, **kwargs):
12 |     """
13 |     This is a helper function used by both 'logged_in_or_basicauth' and
14 |     'has_perm_or_basicauth' that does the nitty of determining if they
15 |     are already logged in or if they have provided proper http-authorization
16 |     and returning the view if all goes well, otherwise responding with a 401.
17 |     """
18 |     if test_func(request.user):
19 |         # Already logged in, just return the view.
20 |         #
21 |         return view(request, *args, **kwargs)
22 | 
23 |     # They are not logged in. See if they provided login credentials
24 |     #
25 |     if 'HTTP_AUTHORIZATION' in request.META:
26 |         auth = request.META['HTTP_AUTHORIZATION'].split()
27 |         if len(auth) == 2:
28 |             # NOTE: We are only support basic authentication for now.
29 |             #
30 |             if auth[0].lower() == "basic":
31 |                 uname, passwd = base64.b64decode(auth[1].encode('UTF-8')).split(b':')
32 |                 user = authenticate(username=uname.decode(), password=passwd.decode())
33 |                 if user is not None:
34 |                     if user.is_active:
35 |                         login(request, user)
36 |                         request.user = user
37 |                         return view(request, *args, **kwargs)
38 | 
39 |     # Either they did not provide an authorization header or
40 |     # something in the authorization attempt failed. Send a 401
41 |     # back to them to ask them to authenticate.
42 |     #
43 |     response = HttpResponse()
44 |     response.status_code = 401
45 |     response['WWW-Authenticate'] = 'Basic realm="%s"' % realm
46 |     return response
47 | 
48 | #############################################################################
49 | #
50 | def logged_in_or_basicauth(realm = ""):
51 |     """
52 |     A simple decorator that requires a user to be logged in. If they are not
53 |     logged in the request is examined for a 'authorization' header.
54 | 
55 |     If the header is present it is tested for basic authentication and
56 |     the user is logged in with the provided credentials.
57 | 
58 |     If the header is not present a http 401 is sent back to the
59 |     requestor to provide credentials.
60 | 
61 |     The purpose of this is that in several django projects I have needed
62 |     several specific views that need to support basic authentication, yet the
63 |     web site as a whole used django's provided authentication.
64 | 
65 |     The uses for this are for urls that are access programmatically such as
66 |     by rss feed readers, yet the view requires a user to be logged in. Many rss
67 |     readers support supplying the authentication credentials via http basic
68 |     auth (and they do NOT support a redirect to a form where they post a
69 |     username/password.)
70 | 
71 |     Use is simple:
72 | 
73 |     @logged_in_or_basicauth
74 |     def your_view:
75 |         ...
76 | 
77 |     You can provide the name of the realm to ask for authentication within.
78 |     """
79 |     def view_decorator(func):
80 |         def wrapper(request, *args, **kwargs):
81 |             return view_or_basicauth(func, request,
82 |                                      lambda u: u.is_authenticated(),
83 |                                      realm, *args, **kwargs)
84 |         return wrapper
85 |     return view_decorator
86 | 


--------------------------------------------------------------------------------
/ait-implementation/webdata/urls.py:
--------------------------------------------------------------------------------
1 | from django.conf.urls import url
2 | from . import views
3 | 
4 | urlpatterns = [
5 |     url(r'(?P<filename>.*)', views.index, name='index'),
6 | ]
7 | 


--------------------------------------------------------------------------------
/ait-implementation/webdata/views.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | from subprocess import Popen, PIPE
 3 | from django.shortcuts import render
 4 | from django.http import HttpResponse, FileResponse
 5 | from django.conf import settings
 6 | 
 7 | from internetarchive import get_session
 8 | 
 9 | from archiveit.archiveit.models import WarcFile, DerivativeFile
10 | from archiveit.webdata.decorators import logged_in_or_basicauth
11 | 
12 | VERBOTEN_FILENAMES = re.compile(r'EXTRACTED|EXTRACTION|HISTORICAL')
13 | 
14 | @logged_in_or_basicauth()
15 | def index(request, filename):
16 | 
17 |     # check authorization:
18 |     if request.user.is_anonymous():
19 |         # TODO:  rfc7235#section-4.1 says to include WWW-Authenticate header
20 |         return HttpResponse('You are not authenticated; please log in to download the requested file: %s'%filename,
21 |           content_type='text/plain', status=401)
22 |     match = re.search(r'^(?:ARCHIVEIT-)?(\d+)-', filename)
23 |     if not match:
24 |         return HttpResponse(
25 |           'Failed to parse collection id from requested filename: %s'%filename,
26 |           content_type='text/plain', status=404)
27 |     file_collection_id = int(match.group(1))
28 |     if(VERBOTEN_FILENAMES.match(filename) or
29 |       not may_access_collection_id(request, file_collection_id)):
30 |         return HttpResponse(
31 |           'You are not authorized to download the requested file: %s'%filename,
32 |           content_type='text/plain', status=403)
33 | 
34 |     # fetch the file's db record:
35 |     webdatafile = (
36 |       WarcFile.objects.filter(filename=filename).first() or
37 |       DerivativeFile.objects.filter(filename=filename).first() )
38 |     if not webdatafile:
39 |         return HttpResponse('404 Not Found',
40 |           content_type='text/plain', status=404)
41 | 
42 |     # get the file's content from somewhere:
43 |     stream = (
44 |       webdatafile.pbox_item and
45 |         stream_from_pbox(webdatafile.pbox_item, filename) or
46 |       webdatafile.hdfs_path and
47 |         stream_from_hdfs(webdatafile.hdfs_path, filename) )
48 |     if not stream:
49 |       return HttpResponse("500 Can't fetch file",
50 |         content_type='text/plain', status=500)
51 | 
52 |     # give it all back to the client:
53 |     response = FileResponse(stream)
54 |     response['Content-Type'] = 'application/octet-stream'
55 |     response['Content-Length'] = webdatafile.size
56 |     return response
57 | 
58 | def stream_from_pbox(itemname, filename):
59 |     # TODO:  handle errors etc
60 |     archive_session = get_session(config_file=settings.IATOOL_CONFIG_PATH)
61 |     item = archive_session.get_item(itemname)
62 |     files = item.get_files(filename)
63 |     file = files.__next__()
64 |     return file.download(return_responses=True)
65 | 
66 | def stream_from_hdfs(hdfs_path, filename):
67 |     # TODO:  consider using python3 snakebite
68 |     # TODO:  handle errors etc; would be nice to examine returncode
69 |     # (but halfway through a big stream is too late to tell the client, right?)
70 |     hdfs_cat = Popen([settings.HDFS_EXE, 'dfs', '-cat', hdfs_path],
71 |       env=settings.HADOOP_ENV, stdout=PIPE)
72 |     return hdfs_cat.stdout
73 | 
74 | def may_access_collection_id(request, collection_id):
75 |     if request.user.is_superuser:
76 |         return True
77 |     return collection_id in request.user.account.collection_set.values_list('id', flat=True)
78 | 


--------------------------------------------------------------------------------
/ait-specification/README.md:
--------------------------------------------------------------------------------
  1 | # **Archive-It WASAPI Data Transfer API v1.0**
  2 | 
  3 | 
  4 | ## Introduction
  5 | 
  6 | This document serves to specify v1.0 of Archive-It's implementation of the WASAPI Data Transfer API.  It is intended to document how a client can use the API to find and select web archive files for transfer and to submit jobs for the creation and transfer of derivative web archive files. The API is designed according to the WASAPI data transfer [general specification](https://github.com/WASAPI-Community/data-transfer-apis/tree/master/general-specification). For context, as of June 2017 the Archive-It repository contains over 3,766,068 WARC files, all of which are accessible to the relevant, authenticated Archive-It partners via this API.
  7 | 
  8 | The interface provides two primary services:  querying existing files and managing jobs for creating derivative files. The WASAPI data transfer general specification does not mandate how to transfer the webdata files for export, but Archive-It's implementation provides straight-forward HTTPS links. We use the syntax `webdata` file to recognize that the API supports working with both web archive files (WARCs) as well as with derivative files created from WARCs (such as WATs or CDX).
  9 | 
 10 | ## Authentication
 11 | 
 12 | Archive-It restricts access to those clients with an Archive-It account.  The WASAPI data transfer general specification allows publicly accessible resources, so Archive-It's implementation will show empty results until you authenticate.  You have two options for authentication:
 13 | 
 14 | ### Authentication via browser cookies
 15 | 
 16 | To try some simple queries or manually download your data with a web browser, you can authenticate with cookies in your web browser.
 17 | 
 18 | Point your web browser to `https://partner.archive-it.org/login` and log in to your Archive-It account with your username and password.  This will set cookies in your browser for subsequent WASAPI requests and downloading files.
 19 | 
 20 | ### Authentication via basic access authentication
 21 | 
 22 | For automated scripts, you should use http [basic access authentication](https://en.wikipedia.org/wiki/Basic_access_authentication).
 23 | 
 24 | For example, if your account has username `teddy` and password `schellenberg`, you could use this [cURL](https://curl.haxx.se/) invocation:
 25 | 
 26 |     curl --user 'teddy:schellenberg' https://partner.archive-it.org/wasapi/v1/webdata
 27 | 
 28 | ## Querying
 29 | 
 30 | Archive-It's data transfer API implementation, Archive-It lets you identify the webdata files via a number of parameters. Start building the URL for your query with `https://partner.archive-it.org/wasapi/v1/webdata`, then append parameters to make your specific query.
 31 | 
 32 | To find all webdata files in your account:
 33 | 
 34 |     https://partner.archive-it.org/wasapi/v1/webdata
 35 | 
 36 | ### Overview of Query Parameters
 37 | 
 38 | The basic parameters for querying for webdata files are:
 39 | 
 40 | - `filename`: the exact webdata filename
 41 | - `filetype`:  the exact webdata files of a specific type, eg `warc`, `wat`, `cdx`
 42 | - `collection`: Archive-It collection identifier
 43 | - `crawl`: Archive-It crawl job identifier
 44 | - `crawl-time-after` & `crawl-time-before`: date of webdata file creation during a crawl job
 45 | - `crawl-start-after` & `crawl-start-before`: date of crawl job start
 46 | 
 47 | ### Query parameters
 48 | 
 49 | #### `filename` query parameter
 50 | 
 51 | The `filename` parameter restricts the query to include webdata files with the exact filename as the parameter's value.  That is, it must match the beginning and end of the filename; the full path of directories is ignored. API v1.0 matches exact filenames, but later version will recognize "globbing," i.e. matching with `*` and `?` patterns.
 52 | 
 53 | To find a specific file:
 54 | 
 55 |     https://partner.archive-it.org/wasapi/v1/webdata?filename=ARCHIVEIT-8232-WEEKLY-JOB300208-20170513202120098-00001.warc.gz
 56 | 
 57 | #### `filetype` query parameter
 58 | 
 59 | The `filetype` parameter restricts the query to those web archive files with the specified type, such as `warc`, `wat`, `cdx`.  API v1.0 supports query by `warc` and later version will support query by derivative formats.
 60 | 
 61 | #### `collection` query parameter
 62 | 
 63 | The `collection` parameter restricts the query to those web archive files within the specified collection. Archive-It users may want to reference the documentation on how to [find your collection's ID number](https://support.archive-it.org/hc/en-us/articles/208000916-Find-your-collection-s-ID-number).
 64 | 
 65 | To find the files from the "Occupy Movement 2011/2012" collection:
 66 | 
 67 |     https://partner.archive-it.org/wasapi/v1/webdata?collection=2950
 68 | 
 69 | The API supports multiple `collection` parameters in a query. To find the files from the "Occupy Movement 2011/2012" collection and the "#blacklivesmatter Web Archive" collection:
 70 | 
 71 |     https://partner.archive-it.org/wasapi/v1/webdata?collection=2950&collection=4783
 72 | 
 73 | #### `crawl` query parameter
 74 | 
 75 | The `crawl` parameter restricts the query to webdata files within a specified crawl, per the crawl job identifier. Archive-It users may want to reference the documentation on [how to find a crawl ID number](https://support.archive-it.org/hc/en-us/articles/115002803383-Finding-your-crawl-ID-number-). Some older Archive-It WARCs and webdata files lack an associated crawl job ID (and, thus, also an associated `crawl-start-time`). Efforts are underway to backfill this data, which should alleviate, if not eliminate, the null values for `crawl` for some historical WARCs. If users receive null results for a know `crawl` identifier, they should contact Archive-It support or use other parameters, which are known to be exhaustive historically.
 76 | 
 77 | To find the files from a specific crawl:
 78 | 
 79 |     https://partner.archive-it.org/wasapi/v1/webdata?crawl=300208
 80 | 
 81 | #### `crawl-time-after` and `crawl-time-before` query parameters
 82 | 
 83 | The `crawl-time-after` and `crawl-time-before` parameters restrict the query to those web archive files crawled within the given time range; see [time formats](#time-formats) for the syntax. Specify the lower bound (if any) with `crawl-time-after` and the upper bound (if any) `crawl-time-before`.  This field uses the time the WARC file was created, the same timestamp represented in the WARC filename.
 84 | 
 85 | To find the files crawled in the first quarter of 2016:
 86 | 
 87 |     https://partner.archive-it.org/wasapi/v1/webdata?crawl-time-after=2016-12-31&crawl-time-before=2016-04-01
 88 | 
 89 | To find all files crawled since 2016:
 90 | 
 91 |     https://partner.archive-it.org/wasapi/v1/webdata?crawl-time-after=2016-01-01
 92 | 
 93 | To find all files crawled prior to 2014:
 94 | 
 95 |     https://partner.archive-it.org/wasapi/v1/webdata?crawl-time-before=2014-01-01
 96 | 
 97 | #### `crawl-start-after` and `crawl-start-before` query parameters
 98 | 
 99 | The `crawl-start-after` and `crawl-start-before` parameters restrict the query to those web archive files gathered from crawl jobs that started within the given time range; see [time formats](#time-formats) for the syntax.  They reference the crawl job start date (in contrast to `crawl-time-after` and `-before` which relate to the individual WARC file creation date).  Specify the lower bound (if any) with `crawl-start-after` and the upper bound (if any) `crawl-start-before`. Since `crawl-start` is associated with the `crawl` parameter, the above caveats will apply in that some older Archive-It WARCs and web archive files will lack an associated `crawl-start`. Efforts are underway to backfill this data, otherwise contact Archive-It support or use other parameters, which are known to be exhaustive historically.
100 | 
101 | To find the files from a Q1 2016 crawl:
102 | 
103 |       https://partner.archive-it.org/wasapi/v1/webdata?crawl-start-after=2016-12-31&crawl-start-before=2016-04-01
104 | 
105 | To find all files from crawls started since 2016:
106 | 
107 |     https://partner.archive-it.org/wasapi/v1/webdata?crawl-start-after=2016-01-01
108 | 
109 | #### Pagination parameters
110 | 
111 | The [parameters for pagination](#parameters-for-pagination) apply to queries.
112 | 
113 | ### Query results
114 | 
115 | The response to a query is a JSON object with [fields for pagination](#fields-for-pagination), an `includes-extra` field, a `request-url` field, and the result in the `files` field.
116 | 
117 | The `count` field represents the total number of web archive files corresponding to the query.
118 | 
119 | The `includes-extra` field is currently always false in the API v1.0, as all query parameters return exact matches and the data in the `files` contains nothing extraneous from what is necessary to satisfy the query or job. The `includes-extra` field is mandated by the general specification as some implementations may return results that include webdata files containing content beyond the specific query. For instance, were `url` a query parameter, a request by URL could return results that contain webdata files (i.e. WARCs) that contain data from that URL as well as data from other URLs, due to the way crawlers write WARC files. When Archive-It (or other implementations) supports these type queries, `includes-extra` could have a true value to indicate that the referenced `files` may contain data outside the specific query.
120 | 
121 | The `request-url` field represents the submitted query URL.
122 | 
123 | The `files` field is a list of a subset (check the [pagination fields](#fields-for-pagination)) of the results of the query, with each webdata file represented by a JSON object with the following keys:
124 | 
125 | - `account`:  the numeric Archive-It account identifier
126 | 
127 | - `checksums`:  an object with `md5` and `sha1` keys and hexadecimal values of
128 |   the webdata file's checksums
129 | 
130 | - `collection`:  the numeric Archive-It identifier of the collection that includes the
131 |   webdata file
132 | 
133 | - `crawl`:  the numeric Archive-It identifier of the crawl that created the webdata file
134 | 
135 | - `crawl-start`:  an optional RFC3339 date stamp of the time the crawl job started
136 | 
137 | - `crawl-time`:  an RFC3339 date stamp of the time the webdata file was
138 |   [created](#crawl-time-after-and-crawl-time-before-query-parameters)
139 | 
140 | - `filename`:  the name of the webdata file (without any path of directories)
141 | 
142 | - `filetype`:  the format of the webdata file, eg `warc`, `wat`, `wane`, `cdx`
143 | 
144 | - `locations`:  a list of sources from which to retrieve the webdata file
145 | 
146 | - `size`:  the size in bytes of the webdata file
147 | 
148 | For example:
149 | 
150 |     {
151 |       "count": 601,
152 |       "includes-extra": false,
153 |       "next": "https://partner.archive-it.org/wasapi/v1/webdata?collection=8232&page=2",
154 |       "previous": null,
155 |       "files": [
156 |         {
157 |           "account": 89,
158 |           "checksums": {
159 |             "md5": "073f2a905ce23462204606329ca545c3",
160 |             "sha1": "1b796f61dc22f2ca246fa7055e97cd25341bfe98"
161 |           },
162 |           "collection": 8232,
163 |           "crawl": 304244,
164 |           "crawl-start": "2017-05-31T22:15:34Z",
165 |           "crawl-time": "2017-05-31T22:15:40Z",
166 |           "filename": "ARCHIVEIT-8232-WEEKLY-JOB304244-20170531221540622-00000.warc.gz",
167 |           "filetype": "warc",
168 |           "locations": [
169 |             "https://warcs.archive-it.org/webdatafile/ARCHIVEIT-8232-WEEKLY-JOB304244-20170531221540622-00000.warc.gz"
170 |           ],
171 |           "size": 1000000858
172 |         },
173 |         {
174 |           "account": 89,
175 |           "checksums": {
176 |             "md5": "610e1849cfc2ad692773348dd34697b4",
177 |             "sha1": "9048d063a9adaf606e1ec2321cde3a29a1ee6490"
178 |           },
179 |           "collection": 8232,
180 |           "crawl": 303042,
181 |           "crawl-start": "2017-05-24T22:15:36Z",
182 |           "crawl-time": "2017-05-26T17:51:37Z",
183 |           "filename": "ARCHIVEIT-8232-WEEKLY-JOB303042-20170526175137981-00002.warc.gz",
184 |           "filetype": "warc",
185 |           "locations": [
186 |             "https://warcs.archive-it.org/webdatafile/ARCHIVEIT-8232-WEEKLY-JOB303042-20170526175137981-00002.warc.gz"
187 |           ],
188 |           "size": 40723812
189 |         },
190 |         [ ... ]
191 |       ]
192 |     }
193 | 
194 | 
195 | ## Jobs
196 | 
197 | The Archive-It data transfer API allows users to submit "jobs" for the creation of derivative files from existing resources. This serves the broader goal of WASAPI data transfer APIs to facilitate use of web archives in data-driven scholarship, research and computational analysis, and to support use, and transport, of files derived from WARCs and original archival web data. The Archive-It WASAPI data transfer API v1.0 allows an Archive-It user or approved researcher to:
198 | 
199 | - Submit a query and be returned a results list of webdata files
200 | - Submit a job to derive different types of datasets from that results list
201 | - Receive a job submission token and job submission status
202 | - Poll the API for current job status
203 | - Upon job completion, get a results list of the generated derived webdata files
204 | 
205 | ### Submitting a new job
206 | 
207 | Submit a new job with an HTTP POST to `https://partner.archive-it.org/wasapi/v1/jobs`.
208 | 
209 | Select a `function` from those supported. The Archive-It API v1.0 currently supports creation of three types of derivative datasets, all of which have a one-to-one correlation to WARC files. Future development will allow for job submission for original datasets. The current job `function` list:
210 | 
211 | - `build-wat`: build a WAT (Web Archive Transformation) file from the matched web archive files
212 | 
213 | - `build-wane`: build a WANE (Web Archive Name Entities) file from the matched  web archive files 
214 | 
215 | - `build-cdx`: Build a CDX (Capture Index) file from the matched  web archive files
216 | 
217 | For more on WATs and WANEs, see their description at [Archive-It Research Services](https://webarchive.jira.com/wiki/display/ARS/Archive-It+Research+Services). For more on CDX, see the documentation for the [CDX Server API](https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md).
218 | 
219 | Build an appropriate `query` in the same manner as for the [`/webdata` endpoint](#query-parameters).
220 | 
221 | For example, to build WAT files from the WARCs in collection 4783 and crawled in 2016:
222 | 
223 |     curl --user 'teddy:schellenberg' -H 'Content-Type: application/json' -d '{"function": "build-wat","query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01"}' https://partner.archive-it.org/wasapi/v1/jobs
224 | 
225 | If all goes well, the server will record the job, set its `submit-time` to the current time and its `state` to `queued`, and return a `201 Created` response, including a `jobtoken` which can be used to [check its
226 | status](#checking-the-status-of-a-job) later:
227 | 
228 |     {
229 |       "account": 89,
230 |       "function": "build-wat",
231 |       "jobtoken": "136",
232 |       "query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01",
233 |       "state": "queued",
234 |       "submit-time": "2017-06-03T22:49:13.869698Z",
235 |       "termination-time": null
236 |     }
237 | 
238 | If you want to match everything, you must still provide an explicit empty string for the query parameter.  For example, to build a CDX index of all your resources:
239 | 
240 |     curl --user 'teddy:schellenberg' -H 'Content-Type: application/json' -d '{"function":"build-cdx","query":""}' https://partner.archive-it.org/wasapi/v1/jobs
241 | 
242 | ### Checking the status of a job
243 | 
244 | To check the [state](#states-of-a-job) of your job, build a URL by appending its job token to `https://partner.archive-it.org/wasapi/v1/jobs/`.  For example:
245 | 
246 |     curl --user 'teddy:schellenberg' https://partner.archive-it.org/wasapi/v1/jobs/136
247 | 
248 | Immediately after submitting it, the job will be in the `queued` `state`, and the response will be the same as the response to the submission.  Once Archive-It starts running the job, its `state` will change, for example:
249 | 
250 |     {
251 |       "account": 89,
252 |       "function": "build-wat",
253 |       "jobtoken": "136",
254 |       "query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01",
255 |       "state": "running",
256 |       "submit-time": "2017-06-03T22:49:13Z",
257 |       "termination-time": null
258 |     }
259 | 
260 | And when it is `complete`, the `termination-time` will be set with the time:
261 | 
262 |     {
263 |       "account": 89,
264 |       "function": "build-wat",
265 |       "jobtoken": "136",
266 |       "query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01",
267 |       "state": "complete",
268 |       "submit-time": "2017-06-03T22:49:13Z",
269 |       "termination-time": "2017-06-06T01:37:54Z"
270 |     }
271 | 
272 | You can also check the [states](#states-of-a-job) of all your jobs at `https://partner.archive-it.org/wasapi/v1/jobs`, which is [paginated](#pagination).  For example:
273 | 
274 |     {
275 |       "count": 16,
276 |       "next": "http://partner.archive-it.org/wasapi/v1/jobs?page_size=10&page=2",
277 |       "previous": null,
278 |       "jobs": [
279 |         {
280 |           "account": 89,
281 |           "function": "build-cdx",
282 |           "jobtoken": "137",
283 |           "query": "",
284 |           "state": "running",
285 |           "submit-time": "2017-06-03T23:55:51Z",
286 |           "termination-time": null
287 |         },
288 |         {
289 |           "account": 89,
290 |           "function": "build-wat",
291 |           "jobtoken": "136",
292 |           "query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01",
293 |           "state": "completed",
294 |           "submit-time": "2017-06-03T22:49:13Z",
295 |           "termination-time": "2017-06-06T01:37:54Z"
296 |         },
297 |         [ ... ]
298 |       ]
299 |     }
300 | 
301 | ### Checking the result of a failed job
302 | 
303 | If your job has a `failed` `state`, build a URL of the form `https://partner.archive-it.org/wasapi/v1/jobs/{jobtoken}/error`.  This is in development and not currently implemented.
304 | 
305 | ### Checking the result of a complete job
306 | 
307 | To retrieve the result of your `complete` job, build a URL of the form `https://partner.archive-it.org/wasapi/v1/jobs/{jobtoken}/result`.  The response is similar to [results of a query](#query-results).  For example:
308 | 
309 |     {
310 |       "count": 4,
311 |       "next": null,
312 |       "previous": null,
313 |       "files": [
314 |         {
315 |           "account": 89,
316 |           "checksums": {
317 |             "md5": "11a0ddb3575da3b9f6dd9dff665ce181",
318 |             "sha1": "0b2a17969b8b45fc14e41441c1ecc7afcf974150"
319 |           },
320 |           "collection": 4783,
321 |           "crawl": 16473,
322 |           "crawl-start": "2016-05-12T15:05:31Z",
323 |           "crawl-time": "2016-05-12T15:05:36Z",
324 |           "filename": "ARCHIVEIT-4783-TEST-JOB16473-20160512150536534-00000_warc.wat.gz",
325 |           "filetype": "wat",
326 |           "locations": [
327 |             "https://warcs.archive-it.org/webdatafile/ARCHIVEIT-4783-TEST-JOB16473-20160512150536534-00000_warc.wat.gz",
328 |           ],
329 |           "size": 8016108
330 |         },
331 |         {
332 |           "account": 89,
333 |           "checksums": {
334 |             "md5": "f762e933a3fd412325e6497457ea2be0",
335 |             "sha1": "08beda59a9b6df9a26ea4783f69d92fd1d1ba5c2"
336 |           },
337 |           "collection": 4783,
338 |           "crawl": 16473,
339 |           "crawl-start": "2016-05-12T15:05:31Z",
340 |           "crawl-time": "2016-05-12T15:05:36Z",
341 |           "filename": "ARCHIVEIT-4783-CRAWL_SELECTED_SEEDS-JOB16472-20160512144021684-00000_warc.wat.gz",
342 |           "filetype": "wat",
343 |           "locations": [
344 |             "https://warcs.archive-it.org/webdatafile/ARCHIVEIT-4783-CRAWL_SELECTED_SEEDS-JOB16472-20160512144021684-00000_warc.wat.gz"
345 |           ],
346 |           "size": 149888
347 |         },
348 |         [ ... ]
349 |       ]
350 |     }
351 | 
352 | 
353 | ## Common WASAPI infrastructure
354 | 
355 | ### Pagination
356 | 
357 | Results of queries and lists of jobs are paginated.  The full results may fit on one page (especially if you set `page_size=2000`), but the syntax is always present.  You needn't manipulate the `page` parameter directly:  after your first request with no `page` parameter, you should iteratively follow non-null `next` links to fetch the full results.
358 | 
359 | #### Fields for pagination
360 | 
361 | The top-level JSON object of the response includes pagination information with the following keys:
362 | 
363 | - `count`:  The number of items in the full result (files or jobs, across all
364 |   pages)
365 | 
366 | - `previous`:  Link (if any) to the previous page of items; otherwise null
367 | 
368 | - `next`:  Link (if any) to the next page of items; otherwise null
369 | 
370 | #### Parameters for pagination
371 | 
372 | ##### `page` query parameter
373 | 
374 | The `page` parameter requests a specific page of the full result.  It defaults to 1, giving the first page.
375 | 
376 | ##### `page_size` query parameter
377 | 
378 | The `page_size` parameter sets the size of each page.  It defaults to 100 and has a maximum value of 2000.
379 | 
380 | ### Time formats
381 | 
382 | Date and time parameters should satisfy RFC3339, eg `YYYY-MM-DD` or `YYYY-MM-DDTHH:MM:SS`, but Archive-It also recognizes abbreviations like `YYYY-MM` or `YYYY` which are interpreted as the first of the month or year.  We recommend using [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time), but the implementation does now recognize a trailing `Z` or timezone offset.
383 | 
384 | Formats that work:
385 | - `2017-01-01`
386 | - `2017-01-01T12:34:56`
387 | - `2017-01-01 12:34:56`
388 | - `2017-01-01T12:34:56Z`
389 | - `2017-01-01 12:34:56-0700`
390 | - `2017`
391 | - `2017-01`
392 | 
393 | ## Recipes and other resources
394 | 
395 | Archive-It is in the midst of creating a recipe book of sample API queries. Both Archive-It and WASAPI grant partners are also creating a number of local utilities for working with this API and implementing it in preservation and research workflows. These utilities will also be posted in this GitHub account for public reference. Stanford has created a number of demonstration videos outlining their tool development for working with this API for ingest of their Archive-It WARCs into their preservation repository. These can be seen in the [WASAPI collection](https://archive.org/details/wasapi) in the Internet Archive and Stanford Libraries' [YouTube channel](https://www.youtube.com/channel/UCc2CQuHkhKGZ-2ZLTZVGE2A).
396 | 
397 | For Archive-It's proposed changes to the WASAPI data transfer API general specification and other build details, visit the [Archive-It implementation repository](https://github.com/WASAPI-Community/data-transfer-apis/tree/master/ait-implementation).
398 | 
399 | ## Contacts
400 | 
401 | *Archive-It (Internet Archive)*
402 | * Jefferson Bailey, Director, Web Archiving, jefferson@archive.org
403 | * Mark Sullivan, Web Archiving Software Engineer, msullivan@archive.org
404 | 


--------------------------------------------------------------------------------
/ait-specification/transfer_api_archive-it_v1.yaml:
--------------------------------------------------------------------------------
1 | ../ait-implementation/wasapi/swagger.yaml


--------------------------------------------------------------------------------
/general-specification/README.md:
--------------------------------------------------------------------------------
  1 | ## **WASAPI Data Transfer API General Specification v0.1**
  2 | 
  3 | **Introduction**
  4 | 
  5 |    This document serves to outline v0.1 of the Web Archive Data Export API developed as part of the Web Archiving Systems API (WASAPI) project of the [Institute of Museum and Library Services](https://www.imls.gov/)-funded [National Leadership Grant, LG-71-15-0174](https://www.imls.gov/grants/available/national-leadership-grants-libraries), "[Systems Interoperability and Collaborative Development for Web Archiving](https://www.imls.gov/sites/default/files/proposal_narritive_lg-71-15-0174_internet_archive.pdf)" (PDF). Primary development of this API specification as well as the creation of multiple reference implementations is being led by Internet Archive (Archive-It) and Stanford University Libraries (DLSS and LOCKSS). Other project partners are University of North Texas, Rutgers University, a Technical Working Group, and an Advisory Board. More information on the WASAPI project can be found at the [WASAPI-Community GitHub workspace](https://github.com/WASAPI-Community) the [WASAPI-community Google Group](https://groups.google.com/forum/#!forum/wasapi-community), and on the [WASAPI Slack](https://docs.google.com/forms/d/e/1FAIpQLScsdTqssLrM9FinmpP8Mow2Hl8zJnfJZfjWxaeXddlvu2VjBw/viewform) channel.
  6 | 
  7 | **General Usage**
  8 | 
  9 |    This is a **_generalized_** specification representing only a **_minimum_** set of requirements for development of APIs to facilitate the transfer of archived web data between custodians, systems, repositories, end users, and others. This API focuses on the transfer of WARC and WARC-derived web data and aims to standardize vocabularies and features to inform those institutions, services, and developers building reference implementations local to their organizations. The basic purpose of this API is to return a results list of WARC files and derivative files originating from WARCs and corresponding essential metadata related to location/transfers in response to a user-defined query that includes implementation-specific parameters. The specification includes the notion of a "job" i.e. the ability to submit a job, receive a job token, and track a job status for the creation of derivative web data files that need to be generated locally and will be available via the API upon job completion. This allows the API to meet the intended goals of supporting both WARC file transfer for preservation replication as well as the ability to allow for derivative datasets to be delivered to researchers and users.
 10 | 
 11 | **Assumptions & Exclusions**
 12 | 
 13 | * Implementation APIs built using this specification should be RESTful.
 14 | * Implementation APIs should, at minimum, produce application/json.
 15 | * Implementation APIs must support GET and POST.
 16 | * This specification does not cover authentication and access control, which are considered to be institution and implementation specific.
 17 | * This specification abstracts away institution-specific details in many areas to remain generalizable and minimal. Additional paths, methods, and functions can be added in implementations as desired.
 18 | * The specification allows for both the return of results considered "complete" and “includes extra.” This allow requesters to know if the returned results fully meet the original query or whether addition data is contained in the file list returned. Details are below.
 19 | 
 20 | **Issues for General Discussion**
 21 | 
 22 | * The "transfer confirmation" functionality originally proposed by the development team was dropped. This functionality was intended to verify a successful transfer once the transfer was complete. It seemed too challenging to force every implementation to develop and support this as a bare minimum, as it could get technically complicated. Implementations can still build it in to their API. To facilitate the ability to confirm a transfer, checksum was made a required return value for **/webdata**.
 23 | * A true/false result for "includes-extra" will be part of all **/webdata** query returns to denote the “inclusive v. exclusive” issues discussed in design meetings.
 24 | * Some questions/decisions remained around whether a "filename" should match directories in a “full path,” and if so, how “\*” wildcard/globs should match directory separators, for instance, *\/webdata?filename=ARCHIVEIT-1234-2016\*.warc.gz.  For now, we suggest an implementation may support searching the full path only as an extension to the set of query parameters.
 25 | * There was debate around how APIs should define and document themselves. Originally idea was to have a **/registry** path under each main path that would return implementation specific information, such as parameters to **/webdata** and functions to **/jobs**. Alternately, the base path could simply return a Swagger YAML file that defines the full API. For discussion, and potentially can be an implementation detail. The WASAPI team will decide where this information lives and what is required of implementations (if anything).
 26 | * We should determine the right way of specifying compression along with the format of files.
 27 | * Are we offering too much flexibility with WebdataMenu and WebBundles and the multiple "locations" of a WebdataFile?  How frequently do implementations offer mirrored files vs offer multiple transport methods of the same webdata? What is the level of requirement for granularity here?
 28 | 
 29 | **Paths & Examples**
 30 | 
 31 |   * **/webdata**
 32 | 
 33 |    *(example: https://partner.archive-it.org/v0/export/api/webdata)*
 34 | 
 35 | The most basic query using the **/webdata** path returns a list of all web data files on the server which are available to the client, basic metadata about those files, and their download information. Parameters to modify **/webdata** will be determined by institutions building their own implementations. Potential parameters can be as simple as */webdata?directoryName=[name]* or can support an extensive list of parameters to modify a request. Examples of possible parameters could include those defining identifiers for things like account, collection, seed, crawl job, harvest event, session, date range, archival identifier, administrative unit, repository, bucket, and more. All institution-specific query filters and modifiers should be parameters to the **/webdata** path.
 36 | 
 37 |    * **Example queries and results**
 38 | 
 39 | *https://partner.archive-it.org/v0/export/api/webdata?filename=2016-08-30-blah.warc.gz*
 40 | 
 41 | The above query would return a list of a single WARC file (though it may be available via multiple transports).
 42 | 
 43 | ```
 44 | {
 45 |     "includes-extra": false,
 46 |     "files": [
 47 |         {
 48 |             "checksum": "md5:b1c3cd...57; sha1:011c65...a7",
 49 |             "content-type": "application/warc",
 50 |             "filename": "2016-08-30-blah.warc.gz",
 51 |             "locations": [
 52 |                 "http://archive-it.org/.../2016-08-30-blaugh.warc.gz",
 53 |                 "gridftp globus://...",
 54 |                 "ipfs/Qmbeef0484098..."
 55 |             ]
 56 |         }
 57 |     ]
 58 | }
 59 | ```
 60 | 
 61 | *https://partner.archive-it.org/v0/export/api/webdata?acccountId=123&collectionId=456&startDate=01012014&endDate=12312015*
 62 | 
 63 | The above query would return a list of all the WARCs (with metadata and download links) from between January 1, 2014 and December 31, 2015 for Account 123, Collection 456.
 64 | 
 65 | ```
 66 | {
 67 |     "includes-extra": false,
 68 |     "files": [
 69 |         {
 70 |             "checksum": "md5:beefface09384509",
 71 |             "content-type": "application/warc",
 72 |             "filename": "2014-01-01-blah.warc.gz",
 73 |             "locations": [
 74 |                 "http://archive-it.org/.../2014-01-01-blah.warc.gz",
 75 |                 "/ipfs/Qmde62f92ea12c42dc0b0c0ab3952e52e1"
 76 |             ]
 77 |         },
 78 |         {
 79 |             "checksum": "md5:beefface09384510",
 80 |             "content-type": "application/warc",
 81 |             "filename": "2014-01-02-blah.warc.gz",
 82 |             "locations": [
 83 |                 "http://archive.org/.../2014-01-02-blah.warc.gz",
 84 |                 "/ipfs/Qmbda3f7abccdad41977fb308453566f84"
 85 |             ]
 86 |         }
 87 |     ]
 88 | }
 89 | ```
 90 | 
 91 |   * **/jobs**
 92 | 
 93 |    *(example: https://partner.archive-it.org/v0/export/api/jobs)*
 94 | 
 95 | The **/jobs** path shows the jobs on this server accessible to the client. This enables the request and delivery of WARC derivative webdata files. The **/jobs** path supports GET and POST methods. Implementations that do not include the ability to submit a job should still support this path and simply return that no jobs are possible for the client on this server.
 96 | 
 97 |    * **Example queries and results**
 98 | 
 99 | GET  *https://partner.archive-it.org/v0/export/api/jobs*
100 | 
101 | Results here depend on whether jobs have been submitted. If no jobs have been submitted, you get an empty list. If you have submitted jobs, you get something similar to the below.
102 | 
103 | ```
104 | [
105 |     {
106 |         "jobtoken": "21EC2020-08002B30309D",
107 |         "function": "build-wat",
108 |         "query": "acccountId=123&collectionId=456&startDate=2014&endDate=2015",
109 |         "submit-time": "2016-08-30Z15:52:53",
110 |         "state": "complete",
111 |         "result": [
112 |             [
113 |                 {
114 |                     "checksum": "md5:beefface09384509",
115 |                     "content-type": "application/wat",
116 |                     "filename": "2014-01-01-blah.wat.gz",
117 |                     "locations": [
118 |                         "http://archive-it.org/.../2014-01-01-blah.wat.gz"
119 |                     ]
120 |                 },
121 |                 {
122 |                     "checksum": "md5:beefface09384510",
123 |                     "content-type": "application/wat",
124 |                     "filename": "2014-01-02-blah.wat.gz",
125 |                     "locations": [
126 |                         "http://archive-it.org/.../2014-01-02-blah.wat.gz"
127 |                     ]
128 |                 }
129 |             ]
130 |         ]
131 |     }
132 | ]
133 | ```
134 | 
135 | POST *https://partner.archive-it.org/v0/export/api/jobs*
136 | 
137 | The POST method includes a string matching a **/webdata** query string plus an implementation-specific function available to the client. In this specification, POST requests remain a bit of an abstraction, as they are dependent upon the implementation-specific parameters supported under **/webdata**.
138 | 
139 | Using the previous **/webdata** example, the below POST request would return a job token for creating WATs for WARC files matching that **/webdata** query:
140 | 
141 |  *https://partner.archive-it.org/v0/export/api/jobs?acccountId=123&collectionId=456&startDate=2014&endDate=2015&function=build-wat*
142 | 
143 | ```
144 | {
145 |     "jobtoken": "21EC2020-08002B30309D",
146 |     "function": "build-wat",
147 |     "query": "acccountId=123&collectionId=456&startDate=2014&endDate=2015",
148 |     "submit-time": "2016-08-30Z15:52:53",
149 |     "state": "queued"
150 | }
151 | ```
152 | 
153 |   * **/jobs/{jobToken}**
154 | 
155 |    *(example: https://partner.archive-it.org/v0/export/api/jobs/123456)*
156 | 
157 | The **/jobs/{jobToken}** path returns the status of a submitted job.
158 | 
159 |    * **Example queries and results**
160 | 
161 | GET *https://partner.archive-it.org/v0/export/api/jobs/21EC2020-08002B30309D*
162 | 
163 | Retrieve status for a submitted job, some metadata, including the original query, time it was requested, etc. Includes results list if job is finished. Results are not necessarily available indefinitely. May return "410 Gone" if derivatives generated by this job have been replaced (e.g. by the results of a newer job), or if job has been expired by some other policy. An implementation may (but is not required to) make results later available under /webdata queries.
164 | 
165 | ```
166 | {
167 |     "jobtoken": "21EC2020-08002B30309D",
168 |     "function": "build-wat",
169 |     "query": "acccountId=123&collectionId=456&startDate=2014&endDate=2015",
170 |     "submit-time": "2016-08-30Z15:52:53",
171 |     "state": "complete",
172 |     "result": [
173 |         [
174 |             {
175 |                 "checksum": "md5:beefface09384509",
176 |                 "content-type": "application/wat",
177 |                 "filename": "2014-01-01-blah.wat.gz",
178 |                 "locations": [
179 |                     "http://archive-it.org/.../2014-01-01-blah.wat.gz"
180 |                 ]
181 |             },
182 |             {
183 |                 "checksum": "md5:beefface09384510",
184 |                 "content-type": "application/wat",
185 |                 "filename": "2014-01-02-blah.wat.gz",
186 |                 "locations": [
187 |                     "http://archive-it.org/.../2014-01-02-blah.wat.gz"
188 |                 ]
189 |             }
190 |         ]
191 |     ]
192 | }
193 | ```
194 | 
195 | **Additional Definitions**
196 | 
197 | The result of a /webdata query or result of a job can be represented in multiple formats and offered via multiple transports.  To express this and allow the client to select the most appropriate, an implementation includes a "WebdataMenu" in each result.  A WebdataMenu offers a number of “WebdataBundles”, each of which provide the complete result with a distinct format and transport.  Each WebdataBundle contains one or more WebdataFiles.  The client chooses a WebdataBundle with appropriate format and transport.
198 | 
199 | Here’s an example of a single WebdataMenu which contains two WebdataBundles.  The first WebdataBundle contains three WebdataFiles; the second contains one.
200 | 
201 | ```
202 | [
203 |  [  ‘http://partner.archive-it.org/.../2016-08-30-blah.warc.gz’,
204 |     ‘http://partner.archive-it.org/.../2016-08-30-blah1.warc.gz’,
205 |     ‘http://partner.archive-it.org/.../2016-08-30-blah2.warc.gz’
206 |  ],
207 |  [  ‘ipfs/Qm67e26534d15bc305340ce4b2e5944ffc’ ]
208 | ]
209 | ```
210 | 
211 | **Timeline & Contacts**
212 | 
213 | This document and the accompanying Swagger .yaml file was shared across the primary development project team for comment and input in early September 2016. The document will be shared in late September with the full grant team, Technical Working Group, and program managers and engineers of the web archiving community attending the IIPC Steering Committee meeting and Crawler Hackathon at British Library the week of September 19, 2016. After a period of comments, the spec and doc will be shared with the full web archiving community for additional feedback. Reference implementations of the specification will be developed by Internet Archive (Archive-It) and Stanford (LOCKSS) in Q4 of 2016. Testing, iterative development, and other ongoing activities will take place in 2017.
214 | 
215 | *Internet Archive (Archive-It)*
216 | * Jefferson Bailey, Director, Web Archiving, jefferson@archive.org
217 | * Mark Sullivan, Web Archiving Software Engineer, msullivan@archive.org
218 | 
219 | *Stanford University Libraries (DLSS & LOCKSS)*
220 | * Nicholas Taylor, Web Archiving Service Manager, ntay@stanford.edu
221 | * David Rosenthal, LOCKSS Chief Information Scientist, dshr@standford.edu
222 | 


--------------------------------------------------------------------------------
/general-specification/transfer_api_v1.yaml:
--------------------------------------------------------------------------------
  1 | swagger: '2.0'
  2 | info:
  3 |   title: WASAPI Export API
  4 |   description: >
  5 |     WASAPI Export API.  A draft of the minimum that a Web Archiving Systems API
  6 |     server must implement.
  7 |   version: 0.1.0
  8 |   contact:
  9 |     name: Jefferson Bailey and Mark Sullivan
 10 |     url: https://github.com/WASAPI-Community/data-transfer-apis
 11 |   license:
 12 |     name: Apache 2.0
 13 |     url: http://www.apache.org/licenses/LICENSE-2.0.html
 14 | consumes:
 15 |   - application/json
 16 | produces:
 17 |   - application/json
 18 | basePath: /v0
 19 | schemes:
 20 |   - https
 21 | paths:
 22 |   /webdata:
 23 |     get:
 24 |       parameters:
 25 |         - name: filename
 26 |           in: query
 27 |           type: string
 28 |           description: >
 29 |             A semicolon-separated list of globs.  In each glob, a star `*`
 30 |             matches any string of characters, and a question mark `?` matches
 31 |             exactly one character.  Are the globs matched against the full
 32 |             pathname (ie with directories) vs just the basename?, and if
 33 |             pathname, is the slash `/` specially matched (cf `**`)?
 34 |         - name: content-type
 35 |           in: query
 36 |           type: string
 37 |           description: A semicolon-separated list of acceptable MIME-types
 38 |       responses:
 39 |         '200':
 40 |           description: Success
 41 |           schema:
 42 |             type: object
 43 |             properties:
 44 |               includes-extra:
 45 |                 type: boolean
 46 |               files:
 47 |                 $ref: '#/definitions/WebdataMenu'
 48 |         '400':
 49 |           description: The request could not be interpreted
 50 |         '401':
 51 |           description: The Request was unauthorized
 52 |   /jobs:
 53 |     get:
 54 |       summary: Show the jobs on this server accessible to the client
 55 |       responses:
 56 |         '200':
 57 |           description: The list of jobs
 58 |           schema:
 59 |             type: array
 60 |             items:
 61 |               $ref: '#/definitions/Job'
 62 |     post:
 63 |       parameters:
 64 |         - name: query
 65 |           in: formData
 66 |           description: URL-encoded query as appropriate for /webdata end-point
 67 |           type: string
 68 |         - name: function
 69 |           in: formData
 70 |           description: >
 71 |             An implementation-specific identifier for some function the
 72 |             implementation supports
 73 |           type: string
 74 |         - name: parameters
 75 |           in: formData
 76 |           description: >
 77 |             Other parameters specific to the function and implementation
 78 |             (URL-encoded).  For example: level of compression, priority, time
 79 |             limit, space limit.
 80 |           type: string
 81 |       responses:
 82 |         '201':
 83 |           description: >
 84 |             Job was successfully submitted.  Body is the submitted job.
 85 |           schema:
 86 |             $ref: '#/definitions/Job'
 87 |         '400':
 88 |           description: The request could not be interpreted
 89 |         '401':
 90 |           description: The Request was unauthorized
 91 |   '/jobs/{jobToken}':
 92 |     get:
 93 |       summary: Retrieve status for job
 94 |       parameters:
 95 |         - name: jobToken
 96 |           in: path
 97 |           description: The job token returned from previous request
 98 |           required: true
 99 |           type: string
100 |       responses:
101 |         '200':
102 |           description: Success
103 |           schema:
104 |             $ref: '#/definitions/Job'
105 |         '400':
106 |           description: The request could not be interpreted
107 |         '401':
108 |           description: The Request was unauthorized
109 |         '403':
110 |           description: Forbidden
111 |         '404':
112 |           description: No such job
113 |         '410':
114 |           description: >
115 |             Gone / invalidated.  Body may include non-result information about
116 |             the job.
117 | definitions:
118 |   WebdataFile:
119 |     description: >
120 |       The unit of distribution of web archival data.  Examples: a WARC file,
121 |       an ARC file, a CDX file, a WAT file, a DAT file, a tarball.
122 |     type: object
123 |     required:
124 |       - filename
125 |       - checksum
126 |       - content-type
127 |       - locations
128 |     properties:
129 |       filename:
130 |         type: string
131 |         description: The name of the webdata file
132 |       content-type:
133 |         # TODO:  handle compression etc
134 |         type: string
135 |         description: >
136 |           The MIME-type for the webdata file, eg `application/warc`,
137 |           `application/pdf`
138 |       checksum:
139 |         type: string
140 |         description: >
141 |           Checksum for the webdata file, eg "sha1:beefface09781234897",
142 |           "md5:dad0dada09823098"
143 |       size:
144 |         type: integer
145 |         format: int64
146 |         description: The size in bytes of the webdata file
147 |       locations:
148 |         type: array
149 |         items:
150 |           type: string
151 |           format: url
152 |         description: >
153 |           A list of (mirrored) sources from which to retrieve (identical copies
154 |           of) the webdata file, eg "http://archive.org/...",
155 |           "/ipfs/Qmee6d6b05c21d1ba2f2020fe2db7db34e"
156 |   WebdataBundle:
157 |     description: >
158 |       A "bundle" of webdata files that together satisfy a query, job, etc
159 |     type: array
160 |     items:
161 |       $ref: '#/definitions/WebdataFile'
162 |   WebdataMenu:
163 |     description: >
164 |       A set of alternative webdata bundles, each of which satisfies a given
165 |       query, job, etc.  An implementation may offer a different bundle (with
166 |       differing number of webdata files) for each of its available transports,
167 |       etc.
168 |     type: array
169 |     items:
170 |       $ref: '#/definitions/WebdataBundle'
171 |   Job:
172 |     type: object
173 |     description: A submitted job with optional results
174 |     required:
175 |       - jobtoken
176 |       - function
177 |       - query
178 |       - submit-time
179 |       - state
180 |     properties:
181 |       jobtoken:
182 |         type: string
183 |         description: >
184 |           Identifier unique across the implementation.  The implementation
185 |           chooses the format.  For example: GUID, increasing integer.
186 |       function:
187 |         type: string  # enum
188 |         description: eg `build-WAT`, `build-index`
189 |       query:
190 |         type: string
191 |         description: >
192 |           The specification of what webdata to include in the job.  Encoding is
193 |           URL-style, eg `param=value&otherparam=othervalue`.
194 |       submit-time:
195 |         type: string
196 |         format: date-time
197 |         description: Time of submission, formatted according to RFC3339
198 |       state:
199 |         type: string  # enum
200 |         description: >
201 |           Implementation-defined, eg `queued`, `running`, `failed`, `complete`,
202 |           `gone`
203 |       result:
204 |         allOf:
205 |         - description: >
206 |             This property indicates whether the job has completed (without
207 |             having been cleaned away).  When present, it is a list of URLs to
208 |             webdata files.  Should its absense be expressed as omission vs
209 |             null/undef/etc vs empty list?, and how do we write that in swagger?
210 |         - $ref: '#/definitions/WebdataMenu'
211 | 


--------------------------------------------------------------------------------
/lockss-implementation/README.md:
--------------------------------------------------------------------------------
 1 | # LOCKSS WASAPI implementation
 2 | 
 3 | Code related to the LOCKSS implementation of the WASAPI general specification.
 4 | 
 5 | * The default_controller.py generated from the WASAPI API spec by FLASK as modified to interface to the LOCKSS daemon's SOAP-y export API.
 6 | * A minimal WASAPI client just sufficient to test the server.
 7 | 
 8 | Note that the server has XXX comments mostly about mismatches between the LOCKSS SOAP-y API and WASAPI.
 9 | 
10 | This works, in that a LOCKSS daemon in our test framework can be run, create an AU full of synthetic content, export it via WASAPI and HTTP, and verify the SHA1 of the result. Not tested at scale yet. Will need fixing when some suggested changes are made to the LOCKSS daemon.
11 | 


--------------------------------------------------------------------------------
/lockss-implementation/default_controller.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | import os, sys
  3 | import suds
  4 | from suds.client import Client
  5 | import json
  6 | from datetime import datetime, timezone
  7 | from base64 import b64decode
  8 | import hashlib
  9 | 
 10 | persistentFile = '/tmp/wsapiJobs'
 11 | testAUID = 'org|lockss|plugin|simulated|SimulatedPlugin&root~%2Ftmp%2F'
 12 | # LOCKSS SOAP export web service
 13 | # NB - org.lockss.export.enabled=true must be set when the daemon starts
 14 | # for the returned URL to be valid.
 15 | # XXX these should be arguments
 16 | host = 'localhost:8081'
 17 | url = 'http://'+host+'/ws/ExportService?wsdl'
 18 | user = 'lockss-u'
 19 | pwd4lockss = 'lockss-p'
 20 | 
 21 | # WSAPI API
 22 | def jobs_post(query = None, function = None, parameters = None) -> str:
 23 | 	# WARC creation is supposed to be asynchronous, but for now we
 24 | 	# make it synchonous, so all jobs are in status 'complete'
 25 | 	if function != 'auid':
 26 | 		return ""
 27 | 	# XXX for now, query is simply an AUID
 28 | 	# XXX query should be a list of AUIDs, not just one
 29 | 	client = Client(url, username=user, password=pwd4lockss)
 30 | 	params = makeParams(client, query)
 31 | 	results = client.service.createExportFiles(params)
 32 | 	jobId = makeJobId()
 33 | 	putJob(jobId, params, results)
 34 | 	return makePostResponse(jobId, params, encodeResults(results))
 35 | 
 36 | def webdata_get(filename = None, contentType = None) -> str:
 37 | 	# XXX webdata requires that content be returned selectively
 38 | 	# XXX if it matches specified URL globbing and/or mime type.
 39 | 	# The LOCKSS daemon's SOAP-y API does not support this.
 40 | 	message = {
 41 | 		'status': '400',
 42 | 		'message': 'Content filtering not supported'
 43 | 	}
 44 | 	resp = jsonify(message)
 45 | 	resp.status_code = 400
 46 | 	return resp
 47 | 
 48 | def jobs_get() -> str:
 49 | 	ret = {}
 50 | 	jobs = getJobs()
 51 | 	ret = json.dumps(jobs)
 52 | 	return ret
 53 | 
 54 | def jobs_job_token_get(jobToken) -> str:
 55 | 	ret = []
 56 | 	job = getJob(jobToken)
 57 | 	if job != None:
 58 | 		params = job['params']
 59 | 		ret['function'] = params['function']
 60 | 		ret['jobtoken'] = jobToken
 61 | 		ret['query'] = params['query']
 62 | 		ret['state'] = 'complete'
 63 | 		ret['submit-time'] = jobs['submit-time']
 64 | 	return json.dumps(ret)
 65 | 
 66 | # Persistent state - i.e. jobs database
 67 | def putJob(jobId, params, results = None):
 68 | 	ret = None
 69 | 	state = {'params':encodeParams(params),
 70 | 		'results':encodeResults(results),
 71 | 		'submit-time':encodeTime()}
 72 | 	try:
 73 | 		fileState = os.stat(persistentFile)
 74 | 	except FileNotFoundError as ex:
 75 | 		jobs = {}
 76 | 	else:
 77 | 		if fileState.st_size > 0:
 78 | 			with open(persistentFile, 'r') as f:
 79 | 				jobs = json.load(f)
 80 | 		else:
 81 | 			jobs = {}
 82 | 	with open(persistentFile, 'w') as f:
 83 | 		jobs[jobId] = state
 84 | 		json.dump(jobs, f)
 85 | 		ret = jobId
 86 | 	return ret
 87 | 
 88 | def getJob(jobId):
 89 | 	try:
 90 | 		with open(persistentFile, 'r') as f:
 91 | 			# XXX need to lock file
 92 | 			jobs = json.load(f)
 93 | 			return jobs[jobId]
 94 | 	except FileNotFoundError:
 95 | 		return None
 96 | def getJobs():
 97 | 	try:
 98 | 		with open(persistentFile, 'r') as f:
 99 | 			# XXX need to lock file
100 | 			jobs = json.load(f)
101 | 			return jobs
102 | 	except FileNotFoundError:
103 | 		return []
104 | 
105 | # LOCKSS SOAP-y export web service
106 | def makeParams(client, auid):
107 | 	typeEnum = client.factory.create(u'typeEnum')
108 | 	filenameTranslationEnum = client.factory.create(u'filenameTranslationEnum')
109 | 	params = client.factory.create(u'exportServiceParams')
110 | 	params.auid = auid
111 | 	params.compress = 1
112 | 	params.excludeDirNodes = 0
113 | 	params.fileType = typeEnum.WARC_RESOURCE
114 | 	params.filePrefix = 'lockss-export-'
115 | 	params.maxSize = -1
116 | 	params.maxVersions = -1
117 | 	params.xlateFilenames = filenameTranslationEnum.XLATE_NONE
118 | 	return params
119 | 
120 | # Encode exportServiceParams as a Dictionary
121 | def encodeParams(params):
122 | 	ret = {}
123 | 	ret['auid'] = params.auid
124 | 	ret['compress'] = params.compress
125 | 	ret['excludeDirNodes'] = params.excludeDirNodes
126 | 	ret['fileType'] = params.fileType
127 | 	ret['filePrefix'] = params.filePrefix
128 | 	ret['maxSize'] = params.maxSize
129 | 	ret['maxVersions'] = params.maxVersions
130 | 	ret['xlateFilenames'] = params.xlateFilenames
131 | 	return ret
132 | 
133 | # Encode exportServiceWsResult as Dictionary
134 | # XXX there is currently no option to the SOAP-y API to select
135 | # XXX between streaming the WARC and placing it in the export
136 | # XXX directory - it does both.
137 | def encodeResults(results):
138 | 	ret = {}
139 | 	ret['auid'] = results.auId
140 | 	ret['name'] = results.dataHandlerWrappers[0].name
141 | 	ret['size'] = results.dataHandlerWrappers[0].size
142 | 	# XXX we need the LOCKSS daemon to compute the
143 | 	# XXX checksum of the files it puts in the export
144 | 	# XXX directory so that exports can be validated.
145 | 	# XXX We would fetch the checksum file here to get
146 | 	# XXX the checksum.
147 | 	# XXX Instead we compute the checksum of the streamed
148 | 	# XXX WARC content but this isn't an end-to-end check.
149 | 	b = bytes(results.dataHandlerWrappers[0].dataHandler, "utf-8")
150 | 	warc = b64decode(b)
151 | 	m = hashlib.sha1()
152 | 	m.update(warc)
153 | 	ret['sha1'] = m.hexdigest()
154 | 	return ret
155 | 
156 | def encodeTime():
157 | 	local_time = datetime.now(timezone.utc).astimezone()
158 | 	return local_time.isoformat()
159 | 
160 | def makeJobId():
161 | 	# XXX for now, jobId is submission time
162 | 	return encodeTime()
163 | 
164 | # Return the body of the POST response
165 | def makePostResponse(jobId, params, results):
166 | 	ret = {
167 | 		"includes-extras":False,
168 | 		"files":[
169 | 			{
170 | 				"checksum":"sha1:" + results['sha1'],
171 | 				"content-type":"application/warc",
172 | 				"filename":results['name'],
173 | 				"locations":[
174 | 					"http://"+host+"/export/"+results['name']
175 | 				],
176 | 				"size":results['size']
177 | 			}
178 | 		]
179 | 	}
180 | 	return json.dumps(ret)
181 | 


--------------------------------------------------------------------------------
/lockss-implementation/wasapi-test.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | # A minimal client for the WASAPI transfer API, with
 4 | # defaults matching the LOCKSS implementation
 5 | #
 6 | # Arguments:
 7 | # -f [function] or --function [function] default auid
 8 | # -q [query] or --query [query] default AUID for SimulatedContent
 9 | 
10 | from argparse import ArgumentParser
11 | import requests
12 | import json
13 | 
14 | # URL prefix for WSAPI service
15 | service = "http://localhost:8080/v0"
16 | testAUID = 'org|lockss|plugin|simulated|SimulatedPlugin&root~%2Ftmp%2F'
17 | err = "Error: "
18 | 
19 | # Return a Dictionary with the params for the WSAPI request
20 | def makeWsapiParams():
21 | 	parser = ArgumentParser()
22 | 	parser.add_argument("-f", "--function", dest="myFunction", help="WASAPI function", default="auid")
23 | 	parser.add_argument("-q", "--query", dest="myQuery", default=testAUID, help="WASAPI query")
24 | 	args = parser.parse_args()
25 | 	ret = None
26 | 	if (args.myQuery != None and args.myFunction != None):
27 | 		ret = {}
28 | 		ret['query'] = args.myQuery
29 | 		ret['function'] = args.myFunction
30 | 	return ret
31 | 
32 | params1 = makeWsapiParams()
33 | if params1 != None:
34 | 	# query the service
35 | 	wasapiResponse = requests.post(service + "/jobs", data=params1)
36 | 	status = wasapiResponse.status_code
37 | 	if(status == 200):
38 | 		# WASAPI request successful
39 | 		err = ""
40 | 		# parse the JSON we got back
41 | 		wasapiData = wasapiResponse.json()
42 | 		message = json.dumps(wasapiData)
43 | 	else:
44 | 		# WASAPI request unsuccessful
45 | 		message = "WSAPI request error: {0}\n{1}".format(status,wasapiResponse)
46 | else:
47 | 	message = "Usage: wasapi-test -q [query] -f [function]"
48 | print('WASAPI test')
49 | print("{0}{1}".format(err,message))
50 | print()
51 | 


--------------------------------------------------------------------------------
/utilities/README.md:
--------------------------------------------------------------------------------
 1 | # WASAPI Utilities # 
 2 | 
 3 | A list of open-source utilities, downloaders, and processing tools that make use of the WASAPI APIs for data transfer into local systems. Utilities by Stanford University Libraries, University of North Texas Libraries, and Rutgers University were developed as part of the main WASAPI grant project.
 4 | 
 5 | ## Stanford University Libraries ##
 6 | 
 7 | https://github.com/sul-dlss/wasapi-downloader
 8 | 
 9 | ## University of North Texas Libraries ##
10 | 
11 | https://github.com/unt-libraries/py-wasapi-client
12 | 


--------------------------------------------------------------------------------