├── README.md ├── ait-implementation ├── README.md ├── wasapi │ ├── __init__.py │ ├── admin.py │ ├── filters.py │ ├── implemented-swagger.yaml │ ├── mailer.py │ ├── migrations │ │ └── __init__.py │ ├── models.py │ ├── selectors.py │ ├── serializers.py │ ├── swagger.yaml │ ├── tests.py │ ├── tests │ │ ├── __init__.py │ │ ├── fixtures.json │ │ ├── test_fixtures.py │ │ └── test_job_result.py │ ├── urls.py │ └── views.py └── webdata │ ├── __init__.py │ ├── decorators.py │ ├── urls.py │ └── views.py ├── ait-specification ├── README.md └── transfer_api_archive-it_v1.yaml ├── general-specification ├── README.md └── transfer_api_v1.yaml ├── lockss-implementation ├── README.md ├── default_controller.py └── wasapi-test.py └── utilities └── README.md /README.md: -------------------------------------------------------------------------------- 1 | ## WASAPI Data Transfer APIs 2 | 3 | This is the public repository for work on the Web Archiving Systems API (WASAPI) data transfer APIs. The intention for these APIs is to provide a standardized mechanism for export and import of web archive data (and perhaps, ultimately, derivative data and capture metadata) between diverse systems for preservation, replication, research use, data delivery, and other purposes. Design and development is being carried out by project partners on the [Institute of Museum and Library Services](https://www.imls.gov/)-funded [National Leadership Grant](https://www.imls.gov/grants/available/national-leadership-grants-libraries), [LG-71-15-0174](https://www.imls.gov/grants/awarded/lg-71-15-0174-15), "[Systems Interoperability and Collaborative Development for Web Archiving](https://www.imls.gov/sites/default/files/proposal_narritive_lg-71-15-0174_internet_archive.pdf)" (PDF) in consultation with a technical working group and based on feedback from the web archiving community. 4 | 5 | ## Clients & Utilities 6 | * Stanford University Digital Library Systems and Services: wasapi-downloader, https://github.com/sul-dlss/wasapi-downloader 7 | * UNT Libraries: py-wasapi-client, https://github.com/unt-libraries/py-wasapi-client 8 | * LOCKSS: LAAWS Live Demo, http://demo.laaws.lockss.org/ 9 | * Rutgers University: Location Mapper, https://github.com/mwe400/LocationMapper 10 | 11 | ## Documents 12 | * 2017-12-11: "[WASAPI Document Repository in archive.org](https://archive.org/details/wasapi)" 13 | * 2017-04-23: "[National Symposium on Web Archiving Interoperability: Agenda & Presentation Links](https://docs.google.com/document/d/1PM8u5nxAKUFb4oh1JTDARfl9hat7gOxgU1t2mGvn8Fg/edit#heading=h.n0bnn4za99v2)" 14 | * 2017-03-02: "[National Symposium on Web Archiving Interoperability Trip Report](https://web.archive.org/web/20180410063821/https://ws-dl.blogspot.fr/2017/03/2017-03-02-national-symposium-on-web.html)" 15 | * 2017-03-30: "[IMLS Year 1 Interim Performance Report Narrative](https://archive.org/details/WASAPIYearOneReport)" 16 | * 2016-11-29: "[Archive-It 2016 State of the WARC Report](https://archive-it.org/blog/post/2016-state-of-the-warc-our-second-annual-digital-preservation-survey-results/)" 17 | * 2016-08-19: "[WASAPI Survey on Data Transfer APIs](https://drive.google.com/file/d/0B7toWei7Sy_SOUJlZFhySHZYTWM/view?usp=sharing)" 18 | * 2015-09-15: "[Systems Interoperability and Collaborative Development for Web Archiving](https://www.imls.gov/grants/awarded/lg-71-15-0174-15)" 19 | 20 | ## Presentations 21 | * 2017-12-11: "[Web Archiving Systems APIs (WASAPI)](https://docs.google.com/presentation/d/1lAjeNmnnJb_lLYofqR-ZlqcqxKZ_ithQ57vCPWdPFt4/edit?usp=sharing) 22 | * 2017-06-15: "[WASAPI: (Web Archiving Systems APIs) Project Updates and Data Transfer APIs Specifications and Demonstrations](https://docs.google.com/presentation/d/1nbfKd80V613-S7AH9CvbMZVp9SyLWW7ByMwMsBflM5s/edit?usp=sharing)" at [IIPC/ReSAW Web Archiving Week](http://netpreserve.org/wac2017/) 23 | * 2017-04-27: "[Stanford Web Archiving Work Cycle Inception Deck for Automating Web Archive Crawl Download and Accessioning](https://drive.google.com/file/d/0B7toWei7Sy_SU2VvWWNVUmRRQkk/view?usp=sharing)" 24 | * 2016-08-03: "[WASAPI: Web Archiving Systems Application Programming Interfaces](https://docs.google.com/presentation/d/1XajUcvETUTL_mSsr0vCno-fzSB15MsRsRmP_pikvGO8/edit?usp=sharing)" at the [SAA Web Archiving Roundtable Meeting](https://archives2016.sched.org/event/6niM/web-archiving) at [Archives Records 2016](http://www2.archivists.org/am2016) 25 | * 2016-06-14: "[WASAPI Web Archive Data Transfer APIs](http://www.slideshare.net/nullhandle/wasapi-web-archive-data-transfer-apis)" at [Archives Unleashed 2.0](http://archivesunleashed.com/) 26 | * 2016-05-26: "[Systems Interoperability and Collaborative Development for Web Archiving - Filling Gaps in the IMLS National Digital Platform](http://digital.library.unt.edu/ark:/67531/metadc848591/)" at [Texas Conference on Digital Libraries](https://conferences.tdl.org/tcdl/index.php/TCDL/TCDL2016) 27 | * 2016-04-12: "[Building API-Based Web Archiving Systems and Services](https://docs.google.com/presentation/d/1IJ9IcLG2cO118oNX0Z5rakiDVySuB9TBWwnVvHTEOAg/edit?usp=sharing)" at the [International Internet Preservation Consortium General Assembly](http://www.netpreserve.org/general-assembly/2016/overview) 28 | * 2016-04-05: "[Building National Web Archiving Capacity](https://drive.google.com/file/d/0BwW5mtdXJ3huLUowUnRZb0E0Z0E/view?usp=sharing)" at the [CNI Spring 2016 Meeting](https://www.cni.org/events/membership-meetings/past-meetings/spring-2016) 29 | 30 | ## Meeting Notes 31 | * 2016-12-13: [WASAPI Technical Working Group](https://docs.google.com/document/d/1q7m6pgINRAUOFGg3SMhCVD_IstAvYbF8cIJ5puO2HP8/edit) 32 | * 2016-03-30: [WASAPI Technical Working Group](https://docs.google.com/document/d/1kDbk3J_DVpqj2rBFQmQIoijYjwgWQKgY-19H6rckGkk/edit?ts=57c36d5a) 33 | 34 | ## How to connect 35 | * Join or send a message to the [WASAPI-community Google Group](https://groups.google.com/forum/#!forum/wasapi-community). 36 | * Join the [WASAPI Slack](https://docs.google.com/forms/d/e/1FAIpQLScsdTqssLrM9FinmpP8Mow2Hl8zJnfJZfjWxaeXddlvu2VjBw/viewform). 37 | -------------------------------------------------------------------------------- /ait-implementation/README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | This document should assist developers in building their own implementation of WASAPI based on Archive-It's implementation. 4 | 5 | # Archive-It WASAPI implementation 6 | 7 | The `archiveit/wasapi` application is the bulk of the code by which Archive-It implements the WASAPI specification. It was written within and then extracted from the [Django](https://www.djangoproject.com/) project (version 1.8.5) that serves [Archive-It's partner site](https://partner.archive-it.org/), so --while it can not be run alone-- it can be fit easily into another Django project. This document outlines implementation details, proposed changes to the WASAPI Data Transfer API general specification and the Archive-It additions beyond the minimum specifications. 8 | 9 | 10 | ## Formal specifications 11 | 12 | The [OpenAPI](https://www.openapis.org/) file `wasapi/swagger.yaml` describes Archive-It's ideal specification at the start of implementation (with few adjustments). The file `wasapi/implemented-swagger.yaml` shows what has been implemented. The difference between the two serves as a to-do list: note that you can submit new jobs and monitor their status but not yet retrieve their results. 13 | 14 | 15 | ## Re-integrating the code 16 | 17 | To use the `wasapi` application within another Django project, you must resolve some references it has to the Archive-It project. 18 | 19 | Archive-It's webdata files are modeled in `archiveit.archiveit.models.WarcFile`; replace that with your own. 20 | 21 | The `AitWasapiDateTimeField` field replaces `django.db.models.fields.DateTimeField` with the ability to parse abbreviated dates and adjust dates with timezones. 22 | 23 | The URL paths to the WASAPI endpoints (and also transport of webdata files) were established in `archiveit/urls.py`; add your own reference to your own routing file (as appropriate for the version of Django you are using): 24 | 25 | urlpatterns = ( 26 | # [...] 27 | patterns('', 28 | # [...] 29 | url(r'^wasapi/v1/', include('archiveit.wasapi.urls')), 30 | url(r'^webdatafile/', include('archiveit.webdata.urls')), 31 | ) 32 | ) 33 | 34 | The full URL to transport a webdata file was defined by WEBDATA_LOCATION_TEMPLATE in `archiveit.settings`. It uses named parameters from the webdata file model, eg `filename`. Make your own, eg: 35 | 36 | WEBDATA_LOCATION_TEMPLATE = BASEURL + '/webdatafile/%%(filename)s' 37 | 38 | 39 | ## The notification mailer and admin interface 40 | 41 | To fit with Archive-It's existing work flow, the implementation sends a notification email upon submission of a new job. We also provide [Django admin site](https://docs.djangoproject.com/en/dev/ref/contrib/admin/) to change the state of jobs. This is outside the scope of the WASAPI specification, but we include it here for completeness. 42 | 43 | ## The `webdata` application 44 | 45 | The `archiveit/webdata` application implements transport of webdata files. That is well outside the scope of the WASAPI specification, but we include it here also for completeness. It transparently serves webdata files from Archive-It's Petabox and HDFS stores. 46 | 47 | 48 | ## Proposed changes to the published minimum 49 | 50 | After more experience, we suggest that the minimum specification should be 51 | adjusted. 52 | 53 | ### Mandatory pagination syntax 54 | 55 | The specification should support pagination of large results. Simple 56 | implementations may give the full results in a single page, but adding the 57 | syntax later would be difficult. 58 | 59 | Unable to find consistent recommendations for pagination syntax, we adopt that 60 | of the [Django Rest Framework](http://www.django-rest-framework.org/). The 61 | client must accept `count`, `previous`, and `next` parameters. The 62 | implementation must provide the number of files/jobs/etc in `count`. The 63 | `previous` and `next` values can be either URLs by which to fetch other pages 64 | of results using a `page` parameter, be absent, or (as the Django Rest 65 | Framework does) hold an explicit `null`. 66 | 67 | ### Matching filenames 68 | 69 | Matching of filenames should consider only the basename and not any path of 70 | directories. The glob pattern should be matched against the complete basename 71 | (ie must match the beginning and end of the filename). An implementation that 72 | wants to match pathnames including directories (and consider eg whether `**` 73 | should match multiple directory separators eg `/`) may offer a different 74 | parameter. 75 | 76 | ### Simpler webdata file bundles 77 | 78 | We should drop `WebdataMenu` and `WebdataBundle`. The multiple `locations` of 79 | a `WebdataFile` provide most of their value. Rather than giving the client 80 | more information than would be used, an implementation can accept a request for 81 | specific transports and formats. 82 | 83 | ### Separate endpoint for results of a job; reporting on a failed job 84 | 85 | We replace `completion-time` with `termination-time` to ease polling for new 86 | information about jobs. Rather than a job that may include a successful result 87 | but gives the same indistinguishable lack of result for both progress and 88 | failure, we provide distinct endpoints: `/jobs/{jobtoken}/result` for a 89 | successful result and `/jobs/{jobtoken}/error` for reporting the error of a 90 | failed job. A client can easily poll `/jobs/{jobtoken}/result` and will be 91 | redirected to `/jobs/{jobtoken}/error` in the case that the job fails. 92 | 93 | ### Checksums of a webdata file 94 | 95 | Since it is useless to require the presence of checksums without mandating any 96 | specific checksum, every implementation should provide at least one of MD5 or 97 | SHA1. To allow evolution, the specification should use a dictionary instead of 98 | a single string. To ensure interoperability, all checksums should be 99 | represented as hexadecimal strings. 100 | 101 | ### Change label describing format of archive file 102 | 103 | Using the label `content-type` to describe the format of the archive files can 104 | be confused with the "content-type" or "MIME-type" of the resources within the 105 | archive. The label `content-type` should be reserved as a potentially valuable 106 | parameter to select such resources, and the current use should be replaced with 107 | `filetype`. Another label to consider is "archive-format" which explicitly 108 | references its subject. 109 | 110 | 111 | ## Extensions beyond the published minimum 112 | 113 | Archive-It extends the minimum specification with our own special parameters for our v1.0 release. 114 | 115 | ### Time range parameters 116 | 117 | We want to support date ranges, but we want to be careful about which time we 118 | refer to: the instant the crawl was requested, the instant a delayed crawl was 119 | scheduled to start, the instant the crawl started, the instant the resource is 120 | retrieved, the instant the archive file was written. For ease of 121 | implementation, we choose to operate on the time that the crawl started using 122 | the `crawl-start-after` and `crawl-start-before` parameters. 123 | 124 | ### Collection parameter 125 | 126 | The `collection` parameter accepts a numeric collection identifier as is used 127 | in the Archive-It application; multiple parameters allow matching files across 128 | multiple collections. 129 | 130 | ### Crawl parameter 131 | 132 | The `crawl` parameter accepts a numeric crawl identifier as is used in the 133 | Archive-It application. 134 | 135 | ### Functions 136 | 137 | Archive-It supports jobs of three functions: 138 | - `build-wat`: Build a WAT file with metadata from the matched archive files 139 | - `build-wane`: Build a WANE file with the named entities from the matched 140 | archive files 141 | - `build-cdx`: Build a CDX file indexing the matched archive files 142 | 143 | Archive-It functions do not yet accept any parameters. 144 | 145 | ### States of a job 146 | 147 | An Archive-It job can be described as being in one of five distinct states: 148 | - `queued`: Job has been submitted and is waiting to run. 149 | - `running`: Job is currently running. 150 | - `failed`: Job ran but failed. 151 | - `complete`: Job ran and successfully completed; result is available. 152 | - `gone`: Job ran, but the result is no longer available (eg deleted to save 153 | storage). 154 | 155 | ## Contacts 156 | 157 | *Archive-It (Internet Archive)* 158 | * Jefferson Bailey, Director, Web Archiving, jefferson@archive.org 159 | * Mark Sullivan, Web Archiving Software Engineer, msullivan@archive.org 160 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/WASAPI-Community/data-transfer-apis/4fab8164f40dcb16a601d4606f9ab67889076d6b/ait-implementation/wasapi/__init__.py -------------------------------------------------------------------------------- /ait-implementation/wasapi/admin.py: -------------------------------------------------------------------------------- 1 | from django.contrib import admin 2 | from archiveit.wasapi.models import WasapiJob 3 | 4 | class WasapiJobAdmin(admin.ModelAdmin): 5 | search_fields = ['id', 'state', 'function', 'account__id', 'account__organization_name'] 6 | list_display = ['id', 'state', 'function', 'termination_time'] 7 | 8 | admin.site.register(WasapiJob, WasapiJobAdmin) 9 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/filters.py: -------------------------------------------------------------------------------- 1 | from rest_framework.filters import BaseFilterBackend 2 | from archiveit.wasapi.selectors import select_webdata_query, select_auth 3 | 4 | # A few words about selectors and "guts": We want to share functionality 5 | # between the "webdata" query and selecting the source files for a job, but 6 | # those two clients have different information in different structures. 7 | # Therefore, we extract (most of) the guts from the filter_queryset methods 8 | # into functions in selectors.py, giving them a different interface (narrower 9 | # than Django's, ie a querydict and kwargs that may or may not include an 10 | # account that may or may not get used) which WasapiJob.set_ideal_result can 11 | # use. 12 | 13 | class WasapiWebdataQueryFilterBackend(BaseFilterBackend): 14 | """Filtering on composition necessary for the webdata query""" 15 | 16 | def filter_queryset(self, request, queryset, view): 17 | return select_webdata_query(request.GET, queryset, 18 | account=request.user.account, user=request.user) 19 | 20 | 21 | class WasapiAuthFilterBackend(BaseFilterBackend): 22 | """Filtering on authorization""" 23 | def filter_queryset(self, request, queryset, view): 24 | return select_auth(request.GET, queryset, 25 | account=request.user.account, user=request.user) 26 | 27 | 28 | class WasapiAuthJobBackend(BaseFilterBackend): 29 | """Filtering on authorization to see the specific job""" 30 | def filter_queryset(self, request, queryset, view): 31 | # TODO: raise http error rather than empty result 32 | queryset = queryset.filter(job_id=view.kwargs['jobid']) 33 | if request.user.is_superuser: 34 | return queryset # no restriction 35 | elif request.user.is_anonymous(): 36 | return queryset.none() # ie hide it all 37 | else: 38 | return queryset.filter(job__account=request.user.account) 39 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/implemented-swagger.yaml: -------------------------------------------------------------------------------- 1 | swagger: '2.0' 2 | info: 3 | title: WASAPI Export API as implemented by Archive-It 4 | description: > 5 | WASAPI Export API. What Archive-It has implemented. 6 | version: 1.0.0 7 | contact: 8 | name: Jefferson Bailey and Mark Sullivan 9 | url: https://github.com/WASAPI-Community/data-transfer-apis 10 | license: 11 | name: Apache 2.0 12 | url: http://www.apache.org/licenses/LICENSE-2.0.html 13 | consumes: 14 | - application/json 15 | produces: 16 | - application/json 17 | basePath: /wasapi/v1 18 | schemes: 19 | - https 20 | paths: 21 | /webdata: 22 | get: 23 | summary: Get the archive files I need 24 | description: > 25 | Produces a page of the list of the files accessible to the client 26 | matching all of the parameters. A parameter with multiple options 27 | matches when any option matches; a missing parameter implicitly 28 | matches. 29 | parameters: 30 | # pagination 31 | - $ref: '#/parameters/page' 32 | # basic query 33 | - $ref: '#/parameters/filename' 34 | # specific to Archive-It 35 | - $ref: '#/parameters/crawl-time-after' 36 | - $ref: '#/parameters/crawl-time-before' 37 | - $ref: '#/parameters/crawl-start-after' 38 | - $ref: '#/parameters/crawl-start-before' 39 | - $ref: '#/parameters/collection' 40 | - $ref: '#/parameters/crawl' 41 | responses: 42 | '200': 43 | description: Success 44 | schema: 45 | $ref: '#/definitions/FileSet' 46 | '400': 47 | description: The request could not be interpreted 48 | /jobs: 49 | get: 50 | summary: What jobs do I have? 51 | description: 52 | Show the jobs on this server accessible to the client 53 | parameters: 54 | - $ref: '#/parameters/page' 55 | responses: 56 | '200': 57 | description: > 58 | Success. Produces a page of the list of the jobs accessible to 59 | the client. 60 | schema: 61 | type: object 62 | required: 63 | - count 64 | - jobs 65 | properties: 66 | count: 67 | type: integer 68 | description: > 69 | The total number of jobs matching the query (across all pages) 70 | previous: 71 | description: > 72 | Link (if any) to the previous page of jobs; otherwise null 73 | type: [string, "null"] 74 | format: url 75 | next: 76 | description: > 77 | Link (if any) to the next page of jobs; otherwise null 78 | type: [string, "null"] 79 | format: url 80 | jobs: 81 | type: array 82 | items: 83 | $ref: '#/definitions/Job' 84 | post: 85 | summary: Make a new job 86 | description: 87 | Create a job to perform some task 88 | parameters: 89 | - name: query 90 | in: formData 91 | required: true 92 | description: > 93 | URL-encoded query as appropriate for /webdata end-point. The empty 94 | query (which matches everything) must explicitly be given as the 95 | empty string. 96 | type: string 97 | - $ref: '#/parameters/function' 98 | - name: parameters 99 | in: formData 100 | required: false 101 | description: > 102 | Other parameters specific to the function and implementation 103 | (URL-encoded). For example: level of compression, priority, time 104 | limit, space limit. Archive-It does not yet accept any such 105 | parameters. 106 | type: string 107 | responses: 108 | '201': 109 | description: > 110 | Job was successfully submitted. Body is the submitted job. 111 | schema: 112 | $ref: '#/definitions/Job' 113 | '400': 114 | description: The request could not be interpreted 115 | '/jobs/{jobtoken}': 116 | get: 117 | summary: How is my job doing? 118 | description: 119 | Retrieve information about a job, both the parameters of its submission 120 | and its current state. If the job is complete, the client can get the 121 | result through a separate request to `jobs/{jobtoken}/result`. 122 | parameters: 123 | - $ref: '#/parameters/jobtoken' 124 | responses: 125 | '200': 126 | description: Success 127 | schema: 128 | $ref: '#/definitions/Job' 129 | '400': 130 | description: The request could not be interpreted 131 | '404': 132 | description: No such job visible to this client 133 | '/jobs/{jobtoken}/result': 134 | get: 135 | summary: What is the result of my job? 136 | description: > 137 | For a complete job, produces a page of the resulting files. 138 | parameters: 139 | - $ref: '#/parameters/page' 140 | - $ref: '#/parameters/jobtoken' 141 | responses: 142 | '200': 143 | description: Success 144 | schema: 145 | $ref: '#/definitions/FileSet' 146 | definitions: 147 | WebdataFile: 148 | description: > 149 | Description of a unit of distribution of web archival data. (This data 150 | type does not include the actual archival data.) Examples: a WARC file, 151 | an ARC file, a CDX file, a WAT file, a DAT file, a tarball. 152 | type: object 153 | required: 154 | - filename 155 | - checksums 156 | - filetype 157 | - locations 158 | properties: 159 | filename: 160 | type: string 161 | description: The name of the webdata file 162 | filetype: 163 | type: string 164 | description: > 165 | The format of the archive file, eg `warc`, `wat`, `cdx` 166 | checksums: 167 | type: object 168 | items: 169 | type: string 170 | format: hexstring 171 | description: > 172 | Verification of the content of the file. Must include at least one 173 | of MD5 or SHA1. The key specifies the lowercase name of the 174 | algorithm; the element is a hexadecimal string of the checksum 175 | value. For example: 176 | {"sha1":"6b4f32a3408b1cd7db9372a63a2053c3ef25c731", 177 | "md5":"766ba6fd3a257edf35d9f42a8dd42a79"} 178 | size: 179 | type: integer 180 | format: int64 181 | description: The size in bytes of the webdata file 182 | collection: 183 | type: integer 184 | format: int64 185 | description: The numeric ID of the collection 186 | crawl: 187 | type: integer 188 | format: int64 189 | description: The numeric ID of the crawl 190 | crawl-time: 191 | type: string 192 | format: date-time 193 | description: Time the original content of the file was crawled 194 | crawl-start: 195 | type: string 196 | format: date-time 197 | description: Time the crawl started 198 | locations: 199 | type: array 200 | items: 201 | type: string 202 | format: url 203 | description: > 204 | A list of (mirrored) sources from which to retrieve (identical copies 205 | of) the webdata file, eg `https://partner.archive-it.org/webdatafile/ARCHIVEIT-4567-CRAWL_SELECTED_SEEDS-JOB1000016543-20170107214356419-00005.warc.gz`, 206 | `/ipfs/Qmee6d6b05c21d1ba2f2020fe2db7db34e` 207 | FileSet: 208 | type: object 209 | required: 210 | - count 211 | - files 212 | properties: 213 | includes-extra: 214 | type: boolean 215 | description: > 216 | When false, the data in the `files` contains nothing extraneous from 217 | what is necessary to satisfy the query or job. When true or absent, 218 | the client must be prepared to handle irrelevant data within the 219 | referenced `files`. 220 | count: 221 | type: integer 222 | description: The total number of files (across all pages) 223 | previous: 224 | description: > 225 | Link (if any) to the previous page of files; otherwise null 226 | type: [string, "null"] 227 | format: url 228 | next: 229 | description: > 230 | Link (if any) to the next page of files; otherwise null 231 | type: [string, "null"] 232 | format: url 233 | files: 234 | type: array 235 | items: 236 | $ref: '#/definitions/WebdataFile' 237 | Job: 238 | type: object 239 | description: > 240 | A job submitted to perform a task. Conceptually, a complete job has a 241 | `result` FileSet, but we avoid sending that potentially large data with 242 | every mention of every job. If the job is complete, the client can get 243 | the result through a separate request to `jobs/{jobtoken}/result`. 244 | required: 245 | - jobtoken 246 | - function 247 | - query 248 | - submit-time 249 | - state 250 | properties: 251 | jobtoken: 252 | type: string 253 | description: > 254 | Identifier unique across the implementation. Archive-It has chosen 255 | to use an increasing integer. 256 | function: 257 | $ref: '#/definitions/Function' 258 | query: 259 | type: string 260 | description: > 261 | The specification of what webdata to include in the job. Encoding is 262 | URL-style, eg `param=value&otherparam=othervalue`. 263 | submit-time: 264 | type: string 265 | format: date-time 266 | description: Time of submission, formatted according to RFC3339 267 | termination-time: 268 | type: string 269 | format: date-time 270 | description: > 271 | Time of completion or failure, formatted according to RFC3339 272 | state: 273 | type: string 274 | enum: 275 | - queued 276 | - running 277 | - failed 278 | - complete 279 | - gone 280 | # alas, can't use GFM 281 | description: > 282 | The state of the job through its lifecycle. 283 | `queued`: Job has been submitted and is waiting to run. 284 | `running`: Job is currently running. 285 | `failed`: Job ran but failed. 286 | `complete`: Job ran and successfully completed; result is available. 287 | `gone`: Job ran, but the result is no longer available (eg deleted 288 | to save storage). 289 | Function: 290 | type: string 291 | enum: 292 | - build-wat 293 | - build-wane 294 | - build-cdx 295 | # This would be the more meaningful place to document the concept of 296 | # "function", but the parameter gives the documentation more space and 297 | # handles GFM. 298 | description: > 299 | The function of the job. See the `function` parameter to the POST that 300 | created the job. 301 | parameters: 302 | # I wish OpenAPI offered a way to define and compose sets of parameters. 303 | # pagination: 304 | page: 305 | name: page 306 | in: query 307 | type: integer 308 | required: false 309 | description: > 310 | One-based index for pagination 311 | # job token: 312 | jobtoken: 313 | name: jobtoken 314 | in: path 315 | description: The job token returned from previous request 316 | required: true 317 | type: string 318 | # basic query: 319 | filename: 320 | name: filename 321 | in: query 322 | type: string 323 | required: false 324 | description: > 325 | A string exactly matching the webdata file's basename (ie must match the 326 | beginning and end of the filename, not the full path of directories). 327 | # Archive-It's implementation of a function 328 | function: 329 | name: function 330 | in: formData 331 | required: true 332 | description: > 333 | One of the following strings which have the following meanings: 334 | 335 | - `build-wat`: Build a WAT file with metadata from the matched archive 336 | files 337 | 338 | - `build-wane`: Build a WANE file with the named entities from the 339 | matched archive files 340 | 341 | - `build-cdx`: Build a CDX file indexing the matched archive files 342 | 343 | type: string 344 | enum: 345 | - build-wat 346 | - build-wane 347 | - build-cdx 348 | # time of crawl (specific to Archive-It): 349 | crawl-time-after: 350 | name: crawl-time-after 351 | type: string 352 | format: date-time 353 | in: query 354 | required: false 355 | description: > 356 | Match resources that were crawled at or after the time given according to 357 | RFC3339. A date given with no time of day means midnight. Coordinated 358 | Universal (UTC) is preferrred and assumed if no timezone is included. 359 | Because `crawl-time-after` matches equal time stamps while 360 | `crawl-time-before` excludes equal time stamps, and because we specify 361 | instants rather than durations implicit from our units, we can smoothly 362 | scale between days and seconds. That is, we specify ranges in the manner 363 | of the C programming language, eg low ≤ x < high. For example, matching 364 | the month of November of 2016 is specified by 365 | `crawl-time-after=2016-11 & crawl-time-before=2016-12` or 366 | equivalently by `crawl-time-after=2016-11-01T00:00:00Z & 367 | crawl-time-before=2016-11-30T16:00:00-08:00`. 368 | crawl-time-before: 369 | name: crawl-time-before 370 | type: string 371 | format: date-time 372 | in: query 373 | required: false 374 | description: > 375 | Match resources that were crawled strictly before the time given 376 | according to RFC3339. See more detail at `crawl-time-after`. 377 | crawl-start-after: 378 | name: crawl-start-after 379 | type: string 380 | format: date-time 381 | in: query 382 | required: false 383 | description: > 384 | Match resources that were crawled in a job that started at or after the 385 | time given according to RFC3339. (Note that the original content of a 386 | file could be crawled many days after the crawl job started; would you 387 | prefer `crawl-time-after` / `crawl-time-before`?) 388 | crawl-start-before: 389 | name: crawl-start-before 390 | type: string 391 | format: date-time 392 | in: query 393 | required: false 394 | description: > 395 | Match resources that were crawled in a job that started strictly before 396 | the time given according to RFC3339. See more detail at 397 | `crawl-start-after`. 398 | # collection (specific to Archive-It): 399 | collection: 400 | name: collection 401 | type: integer 402 | in: query 403 | required: false 404 | description: > 405 | The numeric ID of one or more collections, given as separate fields. 406 | For only this parameter, WASAPI accepts multiple values and will match 407 | items in any of the specified collections. For example, matching the 408 | items from two collections can be specified by `collection=1 & 409 | collection=2`. 410 | # crawl (specific to Archive-It): 411 | crawl: 412 | name: crawl 413 | type: integer 414 | in: query 415 | required: false 416 | description: > 417 | The numeric ID of the crawl 418 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/mailer.py: -------------------------------------------------------------------------------- 1 | from django.conf import settings 2 | from django.core.mail import send_mail 3 | 4 | def new_wasapijob(job, covering_collection_times): 5 | message_template = """ 6 | Dear Web Archivists, 7 | 8 | {user_full_name:s} of account "{account_name:s}" ({account_id:d}) has submitted a new job ({job_id:d}): 9 | {function:s} 10 | {query:s} 11 | 12 | {urls:s} 13 | 14 | {todo} 15 | 16 | Love, 17 | WASAPI within AIT5 18 | """ 19 | if not covering_collection_times: 20 | todo = "The query matched no files, so the job is already complete." 21 | elif job.state == job.COMPLETE: 22 | todo = "The query matched only files already derived, so the job is already complete." 23 | else: 24 | todo = ( 25 | "Please derive from these collections over these time spans:\n" + 26 | "\n".join([ 27 | '%s %s %d %s %s' % (start, end, collection.id, 28 | collection.account.organization_name, collection.name) 29 | for collection,start,end in covering_collection_times ])) 30 | message = message_template.format( 31 | user_full_name = job.user.full_name, 32 | account_name = job.account.organization_name, 33 | account_id = job.account.id, 34 | job_id = job.id, 35 | function = job.function, 36 | query = job.query, 37 | urls = '\n'.join( 38 | [settings.BASE50URL + job.get_absolute_url()] + 39 | ([settings.BASE50URL + job.get_absolute_url() + '/result'] 40 | if job.state==job.COMPLETE else []) + 41 | ['%s/admin/wasapi/wasapijob/%d/' % (settings.BASE50URL, job.id)] ), 42 | todo = todo, 43 | ) 44 | send_mail( 45 | 'Research Services Dataset Request via WASAPI', 46 | message, 47 | 'donotreply@archive-it.org', 48 | settings.AITRESEARCHSERVICES_ADDRESS, 49 | fail_silently=False) 50 | 51 | 52 | def complete_wasapijob(job): 53 | message_template = """ 54 | Dear Web Archivists, 55 | 56 | The job ({job_id:d}) for {user_full_name:s} of account "{account_name:s}" ({account_id:d}) has completed: 57 | {function:s} 58 | {query:s} 59 | 60 | {urls:s} 61 | 62 | Love, 63 | WASAPI within AIT5 64 | """ 65 | message = message_template.format( 66 | user_full_name = job.user.full_name, 67 | account_name = job.account.organization_name, 68 | account_id = job.account.id, 69 | job_id = job.id, 70 | function = job.function, 71 | query = job.query, 72 | urls = '\n'.join([ 73 | settings.BASE50URL + job.get_absolute_url(), 74 | settings.BASE50URL + job.get_absolute_url() + '/result', 75 | '%s/admin/wasapi/wasapijob/%d/' % (settings.BASE50URL, job.id), 76 | ]) 77 | ) 78 | send_mail( 79 | 'WASAPI job %d completed' % (job.id), 80 | message, 81 | 'donotreply@archive-it.org', 82 | settings.AITRESEARCHSERVICES_ADDRESS, 83 | fail_silently=False) 84 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/migrations/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/WASAPI-Community/data-transfer-apis/4fab8164f40dcb16a601d4606f9ab67889076d6b/ait-implementation/wasapi/migrations/__init__.py -------------------------------------------------------------------------------- /ait-implementation/wasapi/models.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from datetime import datetime 3 | import re 4 | from django.core.urlresolvers import reverse 5 | from django.db import models 6 | from .mailer import new_wasapijob, complete_wasapijob 7 | from django.db.models.signals import post_save, pre_save 8 | from django.http import QueryDict 9 | from archiveit.archiveit.models import WarcFile, DerivativeFile 10 | from archiveit.archiveit.model_fields import AitWasapiDateTimeField 11 | from archiveit.wasapi.selectors import select_webdata_query 12 | 13 | 14 | # This set of functions is specific to Archive-It: 15 | function_instances = {} 16 | class JobFunction(object): 17 | identifier = None 18 | code = None 19 | english = None 20 | @staticmethod 21 | def ideal_results_for(job, source_files): 22 | '''Transform DerivativeFile records to new WasapiJobResultFile 23 | records according to the job function''' 24 | assert NotImplementedError('Implement in concrete class') 25 | 26 | @classmethod 27 | def register(cls): 28 | assert cls.identifier not in function_instances, "Already got one" 29 | function_instances[cls.code] = cls 30 | 31 | class BuildWat(JobFunction): 32 | identifier = "BUILD_WAT" 33 | code = "build-wat" 34 | english = "Build a WAT file" 35 | @staticmethod 36 | def ideal_results_for(job, source_files): 37 | return [WasapiJobResultFile( 38 | job=job, 39 | filename=re.sub(r'\.(w?arc)\.gz', '_\\1.wat.gz', warcfile.filename) 40 | ) for warcfile in source_files] 41 | BuildWat.register() 42 | 43 | class BuildWane(JobFunction): 44 | identifier = "BUILD_WANE" 45 | code = "build-wane" 46 | english = "Build a WANE file" 47 | @staticmethod 48 | def ideal_results_for(job, source_files): 49 | return [WasapiJobResultFile( 50 | job=job, 51 | filename=re.sub(r'\.(w?arc)\.gz', '_\\1.wane.gz', warcfile.filename) 52 | ) for warcfile in source_files] 53 | BuildWane.register() 54 | 55 | class BuildCdx(JobFunction): 56 | identifier = "BUILD_CDX" 57 | code = "build-cdx" 58 | english = "Build a CDX file" 59 | @staticmethod 60 | def ideal_results_for(job, source_files): 61 | return [WasapiJobResultFile( 62 | job=job, 63 | filename=re.sub(r'\.(w?arc)\.gz', '_\\1.cdx.gz', warcfile.filename) # TODO: wait to support CDX 64 | ) for warcfile in source_files] 65 | #BuildCdx.register() # TODO: implement and register 66 | 67 | class BuildLga(JobFunction): 68 | identifier = "BUILD_LGA" 69 | code = "build-lga" 70 | english = "Build an LGA file" 71 | @staticmethod 72 | def ideal_results_for(job, source_files): 73 | # TODO: we ignore the results of the query, so don't execute it 74 | # note lack of list comprehension (because LGA is not one-to-one) 75 | return [WasapiJobResultFile( 76 | job=job, 77 | filename='ARCHIVEIT-%d-LONGITUDINAL-GRAPH-%4d-%02d-%02d.lga.tgz' % ( 78 | job.collection, # TODO: but jobs are too general, pull from query? 79 | job.submitTime.year, job.submitTime.month, job.submitTime.day) 80 | )] 81 | #BuildLga.register() # TODO: implement and register 82 | 83 | 84 | class WasapiJob(models.Model): 85 | 86 | id = models.AutoField(primary_key=True) 87 | 88 | # fields for minimal, generic WASAPI: 89 | 90 | FUNCTION_CHOICES = [(concrete_instance.code, concrete_instance.english) 91 | for concrete_instance in function_instances.values()] 92 | function = models.CharField(max_length=32, null=False, choices=FUNCTION_CHOICES) 93 | @property 94 | def function_instance(self): 95 | return function_instances[self.function] 96 | 97 | query = models.CharField(max_length=1024, blank=True, null=False) 98 | submit_time = AitWasapiDateTimeField(db_column='submitTime', auto_now_add=True, null=False) 99 | termination_time = AitWasapiDateTimeField(db_column='terminationTime', null=True, blank=True) 100 | 101 | # This list of states is specific to Archive-It: 102 | STATES = [ 103 | # (identifier, code, english) 104 | ("QUEUED", "queued", "Queued"), 105 | ("RUNNING", "running", "Running"), 106 | ("FAILED", "failed", "Failed"), 107 | ("COMPLETE", "complete", "Complete"), 108 | ("GONE", "gone", "Gone")] 109 | for identifier, code, english in STATES: 110 | locals()[identifier] = code 111 | STATE_CHOICES = [(code, english) for identifier, code, english in STATES] 112 | state = models.CharField(max_length=32, null=False, choices=STATE_CHOICES) 113 | 114 | # fields specific to Archive-It: 115 | 116 | account = models.ForeignKey('accounts.Account', db_column='accountId', editable=False, blank=True) 117 | user = models.ForeignKey('accounts.User', db_column='userId', editable=False, blank=True) 118 | 119 | class Meta: 120 | managed = False 121 | db_table = 'WasapiJob' 122 | 123 | # the same as WebdataQueryViewSet 124 | queryset = WarcFile.objects.all().order_by('-id') 125 | def query_just_like_webdataqueryviewset(self): 126 | '''Create the same queryset that WebdataQueryViewSet executes''' 127 | querydict = QueryDict(self.query) 128 | return select_webdata_query(querydict, self.queryset, 129 | account=self.account, user=self.user) 130 | 131 | def set_ideal_result_and_state(self, source_files): 132 | '''Calculate WasapiJobResultFiles that would comprise result when 133 | complete; update state''' 134 | ideal_results = self.function_instance.ideal_results_for(self, source_files) 135 | # TODO: batch the DB query 136 | for ideal_result in ideal_results: 137 | ideal_result.update_state() 138 | if all(ideal_result.is_complete() for ideal_result in ideal_results): 139 | # vacuous (if ideal_results is empty) or freebie job 140 | self.state = self.COMPLETE 141 | self.termination_time = datetime.now() 142 | # new_wasapijob will say complete so needn't call complete_wasapijob 143 | self._ideal_results = [] 144 | else: 145 | # juicy job, ie have to do some work 146 | self.state = self.QUEUED 147 | self._ideal_results = ideal_results # save them after we get our id 148 | 149 | def save_results(self): 150 | '''Save the WasapiJobResultFile records now that they can refer to 151 | their WasapiJob record''' 152 | for ideal_result in self._ideal_results: 153 | ideal_result.job = self # job_id wasn't yet available at creation 154 | ideal_result.save() 155 | 156 | def update_state(self): 157 | if self.state in (self.RUNNING, self.QUEUED): 158 | ideal_results = WasapiJobResultFile.objects.filter(job=self) 159 | if all(ideal_result.is_complete() for ideal_result in ideal_results): 160 | self.state = self.COMPLETE 161 | self.termination_time = datetime.now() 162 | complete_wasapijob(self) 163 | 164 | def __str__(self): 165 | return "" % (self.id, self.state) 166 | 167 | def get_absolute_url(self): 168 | return reverse('wasapijob-detail', kwargs={'pk':self.id}) 169 | 170 | @staticmethod 171 | def covering_collection_times(source_files): 172 | '''Returns a list of (collection,start,end) tuples. Deriving each 173 | collection for the given time range will generate a superset of the 174 | desired result files.''' 175 | by_collection = {} 176 | for source_file in source_files: 177 | partition = by_collection.get(source_file.collection, set()) 178 | by_collection[source_file.collection] = partition 179 | partition.add(source_file) 180 | return [ 181 | ( collection, 182 | min(sf.crawl_time for sf in sfs), 183 | max(sf.crawl_time for sf in sfs) ) 184 | for collection, sfs in by_collection.items()] 185 | 186 | @classmethod 187 | def pre_save(cls, instance, **kwargs): 188 | job = instance 189 | if not job.id: # freshly created job 190 | source_files = job.query_just_like_webdataqueryviewset() 191 | job.set_ideal_result_and_state(source_files) 192 | job._source_files = source_files # stash for mailer in post-save 193 | 194 | @classmethod 195 | def post_save(cls, instance, **kwargs): 196 | # would prefer to call the parameter "job", but "send" passes it by name 197 | job = instance 198 | if hasattr(job, '_source_files'): # freshly created job 199 | job.save_results() 200 | new_wasapijob(job, cls.covering_collection_times(job._source_files)) 201 | 202 | pre_save.connect(receiver=WasapiJob.pre_save, sender=WasapiJob) 203 | post_save.connect(receiver=WasapiJob.post_save, sender=WasapiJob) 204 | 205 | # Voodoo to patch bug exposed in restore_object: 206 | # TypeError: can only concatenate tuple (not "list") to tuple 207 | # at ait5/ lib/python3.5/site-packages/rest_framework/serializers.py:969 208 | # for field in meta.many_to_many + meta.virtual_fields: 209 | WasapiJob._meta.virtual_fields = () # was []; many_to_many is () 210 | 211 | 212 | def proxy_to_derivative(*fields): 213 | def decorator(cls): 214 | # invoke another method to prevent iterator variable from varying 215 | def install_proxy(cls, field): 216 | @property 217 | def ameth(self): 218 | return( self.derivative_file and 219 | getattr(self.derivative_file, field) ) 220 | setattr(cls, field, ameth) 221 | for field in fields: 222 | install_proxy(cls, field) 223 | return cls 224 | return decorator 225 | 226 | @proxy_to_derivative('filetype', 'md5', 'sha1', 'size', 'store_time', 227 | 'crawl_time', 'account_id', 'collection_id', 'crawl_job', 'crawl_job_id', 228 | 'pbox_item', 'hdfs_path') 229 | class WasapiJobResultFile(models.Model): 230 | id = models.AutoField(primary_key=True) 231 | job = models.ForeignKey('WasapiJob', db_column='jobId') 232 | filename = models.CharField(max_length=4000) 233 | derivative_file = models.ForeignKey('archiveit.DerivativeFile', db_column='derivativeFileId', null=True) 234 | 235 | class Meta: 236 | managed = False 237 | db_table = 'WasapiJobResultFile' 238 | 239 | def is_complete(self): 240 | return self.derivative_file 241 | 242 | def update_state(self): 243 | '''Update reference to any newly existing derivative_file; return 244 | whether state changed (ie whether need to propagate changes further)''' 245 | if self.derivative_file: 246 | return False 247 | self.derivative_file = DerivativeFile.objects.filter(filename=self.filename).first() 248 | return self.derivative_file 249 | 250 | def dict_for_location(self): 251 | '''Return a mapping that can fill location templates''' 252 | d = self.__dict__ 253 | d.update(pbox_item=self.pbox_item) 254 | return d 255 | 256 | def __repr__(self): 257 | return( 'WasapiJobResultFile(id=%s,filename=%s,job_id=%s,%s)' % 258 | (self.id, self.filename, self.job_id, 259 | "complete" if self.derivative_file else "incomplete")) 260 | 261 | @classmethod 262 | def update_completed_result_files(cls): 263 | '''Find and update result files that completed without notification''' 264 | result_files = cls.objects.raw(''' 265 | select WasapiJobResultFile.id, WasapiJobResultFile.jobId 266 | from WasapiJobResultFile 267 | join DerivativeFile 268 | on WasapiJobResultFile.filename=DerivativeFile.filename 269 | where WasapiJobResultFile.derivativeFileId is null''') 270 | # TODO: use that result rather than refetching for each result file 271 | cls.update_states(result_files) 272 | 273 | @staticmethod 274 | def update_states(result_files): 275 | jobs_to_update = set() 276 | for result_file in result_files: 277 | if result_file.update_state(): 278 | result_file.save() 279 | jobs_to_update.add(result_file.job) 280 | # TODO: batch the DB queries 281 | for job in jobs_to_update: 282 | job.update_state() 283 | job.save() 284 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/selectors.py: -------------------------------------------------------------------------------- 1 | # Functions that selectively filter a queryset, ie the "guts" of filters. 2 | 3 | # Filtering functions, both abstract (could be in the DRF library) and specific 4 | # to WASAPI, and composite 5 | 6 | 7 | # for WasapiWebdataQueryFilterBackend and executing query of new WasapiJob 8 | def select_webdata_query(querydict, queryset, **kwargs): 9 | sub_filters = [ 10 | select_auth, 11 | select_wasapi_direct_fields, 12 | select_wasapi_mapped_fields, 13 | ] 14 | for filter in sub_filters: 15 | queryset = filter(querydict, queryset, **kwargs) 16 | return queryset 17 | 18 | 19 | # for WasapiAuthFilterBackend 20 | def select_auth(querydict, queryset, **kwargs): 21 | if kwargs['user'].is_superuser: 22 | return queryset # no restriction 23 | if kwargs['user'].is_anonymous(): 24 | return queryset.none() # ie hide it all 25 | return queryset.filter(account_id=kwargs['account'].id) 26 | 27 | 28 | def generate_select_direct_fields(*args): 29 | """Simple filtering on equality and inclusion of fields 30 | 31 | Any given field that is included in the class's multi_field_names is tested 32 | against any of potentially multiple arguments given in the request. Any 33 | other (existing) field is tested for equality with the given value.""" 34 | multi_field_names = set(args) 35 | def select_direct_fields(querydict, queryset, **kwargs): 36 | field_names = set( field.name for field in queryset.model._meta.get_fields() ) 37 | for key, value in querydict.items(): 38 | if key in multi_field_names: 39 | filter_value = { key+'__in': querydict.getlist(key) } 40 | queryset = queryset.filter(**filter_value) 41 | elif key in field_names: 42 | queryset = queryset.filter(**{ key:value }) 43 | return queryset 44 | return select_direct_fields 45 | 46 | # for WebdataDirectFieldFilterBackend 47 | select_wasapi_direct_fields = generate_select_direct_fields('collection') 48 | 49 | 50 | def generate_select_mapped_fields(filter_for_parameter): 51 | """Map parameters to ORM filters 52 | 53 | Based on `filter_for_parameter` dictionary mapping HTTP parameter name to 54 | ORM query filter""" 55 | 56 | def select_mapped_fields(querydict, queryset, **kwargs): 57 | for parameter_name, filter_name in filter_for_parameter.items(): 58 | value = querydict.get(parameter_name) 59 | if value: 60 | queryset = queryset.filter(**{filter_name:value}) 61 | return queryset 62 | return select_mapped_fields 63 | 64 | # for WebdataMappedFieldFilterBackend 65 | select_wasapi_mapped_fields = generate_select_mapped_fields({ 66 | 'crawl': 'crawl_job_id', 67 | 'crawl-time-after': 'crawl_time__gte', 68 | 'crawl-time-before': 'crawl_time__lt', 69 | 'crawl-start-after': 'crawl_job__original_start_date__gte', 70 | 'crawl-start-before': 'crawl_job__original_start_date__lt', 71 | }) 72 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/serializers.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from django.conf import settings 3 | from django.core.exceptions import PermissionDenied 4 | from rest_framework import serializers 5 | from rest_framework.pagination import PaginationSerializer 6 | from archiveit.archiveit.models import WarcFile 7 | from archiveit.wasapi.models import WasapiJob 8 | 9 | class WebdataFileSerializer(serializers.HyperlinkedModelSerializer): 10 | # explicitly adding to locals() lets us include '-' in name of fields 11 | locals().update({ 12 | 'filetype': serializers.CharField(), 13 | 'checksums': serializers.SerializerMethodField('checksums_method'), 14 | 'account': serializers.SerializerMethodField('account_method'), 15 | 'collection': serializers.SerializerMethodField('collection_method'), 16 | 'crawl': serializers.SerializerMethodField('crawl_method'), 17 | 'crawl-time': serializers.DateTimeField(source='crawl_time'), 18 | 'crawl-start': serializers.SerializerMethodField('crawl_start_method'), 19 | 'locations': serializers.SerializerMethodField('locations_method')}) 20 | def checksums_method(self, obj): 21 | return { 22 | 'sha1': obj.sha1, 23 | 'md5': obj.md5 } 24 | def account_method(self, obj): 25 | return obj.account_id 26 | def collection_method(self, obj): 27 | return obj.collection_id 28 | def crawl_method(self, obj): 29 | return obj.crawl_job_id 30 | def crawl_start_method(self, obj): 31 | crawl_job = obj.crawl_job 32 | return crawl_job and crawl_job.original_start_date 33 | def locations_method(self, obj): 34 | return ( 35 | [settings.WEBDATA_LOCATION_TEMPLATE % obj.dict_for_location()] + 36 | ([settings.PBOX_LOCATION_TEMPLATE % obj.dict_for_location()] 37 | if obj.pbox_item else [])); 38 | class Meta: 39 | model = WarcFile 40 | fields = ( 41 | 'filename', 42 | 'filetype', 43 | 'checksums', 44 | 'account', 45 | 'size', 46 | 'collection', 47 | 'crawl', 48 | 'crawl-time', 49 | 'crawl-start', 50 | 'locations') 51 | 52 | class PaginationSerializerOfFiles(PaginationSerializer): 53 | 'Pagination serializer that labels the "results" as "files"' 54 | results_field = 'files' 55 | 56 | 57 | class JobSerializer(serializers.HyperlinkedModelSerializer): 58 | 59 | # explicitly adding to locals() lets us include '-' in name of fields 60 | locals().update({ 61 | 'state': serializers.CharField(read_only=True), 62 | 'account': serializers.PrimaryKeyRelatedField(read_only=True), 63 | 'jobtoken': serializers.SerializerMethodField('jobtoken_method'), 64 | 'submit-time': serializers.DateTimeField(source='submit_time', read_only=True), 65 | 'termination-time': serializers.DateTimeField(source='termination_time', read_only=True)}) 66 | 67 | def jobtoken_method(self, obj): 68 | return str(obj.id) 69 | 70 | class Meta: 71 | model = WasapiJob 72 | fields = ( 73 | 'jobtoken', 74 | 'function', 75 | 'query', 76 | 'submit-time', 77 | 'termination-time', 78 | 'state', 79 | 'account') 80 | 81 | def validate(self, value): 82 | # I'd prefer to use a dedicated method rather than hooking into 83 | # validation routines, but my "to_internal_value" and "create" methods 84 | # don't get called. 85 | # It would be nice if we let each field set its value, but I don't see 86 | # an easy way to do that. 87 | value = self.set_account(value) 88 | value = self.set_user(value) 89 | value = self.set_submit_time(value) 90 | return value 91 | 92 | def set_account(self, value): 93 | account = self.context['request'].user.account 94 | if not account: # eg "system" user 95 | raise PermissionDenied 96 | value['account'] = account 97 | return value 98 | 99 | def set_user(self, value): 100 | value['user'] = self.context['request'].user 101 | return value 102 | 103 | def set_submit_time(self, value): 104 | value['submit_time'] = datetime.now() 105 | return value 106 | 107 | class PaginationSerializerOfJobs(PaginationSerializer): 108 | 'Pagination serializer that labels the "results" as "jobs"' 109 | results_field = 'jobs' 110 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/swagger.yaml: -------------------------------------------------------------------------------- 1 | swagger: '2.0' 2 | info: 3 | title: Draft WASAPI Export API by Archive-It 4 | description: > 5 | WASAPI Export API. What Archive-It will implement. 6 | version: 1.0.0 7 | contact: 8 | name: Jefferson Bailey and Mark Sullivan 9 | url: https://github.com/WASAPI-Community/data-transfer-apis 10 | license: 11 | name: Apache 2.0 12 | url: http://www.apache.org/licenses/LICENSE-2.0.html 13 | consumes: 14 | - application/json 15 | produces: 16 | - application/json 17 | basePath: /v1 18 | schemes: 19 | - https 20 | paths: 21 | /webdata: 22 | get: 23 | summary: Get the archive files I need 24 | description: > 25 | Produces a page of the list of the files accessible to the client 26 | matching all of the parameters. A parameter with multiple options 27 | matches when any option matches; a missing parameter implicitly 28 | matches. 29 | parameters: 30 | # pagination 31 | - $ref: '#/parameters/page' 32 | # basic query 33 | - $ref: '#/parameters/filename' 34 | - $ref: '#/parameters/filetype' 35 | # specific to Archive-It 36 | - $ref: '#/parameters/crawl-time-after' 37 | - $ref: '#/parameters/crawl-time-before' 38 | - $ref: '#/parameters/crawl-start-after' 39 | - $ref: '#/parameters/crawl-start-before' 40 | - $ref: '#/parameters/collection' 41 | - $ref: '#/parameters/crawl' 42 | responses: 43 | '200': 44 | description: Success 45 | schema: 46 | $ref: '#/definitions/FileSet' 47 | '400': 48 | description: The request could not be interpreted 49 | '401': 50 | description: The request was unauthorized 51 | /jobs: 52 | get: 53 | summary: What jobs do I have? 54 | description: 55 | Show the jobs on this server accessible to the client 56 | parameters: 57 | - $ref: '#/parameters/page' 58 | responses: 59 | '200': 60 | description: > 61 | Success. Produces a page of the list of the jobs accessible to 62 | the client. 63 | schema: 64 | type: object 65 | required: 66 | - count 67 | - jobs 68 | properties: 69 | count: 70 | type: integer 71 | description: > 72 | The total number of jobs matching the query (across all pages) 73 | previous: 74 | description: > 75 | Link (if any) to the previous page of jobs; otherwise null 76 | type: [string, "null"] 77 | format: url 78 | next: 79 | description: > 80 | Link (if any) to the next page of jobs; otherwise null 81 | type: [string, "null"] 82 | format: url 83 | jobs: 84 | type: array 85 | items: 86 | $ref: '#/definitions/Job' 87 | post: 88 | summary: Make a new job 89 | description: 90 | Create a job to perform some task 91 | parameters: 92 | - name: query 93 | in: formData 94 | required: true 95 | description: > 96 | URL-encoded query as appropriate for /webdata end-point. The empty 97 | query (which matches everything) must explicitly be given as the 98 | empty string. 99 | type: string 100 | - $ref: '#/parameters/function' 101 | - name: parameters 102 | in: formData 103 | required: false 104 | description: > 105 | Other parameters specific to the function and implementation 106 | (URL-encoded). For example: level of compression, priority, time 107 | limit, space limit. Archive-It does not yet accept any such 108 | parameters. 109 | type: string 110 | responses: 111 | '201': 112 | description: > 113 | Job was successfully submitted. Body is the submitted job. 114 | schema: 115 | $ref: '#/definitions/Job' 116 | '400': 117 | description: The request could not be interpreted 118 | '401': 119 | description: The request was unauthorized 120 | '/jobs/{jobtoken}': 121 | get: 122 | summary: How is my job doing? 123 | description: 124 | Retrieve information about a job, both the parameters of its submission 125 | and its current state. If the job is complete, the client can get the 126 | result through a separate request to `jobs/{jobtoken}/result`. 127 | parameters: 128 | - $ref: '#/parameters/jobtoken' 129 | responses: 130 | '200': 131 | description: Success 132 | schema: 133 | $ref: '#/definitions/Job' 134 | '400': 135 | description: The request could not be interpreted 136 | '401': 137 | description: The request was unauthorized 138 | '403': 139 | description: Forbidden 140 | '404': 141 | description: No such job 142 | '410': 143 | description: > 144 | Gone / invalidated. Body may include non-result information about 145 | the job. 146 | '/jobs/{jobtoken}/result': 147 | get: 148 | summary: What is the result of my job? 149 | description: > 150 | For a complete job, produces a page of the resulting files. 151 | parameters: 152 | - $ref: '#/parameters/page' 153 | - $ref: '#/parameters/jobtoken' 154 | responses: 155 | '200': 156 | description: Success 157 | schema: 158 | $ref: '#/definitions/FileSet' 159 | '301': 160 | description: The job is in a failed state; get details elsewhere 161 | '307': 162 | description: > 163 | The job failed and will never be fixed; get details elsewhere 164 | '400': 165 | description: The request could not be interpreted 166 | '401': 167 | description: The request was unauthorized 168 | '403': 169 | description: Forbidden 170 | '404': 171 | description: No such job or it is not complete 172 | '410': 173 | description: Job is gone / invalidated 174 | '/jobs/{jobtoken}/error': 175 | get: 176 | summary: Why did my job fail? 177 | description: > 178 | Give details about a failed job 179 | parameters: 180 | - $ref: '#/parameters/jobtoken' 181 | responses: 182 | '200': 183 | description: Success (of reporting the error, not of the job itself) 184 | '400': 185 | description: The request could not be interpreted 186 | '401': 187 | description: The request was unauthorized 188 | '403': 189 | description: Forbidden 190 | '404': 191 | description: No such job or it did not fail 192 | '410': 193 | description: Job is gone / invalidated 194 | definitions: 195 | WebdataFile: 196 | description: > 197 | Description of a unit of distribution of web archival data. (This data 198 | type does not include the actual archival data.) Examples: a WARC file, 199 | an ARC file, a CDX file, a WAT file, a DAT file, a tarball. 200 | type: object 201 | required: 202 | - filename 203 | - checksums 204 | - filetype 205 | - locations 206 | properties: 207 | filename: 208 | type: string 209 | description: The name of the webdata file 210 | filetype: 211 | # TODO: handle compression etc 212 | type: string 213 | description: > 214 | The format of the archive file, eg `warc`, `wat`, `cdx` 215 | checksums: 216 | type: object 217 | items: 218 | type: string 219 | format: hexstring 220 | description: > 221 | Verification of the content of the file. Must include at least one 222 | of MD5 or SHA1. The key specifies the lowercase name of the 223 | algorithm; the element is a hexadecimal string of the checksum 224 | value. For example: 225 | {"sha1":"6b4f32a3408b1cd7db9372a63a2053c3ef25c731", 226 | "md5":"766ba6fd3a257edf35d9f42a8dd42a79"} 227 | size: 228 | type: integer 229 | format: int64 230 | description: The size in bytes of the webdata file 231 | collection: 232 | type: integer 233 | format: int64 234 | description: The numeric ID of the collection 235 | crawl: 236 | type: integer 237 | format: int64 238 | description: The numeric ID of the crawl 239 | crawl-time: 240 | type: string 241 | format: date-time 242 | description: Time the original content of the file was crawled 243 | crawl-start: 244 | type: string 245 | format: date-time 246 | description: Time the crawl started 247 | locations: 248 | type: array 249 | items: 250 | type: string 251 | format: url 252 | description: > 253 | A list of (mirrored) sources from which to retrieve (identical copies 254 | of) the webdata file, eg `https://partner.archive-it.org/webdatafile/ARCHIVEIT-4567-CRAWL_SELECTED_SEEDS-JOB1000016543-20170107214356419-00005.warc.gz`, 255 | `/ipfs/Qmee6d6b05c21d1ba2f2020fe2db7db34e` 256 | FileSet: 257 | type: object 258 | required: 259 | - count 260 | - files 261 | properties: 262 | includes-extra: 263 | type: boolean 264 | description: > 265 | When false, the data in the `files` contains nothing extraneous from 266 | what is necessary to satisfy the query or job. When true or absent, 267 | the client must be prepared to handle irrelevant data within the 268 | referenced `files`. 269 | count: 270 | type: integer 271 | description: The total number of files (across all pages) 272 | previous: 273 | description: > 274 | Link (if any) to the previous page of files; otherwise null 275 | type: [string, "null"] 276 | format: url 277 | next: 278 | description: > 279 | Link (if any) to the next page of files; otherwise null 280 | type: [string, "null"] 281 | format: url 282 | files: 283 | type: array 284 | items: 285 | $ref: '#/definitions/WebdataFile' 286 | Job: 287 | type: object 288 | description: > 289 | A job submitted to perform a task. Conceptually, a complete job has a 290 | `result` FileSet, but we avoid sending that potentially large data with 291 | every mention of every job. If the job is complete, the client can get 292 | the result through a separate request to `jobs/{jobtoken}/result`. 293 | required: 294 | - jobtoken 295 | - function 296 | - query 297 | - submit-time 298 | - state 299 | properties: 300 | jobtoken: 301 | type: string 302 | description: > 303 | Identifier unique across the implementation. The implementation 304 | chooses the format. For example: GUID, increasing integer. 305 | function: 306 | $ref: '#/definitions/Function' 307 | query: 308 | type: string 309 | description: > 310 | The specification of what webdata to include in the job. Encoding is 311 | URL-style, eg `param=value&otherparam=othervalue`. 312 | submit-time: 313 | type: string 314 | format: date-time 315 | description: Time of submission, formatted according to RFC3339 316 | termination-time: 317 | type: string 318 | format: date-time 319 | description: > 320 | Time of completion or failure, formatted according to RFC3339 321 | state: 322 | type: string 323 | enum: 324 | - queued 325 | - running 326 | - failed 327 | - complete 328 | - gone 329 | # alas, can't use GFM 330 | description: > 331 | The state of the job through its lifecycle. 332 | `queued`: Job has been submitted and is waiting to run. 333 | `running`: Job is currently running. 334 | `failed`: Job ran but failed. 335 | `complete`: Job ran and successfully completed; result is available. 336 | `gone`: Job ran, but the result is no longer available (eg deleted 337 | to save storage). 338 | Function: 339 | type: string 340 | enum: 341 | - build-wat 342 | - build-wane 343 | - build-cdx 344 | # This would be the more meaningful place to document the concept of 345 | # "function", but the parameter gives the documentation more space and 346 | # handles GFM. 347 | description: > 348 | The function of the job. See the `function` parameter to the POST that 349 | created the job. 350 | parameters: 351 | # I wish OpenAPI offered a way to define and compose sets of parameters. 352 | # pagination: 353 | page: 354 | name: page 355 | in: query 356 | type: integer 357 | required: false 358 | description: > 359 | One-based index for pagination 360 | # job token: 361 | jobtoken: 362 | name: jobtoken 363 | in: path 364 | description: The job token returned from previous request 365 | required: true 366 | type: string 367 | # basic query: 368 | filename: 369 | name: filename 370 | in: query 371 | type: string 372 | required: false 373 | description: > 374 | A semicolon-separated list of "glob" patterns. In each pattern, a 375 | star `*` matches any string of characters, and a question mark `?` 376 | matches exactly one character. The pattern is matched against the 377 | full basename (ie must match the beginning and end of the filename, 378 | not the full path of directories). 379 | filetype: 380 | name: filetype 381 | in: query 382 | type: string 383 | required: false 384 | description: > 385 | A semicolon-separated list of formats of acceptable archive file, eg 386 | `warc`, `wat`, `cdx` 387 | # Archive-It's implementation of a function 388 | function: 389 | name: function 390 | in: formData 391 | required: true 392 | description: > 393 | One of the following strings which have the following meanings: 394 | 395 | - `build-wat`: Build a WAT file with metadata from the matched archive 396 | files 397 | 398 | - `build-wane`: Build a WANE file with the named entities from the 399 | matched archive files 400 | 401 | - `build-cdx`: Build a CDX file indexing the matched archive files 402 | 403 | type: string 404 | enum: 405 | - build-wat 406 | - build-wane 407 | - build-cdx 408 | # time of crawl (specific to Archive-It): 409 | crawl-time-after: 410 | name: crawl-time-after 411 | type: string 412 | format: date-time 413 | in: query 414 | required: false 415 | description: > 416 | Match resources that were crawled at or after the time given according to 417 | RFC3339. A date given with no time of day means midnight. Coordinated 418 | Universal (UTC) is preferrred and assumed if no timezone is included. 419 | Because `crawl-time-after` matches equal time stamps while 420 | `crawl-time-before` excludes equal time stamps, and because we specify 421 | instants rather than durations implicit from our units, we can smoothly 422 | scale between days and seconds. That is, we specify ranges in the manner 423 | of the C programming language, eg low ≤ x < high. For example, matching 424 | the month of November of 2016 is specified by 425 | `crawl-time-after=2016-11 & crawl-time-before=2016-12` or 426 | equivalently by `crawl-time-after=2016-11-01T00:00:00Z & 427 | crawl-time-before=2016-11-30T16:00:00-08:00`. 428 | crawl-time-before: 429 | name: crawl-time-before 430 | type: string 431 | format: date-time 432 | in: query 433 | required: false 434 | description: > 435 | Match resources that were crawled strictly before the time given 436 | according to RFC3339. See more detail at `crawl-time-after`. 437 | crawl-start-after: 438 | name: crawl-start-after 439 | type: string 440 | format: date-time 441 | in: query 442 | required: false 443 | description: > 444 | Match resources that were crawled in a job that started at or after the 445 | time given according to RFC3339. (Note that the original content of a 446 | file could be crawled many days after the crawl job started; would you 447 | prefer `crawl-time-after` / `crawl-time-before`?) 448 | crawl-start-before: 449 | name: crawl-start-before 450 | type: string 451 | format: date-time 452 | in: query 453 | required: false 454 | description: > 455 | Match resources that were crawled in a job that started strictly before 456 | the time given according to RFC3339. See more detail at 457 | `crawl-start-after`. 458 | # collection (specific to Archive-It): 459 | collection: 460 | name: collection 461 | type: integer 462 | in: query 463 | required: false 464 | description: > 465 | The numeric ID of one or more collections, given as separate fields. 466 | For only this parameter, WASAPI accepts multiple values and will match 467 | items in any of the specified collections. For example, matching the 468 | items from two collections can be specified by `collection=1 & 469 | collection=2`. 470 | # crawl (specific to Archive-It): 471 | crawl: 472 | name: crawl 473 | type: integer 474 | in: query 475 | required: false 476 | description: > 477 | The numeric ID of the crawl 478 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/tests.py: -------------------------------------------------------------------------------- 1 | from django.test import TestCase 2 | 3 | # Create your tests here. 4 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/WASAPI-Community/data-transfer-apis/4fab8164f40dcb16a601d4606f9ab67889076d6b/ait-implementation/wasapi/tests/__init__.py -------------------------------------------------------------------------------- /ait-implementation/wasapi/tests/fixtures.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "pk": 1, 4 | "fields": { 5 | "public_site_settings": null, 6 | "google_analytics_id": "", 7 | "annual_crawl_budget": 9223372036854775807, 8 | "metadata_public": true, 9 | "max_concurrent_test_crawls": null, 10 | "created_by": "", 11 | "partner_description": "", 12 | "created_date": null, 13 | "ignore_robots_option_visible": false, 14 | "public_registry_enabled": 1, 15 | "subscription_end_date": null, 16 | "total_data_budget_in_bytes": null, 17 | "annual_data_budget_in_gbs": 9223372036854775807, 18 | "partner_type": "", 19 | "private_metadata_fields": "", 20 | "deleted": false, 21 | "member_since_date": null, 22 | "partner_url": "", 23 | "active": true, 24 | "account_type": "", 25 | "last_updated_date": "2014-07-10T00:28:56.454", 26 | "organization_name": "test origanization", 27 | "custom_name": "", 28 | "logo_blob": null, 29 | "last_updated_by": "", 30 | "billing_period_start_date": "2014-07-10T00:28:56.454", 31 | "tag": "", 32 | "feed_enabled": false 33 | }, 34 | "model": "accounts.account" 35 | }, 36 | { 37 | "pk": 2, 38 | "fields": { 39 | "public_site_settings": null, 40 | "google_analytics_id": "", 41 | "annual_crawl_budget": 9223372036854775807, 42 | "metadata_public": true, 43 | "max_concurrent_test_crawls": null, 44 | "created_by": "", 45 | "partner_description": "", 46 | "created_date": null, 47 | "ignore_robots_option_visible": false, 48 | "public_registry_enabled": 1, 49 | "subscription_end_date": null, 50 | "total_data_budget_in_bytes": null, 51 | "annual_data_budget_in_gbs": 9223372036854775807, 52 | "partner_type": "", 53 | "private_metadata_fields": "", 54 | "deleted": false, 55 | "member_since_date": null, 56 | "partner_url": "", 57 | "active": true, 58 | "account_type": "", 59 | "last_updated_date": "2014-07-10T00:29:40.286", 60 | "organization_name": "another test origanization", 61 | "custom_name": "", 62 | "logo_blob": null, 63 | "last_updated_by": "", 64 | "billing_period_start_date": "2014-07-10T00:29:40.286", 65 | "tag": "", 66 | "feed_enabled": false 67 | }, 68 | "model": "accounts.account" 69 | }, 70 | { 71 | "pk": 1, 72 | "fields": { 73 | "email": "system@archive-it.org", 74 | "beta_opt_in": 1, 75 | "deleted": 0, 76 | "full_name": "", 77 | "date_joined": "2014-07-10T00:27:24.045", 78 | "last_updated_date": null, 79 | "username": "system", 80 | "time_zone_abbreviation": "", 81 | "created_by": "", 82 | "created_date": null, 83 | "temporary_password": null, 84 | "last_updated_by": "", 85 | "password": "", 86 | "last_login": "2014-07-10T00:27:24.045", 87 | "require_change_password": 0, 88 | "password_hash": "cf9c14fc1bcf5d4de8b912a8f2456e6bb46a470f", 89 | "language": "" 90 | }, 91 | "model": "accounts.user" 92 | }, 93 | { 94 | "pk": 2, 95 | "fields": { 96 | "email": "", 97 | "beta_opt_in": 1, 98 | "deleted": 0, 99 | "full_name": "", 100 | "date_joined": null, 101 | "last_updated_date": null, 102 | "username": "authuser", 103 | "time_zone_abbreviation": "", 104 | "created_by": "", 105 | "created_date": null, 106 | "temporary_password": null, 107 | "last_updated_by": "", 108 | "password": "", 109 | "last_login": "2014-07-10T00:29:01.162", 110 | "require_change_password": 0, 111 | "password_hash": "cb879e21d380ed6ac814cdf38a1aff5dec2a13ef", 112 | "language": "" 113 | }, 114 | "model": "accounts.user" 115 | }, 116 | { 117 | "pk": 3, 118 | "fields": { 119 | "email": "", 120 | "beta_opt_in": 1, 121 | "deleted": 0, 122 | "full_name": "", 123 | "date_joined": null, 124 | "last_updated_date": null, 125 | "username": "authuser2", 126 | "time_zone_abbreviation": "", 127 | "created_by": "", 128 | "created_date": null, 129 | "temporary_password": null, 130 | "last_updated_by": "", 131 | "password": "", 132 | "last_login": "2014-07-10T00:29:17.850", 133 | "require_change_password": 0, 134 | "password_hash": "a8d9d1882162af890fb2b60738bf3fc3e8090f33", 135 | "language": "" 136 | }, 137 | "model": "accounts.user" 138 | }, 139 | { 140 | "pk": 4, 141 | "fields": { 142 | "email": "", 143 | "beta_opt_in": 1, 144 | "deleted": 0, 145 | "full_name": "", 146 | "date_joined": null, 147 | "last_updated_date": null, 148 | "username": "anotheruser", 149 | "time_zone_abbreviation": "", 150 | "created_by": "", 151 | "created_date": null, 152 | "temporary_password": null, 153 | "last_updated_by": "", 154 | "password": "", 155 | "last_login": "2014-07-10T00:29:44.106", 156 | "require_change_password": 0, 157 | "password_hash": "7f965560c9f2ce126407eda7c7dbbdb75037ef4d", 158 | "language": "" 159 | }, 160 | "model": "accounts.user" 161 | }, 162 | { 163 | "pk": 1, 164 | "fields": { 165 | "topics": "", 166 | "deleted": 0, 167 | "image": null, 168 | "account": 1, 169 | "last_updated_date": "2016-06-14T19:30:50.999", 170 | "state": "", 171 | "created_by": "authuser2", 172 | "created_date": "2014-07-10T00:29:25.215", 173 | "name": "Private Test Collection", 174 | "last_updated_by": null, 175 | "tag": "", 176 | "publicly_visible": false, 177 | "oai_exported": false 178 | }, 179 | "model": "archiveit.collection" 180 | }, 181 | { 182 | "pk": 2, 183 | "fields": { 184 | "topics": "", 185 | "deleted": 0, 186 | "image": null, 187 | "account": 1, 188 | "last_updated_date": "2016-06-14T19:30:51.032", 189 | "state": "", 190 | "created_by": "authuser", 191 | "created_date": "2014-07-10T00:29:33.943", 192 | "name": "Public Test Collection", 193 | "last_updated_by": null, 194 | "tag": "", 195 | "publicly_visible": true, 196 | "oai_exported": false 197 | }, 198 | "model": "archiveit.collection" 199 | }, 200 | { 201 | "pk": 3, 202 | "fields": { 203 | "topics": "", 204 | "deleted": 0, 205 | "image": null, 206 | "account": 2, 207 | "last_updated_date": "2016-06-14T19:30:51.064", 208 | "state": "", 209 | "created_by": "anotheruser", 210 | "created_date": "2014-07-10T00:29:56.855", 211 | "name": "Another Public Test Collection", 212 | "last_updated_by": null, 213 | "tag": "", 214 | "publicly_visible": true, 215 | "oai_exported": false 216 | }, 217 | "model": "archiveit.collection" 218 | }, 219 | { 220 | "pk": 1, 221 | "fields": { 222 | "patch_for_qa_job": null, 223 | "byte_limit": null, 224 | "test": false, 225 | "time_limit": 600, 226 | "collection": 1, 227 | "account": 1, 228 | "patch_ignore_robots": 0, 229 | "one_time_subtype": "", 230 | "document_limit": null, 231 | "recurrence_type": "", 232 | "pdfs_only": false 233 | }, 234 | "model": "archiveit.crawldefinition" 235 | }, 236 | { 237 | "pk": 2, 238 | "fields": { 239 | "patch_for_qa_job": null, 240 | "byte_limit": null, 241 | "test": false, 242 | "time_limit": 600, 243 | "collection": 2, 244 | "account": 1, 245 | "patch_ignore_robots": 0, 246 | "one_time_subtype": "", 247 | "document_limit": null, 248 | "recurrence_type": "", 249 | "pdfs_only": false 250 | }, 251 | "model": "archiveit.crawldefinition" 252 | }, 253 | { 254 | "pk": 1, 255 | "fields": { 256 | "host": "http://example.org", 257 | "end_date": null, 258 | "duplicate_count": null, 259 | "queued_count": null, 260 | "type": "0", 261 | "resumption_count": null, 262 | "document_limit": 100000, 263 | "test_crawl_state_changed_by": null, 264 | "patch_for_qa_job": null, 265 | "downloaded_count": null, 266 | "total_data_in_kbs": null, 267 | "url": "", 268 | "crawl_stop_requested": null, 269 | "novel_count": null, 270 | "novel_bytes": null, 271 | "port": 80, 272 | "kb_rate": null, 273 | "recurrence_type": "NONE", 274 | "pdfs_only": 0, 275 | "thread_count": null, 276 | "test": 0, 277 | "uid": 791728987711, 278 | "collection": 1, 279 | "account": 1, 280 | "processing_end_date": null, 281 | "doc_rate": null, 282 | "discovered_count": null, 283 | "status": "", 284 | "warc_revisit_count": null, 285 | "warc_content_bytes": null, 286 | "download_failures": null, 287 | "description": "", 288 | "duplicate_bytes": null, 289 | "byte_limit": null, 290 | "job_name": "1", 291 | "original_start_date": "2014-07-10T00:29:25.234", 292 | "warc_compressed_bytes": null, 293 | "scheduled_crawl_event": null, 294 | "start_date": "2014-07-10T00:29:25.234", 295 | "workflow_step": null, 296 | "elapsed_ms": null, 297 | "current_kb_rate": null, 298 | "current_doc_rate": null, 299 | "time_limit": 100000, 300 | "test_crawl_state": "False", 301 | "warc_url_count": null, 302 | "warc_uncompressed_bytes": null 303 | }, 304 | "model": "archiveit.crawljob" 305 | }, 306 | { 307 | "pk": 2, 308 | "fields": { 309 | "host": "http://example.com", 310 | "end_date": null, 311 | "duplicate_count": null, 312 | "queued_count": null, 313 | "type": "0", 314 | "resumption_count": null, 315 | "document_limit": 100000, 316 | "test_crawl_state_changed_by": null, 317 | "patch_for_qa_job": null, 318 | "downloaded_count": null, 319 | "total_data_in_kbs": null, 320 | "url": "", 321 | "crawl_stop_requested": null, 322 | "novel_count": null, 323 | "novel_bytes": null, 324 | "port": 80, 325 | "kb_rate": null, 326 | "recurrence_type": "NONE", 327 | "pdfs_only": 0, 328 | "thread_count": null, 329 | "test": 0, 330 | "uid": 728791872212, 331 | "collection": 2, 332 | "account": 1, 333 | "processing_end_date": null, 334 | "doc_rate": null, 335 | "discovered_count": null, 336 | "status": "", 337 | "warc_revisit_count": null, 338 | "warc_content_bytes": null, 339 | "download_failures": null, 340 | "description": "", 341 | "duplicate_bytes": null, 342 | "byte_limit": null, 343 | "job_name": "2", 344 | "original_start_date": "2014-07-10T00:29:33.963", 345 | "warc_compressed_bytes": null, 346 | "scheduled_crawl_event": null, 347 | "start_date": "2014-07-10T00:29:33.963", 348 | "workflow_step": null, 349 | "elapsed_ms": null, 350 | "current_kb_rate": null, 351 | "current_doc_rate": null, 352 | "time_limit": 100000, 353 | "test_crawl_state": "False", 354 | "warc_url_count": null, 355 | "warc_uncompressed_bytes": null 356 | }, 357 | "model": "archiveit.crawljob" 358 | }, 359 | { 360 | "pk": 3, 361 | "fields": { 362 | "host": "http://examyank.com", 363 | "end_date": null, 364 | "duplicate_count": null, 365 | "queued_count": null, 366 | "type": "0", 367 | "resumption_count": null, 368 | "document_limit": 100000, 369 | "test_crawl_state_changed_by": null, 370 | "patch_for_qa_job": null, 371 | "downloaded_count": null, 372 | "total_data_in_kbs": null, 373 | "url": "", 374 | "crawl_stop_requested": null, 375 | "novel_count": null, 376 | "novel_bytes": null, 377 | "port": 80, 378 | "kb_rate": null, 379 | "recurrence_type": "NONE", 380 | "pdfs_only": 0, 381 | "thread_count": null, 382 | "test": 0, 383 | "uid": 728791872212, 384 | "collection": 2, 385 | "account": 1, 386 | "processing_end_date": null, 387 | "doc_rate": null, 388 | "discovered_count": null, 389 | "status": "", 390 | "warc_revisit_count": null, 391 | "warc_content_bytes": null, 392 | "download_failures": null, 393 | "description": "", 394 | "duplicate_bytes": null, 395 | "byte_limit": null, 396 | "job_name": "2", 397 | "original_start_date": "2015-05-15T15:55:55.543", 398 | "warc_compressed_bytes": null, 399 | "scheduled_crawl_event": null, 400 | "start_date": "2015-05-15T15:55:55.543", 401 | "workflow_step": null, 402 | "elapsed_ms": null, 403 | "current_kb_rate": null, 404 | "current_doc_rate": null, 405 | "time_limit": 100000, 406 | "test_crawl_state": "False", 407 | "warc_url_count": null, 408 | "warc_uncompressed_bytes": null 409 | }, 410 | "model": "archiveit.crawljob" 411 | }, 412 | { 413 | "pk": 4, 414 | "fields": { 415 | "host": "http://yaketty.com", 416 | "end_date": null, 417 | "duplicate_count": null, 418 | "queued_count": null, 419 | "type": "0", 420 | "resumption_count": null, 421 | "document_limit": 100000, 422 | "test_crawl_state_changed_by": null, 423 | "patch_for_qa_job": null, 424 | "downloaded_count": null, 425 | "total_data_in_kbs": null, 426 | "url": "", 427 | "crawl_stop_requested": null, 428 | "novel_count": null, 429 | "novel_bytes": null, 430 | "port": 80, 431 | "kb_rate": null, 432 | "recurrence_type": "NONE", 433 | "pdfs_only": 0, 434 | "thread_count": null, 435 | "test": 0, 436 | "uid": 728791872212, 437 | "collection": 3, 438 | "account": 2, 439 | "processing_end_date": null, 440 | "doc_rate": null, 441 | "discovered_count": null, 442 | "status": "", 443 | "warc_revisit_count": null, 444 | "warc_content_bytes": null, 445 | "download_failures": null, 446 | "description": "", 447 | "duplicate_bytes": null, 448 | "byte_limit": null, 449 | "job_name": "2", 450 | "original_start_date": "2015-05-15T15:55:55.543", 451 | "warc_compressed_bytes": null, 452 | "scheduled_crawl_event": null, 453 | "start_date": "2015-05-15T15:55:55.543", 454 | "workflow_step": null, 455 | "elapsed_ms": null, 456 | "current_kb_rate": null, 457 | "current_doc_rate": null, 458 | "time_limit": 100000, 459 | "test_crawl_state": "False", 460 | "warc_url_count": null, 461 | "warc_uncompressed_bytes": null 462 | }, 463 | "model": "archiveit.crawljob" 464 | }, 465 | { 466 | "pk": "DAILY", 467 | "fields": { 468 | "interval_sec": 86400 469 | }, 470 | "model": "archiveit.frequency" 471 | }, 472 | { 473 | "pk": "WEEKLY", 474 | "fields": { 475 | "interval_sec": 604800 476 | }, 477 | "model": "archiveit.frequency" 478 | }, 479 | { 480 | "pk": 1, 481 | "fields": { 482 | "last_checked_http_response_code": null, 483 | "deleted": false, 484 | "collection": 1, 485 | "login_password": "", 486 | "seed_group": null, 487 | "active": true, 488 | "last_updated_date": "2016-06-14T19:30:50.913", 489 | "seed_type": "", 490 | "http_response_code": null, 491 | "created_by": null, 492 | "crawl_definition": 1, 493 | "created_date": "2014-07-10T00:29:25.260", 494 | "valid": false, 495 | "last_updated_by": null, 496 | "canonical_url": "http://example.com/", 497 | "login_username": "", 498 | "publicly_visible": false, 499 | "url": "http://example.com" 500 | }, 501 | "model": "archiveit.seed" 502 | }, 503 | { 504 | "pk": 2, 505 | "fields": { 506 | "last_checked_http_response_code": null, 507 | "deleted": false, 508 | "collection": 2, 509 | "login_password": "", 510 | "seed_group": null, 511 | "active": true, 512 | "last_updated_date": "2016-06-14T19:30:50.953", 513 | "seed_type": "", 514 | "http_response_code": null, 515 | "created_by": null, 516 | "crawl_definition": 2, 517 | "created_date": "2014-07-10T00:29:33.992", 518 | "valid": false, 519 | "last_updated_by": null, 520 | "canonical_url": "http://google.com/", 521 | "login_username": "", 522 | "publicly_visible": true, 523 | "url": "http://www.google.com" 524 | }, 525 | "model": "archiveit.seed" 526 | }, 527 | { 528 | "pk": 1, 529 | "fields": { 530 | "name": "Test SeedGroup", 531 | "collection": 1 532 | }, 533 | "model": "archiveit.seedgroup" 534 | }, 535 | { 536 | "pk": 2, 537 | "fields": { 538 | "name": "Test SeedGroup 2", 539 | "collection": 2 540 | }, 541 | "model": "archiveit.seedgroup" 542 | }, 543 | { 544 | "pk": 1, 545 | "fields": { 546 | "user": 2, 547 | "nickname": "", 548 | "crawl_email_enabled": false, 549 | "account": 1 550 | }, 551 | "model": "accounts.accountmember" 552 | }, 553 | { 554 | "pk": 2, 555 | "fields": { 556 | "user": 3, 557 | "nickname": "", 558 | "crawl_email_enabled": false, 559 | "account": 1 560 | }, 561 | "model": "accounts.accountmember" 562 | }, 563 | { 564 | "pk": 3, 565 | "fields": { 566 | "user": 4, 567 | "nickname": "", 568 | "crawl_email_enabled": false, 569 | "account": 2 570 | }, 571 | "model": "accounts.accountmember" 572 | }, 573 | { 574 | "pk": 1, 575 | "fields": { 576 | "filename": "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000.warc.gz", 577 | "md5": "119ceaf851143f0036f83fe6f9d59711", 578 | "sha1": "11e928116cf3ff3efe8818c706b616447b7b8e11", 579 | "size": 10910469, 580 | "store_time": "2014-07-10T02:22:22.234", 581 | "crawl_time": "2014-07-10T00:35:44.123", 582 | "account": 1, 583 | "collection": 2, 584 | "crawl_job": 2, 585 | "pbox_item": null, 586 | "hdfs_path": "/ait/qa/h3-wayback/warcs/ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000.warc.gz", 587 | "all_pbox_items": null 588 | }, 589 | "model": "archiveit.warcfile" 590 | }, 591 | { 592 | "pk": 2, 593 | "fields": { 594 | "filename": "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000.warc.gz", 595 | "md5": "229ceaf851143f0036f83fe6f9d59722", 596 | "sha1": "22e928116cf3ff3efe8818c706b616447b7b8e22", 597 | "size": 10220229, 598 | "store_time": "2014-07-10T02:22:24.424", 599 | "crawl_time": "2014-07-10T01:40:44.456", 600 | "account": 1, 601 | "collection": 2, 602 | "crawl_job": 2, 603 | "pbox_item": null, 604 | "hdfs_path": "/ait/qa/h3-wayback/warcs/ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000.warc.gz", 605 | "all_pbox_items": null 606 | }, 607 | "model": "archiveit.warcfile" 608 | }, 609 | { 610 | "pk": 3, 611 | "fields": { 612 | "filename": "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000.warc.gz", 613 | "md5": "339ceaf851143f0036f83fe6f9d59733", 614 | "sha1": "33e928116cf3ff3efe8818c706b616447b7b8e33", 615 | "size": 10220229, 616 | "store_time": "2014-07-10T02:01:01.012", 617 | "crawl_time": "2014-07-10T01:01:01.012", 618 | "account": 1, 619 | "collection": 2, 620 | "crawl_job": 3, 621 | "pbox_item": null, 622 | "hdfs_path": "/ait/qa/h3-wayback/warcs/ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044455-00000.warc.gz", 623 | "all_pbox_items": null 624 | }, 625 | "model": "archiveit.warcfile" 626 | }, 627 | { 628 | "pk": 4, 629 | "fields": { 630 | "filename": "ARCHIVEIT-1-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000.warc.gz", 631 | "md5": "449ceaf851143f0036f83fe6f9d59744", 632 | "sha1": "44e928116cf3ff3efe8818c706b616447b7b8e44", 633 | "size": 10220229, 634 | "store_time": "2014-07-10T02:01:01.012", 635 | "crawl_time": "2014-07-10T01:01:01.012", 636 | "account": 1, 637 | "collection": 1, 638 | "crawl_job": 1, 639 | "pbox_item": null, 640 | "hdfs_path": "/ait/qa/h3-wayback/warcs/ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044455-00000.warc.gz", 641 | "all_pbox_items": null 642 | }, 643 | "model": "archiveit.warcfile" 644 | }, 645 | { 646 | "pk": 5, 647 | "fields": { 648 | "filename": "ARCHIVEIT-1-CRAWL_SELECTED_SEEDS-JOB1-20140710010101012-00000.warc.gz", 649 | "md5": "559ceaf851143f0036f83fe6f9d59755", 650 | "sha1": "55e928116cf3ff3efe8818c706b616447b7b8e55", 651 | "size": 10220229, 652 | "store_time": "2014-07-10T02:01:01.012", 653 | "crawl_time": "2014-07-10T01:01:01.012", 654 | "account": 2, 655 | "collection": 3, 656 | "crawl_job": 4, 657 | "pbox_item": null, 658 | "hdfs_path": "/ait/qa/h3-wayback/warcs/ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044455-00000.warc.gz", 659 | "all_pbox_items": null 660 | }, 661 | "model": "archiveit.warcfile" 662 | } 663 | ] 664 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/tests/test_fixtures.py: -------------------------------------------------------------------------------- 1 | from django.test import TestCase 2 | 3 | from archiveit.archiveit.models import CrawlJob, WarcFile 4 | 5 | class TestFixtures(TestCase): 6 | '''Ensure we have the set of fixtures other tests depend on''' 7 | fixtures = ['archiveit/wasapi/tests/fixtures.json'] 8 | 9 | def test_nested_sets_of_warcfiles(self): 10 | '''To test an automatically built query, we need some WarcFiles that 11 | match it and some that don't match it''' 12 | # the interesting WarcFile at the core of the nested sets 13 | awarc = WarcFile.objects.get(id=1) 14 | self.assertGreater( 15 | len(WarcFile.objects.filter(crawl_job_id=awarc.crawl_job_id)), 16 | 1, 17 | "Should have multiple WarcFiles in the crawl job") 18 | self.assertGreater( 19 | len(WarcFile.objects.filter(collection_id=awarc.collection_id)), 20 | len(WarcFile.objects.filter(crawl_job_id=awarc.crawl_job_id)), 21 | "Should have WarcFiles in the collection outside the crawl job") 22 | self.assertGreater( 23 | len(WarcFile.objects.filter(account_id=awarc.account_id)), 24 | len(WarcFile.objects.filter(collection_id=awarc.collection_id)), 25 | "Should have WarcFiles in the account outside the collection") 26 | self.assertGreater( 27 | len(WarcFile.objects.all()), 28 | len(WarcFile.objects.filter(account_id=awarc.account_id)), 29 | "Should have WarcFiles outside the account") 30 | 31 | def test_fields_of_warcfiles_are_unique(self): 32 | for fieldname in ['filename','md5','sha1']: 33 | self.assertEqual( 34 | conflicts(WarcFile.objects.all(), fieldname), {}, 35 | "%s values should be unique across WarcFiles" % (fieldname)) 36 | 37 | def test_denormalization(self): 38 | self.assertEqual( 39 | [warcfile for warcfile in WarcFile.objects.all() 40 | if warcfile.crawl_job and 41 | warcfile.collection_id != warcfile.crawl_job.collection_id], 42 | [], 43 | "each warcfile should match its crawl job's collection") 44 | self.assertEqual( 45 | [warcfile for warcfile in WarcFile.objects.all() 46 | if warcfile.crawl_job and 47 | warcfile.account_id != warcfile.crawl_job.account_id], 48 | [], 49 | "each warcfile should match its crawl job's account") 50 | self.assertEqual( 51 | [crawl_job for crawl_job in CrawlJob.objects.all() 52 | if crawl_job.collection.account_id != crawl_job.account_id], 53 | [], 54 | "each crawl job should match its collection's account") 55 | 56 | 57 | def conflicts(col, fieldname): 58 | '''Returns a dict of lists of conflicting elements keyed by their elements' value from partition_key''' 59 | partition_key = lambda obj: obj.__getattribute__(fieldname) 60 | return dict( 61 | [k,v] for k,v in partition(col, partition_key, list).items() 62 | if len(v) > 1 ) 63 | 64 | def partition(col, partition_key, empty_col=None): 65 | '''Returns a dict to the set of elements keyed by their value from partition_key''' 66 | if empty_col == None: 67 | empty_col = type(col) 68 | ret = {} 69 | for item in col: 70 | key = partition_key(item) 71 | ret[key] = ret.get(key, empty_col()) 72 | ret[key].append(item) 73 | return ret 74 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/tests/test_job_result.py: -------------------------------------------------------------------------------- 1 | from django.test import TestCase 2 | 3 | from archiveit.accounts.models import Account, User 4 | from archiveit.archiveit.models import Collection, CrawlJob, DerivativeFile 5 | from archiveit.wasapi.models import WasapiJob, WasapiJobResultFile 6 | from archiveit.wasapi.views import WebdataQueryViewSet 7 | 8 | class TestJobResult(TestCase): 9 | fixtures = ['archiveit/wasapi/tests/fixtures.json'] 10 | 11 | def test_query_execution(self): 12 | cases = [ 13 | ('crawl=2', {1,2}, "Query by crawl job"), 14 | ('collection=2', {1,2,3}, "Query by collection"), 15 | ('', {1,2,3,4}, "Empty query ie query by account"), 16 | ] 17 | for query, ideal, msg in cases: 18 | with self.subTest(query=query): 19 | source_files = WasapiJob( 20 | query=query, 21 | account=Account.objects.get(id=1), 22 | user=User.objects.get(username='authuser'), 23 | ).query_just_like_webdataqueryviewset() 24 | self.assertEqual(set(wf.id for wf in source_files), ideal, msg) 25 | 26 | def test_creation_of_resultfiles(self): 27 | job = WasapiJob( 28 | query='collection=2', 29 | function='build-wat', 30 | account=Account.objects.get(id=1), 31 | user=User.objects.get(username='authuser')) 32 | job.save() 33 | result_files = WasapiJobResultFile.objects.filter(job_id=job.id) 34 | ideal = { 35 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000_warc.wat.gz", 36 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000_warc.wat.gz", 37 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000_warc.wat.gz"} 38 | self.assertEqual( set(f.filename for f in result_files), ideal, 39 | "Create result files") 40 | 41 | def test_vacuous_job_should_update_its_own_state(self): 42 | '''"Vacuous" means the job needs no result files''' 43 | # partner creates job via DRF 44 | job = WasapiJob( 45 | query="filename=nonesuch", 46 | function='build-wat', 47 | account=Account.objects.get(id=1), 48 | user=User.objects.get(username='authuser'), 49 | ) 50 | job.save() # DRF calls save 51 | self.assertEqual(job.state, WasapiJob.COMPLETE, 52 | "Vacuous job should be in state complete after save") 53 | self.assertIsNotNone(job.termination_time, 54 | "Vacuous job should have a termination time after save") 55 | 56 | def test_freebie_job_should_update_its_own_state(self): 57 | '''"Freebie" means the job is satisfied by pre-existing result files''' 58 | account = Account.objects.get(id=1) 59 | user = User.objects.get(username='authuser') 60 | collection = Collection.objects.get(id=2) 61 | # some earlier job derives some files 62 | earlier_job = WasapiJob(function='build-wat',account=account,user=user) 63 | earlier_job.save() 64 | already_deriveds = [ 65 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000", 66 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000", 67 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000", 68 | ] 69 | for rootname in already_deriveds: 70 | filename = rootname + '_warc.wat.gz' 71 | df = DerivativeFile(filename=filename, 72 | size=1, account=account, collection=collection) 73 | df.save() 74 | rf = WasapiJobResultFile(filename=filename, 75 | derivative_file=df, job=earlier_job) 76 | rf.save() 77 | # partner creates job via DRF 78 | job = WasapiJob( 79 | query='collection=%d' % (collection.id), 80 | function='build-wat', 81 | account=account, 82 | user=user) 83 | job.save() # DRF calls save 84 | self.assertEqual(job.state, WasapiJob.COMPLETE, 85 | "Freebie job should be in state complete after save") 86 | self.assertIsNotNone(job.termination_time, 87 | "Freebie job should have a termination time after save") 88 | 89 | def test_juicy_job_should_update_state_upon_derive(self): 90 | '''"Juicy" means the job needs fresh result files''' 91 | account = Account.objects.get(id=1) 92 | user = User.objects.get(username='authuser') 93 | collection = Collection.objects.get(id=2) 94 | # partner creates job via DRF 95 | job = WasapiJob( 96 | query='collection=%d' % (collection.id), 97 | function='build-wat', 98 | account=account, 99 | user=user, 100 | ) 101 | job.save() # DRF calls save 102 | self.assertEqual(job.state, WasapiJob.QUEUED, 103 | "Juicy job should remain in state queued after save") 104 | self.assertIsNone(job.termination_time, 105 | "Juicy job should not have a termination time after save") 106 | 107 | # archivist manually changes its state 108 | job.state = WasapiJob.RUNNING 109 | job.save() 110 | self.assertEqual(job.state, WasapiJob.RUNNING, 111 | "Juicy job runs") 112 | 113 | # the first file is derived 114 | first_rf = WasapiJobResultFile.objects.get(filename='ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000_warc.wat.gz') 115 | first_df = DerivativeFile( 116 | filename='ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000_warc.wat.gz', 117 | size=1, account=account, collection=collection) 118 | first_df.save() 119 | # script notifies us of first derivative 120 | WasapiJobResultFile.update_states(WasapiJobResultFile.objects.filter(filename='ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000_warc.wat.gz')) 121 | self.assertEqual(job.state, WasapiJob.RUNNING, 122 | "Juicy job should remain in state running during derivation") 123 | self.assertIsNone(job.termination_time, 124 | "Juicy job should not have a termination time during derivation") 125 | 126 | # the other files are derived 127 | other_basenames = [ 128 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000", 129 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000", 130 | ] 131 | for basename in other_basenames: 132 | other_rf = WasapiJobResultFile.objects.get( 133 | filename=basename+'_warc.wat.gz') 134 | other_df = DerivativeFile(filename=basename+'_warc.wat.gz', 135 | size=1, account=account, collection=collection) 136 | other_df.save() 137 | # script notifies us of other derivatives 138 | for basename in other_basenames: 139 | WasapiJobResultFile.update_states(WasapiJobResultFile.objects.filter(filename=basename+'_warc.wat.gz')) 140 | job.refresh_from_db() 141 | self.assertEqual(job.state, WasapiJob.COMPLETE, 142 | "Juicy job should change to state completed when files derived") 143 | self.assertIsNotNone(job.termination_time, 144 | "Juicy job should get a termination time when files derived") 145 | 146 | def test_juicy_job_should_update_state_upon_cron(self): 147 | '''"Juicy" means the job needs fresh result files''' 148 | account = Account.objects.get(id=1) 149 | user = User.objects.get(username='authuser') 150 | collection = Collection.objects.get(id=2) 151 | # partner creates job via DRF 152 | job = WasapiJob( 153 | query='collection=%d' % (collection.id), 154 | function='build-wat', 155 | account=account, 156 | user=user) 157 | job.save() # DRF calls save 158 | self.assertEqual(job.state, WasapiJob.QUEUED, 159 | "Juicy job should remain in state queued after save") 160 | self.assertIsNone(job.termination_time, 161 | "Juicy job should not have a termination time after save") 162 | 163 | # archivist manually changes its state 164 | job.state = WasapiJob.RUNNING 165 | job.save() 166 | self.assertEqual(job.state, WasapiJob.RUNNING, 167 | "Juicy job runs") 168 | 169 | # the files are derived 170 | other_basenames = [ 171 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710003544123-00000", 172 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB2-20140710014044456-00000", 173 | "ARCHIVEIT-2-CRAWL_SELECTED_SEEDS-JOB1-20140710014044455-00000", 174 | ] 175 | for basename in other_basenames: 176 | other_rf = WasapiJobResultFile.objects.get( 177 | filename=basename+'_warc.wat.gz') 178 | other_df = DerivativeFile(filename=basename+'_warc.wat.gz', 179 | size=1, account=account, collection=collection) 180 | other_df.save() 181 | # but the script somehow doesn't notify us of other derivatives 182 | self.assertEqual(job.state, WasapiJob.RUNNING, 183 | "Juicy job should remain in state running without notification") 184 | self.assertIsNone(job.termination_time, 185 | "Juicy job should remain without a termination time without notification") 186 | # cronjob triggers clean up 187 | WasapiJobResultFile.update_completed_result_files() 188 | job.refresh_from_db() 189 | self.assertEqual(job.state, WasapiJob.COMPLETE, 190 | "Juicy job should change to state completed after clean up") 191 | self.assertIsNotNone(job.termination_time, 192 | "Juicy job should get a termination time after clean up") 193 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/urls.py: -------------------------------------------------------------------------------- 1 | from django.core.urlresolvers import RegexURLPattern 2 | from rest_framework.routers import DefaultRouter 3 | from archiveit.wasapi.views import WebdataQueryViewSet, JobsViewSet, update_result_file_state, JobResultViewSet 4 | 5 | router = DefaultRouter(trailing_slash=False) 6 | router.register(r'webdata', WebdataQueryViewSet) 7 | router.register(r'jobs/(?P\d+)/result', JobResultViewSet) 8 | router.register(r'jobs', JobsViewSet) 9 | router.urls.append(RegexURLPattern(r'^update_result_file_state/(?P.*)', update_result_file_state)) 10 | urlpatterns = router.urls 11 | -------------------------------------------------------------------------------- /ait-implementation/wasapi/views.py: -------------------------------------------------------------------------------- 1 | from django.http import HttpResponse 2 | from rest_framework import viewsets 3 | from rest_framework.response import Response 4 | from archiveit.wasapi.serializers import WebdataFileSerializer, PaginationSerializerOfFiles, JobSerializer, PaginationSerializerOfJobs 5 | from archiveit.wasapi.filters import WasapiWebdataQueryFilterBackend, WasapiAuthFilterBackend, WasapiAuthJobBackend 6 | from archiveit.archiveit.models import WarcFile 7 | from archiveit.wasapi.models import WasapiJob, WasapiJobResultFile 8 | 9 | 10 | class WebdataQueryViewSet(viewsets.ModelViewSet): 11 | """ 12 | API endpoint that allows webdata files to be queried for and listed. 13 | """ 14 | # TODO: decide how to order results 15 | queryset = WarcFile.objects.all().order_by('-id') 16 | serializer_class = WebdataFileSerializer 17 | # selector shared with WasapiJob.query_just_like_webdataqueryviewset 18 | filter_backends = [WasapiWebdataQueryFilterBackend] 19 | pagination_serializer_class = PaginationSerializerOfFiles 20 | paginate_by_param = 'page_size' 21 | paginate_by = 100 22 | max_paginate_by = 2000 23 | 24 | def list(self, request, *args, **kwargs): 25 | """Cloned (but trimmed) from ModelViewSet.list""" 26 | self.object_list = self.filter_queryset(self.get_queryset()) 27 | # always paginate our responses because that's how we implement the spec 28 | page = self.paginate_queryset(self.object_list) 29 | serializer = self.get_pagination_serializer(page) 30 | # The change to add other fields: 31 | # The current implementation doesn't support any query that could 32 | # include extra data, so we can hard-code False. We must revisit 33 | # this as we add other queries. 34 | serializer.fields['includes-extra'] = WedgeValueIntoObjectField( 35 | value=False, label='includes-extra') 36 | serializer.fields['request-url'] = WedgeValueIntoObjectField( 37 | value=request._request.build_absolute_uri(), label='request-url') 38 | return Response(serializer.data) 39 | 40 | 41 | class JobsViewSet(viewsets.ModelViewSet): 42 | """ 43 | API endpoint that allows WASAPI jobs to be created and monitored. 44 | """ 45 | queryset = WasapiJob.objects.all().order_by('-id') 46 | serializer_class = JobSerializer 47 | filter_backends = [WasapiAuthFilterBackend] 48 | pagination_serializer_class = PaginationSerializerOfJobs 49 | paginate_by_param = 'page_size' 50 | paginate_by = 100 51 | max_paginate_by = 2000 52 | 53 | 54 | def update_result_file_state(request, filename): 55 | result_files = WasapiJobResultFile.objects.filter(filename=filename) 56 | if not result_files: 57 | return HttpResponse("", status=404) 58 | WasapiJobResultFile.update_states(result_files) 59 | return HttpResponse("") 60 | 61 | 62 | class JobResultViewSet(viewsets.ModelViewSet): 63 | """ 64 | API endpoint that gives the result of a WASAPI job. 65 | """ 66 | queryset = WasapiJobResultFile.objects.all().order_by('-id') 67 | serializer_class = WebdataFileSerializer 68 | filter_backends = [ 69 | WasapiAuthJobBackend 70 | # don't need WasapiAuthFilterBackend since we already filtered on the job 71 | ] 72 | pagination_serializer_class = PaginationSerializerOfFiles 73 | paginate_by_param = 'page_size' 74 | paginate_by = 100 75 | max_paginate_by = 2000 76 | 77 | 78 | class WedgeValueIntoObjectField(object): 79 | read_only = False 80 | def __init__(self, value, label): 81 | self.value = value 82 | self.label = label 83 | def initialize(self, parent, field_name): 84 | pass 85 | def field_to_native(self, obj, field_name): 86 | return self.value 87 | -------------------------------------------------------------------------------- /ait-implementation/webdata/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/WASAPI-Community/data-transfer-apis/4fab8164f40dcb16a601d4606f9ab67889076d6b/ait-implementation/webdata/__init__.py -------------------------------------------------------------------------------- /ait-implementation/webdata/decorators.py: -------------------------------------------------------------------------------- 1 | # The following 83 lines are copied verbatim from https://www.djangosnippets.org/snippets/243/ 2 | # which is a simple implementation of per-view HTTP Basic Authentication 3 | 4 | import base64 5 | 6 | from django.http import HttpResponse 7 | from django.contrib.auth import authenticate, login 8 | 9 | ############################################################################# 10 | # 11 | def view_or_basicauth(view, request, test_func, realm = "", *args, **kwargs): 12 | """ 13 | This is a helper function used by both 'logged_in_or_basicauth' and 14 | 'has_perm_or_basicauth' that does the nitty of determining if they 15 | are already logged in or if they have provided proper http-authorization 16 | and returning the view if all goes well, otherwise responding with a 401. 17 | """ 18 | if test_func(request.user): 19 | # Already logged in, just return the view. 20 | # 21 | return view(request, *args, **kwargs) 22 | 23 | # They are not logged in. See if they provided login credentials 24 | # 25 | if 'HTTP_AUTHORIZATION' in request.META: 26 | auth = request.META['HTTP_AUTHORIZATION'].split() 27 | if len(auth) == 2: 28 | # NOTE: We are only support basic authentication for now. 29 | # 30 | if auth[0].lower() == "basic": 31 | uname, passwd = base64.b64decode(auth[1].encode('UTF-8')).split(b':') 32 | user = authenticate(username=uname.decode(), password=passwd.decode()) 33 | if user is not None: 34 | if user.is_active: 35 | login(request, user) 36 | request.user = user 37 | return view(request, *args, **kwargs) 38 | 39 | # Either they did not provide an authorization header or 40 | # something in the authorization attempt failed. Send a 401 41 | # back to them to ask them to authenticate. 42 | # 43 | response = HttpResponse() 44 | response.status_code = 401 45 | response['WWW-Authenticate'] = 'Basic realm="%s"' % realm 46 | return response 47 | 48 | ############################################################################# 49 | # 50 | def logged_in_or_basicauth(realm = ""): 51 | """ 52 | A simple decorator that requires a user to be logged in. If they are not 53 | logged in the request is examined for a 'authorization' header. 54 | 55 | If the header is present it is tested for basic authentication and 56 | the user is logged in with the provided credentials. 57 | 58 | If the header is not present a http 401 is sent back to the 59 | requestor to provide credentials. 60 | 61 | The purpose of this is that in several django projects I have needed 62 | several specific views that need to support basic authentication, yet the 63 | web site as a whole used django's provided authentication. 64 | 65 | The uses for this are for urls that are access programmatically such as 66 | by rss feed readers, yet the view requires a user to be logged in. Many rss 67 | readers support supplying the authentication credentials via http basic 68 | auth (and they do NOT support a redirect to a form where they post a 69 | username/password.) 70 | 71 | Use is simple: 72 | 73 | @logged_in_or_basicauth 74 | def your_view: 75 | ... 76 | 77 | You can provide the name of the realm to ask for authentication within. 78 | """ 79 | def view_decorator(func): 80 | def wrapper(request, *args, **kwargs): 81 | return view_or_basicauth(func, request, 82 | lambda u: u.is_authenticated(), 83 | realm, *args, **kwargs) 84 | return wrapper 85 | return view_decorator 86 | -------------------------------------------------------------------------------- /ait-implementation/webdata/urls.py: -------------------------------------------------------------------------------- 1 | from django.conf.urls import url 2 | from . import views 3 | 4 | urlpatterns = [ 5 | url(r'(?P.*)', views.index, name='index'), 6 | ] 7 | -------------------------------------------------------------------------------- /ait-implementation/webdata/views.py: -------------------------------------------------------------------------------- 1 | import re 2 | from subprocess import Popen, PIPE 3 | from django.shortcuts import render 4 | from django.http import HttpResponse, FileResponse 5 | from django.conf import settings 6 | 7 | from internetarchive import get_session 8 | 9 | from archiveit.archiveit.models import WarcFile, DerivativeFile 10 | from archiveit.webdata.decorators import logged_in_or_basicauth 11 | 12 | VERBOTEN_FILENAMES = re.compile(r'EXTRACTED|EXTRACTION|HISTORICAL') 13 | 14 | @logged_in_or_basicauth() 15 | def index(request, filename): 16 | 17 | # check authorization: 18 | if request.user.is_anonymous(): 19 | # TODO: rfc7235#section-4.1 says to include WWW-Authenticate header 20 | return HttpResponse('You are not authenticated; please log in to download the requested file: %s'%filename, 21 | content_type='text/plain', status=401) 22 | match = re.search(r'^(?:ARCHIVEIT-)?(\d+)-', filename) 23 | if not match: 24 | return HttpResponse( 25 | 'Failed to parse collection id from requested filename: %s'%filename, 26 | content_type='text/plain', status=404) 27 | file_collection_id = int(match.group(1)) 28 | if(VERBOTEN_FILENAMES.match(filename) or 29 | not may_access_collection_id(request, file_collection_id)): 30 | return HttpResponse( 31 | 'You are not authorized to download the requested file: %s'%filename, 32 | content_type='text/plain', status=403) 33 | 34 | # fetch the file's db record: 35 | webdatafile = ( 36 | WarcFile.objects.filter(filename=filename).first() or 37 | DerivativeFile.objects.filter(filename=filename).first() ) 38 | if not webdatafile: 39 | return HttpResponse('404 Not Found', 40 | content_type='text/plain', status=404) 41 | 42 | # get the file's content from somewhere: 43 | stream = ( 44 | webdatafile.pbox_item and 45 | stream_from_pbox(webdatafile.pbox_item, filename) or 46 | webdatafile.hdfs_path and 47 | stream_from_hdfs(webdatafile.hdfs_path, filename) ) 48 | if not stream: 49 | return HttpResponse("500 Can't fetch file", 50 | content_type='text/plain', status=500) 51 | 52 | # give it all back to the client: 53 | response = FileResponse(stream) 54 | response['Content-Type'] = 'application/octet-stream' 55 | response['Content-Length'] = webdatafile.size 56 | return response 57 | 58 | def stream_from_pbox(itemname, filename): 59 | # TODO: handle errors etc 60 | archive_session = get_session(config_file=settings.IATOOL_CONFIG_PATH) 61 | item = archive_session.get_item(itemname) 62 | files = item.get_files(filename) 63 | file = files.__next__() 64 | return file.download(return_responses=True) 65 | 66 | def stream_from_hdfs(hdfs_path, filename): 67 | # TODO: consider using python3 snakebite 68 | # TODO: handle errors etc; would be nice to examine returncode 69 | # (but halfway through a big stream is too late to tell the client, right?) 70 | hdfs_cat = Popen([settings.HDFS_EXE, 'dfs', '-cat', hdfs_path], 71 | env=settings.HADOOP_ENV, stdout=PIPE) 72 | return hdfs_cat.stdout 73 | 74 | def may_access_collection_id(request, collection_id): 75 | if request.user.is_superuser: 76 | return True 77 | return collection_id in request.user.account.collection_set.values_list('id', flat=True) 78 | -------------------------------------------------------------------------------- /ait-specification/README.md: -------------------------------------------------------------------------------- 1 | # **Archive-It WASAPI Data Transfer API v1.0** 2 | 3 | 4 | ## Introduction 5 | 6 | This document serves to specify v1.0 of Archive-It's implementation of the WASAPI Data Transfer API. It is intended to document how a client can use the API to find and select web archive files for transfer and to submit jobs for the creation and transfer of derivative web archive files. The API is designed according to the WASAPI data transfer [general specification](https://github.com/WASAPI-Community/data-transfer-apis/tree/master/general-specification). For context, as of June 2017 the Archive-It repository contains over 3,766,068 WARC files, all of which are accessible to the relevant, authenticated Archive-It partners via this API. 7 | 8 | The interface provides two primary services: querying existing files and managing jobs for creating derivative files. The WASAPI data transfer general specification does not mandate how to transfer the webdata files for export, but Archive-It's implementation provides straight-forward HTTPS links. We use the syntax `webdata` file to recognize that the API supports working with both web archive files (WARCs) as well as with derivative files created from WARCs (such as WATs or CDX). 9 | 10 | ## Authentication 11 | 12 | Archive-It restricts access to those clients with an Archive-It account. The WASAPI data transfer general specification allows publicly accessible resources, so Archive-It's implementation will show empty results until you authenticate. You have two options for authentication: 13 | 14 | ### Authentication via browser cookies 15 | 16 | To try some simple queries or manually download your data with a web browser, you can authenticate with cookies in your web browser. 17 | 18 | Point your web browser to `https://partner.archive-it.org/login` and log in to your Archive-It account with your username and password. This will set cookies in your browser for subsequent WASAPI requests and downloading files. 19 | 20 | ### Authentication via basic access authentication 21 | 22 | For automated scripts, you should use http [basic access authentication](https://en.wikipedia.org/wiki/Basic_access_authentication). 23 | 24 | For example, if your account has username `teddy` and password `schellenberg`, you could use this [cURL](https://curl.haxx.se/) invocation: 25 | 26 | curl --user 'teddy:schellenberg' https://partner.archive-it.org/wasapi/v1/webdata 27 | 28 | ## Querying 29 | 30 | Archive-It's data transfer API implementation, Archive-It lets you identify the webdata files via a number of parameters. Start building the URL for your query with `https://partner.archive-it.org/wasapi/v1/webdata`, then append parameters to make your specific query. 31 | 32 | To find all webdata files in your account: 33 | 34 | https://partner.archive-it.org/wasapi/v1/webdata 35 | 36 | ### Overview of Query Parameters 37 | 38 | The basic parameters for querying for webdata files are: 39 | 40 | - `filename`: the exact webdata filename 41 | - `filetype`: the exact webdata files of a specific type, eg `warc`, `wat`, `cdx` 42 | - `collection`: Archive-It collection identifier 43 | - `crawl`: Archive-It crawl job identifier 44 | - `crawl-time-after` & `crawl-time-before`: date of webdata file creation during a crawl job 45 | - `crawl-start-after` & `crawl-start-before`: date of crawl job start 46 | 47 | ### Query parameters 48 | 49 | #### `filename` query parameter 50 | 51 | The `filename` parameter restricts the query to include webdata files with the exact filename as the parameter's value. That is, it must match the beginning and end of the filename; the full path of directories is ignored. API v1.0 matches exact filenames, but later version will recognize "globbing," i.e. matching with `*` and `?` patterns. 52 | 53 | To find a specific file: 54 | 55 | https://partner.archive-it.org/wasapi/v1/webdata?filename=ARCHIVEIT-8232-WEEKLY-JOB300208-20170513202120098-00001.warc.gz 56 | 57 | #### `filetype` query parameter 58 | 59 | The `filetype` parameter restricts the query to those web archive files with the specified type, such as `warc`, `wat`, `cdx`. API v1.0 supports query by `warc` and later version will support query by derivative formats. 60 | 61 | #### `collection` query parameter 62 | 63 | The `collection` parameter restricts the query to those web archive files within the specified collection. Archive-It users may want to reference the documentation on how to [find your collection's ID number](https://support.archive-it.org/hc/en-us/articles/208000916-Find-your-collection-s-ID-number). 64 | 65 | To find the files from the "Occupy Movement 2011/2012" collection: 66 | 67 | https://partner.archive-it.org/wasapi/v1/webdata?collection=2950 68 | 69 | The API supports multiple `collection` parameters in a query. To find the files from the "Occupy Movement 2011/2012" collection and the "#blacklivesmatter Web Archive" collection: 70 | 71 | https://partner.archive-it.org/wasapi/v1/webdata?collection=2950&collection=4783 72 | 73 | #### `crawl` query parameter 74 | 75 | The `crawl` parameter restricts the query to webdata files within a specified crawl, per the crawl job identifier. Archive-It users may want to reference the documentation on [how to find a crawl ID number](https://support.archive-it.org/hc/en-us/articles/115002803383-Finding-your-crawl-ID-number-). Some older Archive-It WARCs and webdata files lack an associated crawl job ID (and, thus, also an associated `crawl-start-time`). Efforts are underway to backfill this data, which should alleviate, if not eliminate, the null values for `crawl` for some historical WARCs. If users receive null results for a know `crawl` identifier, they should contact Archive-It support or use other parameters, which are known to be exhaustive historically. 76 | 77 | To find the files from a specific crawl: 78 | 79 | https://partner.archive-it.org/wasapi/v1/webdata?crawl=300208 80 | 81 | #### `crawl-time-after` and `crawl-time-before` query parameters 82 | 83 | The `crawl-time-after` and `crawl-time-before` parameters restrict the query to those web archive files crawled within the given time range; see [time formats](#time-formats) for the syntax. Specify the lower bound (if any) with `crawl-time-after` and the upper bound (if any) `crawl-time-before`. This field uses the time the WARC file was created, the same timestamp represented in the WARC filename. 84 | 85 | To find the files crawled in the first quarter of 2016: 86 | 87 | https://partner.archive-it.org/wasapi/v1/webdata?crawl-time-after=2016-12-31&crawl-time-before=2016-04-01 88 | 89 | To find all files crawled since 2016: 90 | 91 | https://partner.archive-it.org/wasapi/v1/webdata?crawl-time-after=2016-01-01 92 | 93 | To find all files crawled prior to 2014: 94 | 95 | https://partner.archive-it.org/wasapi/v1/webdata?crawl-time-before=2014-01-01 96 | 97 | #### `crawl-start-after` and `crawl-start-before` query parameters 98 | 99 | The `crawl-start-after` and `crawl-start-before` parameters restrict the query to those web archive files gathered from crawl jobs that started within the given time range; see [time formats](#time-formats) for the syntax. They reference the crawl job start date (in contrast to `crawl-time-after` and `-before` which relate to the individual WARC file creation date). Specify the lower bound (if any) with `crawl-start-after` and the upper bound (if any) `crawl-start-before`. Since `crawl-start` is associated with the `crawl` parameter, the above caveats will apply in that some older Archive-It WARCs and web archive files will lack an associated `crawl-start`. Efforts are underway to backfill this data, otherwise contact Archive-It support or use other parameters, which are known to be exhaustive historically. 100 | 101 | To find the files from a Q1 2016 crawl: 102 | 103 | https://partner.archive-it.org/wasapi/v1/webdata?crawl-start-after=2016-12-31&crawl-start-before=2016-04-01 104 | 105 | To find all files from crawls started since 2016: 106 | 107 | https://partner.archive-it.org/wasapi/v1/webdata?crawl-start-after=2016-01-01 108 | 109 | #### Pagination parameters 110 | 111 | The [parameters for pagination](#parameters-for-pagination) apply to queries. 112 | 113 | ### Query results 114 | 115 | The response to a query is a JSON object with [fields for pagination](#fields-for-pagination), an `includes-extra` field, a `request-url` field, and the result in the `files` field. 116 | 117 | The `count` field represents the total number of web archive files corresponding to the query. 118 | 119 | The `includes-extra` field is currently always false in the API v1.0, as all query parameters return exact matches and the data in the `files` contains nothing extraneous from what is necessary to satisfy the query or job. The `includes-extra` field is mandated by the general specification as some implementations may return results that include webdata files containing content beyond the specific query. For instance, were `url` a query parameter, a request by URL could return results that contain webdata files (i.e. WARCs) that contain data from that URL as well as data from other URLs, due to the way crawlers write WARC files. When Archive-It (or other implementations) supports these type queries, `includes-extra` could have a true value to indicate that the referenced `files` may contain data outside the specific query. 120 | 121 | The `request-url` field represents the submitted query URL. 122 | 123 | The `files` field is a list of a subset (check the [pagination fields](#fields-for-pagination)) of the results of the query, with each webdata file represented by a JSON object with the following keys: 124 | 125 | - `account`: the numeric Archive-It account identifier 126 | 127 | - `checksums`: an object with `md5` and `sha1` keys and hexadecimal values of 128 | the webdata file's checksums 129 | 130 | - `collection`: the numeric Archive-It identifier of the collection that includes the 131 | webdata file 132 | 133 | - `crawl`: the numeric Archive-It identifier of the crawl that created the webdata file 134 | 135 | - `crawl-start`: an optional RFC3339 date stamp of the time the crawl job started 136 | 137 | - `crawl-time`: an RFC3339 date stamp of the time the webdata file was 138 | [created](#crawl-time-after-and-crawl-time-before-query-parameters) 139 | 140 | - `filename`: the name of the webdata file (without any path of directories) 141 | 142 | - `filetype`: the format of the webdata file, eg `warc`, `wat`, `wane`, `cdx` 143 | 144 | - `locations`: a list of sources from which to retrieve the webdata file 145 | 146 | - `size`: the size in bytes of the webdata file 147 | 148 | For example: 149 | 150 | { 151 | "count": 601, 152 | "includes-extra": false, 153 | "next": "https://partner.archive-it.org/wasapi/v1/webdata?collection=8232&page=2", 154 | "previous": null, 155 | "files": [ 156 | { 157 | "account": 89, 158 | "checksums": { 159 | "md5": "073f2a905ce23462204606329ca545c3", 160 | "sha1": "1b796f61dc22f2ca246fa7055e97cd25341bfe98" 161 | }, 162 | "collection": 8232, 163 | "crawl": 304244, 164 | "crawl-start": "2017-05-31T22:15:34Z", 165 | "crawl-time": "2017-05-31T22:15:40Z", 166 | "filename": "ARCHIVEIT-8232-WEEKLY-JOB304244-20170531221540622-00000.warc.gz", 167 | "filetype": "warc", 168 | "locations": [ 169 | "https://warcs.archive-it.org/webdatafile/ARCHIVEIT-8232-WEEKLY-JOB304244-20170531221540622-00000.warc.gz" 170 | ], 171 | "size": 1000000858 172 | }, 173 | { 174 | "account": 89, 175 | "checksums": { 176 | "md5": "610e1849cfc2ad692773348dd34697b4", 177 | "sha1": "9048d063a9adaf606e1ec2321cde3a29a1ee6490" 178 | }, 179 | "collection": 8232, 180 | "crawl": 303042, 181 | "crawl-start": "2017-05-24T22:15:36Z", 182 | "crawl-time": "2017-05-26T17:51:37Z", 183 | "filename": "ARCHIVEIT-8232-WEEKLY-JOB303042-20170526175137981-00002.warc.gz", 184 | "filetype": "warc", 185 | "locations": [ 186 | "https://warcs.archive-it.org/webdatafile/ARCHIVEIT-8232-WEEKLY-JOB303042-20170526175137981-00002.warc.gz" 187 | ], 188 | "size": 40723812 189 | }, 190 | [ ... ] 191 | ] 192 | } 193 | 194 | 195 | ## Jobs 196 | 197 | The Archive-It data transfer API allows users to submit "jobs" for the creation of derivative files from existing resources. This serves the broader goal of WASAPI data transfer APIs to facilitate use of web archives in data-driven scholarship, research and computational analysis, and to support use, and transport, of files derived from WARCs and original archival web data. The Archive-It WASAPI data transfer API v1.0 allows an Archive-It user or approved researcher to: 198 | 199 | - Submit a query and be returned a results list of webdata files 200 | - Submit a job to derive different types of datasets from that results list 201 | - Receive a job submission token and job submission status 202 | - Poll the API for current job status 203 | - Upon job completion, get a results list of the generated derived webdata files 204 | 205 | ### Submitting a new job 206 | 207 | Submit a new job with an HTTP POST to `https://partner.archive-it.org/wasapi/v1/jobs`. 208 | 209 | Select a `function` from those supported. The Archive-It API v1.0 currently supports creation of three types of derivative datasets, all of which have a one-to-one correlation to WARC files. Future development will allow for job submission for original datasets. The current job `function` list: 210 | 211 | - `build-wat`: build a WAT (Web Archive Transformation) file from the matched web archive files 212 | 213 | - `build-wane`: build a WANE (Web Archive Name Entities) file from the matched web archive files 214 | 215 | - `build-cdx`: Build a CDX (Capture Index) file from the matched web archive files 216 | 217 | For more on WATs and WANEs, see their description at [Archive-It Research Services](https://webarchive.jira.com/wiki/display/ARS/Archive-It+Research+Services). For more on CDX, see the documentation for the [CDX Server API](https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md). 218 | 219 | Build an appropriate `query` in the same manner as for the [`/webdata` endpoint](#query-parameters). 220 | 221 | For example, to build WAT files from the WARCs in collection 4783 and crawled in 2016: 222 | 223 | curl --user 'teddy:schellenberg' -H 'Content-Type: application/json' -d '{"function": "build-wat","query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01"}' https://partner.archive-it.org/wasapi/v1/jobs 224 | 225 | If all goes well, the server will record the job, set its `submit-time` to the current time and its `state` to `queued`, and return a `201 Created` response, including a `jobtoken` which can be used to [check its 226 | status](#checking-the-status-of-a-job) later: 227 | 228 | { 229 | "account": 89, 230 | "function": "build-wat", 231 | "jobtoken": "136", 232 | "query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01", 233 | "state": "queued", 234 | "submit-time": "2017-06-03T22:49:13.869698Z", 235 | "termination-time": null 236 | } 237 | 238 | If you want to match everything, you must still provide an explicit empty string for the query parameter. For example, to build a CDX index of all your resources: 239 | 240 | curl --user 'teddy:schellenberg' -H 'Content-Type: application/json' -d '{"function":"build-cdx","query":""}' https://partner.archive-it.org/wasapi/v1/jobs 241 | 242 | ### Checking the status of a job 243 | 244 | To check the [state](#states-of-a-job) of your job, build a URL by appending its job token to `https://partner.archive-it.org/wasapi/v1/jobs/`. For example: 245 | 246 | curl --user 'teddy:schellenberg' https://partner.archive-it.org/wasapi/v1/jobs/136 247 | 248 | Immediately after submitting it, the job will be in the `queued` `state`, and the response will be the same as the response to the submission. Once Archive-It starts running the job, its `state` will change, for example: 249 | 250 | { 251 | "account": 89, 252 | "function": "build-wat", 253 | "jobtoken": "136", 254 | "query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01", 255 | "state": "running", 256 | "submit-time": "2017-06-03T22:49:13Z", 257 | "termination-time": null 258 | } 259 | 260 | And when it is `complete`, the `termination-time` will be set with the time: 261 | 262 | { 263 | "account": 89, 264 | "function": "build-wat", 265 | "jobtoken": "136", 266 | "query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01", 267 | "state": "complete", 268 | "submit-time": "2017-06-03T22:49:13Z", 269 | "termination-time": "2017-06-06T01:37:54Z" 270 | } 271 | 272 | You can also check the [states](#states-of-a-job) of all your jobs at `https://partner.archive-it.org/wasapi/v1/jobs`, which is [paginated](#pagination). For example: 273 | 274 | { 275 | "count": 16, 276 | "next": "http://partner.archive-it.org/wasapi/v1/jobs?page_size=10&page=2", 277 | "previous": null, 278 | "jobs": [ 279 | { 280 | "account": 89, 281 | "function": "build-cdx", 282 | "jobtoken": "137", 283 | "query": "", 284 | "state": "running", 285 | "submit-time": "2017-06-03T23:55:51Z", 286 | "termination-time": null 287 | }, 288 | { 289 | "account": 89, 290 | "function": "build-wat", 291 | "jobtoken": "136", 292 | "query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01", 293 | "state": "completed", 294 | "submit-time": "2017-06-03T22:49:13Z", 295 | "termination-time": "2017-06-06T01:37:54Z" 296 | }, 297 | [ ... ] 298 | ] 299 | } 300 | 301 | ### Checking the result of a failed job 302 | 303 | If your job has a `failed` `state`, build a URL of the form `https://partner.archive-it.org/wasapi/v1/jobs/{jobtoken}/error`. This is in development and not currently implemented. 304 | 305 | ### Checking the result of a complete job 306 | 307 | To retrieve the result of your `complete` job, build a URL of the form `https://partner.archive-it.org/wasapi/v1/jobs/{jobtoken}/result`. The response is similar to [results of a query](#query-results). For example: 308 | 309 | { 310 | "count": 4, 311 | "next": null, 312 | "previous": null, 313 | "files": [ 314 | { 315 | "account": 89, 316 | "checksums": { 317 | "md5": "11a0ddb3575da3b9f6dd9dff665ce181", 318 | "sha1": "0b2a17969b8b45fc14e41441c1ecc7afcf974150" 319 | }, 320 | "collection": 4783, 321 | "crawl": 16473, 322 | "crawl-start": "2016-05-12T15:05:31Z", 323 | "crawl-time": "2016-05-12T15:05:36Z", 324 | "filename": "ARCHIVEIT-4783-TEST-JOB16473-20160512150536534-00000_warc.wat.gz", 325 | "filetype": "wat", 326 | "locations": [ 327 | "https://warcs.archive-it.org/webdatafile/ARCHIVEIT-4783-TEST-JOB16473-20160512150536534-00000_warc.wat.gz", 328 | ], 329 | "size": 8016108 330 | }, 331 | { 332 | "account": 89, 333 | "checksums": { 334 | "md5": "f762e933a3fd412325e6497457ea2be0", 335 | "sha1": "08beda59a9b6df9a26ea4783f69d92fd1d1ba5c2" 336 | }, 337 | "collection": 4783, 338 | "crawl": 16473, 339 | "crawl-start": "2016-05-12T15:05:31Z", 340 | "crawl-time": "2016-05-12T15:05:36Z", 341 | "filename": "ARCHIVEIT-4783-CRAWL_SELECTED_SEEDS-JOB16472-20160512144021684-00000_warc.wat.gz", 342 | "filetype": "wat", 343 | "locations": [ 344 | "https://warcs.archive-it.org/webdatafile/ARCHIVEIT-4783-CRAWL_SELECTED_SEEDS-JOB16472-20160512144021684-00000_warc.wat.gz" 345 | ], 346 | "size": 149888 347 | }, 348 | [ ... ] 349 | ] 350 | } 351 | 352 | 353 | ## Common WASAPI infrastructure 354 | 355 | ### Pagination 356 | 357 | Results of queries and lists of jobs are paginated. The full results may fit on one page (especially if you set `page_size=2000`), but the syntax is always present. You needn't manipulate the `page` parameter directly: after your first request with no `page` parameter, you should iteratively follow non-null `next` links to fetch the full results. 358 | 359 | #### Fields for pagination 360 | 361 | The top-level JSON object of the response includes pagination information with the following keys: 362 | 363 | - `count`: The number of items in the full result (files or jobs, across all 364 | pages) 365 | 366 | - `previous`: Link (if any) to the previous page of items; otherwise null 367 | 368 | - `next`: Link (if any) to the next page of items; otherwise null 369 | 370 | #### Parameters for pagination 371 | 372 | ##### `page` query parameter 373 | 374 | The `page` parameter requests a specific page of the full result. It defaults to 1, giving the first page. 375 | 376 | ##### `page_size` query parameter 377 | 378 | The `page_size` parameter sets the size of each page. It defaults to 100 and has a maximum value of 2000. 379 | 380 | ### Time formats 381 | 382 | Date and time parameters should satisfy RFC3339, eg `YYYY-MM-DD` or `YYYY-MM-DDTHH:MM:SS`, but Archive-It also recognizes abbreviations like `YYYY-MM` or `YYYY` which are interpreted as the first of the month or year. We recommend using [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time), but the implementation does now recognize a trailing `Z` or timezone offset. 383 | 384 | Formats that work: 385 | - `2017-01-01` 386 | - `2017-01-01T12:34:56` 387 | - `2017-01-01 12:34:56` 388 | - `2017-01-01T12:34:56Z` 389 | - `2017-01-01 12:34:56-0700` 390 | - `2017` 391 | - `2017-01` 392 | 393 | ## Recipes and other resources 394 | 395 | Archive-It is in the midst of creating a recipe book of sample API queries. Both Archive-It and WASAPI grant partners are also creating a number of local utilities for working with this API and implementing it in preservation and research workflows. These utilities will also be posted in this GitHub account for public reference. Stanford has created a number of demonstration videos outlining their tool development for working with this API for ingest of their Archive-It WARCs into their preservation repository. These can be seen in the [WASAPI collection](https://archive.org/details/wasapi) in the Internet Archive and Stanford Libraries' [YouTube channel](https://www.youtube.com/channel/UCc2CQuHkhKGZ-2ZLTZVGE2A). 396 | 397 | For Archive-It's proposed changes to the WASAPI data transfer API general specification and other build details, visit the [Archive-It implementation repository](https://github.com/WASAPI-Community/data-transfer-apis/tree/master/ait-implementation). 398 | 399 | ## Contacts 400 | 401 | *Archive-It (Internet Archive)* 402 | * Jefferson Bailey, Director, Web Archiving, jefferson@archive.org 403 | * Mark Sullivan, Web Archiving Software Engineer, msullivan@archive.org 404 | -------------------------------------------------------------------------------- /ait-specification/transfer_api_archive-it_v1.yaml: -------------------------------------------------------------------------------- 1 | ../ait-implementation/wasapi/swagger.yaml -------------------------------------------------------------------------------- /general-specification/README.md: -------------------------------------------------------------------------------- 1 | ## **WASAPI Data Transfer API General Specification v0.1** 2 | 3 | **Introduction** 4 | 5 | This document serves to outline v0.1 of the Web Archive Data Export API developed as part of the Web Archiving Systems API (WASAPI) project of the [Institute of Museum and Library Services](https://www.imls.gov/)-funded [National Leadership Grant, LG-71-15-0174](https://www.imls.gov/grants/available/national-leadership-grants-libraries), "[Systems Interoperability and Collaborative Development for Web Archiving](https://www.imls.gov/sites/default/files/proposal_narritive_lg-71-15-0174_internet_archive.pdf)" (PDF). Primary development of this API specification as well as the creation of multiple reference implementations is being led by Internet Archive (Archive-It) and Stanford University Libraries (DLSS and LOCKSS). Other project partners are University of North Texas, Rutgers University, a Technical Working Group, and an Advisory Board. More information on the WASAPI project can be found at the [WASAPI-Community GitHub workspace](https://github.com/WASAPI-Community) the [WASAPI-community Google Group](https://groups.google.com/forum/#!forum/wasapi-community), and on the [WASAPI Slack](https://docs.google.com/forms/d/e/1FAIpQLScsdTqssLrM9FinmpP8Mow2Hl8zJnfJZfjWxaeXddlvu2VjBw/viewform) channel. 6 | 7 | **General Usage** 8 | 9 | This is a **_generalized_** specification representing only a **_minimum_** set of requirements for development of APIs to facilitate the transfer of archived web data between custodians, systems, repositories, end users, and others. This API focuses on the transfer of WARC and WARC-derived web data and aims to standardize vocabularies and features to inform those institutions, services, and developers building reference implementations local to their organizations. The basic purpose of this API is to return a results list of WARC files and derivative files originating from WARCs and corresponding essential metadata related to location/transfers in response to a user-defined query that includes implementation-specific parameters. The specification includes the notion of a "job" i.e. the ability to submit a job, receive a job token, and track a job status for the creation of derivative web data files that need to be generated locally and will be available via the API upon job completion. This allows the API to meet the intended goals of supporting both WARC file transfer for preservation replication as well as the ability to allow for derivative datasets to be delivered to researchers and users. 10 | 11 | **Assumptions & Exclusions** 12 | 13 | * Implementation APIs built using this specification should be RESTful. 14 | * Implementation APIs should, at minimum, produce application/json. 15 | * Implementation APIs must support GET and POST. 16 | * This specification does not cover authentication and access control, which are considered to be institution and implementation specific. 17 | * This specification abstracts away institution-specific details in many areas to remain generalizable and minimal. Additional paths, methods, and functions can be added in implementations as desired. 18 | * The specification allows for both the return of results considered "complete" and “includes extra.” This allow requesters to know if the returned results fully meet the original query or whether addition data is contained in the file list returned. Details are below. 19 | 20 | **Issues for General Discussion** 21 | 22 | * The "transfer confirmation" functionality originally proposed by the development team was dropped. This functionality was intended to verify a successful transfer once the transfer was complete. It seemed too challenging to force every implementation to develop and support this as a bare minimum, as it could get technically complicated. Implementations can still build it in to their API. To facilitate the ability to confirm a transfer, checksum was made a required return value for **/webdata**. 23 | * A true/false result for "includes-extra" will be part of all **/webdata** query returns to denote the “inclusive v. exclusive” issues discussed in design meetings. 24 | * Some questions/decisions remained around whether a "filename" should match directories in a “full path,” and if so, how “\*” wildcard/globs should match directory separators, for instance, *\/webdata?filename=ARCHIVEIT-1234-2016\*.warc.gz. For now, we suggest an implementation may support searching the full path only as an extension to the set of query parameters. 25 | * There was debate around how APIs should define and document themselves. Originally idea was to have a **/registry** path under each main path that would return implementation specific information, such as parameters to **/webdata** and functions to **/jobs**. Alternately, the base path could simply return a Swagger YAML file that defines the full API. For discussion, and potentially can be an implementation detail. The WASAPI team will decide where this information lives and what is required of implementations (if anything). 26 | * We should determine the right way of specifying compression along with the format of files. 27 | * Are we offering too much flexibility with WebdataMenu and WebBundles and the multiple "locations" of a WebdataFile? How frequently do implementations offer mirrored files vs offer multiple transport methods of the same webdata? What is the level of requirement for granularity here? 28 | 29 | **Paths & Examples** 30 | 31 | * **/webdata** 32 | 33 | *(example: https://partner.archive-it.org/v0/export/api/webdata)* 34 | 35 | The most basic query using the **/webdata** path returns a list of all web data files on the server which are available to the client, basic metadata about those files, and their download information. Parameters to modify **/webdata** will be determined by institutions building their own implementations. Potential parameters can be as simple as */webdata?directoryName=[name]* or can support an extensive list of parameters to modify a request. Examples of possible parameters could include those defining identifiers for things like account, collection, seed, crawl job, harvest event, session, date range, archival identifier, administrative unit, repository, bucket, and more. All institution-specific query filters and modifiers should be parameters to the **/webdata** path. 36 | 37 | * **Example queries and results** 38 | 39 | *https://partner.archive-it.org/v0/export/api/webdata?filename=2016-08-30-blah.warc.gz* 40 | 41 | The above query would return a list of a single WARC file (though it may be available via multiple transports). 42 | 43 | ``` 44 | { 45 | "includes-extra": false, 46 | "files": [ 47 | { 48 | "checksum": "md5:b1c3cd...57; sha1:011c65...a7", 49 | "content-type": "application/warc", 50 | "filename": "2016-08-30-blah.warc.gz", 51 | "locations": [ 52 | "http://archive-it.org/.../2016-08-30-blaugh.warc.gz", 53 | "gridftp globus://...", 54 | "ipfs/Qmbeef0484098..." 55 | ] 56 | } 57 | ] 58 | } 59 | ``` 60 | 61 | *https://partner.archive-it.org/v0/export/api/webdata?acccountId=123&collectionId=456&startDate=01012014&endDate=12312015* 62 | 63 | The above query would return a list of all the WARCs (with metadata and download links) from between January 1, 2014 and December 31, 2015 for Account 123, Collection 456. 64 | 65 | ``` 66 | { 67 | "includes-extra": false, 68 | "files": [ 69 | { 70 | "checksum": "md5:beefface09384509", 71 | "content-type": "application/warc", 72 | "filename": "2014-01-01-blah.warc.gz", 73 | "locations": [ 74 | "http://archive-it.org/.../2014-01-01-blah.warc.gz", 75 | "/ipfs/Qmde62f92ea12c42dc0b0c0ab3952e52e1" 76 | ] 77 | }, 78 | { 79 | "checksum": "md5:beefface09384510", 80 | "content-type": "application/warc", 81 | "filename": "2014-01-02-blah.warc.gz", 82 | "locations": [ 83 | "http://archive.org/.../2014-01-02-blah.warc.gz", 84 | "/ipfs/Qmbda3f7abccdad41977fb308453566f84" 85 | ] 86 | } 87 | ] 88 | } 89 | ``` 90 | 91 | * **/jobs** 92 | 93 | *(example: https://partner.archive-it.org/v0/export/api/jobs)* 94 | 95 | The **/jobs** path shows the jobs on this server accessible to the client. This enables the request and delivery of WARC derivative webdata files. The **/jobs** path supports GET and POST methods. Implementations that do not include the ability to submit a job should still support this path and simply return that no jobs are possible for the client on this server. 96 | 97 | * **Example queries and results** 98 | 99 | GET *https://partner.archive-it.org/v0/export/api/jobs* 100 | 101 | Results here depend on whether jobs have been submitted. If no jobs have been submitted, you get an empty list. If you have submitted jobs, you get something similar to the below. 102 | 103 | ``` 104 | [ 105 | { 106 | "jobtoken": "21EC2020-08002B30309D", 107 | "function": "build-wat", 108 | "query": "acccountId=123&collectionId=456&startDate=2014&endDate=2015", 109 | "submit-time": "2016-08-30Z15:52:53", 110 | "state": "complete", 111 | "result": [ 112 | [ 113 | { 114 | "checksum": "md5:beefface09384509", 115 | "content-type": "application/wat", 116 | "filename": "2014-01-01-blah.wat.gz", 117 | "locations": [ 118 | "http://archive-it.org/.../2014-01-01-blah.wat.gz" 119 | ] 120 | }, 121 | { 122 | "checksum": "md5:beefface09384510", 123 | "content-type": "application/wat", 124 | "filename": "2014-01-02-blah.wat.gz", 125 | "locations": [ 126 | "http://archive-it.org/.../2014-01-02-blah.wat.gz" 127 | ] 128 | } 129 | ] 130 | ] 131 | } 132 | ] 133 | ``` 134 | 135 | POST *https://partner.archive-it.org/v0/export/api/jobs* 136 | 137 | The POST method includes a string matching a **/webdata** query string plus an implementation-specific function available to the client. In this specification, POST requests remain a bit of an abstraction, as they are dependent upon the implementation-specific parameters supported under **/webdata**. 138 | 139 | Using the previous **/webdata** example, the below POST request would return a job token for creating WATs for WARC files matching that **/webdata** query: 140 | 141 | *https://partner.archive-it.org/v0/export/api/jobs?acccountId=123&collectionId=456&startDate=2014&endDate=2015&function=build-wat* 142 | 143 | ``` 144 | { 145 | "jobtoken": "21EC2020-08002B30309D", 146 | "function": "build-wat", 147 | "query": "acccountId=123&collectionId=456&startDate=2014&endDate=2015", 148 | "submit-time": "2016-08-30Z15:52:53", 149 | "state": "queued" 150 | } 151 | ``` 152 | 153 | * **/jobs/{jobToken}** 154 | 155 | *(example: https://partner.archive-it.org/v0/export/api/jobs/123456)* 156 | 157 | The **/jobs/{jobToken}** path returns the status of a submitted job. 158 | 159 | * **Example queries and results** 160 | 161 | GET *https://partner.archive-it.org/v0/export/api/jobs/21EC2020-08002B30309D* 162 | 163 | Retrieve status for a submitted job, some metadata, including the original query, time it was requested, etc. Includes results list if job is finished. Results are not necessarily available indefinitely. May return "410 Gone" if derivatives generated by this job have been replaced (e.g. by the results of a newer job), or if job has been expired by some other policy. An implementation may (but is not required to) make results later available under /webdata queries. 164 | 165 | ``` 166 | { 167 | "jobtoken": "21EC2020-08002B30309D", 168 | "function": "build-wat", 169 | "query": "acccountId=123&collectionId=456&startDate=2014&endDate=2015", 170 | "submit-time": "2016-08-30Z15:52:53", 171 | "state": "complete", 172 | "result": [ 173 | [ 174 | { 175 | "checksum": "md5:beefface09384509", 176 | "content-type": "application/wat", 177 | "filename": "2014-01-01-blah.wat.gz", 178 | "locations": [ 179 | "http://archive-it.org/.../2014-01-01-blah.wat.gz" 180 | ] 181 | }, 182 | { 183 | "checksum": "md5:beefface09384510", 184 | "content-type": "application/wat", 185 | "filename": "2014-01-02-blah.wat.gz", 186 | "locations": [ 187 | "http://archive-it.org/.../2014-01-02-blah.wat.gz" 188 | ] 189 | } 190 | ] 191 | ] 192 | } 193 | ``` 194 | 195 | **Additional Definitions** 196 | 197 | The result of a /webdata query or result of a job can be represented in multiple formats and offered via multiple transports. To express this and allow the client to select the most appropriate, an implementation includes a "WebdataMenu" in each result. A WebdataMenu offers a number of “WebdataBundles”, each of which provide the complete result with a distinct format and transport. Each WebdataBundle contains one or more WebdataFiles. The client chooses a WebdataBundle with appropriate format and transport. 198 | 199 | Here’s an example of a single WebdataMenu which contains two WebdataBundles. The first WebdataBundle contains three WebdataFiles; the second contains one. 200 | 201 | ``` 202 | [ 203 | [ ‘http://partner.archive-it.org/.../2016-08-30-blah.warc.gz’, 204 | ‘http://partner.archive-it.org/.../2016-08-30-blah1.warc.gz’, 205 | ‘http://partner.archive-it.org/.../2016-08-30-blah2.warc.gz’ 206 | ], 207 | [ ‘ipfs/Qm67e26534d15bc305340ce4b2e5944ffc’ ] 208 | ] 209 | ``` 210 | 211 | **Timeline & Contacts** 212 | 213 | This document and the accompanying Swagger .yaml file was shared across the primary development project team for comment and input in early September 2016. The document will be shared in late September with the full grant team, Technical Working Group, and program managers and engineers of the web archiving community attending the IIPC Steering Committee meeting and Crawler Hackathon at British Library the week of September 19, 2016. After a period of comments, the spec and doc will be shared with the full web archiving community for additional feedback. Reference implementations of the specification will be developed by Internet Archive (Archive-It) and Stanford (LOCKSS) in Q4 of 2016. Testing, iterative development, and other ongoing activities will take place in 2017. 214 | 215 | *Internet Archive (Archive-It)* 216 | * Jefferson Bailey, Director, Web Archiving, jefferson@archive.org 217 | * Mark Sullivan, Web Archiving Software Engineer, msullivan@archive.org 218 | 219 | *Stanford University Libraries (DLSS & LOCKSS)* 220 | * Nicholas Taylor, Web Archiving Service Manager, ntay@stanford.edu 221 | * David Rosenthal, LOCKSS Chief Information Scientist, dshr@standford.edu 222 | -------------------------------------------------------------------------------- /general-specification/transfer_api_v1.yaml: -------------------------------------------------------------------------------- 1 | swagger: '2.0' 2 | info: 3 | title: WASAPI Export API 4 | description: > 5 | WASAPI Export API. A draft of the minimum that a Web Archiving Systems API 6 | server must implement. 7 | version: 0.1.0 8 | contact: 9 | name: Jefferson Bailey and Mark Sullivan 10 | url: https://github.com/WASAPI-Community/data-transfer-apis 11 | license: 12 | name: Apache 2.0 13 | url: http://www.apache.org/licenses/LICENSE-2.0.html 14 | consumes: 15 | - application/json 16 | produces: 17 | - application/json 18 | basePath: /v0 19 | schemes: 20 | - https 21 | paths: 22 | /webdata: 23 | get: 24 | parameters: 25 | - name: filename 26 | in: query 27 | type: string 28 | description: > 29 | A semicolon-separated list of globs. In each glob, a star `*` 30 | matches any string of characters, and a question mark `?` matches 31 | exactly one character. Are the globs matched against the full 32 | pathname (ie with directories) vs just the basename?, and if 33 | pathname, is the slash `/` specially matched (cf `**`)? 34 | - name: content-type 35 | in: query 36 | type: string 37 | description: A semicolon-separated list of acceptable MIME-types 38 | responses: 39 | '200': 40 | description: Success 41 | schema: 42 | type: object 43 | properties: 44 | includes-extra: 45 | type: boolean 46 | files: 47 | $ref: '#/definitions/WebdataMenu' 48 | '400': 49 | description: The request could not be interpreted 50 | '401': 51 | description: The Request was unauthorized 52 | /jobs: 53 | get: 54 | summary: Show the jobs on this server accessible to the client 55 | responses: 56 | '200': 57 | description: The list of jobs 58 | schema: 59 | type: array 60 | items: 61 | $ref: '#/definitions/Job' 62 | post: 63 | parameters: 64 | - name: query 65 | in: formData 66 | description: URL-encoded query as appropriate for /webdata end-point 67 | type: string 68 | - name: function 69 | in: formData 70 | description: > 71 | An implementation-specific identifier for some function the 72 | implementation supports 73 | type: string 74 | - name: parameters 75 | in: formData 76 | description: > 77 | Other parameters specific to the function and implementation 78 | (URL-encoded). For example: level of compression, priority, time 79 | limit, space limit. 80 | type: string 81 | responses: 82 | '201': 83 | description: > 84 | Job was successfully submitted. Body is the submitted job. 85 | schema: 86 | $ref: '#/definitions/Job' 87 | '400': 88 | description: The request could not be interpreted 89 | '401': 90 | description: The Request was unauthorized 91 | '/jobs/{jobToken}': 92 | get: 93 | summary: Retrieve status for job 94 | parameters: 95 | - name: jobToken 96 | in: path 97 | description: The job token returned from previous request 98 | required: true 99 | type: string 100 | responses: 101 | '200': 102 | description: Success 103 | schema: 104 | $ref: '#/definitions/Job' 105 | '400': 106 | description: The request could not be interpreted 107 | '401': 108 | description: The Request was unauthorized 109 | '403': 110 | description: Forbidden 111 | '404': 112 | description: No such job 113 | '410': 114 | description: > 115 | Gone / invalidated. Body may include non-result information about 116 | the job. 117 | definitions: 118 | WebdataFile: 119 | description: > 120 | The unit of distribution of web archival data. Examples: a WARC file, 121 | an ARC file, a CDX file, a WAT file, a DAT file, a tarball. 122 | type: object 123 | required: 124 | - filename 125 | - checksum 126 | - content-type 127 | - locations 128 | properties: 129 | filename: 130 | type: string 131 | description: The name of the webdata file 132 | content-type: 133 | # TODO: handle compression etc 134 | type: string 135 | description: > 136 | The MIME-type for the webdata file, eg `application/warc`, 137 | `application/pdf` 138 | checksum: 139 | type: string 140 | description: > 141 | Checksum for the webdata file, eg "sha1:beefface09781234897", 142 | "md5:dad0dada09823098" 143 | size: 144 | type: integer 145 | format: int64 146 | description: The size in bytes of the webdata file 147 | locations: 148 | type: array 149 | items: 150 | type: string 151 | format: url 152 | description: > 153 | A list of (mirrored) sources from which to retrieve (identical copies 154 | of) the webdata file, eg "http://archive.org/...", 155 | "/ipfs/Qmee6d6b05c21d1ba2f2020fe2db7db34e" 156 | WebdataBundle: 157 | description: > 158 | A "bundle" of webdata files that together satisfy a query, job, etc 159 | type: array 160 | items: 161 | $ref: '#/definitions/WebdataFile' 162 | WebdataMenu: 163 | description: > 164 | A set of alternative webdata bundles, each of which satisfies a given 165 | query, job, etc. An implementation may offer a different bundle (with 166 | differing number of webdata files) for each of its available transports, 167 | etc. 168 | type: array 169 | items: 170 | $ref: '#/definitions/WebdataBundle' 171 | Job: 172 | type: object 173 | description: A submitted job with optional results 174 | required: 175 | - jobtoken 176 | - function 177 | - query 178 | - submit-time 179 | - state 180 | properties: 181 | jobtoken: 182 | type: string 183 | description: > 184 | Identifier unique across the implementation. The implementation 185 | chooses the format. For example: GUID, increasing integer. 186 | function: 187 | type: string # enum 188 | description: eg `build-WAT`, `build-index` 189 | query: 190 | type: string 191 | description: > 192 | The specification of what webdata to include in the job. Encoding is 193 | URL-style, eg `param=value&otherparam=othervalue`. 194 | submit-time: 195 | type: string 196 | format: date-time 197 | description: Time of submission, formatted according to RFC3339 198 | state: 199 | type: string # enum 200 | description: > 201 | Implementation-defined, eg `queued`, `running`, `failed`, `complete`, 202 | `gone` 203 | result: 204 | allOf: 205 | - description: > 206 | This property indicates whether the job has completed (without 207 | having been cleaned away). When present, it is a list of URLs to 208 | webdata files. Should its absense be expressed as omission vs 209 | null/undef/etc vs empty list?, and how do we write that in swagger? 210 | - $ref: '#/definitions/WebdataMenu' 211 | -------------------------------------------------------------------------------- /lockss-implementation/README.md: -------------------------------------------------------------------------------- 1 | # LOCKSS WASAPI implementation 2 | 3 | Code related to the LOCKSS implementation of the WASAPI general specification. 4 | 5 | * The default_controller.py generated from the WASAPI API spec by FLASK as modified to interface to the LOCKSS daemon's SOAP-y export API. 6 | * A minimal WASAPI client just sufficient to test the server. 7 | 8 | Note that the server has XXX comments mostly about mismatches between the LOCKSS SOAP-y API and WASAPI. 9 | 10 | This works, in that a LOCKSS daemon in our test framework can be run, create an AU full of synthetic content, export it via WASAPI and HTTP, and verify the SHA1 of the result. Not tested at scale yet. Will need fixing when some suggested changes are made to the LOCKSS daemon. 11 | -------------------------------------------------------------------------------- /lockss-implementation/default_controller.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import os, sys 3 | import suds 4 | from suds.client import Client 5 | import json 6 | from datetime import datetime, timezone 7 | from base64 import b64decode 8 | import hashlib 9 | 10 | persistentFile = '/tmp/wsapiJobs' 11 | testAUID = 'org|lockss|plugin|simulated|SimulatedPlugin&root~%2Ftmp%2F' 12 | # LOCKSS SOAP export web service 13 | # NB - org.lockss.export.enabled=true must be set when the daemon starts 14 | # for the returned URL to be valid. 15 | # XXX these should be arguments 16 | host = 'localhost:8081' 17 | url = 'http://'+host+'/ws/ExportService?wsdl' 18 | user = 'lockss-u' 19 | pwd4lockss = 'lockss-p' 20 | 21 | # WSAPI API 22 | def jobs_post(query = None, function = None, parameters = None) -> str: 23 | # WARC creation is supposed to be asynchronous, but for now we 24 | # make it synchonous, so all jobs are in status 'complete' 25 | if function != 'auid': 26 | return "" 27 | # XXX for now, query is simply an AUID 28 | # XXX query should be a list of AUIDs, not just one 29 | client = Client(url, username=user, password=pwd4lockss) 30 | params = makeParams(client, query) 31 | results = client.service.createExportFiles(params) 32 | jobId = makeJobId() 33 | putJob(jobId, params, results) 34 | return makePostResponse(jobId, params, encodeResults(results)) 35 | 36 | def webdata_get(filename = None, contentType = None) -> str: 37 | # XXX webdata requires that content be returned selectively 38 | # XXX if it matches specified URL globbing and/or mime type. 39 | # The LOCKSS daemon's SOAP-y API does not support this. 40 | message = { 41 | 'status': '400', 42 | 'message': 'Content filtering not supported' 43 | } 44 | resp = jsonify(message) 45 | resp.status_code = 400 46 | return resp 47 | 48 | def jobs_get() -> str: 49 | ret = {} 50 | jobs = getJobs() 51 | ret = json.dumps(jobs) 52 | return ret 53 | 54 | def jobs_job_token_get(jobToken) -> str: 55 | ret = [] 56 | job = getJob(jobToken) 57 | if job != None: 58 | params = job['params'] 59 | ret['function'] = params['function'] 60 | ret['jobtoken'] = jobToken 61 | ret['query'] = params['query'] 62 | ret['state'] = 'complete' 63 | ret['submit-time'] = jobs['submit-time'] 64 | return json.dumps(ret) 65 | 66 | # Persistent state - i.e. jobs database 67 | def putJob(jobId, params, results = None): 68 | ret = None 69 | state = {'params':encodeParams(params), 70 | 'results':encodeResults(results), 71 | 'submit-time':encodeTime()} 72 | try: 73 | fileState = os.stat(persistentFile) 74 | except FileNotFoundError as ex: 75 | jobs = {} 76 | else: 77 | if fileState.st_size > 0: 78 | with open(persistentFile, 'r') as f: 79 | jobs = json.load(f) 80 | else: 81 | jobs = {} 82 | with open(persistentFile, 'w') as f: 83 | jobs[jobId] = state 84 | json.dump(jobs, f) 85 | ret = jobId 86 | return ret 87 | 88 | def getJob(jobId): 89 | try: 90 | with open(persistentFile, 'r') as f: 91 | # XXX need to lock file 92 | jobs = json.load(f) 93 | return jobs[jobId] 94 | except FileNotFoundError: 95 | return None 96 | def getJobs(): 97 | try: 98 | with open(persistentFile, 'r') as f: 99 | # XXX need to lock file 100 | jobs = json.load(f) 101 | return jobs 102 | except FileNotFoundError: 103 | return [] 104 | 105 | # LOCKSS SOAP-y export web service 106 | def makeParams(client, auid): 107 | typeEnum = client.factory.create(u'typeEnum') 108 | filenameTranslationEnum = client.factory.create(u'filenameTranslationEnum') 109 | params = client.factory.create(u'exportServiceParams') 110 | params.auid = auid 111 | params.compress = 1 112 | params.excludeDirNodes = 0 113 | params.fileType = typeEnum.WARC_RESOURCE 114 | params.filePrefix = 'lockss-export-' 115 | params.maxSize = -1 116 | params.maxVersions = -1 117 | params.xlateFilenames = filenameTranslationEnum.XLATE_NONE 118 | return params 119 | 120 | # Encode exportServiceParams as a Dictionary 121 | def encodeParams(params): 122 | ret = {} 123 | ret['auid'] = params.auid 124 | ret['compress'] = params.compress 125 | ret['excludeDirNodes'] = params.excludeDirNodes 126 | ret['fileType'] = params.fileType 127 | ret['filePrefix'] = params.filePrefix 128 | ret['maxSize'] = params.maxSize 129 | ret['maxVersions'] = params.maxVersions 130 | ret['xlateFilenames'] = params.xlateFilenames 131 | return ret 132 | 133 | # Encode exportServiceWsResult as Dictionary 134 | # XXX there is currently no option to the SOAP-y API to select 135 | # XXX between streaming the WARC and placing it in the export 136 | # XXX directory - it does both. 137 | def encodeResults(results): 138 | ret = {} 139 | ret['auid'] = results.auId 140 | ret['name'] = results.dataHandlerWrappers[0].name 141 | ret['size'] = results.dataHandlerWrappers[0].size 142 | # XXX we need the LOCKSS daemon to compute the 143 | # XXX checksum of the files it puts in the export 144 | # XXX directory so that exports can be validated. 145 | # XXX We would fetch the checksum file here to get 146 | # XXX the checksum. 147 | # XXX Instead we compute the checksum of the streamed 148 | # XXX WARC content but this isn't an end-to-end check. 149 | b = bytes(results.dataHandlerWrappers[0].dataHandler, "utf-8") 150 | warc = b64decode(b) 151 | m = hashlib.sha1() 152 | m.update(warc) 153 | ret['sha1'] = m.hexdigest() 154 | return ret 155 | 156 | def encodeTime(): 157 | local_time = datetime.now(timezone.utc).astimezone() 158 | return local_time.isoformat() 159 | 160 | def makeJobId(): 161 | # XXX for now, jobId is submission time 162 | return encodeTime() 163 | 164 | # Return the body of the POST response 165 | def makePostResponse(jobId, params, results): 166 | ret = { 167 | "includes-extras":False, 168 | "files":[ 169 | { 170 | "checksum":"sha1:" + results['sha1'], 171 | "content-type":"application/warc", 172 | "filename":results['name'], 173 | "locations":[ 174 | "http://"+host+"/export/"+results['name'] 175 | ], 176 | "size":results['size'] 177 | } 178 | ] 179 | } 180 | return json.dumps(ret) 181 | -------------------------------------------------------------------------------- /lockss-implementation/wasapi-test.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # A minimal client for the WASAPI transfer API, with 4 | # defaults matching the LOCKSS implementation 5 | # 6 | # Arguments: 7 | # -f [function] or --function [function] default auid 8 | # -q [query] or --query [query] default AUID for SimulatedContent 9 | 10 | from argparse import ArgumentParser 11 | import requests 12 | import json 13 | 14 | # URL prefix for WSAPI service 15 | service = "http://localhost:8080/v0" 16 | testAUID = 'org|lockss|plugin|simulated|SimulatedPlugin&root~%2Ftmp%2F' 17 | err = "Error: " 18 | 19 | # Return a Dictionary with the params for the WSAPI request 20 | def makeWsapiParams(): 21 | parser = ArgumentParser() 22 | parser.add_argument("-f", "--function", dest="myFunction", help="WASAPI function", default="auid") 23 | parser.add_argument("-q", "--query", dest="myQuery", default=testAUID, help="WASAPI query") 24 | args = parser.parse_args() 25 | ret = None 26 | if (args.myQuery != None and args.myFunction != None): 27 | ret = {} 28 | ret['query'] = args.myQuery 29 | ret['function'] = args.myFunction 30 | return ret 31 | 32 | params1 = makeWsapiParams() 33 | if params1 != None: 34 | # query the service 35 | wasapiResponse = requests.post(service + "/jobs", data=params1) 36 | status = wasapiResponse.status_code 37 | if(status == 200): 38 | # WASAPI request successful 39 | err = "" 40 | # parse the JSON we got back 41 | wasapiData = wasapiResponse.json() 42 | message = json.dumps(wasapiData) 43 | else: 44 | # WASAPI request unsuccessful 45 | message = "WSAPI request error: {0}\n{1}".format(status,wasapiResponse) 46 | else: 47 | message = "Usage: wasapi-test -q [query] -f [function]" 48 | print('WASAPI test') 49 | print("{0}{1}".format(err,message)) 50 | print() 51 | -------------------------------------------------------------------------------- /utilities/README.md: -------------------------------------------------------------------------------- 1 | # WASAPI Utilities # 2 | 3 | A list of open-source utilities, downloaders, and processing tools that make use of the WASAPI APIs for data transfer into local systems. Utilities by Stanford University Libraries, University of North Texas Libraries, and Rutgers University were developed as part of the main WASAPI grant project. 4 | 5 | ## Stanford University Libraries ## 6 | 7 | https://github.com/sul-dlss/wasapi-downloader 8 | 9 | ## University of North Texas Libraries ## 10 | 11 | https://github.com/unt-libraries/py-wasapi-client 12 | --------------------------------------------------------------------------------