├── .gitignore
├── .travis.yml
├── LICENSE
├── NOTES.md
├── README.md
├── archivekit
    ├── __init__.py
    ├── archive.py
    ├── collection.py
    ├── ext.py
    ├── ingest.py
    ├── manifest.py
    ├── package.py
    ├── resource.py
    ├── store
    │   ├── __init__.py
    │   ├── common.py
    │   ├── file.py
    │   └── s3.py
    ├── types
    │   ├── __init__.py
    │   └── source.py
    └── util.py
├── setup.py
└── tests
    ├── fixtures
        └── test.csv
    ├── helpers.py
    ├── test_file.py
    └── test_s3.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .stash/*
2 | *.pyc
3 | *.DS_Store
4 | *.egg-info
5 | 
6 | dist/*
7 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: python
 2 | python:
 3 |   - "2.7"
 4 | before_install:
 5 |   - virtualenv ./pyenv --distribute
 6 |   - source ./pyenv/bin/activate
 7 | install:
 8 |   - python setup.py develop
 9 |   - pip install coveralls moto nose
10 | script:
11 |   - nosetests
12 |   - coverage run --source=archivekit setup.py test
13 | after_success:
14 |   - coveralls
15 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2015 Friedrich Lindenberg
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 | 


--------------------------------------------------------------------------------
/NOTES.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## How to implement de-duped collections in archivekit?
 3 | 
 4 | * Store a content hash in the manifest and traverse to find
 5 |   duplicates (too slow, too mutable).
 6 | * Make local copies of the data, generate a content hash, upload
 7 |   by hash ID.
 8 | 
 9 | 
10 | Sources:
11 | 
12 |     * From file name
13 |     * From URL
14 |     * From fileobj
15 | 
16 | Options: 
17 | 
18 |     * Content hashed
19 |     * Move or copy?
20 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # archivekit
 2 | 
 3 | [![Build Status](https://travis-ci.org/pudo/archivekit.png?branch=master)](https://travis-ci.org/pudo/archivekit) [![Coverage Status](https://coveralls.io/repos/pudo/archivekit/badge.svg)](https://coveralls.io/r/pudo/archivekit)
 4 | 
 5 | ``archivekit`` provides a mechanism for storing a (large) set of immutable documents and data files in an organized way. Transformed versions of each file can be stored the alongside the original data in order to reflect a complete processing chain. Metadata is kept with the data as a YAML file.
 6 | 
 7 | This library is inspired by [OFS](https://github.com/okfn/ofs), [BagIt](https://github.com/LibraryOfCongress/bagit-python) and [Pairtree](https://pythonhosted.org/Pairtree/). It replaces a previous project, [docstash](https://github.com/pudo/docstash).
 8 | 
 9 | 
10 | ## Installation
11 | 
12 | The easiest way of using ``archivekit`` is via PyPI:
13 | 
14 | ```bash
15 | $ pip install archivekit
16 | ```
17 | 
18 | Alternatively, check out the repository from GitHub and install it locally:
19 | 
20 | ```bash
21 | $ git clone https://github.com/pudo/archivekit.git
22 | $ cd archivekit
23 | $ python setup.py develop
24 | ```
25 | 
26 | 
27 | ## Example
28 | 
29 | ``archivekit`` manages ``Packages`` which contain one or several ``Resources`` and their associated metadata. Each ``Package`` is part of a ``Collection``. 
30 | 
31 | ```python
32 | from archivekit import open_collection, Source
33 | 
34 | # open a collection of packages
35 | collection = open_collection('file', path='/tmp')
36 | 
37 | # or via S3:
38 | collection = open_collection('s3', aws_key_id='..', aws_secret='..',
39 |                              bucket_name='test.pudo.org')
40 | 
41 | # import a file from the local working directory:
42 | collection.ingest('README.md')
43 | 
44 | # import an http resource:
45 | collection.ingest('http://pudo.org/index.html')
46 | # ingest will also accept file objects and httplib/urllib/requests responses
47 | 
48 | # iterate through each document and set a metadata
49 | # value:
50 | for package in collection:
51 |     for source in package.all(Source):
52 |         with source.fh() as fh:
53 |             source.meta['body_length'] = len(fh.read())
54 |     package.save()
55 | ```
56 | 
57 | The code for this library is very compact, go check it out.
58 | 
59 | 
60 | ## Configuration
61 | 
62 | If AWS credentials are not supplied for an S3-based collection, the application will attempt to use the ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY`` environment variables. ``AWS_BUCKET_NAME`` is also supported.
63 | 
64 | ## License
65 | 
66 | ``archivekit`` is open source, licensed under a standard MIT license (included in this repository as ``LICENSE``).
67 | 


--------------------------------------------------------------------------------
/archivekit/__init__.py:
--------------------------------------------------------------------------------
 1 | from archivekit.collection import Collection # noqa
 2 | from archivekit.archive import Archive # noqa
 3 | from archivekit.resource import Resource # noqa
 4 | from archivekit.types.source import Source # noqa
 5 | 
 6 | from archivekit.ext import get_stores
 7 | 
 8 | 
 9 | def _open_store(store_type, **kwargs):
10 |     store_cls = get_stores().get(store_type)
11 |     if store_cls is None:
12 |         raise TypeError("No such store type: %s" % store_type)
13 |     return store_cls(**kwargs)
14 | 
15 | 
16 | def open_collection(name, store_type, **kwargs):
17 |     """ Create a ``archivekit.Collection`` of the given store type, passing
18 |     along any arguments. The valid types at the moment are: s3, file.
19 |     """
20 |     return Collection(name, _open_store(store_type, **kwargs))
21 | 
22 | 
23 | def open_archive(store_type, **kwargs):
24 |     """ Create a ``archivekit.Archive`` of the given store type, passing
25 |     along any arguments. The valid types at the moment are: s3, file.
26 |     """
27 |     return Archive(_open_store(store_type, **kwargs))
28 | 


--------------------------------------------------------------------------------
/archivekit/archive.py:
--------------------------------------------------------------------------------
 1 | from archivekit.collection import Collection
 2 | 
 3 | 
 4 | class Archive(object):
 5 |     """ An archive is composed of collections. """
 6 | 
 7 |     def __init__(self, store):
 8 |         self.store = store
 9 | 
10 |     def get(self, name):
11 |         """ Get a collection of packages. """
12 |         return Collection(name, self.store)
13 | 
14 |     def __iter__(self):
15 |         for name in self.store.list_collections():
16 |             yield Collection(name, self.store)
17 | 
18 |     def __contains__(self, name):
19 |         for collection in self:
20 |             if collection == name or collection.name == name:
21 |                 return True
22 |         return False
23 | 
24 |     def __repr__(self):
25 |         return '<Archive(%r)>' % (self.store)
26 | 


--------------------------------------------------------------------------------
/archivekit/collection.py:
--------------------------------------------------------------------------------
 1 | from archivekit.package import Package
 2 | from archivekit.ingest import Ingestor
 3 | 
 4 | 
 5 | class Collection(object):
 6 |     """ The list of all packages with an existing manifest which exist in
 7 |     the given storage """
 8 | 
 9 |     def __init__(self, name, store):
10 |         self.name = name
11 |         self.store = store
12 | 
13 |     def create(self, id=None, manifest=None):
14 |         """ Create a package and save a manifest. If ``manifest`` is
15 |         given, the values are saved to the manifest. """
16 |         package = Package(self.store, self, id=id)
17 |         if manifest is not None:
18 |             package.manifest.update(manifest)
19 |         package.save()
20 |         return package
21 | 
22 |     def get(self, id):
23 |         """ Get a ``Package`` identified by the ``id``. """
24 |         return Package(self.store, self, id=id)
25 | 
26 |     def ingest(self, something, meta=None):
27 |         """ Import a given object into the collection. The object can be
28 |         either a URL, a file or folder name, an open file handle or a
29 |         HTTP returned object from urllib, urllib2 or requests.
30 | 
31 |         Before importing it, a SHA1 hash will be generated and used as the
32 |         package ID. If a package with the given name already exists, it
33 |         will be overwritten. If you do not desire SHA1 de-duplication,
34 |         create a package directly and ingest from there. """
35 |         for ingestor in Ingestor.analyze(something):
36 |             try:
37 |                 package = self.get(ingestor.hash())
38 |                 package.ingest(ingestor, meta=meta)
39 |             finally:
40 |                 ingestor.dispose()
41 | 
42 |     def __iter__(self):
43 |         for package_id in self.store.list_packages(self.name):
44 |             yield Package(self.store, self, id=package_id)
45 | 
46 |     def __contains__(self, id):
47 |         for package in self:
48 |             if package == id or package.id == id:
49 |                 return True
50 |         return False
51 | 
52 |     def __repr__(self):
53 |         return '<Collection(%r)>' % (self.name)
54 | 
55 |     def __eq__(self, other):
56 |         if not hasattr(other, 'name'):
57 |             return False
58 |         return self.name == other.name
59 | 


--------------------------------------------------------------------------------
/archivekit/ext.py:
--------------------------------------------------------------------------------
 1 | from pkg_resources import iter_entry_points
 2 | 
 3 | 
 4 | def get_stores():
 5 |     stores = {}
 6 |     for ep in iter_entry_points('archivekit.stores'):
 7 |         stores[ep.name] = ep.load()
 8 |     return stores
 9 | 
10 | 
11 | def get_resource_types():
12 |     types = {}
13 |     for ep in iter_entry_points('archivekit.resource_types'):
14 |         types[ep.name] = ep.load()
15 |     return types
16 | 


--------------------------------------------------------------------------------
/archivekit/ingest.py:
--------------------------------------------------------------------------------
  1 | from os import path, walk, unlink
  2 | from os import name as osname
  3 | from tempfile import NamedTemporaryFile
  4 | from shutil import copyfileobj
  5 | from httplib import HTTPResponse
  6 | from StringIO import StringIO
  7 | from urlparse import urlparse
  8 | import mimetypes
  9 | 
 10 | import requests
 11 | 
 12 | from archivekit.util import clean_headers, checksum, fullpath
 13 | from archivekit.util import make_secure_filename, slugify
 14 | 
 15 | 
 16 | def directory_files(fpath):
 17 |     for (dir, _, files) in walk(fpath):
 18 |         for file_name in files:
 19 |             yield path.join(dir, file_name)
 20 | 
 21 | 
 22 | class Ingestor(object):
 23 |     """ An ingestor is an intermedia object used when importing data.
 24 |     Since the source types (URLs, file names or file handles) are very
 25 |     diverse, this may require data to be cached locally, e.g. to
 26 |     generate a SHA1 hash signature. """
 27 | 
 28 |     def __init__(self, file_name=None, file_obj=None, meta=None):
 29 |         self._file_name = file_name
 30 |         self._file_obj = file_obj
 31 |         self._file_cache = None
 32 |         self._hash = None
 33 |         self.is_local = file_name is not None
 34 |         self.meta = meta or {}
 35 | 
 36 |     def local(self):
 37 |         if self.is_local:
 38 |             return self._file_name
 39 |         if self._file_cache is None:
 40 |             tempfile = NamedTemporaryFile(delete=False)
 41 |             copyfileobj(self._file_obj, tempfile)
 42 |             self._file_cache = tempfile.name
 43 |             tempfile.close()
 44 |         return self._file_cache
 45 | 
 46 |     def has_local(self):
 47 |         cached = self._file_cache and path.exists(self._file_cache)
 48 |         return self.is_local or cached
 49 | 
 50 |     def hash(self):
 51 |         """ Generate an SHA1 hash of the given ingested object. """
 52 |         if self._hash is None:
 53 |             self._hash = checksum(self.local())
 54 |         return self._hash
 55 | 
 56 |     def generate_meta(self, meta):
 57 |         """ Set up some generic metadata for the resource, based on
 58 |         the available file name, HTTP headers etc. """
 59 |         meta = meta or {}
 60 |         for key, value in self.meta.items():
 61 |             if key not in meta:
 62 |                 meta[key] = value
 63 |         if 'source_file' not in meta and self._file_name:
 64 |             meta['source_file'] = self._file_name
 65 | 
 66 |         if not meta.get('name'):
 67 |             if meta.get('source_url') and len(meta.get('source_url')):
 68 |                 path = meta['source_url']
 69 |                 try:
 70 |                     path = urlparse(path).path
 71 |                 except:
 72 |                     pass
 73 |                 meta['name'] = path
 74 |             elif meta.get('source_file') and len(meta.get('source_file')):
 75 |                 meta['name'] = meta['source_file']
 76 |         name, slug, ext = make_secure_filename(meta.get('name'))
 77 |         meta['name'] = name
 78 |         if not meta.get('slug'):
 79 |             meta['slug'] = slug
 80 |         if not meta.get('extension'):
 81 |             meta['extension'] = ext
 82 |         if not meta.get('mime_type') and 'http_headers' in meta:
 83 |             mime_type = meta.get('http_headers').get('content_type')
 84 |             if mime_type not in ['application/octet-stream', 'text/plain']:
 85 |                 meta['mime_type'] = mime_type
 86 |                 ext = mimetypes.guess_extension(mime_type)
 87 |                 if ext is not None:
 88 |                     meta['extension'] = ext.strip('.')
 89 |         elif not meta.get('mime_type') and meta.get('name'):
 90 |             mime_type, encoding = mimetypes.guess_type(meta.get('name'))
 91 |             meta['mime_type'] = mime_type
 92 | 
 93 |         if meta.get('extension'):
 94 |             meta['extension'] = slugify(meta.get('extension'))
 95 |         return meta
 96 | 
 97 |     def store(self, source):
 98 |         if not self.has_local():
 99 |             source.save_fileobj(self._file_obj)
100 |         elif self.is_local:
101 |             source.save_file(self._file_name)
102 |         elif self._file_cache:
103 |             source.save_file(self._file_cache, destructive=True)
104 | 
105 |     def dispose(self):
106 |         if self._file_cache is not None and path.exists(self._file_cache):
107 |             unlink(self._file_cache)
108 | 
109 |     @classmethod
110 |     def analyze(cls, something):
111 |         """ Accept a given input (e.g. a URL, file path, or file handle
112 |         and determine how to normalize it into an ``Ingestor`` while
113 |         generating metadata. """
114 |         if isinstance(something, cls):
115 |             return (something, )
116 | 
117 |         if isinstance(something, basestring):
118 |             # Treat strings as paths or URLs
119 |             url = urlparse(something)
120 |             if url.scheme.lower() in ['http', 'https']:
121 |                 something = requests.get(something)
122 |             elif url.scheme.lower() in ['file', '']:
123 |                 finalpath = url.path
124 |                 if osname == 'nt':
125 |                     finalpath = finalpath[1:]
126 |                 upath = fullpath(finalpath)
127 |                 if path.isdir(upath):
128 |                     return (cls(file_name=f) for f in directory_files(upath))
129 |                 return (cls(file_name=upath),)
130 | 
131 |         # Python requests
132 |         if isinstance(something, requests.Response):
133 |             fd = StringIO(something.content)
134 |             return (cls(file_obj=fd, meta={
135 |                 'http_status': something.status_code,
136 |                 'http_headers': clean_headers(something.headers),
137 |                 'source_url': something.url
138 |             }), )
139 | 
140 |         if isinstance(something, HTTPResponse):
141 |             # Can't tell the URL for HTTPResponses
142 |             return (cls(file_obj=something, meta={
143 |                 'http_status': something.status,
144 |                 'http_headers': clean_headers(something.getheaders()),
145 |                 'source_url': something.url
146 |             }), )
147 | 
148 |         elif hasattr(something, 'geturl') and hasattr(something, 'info'):
149 |             # assume urllib or urllib2
150 |             return (cls(file_obj=something, meta={
151 |                 'http_status': something.getcode(),
152 |                 'http_headers': clean_headers(something.headers),
153 |                 'source_url': something.url
154 |             }), )
155 | 
156 |         elif hasattr(something, 'read'):
157 |             # Fileobj will be a bit bland
158 |             return (cls(file_obj=something), )
159 | 
160 |         return []
161 | 


--------------------------------------------------------------------------------
/archivekit/manifest.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import collections
 3 | from datetime import datetime
 4 | 
 5 | 
 6 | from archivekit.util import json_default, json_hook
 7 | 
 8 | 
 9 | class Manifest(dict):
10 |     """ A manifest has metadata on a package. """
11 | 
12 |     def __init__(self, obj):
13 |         self.object = obj
14 |         self.load()
15 | 
16 |     def load(self):
17 |         if self.object.exists():
18 |             data = self.object.load_data()
19 |             self.update(json.loads(data, object_hook=json_hook))
20 |         else:
21 |             self['created_at'] = datetime.utcnow()
22 |             self.update({'resources': {}})
23 | 
24 |     def save(self):
25 |         self['updated_at'] = datetime.utcnow()
26 |         content = json.dumps(self, default=json_default, indent=2)
27 |         self.object.save_data(content)
28 | 
29 |     def __repr__(self):
30 |         return '<Manifest(%r)>' % self.key
31 | 
32 | 
33 | class ResourceMetaData(collections.MutableMapping):
34 |     """ Metadata for a resource is derived from the main manifest. """
35 | 
36 |     def __init__(self, resource):
37 |         self.resource = resource
38 |         self.manifest = resource.package.manifest
39 |         if not isinstance(self.manifest.get('resources'), dict):
40 |             self.manifest['resources'] = {}
41 |         existing = self.manifest['resources'].get(self.resource.path)
42 |         if not isinstance(existing, dict):
43 |             self.manifest['resources'][self.resource.path] = {
44 |                 'created_at': datetime.utcnow()
45 |             }
46 | 
47 |     def touch(self):
48 |         self.manifest['resources'][self.resource.path]['updated_at'] = \
49 |             datetime.utcnow()
50 | 
51 |     def __getitem__(self, key):
52 |         return self.manifest['resources'][self.resource.path][key]
53 | 
54 |     def __setitem__(self, key, value):
55 |         self.manifest['resources'][self.resource.path][key] = value
56 |         self.touch()
57 | 
58 |     def __delitem__(self, key):
59 |         del self.manifest['resources'][self.resource.path][key]
60 |         self.touch()
61 | 
62 |     def __iter__(self):
63 |         return iter(self.manifest['resources'][self.resource.path])
64 | 
65 |     def __len__(self):
66 |         return len(self.manifest['resources'][self.resource.path])
67 | 
68 |     def __keytransform__(self, key):
69 |         return key
70 | 
71 |     def save(self):
72 |         self.touch()
73 |         self.resource.package.save()
74 | 
75 |     def __repr__(self):
76 |         return '<ResourceMetaData(%r)>' % self.resource.path
77 | 


--------------------------------------------------------------------------------
/archivekit/package.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from uuid import uuid4
  3 | from itertools import count
  4 | 
  5 | from archivekit.manifest import Manifest
  6 | from archivekit.ext import get_resource_types
  7 | from archivekit.ingest import Ingestor
  8 | from archivekit.types.source import Source
  9 | from archivekit.store.common import MANIFEST
 10 | 
 11 | 
 12 | class Package(object):
 13 |     """ An package is a resource in the remote bucket. It consists of a
 14 |     source file, a manifest metadata file and one or many processed
 15 |     version. """
 16 | 
 17 |     def __init__(self, store, collection, id=None):
 18 |         self.store = store
 19 |         self.collection = collection.name
 20 |         self.id = id or uuid4().hex
 21 | 
 22 |     def has(self, cls, name):
 23 |         """ Check if a resource of a given type and name exists. """
 24 |         return cls(self, name).exists()
 25 | 
 26 |     def all(self, cls, *extra):
 27 |         """ Iterate over all resources of a given type. """
 28 |         prefix = os.path.join(cls.GROUP, *extra)
 29 |         for path in self.store.list_resources(self.collection, self.id):
 30 |             if path.startswith(prefix):
 31 |                 yield cls.from_path(self, path)
 32 | 
 33 |     def exists(self):
 34 |         """ Check if the package identified by the given ID exists. """
 35 |         obj = self.store.get_object(self.collection, self.id, MANIFEST)
 36 |         return obj.exists()
 37 | 
 38 |     @property
 39 |     def manifest(self):
 40 |         if not hasattr(self, '_manifest'):
 41 |             obj = self.store.get_object(self.collection, self.id, MANIFEST)
 42 |             self._manifest = Manifest(obj)
 43 |         return self._manifest
 44 | 
 45 |     def get_resource(self, path):
 46 |         """ Get a typed resource by it's path. """
 47 |         for resource_type in get_resource_types().values():
 48 |             prefix = os.path.join(resource_type.GROUP, '')
 49 |             if path.startswith(prefix):
 50 |                 return resource_type.from_path(self, path)
 51 | 
 52 |     def save(self):
 53 |         """ Save the package metadata (manifest). """
 54 |         self.manifest.save()
 55 | 
 56 |     @property
 57 |     def source(self):
 58 |         """ Return the sole source of this package if present, or
 59 |         None if there is no source, or if there are multiple sources. """
 60 |         sources = list(self.all(Source))
 61 |         # TODO: should this raise for multiple sources instead?
 62 |         if not len(sources):
 63 |             return None
 64 |         return sources[0]
 65 | 
 66 |     def ingest(self, something, meta=None, overwrite=True):
 67 |         """ Import a given object into the package as a source. The
 68 |         object can be either a URL, a file or folder name, an open
 69 |         file handle or a HTTP returned object from urllib, urllib2 or
 70 |         requests. If ``overwrite`` is ``False``, the source file
 71 |         will be renamed until the name is not taken. """
 72 |         ingestors = list(Ingestor.analyze(something))
 73 | 
 74 |         if len(ingestors) != 1:
 75 |             raise ValueError("Can't ingest: %r" % something)
 76 |         ingestor = ingestors[0]
 77 | 
 78 |         try:
 79 |             meta = ingestor.generate_meta(meta)
 80 |             name = None
 81 |             for i in count(1):
 82 |                 suffix = '-%s' % i if i > 1 else ''
 83 |                 name = '%s%s.%s' % (meta['slug'], suffix, meta['extension'])
 84 |                 if overwrite or not self.has(Source, name):
 85 |                     break
 86 | 
 87 |             source = Source(self, name)
 88 |             source.meta.update(meta)
 89 |             ingestor.store(source)
 90 |             self.save()
 91 |             return source
 92 |         finally:
 93 |             ingestor.dispose()
 94 | 
 95 |     def __eq__(self, other):
 96 |         return self.id == other.id
 97 | 
 98 |     def __repr__(self):
 99 |         return '<Package(%r, %r)>' % (self.collection, self.id)
100 | 


--------------------------------------------------------------------------------
/archivekit/resource.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import shutil
 3 | import tempfile
 4 | from contextlib import contextmanager
 5 | 
 6 | from archivekit.manifest import ResourceMetaData
 7 | 
 8 | 
 9 | class Resource(object):
10 |     """ Any file within the prefix of the given package, except the
11 |     manifest. """
12 | 
13 |     GROUP = None
14 | 
15 |     def __init__(self, package, name):
16 |         self.package = package
17 |         self.name = name
18 |         self.path = os.path.join(self.GROUP, name)
19 |         self._obj = package.store.get_object(package.collection, package.id,
20 |                                              self.path)
21 |         self.meta = ResourceMetaData(self)
22 | 
23 |     @classmethod
24 |     def from_path(cls, package, path):
25 |         """ Instantiate a resource class with a path which is relative
26 |         to the root of the package. """
27 |         if path.startswith(cls.GROUP):
28 |             _, name = path.split(os.path.join(cls.GROUP, ''))
29 |             return cls(package, name)
30 | 
31 |     def exists(self):
32 |         return self._obj.exists()
33 | 
34 |     def save(self):
35 |         """ Save the metadata. """
36 |         return self.package.save()
37 | 
38 |     def save_data(self, data):
39 |         """ Save a string to the given resource. Overwrites any existing
40 |         data in the resource. """
41 |         return self._obj.save_data(data)
42 | 
43 |     def save_file(self, file_name, destructive=False):
44 |         """ Update the contents of this resource from the given file
45 |         name. If ``destructive`` is set, the original file may be
46 |         lost (i.e. it will be moved, not copied). """
47 |         return self._obj.save_file(file_name, destructive=destructive)
48 | 
49 |     def save_fileobj(self, fh):
50 |         """ Save the contents of the given file handle to the given
51 |         resource. Overwrites any existing data in the resource. """
52 |         return self._obj.save_fileobj(fh)
53 | 
54 |     def fh(self):
55 |         """ Read the contents of this resource as a file handle. """
56 |         return self._obj.load_fileobj()
57 | 
58 |     def data(self):
59 |         """ Read the contents of this resource as a string. """
60 |         return self._obj.load_data()
61 | 
62 |     @property
63 |     def url(self):
64 |         """ Return the public URL of the resource, if it exists. If
65 |         no public url is available, returns ``None``. """
66 |         if not hasattr(self, '_url'):
67 |             try:
68 |                 self._url = self._obj.public_url()
69 |             except ValueError:
70 |                 self._url = None
71 |         return self._url
72 | 
73 |     @contextmanager
74 |     def local(self):
75 |         """ This will make a local file version of a given resource
76 |         available for read analysis (e.g. for passing to external
77 |         programs). """
78 |         local_path = self._obj.local_path()
79 |         if local_path:
80 |             yield local_path
81 |         else:
82 |             path = tempfile.mkdtemp()
83 |             local_path = os.path.join(path, self.name)
84 |             with open(local_path, 'wb') as fh:
85 |                 shutil.copyfileobj(self.fh(), fh)
86 |             yield local_path
87 |             shutil.rmtree(path)
88 | 
89 |     def __repr__(self):
90 |         return '<Resource(%r)>' % self.path
91 | 
92 | 


--------------------------------------------------------------------------------
/archivekit/store/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pudo-attic/archivekit/9de8c7f31b9b7b57e3701ce528672e980fd8aa1b/archivekit/store/__init__.py


--------------------------------------------------------------------------------
/archivekit/store/common.py:
--------------------------------------------------------------------------------
 1 | 
 2 | MANIFEST = 'manifest.json'
 3 | 
 4 | 
 5 | class Store(object):
 6 |     """ A host object to represent a specific type of storage,
 7 |     in which objects are managed. """
 8 | 
 9 |     def __init__(self):
10 |         pass
11 | 
12 |     def get_object(self, collection, package_id, path):
13 |         raise NotImplemented()
14 | 
15 |     def list_collections(self):
16 |         raise NotImplemented()
17 | 
18 |     def list_packages(self, collection):
19 |         raise NotImplemented()
20 | 
21 |     def list_resources(self, collection, package_id):
22 |         raise NotImplemented()
23 | 
24 | 
25 | class StoreObject(object):
26 |     """ An abstraction over the on-disk representation of a
27 |     stored object. This can be subclassed for specific storage
28 |     mechanisms. """
29 | 
30 |     def exists(self):
31 |         raise NotImplemented()
32 | 
33 |     def save_fileobj(self, fileobj):
34 |         raise NotImplemented()
35 | 
36 |     def save_file(self, file, destructive=False):
37 |         """ Update the contents of this resource from the given file
38 |         name. If ``destructive`` is set, the original file may be
39 |         lost (i.e. it will be moved, not copied). """
40 |         raise NotImplemented()
41 | 
42 |     def save_data(self, data):
43 |         raise NotImplemented()
44 | 
45 |     def load_fileobj(self):
46 |         raise NotImplemented()
47 | 
48 |     def load_data(self):
49 |         raise NotImplemented()
50 | 
51 |     def public_url(self):
52 |         return None
53 | 
54 |     def local_path(self):
55 |         return None
56 | 


--------------------------------------------------------------------------------
/archivekit/store/file.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import shutil
  3 | from lockfile import LockFile
  4 | 
  5 | from archivekit.store.common import Store, StoreObject, MANIFEST
  6 | from archivekit.util import safe_id, fullpath
  7 | 
  8 | LENGTH = 2
  9 | 
 10 | 
 11 | class FileStore(Store):
 12 | 
 13 |     def __init__(self, path=None, **kwargs):
 14 |         self.path = fullpath(path)
 15 |         if os.path.exists(path) and not os.path.isdir(path):
 16 |             raise ValueError('Not a directory: %s' % path)
 17 | 
 18 |     def get_object(self, collection, package_id, path):
 19 |         return FileStoreObject(self, collection, package_id, path)
 20 | 
 21 |     def list_collections(self):
 22 |         if self.path is None:
 23 |             return
 24 |         for collection in os.listdir(self.path):
 25 |             if os.path.isdir(os.path.join(self.path, collection)):
 26 |                 yield collection
 27 | 
 28 |     def list_packages(self, collection):
 29 |         if self.path is None:
 30 |             return
 31 |         coll_path = os.path.join(self.path, collection)
 32 |         if not os.path.exists(coll_path):
 33 |             return
 34 |         for (dirpath, dirnames, filenames) in os.walk(coll_path):
 35 |             if MANIFEST not in filenames:
 36 |                 continue
 37 |             _, id = os.path.split(dirpath)
 38 |             if self._make_path(collection, id) == dirpath:
 39 |                 yield id
 40 | 
 41 |     def _make_path(self, collection, package_id):
 42 |         id = safe_id(package_id, len=LENGTH)
 43 |         path = os.path.join(self.path, collection, *id[:LENGTH])
 44 |         return os.path.join(path, id)
 45 | 
 46 |     def list_resources(self, collection, package_id):
 47 |         prefix = self._make_path(collection, package_id)
 48 |         if not os.path.exists(prefix):
 49 |             return
 50 |         skip = os.path.join(prefix, MANIFEST)
 51 |         for (dirpath, dirnames, filenames) in os.walk(self.path):
 52 |             for filename in filenames:
 53 |                 path = os.path.join(dirpath, filename)
 54 |                 if path == skip:
 55 |                     continue
 56 |                 yield os.path.relpath(path, start=prefix)
 57 | 
 58 |     def __repr__(self):
 59 |         return '<FileStore(%r)>' % self.path
 60 | 
 61 |     def __unicode__(self):
 62 |         return self.path
 63 | 
 64 | 
 65 | class FileStoreObject(StoreObject):
 66 | 
 67 |     def __init__(self, store, collection, package_id, path):
 68 |         self.store = store
 69 |         self.package_id = package_id
 70 |         self.path = path
 71 |         pkg_path = self.store._make_path(collection, package_id)
 72 |         self._abs_path = os.path.join(pkg_path, path)
 73 |         self._abs_dir = os.path.dirname(self._abs_path)
 74 |         self._lock = LockFile(self._abs_path)
 75 | 
 76 |     def exists(self):
 77 |         return os.path.exists(self._abs_path)
 78 | 
 79 |     def _prepare(self):
 80 |         try:
 81 |             os.makedirs(self._abs_dir)
 82 |         except:
 83 |             pass
 84 | 
 85 |     def save_fileobj(self, fileobj):
 86 |         self._prepare()
 87 |         with self._lock:
 88 |             with open(self._abs_path, 'wb') as fh:
 89 |                 shutil.copyfileobj(fileobj, fh)
 90 | 
 91 |     def save_file(self, file_name, destructive=False):
 92 |         self._prepare()
 93 |         with self._lock:
 94 |             if destructive:
 95 |                 shutil.move(file_name, self._abs_path)
 96 |             else:
 97 |                 shutil.copy(file_name, self._abs_path)
 98 | 
 99 |     def save_data(self, data):
100 |         self._prepare()
101 |         with self._lock:
102 |             with open(self._abs_path, 'wb') as fh:
103 |                 fh.write(data)
104 | 
105 |     def load_fileobj(self):
106 |         if not self.exists():
107 |             return
108 |         with self._lock:
109 |             return open(self._abs_path, 'rb')
110 | 
111 |     def load_data(self):
112 |         if not self.exists():
113 |             return
114 |         with self._lock:
115 |             with open(self._abs_path, 'rb') as fh:
116 |                 return fh.read()
117 | 
118 |     def local_path(self):
119 |         return self._abs_path
120 | 
121 |     def __repr__(self):
122 |         return '<FileStore(%r)>' % self._abs_path
123 | 
124 |     def __unicode__(self):
125 |         return self._abs_path
126 | 


--------------------------------------------------------------------------------
/archivekit/store/s3.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from urllib2 import urlopen
  3 | 
  4 | from boto.s3.connection import S3Connection, S3ResponseError
  5 | from boto.s3.connection import Location
  6 | 
  7 | from archivekit.store.common import Store, StoreObject, MANIFEST
  8 | 
  9 | DELIM = os.path.join(' ', ' ').strip()
 10 | ALL_USERS = 'http://acs.amazonaws.com/groups/global/AllUsers'
 11 | 
 12 | 
 13 | class S3Store(Store):
 14 |     
 15 |     def __init__(self, aws_key_id=None, aws_secret=None, bucket_name=None,
 16 |                  prefix=None, location=Location.EU, **kwargs):
 17 |         if aws_key_id is None:
 18 |             aws_key_id = os.environ.get('AWS_ACCESS_KEY_ID')
 19 |             aws_secret = os.environ.get('AWS_SECRET_ACCESS_KEY')
 20 |         self.aws_key_id = aws_key_id
 21 |         self.aws_secret = aws_secret
 22 |         if bucket_name is None:
 23 |             bucket_name = os.environ.get('AWS_BUCKET_NAME')
 24 |         self.bucket_name = bucket_name
 25 |         self.prefix = prefix
 26 |         self.location = location
 27 |         self._bucket = None
 28 | 
 29 |     @property
 30 |     def bucket(self):
 31 |         if self._bucket is None:
 32 |             self.conn = S3Connection(self.aws_key_id, self.aws_secret)
 33 |             try:
 34 |                 self._bucket = self.conn.get_bucket(self.bucket_name)
 35 |             except S3ResponseError, se:
 36 |                 if se.status != 404:
 37 |                     raise
 38 |                 self._bucket = self.conn.create_bucket(self.bucket_name,
 39 |                                                        location=self.location)
 40 |         return self._bucket
 41 | 
 42 |     def get_object(self, collection, package_id, path):
 43 |         return S3StoreObject(self, collection, package_id, path)
 44 | 
 45 |     def _get_prefix(self, collection):
 46 |         prefix = collection
 47 |         if self.prefix:
 48 |             prefix = os.path.join(self.prefix, prefix)
 49 |         return os.path.join(prefix, '')
 50 | 
 51 |     def list_collections(self):
 52 |         prefix = os.path.join(self.prefix, '') if self.prefix else None
 53 |         for prefix in self.bucket.list(prefix=prefix, delimiter=DELIM):
 54 |             yield prefix.name.rsplit(DELIM, 2)[-2]
 55 | 
 56 |     def list_packages(self, collection):
 57 |         prefix = self._get_prefix(collection)
 58 |         for sub_prefix in self.bucket.list(prefix=prefix, delimiter=DELIM):
 59 |             yield sub_prefix.name.rsplit(DELIM, 2)[-2]
 60 | 
 61 |     def list_resources(self, collection, package_id):
 62 |         prefix = os.path.join(self._get_prefix(collection), package_id)
 63 |         skip = os.path.join(prefix, MANIFEST)
 64 |         offset = len(skip) - len(MANIFEST)
 65 |         for key in self.bucket.get_all_keys(prefix=prefix):
 66 |             if key.name == skip:
 67 |                 continue
 68 |             yield key.name[offset:]
 69 | 
 70 |     def __repr__(self):
 71 |         return '<S3Store(%r, %r)>' % (self.bucket_name, self.prefix)
 72 | 
 73 |     def __unicode__(self):
 74 |         return os.path.join(self.bucket_name, self.prefix)
 75 | 
 76 | 
 77 | class S3StoreObject(StoreObject):
 78 | 
 79 |     def __init__(self, store, collection, package_id, path):
 80 |         self.store = store
 81 |         self.package_id = package_id
 82 |         self.path = path
 83 |         self._key = None
 84 |         self._key_name = os.path.join(collection, package_id, path)
 85 |         if store.prefix:
 86 |             self._key_name = os.path.join(store.prefix, self._key_name)
 87 | 
 88 |     @property
 89 |     def key(self):
 90 |         if self._key is None:
 91 |             self._key = self.store.bucket.get_key(self._key_name)
 92 |             if self._key is None:
 93 |                 self._key = self.store.bucket.new_key(self._key_name)
 94 |         return self._key
 95 | 
 96 |     def exists(self):
 97 |         if self._key is None:
 98 |             self._key = self.store.bucket.get_key(self._key_name)
 99 |         return self._key is not None
100 | 
101 |     def save_fileobj(self, fileobj):
102 |         self.key.set_contents_from_file(fileobj)
103 | 
104 |     def save_file(self, file_name, destructive=False):
105 |         with open(file_name, 'rb') as fh:
106 |             self.save_fileobj(fh)
107 | 
108 |     def save_data(self, data):
109 |         self.key.set_contents_from_string(data)
110 | 
111 |     def load_fileobj(self):
112 |         return urlopen(self.public_url())
113 | 
114 |     def load_data(self):
115 |         return self.key.get_contents_as_string()
116 | 
117 |     def _is_public(self):
118 |         try:
119 |             for grant in self.key.get_acl().acl.grants:
120 |                 if grant.permission == 'READ':
121 |                     if grant.uri == ALL_USERS:
122 |                         return True
123 |         except:
124 |             pass
125 |         return False
126 | 
127 |     def public_url(self):
128 |         if not self.exists():
129 |             return
130 |         # Welcome to the world of open data:
131 |         if not self._is_public():
132 |             self.key.make_public()
133 |         return self.key.generate_url(expires_in=0,
134 |                                      force_http=True,
135 |                                      query_auth=False)
136 | 
137 |     def __repr__(self):
138 |         return '<S3StoreObject(%r, %r, %r)>' % (self.store, self.package_id,
139 |                                                 self.path)
140 | 
141 |     def __unicode__(self):
142 |         return self.public_url()
143 | 


--------------------------------------------------------------------------------
/archivekit/types/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pudo-attic/archivekit/9de8c7f31b9b7b57e3701ce528672e980fd8aa1b/archivekit/types/__init__.py


--------------------------------------------------------------------------------
/archivekit/types/source.py:
--------------------------------------------------------------------------------
 1 | from archivekit.resource import Resource
 2 | 
 3 | 
 4 | class Source(Resource):
 5 |     """ A source file, as initially submitted to the ``archivekit``. """
 6 | 
 7 |     GROUP = 'source'
 8 | 
 9 |     def __repr__(self):
10 |         return '<Source(%r)>' % self.name
11 | 


--------------------------------------------------------------------------------
/archivekit/util.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import six
 3 | from hashlib import sha1
 4 | from decimal import Decimal
 5 | from slugify import slugify
 6 | from datetime import datetime, date
 7 | 
 8 | 
 9 | def safe_id(name, len=5):
10 |     """ Remove potential path escapes from a content ID. """
11 |     if name is None:
12 |         return None
13 |     name = slugify(os.path.basename(name)).strip('-')
14 |     name = name.ljust(len, '_')
15 |     return name
16 | 
17 | 
18 | def make_secure_filename(source):
19 |     # TODO: don't let users create files called ``manifest.json``.
20 |     if source:
21 |         source = os.path.basename(source).strip()
22 |     source = source or 'source'
23 |     fn, ext = os.path.splitext(source)
24 |     ext = ext or '.raw'
25 |     ext = ext.lower().strip().replace('.', '')
26 |     return source, slugify(fn), ext
27 | 
28 | 
29 | def fullpath(filename):
30 |     """ Perform normalization of the source file name. """
31 |     if filename is None:
32 |         return
33 |     # a happy tour through stdlib
34 |     filename = os.path.expanduser(filename)
35 |     filename = os.path.expandvars(filename)
36 |     filename = os.path.normpath(filename)
37 |     return os.path.abspath(filename)
38 | 
39 | 
40 | def clean_headers(headers):
41 |     """ Convert HTTP response headers into a common format
42 |     for storing them in the resource meta data. """
43 |     result = {}
44 |     for k, v in dict(headers).items():
45 |         k = k.lower().replace('-', '_')
46 |         result[k] = v
47 |     return result
48 | 
49 | 
50 | def checksum(filename):
51 |     hash = sha1()
52 |     with open(filename, 'rb') as fh:
53 |         while True:
54 |             block = fh.read(2 ** 10)
55 |             if not block:
56 |                 break
57 |             hash.update(block)
58 |     return hash.hexdigest()
59 | 
60 | 
61 | def encode_text(text):
62 |     if isinstance(text, six.text_type):
63 |         return text.encode('utf-8')
64 |     try:
65 |         return text.decode('utf-8').encode('utf-8')
66 |     except (UnicodeDecodeError, UnicodeEncodeError):
67 |         return text.encode('ascii', 'replace')
68 | 
69 | 
70 | def json_default(obj):
71 |     if isinstance(obj, datetime):
72 |         obj = obj.isoformat()
73 |     if isinstance(obj, Decimal):
74 |         obj = float(obj)
75 |     if isinstance(obj, date):
76 |         return 'new Date(%s)' % obj.isoformat()
77 |     return obj
78 | 
79 | 
80 | def json_hook(obj):
81 |     for k, v in obj.items():
82 |         if isinstance(v, basestring):
83 |             try:
84 |                 obj[k] = datetime.strptime(v, "new Date(%Y-%m-%d)").date()
85 |             except ValueError:
86 |                 pass
87 |             try:
88 |                 obj[k] = datetime.strptime(v, "%Y-%m-%dT%H:%M:%S")
89 |             except ValueError:
90 |                 pass
91 |     return obj
92 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | setup(
 4 |     name='archivekit',
 5 |     version='0.5.3',
 6 |     description="Store a set of files and metadata in an organized way",
 7 |     long_description="",
 8 |     classifiers=[
 9 |         "Development Status :: 3 - Alpha",
10 |         "Intended Audience :: Developers",
11 |         "Operating System :: OS Independent",
12 |         "Programming Language :: Python",
13 |     ],
14 |     keywords='storage filesystem archive bagit bag etl data processing',
15 |     author='Friedrich Lindenberg',
16 |     author_email='friedrich@pudo.org',
17 |     url='http://pudo.org',
18 |     license='MIT',
19 |     packages=find_packages(exclude=['ez_setup', 'examples', 'tests']),
20 |     namespace_packages=[],
21 |     include_package_data=True,
22 |     zip_safe=False,
23 |     test_suite='nose.collector',
24 |     install_requires=[
25 |         "Werkzeug>=0.9.6",
26 |         "lockfile>=0.9.1",
27 |         "python-slugify>=0.0.6",
28 |         "requests>=2.2.0",
29 |         "boto>=2.33",
30 |         "six"
31 |     ],
32 |     entry_points={
33 |         'archivekit.stores': [
34 |             's3 = archivekit.store.s3:S3Store',
35 |             'file = archivekit.store.file:FileStore'
36 |         ],
37 |         'archivekit.resource_types': [
38 |             'source = archivekit.types.source:Source'
39 |         ],
40 |     },
41 |     tests_require=[]
42 | )
43 | 


--------------------------------------------------------------------------------
/tests/fixtures/test.csv:
--------------------------------------------------------------------------------
  1 | company,opencorp_matches
  2 | ABB,250
  3 | AED Oil Limited,0
  4 | AGR,250
  5 | AMNI International,1
  6 | Addax Petroleum,18
  7 | Afren,21
  8 | Agip,117
  9 | Agip Energy & Natural Resources,0
 10 | Agip Nigeria,0
 11 | Aibel,42
 12 | Aker Borgestad Operations,0
 13 | Aker Floating Production,1
 14 | Alliance Marine Services,9
 15 | Amerada / Hess Shell,0
 16 | Amerada Hess,94
 17 | Anadarko,250
 18 | Anzon,36
 19 | Apache,250
 20 | Arco,250
 21 | Atlantica Tender Drilling Ltd.,0
 22 | Atwood Oceanics,8
 23 | Australian Worldwide Exploration,0
 24 | Axxis Petroconsultants,0
 25 | BHP,250
 26 | BHP Billiton,180
 27 | BLT,250
 28 | BP,250
 29 | BW Offshore,16
 30 | Berlian Laju Tanker (BLT),0
 31 | Bluewater,250
 32 | Bourbon Offshore,15
 33 | Burmi Armada,0
 34 | CACT,8
 35 | CAMAC Energy,4
 36 | CDC,250
 37 | CNOOC,29
 38 | CNOOC (NOC),0
 39 | CNR International,25
 40 | CNRL International,0
 41 | Cairn Energy,77
 42 | Cardiff,250
 43 | Chevron,250
 44 | ConocoPhillips,250
 45 | ConocoPhillips/CNOOC,0
 46 | Conoil,7
 47 | Coogee,47
 48 | Cuu Long JV,0
 49 | DPI,250
 50 | Devon Energy,99
 51 | Diamond Offshore,28
 52 | Discovery Offshore S.A.,4
 53 | Dolphin Drilling,10
 54 | Dryships,2
 55 | Dynamic Producer,0
 56 | ENI (NOC),0
 57 | ENI Agip,0
 58 | ENSCO,250
 59 | Egyptian Drilling,0
 60 | Emas,223
 61 | Energean Oil & Gas SA,0
 62 | Esso,250
 63 | Exmar,54
 64 | ExxonMobil,250
 65 | ExxonMobil/GEPetrol,0
 66 | FPS Ocean,1
 67 | FVSB,1
 68 | Fred Olsen Energy,1
 69 | Fred.Olsen,76
 70 | Fred.Olsen Production,2
 71 | Frontier Drilling do Brazil,0
 72 | GFI O & G,0
 73 | GGPC,4
 74 | GSP,250
 75 | Galoc Production Co.,0
 76 | HUnited Kingdom,0
 77 | Hercules Assets LLC,0
 78 | Hercules Offshore,9
 79 | Hess Corp.,298
 80 | Ikdam,3
 81 | Ikdam Production SA,0
 82 | JV - AGR/ Helix,0
 83 | JV - North West Shelf,0
 84 | JV - Prosafe/ Fred Olsen,0
 85 | JV - SBMO/ Partner,0
 86 | Jasper Explorer Pte Ltd.,0
 87 | Jasper Offshore,1
 88 | KCA Deutag,43
 89 | Kerr McGee,205
 90 | Lonestar Drilling Nigeria Ltd.,0
 91 | Lundin Petroleum,3
 92 | M3nergy,0
 93 | MISC,174
 94 | MODEC,63
 95 | MODEC T,0
 96 | Maersk,250
 97 | Maersk Drilling,15
 98 | Marathon,250
 99 | Matrix Oil,10
100 | Megadrill Services,0
101 | Mitsui Oil Exploration,0
102 | Murphy Oil Sabah,0
103 | Murphy West Africa,1
104 | NNPC,15
105 | Nabors International,5
106 | Nabors Offshore,3
107 | Nexen,154
108 | Nexus Floating Production,1
109 | Nigerian Agip Exploration,0
110 | Noble Drilling,64
111 | North Sea Production,1
112 | Northern Offshore Ltd,22
113 | OMV,111
114 | Oando,20
115 | Ocean Rig Asa,1
116 | Oceaneering,77
117 | Odfjell,133
118 | Oilexco,7
119 | PA Resources,20
120 | PEMEX,32
121 | PGS,250
122 | PTSC,14
123 | PTTEP,8
124 | Pacific Drilling Limited,17
125 | Paragon Offshore,3
126 | Pearl Energy,9
127 | Pemex,32
128 | Perenco,80
129 | Pertamina/ PetroChina,0
130 | Petro-Canada,144
131 | PetroSA/Pioneer Natural Resources,0
132 | PetroViet Nam E&P,0
133 | Petrobras,38
134 | Petronas,78
135 | Petronas Carigali,23
136 | Petroserv SA,25
137 | Premier Oil,140
138 | Premuda,13
139 | Prosafe,93
140 | Rasmussen,250
141 | Reliance,250
142 | Repsol,164
143 | Rowan,250
144 | Rubicon Offshore,5
145 | SBM,250
146 | SBM (First 3 years),0
147 | SBMO,2
148 | SBMO+JV Partner,0
149 | Saipem,44
150 | Santos,250
151 | SapuraKencana,8
152 | Sea Production Ltd,6
153 | SeaWolf Oil Services Limited,0
154 | Seadrill Ltd,118
155 | Secunda Marine,3
156 | SembCorp Marine,2
157 | Sevan Marine,1
158 | Shebah,16
159 | Shebah E&P,0
160 | Shelf Drilling,12
161 | Shell,250
162 | Shell Nigeria,0
163 | Shell Todd,2
164 | Sonangol,41
165 | Sonangol (NOC),0
166 | Songa Floating Production,1
167 | South Atlantic Petroleum,2
168 | Star Deepwater JV,0
169 | Statoil,250
170 | StatoilHydro,3
171 | StatoilHydro/Saga Petroleum,0
172 | Stena Drilling,20
173 | TSJOC,0
174 | Talisman,250
175 | Talisman (Blake f. operated by BG),0
176 | Tanker Pacifi,0
177 | Tanker Pacific,8
178 | Teekay Petrojarl,10
179 | Toisa Horizon,0
180 | Total,250
181 | Transocean Ltd.,250
182 | Tullow,150
183 | Vaalco Energy,4
184 | Vantage Drilling,4
185 | Venture Production,10
186 | Viet Nam Offshore Floating Terminal,0
187 | Vietsovpetro,0
188 | Wood Group,250
189 | Woodside,250
190 | Woodside Energy,11
191 | Zaafarana Oil Company,0
192 | ullow,0
193 | 


--------------------------------------------------------------------------------
/tests/helpers.py:
--------------------------------------------------------------------------------
1 | import os
2 | 
3 | FIXTURES = os.path.join(os.path.dirname(__file__), 'fixtures')
4 | DATA_FILE = os.path.join(FIXTURES, 'test.csv')
5 | DATA_URL = 'https://raw.githubusercontent.com/okfn/dpkg-barnet/master/barnet-2009.csv'
6 | 


--------------------------------------------------------------------------------
/tests/test_file.py:
--------------------------------------------------------------------------------
  1 | from tempfile import mkdtemp
  2 | from shutil import rmtree
  3 | import urllib
  4 | 
  5 | from helpers import DATA_FILE, DATA_URL
  6 | from archivekit import Collection, open_archive
  7 | from archivekit.store.file import FileStore
  8 | from archivekit.types.source import Source
  9 | from archivekit.util import checksum
 10 | 
 11 | 
 12 | def test_basic_package():
 13 |     path = mkdtemp()
 14 |     store = FileStore(path=path)
 15 |     coll = Collection('test', store)
 16 | 
 17 |     assert len(list(coll)) == 0, list(coll)
 18 | 
 19 |     pkg = coll.create()
 20 |     assert pkg.id is not None, pkg
 21 |     assert pkg.exists(), pkg
 22 | 
 23 |     pkg = coll.get(None)
 24 |     assert not pkg.exists(), pkg
 25 | 
 26 |     rmtree(path)
 27 | 
 28 | 
 29 | def test_basic_manifest():
 30 |     path = mkdtemp()
 31 |     store = FileStore(path=path)
 32 |     coll = Collection('test', store)
 33 |     pkg = coll.create()
 34 |     pkg.manifest['foo'] = 'bar'
 35 |     pkg.save()
 36 | 
 37 |     npkg = coll.get(pkg.id)
 38 |     assert npkg.id == pkg.id, npkg
 39 |     assert npkg.manifest['foo'] == 'bar', npkg.meta.items()
 40 | 
 41 |     rmtree(path)
 42 | 
 43 | 
 44 | def test_archive():
 45 |     path = mkdtemp()
 46 |     store = FileStore(path=path)
 47 |     coll = Collection('test', store)
 48 |     coll.ingest(DATA_FILE)
 49 | 
 50 |     archive = open_archive('file', path=path)
 51 |     assert archive.get('test') == coll, archive.get('test')
 52 |     colls = list(archive)
 53 |     assert len(colls) == 1, colls
 54 | 
 55 |     rmtree(path)
 56 | 
 57 | 
 58 | def test_collection_ingest():
 59 |     path = mkdtemp()
 60 |     store = FileStore(path=path)
 61 |     coll = Collection('test', store)
 62 |     coll.ingest(DATA_FILE)
 63 |     pkgs = list(coll)
 64 |     assert len(pkgs) == 1, pkgs
 65 |     pkg0 = pkgs[0]
 66 |     assert pkg0.id == checksum(DATA_FILE), pkg0.id
 67 |     sources = list(pkg0.all(Source))
 68 |     assert len(sources) == 1, sources
 69 |     assert sources[0].name == 'test.csv', sources[0].name
 70 |     rmtree(path)
 71 | 
 72 | 
 73 | def test_package_ingest_file():
 74 |     path = mkdtemp()
 75 |     store = FileStore(path=path)
 76 |     coll = Collection('test', store)
 77 |     pkg = coll.create()
 78 |     source = pkg.ingest(DATA_FILE)
 79 |     assert source.meta.get('name') == 'test.csv', source.meta
 80 |     assert source.meta.get('extension') == 'csv', source.meta
 81 |     assert source.meta.get('slug') == 'test', source.meta
 82 |     rmtree(path)
 83 | 
 84 | 
 85 | def test_package_get_resource():
 86 |     path = mkdtemp()
 87 |     store = FileStore(path=path)
 88 |     coll = Collection('test', store)
 89 |     pkg = coll.create()
 90 |     source = pkg.ingest(DATA_FILE)
 91 |     other = pkg.get_resource(source.path)
 92 |     assert isinstance(other, Source), other.__class__
 93 |     assert other.path == source.path, other
 94 |     rmtree(path)
 95 | 
 96 | 
 97 | def test_resource_local():
 98 |     path = mkdtemp()
 99 |     store = FileStore(path=path)
100 |     coll = Collection('test', store)
101 |     pkg = coll.create()
102 |     source = pkg.ingest(DATA_FILE)
103 |     with source.local() as file_name:
104 |         assert file_name.endswith(source.name), file_name
105 |     rmtree(path)
106 | 
107 | 
108 | def test_package_source():
109 |     path = mkdtemp()
110 |     store = FileStore(path=path)
111 |     coll = Collection('test', store)
112 |     pkg = coll.create()
113 |     assert pkg.source is None, pkg.source
114 |     source = pkg.ingest(DATA_FILE)
115 |     other = pkg.source
116 |     assert isinstance(other, Source), other.__class__
117 |     assert other.path == source.path, other
118 |     rmtree(path)
119 | 
120 | 
121 | def test_package_ingest_url():
122 |     path = mkdtemp()
123 |     store = FileStore(path=path)
124 |     coll = Collection('test', store)
125 |     pkg = coll.create()
126 |     source = pkg.ingest(DATA_URL)
127 |     assert source.name == 'barnet-2009.csv', source.name
128 |     assert source.meta['source_url'] == DATA_URL, source.meta
129 | 
130 |     source = pkg.ingest(urllib.urlopen(DATA_URL))
131 |     assert source.name == 'barnet-2009.csv', source.name
132 |     assert source.meta['source_url'] == DATA_URL, source.meta
133 |     rmtree(path)
134 | 
135 | 
136 | def test_package_ingest_fileobj():
137 |     path = mkdtemp()
138 |     store = FileStore(path=path)
139 |     coll = Collection('test', store)
140 |     pkg = coll.create()
141 |     with open(DATA_FILE, 'rb') as fh:
142 |         source = pkg.ingest(fh)
143 |         assert source.name == 'source.raw', source.name
144 |     rmtree(path)
145 | 


--------------------------------------------------------------------------------
/tests/test_s3.py:
--------------------------------------------------------------------------------
  1 | from moto import mock_s3
  2 | from StringIO import StringIO
  3 | 
  4 | from helpers import DATA_FILE
  5 | from archivekit import Collection
  6 | from archivekit.store.s3 import S3Store
  7 | from archivekit.types.source import Source
  8 | from archivekit.util import checksum
  9 | 
 10 | 
 11 | @mock_s3
 12 | def test_store_loader():
 13 |     from archivekit.ext import get_stores
 14 |     stores = get_stores()
 15 |     assert 's3' in stores, stores
 16 |     assert stores['s3'] == S3Store, stores
 17 | 
 18 | 
 19 | @mock_s3
 20 | def test_open_collection():
 21 |     from archivekit import open_collection
 22 |     coll = open_collection('test', 's3', bucket_name='foo')
 23 |     assert isinstance(coll.store, S3Store), coll.store
 24 |     assert coll.store.bucket.name == 'foo', coll.store.bucket
 25 | 
 26 | 
 27 | @mock_s3
 28 | def test_list_collections():
 29 |     store = S3Store(bucket_name='foo', prefix='bar')
 30 |     coll = Collection('test', store)
 31 |     coll.ingest(DATA_FILE)
 32 |     colls = list(store.list_collections())
 33 |     assert len(colls) == 1, colls
 34 |     assert colls[0] == coll.name, colls
 35 | 
 36 | 
 37 | @mock_s3
 38 | def test_basic_package():
 39 |     store = S3Store(bucket_name='test_bucket')
 40 |     coll = Collection('test', store)
 41 | 
 42 |     assert len(list(coll)) == 0, list(coll)
 43 | 
 44 |     pkg = coll.create()
 45 |     assert pkg.id is not None, pkg
 46 |     assert pkg.exists(), pkg
 47 | 
 48 |     pkg = coll.get(None)
 49 |     assert not pkg.exists(), pkg
 50 | 
 51 | 
 52 | @mock_s3
 53 | def test_basic_manifest():
 54 |     store = S3Store(bucket_name='test_bucket')
 55 |     coll = Collection('test', store)
 56 |     pkg = coll.create()
 57 |     pkg.manifest['foo'] = 'bar'
 58 |     pkg.save()
 59 | 
 60 |     npkg = coll.get(pkg.id)
 61 |     assert npkg.id == pkg.id, npkg
 62 |     assert npkg.manifest['foo'] == 'bar', npkg.meta.items()
 63 | 
 64 | 
 65 | @mock_s3
 66 | def test_collection_ingest():
 67 |     store = S3Store(bucket_name='test_bucket')
 68 |     coll = Collection('test', store)
 69 |     coll.ingest(DATA_FILE)
 70 |     pkgs = list(coll)
 71 |     assert len(pkgs) == 1, pkgs
 72 |     pkg0 = pkgs[0]
 73 |     assert pkg0.id == checksum(DATA_FILE), pkg0.id
 74 |     print pkg0
 75 |     sources = list(pkg0.all(Source))
 76 |     assert len(sources) == 1, sources
 77 |     assert sources[0].name == 'test.csv', sources[0].name
 78 | 
 79 | 
 80 | @mock_s3
 81 | def test_package_ingest_file():
 82 |     store = S3Store(bucket_name='test_bucket')
 83 |     coll = Collection('test', store)
 84 |     pkg = coll.create()
 85 |     source = pkg.ingest(DATA_FILE)
 86 |     assert source.meta.get('name') == 'test.csv', source.meta
 87 |     assert source.meta.get('extension') == 'csv', source.meta
 88 |     assert source.meta.get('slug') == 'test', source.meta
 89 | 
 90 | 
 91 | @mock_s3
 92 | def test_package_local_file():
 93 |     store = S3Store(bucket_name='test_bucket')
 94 |     coll = Collection('test', store)
 95 |     pkg = coll.create()
 96 |     source = pkg.ingest(DATA_FILE)
 97 |     with source.local() as file_name:
 98 |         assert file_name != DATA_FILE, file_name
 99 |         assert file_name.endswith('test.csv'), file_name
100 | 
101 | 
102 | @mock_s3
103 | def test_package_save_data():
104 |     store = S3Store(bucket_name='test_bucket')
105 |     coll = Collection('test', store)
106 |     pkg = coll.create()
107 |     src = Source(pkg, 'foo.csv')
108 |     src.save_data('huhu!')
109 | 
110 |     src2 = Source(pkg, 'bar.csv')
111 |     sio = StringIO("bahfhkkjdf")
112 |     src2.save_fileobj(sio)
113 | 
114 | 


--------------------------------------------------------------------------------