├── .gitignore ├── .travis.yml ├── LICENSE ├── NOTES.md ├── README.md ├── archivekit ├── __init__.py ├── archive.py ├── collection.py ├── ext.py ├── ingest.py ├── manifest.py ├── package.py ├── resource.py ├── store │ ├── __init__.py │ ├── common.py │ ├── file.py │ └── s3.py ├── types │ ├── __init__.py │ └── source.py └── util.py ├── setup.py └── tests ├── fixtures └── test.csv ├── helpers.py ├── test_file.py └── test_s3.py /.gitignore: -------------------------------------------------------------------------------- 1 | .stash/* 2 | *.pyc 3 | *.DS_Store 4 | *.egg-info 5 | 6 | dist/* 7 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "2.7" 4 | before_install: 5 | - virtualenv ./pyenv --distribute 6 | - source ./pyenv/bin/activate 7 | install: 8 | - python setup.py develop 9 | - pip install coveralls moto nose 10 | script: 11 | - nosetests 12 | - coverage run --source=archivekit setup.py test 13 | after_success: 14 | - coveralls 15 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Friedrich Lindenberg 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /NOTES.md: -------------------------------------------------------------------------------- 1 | 2 | ## How to implement de-duped collections in archivekit? 3 | 4 | * Store a content hash in the manifest and traverse to find 5 | duplicates (too slow, too mutable). 6 | * Make local copies of the data, generate a content hash, upload 7 | by hash ID. 8 | 9 | 10 | Sources: 11 | 12 | * From file name 13 | * From URL 14 | * From fileobj 15 | 16 | Options: 17 | 18 | * Content hashed 19 | * Move or copy? 20 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # archivekit 2 | 3 | [![Build Status](https://travis-ci.org/pudo/archivekit.png?branch=master)](https://travis-ci.org/pudo/archivekit) [![Coverage Status](https://coveralls.io/repos/pudo/archivekit/badge.svg)](https://coveralls.io/r/pudo/archivekit) 4 | 5 | ``archivekit`` provides a mechanism for storing a (large) set of immutable documents and data files in an organized way. Transformed versions of each file can be stored the alongside the original data in order to reflect a complete processing chain. Metadata is kept with the data as a YAML file. 6 | 7 | This library is inspired by [OFS](https://github.com/okfn/ofs), [BagIt](https://github.com/LibraryOfCongress/bagit-python) and [Pairtree](https://pythonhosted.org/Pairtree/). It replaces a previous project, [docstash](https://github.com/pudo/docstash). 8 | 9 | 10 | ## Installation 11 | 12 | The easiest way of using ``archivekit`` is via PyPI: 13 | 14 | ```bash 15 | $ pip install archivekit 16 | ``` 17 | 18 | Alternatively, check out the repository from GitHub and install it locally: 19 | 20 | ```bash 21 | $ git clone https://github.com/pudo/archivekit.git 22 | $ cd archivekit 23 | $ python setup.py develop 24 | ``` 25 | 26 | 27 | ## Example 28 | 29 | ``archivekit`` manages ``Packages`` which contain one or several ``Resources`` and their associated metadata. Each ``Package`` is part of a ``Collection``. 30 | 31 | ```python 32 | from archivekit import open_collection, Source 33 | 34 | # open a collection of packages 35 | collection = open_collection('file', path='/tmp') 36 | 37 | # or via S3: 38 | collection = open_collection('s3', aws_key_id='..', aws_secret='..', 39 | bucket_name='test.pudo.org') 40 | 41 | # import a file from the local working directory: 42 | collection.ingest('README.md') 43 | 44 | # import an http resource: 45 | collection.ingest('http://pudo.org/index.html') 46 | # ingest will also accept file objects and httplib/urllib/requests responses 47 | 48 | # iterate through each document and set a metadata 49 | # value: 50 | for package in collection: 51 | for source in package.all(Source): 52 | with source.fh() as fh: 53 | source.meta['body_length'] = len(fh.read()) 54 | package.save() 55 | ``` 56 | 57 | The code for this library is very compact, go check it out. 58 | 59 | 60 | ## Configuration 61 | 62 | If AWS credentials are not supplied for an S3-based collection, the application will attempt to use the ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY`` environment variables. ``AWS_BUCKET_NAME`` is also supported. 63 | 64 | ## License 65 | 66 | ``archivekit`` is open source, licensed under a standard MIT license (included in this repository as ``LICENSE``). 67 | -------------------------------------------------------------------------------- /archivekit/__init__.py: -------------------------------------------------------------------------------- 1 | from archivekit.collection import Collection # noqa 2 | from archivekit.archive import Archive # noqa 3 | from archivekit.resource import Resource # noqa 4 | from archivekit.types.source import Source # noqa 5 | 6 | from archivekit.ext import get_stores 7 | 8 | 9 | def _open_store(store_type, **kwargs): 10 | store_cls = get_stores().get(store_type) 11 | if store_cls is None: 12 | raise TypeError("No such store type: %s" % store_type) 13 | return store_cls(**kwargs) 14 | 15 | 16 | def open_collection(name, store_type, **kwargs): 17 | """ Create a ``archivekit.Collection`` of the given store type, passing 18 | along any arguments. The valid types at the moment are: s3, file. 19 | """ 20 | return Collection(name, _open_store(store_type, **kwargs)) 21 | 22 | 23 | def open_archive(store_type, **kwargs): 24 | """ Create a ``archivekit.Archive`` of the given store type, passing 25 | along any arguments. The valid types at the moment are: s3, file. 26 | """ 27 | return Archive(_open_store(store_type, **kwargs)) 28 | -------------------------------------------------------------------------------- /archivekit/archive.py: -------------------------------------------------------------------------------- 1 | from archivekit.collection import Collection 2 | 3 | 4 | class Archive(object): 5 | """ An archive is composed of collections. """ 6 | 7 | def __init__(self, store): 8 | self.store = store 9 | 10 | def get(self, name): 11 | """ Get a collection of packages. """ 12 | return Collection(name, self.store) 13 | 14 | def __iter__(self): 15 | for name in self.store.list_collections(): 16 | yield Collection(name, self.store) 17 | 18 | def __contains__(self, name): 19 | for collection in self: 20 | if collection == name or collection.name == name: 21 | return True 22 | return False 23 | 24 | def __repr__(self): 25 | return '' % (self.store) 26 | -------------------------------------------------------------------------------- /archivekit/collection.py: -------------------------------------------------------------------------------- 1 | from archivekit.package import Package 2 | from archivekit.ingest import Ingestor 3 | 4 | 5 | class Collection(object): 6 | """ The list of all packages with an existing manifest which exist in 7 | the given storage """ 8 | 9 | def __init__(self, name, store): 10 | self.name = name 11 | self.store = store 12 | 13 | def create(self, id=None, manifest=None): 14 | """ Create a package and save a manifest. If ``manifest`` is 15 | given, the values are saved to the manifest. """ 16 | package = Package(self.store, self, id=id) 17 | if manifest is not None: 18 | package.manifest.update(manifest) 19 | package.save() 20 | return package 21 | 22 | def get(self, id): 23 | """ Get a ``Package`` identified by the ``id``. """ 24 | return Package(self.store, self, id=id) 25 | 26 | def ingest(self, something, meta=None): 27 | """ Import a given object into the collection. The object can be 28 | either a URL, a file or folder name, an open file handle or a 29 | HTTP returned object from urllib, urllib2 or requests. 30 | 31 | Before importing it, a SHA1 hash will be generated and used as the 32 | package ID. If a package with the given name already exists, it 33 | will be overwritten. If you do not desire SHA1 de-duplication, 34 | create a package directly and ingest from there. """ 35 | for ingestor in Ingestor.analyze(something): 36 | try: 37 | package = self.get(ingestor.hash()) 38 | package.ingest(ingestor, meta=meta) 39 | finally: 40 | ingestor.dispose() 41 | 42 | def __iter__(self): 43 | for package_id in self.store.list_packages(self.name): 44 | yield Package(self.store, self, id=package_id) 45 | 46 | def __contains__(self, id): 47 | for package in self: 48 | if package == id or package.id == id: 49 | return True 50 | return False 51 | 52 | def __repr__(self): 53 | return '' % (self.name) 54 | 55 | def __eq__(self, other): 56 | if not hasattr(other, 'name'): 57 | return False 58 | return self.name == other.name 59 | -------------------------------------------------------------------------------- /archivekit/ext.py: -------------------------------------------------------------------------------- 1 | from pkg_resources import iter_entry_points 2 | 3 | 4 | def get_stores(): 5 | stores = {} 6 | for ep in iter_entry_points('archivekit.stores'): 7 | stores[ep.name] = ep.load() 8 | return stores 9 | 10 | 11 | def get_resource_types(): 12 | types = {} 13 | for ep in iter_entry_points('archivekit.resource_types'): 14 | types[ep.name] = ep.load() 15 | return types 16 | -------------------------------------------------------------------------------- /archivekit/ingest.py: -------------------------------------------------------------------------------- 1 | from os import path, walk, unlink 2 | from os import name as osname 3 | from tempfile import NamedTemporaryFile 4 | from shutil import copyfileobj 5 | from httplib import HTTPResponse 6 | from StringIO import StringIO 7 | from urlparse import urlparse 8 | import mimetypes 9 | 10 | import requests 11 | 12 | from archivekit.util import clean_headers, checksum, fullpath 13 | from archivekit.util import make_secure_filename, slugify 14 | 15 | 16 | def directory_files(fpath): 17 | for (dir, _, files) in walk(fpath): 18 | for file_name in files: 19 | yield path.join(dir, file_name) 20 | 21 | 22 | class Ingestor(object): 23 | """ An ingestor is an intermedia object used when importing data. 24 | Since the source types (URLs, file names or file handles) are very 25 | diverse, this may require data to be cached locally, e.g. to 26 | generate a SHA1 hash signature. """ 27 | 28 | def __init__(self, file_name=None, file_obj=None, meta=None): 29 | self._file_name = file_name 30 | self._file_obj = file_obj 31 | self._file_cache = None 32 | self._hash = None 33 | self.is_local = file_name is not None 34 | self.meta = meta or {} 35 | 36 | def local(self): 37 | if self.is_local: 38 | return self._file_name 39 | if self._file_cache is None: 40 | tempfile = NamedTemporaryFile(delete=False) 41 | copyfileobj(self._file_obj, tempfile) 42 | self._file_cache = tempfile.name 43 | tempfile.close() 44 | return self._file_cache 45 | 46 | def has_local(self): 47 | cached = self._file_cache and path.exists(self._file_cache) 48 | return self.is_local or cached 49 | 50 | def hash(self): 51 | """ Generate an SHA1 hash of the given ingested object. """ 52 | if self._hash is None: 53 | self._hash = checksum(self.local()) 54 | return self._hash 55 | 56 | def generate_meta(self, meta): 57 | """ Set up some generic metadata for the resource, based on 58 | the available file name, HTTP headers etc. """ 59 | meta = meta or {} 60 | for key, value in self.meta.items(): 61 | if key not in meta: 62 | meta[key] = value 63 | if 'source_file' not in meta and self._file_name: 64 | meta['source_file'] = self._file_name 65 | 66 | if not meta.get('name'): 67 | if meta.get('source_url') and len(meta.get('source_url')): 68 | path = meta['source_url'] 69 | try: 70 | path = urlparse(path).path 71 | except: 72 | pass 73 | meta['name'] = path 74 | elif meta.get('source_file') and len(meta.get('source_file')): 75 | meta['name'] = meta['source_file'] 76 | name, slug, ext = make_secure_filename(meta.get('name')) 77 | meta['name'] = name 78 | if not meta.get('slug'): 79 | meta['slug'] = slug 80 | if not meta.get('extension'): 81 | meta['extension'] = ext 82 | if not meta.get('mime_type') and 'http_headers' in meta: 83 | mime_type = meta.get('http_headers').get('content_type') 84 | if mime_type not in ['application/octet-stream', 'text/plain']: 85 | meta['mime_type'] = mime_type 86 | ext = mimetypes.guess_extension(mime_type) 87 | if ext is not None: 88 | meta['extension'] = ext.strip('.') 89 | elif not meta.get('mime_type') and meta.get('name'): 90 | mime_type, encoding = mimetypes.guess_type(meta.get('name')) 91 | meta['mime_type'] = mime_type 92 | 93 | if meta.get('extension'): 94 | meta['extension'] = slugify(meta.get('extension')) 95 | return meta 96 | 97 | def store(self, source): 98 | if not self.has_local(): 99 | source.save_fileobj(self._file_obj) 100 | elif self.is_local: 101 | source.save_file(self._file_name) 102 | elif self._file_cache: 103 | source.save_file(self._file_cache, destructive=True) 104 | 105 | def dispose(self): 106 | if self._file_cache is not None and path.exists(self._file_cache): 107 | unlink(self._file_cache) 108 | 109 | @classmethod 110 | def analyze(cls, something): 111 | """ Accept a given input (e.g. a URL, file path, or file handle 112 | and determine how to normalize it into an ``Ingestor`` while 113 | generating metadata. """ 114 | if isinstance(something, cls): 115 | return (something, ) 116 | 117 | if isinstance(something, basestring): 118 | # Treat strings as paths or URLs 119 | url = urlparse(something) 120 | if url.scheme.lower() in ['http', 'https']: 121 | something = requests.get(something) 122 | elif url.scheme.lower() in ['file', '']: 123 | finalpath = url.path 124 | if osname == 'nt': 125 | finalpath = finalpath[1:] 126 | upath = fullpath(finalpath) 127 | if path.isdir(upath): 128 | return (cls(file_name=f) for f in directory_files(upath)) 129 | return (cls(file_name=upath),) 130 | 131 | # Python requests 132 | if isinstance(something, requests.Response): 133 | fd = StringIO(something.content) 134 | return (cls(file_obj=fd, meta={ 135 | 'http_status': something.status_code, 136 | 'http_headers': clean_headers(something.headers), 137 | 'source_url': something.url 138 | }), ) 139 | 140 | if isinstance(something, HTTPResponse): 141 | # Can't tell the URL for HTTPResponses 142 | return (cls(file_obj=something, meta={ 143 | 'http_status': something.status, 144 | 'http_headers': clean_headers(something.getheaders()), 145 | 'source_url': something.url 146 | }), ) 147 | 148 | elif hasattr(something, 'geturl') and hasattr(something, 'info'): 149 | # assume urllib or urllib2 150 | return (cls(file_obj=something, meta={ 151 | 'http_status': something.getcode(), 152 | 'http_headers': clean_headers(something.headers), 153 | 'source_url': something.url 154 | }), ) 155 | 156 | elif hasattr(something, 'read'): 157 | # Fileobj will be a bit bland 158 | return (cls(file_obj=something), ) 159 | 160 | return [] 161 | -------------------------------------------------------------------------------- /archivekit/manifest.py: -------------------------------------------------------------------------------- 1 | import json 2 | import collections 3 | from datetime import datetime 4 | 5 | 6 | from archivekit.util import json_default, json_hook 7 | 8 | 9 | class Manifest(dict): 10 | """ A manifest has metadata on a package. """ 11 | 12 | def __init__(self, obj): 13 | self.object = obj 14 | self.load() 15 | 16 | def load(self): 17 | if self.object.exists(): 18 | data = self.object.load_data() 19 | self.update(json.loads(data, object_hook=json_hook)) 20 | else: 21 | self['created_at'] = datetime.utcnow() 22 | self.update({'resources': {}}) 23 | 24 | def save(self): 25 | self['updated_at'] = datetime.utcnow() 26 | content = json.dumps(self, default=json_default, indent=2) 27 | self.object.save_data(content) 28 | 29 | def __repr__(self): 30 | return '' % self.key 31 | 32 | 33 | class ResourceMetaData(collections.MutableMapping): 34 | """ Metadata for a resource is derived from the main manifest. """ 35 | 36 | def __init__(self, resource): 37 | self.resource = resource 38 | self.manifest = resource.package.manifest 39 | if not isinstance(self.manifest.get('resources'), dict): 40 | self.manifest['resources'] = {} 41 | existing = self.manifest['resources'].get(self.resource.path) 42 | if not isinstance(existing, dict): 43 | self.manifest['resources'][self.resource.path] = { 44 | 'created_at': datetime.utcnow() 45 | } 46 | 47 | def touch(self): 48 | self.manifest['resources'][self.resource.path]['updated_at'] = \ 49 | datetime.utcnow() 50 | 51 | def __getitem__(self, key): 52 | return self.manifest['resources'][self.resource.path][key] 53 | 54 | def __setitem__(self, key, value): 55 | self.manifest['resources'][self.resource.path][key] = value 56 | self.touch() 57 | 58 | def __delitem__(self, key): 59 | del self.manifest['resources'][self.resource.path][key] 60 | self.touch() 61 | 62 | def __iter__(self): 63 | return iter(self.manifest['resources'][self.resource.path]) 64 | 65 | def __len__(self): 66 | return len(self.manifest['resources'][self.resource.path]) 67 | 68 | def __keytransform__(self, key): 69 | return key 70 | 71 | def save(self): 72 | self.touch() 73 | self.resource.package.save() 74 | 75 | def __repr__(self): 76 | return '' % self.resource.path 77 | -------------------------------------------------------------------------------- /archivekit/package.py: -------------------------------------------------------------------------------- 1 | import os 2 | from uuid import uuid4 3 | from itertools import count 4 | 5 | from archivekit.manifest import Manifest 6 | from archivekit.ext import get_resource_types 7 | from archivekit.ingest import Ingestor 8 | from archivekit.types.source import Source 9 | from archivekit.store.common import MANIFEST 10 | 11 | 12 | class Package(object): 13 | """ An package is a resource in the remote bucket. It consists of a 14 | source file, a manifest metadata file and one or many processed 15 | version. """ 16 | 17 | def __init__(self, store, collection, id=None): 18 | self.store = store 19 | self.collection = collection.name 20 | self.id = id or uuid4().hex 21 | 22 | def has(self, cls, name): 23 | """ Check if a resource of a given type and name exists. """ 24 | return cls(self, name).exists() 25 | 26 | def all(self, cls, *extra): 27 | """ Iterate over all resources of a given type. """ 28 | prefix = os.path.join(cls.GROUP, *extra) 29 | for path in self.store.list_resources(self.collection, self.id): 30 | if path.startswith(prefix): 31 | yield cls.from_path(self, path) 32 | 33 | def exists(self): 34 | """ Check if the package identified by the given ID exists. """ 35 | obj = self.store.get_object(self.collection, self.id, MANIFEST) 36 | return obj.exists() 37 | 38 | @property 39 | def manifest(self): 40 | if not hasattr(self, '_manifest'): 41 | obj = self.store.get_object(self.collection, self.id, MANIFEST) 42 | self._manifest = Manifest(obj) 43 | return self._manifest 44 | 45 | def get_resource(self, path): 46 | """ Get a typed resource by it's path. """ 47 | for resource_type in get_resource_types().values(): 48 | prefix = os.path.join(resource_type.GROUP, '') 49 | if path.startswith(prefix): 50 | return resource_type.from_path(self, path) 51 | 52 | def save(self): 53 | """ Save the package metadata (manifest). """ 54 | self.manifest.save() 55 | 56 | @property 57 | def source(self): 58 | """ Return the sole source of this package if present, or 59 | None if there is no source, or if there are multiple sources. """ 60 | sources = list(self.all(Source)) 61 | # TODO: should this raise for multiple sources instead? 62 | if not len(sources): 63 | return None 64 | return sources[0] 65 | 66 | def ingest(self, something, meta=None, overwrite=True): 67 | """ Import a given object into the package as a source. The 68 | object can be either a URL, a file or folder name, an open 69 | file handle or a HTTP returned object from urllib, urllib2 or 70 | requests. If ``overwrite`` is ``False``, the source file 71 | will be renamed until the name is not taken. """ 72 | ingestors = list(Ingestor.analyze(something)) 73 | 74 | if len(ingestors) != 1: 75 | raise ValueError("Can't ingest: %r" % something) 76 | ingestor = ingestors[0] 77 | 78 | try: 79 | meta = ingestor.generate_meta(meta) 80 | name = None 81 | for i in count(1): 82 | suffix = '-%s' % i if i > 1 else '' 83 | name = '%s%s.%s' % (meta['slug'], suffix, meta['extension']) 84 | if overwrite or not self.has(Source, name): 85 | break 86 | 87 | source = Source(self, name) 88 | source.meta.update(meta) 89 | ingestor.store(source) 90 | self.save() 91 | return source 92 | finally: 93 | ingestor.dispose() 94 | 95 | def __eq__(self, other): 96 | return self.id == other.id 97 | 98 | def __repr__(self): 99 | return '' % (self.collection, self.id) 100 | -------------------------------------------------------------------------------- /archivekit/resource.py: -------------------------------------------------------------------------------- 1 | import os 2 | import shutil 3 | import tempfile 4 | from contextlib import contextmanager 5 | 6 | from archivekit.manifest import ResourceMetaData 7 | 8 | 9 | class Resource(object): 10 | """ Any file within the prefix of the given package, except the 11 | manifest. """ 12 | 13 | GROUP = None 14 | 15 | def __init__(self, package, name): 16 | self.package = package 17 | self.name = name 18 | self.path = os.path.join(self.GROUP, name) 19 | self._obj = package.store.get_object(package.collection, package.id, 20 | self.path) 21 | self.meta = ResourceMetaData(self) 22 | 23 | @classmethod 24 | def from_path(cls, package, path): 25 | """ Instantiate a resource class with a path which is relative 26 | to the root of the package. """ 27 | if path.startswith(cls.GROUP): 28 | _, name = path.split(os.path.join(cls.GROUP, '')) 29 | return cls(package, name) 30 | 31 | def exists(self): 32 | return self._obj.exists() 33 | 34 | def save(self): 35 | """ Save the metadata. """ 36 | return self.package.save() 37 | 38 | def save_data(self, data): 39 | """ Save a string to the given resource. Overwrites any existing 40 | data in the resource. """ 41 | return self._obj.save_data(data) 42 | 43 | def save_file(self, file_name, destructive=False): 44 | """ Update the contents of this resource from the given file 45 | name. If ``destructive`` is set, the original file may be 46 | lost (i.e. it will be moved, not copied). """ 47 | return self._obj.save_file(file_name, destructive=destructive) 48 | 49 | def save_fileobj(self, fh): 50 | """ Save the contents of the given file handle to the given 51 | resource. Overwrites any existing data in the resource. """ 52 | return self._obj.save_fileobj(fh) 53 | 54 | def fh(self): 55 | """ Read the contents of this resource as a file handle. """ 56 | return self._obj.load_fileobj() 57 | 58 | def data(self): 59 | """ Read the contents of this resource as a string. """ 60 | return self._obj.load_data() 61 | 62 | @property 63 | def url(self): 64 | """ Return the public URL of the resource, if it exists. If 65 | no public url is available, returns ``None``. """ 66 | if not hasattr(self, '_url'): 67 | try: 68 | self._url = self._obj.public_url() 69 | except ValueError: 70 | self._url = None 71 | return self._url 72 | 73 | @contextmanager 74 | def local(self): 75 | """ This will make a local file version of a given resource 76 | available for read analysis (e.g. for passing to external 77 | programs). """ 78 | local_path = self._obj.local_path() 79 | if local_path: 80 | yield local_path 81 | else: 82 | path = tempfile.mkdtemp() 83 | local_path = os.path.join(path, self.name) 84 | with open(local_path, 'wb') as fh: 85 | shutil.copyfileobj(self.fh(), fh) 86 | yield local_path 87 | shutil.rmtree(path) 88 | 89 | def __repr__(self): 90 | return '' % self.path 91 | 92 | -------------------------------------------------------------------------------- /archivekit/store/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pudo-attic/archivekit/9de8c7f31b9b7b57e3701ce528672e980fd8aa1b/archivekit/store/__init__.py -------------------------------------------------------------------------------- /archivekit/store/common.py: -------------------------------------------------------------------------------- 1 | 2 | MANIFEST = 'manifest.json' 3 | 4 | 5 | class Store(object): 6 | """ A host object to represent a specific type of storage, 7 | in which objects are managed. """ 8 | 9 | def __init__(self): 10 | pass 11 | 12 | def get_object(self, collection, package_id, path): 13 | raise NotImplemented() 14 | 15 | def list_collections(self): 16 | raise NotImplemented() 17 | 18 | def list_packages(self, collection): 19 | raise NotImplemented() 20 | 21 | def list_resources(self, collection, package_id): 22 | raise NotImplemented() 23 | 24 | 25 | class StoreObject(object): 26 | """ An abstraction over the on-disk representation of a 27 | stored object. This can be subclassed for specific storage 28 | mechanisms. """ 29 | 30 | def exists(self): 31 | raise NotImplemented() 32 | 33 | def save_fileobj(self, fileobj): 34 | raise NotImplemented() 35 | 36 | def save_file(self, file, destructive=False): 37 | """ Update the contents of this resource from the given file 38 | name. If ``destructive`` is set, the original file may be 39 | lost (i.e. it will be moved, not copied). """ 40 | raise NotImplemented() 41 | 42 | def save_data(self, data): 43 | raise NotImplemented() 44 | 45 | def load_fileobj(self): 46 | raise NotImplemented() 47 | 48 | def load_data(self): 49 | raise NotImplemented() 50 | 51 | def public_url(self): 52 | return None 53 | 54 | def local_path(self): 55 | return None 56 | -------------------------------------------------------------------------------- /archivekit/store/file.py: -------------------------------------------------------------------------------- 1 | import os 2 | import shutil 3 | from lockfile import LockFile 4 | 5 | from archivekit.store.common import Store, StoreObject, MANIFEST 6 | from archivekit.util import safe_id, fullpath 7 | 8 | LENGTH = 2 9 | 10 | 11 | class FileStore(Store): 12 | 13 | def __init__(self, path=None, **kwargs): 14 | self.path = fullpath(path) 15 | if os.path.exists(path) and not os.path.isdir(path): 16 | raise ValueError('Not a directory: %s' % path) 17 | 18 | def get_object(self, collection, package_id, path): 19 | return FileStoreObject(self, collection, package_id, path) 20 | 21 | def list_collections(self): 22 | if self.path is None: 23 | return 24 | for collection in os.listdir(self.path): 25 | if os.path.isdir(os.path.join(self.path, collection)): 26 | yield collection 27 | 28 | def list_packages(self, collection): 29 | if self.path is None: 30 | return 31 | coll_path = os.path.join(self.path, collection) 32 | if not os.path.exists(coll_path): 33 | return 34 | for (dirpath, dirnames, filenames) in os.walk(coll_path): 35 | if MANIFEST not in filenames: 36 | continue 37 | _, id = os.path.split(dirpath) 38 | if self._make_path(collection, id) == dirpath: 39 | yield id 40 | 41 | def _make_path(self, collection, package_id): 42 | id = safe_id(package_id, len=LENGTH) 43 | path = os.path.join(self.path, collection, *id[:LENGTH]) 44 | return os.path.join(path, id) 45 | 46 | def list_resources(self, collection, package_id): 47 | prefix = self._make_path(collection, package_id) 48 | if not os.path.exists(prefix): 49 | return 50 | skip = os.path.join(prefix, MANIFEST) 51 | for (dirpath, dirnames, filenames) in os.walk(self.path): 52 | for filename in filenames: 53 | path = os.path.join(dirpath, filename) 54 | if path == skip: 55 | continue 56 | yield os.path.relpath(path, start=prefix) 57 | 58 | def __repr__(self): 59 | return '' % self.path 60 | 61 | def __unicode__(self): 62 | return self.path 63 | 64 | 65 | class FileStoreObject(StoreObject): 66 | 67 | def __init__(self, store, collection, package_id, path): 68 | self.store = store 69 | self.package_id = package_id 70 | self.path = path 71 | pkg_path = self.store._make_path(collection, package_id) 72 | self._abs_path = os.path.join(pkg_path, path) 73 | self._abs_dir = os.path.dirname(self._abs_path) 74 | self._lock = LockFile(self._abs_path) 75 | 76 | def exists(self): 77 | return os.path.exists(self._abs_path) 78 | 79 | def _prepare(self): 80 | try: 81 | os.makedirs(self._abs_dir) 82 | except: 83 | pass 84 | 85 | def save_fileobj(self, fileobj): 86 | self._prepare() 87 | with self._lock: 88 | with open(self._abs_path, 'wb') as fh: 89 | shutil.copyfileobj(fileobj, fh) 90 | 91 | def save_file(self, file_name, destructive=False): 92 | self._prepare() 93 | with self._lock: 94 | if destructive: 95 | shutil.move(file_name, self._abs_path) 96 | else: 97 | shutil.copy(file_name, self._abs_path) 98 | 99 | def save_data(self, data): 100 | self._prepare() 101 | with self._lock: 102 | with open(self._abs_path, 'wb') as fh: 103 | fh.write(data) 104 | 105 | def load_fileobj(self): 106 | if not self.exists(): 107 | return 108 | with self._lock: 109 | return open(self._abs_path, 'rb') 110 | 111 | def load_data(self): 112 | if not self.exists(): 113 | return 114 | with self._lock: 115 | with open(self._abs_path, 'rb') as fh: 116 | return fh.read() 117 | 118 | def local_path(self): 119 | return self._abs_path 120 | 121 | def __repr__(self): 122 | return '' % self._abs_path 123 | 124 | def __unicode__(self): 125 | return self._abs_path 126 | -------------------------------------------------------------------------------- /archivekit/store/s3.py: -------------------------------------------------------------------------------- 1 | import os 2 | from urllib2 import urlopen 3 | 4 | from boto.s3.connection import S3Connection, S3ResponseError 5 | from boto.s3.connection import Location 6 | 7 | from archivekit.store.common import Store, StoreObject, MANIFEST 8 | 9 | DELIM = os.path.join(' ', ' ').strip() 10 | ALL_USERS = 'http://acs.amazonaws.com/groups/global/AllUsers' 11 | 12 | 13 | class S3Store(Store): 14 | 15 | def __init__(self, aws_key_id=None, aws_secret=None, bucket_name=None, 16 | prefix=None, location=Location.EU, **kwargs): 17 | if aws_key_id is None: 18 | aws_key_id = os.environ.get('AWS_ACCESS_KEY_ID') 19 | aws_secret = os.environ.get('AWS_SECRET_ACCESS_KEY') 20 | self.aws_key_id = aws_key_id 21 | self.aws_secret = aws_secret 22 | if bucket_name is None: 23 | bucket_name = os.environ.get('AWS_BUCKET_NAME') 24 | self.bucket_name = bucket_name 25 | self.prefix = prefix 26 | self.location = location 27 | self._bucket = None 28 | 29 | @property 30 | def bucket(self): 31 | if self._bucket is None: 32 | self.conn = S3Connection(self.aws_key_id, self.aws_secret) 33 | try: 34 | self._bucket = self.conn.get_bucket(self.bucket_name) 35 | except S3ResponseError, se: 36 | if se.status != 404: 37 | raise 38 | self._bucket = self.conn.create_bucket(self.bucket_name, 39 | location=self.location) 40 | return self._bucket 41 | 42 | def get_object(self, collection, package_id, path): 43 | return S3StoreObject(self, collection, package_id, path) 44 | 45 | def _get_prefix(self, collection): 46 | prefix = collection 47 | if self.prefix: 48 | prefix = os.path.join(self.prefix, prefix) 49 | return os.path.join(prefix, '') 50 | 51 | def list_collections(self): 52 | prefix = os.path.join(self.prefix, '') if self.prefix else None 53 | for prefix in self.bucket.list(prefix=prefix, delimiter=DELIM): 54 | yield prefix.name.rsplit(DELIM, 2)[-2] 55 | 56 | def list_packages(self, collection): 57 | prefix = self._get_prefix(collection) 58 | for sub_prefix in self.bucket.list(prefix=prefix, delimiter=DELIM): 59 | yield sub_prefix.name.rsplit(DELIM, 2)[-2] 60 | 61 | def list_resources(self, collection, package_id): 62 | prefix = os.path.join(self._get_prefix(collection), package_id) 63 | skip = os.path.join(prefix, MANIFEST) 64 | offset = len(skip) - len(MANIFEST) 65 | for key in self.bucket.get_all_keys(prefix=prefix): 66 | if key.name == skip: 67 | continue 68 | yield key.name[offset:] 69 | 70 | def __repr__(self): 71 | return '' % (self.bucket_name, self.prefix) 72 | 73 | def __unicode__(self): 74 | return os.path.join(self.bucket_name, self.prefix) 75 | 76 | 77 | class S3StoreObject(StoreObject): 78 | 79 | def __init__(self, store, collection, package_id, path): 80 | self.store = store 81 | self.package_id = package_id 82 | self.path = path 83 | self._key = None 84 | self._key_name = os.path.join(collection, package_id, path) 85 | if store.prefix: 86 | self._key_name = os.path.join(store.prefix, self._key_name) 87 | 88 | @property 89 | def key(self): 90 | if self._key is None: 91 | self._key = self.store.bucket.get_key(self._key_name) 92 | if self._key is None: 93 | self._key = self.store.bucket.new_key(self._key_name) 94 | return self._key 95 | 96 | def exists(self): 97 | if self._key is None: 98 | self._key = self.store.bucket.get_key(self._key_name) 99 | return self._key is not None 100 | 101 | def save_fileobj(self, fileobj): 102 | self.key.set_contents_from_file(fileobj) 103 | 104 | def save_file(self, file_name, destructive=False): 105 | with open(file_name, 'rb') as fh: 106 | self.save_fileobj(fh) 107 | 108 | def save_data(self, data): 109 | self.key.set_contents_from_string(data) 110 | 111 | def load_fileobj(self): 112 | return urlopen(self.public_url()) 113 | 114 | def load_data(self): 115 | return self.key.get_contents_as_string() 116 | 117 | def _is_public(self): 118 | try: 119 | for grant in self.key.get_acl().acl.grants: 120 | if grant.permission == 'READ': 121 | if grant.uri == ALL_USERS: 122 | return True 123 | except: 124 | pass 125 | return False 126 | 127 | def public_url(self): 128 | if not self.exists(): 129 | return 130 | # Welcome to the world of open data: 131 | if not self._is_public(): 132 | self.key.make_public() 133 | return self.key.generate_url(expires_in=0, 134 | force_http=True, 135 | query_auth=False) 136 | 137 | def __repr__(self): 138 | return '' % (self.store, self.package_id, 139 | self.path) 140 | 141 | def __unicode__(self): 142 | return self.public_url() 143 | -------------------------------------------------------------------------------- /archivekit/types/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pudo-attic/archivekit/9de8c7f31b9b7b57e3701ce528672e980fd8aa1b/archivekit/types/__init__.py -------------------------------------------------------------------------------- /archivekit/types/source.py: -------------------------------------------------------------------------------- 1 | from archivekit.resource import Resource 2 | 3 | 4 | class Source(Resource): 5 | """ A source file, as initially submitted to the ``archivekit``. """ 6 | 7 | GROUP = 'source' 8 | 9 | def __repr__(self): 10 | return '' % self.name 11 | -------------------------------------------------------------------------------- /archivekit/util.py: -------------------------------------------------------------------------------- 1 | import os 2 | import six 3 | from hashlib import sha1 4 | from decimal import Decimal 5 | from slugify import slugify 6 | from datetime import datetime, date 7 | 8 | 9 | def safe_id(name, len=5): 10 | """ Remove potential path escapes from a content ID. """ 11 | if name is None: 12 | return None 13 | name = slugify(os.path.basename(name)).strip('-') 14 | name = name.ljust(len, '_') 15 | return name 16 | 17 | 18 | def make_secure_filename(source): 19 | # TODO: don't let users create files called ``manifest.json``. 20 | if source: 21 | source = os.path.basename(source).strip() 22 | source = source or 'source' 23 | fn, ext = os.path.splitext(source) 24 | ext = ext or '.raw' 25 | ext = ext.lower().strip().replace('.', '') 26 | return source, slugify(fn), ext 27 | 28 | 29 | def fullpath(filename): 30 | """ Perform normalization of the source file name. """ 31 | if filename is None: 32 | return 33 | # a happy tour through stdlib 34 | filename = os.path.expanduser(filename) 35 | filename = os.path.expandvars(filename) 36 | filename = os.path.normpath(filename) 37 | return os.path.abspath(filename) 38 | 39 | 40 | def clean_headers(headers): 41 | """ Convert HTTP response headers into a common format 42 | for storing them in the resource meta data. """ 43 | result = {} 44 | for k, v in dict(headers).items(): 45 | k = k.lower().replace('-', '_') 46 | result[k] = v 47 | return result 48 | 49 | 50 | def checksum(filename): 51 | hash = sha1() 52 | with open(filename, 'rb') as fh: 53 | while True: 54 | block = fh.read(2 ** 10) 55 | if not block: 56 | break 57 | hash.update(block) 58 | return hash.hexdigest() 59 | 60 | 61 | def encode_text(text): 62 | if isinstance(text, six.text_type): 63 | return text.encode('utf-8') 64 | try: 65 | return text.decode('utf-8').encode('utf-8') 66 | except (UnicodeDecodeError, UnicodeEncodeError): 67 | return text.encode('ascii', 'replace') 68 | 69 | 70 | def json_default(obj): 71 | if isinstance(obj, datetime): 72 | obj = obj.isoformat() 73 | if isinstance(obj, Decimal): 74 | obj = float(obj) 75 | if isinstance(obj, date): 76 | return 'new Date(%s)' % obj.isoformat() 77 | return obj 78 | 79 | 80 | def json_hook(obj): 81 | for k, v in obj.items(): 82 | if isinstance(v, basestring): 83 | try: 84 | obj[k] = datetime.strptime(v, "new Date(%Y-%m-%d)").date() 85 | except ValueError: 86 | pass 87 | try: 88 | obj[k] = datetime.strptime(v, "%Y-%m-%dT%H:%M:%S") 89 | except ValueError: 90 | pass 91 | return obj 92 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name='archivekit', 5 | version='0.5.3', 6 | description="Store a set of files and metadata in an organized way", 7 | long_description="", 8 | classifiers=[ 9 | "Development Status :: 3 - Alpha", 10 | "Intended Audience :: Developers", 11 | "Operating System :: OS Independent", 12 | "Programming Language :: Python", 13 | ], 14 | keywords='storage filesystem archive bagit bag etl data processing', 15 | author='Friedrich Lindenberg', 16 | author_email='friedrich@pudo.org', 17 | url='http://pudo.org', 18 | license='MIT', 19 | packages=find_packages(exclude=['ez_setup', 'examples', 'tests']), 20 | namespace_packages=[], 21 | include_package_data=True, 22 | zip_safe=False, 23 | test_suite='nose.collector', 24 | install_requires=[ 25 | "Werkzeug>=0.9.6", 26 | "lockfile>=0.9.1", 27 | "python-slugify>=0.0.6", 28 | "requests>=2.2.0", 29 | "boto>=2.33", 30 | "six" 31 | ], 32 | entry_points={ 33 | 'archivekit.stores': [ 34 | 's3 = archivekit.store.s3:S3Store', 35 | 'file = archivekit.store.file:FileStore' 36 | ], 37 | 'archivekit.resource_types': [ 38 | 'source = archivekit.types.source:Source' 39 | ], 40 | }, 41 | tests_require=[] 42 | ) 43 | -------------------------------------------------------------------------------- /tests/fixtures/test.csv: -------------------------------------------------------------------------------- 1 | company,opencorp_matches 2 | ABB,250 3 | AED Oil Limited,0 4 | AGR,250 5 | AMNI International,1 6 | Addax Petroleum,18 7 | Afren,21 8 | Agip,117 9 | Agip Energy & Natural Resources,0 10 | Agip Nigeria,0 11 | Aibel,42 12 | Aker Borgestad Operations,0 13 | Aker Floating Production,1 14 | Alliance Marine Services,9 15 | Amerada / Hess Shell,0 16 | Amerada Hess,94 17 | Anadarko,250 18 | Anzon,36 19 | Apache,250 20 | Arco,250 21 | Atlantica Tender Drilling Ltd.,0 22 | Atwood Oceanics,8 23 | Australian Worldwide Exploration,0 24 | Axxis Petroconsultants,0 25 | BHP,250 26 | BHP Billiton,180 27 | BLT,250 28 | BP,250 29 | BW Offshore,16 30 | Berlian Laju Tanker (BLT),0 31 | Bluewater,250 32 | Bourbon Offshore,15 33 | Burmi Armada,0 34 | CACT,8 35 | CAMAC Energy,4 36 | CDC,250 37 | CNOOC,29 38 | CNOOC (NOC),0 39 | CNR International,25 40 | CNRL International,0 41 | Cairn Energy,77 42 | Cardiff,250 43 | Chevron,250 44 | ConocoPhillips,250 45 | ConocoPhillips/CNOOC,0 46 | Conoil,7 47 | Coogee,47 48 | Cuu Long JV,0 49 | DPI,250 50 | Devon Energy,99 51 | Diamond Offshore,28 52 | Discovery Offshore S.A.,4 53 | Dolphin Drilling,10 54 | Dryships,2 55 | Dynamic Producer,0 56 | ENI (NOC),0 57 | ENI Agip,0 58 | ENSCO,250 59 | Egyptian Drilling,0 60 | Emas,223 61 | Energean Oil & Gas SA,0 62 | Esso,250 63 | Exmar,54 64 | ExxonMobil,250 65 | ExxonMobil/GEPetrol,0 66 | FPS Ocean,1 67 | FVSB,1 68 | Fred Olsen Energy,1 69 | Fred.Olsen,76 70 | Fred.Olsen Production,2 71 | Frontier Drilling do Brazil,0 72 | GFI O & G,0 73 | GGPC,4 74 | GSP,250 75 | Galoc Production Co.,0 76 | HUnited Kingdom,0 77 | Hercules Assets LLC,0 78 | Hercules Offshore,9 79 | Hess Corp.,298 80 | Ikdam,3 81 | Ikdam Production SA,0 82 | JV - AGR/ Helix,0 83 | JV - North West Shelf,0 84 | JV - Prosafe/ Fred Olsen,0 85 | JV - SBMO/ Partner,0 86 | Jasper Explorer Pte Ltd.,0 87 | Jasper Offshore,1 88 | KCA Deutag,43 89 | Kerr McGee,205 90 | Lonestar Drilling Nigeria Ltd.,0 91 | Lundin Petroleum,3 92 | M3nergy,0 93 | MISC,174 94 | MODEC,63 95 | MODEC T,0 96 | Maersk,250 97 | Maersk Drilling,15 98 | Marathon,250 99 | Matrix Oil,10 100 | Megadrill Services,0 101 | Mitsui Oil Exploration,0 102 | Murphy Oil Sabah,0 103 | Murphy West Africa,1 104 | NNPC,15 105 | Nabors International,5 106 | Nabors Offshore,3 107 | Nexen,154 108 | Nexus Floating Production,1 109 | Nigerian Agip Exploration,0 110 | Noble Drilling,64 111 | North Sea Production,1 112 | Northern Offshore Ltd,22 113 | OMV,111 114 | Oando,20 115 | Ocean Rig Asa,1 116 | Oceaneering,77 117 | Odfjell,133 118 | Oilexco,7 119 | PA Resources,20 120 | PEMEX,32 121 | PGS,250 122 | PTSC,14 123 | PTTEP,8 124 | Pacific Drilling Limited,17 125 | Paragon Offshore,3 126 | Pearl Energy,9 127 | Pemex,32 128 | Perenco,80 129 | Pertamina/ PetroChina,0 130 | Petro-Canada,144 131 | PetroSA/Pioneer Natural Resources,0 132 | PetroViet Nam E&P,0 133 | Petrobras,38 134 | Petronas,78 135 | Petronas Carigali,23 136 | Petroserv SA,25 137 | Premier Oil,140 138 | Premuda,13 139 | Prosafe,93 140 | Rasmussen,250 141 | Reliance,250 142 | Repsol,164 143 | Rowan,250 144 | Rubicon Offshore,5 145 | SBM,250 146 | SBM (First 3 years),0 147 | SBMO,2 148 | SBMO+JV Partner,0 149 | Saipem,44 150 | Santos,250 151 | SapuraKencana,8 152 | Sea Production Ltd,6 153 | SeaWolf Oil Services Limited,0 154 | Seadrill Ltd,118 155 | Secunda Marine,3 156 | SembCorp Marine,2 157 | Sevan Marine,1 158 | Shebah,16 159 | Shebah E&P,0 160 | Shelf Drilling,12 161 | Shell,250 162 | Shell Nigeria,0 163 | Shell Todd,2 164 | Sonangol,41 165 | Sonangol (NOC),0 166 | Songa Floating Production,1 167 | South Atlantic Petroleum,2 168 | Star Deepwater JV,0 169 | Statoil,250 170 | StatoilHydro,3 171 | StatoilHydro/Saga Petroleum,0 172 | Stena Drilling,20 173 | TSJOC,0 174 | Talisman,250 175 | Talisman (Blake f. operated by BG),0 176 | Tanker Pacifi,0 177 | Tanker Pacific,8 178 | Teekay Petrojarl,10 179 | Toisa Horizon,0 180 | Total,250 181 | Transocean Ltd.,250 182 | Tullow,150 183 | Vaalco Energy,4 184 | Vantage Drilling,4 185 | Venture Production,10 186 | Viet Nam Offshore Floating Terminal,0 187 | Vietsovpetro,0 188 | Wood Group,250 189 | Woodside,250 190 | Woodside Energy,11 191 | Zaafarana Oil Company,0 192 | ullow,0 193 | -------------------------------------------------------------------------------- /tests/helpers.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | FIXTURES = os.path.join(os.path.dirname(__file__), 'fixtures') 4 | DATA_FILE = os.path.join(FIXTURES, 'test.csv') 5 | DATA_URL = 'https://raw.githubusercontent.com/okfn/dpkg-barnet/master/barnet-2009.csv' 6 | -------------------------------------------------------------------------------- /tests/test_file.py: -------------------------------------------------------------------------------- 1 | from tempfile import mkdtemp 2 | from shutil import rmtree 3 | import urllib 4 | 5 | from helpers import DATA_FILE, DATA_URL 6 | from archivekit import Collection, open_archive 7 | from archivekit.store.file import FileStore 8 | from archivekit.types.source import Source 9 | from archivekit.util import checksum 10 | 11 | 12 | def test_basic_package(): 13 | path = mkdtemp() 14 | store = FileStore(path=path) 15 | coll = Collection('test', store) 16 | 17 | assert len(list(coll)) == 0, list(coll) 18 | 19 | pkg = coll.create() 20 | assert pkg.id is not None, pkg 21 | assert pkg.exists(), pkg 22 | 23 | pkg = coll.get(None) 24 | assert not pkg.exists(), pkg 25 | 26 | rmtree(path) 27 | 28 | 29 | def test_basic_manifest(): 30 | path = mkdtemp() 31 | store = FileStore(path=path) 32 | coll = Collection('test', store) 33 | pkg = coll.create() 34 | pkg.manifest['foo'] = 'bar' 35 | pkg.save() 36 | 37 | npkg = coll.get(pkg.id) 38 | assert npkg.id == pkg.id, npkg 39 | assert npkg.manifest['foo'] == 'bar', npkg.meta.items() 40 | 41 | rmtree(path) 42 | 43 | 44 | def test_archive(): 45 | path = mkdtemp() 46 | store = FileStore(path=path) 47 | coll = Collection('test', store) 48 | coll.ingest(DATA_FILE) 49 | 50 | archive = open_archive('file', path=path) 51 | assert archive.get('test') == coll, archive.get('test') 52 | colls = list(archive) 53 | assert len(colls) == 1, colls 54 | 55 | rmtree(path) 56 | 57 | 58 | def test_collection_ingest(): 59 | path = mkdtemp() 60 | store = FileStore(path=path) 61 | coll = Collection('test', store) 62 | coll.ingest(DATA_FILE) 63 | pkgs = list(coll) 64 | assert len(pkgs) == 1, pkgs 65 | pkg0 = pkgs[0] 66 | assert pkg0.id == checksum(DATA_FILE), pkg0.id 67 | sources = list(pkg0.all(Source)) 68 | assert len(sources) == 1, sources 69 | assert sources[0].name == 'test.csv', sources[0].name 70 | rmtree(path) 71 | 72 | 73 | def test_package_ingest_file(): 74 | path = mkdtemp() 75 | store = FileStore(path=path) 76 | coll = Collection('test', store) 77 | pkg = coll.create() 78 | source = pkg.ingest(DATA_FILE) 79 | assert source.meta.get('name') == 'test.csv', source.meta 80 | assert source.meta.get('extension') == 'csv', source.meta 81 | assert source.meta.get('slug') == 'test', source.meta 82 | rmtree(path) 83 | 84 | 85 | def test_package_get_resource(): 86 | path = mkdtemp() 87 | store = FileStore(path=path) 88 | coll = Collection('test', store) 89 | pkg = coll.create() 90 | source = pkg.ingest(DATA_FILE) 91 | other = pkg.get_resource(source.path) 92 | assert isinstance(other, Source), other.__class__ 93 | assert other.path == source.path, other 94 | rmtree(path) 95 | 96 | 97 | def test_resource_local(): 98 | path = mkdtemp() 99 | store = FileStore(path=path) 100 | coll = Collection('test', store) 101 | pkg = coll.create() 102 | source = pkg.ingest(DATA_FILE) 103 | with source.local() as file_name: 104 | assert file_name.endswith(source.name), file_name 105 | rmtree(path) 106 | 107 | 108 | def test_package_source(): 109 | path = mkdtemp() 110 | store = FileStore(path=path) 111 | coll = Collection('test', store) 112 | pkg = coll.create() 113 | assert pkg.source is None, pkg.source 114 | source = pkg.ingest(DATA_FILE) 115 | other = pkg.source 116 | assert isinstance(other, Source), other.__class__ 117 | assert other.path == source.path, other 118 | rmtree(path) 119 | 120 | 121 | def test_package_ingest_url(): 122 | path = mkdtemp() 123 | store = FileStore(path=path) 124 | coll = Collection('test', store) 125 | pkg = coll.create() 126 | source = pkg.ingest(DATA_URL) 127 | assert source.name == 'barnet-2009.csv', source.name 128 | assert source.meta['source_url'] == DATA_URL, source.meta 129 | 130 | source = pkg.ingest(urllib.urlopen(DATA_URL)) 131 | assert source.name == 'barnet-2009.csv', source.name 132 | assert source.meta['source_url'] == DATA_URL, source.meta 133 | rmtree(path) 134 | 135 | 136 | def test_package_ingest_fileobj(): 137 | path = mkdtemp() 138 | store = FileStore(path=path) 139 | coll = Collection('test', store) 140 | pkg = coll.create() 141 | with open(DATA_FILE, 'rb') as fh: 142 | source = pkg.ingest(fh) 143 | assert source.name == 'source.raw', source.name 144 | rmtree(path) 145 | -------------------------------------------------------------------------------- /tests/test_s3.py: -------------------------------------------------------------------------------- 1 | from moto import mock_s3 2 | from StringIO import StringIO 3 | 4 | from helpers import DATA_FILE 5 | from archivekit import Collection 6 | from archivekit.store.s3 import S3Store 7 | from archivekit.types.source import Source 8 | from archivekit.util import checksum 9 | 10 | 11 | @mock_s3 12 | def test_store_loader(): 13 | from archivekit.ext import get_stores 14 | stores = get_stores() 15 | assert 's3' in stores, stores 16 | assert stores['s3'] == S3Store, stores 17 | 18 | 19 | @mock_s3 20 | def test_open_collection(): 21 | from archivekit import open_collection 22 | coll = open_collection('test', 's3', bucket_name='foo') 23 | assert isinstance(coll.store, S3Store), coll.store 24 | assert coll.store.bucket.name == 'foo', coll.store.bucket 25 | 26 | 27 | @mock_s3 28 | def test_list_collections(): 29 | store = S3Store(bucket_name='foo', prefix='bar') 30 | coll = Collection('test', store) 31 | coll.ingest(DATA_FILE) 32 | colls = list(store.list_collections()) 33 | assert len(colls) == 1, colls 34 | assert colls[0] == coll.name, colls 35 | 36 | 37 | @mock_s3 38 | def test_basic_package(): 39 | store = S3Store(bucket_name='test_bucket') 40 | coll = Collection('test', store) 41 | 42 | assert len(list(coll)) == 0, list(coll) 43 | 44 | pkg = coll.create() 45 | assert pkg.id is not None, pkg 46 | assert pkg.exists(), pkg 47 | 48 | pkg = coll.get(None) 49 | assert not pkg.exists(), pkg 50 | 51 | 52 | @mock_s3 53 | def test_basic_manifest(): 54 | store = S3Store(bucket_name='test_bucket') 55 | coll = Collection('test', store) 56 | pkg = coll.create() 57 | pkg.manifest['foo'] = 'bar' 58 | pkg.save() 59 | 60 | npkg = coll.get(pkg.id) 61 | assert npkg.id == pkg.id, npkg 62 | assert npkg.manifest['foo'] == 'bar', npkg.meta.items() 63 | 64 | 65 | @mock_s3 66 | def test_collection_ingest(): 67 | store = S3Store(bucket_name='test_bucket') 68 | coll = Collection('test', store) 69 | coll.ingest(DATA_FILE) 70 | pkgs = list(coll) 71 | assert len(pkgs) == 1, pkgs 72 | pkg0 = pkgs[0] 73 | assert pkg0.id == checksum(DATA_FILE), pkg0.id 74 | print pkg0 75 | sources = list(pkg0.all(Source)) 76 | assert len(sources) == 1, sources 77 | assert sources[0].name == 'test.csv', sources[0].name 78 | 79 | 80 | @mock_s3 81 | def test_package_ingest_file(): 82 | store = S3Store(bucket_name='test_bucket') 83 | coll = Collection('test', store) 84 | pkg = coll.create() 85 | source = pkg.ingest(DATA_FILE) 86 | assert source.meta.get('name') == 'test.csv', source.meta 87 | assert source.meta.get('extension') == 'csv', source.meta 88 | assert source.meta.get('slug') == 'test', source.meta 89 | 90 | 91 | @mock_s3 92 | def test_package_local_file(): 93 | store = S3Store(bucket_name='test_bucket') 94 | coll = Collection('test', store) 95 | pkg = coll.create() 96 | source = pkg.ingest(DATA_FILE) 97 | with source.local() as file_name: 98 | assert file_name != DATA_FILE, file_name 99 | assert file_name.endswith('test.csv'), file_name 100 | 101 | 102 | @mock_s3 103 | def test_package_save_data(): 104 | store = S3Store(bucket_name='test_bucket') 105 | coll = Collection('test', store) 106 | pkg = coll.create() 107 | src = Source(pkg, 'foo.csv') 108 | src.save_data('huhu!') 109 | 110 | src2 = Source(pkg, 'bar.csv') 111 | sio = StringIO("bahfhkkjdf") 112 | src2.save_fileobj(sio) 113 | 114 | --------------------------------------------------------------------------------