├── .gitignore ├── .travis.yml ├── DESIGN.md ├── LICENSE ├── README.md ├── example.yaml ├── loadkit ├── __init__.py ├── cli.py ├── extract.py ├── logger.py ├── node.py ├── operators │ ├── __init__.py │ ├── common.py │ ├── ingest.py │ ├── normalize.py │ ├── regex.py │ ├── table.py │ └── text.py ├── pipeline.py ├── tests │ ├── __init__.py │ ├── fixtures │ │ ├── barnet-2009.csv │ │ └── gpc-july-2014.csv │ ├── test_etl.py │ └── util.py ├── types │ ├── __init__.py │ ├── logfile.py │ ├── stage.py │ └── table.py └── util.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | .env 3 | *.sqlite3 4 | *.egg-info 5 | dist/* 6 | demo-data/* 7 | .coverage 8 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "2.7" 4 | before_install: 5 | - virtualenv ./pyenv --distribute 6 | - source ./pyenv/bin/activate 7 | install: 8 | - python setup.py develop 9 | - pip install coveralls nose moto 10 | script: 11 | - nosetests 12 | - coverage run --source=loadkit setup.py test 13 | after_success: 14 | - coveralls 15 | -------------------------------------------------------------------------------- /DESIGN.md: -------------------------------------------------------------------------------- 1 | # LoadKit 2 | 3 | ** This is the original README for LoadKit. Its purpose has changed significantly since then. ** 4 | 5 | LoadKit is a simple Python-based ETL framework inspired by a discussion about the [OpenSpending](http://openspending.org) data warehouse platform. 6 | 7 | It is intended to accept tabular input files, such as CSV files, Excel spreadsheets and [other formats](https://messytables.readthedocs.org/). The data is kept in a managed file structure locally or uploaded to an S3 bucket together with a JSON metadata file. 8 | 9 | Once data has been ingested, it can be processed and turned into a series of ``Artifacts``, which are transformed versions of the initial resource. 10 | 11 | Finally, an ``Artifact`` can be loaded into an automatically generated SQL database table in order to be queried for analytical purposes. 12 | 13 | ### Usage 14 | 15 | See ``demo.py`` in the project root. 16 | 17 | ### What is to be done 18 | 19 | * Decide which bits of the ``datapackage`` specification this needs to adhere to. 20 | * Allow passing in some metadata to aid interpretation of the table. 21 | * Include much more data quality assessment tooling and data validation options. 22 | * Does metadata (e.g. on fields) need to be per-resource instead of package-wide? 23 | * Set up custom exceptions and error handling (invalid URLs and file names, too large, parsing failures, loading failures). 24 | * Think about whether the resulting DB must be denormalized. 25 | * Create a Postgres FTS index when loading the data with [sqlalchemy-searchable](https://github.com/kvesteri/sqlalchemy-searchable/). 26 | 27 | ### References 28 | 29 | * [OpenSpending Enhancement Protocol 2](https://github.com/openspending/osep/blob/gh-pages/02-data-storage-and-data-pipeline.md). 30 | 31 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014-2015 Friedrich Lindenberg 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LoadKit 2 | 3 | [![Build Status](https://travis-ci.org/pudo/loadkit.png?branch=master)](https://travis-ci.org/pudo/loadkit) [![Coverage Status](https://coveralls.io/repos/pudo/loadkit/badge.svg)](https://coveralls.io/r/pudo/loadkit) 4 | 5 | ``loadkit`` is a data and document processing tool. It can be used to 6 | construct multi-stage processing pipelines and to monitor the 7 | execution of tasks through these pipelines. 8 | 9 | ``loadkit`` will traverse a collection of ``archivekit`` packages, which 10 | contain source documents or data files. The stages of the processing 11 | pipeline will consume these sources and transform them into a series of 12 | derived artifacts. 13 | 14 | ## Installation 15 | 16 | The easiest way of using ``loadkit`` is via PyPI: 17 | 18 | ```bash 19 | $ pip install loadkit 20 | ``` 21 | 22 | Alternatively, check out the repository from GitHub and install it locally: 23 | 24 | ```bash 25 | $ git clone https://github.com/pudo/loadkit.git 26 | $ cd loadkit 27 | $ python setup.py develop 28 | ``` 29 | 30 | 31 | ## Usage 32 | 33 | Each data processing pipeline is defined as a set of operations, divided into two phases, the ``extract`` and ``transform`` steps. Operations defined in the ``extract`` phase will be executed once (to import a set of packages), while operations defined in the ``transform`` phase will be executed for each package. 34 | 35 | A pipeline is defined through a YAML file, such as this: 36 | 37 | ```yaml 38 | config: 39 | collections: 40 | my-project: 41 | type: file 42 | path: /srv/my-project 43 | 44 | extract: 45 | docs: 46 | operator: 'ingest' 47 | source: '~/tmp/incoming' 48 | meta: 49 | source: 'Freshly scraped' 50 | 51 | transform: 52 | mime: 53 | operator: 'mime_type' 54 | 55 | text: 56 | requires: 'mime' 57 | operator: 'textract' 58 | 59 | index: 60 | requires: ['text', 'mime'] 61 | operator: 'elasticsearch' 62 | url: 'http://bonsai.io/...' 63 | ``` 64 | 65 | As you can see, each operation node is named and can be referenced by others as a required precondition. 66 | 67 | Such a pipeline can be executed using the following command: 68 | 69 | ```bash 70 | $ loadkit run pipeline.yaml 71 | ``` 72 | 73 | Alternatively, each phase of the process can be executed individually: 74 | 75 | ```bash 76 | $ loadkit extract pipeline.yaml 77 | $ loadkit transform pipeline.yaml 78 | ``` 79 | 80 | ### Available operators 81 | 82 | The library includes a small set of pre-defined operators for document processing. Other operators can also be defined via entry points in Python packages; they will be picked up automatically once installed in the same Python environment. 83 | 84 | * ``ingest``, the default document ingester. It accepts on configuration option, ``source``, which can be a URL, file path or directory name. 85 | 86 | ### Adding new operators 87 | 88 | ``loadkit`` is easily enhanceable, allowing for the seamless addition of domain-specific or other complex operators in a processing pipeline. Each ``operator`` is a simple Python class inherited from ``loadkit.Operator``: 89 | 90 | ```python 91 | from loadkit import Operator 92 | 93 | class FileSizeOperator(Operator): 94 | 95 | def process(self, package): 96 | # config is set in the pipline for each task. 97 | field = self.config.get('field', 'file_size') 98 | 99 | # For help with the document object, see docstash. 100 | with open(document.file, 'r') as fh: 101 | document[field] = len(fh.read()) 102 | document.save() 103 | 104 | # Alternatively, tasks can also implement the ``extract(self)`` method. 105 | ``` 106 | 107 | To become available in processing pipelines, the operator must also be registered as an entry point in the Python package's ``setup.py`` like this: 108 | 109 | ```python 110 | ... 111 | setup( 112 | ... 113 | entry_points={ 114 | 'loadkit.operators': [ 115 | 'my_op = my_package:FileSizeOperator' 116 | ] 117 | }, 118 | ... 119 | ) 120 | ``` 121 | 122 | Note that changes to ``setup.py`` only come into effect after the package has been re-installed, or the following command has been executed: 123 | 124 | ```bash 125 | $ python setup.py develop 126 | ``` 127 | 128 | ## License 129 | 130 | ``loadkit`` is open source, licensed under a standard MIT license (included in this repository as ``LICENSE``). 131 | -------------------------------------------------------------------------------- /example.yaml: -------------------------------------------------------------------------------- 1 | config: 2 | collections: 3 | dev: 4 | type: file 5 | path: /Users/fl/Code/loadkit/demo-data 6 | 7 | generate: 8 | docs: 9 | operator: ingest 10 | source: /Users/fl/Code/icij-asx/data/ann 11 | 12 | process: 13 | text: 14 | operator: text_extract 15 | 16 | normalize: 17 | operator: normalize 18 | requires: text 19 | lowercase: True 20 | collapse: True 21 | -------------------------------------------------------------------------------- /loadkit/__init__.py: -------------------------------------------------------------------------------- 1 | from loadkit.operators.common import Operator, TransformOperator 2 | 3 | __all__ = ['Operator', 'TransformOperator'] 4 | -------------------------------------------------------------------------------- /loadkit/cli.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | import yaml 4 | import click 5 | from archivekit import open_collection 6 | 7 | from loadkit.util import LoadKitException 8 | from loadkit.pipeline import Pipeline 9 | 10 | 11 | log = logging.getLogger(__name__) 12 | 13 | 14 | def execute_pipeline(ctx, fh, operation): 15 | try: 16 | config = yaml.load(fh.read()) 17 | fh.close() 18 | except Exception, e: 19 | raise click.ClickException("Cannot parse pipeline: %s" % e) 20 | if 'config' not in config: 21 | config['config'] = {} 22 | 23 | collections = ctx.pop('collections', []) 24 | config['config'].update(ctx) 25 | config['config']['threads'] = ctx.pop('threads', None) 26 | 27 | collection_configs = config['config'].pop('collections', {}) 28 | if not len(collections): 29 | collections = collection_configs.keys() 30 | collections = [c for c in collections if c in collection_configs] 31 | 32 | for cname in collections: 33 | cconfig = collection_configs.get(cname) 34 | coll = open_collection(cname, cconfig.pop('type'), **cconfig) 35 | try: 36 | pipeline = Pipeline(coll, fh.name, config=config) 37 | getattr(pipeline, operation)() 38 | except LoadKitException, de: 39 | raise click.ClickException(unicode(de)) 40 | 41 | 42 | @click.group() 43 | @click.option('-c', '--collections', default=None, nargs=-1, 44 | help='The configured collection name to use.') 45 | @click.option('-t', '--threads', default=None, type=int, 46 | help='Number of threads to process data') 47 | @click.option('-d', '--debug', default=False, is_flag=True, 48 | help='Verbose output for debugging') 49 | @click.pass_context 50 | def cli(ctx, collections, threads, debug): 51 | """ A configurable data and document processing tool. """ 52 | ctx.obj = { 53 | 'collections': collections, 54 | 'debug': debug, 55 | 'threads': threads 56 | } 57 | if debug: 58 | logging.basicConfig(level=logging.DEBUG) 59 | else: 60 | logging.basicConfig(level=logging.INFO) 61 | 62 | 63 | @cli.command() 64 | @click.argument('pipeline', type=click.File('rb')) 65 | @click.pass_obj 66 | def run(ctx, pipeline): 67 | """ Execute the given PIPELINE. """ 68 | execute_pipeline(ctx, pipeline, 'run') 69 | 70 | 71 | @cli.command() 72 | @click.argument('pipeline', type=click.File('rb')) 73 | @click.pass_obj 74 | def extract(ctx, pipeline): 75 | """ Execute the extractors in PIPELINE. """ 76 | execute_pipeline(ctx, pipeline, 'extract') 77 | 78 | 79 | @cli.command() 80 | @click.argument('pipeline', type=click.File('rb')) 81 | @click.pass_obj 82 | def transform(ctx, pipeline): 83 | """ Execute the transformers in PIPELINE. """ 84 | execute_pipeline(ctx, pipeline, 'transform') 85 | -------------------------------------------------------------------------------- /loadkit/extract.py: -------------------------------------------------------------------------------- 1 | 2 | def from_file(package, source_file): 3 | return package.ingest(source_file) 4 | 5 | 6 | def from_fileobj(package, fileobj, source_name=None): 7 | meta = {'source_file': source_name} 8 | return package.ingest(from_fileobj, meta=meta) 9 | 10 | 11 | def from_url(package, source_url): 12 | return package.ingest(source_url) 13 | -------------------------------------------------------------------------------- /loadkit/logger.py: -------------------------------------------------------------------------------- 1 | import time 2 | import logging 3 | import tempfile 4 | import shutil 5 | 6 | from loadkit.types.logfile import LogFile 7 | 8 | SEP = '-||-' 9 | FORMAT = '%%(asctime)s %s %%(name)s %s %%(levelname)s %s %%(message)s' % \ 10 | (SEP, SEP, SEP) 11 | 12 | 13 | class LogFileHandler(logging.FileHandler): 14 | """ Log to a temporary local file, then dump to a bucket. """ 15 | 16 | def __init__(self, package, prefix): 17 | self.package = package 18 | self.prefix = prefix 19 | self.tmp = tempfile.NamedTemporaryFile() 20 | # to be reopened by the super class 21 | self.tmp.close() 22 | super(LogFileHandler, self).__init__(self.tmp.name) 23 | 24 | def archive(self): 25 | self.close() 26 | name = '%s/%s.log' % (self.prefix, int(time.time() * 1000)) 27 | logfile = LogFile(self.package, name) 28 | with open(self.tmp.name, 'rb') as fh: 29 | logfile.save_fileobj(fh) 30 | 31 | 32 | def capture(package, prefix, modules=[], level=logging.DEBUG): 33 | """ Capture log messages for the given modules and archive 34 | them to a ``LogFile`` resource. """ 35 | handler = LogFileHandler(package, prefix) 36 | formatter = logging.Formatter(FORMAT) 37 | handler.setFormatter(formatter) 38 | modules = set(modules + ['loadkit']) 39 | 40 | for logger in modules: 41 | if not hasattr(logger, 'addHandler'): 42 | logger = logging.getLogger(logger) 43 | logger.setLevel(level=level) 44 | logger.addHandler(handler) 45 | 46 | return handler 47 | 48 | 49 | def load(package, prefix, offset=0, limit=1000): 50 | """ Load lines from the log file with pagination support. """ 51 | logs = package.all(LogFile, unicode(prefix)) 52 | logs = sorted(logs, key=lambda l: l.name, reverse=True) 53 | seen = 0 54 | record = None 55 | tmp = tempfile.NamedTemporaryFile(suffix='.log') 56 | for log in logs: 57 | shutil.copyfileobj(log.fh(), tmp) 58 | tmp.seek(0) 59 | for line in reversed(list(tmp)): 60 | seen += 1 61 | if seen < offset: 62 | continue 63 | if seen > limit: 64 | tmp.close() 65 | return 66 | try: 67 | d, mo, l, m = line.split(' %s ' % SEP, 4) 68 | if record is not None: 69 | yield record 70 | record = {'time': d, 'module': mo, 'level': l, 'message': m} 71 | except ValueError: 72 | if record is not None: 73 | record['message'] += '\n' + line 74 | tmp.seek(0) 75 | tmp.close() 76 | if record is not None: 77 | yield record 78 | -------------------------------------------------------------------------------- /loadkit/node.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from loadkit.util import ConfigException 4 | from loadkit.operators import load_operator 5 | 6 | log = logging.getLogger(__name__) 7 | 8 | 9 | def resolve_dependencies(nodes): 10 | """ Figure out which order the nodes in the graph can be executed 11 | in to satisfy all requirements. """ 12 | done = set() 13 | while True: 14 | if len(done) == len(nodes): 15 | break 16 | for node in nodes: 17 | if node.name not in done: 18 | match = done.intersection(node.requires) 19 | if len(match) == len(node.requires): 20 | done.add(node.name) 21 | yield node 22 | break 23 | else: 24 | raise ConfigException('Invalid requirements in pipeline!') 25 | 26 | 27 | class Node(object): 28 | 29 | def __init__(self, pipeline, name, config): 30 | self.name = name 31 | self.config = config 32 | self.pipeline = pipeline 33 | self._operator = None 34 | self._requires = None 35 | 36 | @property 37 | def operator(self): 38 | if self._operator is None: 39 | op_name = self.config.get('operator') 40 | operator = load_operator(op_name) 41 | self._operator = operator(self.pipeline, self.name, self.config) 42 | return self._operator 43 | 44 | def generate(self): 45 | log.debug("Running extract: %s (%s)" % (self.name, self.operator.type)) 46 | self.operator.generate() 47 | 48 | def process(self, package): 49 | log.debug("Running transform: %s (%s) on %r" % 50 | (self.name, self.operator.type, package)) 51 | self.operator.process(package) 52 | 53 | def finalize(self): 54 | self.operator.finalize() 55 | 56 | @property 57 | def requires(self): 58 | if self._requires is None: 59 | reqs = self.config.get('requires', []) 60 | if reqs is None: 61 | reqs = [] 62 | if not isinstance(reqs, (list, tuple, set)): 63 | reqs = [unicode(reqs)] 64 | self._requires = set(reqs) 65 | return self._requires 66 | 67 | def __repr__(self): 68 | return '' % (self.name, self.operator.type) 69 | -------------------------------------------------------------------------------- /loadkit/operators/__init__.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import logging 4 | import importlib 5 | 6 | from pkg_resources import iter_entry_points 7 | 8 | from loadkit.util import ConfigException 9 | 10 | NAMESPACE = 'loadkit.operators' 11 | OPERATORS = {} 12 | 13 | log = logging.getLogger(__name__) 14 | 15 | 16 | def load(): 17 | if not len(OPERATORS): 18 | for ep in iter_entry_points(NAMESPACE): 19 | OPERATORS[ep.name] = ep.load() 20 | log.info('Available operators: %s', ', '.join(OPERATORS.keys())) 21 | return OPERATORS 22 | 23 | 24 | def load_operator(name): 25 | operators = load() 26 | if name in operators: 27 | return operators.get(name) 28 | try: 29 | if ':' in name: 30 | if os.getcwd() not in sys.path: 31 | sys.path.append(os.getcwd()) 32 | module, cls = name.split(':', 1) 33 | return getattr(importlib.import_module(module), cls) 34 | except ImportError: 35 | pass 36 | raise ConfigException('Invalid operator: %r' % name) 37 | -------------------------------------------------------------------------------- /loadkit/operators/common.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | log = logging.getLogger(__name__) 4 | 5 | 6 | class Operator(object): 7 | """ A simple operator, working on a particular package or 8 | generating packages through an ingest process. """ 9 | 10 | def __init__(self, pipeline, name, config): 11 | self.pipeline = pipeline 12 | self.name = name 13 | self.config = config 14 | 15 | def generate(self): 16 | pass 17 | 18 | def process(self, package): 19 | pass 20 | 21 | def finalize(self): 22 | pass 23 | 24 | @property 25 | def type(self): 26 | return self.__class__.__name__ 27 | 28 | def __repr__(self): 29 | return '<%s(%r)>' % (self.type, self.name) 30 | 31 | 32 | class SourceOperator(Operator): 33 | """ An operator which has an input resource which must exist 34 | in order for it to run. The resource name is given as a class 35 | constant, and when transforming, the resource is passed into 36 | a ``analyze`` function which must be subclassed. """ 37 | 38 | DEFAULT_SOURCE = None 39 | 40 | def analyze(self, source): 41 | raise NotImplemented() 42 | 43 | def process(self, package): 44 | source_path = self.config.get('source', self.DEFAULT_SOURCE) 45 | if source_path is not None: 46 | source = package.get_resource(source_path) 47 | else: 48 | source = package.source 49 | 50 | if source is None: 51 | log.warn("No source configured for operator %r", self.type) 52 | return 53 | 54 | if not source.exists(): 55 | log.debug("Missing source for operator %r: %r", self.type, source) 56 | return 57 | 58 | return self.analyze(source) 59 | 60 | 61 | class TransformOperator(SourceOperator): 62 | """ Similar to the ``SourceOperator``, this operator transforms 63 | a given resource into another resource. Both resource names are 64 | given as constants to subclasses, and passed into the 65 | ``transform()`` method, which must be sub-classed. """ 66 | 67 | DEFAULT_TARGET = None 68 | 69 | def transform(self, source, target): 70 | raise NotImplemented() 71 | 72 | def analyze(self, source): 73 | target_path = self.config.get('target', self.DEFAULT_TARGET) 74 | if target_path is None: 75 | log.error("No target for operator %r", self.type) 76 | return 77 | 78 | target = source.package.get_resource(target_path) 79 | 80 | if target.exists() and not self.config.get('overwrite', False): 81 | log.debug("Skipping operator %r: %r exists", self.type, target) 82 | return 83 | 84 | return self.transform(source, target) 85 | -------------------------------------------------------------------------------- /loadkit/operators/ingest.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from loadkit import Operator 4 | 5 | log = logging.getLogger(__name__) 6 | 7 | 8 | class IngestOperator(Operator): 9 | 10 | def generate(self): 11 | source = self.config.get('source') 12 | if source is None or not len(source.strip()): 13 | log.error('Invalid source for %s: %r', self.name, source) 14 | else: 15 | log.info('Ingesting content from %s...', source) 16 | meta = self.config.get('meta', {}) 17 | self.pipeline.collection.ingest(source, meta=meta) 18 | -------------------------------------------------------------------------------- /loadkit/operators/normalize.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | from normality import normalize 4 | 5 | from loadkit.types.stage import Stage 6 | from loadkit.operators.common import TransformOperator 7 | 8 | log = logging.getLogger(__name__) 9 | 10 | 11 | class NormalizeOperator(TransformOperator): 12 | """ Simplify a piece of text to generate a more canonical 13 | representation. This involves lowercasing, stripping trailing 14 | spaces, removing symbols, diacritical marks (umlauts) and 15 | converting all newlines etc. to single spaces. 16 | """ 17 | 18 | DEFAULT_SOURCE = os.path.join(Stage.GROUP, 'plain.txt') 19 | DEFAULT_TARGET = os.path.join(Stage.GROUP, 'normalized.txt') 20 | 21 | def transform(self, source, target): 22 | text = source.data() 23 | text = normalize(text, lowercase=self.config.get('lowercase', True), 24 | transliterate=self.config.get('transliterate', False), 25 | collapse=self.config.get('collapse', True)) 26 | target.save_data(text.encode('utf-8')) 27 | -------------------------------------------------------------------------------- /loadkit/operators/regex.py: -------------------------------------------------------------------------------- 1 | import re 2 | import logging 3 | 4 | from loadkit.operators.common import SourceOperator 5 | 6 | log = logging.getLogger(__name__) 7 | 8 | 9 | class RegExOperator(SourceOperator): 10 | 11 | @property 12 | def re(self): 13 | if not hasattr(self, '_re'): 14 | term = self.config.get('term', []) 15 | terms = self.config.get('terms', [term]) 16 | self._re = re.compile('(%s)' % '|'.join(terms)) 17 | return self._re 18 | 19 | def analyze(self, source): 20 | matches_field = self.config.get('field_matches', 'matches') 21 | total_field = self.config.get('field_total') 22 | 23 | matches = {} 24 | for match in self.re.findall(source.fh().read()): 25 | score = matches.get(match, 0) 26 | matches[match] = score + 1 27 | 28 | if total_field is not None: 29 | source.meta[total_field] = sum(matches.values()) 30 | 31 | source.meta[matches_field] = matches 32 | source.meta.save() 33 | -------------------------------------------------------------------------------- /loadkit/operators/table.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | import random 4 | from decimal import Decimal 5 | from datetime import datetime 6 | 7 | from normality import slugify 8 | from messytables import any_tableset, type_guess 9 | from messytables import types_processor, headers_guess 10 | from messytables import headers_processor, offset_processor 11 | 12 | from loadkit.types.table import Table 13 | from loadkit.operators.common import TransformOperator 14 | 15 | log = logging.getLogger(__name__) 16 | 17 | 18 | def resource_row_set(package, resource): 19 | """ Generate an iterator over all the rows in this resource's 20 | source data. """ 21 | # This is a work-around because messytables hangs on boto file 22 | # handles, so we're doing it via plain old HTTP. 23 | table_set = any_tableset(resource.fh(), 24 | extension=resource.meta.get('extension'), 25 | mimetype=resource.meta.get('mime_type')) 26 | tables = list(table_set.tables) 27 | if not len(tables): 28 | log.error("No tables were found in the source file.") 29 | return 30 | 31 | row_set = tables[0] 32 | offset, headers = headers_guess(row_set.sample) 33 | row_set.register_processor(headers_processor(headers)) 34 | row_set.register_processor(offset_processor(offset + 1)) 35 | types = type_guess(row_set.sample, strict=True) 36 | row_set.register_processor(types_processor(types)) 37 | return row_set 38 | 39 | 40 | def column_alias(cell, names): 41 | """ Generate a normalized version of the column name. """ 42 | column = slugify(cell.column or '', sep='_') 43 | column = column.strip('_') 44 | column = 'column' if not len(column) else column 45 | name, i = column, 2 46 | # de-dupe: column, column_2, column_3, ... 47 | while name in names: 48 | name = '%s_%s' % (name, i) 49 | i += 1 50 | return name 51 | 52 | 53 | def generate_field_spec(row): 54 | """ Generate a set of metadata for each field/column in 55 | the data. This is loosely based on jsontableschema. """ 56 | names = set() 57 | fields = [] 58 | for cell in row: 59 | name = column_alias(cell, names) 60 | field = { 61 | 'name': name, 62 | 'title': cell.column, 63 | 'type': unicode(cell.type).lower(), 64 | 'has_nulls': False, 65 | 'has_empty': False, 66 | 'samples': [] 67 | } 68 | if hasattr(cell.type, 'format'): 69 | field['type'] = 'date' 70 | field['format'] = cell.type.format 71 | fields.append(field) 72 | return fields 73 | 74 | 75 | def random_sample(value, field, row, num=10): 76 | """ Collect a random sample of the values in a particular 77 | field based on the reservoir sampling technique. """ 78 | # TODO: Could become a more general DQ piece. 79 | if value is None: 80 | field['has_nulls'] = True 81 | return 82 | if value in field['samples']: 83 | return 84 | if isinstance(value, basestring) and not len(value.strip()): 85 | field['has_empty'] = True 86 | return 87 | if len(field['samples']) < num: 88 | field['samples'].append(value) 89 | return 90 | j = random.randint(0, row) 91 | if j < (num - 1): 92 | field['samples'][j] = value 93 | 94 | 95 | def parse_table(row_set, save_func): 96 | num_rows = 0 97 | fields = {} 98 | 99 | for i, row in enumerate(row_set): 100 | if not len(fields): 101 | fields = generate_field_spec(row) 102 | 103 | data = {} 104 | for cell, field in zip(row, fields): 105 | value = cell.value 106 | if isinstance(value, datetime): 107 | value = value.date() 108 | if isinstance(value, Decimal): 109 | # Baby jesus forgive me. 110 | value = float(value) 111 | if isinstance(value, basestring) and not len(value.strip()): 112 | value = None 113 | data[field['name']] = value 114 | random_sample(value, field, i) 115 | 116 | check_empty = set(data.values()) 117 | if None in check_empty and len(check_empty) == 1: 118 | continue 119 | 120 | save_func(data) 121 | num_rows = i 122 | 123 | fields = {f.get('name'): f for f in fields} 124 | return num_rows, fields 125 | 126 | 127 | class TableExtractOperator(TransformOperator): 128 | """ This operator will extract tabular data from the source 129 | file in a package. It recognizes a variety of source formats, 130 | including CSV, Excel, etc. The operator will convert them to 131 | a line-based JSON format which can be easily serialized and 132 | deserialized. """ 133 | 134 | DEFAULT_TARGET = os.path.join(Table.GROUP, 'table.json') 135 | 136 | def transform(self, source, target): 137 | target.meta.update(source.meta) 138 | 139 | with target.store() as save: 140 | row_set = resource_row_set(source.package, source) 141 | num_rows, fields = parse_table(row_set, save) 142 | 143 | log.info("Converted %s rows with %s columns.", num_rows, len(fields)) 144 | target.meta['fields'] = fields 145 | target.meta['num_records'] = num_rows 146 | target.meta.save() 147 | -------------------------------------------------------------------------------- /loadkit/operators/text.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | 4 | from archivekit.util import encode_text 5 | 6 | from loadkit.types.stage import Stage 7 | from loadkit.operators.common import TransformOperator 8 | 9 | log = logging.getLogger(__name__) 10 | 11 | HTML_EXT = ['html', 'htm'] 12 | HTML_MIME = 'text/html' 13 | OCR_EXT = ['pdf', 'png', 'jpg', 'jpeg', 'bmp', 'gif'] 14 | 15 | 16 | def text_empty(text): 17 | if text is None: 18 | return True 19 | return len(text.strip()) <= 0 20 | 21 | 22 | def extract_content(resource): 23 | if resource.meta.get('extract_article'): 24 | try: 25 | from newspaper import Article 26 | article = Article(resource.meta.get('source_url')) 27 | article.download(html=resource.data()) 28 | article.parse() 29 | if article.title and not resource.meta.get('title'): 30 | resource.meta['title'] = article.title 31 | if article.text: 32 | return article.text 33 | except ImportError: 34 | log.error('Newspaper is not installed.') 35 | except Exception, e: 36 | log.exception(e) 37 | 38 | try: 39 | from textract.parsers import process 40 | with resource.local() as file_name: 41 | text = process(file_name) 42 | if resource.meta.get('extension') in OCR_EXT and text_empty(text): 43 | log.info("Using OCR for: %r", resource) 44 | text = process(file_name, method='tesseract') 45 | return text 46 | except ImportError: 47 | log.error('Textract is not installed.') 48 | except Exception, e: 49 | err = unicode(e).split('\n')[0] 50 | log.error('Textract failed: %s', err) 51 | 52 | 53 | class TextExtractOperator(TransformOperator): 54 | 55 | DEFAULT_TARGET = os.path.join(Stage.GROUP, 'plain.txt') 56 | 57 | def transform(self, source, target): 58 | text = extract_content(source) 59 | if not text_empty(text): 60 | # TODO: copy metadata? 61 | target.save_data(encode_text(text)) 62 | source.package.save() 63 | -------------------------------------------------------------------------------- /loadkit/pipeline.py: -------------------------------------------------------------------------------- 1 | import time 2 | import logging 3 | import multiprocessing 4 | try: 5 | from queue import Queue 6 | except ImportError: 7 | from Queue import Queue 8 | from threading import Thread 9 | 10 | from loadkit.node import Node, resolve_dependencies 11 | 12 | GENERATE = 'generate' 13 | PROCESS = 'process' 14 | 15 | log = logging.getLogger(__name__) 16 | 17 | 18 | class Pipeline(object): 19 | """ A pipeline is defined by a set of operators which are 20 | executed in a given sequence based on their mutual 21 | dependencies. The whole pipeline consists of three main 22 | phases: one, in which packages are generated, the second, 23 | in which packages are transformed, and the third, in which 24 | final tasks are performed. """ 25 | 26 | def __init__(self, collection, name, config=None): 27 | self.config = dict() 28 | self.collection = collection 29 | self.name = name 30 | 31 | if config is not None: 32 | self.config.update(config) 33 | 34 | self.threads = config.get('config', {}).get('threads') 35 | if self.threads is None: 36 | self.threads = multiprocessing.cpu_count() * 2 37 | 38 | self._nodes = None 39 | 40 | self._queue = None 41 | 42 | @property 43 | def queue(self): 44 | if self._queue is None: 45 | self._queue = Queue(maxsize=self.threads * 100) 46 | for i in range(self.threads): 47 | thread = Thread(target=self._process_packages) 48 | thread.daemon = True 49 | thread.start() 50 | return self._queue 51 | 52 | @property 53 | def nodes(self): 54 | if self._nodes is None: 55 | self._nodes = {GENERATE: [], PROCESS: []} 56 | 57 | for phase in self._nodes.keys(): 58 | for name, config in self.config.get(phase, {}).items(): 59 | base = self.config.get('config', {}).copy() 60 | base.update(config) 61 | node = Node(self, name, base) 62 | self._nodes[phase].append(node) 63 | return self._nodes 64 | 65 | def generate(self): 66 | for node in self.nodes[GENERATE]: 67 | node.generate() 68 | 69 | def process(self): 70 | for package in self.collection: 71 | self.queue.put(package) 72 | 73 | try: 74 | while True: 75 | if self.queue.empty(): 76 | break 77 | time.sleep(0.1) 78 | except KeyboardInterrupt: 79 | pass 80 | 81 | def process_sync(self): 82 | for package in self.collection: 83 | self.process_package(package) 84 | 85 | def process_package(self, package): 86 | try: 87 | for node in resolve_dependencies(self.nodes[PROCESS]): 88 | node.process(package) 89 | except Exception, e: 90 | log.exception(e) 91 | 92 | def _process_packages(self): 93 | while True: 94 | try: 95 | package = self.queue.get(True) 96 | self.process_package(package) 97 | finally: 98 | self.queue.task_done() 99 | 100 | def finalize(self): 101 | for phase, nodes in self.nodes.items(): 102 | for node in nodes: 103 | node.finalize() 104 | 105 | def run(self): 106 | self.generate() 107 | self.process() 108 | self.finalize() 109 | 110 | def __repr__(self): 111 | return "" % self.name 112 | -------------------------------------------------------------------------------- /loadkit/tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pudo-attic/loadkit/1fb17e69e2ffaf3dac4f40b574c3b7afb2198b7c/loadkit/tests/__init__.py -------------------------------------------------------------------------------- /loadkit/tests/fixtures/barnet-2009.csv: -------------------------------------------------------------------------------- 1 | Level 1,Level 2,Level 3,Amount,Year,Detailed Description,unique_id 2 | Schools,Primary Schools,,117500000,2009,This is provided by the government for the 87 primary schools in the borough. The funding is set by central government and calculated based on the number of pupils and other factors. The funding goes via the local authority directly to the schools.,1 3 | Schools,Secondary Schools,,112700000,2009,Provided by the government for the 20 secondary schools in the borough (not including academies). The funding is set by central government and calculated based on the number of pupils and other factors. The funding goes via the local authority directly to the schools.,2 4 | Schools,Centrally Retained Schools Budget,,21500000,2009,"Support services to schools. Includes: 5 | 6 | - additional support for students with special educational needs 7 | - pupil referral units for children who are educated outside of a mainstream school 8 | - teaching for children who cannot attend school due to illness 9 | - specialist services for children with disabilities (including support for children with autism, hearing impairment, visual impairment etc) 10 | - co-ordination of admissions to primary and secondary schools.",3 11 | Schools,Special Schools,,7400000,2009,"This pays for the running of Barnet's four special schools, for children with a range of complex needs.",4 12 | Schools,Pre School,,7400000,2009,"Provided by the government for early years and nursery education in private, voluntary and independent settings, as well as children's centres. All three and four year olds are entitled to a limited amount of free nursery provision (15 hours per week from September 2010). Parents can choose to take this up in a Barnet school or a private, voluntary or independent setting.",5 13 | Keeping Children Safe,Children in Care,,19800000,2009,"Supports, on average, 320 children in the council's care. Includes accommodation (only those placed with independent fostering agencies or other external placements are included in this budget), food, clothing, transport and other living expenses, as well as social care support. Support is also provided to young people who have left care, up to the age of 25, to help them find suitable accommodation and to enter education, employment or training.",6 14 | Keeping Children Safe,Services for Schools,,11100000,2009,"A wide range of services that the council provides to schools. Includes: 15 | 16 | - support for school improvement, advice and guidance on the curriculum 17 | - monitoring of performance. Training for school staff with extra support schools causing concern. Educational welfare service to improve school attendance and prevent exclusions 18 | - support for school governors 19 | - music service 20 | - sports development service.",7 21 | Keeping Children Safe,Support for Vulnerable Families,,18700000,2009,"Includes: 22 | 23 | - 21 Children's Centres providing advice, support, childcare and a range of services to parents across the borough 24 | - parenting programmes to support families to develop their parenting skills 25 | - short breaks for disabled children 26 | - extended services in and around schools 27 | - training for early years practitioners and childminders 28 | - 'FYI' information service, giving information for families about childcare, holiday activities and other support services.",8 29 | Keeping Children Safe,Disabled Children's Services,,9500000,2009,"Includes: 30 | 31 | - support to families with disabled children to enable children to stay at home (e.g. short breaks, respite care, childcare). Social work teams supporting families with disabled children 32 | - school transport costs for children with Special Educational Needs to travel to school 33 | - occupational therapy services for disabled children 34 | - this budget also covers the costs associated with statements of Special Educational Needs.",9 35 | Keeping Children Safe,Safeguarding,,6600000,2009,"The safeguarding budget covers the cost of social work support for children at risk of harm which includes: 36 | 37 | - the assessment of children referred to social care 38 | - support and monitoring of children subject to child protection plans 39 | - investigation of allegations against adults working with children or young people 40 | - training on safeguarding and child protection for education professionals across Barnet 41 | - mental health support for children and young people",10 42 | Keeping Children Safe,Fostering and Adoption - Internal,,7100000,2009,"Includes: 43 | 44 | - the placement of children for adoption, and finding and supporting adoptive families 45 | - recruitment and training of foster carers 46 | - fostering and adoption allowances paid to those carers 47 | - two residential children's homes for children in care.",11 48 | Keeping Children Safe,Youth Activities,,7200000,2009,"Services provided for young people. 49 | Includes: 50 | 51 | - Connexions service - employment and career advice to young people through schools and colleges 52 | - youth projects and youth centres 53 | - holiday activities accessible to all young people in the borough. Arts and music activities including Finchley Youth Theatre, the Rithmik music project and other projects 54 | - after school schemes 55 | - the costs of running the Barnet Youth Board 56 | - work with young offenders and preventative work with young people at risk of offending.",12 57 | Housing Benefits,Housing Benefits,,216400000,2009,"This is funded from central government - the council processes it locally and there is no scope for local reduction. 58 | Includes: 59 | 60 | - Housing Benefit paid to Council Tenants 61 | - Housing Benefit paid to Private and Social Sector Tenants 62 | - Council Tax Benefit. 63 | - Currently, there are 32,227 people receiving these benefits in the borough.",13 64 | Environment and operations,Street Cleaning,,4900000,2009,"The Street Scene service is based at Mill Hill Depot and is responsible for refuse collection, trade waste and street cleaning services. 65 | 66 | - the annual cost for six vehicular street sweepers: £161,000 67 | - annual cost to run each of the refuse crews (truck, loaders and driver) plying Barnet's roads: £115,000 68 | - annual cost of clearing up illegal fly tipping on Barnet's roads: £155,000 69 | - annual cost of the regular four-weekly residential sweep: £1.9m",14 70 | Environment and operations,Transport,,1200000,2009,"The service: 71 | 72 | - manages and arranges home to school transport for 634 pupils with Special Educational Needs (SEN) 73 | - manages the transport of 300 elderly and vulnerable adults to day centres 74 | - has responsibility for procuring and maintaining the council's fleet of nearly 300 vehicles.",15 75 | Environment and operations,Waste Collection (plus NLWA),,10600000,2009,"North London Waste Authority (NLWA) -- this is the joint waste disposal authority for seven London boroughs -- Barnet, Camden, Enfield, Hackney, Haringey, Islington and Waltham Forest. NLWA is responsible for the disposal of all waste collected by the boroughs. 76 | 77 | - the annual contract cost to provide recycling services for houses, flats, schools, bring banks and the civic amenity site: £3.9m 78 | - the levy paid to NLWA in 2009/10 for household and other waste: £8.74m",16 79 | Environment and operations,Highways and Pavement Maintenance,,6700000,2009,"We implement improvement and maintenance schemes to ensure that Barnet's highways network is fit for purpose and serves the local community. This includes maintenance of the carriageway and footway network.. We also manage works carried out to the highway by utility companies. The team also manages the issuing of licences to allow various operations to take place on the highway network including the placing of skips, scaffolding, hoardings, builders materials etc and regular inspection of the highway network to identify the general condition and, in particular, to identify and deal with any safety hazards.",17 80 | Environment and operations,Street Lighting,,4900000,2009,"This pays for the installations, running costs and maintenance of 25,977 street lights in Barnet.",18 81 | Environment and operations,Greenspaces,,6400000,2009,"The council manages over 200 parks and open spaces across the borough including day to day maintenance and the booking/letting of sport facilities. The service is also responsible for tree management across the borough (at a cost of £640,000 each year) and maintenance issues around the public highway (e.g. grass cutting, flowerbeds, hedges and shrubs etc) and 46 allotment sites.",19 82 | Environment and operations,Parking,,6400000,2009,"The service is responsible for the day-to-day management of car and motorcycle parking, on-street parking and parking permits. The team also enforces parking legislation in the borough.",20 83 | Environment and operations,Trading Standards and Community Safety (Community Protection Group),,3000000,2009,"The work of the Community Protection Group (CPG) covers six functional teams: Community Safety, Drugs and Alcohol, Priority Intervention Team (PIT), Trading Standards and Licensing, CCTV and Crime Intelligence. 84 | The CPG works with the police to reduce crime and anti-social behaviour in Barnet. They work in partnerships with community groups and police to make sure our neighbourhoods are clean and safe and to encourage residents to take greater pride in their local area by tackling anti-social behaviour such as littering, fly-tipping and graffiti (it currently costs £240,000 per year to clear up graffiti).",21 85 | Adult Social Services,Registered Care Homes,,50300000,2009,"Around 1,100 Barnet residents are supported in long-term registered care homes. 770 are people over 65 with dementia and other age-related social care needs. 220 are individuals under 65 with learning disabilities. The remainder are those under 65 with physical, sensory and mental health disabilities. 86 | Residents of registered care homes make a means-tested contribution to their care costs. This currently contributes £8.9m to the £50.3m cost of care provision",22 87 | Adult Social Services,Support in the Home,,26900000,2009,"The council aims to promote personal independence, supporting people to remain in their own homes. Including: 88 | 89 | - Delivering personal care in the home to 2,857 people. 82% are over 65, with the remaining 18% being younger adults with physical, sensory, learning or mental health disabilities 90 | - Providing intensive rehabilitation to enable 586 people to live at home more independently. Of these, 316 are helped to live independently without the need of ongoing care to support them 91 | - Providing 339,000 hours of housing-related support to vulnerable people through the Supporting People Programme to enable them to live more independently. 92 | 93 | People who receive home care services make a means tested contribution to the costs of their support. In 2010/11, individuals contributed £1.1m to the costs of their homecare.",23 94 | Adult Social Services,Day Services,,13000000,2009,"As part of meeting individuals' social care needs, the council provides a variety of day opportunity services. Overall, 1,045 individuals received day social care services last year. Services include: 95 | 96 | - centres provide stimulating care and support for elderly residents, in particular those with dementia 97 | - services for younger adults with learning disabilities, physical disabilities, sensory impairment and mental health problems - aiming to improve their quality of life, letting them maximise their potential to live independently by providing a wide range of services from peer support to preparation for getting paid employment 98 | - hubs providing daytime opportunities for adults with profound and multiple learning and physical disabilities. 99 | 100 | Individuals are currently not required to make a means-tested contribution to the cost of day services they receive.",24 101 | Adult Social Services,Social Work and Safeguarding,,11800000,2009,"The council spends £11.8 million meeting responsibilities in relation to social care and safeguarding of adults. Includes: 102 | 103 | - ensuring that vulnerable adults are properly safeguarded from harm and abuse 104 | - assessing those with social care needs 105 | - determining, in conjunction with the individual concerned, appropriate social care for those with 'substantial and critical' social care needs 106 | - regularly reviewing those currently receiving social care to ensure the support provided is appropriate to their needs 107 | - providing expert, professional support to those with complex social care needs. 108 | 109 | In a typical year this involves: 110 | 111 | - dealing with over 20,000 contacts in relation to adult social care matters 112 | - carrying out 2,250 assessments for those presenting with social care needs 113 | - reviewing the needs of 6,000 people with existing social care packages 114 | - managing 420 safeguarding referrals 115 | - delivering expert, professional support to 1,200 people with complex social care needs.",25 116 | Adult Social Services,Direct Payments,,6900000,2009,"Residents have the option of choosing to manage their own care needs through the provision by the council of a Direct Payment. Direct Payments are closely monitored by the council to ensure they are used by individuals to meet their social care needs. For instance, they could be used to hire a personal assistant who would support them in the provision of their personal care needs or to purchase items of equipment that help to maintain their independence. Last year, 783 people received Direct Payments from the council to purchase social care.",26 117 | Adult Social Services,Prevention - Keeping people Independent and Carer Support,,3700000,2009,"The council provide a range of services designed to enable people to maintain their independence and prevent their conditions and situations from deteriorating to the point they become dependent and require intensive social care interventions. Includes 118 | 119 | - support (such as advice and respite care) to the estimated 30,000 unpaid carers within Barnet who provide informal care to family members with social care needs 120 | - 41 voluntary sector organisations received grants to deliver a range of support, advocacy and advice to vulnerable people within the borough 121 | - 1,343 residents received equipment and home adaptations during 2009/10 to enable them to keep their independence.",27 122 | Corporate Services,Central Expenses,,55300000,2009,"Central Expenses include a range of costs that cover the council as a whole. It includes: 123 | 124 | - employers cost of supporting Unison, the main trade union for council staff 125 | - corporate subscriptions to national and local organisations such as the Local Government Association and other networks, research and support institutions 126 | - levies and payments to a range of cross-council/London-wide organisations such as Transport for London, the North London Waste Authority, and Coroners Court 127 | - financing costs of the council's investment portfolio 128 | - contributions to the pension fund of early retirements made to improve the on-going efficiency of the organisation 129 | - external audit costs.",28 130 | Corporate Services,IT,,7000000,2009,"This includes the IT infrastructure which is the council's network of connected data centres, computers, telephones and printers. Over recent years this has been extended to enable staff to work more flexibly and efficiently using wireless networks in the main council offices and secure remote access from non-council locations. This infrastructure cost also includes the cost of security measures such as anti-virus software and internet filtering, testing to ensure that the network cannot be compromised by hackers. The majority of infrastructure services are delivered and supported by 2e2, whose staff work as part of the council's IS team. 131 | 132 | Business Systems - the council's core business system is SAP, a single integrated system supporting financial accounting, procurement, human resources, asset maintenance and customer relationship management. Our SAP system is provided through a contract with Logica who host and maintain the service. A number of other systems support specific council services, such as Revenues and Benefits, Adults and Children's Services, Planning, and Libraries. IT also provides day to day in house support and IT helpdesk to council staff (£1.6m).",29 133 | Corporate Services,Tax Collection and Finance,,5100000,2009,"Processes and pays Housing and Council Tax benefits and deals with all enquiries about payments. Includes: 134 | Student Finance (to be moved to the Student Finance Company) 135 | Debt collection (chases non-payment of council tax and business rates). At the end of each financial year, Barnet has collected over 96% of Council Tax payments. Debt collection eventually raises this to around 99% of tax payments. 136 | 137 | The service also includes security vans for emptying parking meters and cashiers.",30 138 | Corporate Services,HR,,3300000,2009,"- provides traded support to schools in the borough and Barnet Homes. Income generated supports the overall HR budget, offsetting the costs of service delivery to the core council; ensuring HR remains a cost-efficient service function. 139 | 140 | - employee relations: support for managers on employment law, policies and procedures, supports informal meetings, manages formal hearings and engages with trade union representatives on individual cases 141 | - payroll - the council HR service manages payroll for around 17,000 people 142 | - manages third-party deductions from employee pay, e.g. PAYE, union subscriptions etc., including prompt payment and reconciliation 143 | - pensions service. Administers pension fund 144 | - recruitment.",31 145 | Corporate Services,Finance,,2600000,2009,"Includes central accountancy staff, financial monitoring, administering pension fund, payments for services and invoicing.",32 146 | Corporate Services,Management and Business Support,,900000,2009,"Management and business support covers the strategic management costs of Corporate Services including directorate covering HR, IS, Customer Services and Revenues and Benefits. This includes the performance management and strategic planning costs of those services.",33 147 | Corporate Services,Customer Services,,2000000,2009,"Corporate Customer Services (£1.1m) provides a professional customer service for the residents of Barnet. It manages all the main corporate reception points across the borough and telephone contact centres for Street Based Services, Planning and the council's switchboard. The team works across telephone, email, letters, web and face to face meetings as appropriate. 148 | 149 | This service will expand over the next year to take in more customer-facing functions. 150 | 151 | Customer Services also manages certain cross-council telephone costs including Sign Video to ensure our deaf customers have equal access to information and services. 152 | 153 | Burnt Oak Customer Access Centre (£0.4m), opened in October 2008, offers a number of blended services to our customers, including Housing Benefits, Libraries, CAB and now HM Revenue and Customs and Job Centre Plus. 154 | 155 | The Barnet Registration and Nationality Service (£0.5m) provides customers with the following services: the registration of all births, deaths and still births occurring within the borough of Barnet. 156 | Safe custody of all historic records of births, deaths and marriages dating back to 1837 and issues of certified copies from these records on demand. 157 | Citizenship ceremonies for 3,000 residents each year. 158 | 159 | Conducting and registering all civil marriages and civil partnerships taking place in the borough.",34 160 | Chief Executive's Service,,,2800000,2009,"Supports the Chief Executive, develops policy and provides support across council departments on: 161 | 162 | - research and insight into changes in the borough 163 | - media and communications 164 | - performance monitoring 165 | - design and publications 166 | - the council website 167 | - equal opportunities 168 | - internal communications. 169 | 170 | The service also includes the Mayor's Office and civic events.",35 171 | Corporate Governance,Democratic Services and Elections,,3100000,2009,"The team co-ordinates and is responsible for the production and publication (including online) of all committee and other decision-making papers and providing all necessary support at formal meetings including preparation and publication of the minutes. It also plays a key role in supporting the scrutiny of council services and policies They also support and co-ordinate arrangements for residents forums and a range of regulatory and other hearings. 172 | The Electoral Registration Office compiles the register that shows who is eligible to vote in council, Parliamentary, European, GLA and London Mayoral elections. The Elections Project Team supports the Returning Officer in organising elections within Barnet.",36 173 | Corporate Governance,Legal Services,,2400000,2009,"The Legal Service is the Council's in-house solicitors' practice, providing legal advice and support to Members and other internal services. This includes child and adult protection, community care, prosecutions, employment, planning, contracts, procurement and property entries.",37 174 | Corporate Governance,Internal Assurance,,1300000,2009,"Internal Assurance includes: 175 | 176 | - the risk management processes of the council, assessing the potential financial risks of the authority and identifying which can be mitigated through external insurance and what can be covered internally. The team processes claims against external insurance cover and deals with the handling of all claims made by third parties against the council 177 | - the Corporate Anti Fraud Team (CAFT), headed by a specialist investigative unit which investigates allegations of Housing Benefit, Council Tax Benefit and general fraud within the borough 178 | - the Emergency Planning team which ensures that the council meets local needs in the event of significant national or local incidents.",38 179 | Corporate Governance,Management and Business Support,,600000,2009,"Includes: the management team of Corporate Governance , the Governance and the Service Development team which takes the corporate lead on information governance, ethical governance and complaints.",39 180 | Corporate Governance,Tax collection and Finance,,100000,2009,Insurance which includes the cost of managing and maintaining the Corporate Risk Register (a list of risks to the council and how it is working to prevent them from occuring or mitgate their impact).,40 181 | "Planning, Housing and Regeneration",Planning,,4600000,2009,"The Planning Service manages the use of land and buildings in the borough, the size and appearance of new development, their impact on neighbours and the local environment including conservation areas, listed buildings and trees. 182 | The planning authority processes over 4,000 planning applications each year and it is required to prepare a development plan, known as the Local Development Framework, which is used to guide development decisions about whether or not new uses and buildings are allowed. 183 | The planning department also manages an enforcement service which investigates breaches of planning control. Last year, the enforcement service investigated nearly 1,700 complaints. 184 | 185 | Regeneration is responsible for delivering the regeneration of the West Hendon, Stonegrove and Grahame Park estates all of which are now on site and are now securing development partnerships to deliver the regeneration of Brent Cross Cricklewood, Dollis Valley estate and Granville Road estate, leveraging an estimated £7billion of private sector investment in the borough over the lifetime of these projects. 186 | 187 | Regeneration is also responsible for securing funding to deliver affordable housing across the borough. Since 2008, £169million has been awarded - enough to deliver 1,365 new affordable homes.",41 188 | "Planning, Housing and Regeneration",Assisting Homeless People,,11900000,2009,"The Housing Service works with people who need help with their housing. Last year, the service helped 1,421 households find new homes in the private rented sector, council housing and housing association properties.",42 189 | "Planning, Housing and Regeneration",Environmental Health,,3400000,2009,"The Environmental Health service works to protect local people from dangers to their health. In 2009/10, staff visited 1,208 food premises in the borough to make sure that the food offered there was safe. The service concentrates its efforts on the places that serve food to vulnerable people (children in nurseries and older people in residential care homes).",43 190 | "Planning, Housing and Regeneration",Leisure,,1900000,2009,"The majority of the leisure budgets are used to pay contract management fees to GLL. This contract is for the management of Barnet Copthall Leisure Centre, Finchley Lido, Burnt Oak Leisure Centre, Hendon Leisure Centre, Compton Leisure Centre, Queen Elizabeth's Sports Centre and Church Farm Swimming Pool. A second contract with GLL is for the management of the Barnet Copthall Stadium. 191 | 192 | The council also works in partnership with ACB (Age Concern Barnet). ACB delivers the Instructor-Led Walks programme which is the remaining function from the Fitness for Life scheme.",44 193 | "Planning, Housing and Regeneration",Procurement and Grants,,1900000,2009,"Corporate Procurement: This team works to understand spend activity across the council; to manage the vendor base of approximately 9,720 vendors; to ensure appropriate contracts are developed to support organisational needs; and to ensure effective contract monitoring and management across the organisation. It provides support and process quality assurance across the council and to drive out organisational inefficiency relating to procurement and strategic contract management. 194 | 195 | Corporate lead on commissioning services from the voluntary sector and managing grants.",45 196 | Commercial Services,Management and Business Suppor,,100000,2009,"Corporate Projects and Programmes: The council's corporate project and programmes themselves are capital funded however, the Corporate Programme Office and the responsibility for the delivery of all corporate projects and programmes sit within this area of Commercial Services. A pool of project and programme managers and project consultants work to ensure robust project governance and management, and the timely delivery of all corporate project and programme objectives to a high standard and within the agreed budget. 197 | 198 | The council's major transformational programme, One Barnet, is also being delivered through Commercial Services.",46 199 | Commercial Services,"Property, Public Office and Facilities",,10800000,2009,"The London Borough of Barnet owns an extremely varied portfolio of property assets valued in the region of £1.8 billion, including schools, housing, libraries, agricultural land, industrial units and museums. 200 | 201 | The Council's Estates Department controls and manages these assets, working to ensure the best use of our assets for the people in Barnet.",47 202 | Libraries,,,6900000,2009,"Branch libraries - 2.7m visits a year. 203 | 16 branch libraries across the borough, including a library and customer service centre (Burnt Oak), and five Phase Three Children's Centres. 204 | 205 | Includes: 206 | 207 | - 492,158 books, 1,000 e-book, DVD, and audio loans (and advice) - 1,577,240 books loaned last year (annual spend on stock £835,000) 208 | - independent information and reference services, including on-line - 340,000 enquiries last year 209 | - early years literacy sessions including rhymetime and storytime 210 | - class visits, homework clubs; and host youth and younger people's activities 211 | - social/learning initiatives including bookclubs, reading groups, and coffee mornings 212 | - Bookstart schemes with Children's Centres in Chipping Barnet, Mill Hill, Church End, North Finchley, and Edgware 213 | 214 | Home and mobile library service: 215 | Includes: a housebound delivery service for residents who are helped to live at home; and a mobile library which has a number of scheduled stops across the borough. (£250,000) 216 | 217 | Museums and local studies. 218 | Includes Church Farmhouse Museum, the Barnet Museum and local archives.",48 219 | -------------------------------------------------------------------------------- /loadkit/tests/fixtures/gpc-july-2014.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pudo-attic/loadkit/1fb17e69e2ffaf3dac4f40b574c3b7afb2198b7c/loadkit/tests/fixtures/gpc-july-2014.csv -------------------------------------------------------------------------------- /loadkit/tests/test_etl.py: -------------------------------------------------------------------------------- 1 | from moto import mock_s3 2 | from datetime import date 3 | from tempfile import mkdtemp 4 | 5 | from archivekit import open_collection, Source 6 | from loadkit import extract 7 | from loadkit.pipeline import Pipeline 8 | from loadkit.types.table import Table 9 | from loadkit.tests.util import CSV_FIXTURE, CSV_URL 10 | from loadkit.tests.util import GPC_FIXTURE 11 | 12 | 13 | @mock_s3 14 | def test_basic_api(): 15 | index = open_collection('test', 's3', bucket_name='test.mapthemoney.org') 16 | assert not len(list(index)), len(list(index)) 17 | 18 | package = index.create(manifest={'test': 'value'}) 19 | assert len(list(index)) == 1, len(list(index)) 20 | assert package.id is not None, package.id 21 | 22 | assert package.manifest['test'] == 'value' 23 | 24 | assert index.get(package.id) == package, index.get(package.id) 25 | 26 | 27 | @mock_s3 28 | def test_extract_file(): 29 | index = open_collection('test', 's3', bucket_name='test.mapthemoney.org') 30 | package = index.create() 31 | src = extract.from_file(package, CSV_FIXTURE) 32 | assert src is not None, src 33 | 34 | sources = list(package.all(Source)) 35 | assert len(sources) == 1, sources 36 | 37 | artifacts = list(package.all(Table)) 38 | assert len(artifacts) == 0, artifacts 39 | 40 | assert 'barnet-2009.csv' in src.path, src 41 | 42 | 43 | def test_extract_url(): 44 | path = mkdtemp() 45 | index = open_collection('test', 'file', path=path) 46 | package = index.create() 47 | src = extract.from_url(package, CSV_URL) 48 | assert src is not None, src 49 | 50 | assert 'barnet-2009.csv' in src.path, src 51 | 52 | 53 | @mock_s3 54 | def test_parse_with_dates(): 55 | index = open_collection('test', 's3', bucket_name='test.mapthemoney.org') 56 | package = index.create() 57 | extract.from_file(package, GPC_FIXTURE) 58 | pipeline = Pipeline(index, 'foo', { 59 | 'process': { 60 | 'table': { 61 | 'operator': 'table_extract' 62 | } 63 | } 64 | }) 65 | pipeline.process_package(package) 66 | 67 | artifacts = list(package.all(Table)) 68 | assert len(artifacts) == 1, artifacts 69 | artifact = artifacts[0] 70 | assert artifact.name == 'table.json' 71 | recs = list(artifact.records()) 72 | assert len(recs) == 23, len(recs) 73 | assert isinstance(recs[0]['transaction_date'], date) 74 | -------------------------------------------------------------------------------- /loadkit/tests/util.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | 4 | from boto.s3.connection import S3Connection, S3ResponseError 5 | from boto.s3.connection import Location 6 | 7 | logging.basicConfig(level=logging.INFO) 8 | 9 | FIXTURES = os.path.join(os.path.dirname(__file__), 'fixtures') 10 | CSV_FIXTURE = os.path.join(FIXTURES, 'barnet-2009.csv') 11 | CSV_URL = 'https://raw.githubusercontent.com/okfn/dpkg-barnet/master/barnet-2009.csv' 12 | GPC_FIXTURE = os.path.join(FIXTURES, 'gpc-july-2014.csv') 13 | 14 | 15 | def get_bucket(bucket_name='tests.mapthemoney.org'): 16 | conn = S3Connection(os.environ.get('AWS_KEY_ID'), 17 | os.environ.get('AWS_SECRET')) 18 | 19 | try: 20 | return conn.get_bucket(bucket_name) 21 | except S3ResponseError, se: 22 | if se.status == 404: 23 | return conn.create_bucket(bucket_name, 24 | location=Location.EU) 25 | -------------------------------------------------------------------------------- /loadkit/types/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pudo-attic/loadkit/1fb17e69e2ffaf3dac4f40b574c3b7afb2198b7c/loadkit/types/__init__.py -------------------------------------------------------------------------------- /loadkit/types/logfile.py: -------------------------------------------------------------------------------- 1 | from archivekit import Resource 2 | 3 | 4 | class LogFile(Resource): 5 | """ A log file is a snippet of Python logging, preserved in the 6 | bucket. """ 7 | 8 | GROUP = 'logs' 9 | 10 | def __repr__(self): 11 | return '' % self.name 12 | -------------------------------------------------------------------------------- /loadkit/types/stage.py: -------------------------------------------------------------------------------- 1 | from archivekit import Resource 2 | 3 | 4 | class Stage(Resource): 5 | """ Stages are the intermediate document types produced by 6 | a processing pipeline. """ 7 | 8 | GROUP = 'stages' 9 | 10 | def __repr__(self): 11 | return '' % self.name 12 | -------------------------------------------------------------------------------- /loadkit/types/table.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import json 3 | import tempfile 4 | import shutil 5 | from contextlib import contextmanager 6 | 7 | from archivekit import Resource 8 | from loadkit.util import json_default, json_hook 9 | 10 | log = logging.getLogger(__name__) 11 | 12 | 13 | class Table(Resource): 14 | """ The table holds a temporary, cleaned representation of the 15 | package resource (as a newline-separated set of JSON 16 | documents). """ 17 | 18 | GROUP = 'tables' 19 | 20 | @contextmanager 21 | def store(self): 22 | """ Create a context manager to store records in the cleaned 23 | table. """ 24 | output = tempfile.NamedTemporaryFile(suffix='.json') 25 | try: 26 | 27 | def write(o): 28 | line = json.dumps(o, default=json_default) 29 | return output.write(line + '\n') 30 | 31 | yield write 32 | 33 | output.seek(0) 34 | log.info("Uploading generated table (%s)...", self._obj) 35 | self.save_file(output.name, destructive=True) 36 | finally: 37 | try: 38 | output.close() 39 | except: 40 | pass 41 | 42 | def records(self): 43 | """ Get each record that has been stored in the table. """ 44 | output = tempfile.NamedTemporaryFile(suffix='.json') 45 | try: 46 | log.info("Loading table from (%s)...", self._obj) 47 | shutil.copyfileobj(self.fh(), output) 48 | output.seek(0) 49 | 50 | for line in output.file: 51 | yield json.loads(line, object_hook=json_hook) 52 | 53 | finally: 54 | try: 55 | output.close() 56 | except: 57 | pass 58 | 59 | def __repr__(self): 60 | return '' % self.name 61 | -------------------------------------------------------------------------------- /loadkit/util.py: -------------------------------------------------------------------------------- 1 | from decimal import Decimal 2 | from datetime import datetime, date 3 | 4 | 5 | class LoadKitException(Exception): 6 | pass 7 | 8 | 9 | class ConfigException(LoadKitException): 10 | pass 11 | 12 | 13 | def json_default(obj): 14 | if isinstance(obj, datetime): 15 | obj = obj.isoformat() 16 | if isinstance(obj, Decimal): 17 | obj = float(obj) 18 | if isinstance(obj, date): 19 | return 'loadKitDate(%s)' % obj.isoformat() 20 | return obj 21 | 22 | 23 | def json_hook(obj): 24 | for k, v in obj.items(): 25 | if isinstance(v, basestring): 26 | try: 27 | obj[k] = datetime.strptime(v, "loadKitDate(%Y-%m-%d)").date() 28 | except ValueError: 29 | pass 30 | try: 31 | obj[k] = datetime.strptime(v, "%Y-%m-%dT%H:%M:%S") 32 | except ValueError: 33 | pass 34 | return obj 35 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | 4 | setup( 5 | name='loadkit', 6 | version='0.3.4', 7 | description="Light-weight tools for ETL", 8 | long_description="", 9 | classifiers=[ 10 | "Development Status :: 3 - Alpha", 11 | "Intended Audience :: Developers", 12 | "License :: OSI Approved :: MIT License", 13 | "Operating System :: OS Independent", 14 | 'Programming Language :: Python :: 2.6', 15 | 'Programming Language :: Python :: 2.7' 16 | ], 17 | keywords='web scraping crawling http cache threading', 18 | author='Friedrich Lindenberg', 19 | author_email='friedrich@pudo.org', 20 | url='http://github.com/pudo/loadkit', 21 | license='MIT', 22 | packages=find_packages(exclude=['ez_setup', 'examples', 'test']), 23 | namespace_packages=[], 24 | package_data={}, 25 | include_package_data=True, 26 | zip_safe=False, 27 | test_suite='nose.collector', 28 | install_requires=[ 29 | 'SQLAlchemy>=0.9.8', 30 | 'messytables>=0.2.1', 31 | 'requests>=2.5.1', 32 | 'archivekit>=0.5', 33 | 'click>=3.2', 34 | 'normality>=0.1' 35 | ], 36 | tests_require=[], 37 | entry_points={ 38 | 'archivekit.resource_types': [ 39 | 'table = loadkit.types.table:Table', 40 | 'logfile = loadkit.types.logfile:LogFile', 41 | 'stage = loadkit.types.stage:Stage' 42 | ], 43 | 'loadkit.operators': [ 44 | 'text_extract = loadkit.operators.text:TextExtractOperator', 45 | 'table_extract = loadkit.operators.table:TableExtractOperator', 46 | 'normalize = loadkit.operators.normalize:NormalizeOperator', 47 | 'regex = loadkit.operators.regex:RegExOperator', 48 | 'ingest = loadkit.operators.ingest:IngestOperator' 49 | ], 50 | 'console_scripts': [ 51 | 'loadkit = loadkit.cli:cli' 52 | ] 53 | } 54 | ) 55 | --------------------------------------------------------------------------------