├── .gitignore ├── LICENSE ├── README.md ├── import_all.py ├── import_zip.py ├── ncd ├── __init__.py ├── athena.py ├── athena_mock.py ├── data_zip.py ├── global_file.py ├── lookup_table.py └── normal_table.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | desktop.ini 3 | __pycache__ 4 | source/ 5 | tables/ 6 | ddl/ 7 | tmp/ 8 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | ISC License 2 | 3 | Copyright (c) 2017-2018, Associated Press 4 | 5 | Permission to use, copy, modify, and/or distribute this software for any 6 | purpose with or without fee is hereby granted, provided that the above 7 | copyright notice and this permission notice appear in all copies. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH 10 | REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY 11 | AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, 12 | INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM 13 | LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE 14 | OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR 15 | PERFORMANCE OF THIS SOFTWARE. 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # National Caseload Data # 2 | 3 | Ingest script to take the Department of Justice's 4 | [National Caseload Data][ncd], which covers cases handled by U.S. Attorneys, 5 | and load it into [Athena][] for querying. 6 | 7 | (Want to use a more typical database, such as PostgreSQL? Look in the 8 | `sqlalchemy` branch.) 9 | 10 | [Athena]: https://aws.amazon.com/athena/ 11 | [ncd]: https://www.justice.gov/usao/resources/foia-library/national-caseload-data 12 | 13 | ## The files ## 14 | 15 | The DOJ website provides data dumps of the entire database as of the end of 16 | each month and as of the end of each fiscal year (ends Sept. 30). DOJ only 17 | retains the three most recent monthly dumps, and there's usually a lag of a 18 | month or so. 19 | 20 | Each dump is cumulative, so the November 2017 dump includes all of the cases 21 | from the October 2017 dump, plus whatever was added in November. 22 | 23 | Each dump is split into a number of zip files (24 as of the December 2017 24 | release). Each zip file has one of three structures: 25 | 26 | * **Normal tables:** 27 | 28 | Most of the NCD database tables are stored as fixed-width text files, with 29 | their schemas described in the zip file's `README.TXT`. 30 | 31 | For example, the `GS_COURT_HIST` table's contents are in a file called 32 | `gs_court_hist.txt` within one of the zip files, and that zip file's 33 | `README.TXT` will contain the schema for that table (and others contained 34 | in the zip). 35 | 36 | If a table is particularly large, its contents will be split into several 37 | text files--one for each [district][]--and distributed among several zip 38 | files. 39 | 40 | For example, the `GS_PARTICIPANT` table's contents are split into files 41 | such as `gs_participant_FLM.txt` for the Middle District of Florida and 42 | `gs_participant_CT` for the District of Connecticut, and these might live 43 | in separate zip files, each with the `GS_PARTICIPANT` schema in its 44 | `README.TXT`. 45 | 46 | * **Codebooks:** 47 | 48 | The last file in a given dump will contain codebooks that can be useful in 49 | interpreting the contents of the normal tables. These come in two forms, 50 | which this code describes as: 51 | 52 | * **Lookup tables:** 53 | 54 | A lookup table consists of one fixed-width text file containing some 55 | metadata followed by a row of column headers and a separator row of 56 | hyphens (useful for determining column widths). 57 | 58 | These filenames start with `table_`; for example, the `GS_POSITION` 59 | table is in a file called `table_gs_position.txt`. 60 | 61 | * **Global tables:** 62 | 63 | One file called `global_LIONS.txt` contains several distinct 64 | fixed-width tables all stacked on top of one another. 65 | 66 | This is as painful as it sounds. 67 | 68 | * **Documentation:** 69 | 70 | The second-to-last file in a given dump usually has no specific data in it; 71 | it just contains logs, statistics files and other semi-documentation. 72 | 73 | [district]: https://en.wikipedia.org/wiki/United_States_federal_judicial_district 74 | 75 | ## Ingest architecture ## 76 | 77 | There are two scripts in the root of this repo: 78 | 79 | * `import_zip.py`, which takes one already-downloaded NCD component zip file, 80 | converts it to [gzipped][athena-compression] [JSON][athena-json] for 81 | Athena, uploads it to S3 and creates the appropriate Athena tables. 82 | 83 | * `import_all.py` is an experimental script that takes a URL to a dump's 84 | landing page ([such as this one][dump_fy_2017]) and asynchronously 85 | processes the zip files listed there. (This is meant to make it easier to 86 | invoke automatically and to allow one file to be processed while another 87 | downloads. Still working out some of the kinks there, though.) 88 | 89 | [athena-compression]: https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html 90 | [athena-json]: https://docs.aws.amazon.com/athena/latest/ug/json.html 91 | [dump_fy_2017]: https://www.justice.gov/usao/resources/foia-library/national-caseload-data/2017 92 | -------------------------------------------------------------------------------- /import_all.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from argparse import ArgumentParser 3 | import logging 4 | import sys 5 | from tempfile import NamedTemporaryFile 6 | from urllib.parse import urlsplit, urlunsplit 7 | 8 | from lxml import etree 9 | import requests 10 | 11 | from ncd.athena import Athena as Athena 12 | from ncd.data_zip import DataZip 13 | 14 | 15 | logger = logging.getLogger(__name__) 16 | logger.setLevel(logging.DEBUG) 17 | ch = logging.StreamHandler() 18 | ch.setLevel(logging.DEBUG) 19 | formatter = logging.Formatter( 20 | '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s') 21 | ch.setFormatter(formatter) 22 | logger.addHandler(ch) 23 | 24 | 25 | parser = ArgumentParser(description='Load a month of National Caseload Data.') 26 | parser.add_argument( 27 | '--data-bucket', help='S3 bucket name for data files', required=True) 28 | parser.add_argument( 29 | '--results-bucket', help='S3 bucket name for query results', required=True) 30 | parser.add_argument( 31 | '--s3-prefix', help='Prefix for data file paths on S3', required=True) 32 | parser.add_argument('--db-name', help='Database name on Athena', required=True) 33 | parser.add_argument( 34 | 'file_listing_url', 35 | help='URL to a DOJ page of yearly or monthly data files') 36 | 37 | 38 | def change_url_scheme(url, new_scheme): 39 | """Change a URL from, say, HTTP to HTTPS. 40 | 41 | Args: 42 | url: A string URL to modify. 43 | new_scheme: A string with which to replace the original URL's scheme 44 | (everything before the :// portion). 45 | 46 | Returns: 47 | A string URL. 48 | """ 49 | url_parts = urlsplit(url) 50 | return urlunsplit((new_scheme, *url_parts[1:])) 51 | 52 | 53 | def get_file_urls(file_listing_url): 54 | """Determine which URLs need to be downloaded. 55 | 56 | Args: 57 | file_listing_url: A string URL to a page on the DOJ site. 58 | 59 | Returns: 60 | A tuple of string URLs to individual zip files. 61 | """ 62 | r = requests.get(file_listing_url) 63 | raw_html = r.text 64 | html = etree.HTML(raw_html) 65 | links = html.cssselect('a[href$=".zip"]') 66 | return tuple(map( 67 | lambda link: change_url_scheme(link.attrib['href'], 'https'), 68 | links)) 69 | 70 | 71 | def load_file_from_url(zip_file_url, athena): 72 | """Download a data file and load it into a database. 73 | 74 | Args: 75 | zip_file_url: A string URL to an NCD data file. 76 | athena: An ncd.Athena to use when storing the file. 77 | """ 78 | zip_file_basename = zip_file_url.split('/')[-1] 79 | logger.debug('About to download {0}'.format(zip_file_basename)) 80 | with NamedTemporaryFile() as zip_file: 81 | chunk_size = 32768 82 | r = requests.get(zip_file_url, stream=True) 83 | logger.debug('Saving {0} to {1}'.format( 84 | zip_file_basename, zip_file.name)) 85 | for chunk in r.iter_content(chunk_size=chunk_size): 86 | zip_file.write(chunk) 87 | logger.debug('Finished saving {0} to {1}'.format( 88 | zip_file_basename, zip_file.name)) 89 | zip_file.seek(0) 90 | 91 | logger.debug('Saving {0} to Athena'.format(zip_file_basename)) 92 | DataZip(zip_file.name, athena).load() 93 | logger.debug('Completed {0}'.format(zip_file_basename)) 94 | 95 | 96 | def main(raw_args): 97 | args = parser.parse_args(raw_args) 98 | 99 | athena = Athena( 100 | data_bucket=args.data_bucket, results_bucket=args.results_bucket, 101 | s3_prefix=args.s3_prefix, db_name=args.db_name) 102 | athena.create_db() 103 | 104 | file_urls = get_file_urls(args.file_listing_url) 105 | logger.info('Found {0} files to download'.format(len(file_urls))) 106 | 107 | for file_url in file_urls: 108 | load_file_from_url(file_url, athena) 109 | 110 | 111 | if __name__ == '__main__': 112 | main(sys.argv[1:]) 113 | -------------------------------------------------------------------------------- /import_zip.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from argparse import ArgumentParser 3 | import logging 4 | import sys 5 | 6 | from ncd.athena import Athena 7 | from ncd.data_zip import DataZip 8 | 9 | 10 | logger = logging.getLogger(__name__) 11 | logger.setLevel(logging.DEBUG) 12 | ch = logging.StreamHandler() 13 | ch.setLevel(logging.DEBUG) 14 | formatter = logging.Formatter( 15 | '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s') 16 | ch.setFormatter(formatter) 17 | logger.addHandler(ch) 18 | 19 | 20 | parser = ArgumentParser(description='Load a file of National Caseload Data.') 21 | parser.add_argument( 22 | '--data-bucket', help='S3 bucket name for data files', required=True) 23 | parser.add_argument( 24 | '--results-bucket', help='S3 bucket name for query results', required=True) 25 | parser.add_argument( 26 | '--s3-prefix', help='Prefix for data file paths on S3', required=True) 27 | parser.add_argument('--db-name', help='Database name on Athena', required=True) 28 | parser.add_argument('zip_path', help='Path to a zip file from NCD') 29 | 30 | 31 | def main(raw_args): 32 | args = parser.parse_args(raw_args) 33 | athena = Athena( 34 | data_bucket=args.data_bucket, results_bucket=args.results_bucket, 35 | s3_prefix=args.s3_prefix, db_name=args.db_name) 36 | athena.create_db() 37 | DataZip(args.zip_path, athena).load() 38 | 39 | 40 | if __name__ == '__main__': 41 | main(sys.argv[1:]) 42 | -------------------------------------------------------------------------------- /ncd/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/associatedpress/national-caseload-data-ingest/f657719e96129dc578ec7c4c5e484781df34b213/ncd/__init__.py -------------------------------------------------------------------------------- /ncd/athena.py: -------------------------------------------------------------------------------- 1 | from io import BytesIO, TextIOWrapper 2 | import logging 3 | import posixpath 4 | from time import sleep 5 | 6 | import boto3 7 | 8 | 9 | logger = logging.getLogger(__name__) 10 | logger.setLevel(logging.DEBUG) 11 | ch = logging.StreamHandler() 12 | ch.setLevel(logging.DEBUG) 13 | formatter = logging.Formatter( 14 | '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s') 15 | ch.setFormatter(formatter) 16 | logger.addHandler(ch) 17 | 18 | 19 | class Athena(object): 20 | """Helper for Athena I/O. 21 | 22 | Args: 23 | data_bucket: A string name of an S3 bucket where we should store data 24 | files. 25 | results_bucket: A string name of an S3 bucket where Athena should store 26 | CSVs of query results. 27 | s3_prefix: A string prefix to use for data files' S3 keys. 28 | db_name: A string database name to use when creating and querying 29 | tables. 30 | """ 31 | def __init__( 32 | self, data_bucket=None, results_bucket=None, s3_prefix=None, 33 | db_name=None): 34 | self._athena = boto3.client('athena') 35 | self._s3 = boto3.resource('s3') 36 | 37 | self.data_bucket = data_bucket 38 | self.results_bucket = results_bucket 39 | self.s3_prefix = s3_prefix 40 | self.db_name = db_name 41 | 42 | self.logger = logger.getChild('Athena') 43 | 44 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 45 | # -=-=-=-=-=-=-=-=-=-=-= PUBLIC METHODS FOLLOW =-=-=-=-=-=-=-=-=-=-=-=-=-=- 46 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 47 | 48 | def create_db(self): 49 | """Create the database we want to use. 50 | """ 51 | self.logger.debug('Ensuring Athena database exists') 52 | self.execute_query( 53 | 'CREATE DATABASE IF NOT EXISTS {0};'.format(self.db_name), 54 | 'default') 55 | self.logger.debug('CREATE DATABASE query completed') 56 | 57 | def execute_query(self, sql_string, db_name=None): 58 | """Execute a query on Athena. 59 | 60 | Args: 61 | sql_string: A string SQL query to execute. 62 | 63 | Returns: 64 | A StringIO of CSV output from Athena. 65 | """ 66 | db_for_query = self.db_name if db_name is None else db_name 67 | start_response = self._athena.start_query_execution( 68 | QueryString=sql_string, 69 | QueryExecutionContext={'Database': db_for_query}, 70 | ResultConfiguration={ 71 | 'OutputLocation': 's3://{results_bucket}/{s3_prefix}'.format( 72 | results_bucket=self.results_bucket, 73 | s3_prefix=self.s3_prefix) 74 | }) 75 | 76 | query_execution_id = start_response['QueryExecutionId'] 77 | self.logger.debug('Started query ID {0}'.format(query_execution_id)) 78 | 79 | return self._results_for_query(query_execution_id) 80 | 81 | def prefix_for_table(self, table_name): 82 | """Create a full prefix for S3 keys for a given table. 83 | 84 | Args: 85 | table_name: A string table name. 86 | 87 | Returns: 88 | A string S3 prefix. 89 | """ 90 | return posixpath.join(self.s3_prefix, self.db_name, table_name) 91 | 92 | def upload_data(self, table_name, file_obj, district=None): 93 | """Upload the given data to S3. 94 | 95 | Args: 96 | table_name: A string table name. 97 | file_obj: A binary file-like object. 98 | district: An optional string code for a federal judicial district; 99 | provide this when DOJ splits up a table by district. 100 | """ 101 | if district: 102 | s3_key = posixpath.join( 103 | self.prefix_for_table(table_name), 104 | 'filename_district={0}'.format(district), 105 | '{0}-{1}.json.gz'.format(table_name, district)) 106 | else: 107 | s3_key = posixpath.join( 108 | self.prefix_for_table(table_name), 109 | '{0}.json.gz'.format(table_name)) 110 | file_obj.seek(0) 111 | self._s3.Bucket(self.data_bucket).upload_fileobj(file_obj, s3_key) 112 | self.logger.debug('Uploaded file to s3://{0}/{1}'.format( 113 | self.data_bucket, s3_key)) 114 | 115 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 116 | # -=-=-=-=-=-=-=-=-=-=- INTERNAL METHODS FOLLOW -=-=-=-=-=-=-=-=-=-=-=-=-=- 117 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 118 | 119 | def _results_for_query(self, query_execution_id): 120 | """Retrieve the results for the given query. 121 | 122 | Args: 123 | query_execution_id: A string execution ID. 124 | 125 | Returns: 126 | A text file-like object of CSV output from Athena. 127 | """ 128 | result_bucket, result_key = self._wait_for_result(query_execution_id) 129 | results_bytes = BytesIO() 130 | self.logger.debug('Downloading results for query ID {0}'.format( 131 | query_execution_id)) 132 | self._s3.Bucket(result_bucket).download_fileobj( 133 | result_key, results_bytes) 134 | self.logger.debug('Downloaded results for query ID {0}'.format( 135 | query_execution_id)) 136 | results_bytes.seek(0) 137 | results_text = TextIOWrapper(results_bytes, encoding='utf-8') 138 | return results_text 139 | 140 | def _wait_for_result(self, query_execution_id): 141 | """Wait for a query to complete. 142 | 143 | This method will block until the query has completed. 144 | 145 | Args: 146 | query_execution_id: A string execution ID. 147 | 148 | Returns: 149 | A tuple with two elements: 150 | * A string S3 bucket name where results are stored. 151 | * A string S3 key for the object containing results. 152 | """ 153 | try: 154 | sleep(0.5) 155 | while True: 156 | query_execution = self._athena.get_query_execution( 157 | QueryExecutionId=query_execution_id) 158 | query_state = query_execution['QueryExecution'][ 159 | 'Status']['State'] 160 | if query_state != 'RUNNING': 161 | break 162 | self.logger.debug('Waiting for results for query {0}'.format( 163 | query_execution_id)) 164 | sleep(5) 165 | 166 | output_location = query_execution['QueryExecution'][ 167 | 'ResultConfiguration']['OutputLocation'] 168 | location_components = output_location.split('/', maxsplit=3) 169 | 170 | return (location_components[2], location_components[3]) 171 | except BaseException as e: # Yes, really. 172 | self._athena.stop_query_execution( 173 | QueryExecutionId=query_execution_id) 174 | raise e 175 | -------------------------------------------------------------------------------- /ncd/athena_mock.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | import posixpath 3 | from pathlib import Path 4 | from shutil import copyfileobj 5 | 6 | 7 | class AthenaMock(object): 8 | """Mock for ncd.athena.Athena that saves to disk. 9 | 10 | Args: 11 | data_bucket: Ignored. 12 | results_bucket: Ignored. 13 | s3_prefix: A string base directory into which table data and queries 14 | will be saved. 15 | db_name: Ignored. 16 | """ 17 | def __init__( 18 | self, data_bucket=None, results_bucket=None, s3_prefix=None, 19 | db_name=None): 20 | self.data_bucket = data_bucket 21 | self.results_bucket = results_bucket 22 | self.s3_prefix = s3_prefix 23 | self.db_name = db_name 24 | 25 | self._base_dir = Path(s3_prefix) 26 | 27 | self._query_dir = self._base_dir / 'queries' 28 | self._table_dir = self._base_dir / 'tables' 29 | 30 | self._query_dir.mkdir(parents=True, exist_ok=True) 31 | self._table_dir.mkdir(parents=True, exist_ok=True) 32 | 33 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 34 | # -=-=-=-=-=-=-=-=-=-=-= PUBLIC METHODS FOLLOW =-=-=-=-=-=-=-=-=-=-=-=-=-=- 35 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 36 | 37 | def create_db(self): 38 | """No-op. 39 | """ 40 | return 41 | 42 | def execute_query(self, sql_string): 43 | """Save the given query to disk. 44 | 45 | Args: 46 | sql_string: A string SQL query to save. 47 | """ 48 | timestamp = datetime.now().isoformat() 49 | output_path = self._query_dir / '{0}.sql'.format(timestamp) 50 | with output_path.open('w') as output_file: 51 | output_file.write(sql_string) 52 | 53 | def prefix_for_table(self, table_name): 54 | """Create a full prefix for S3 keys for a given table. 55 | 56 | Args: 57 | table_name: A string table name. 58 | 59 | Returns: 60 | A string S3 prefix. 61 | """ 62 | return posixpath.join(str(self._base_dir), table_name) 63 | 64 | def upload_data(self, table_name, file_obj, district=None): 65 | """Save the given data to disk. 66 | 67 | Args: 68 | table_name: A string table name. 69 | file_obj: A binary file-like object. 70 | district: An optional string code for a federal judicial district; 71 | provide this when DOJ splits up a table by district. 72 | """ 73 | if district: 74 | table_dir = self._table_dir / table_name 75 | district_dir = table_dir / 'filename_district={0}'.format(district) 76 | output_path = district_dir / '{0}-{1}.json.gz'.format( 77 | table_name, district) 78 | else: 79 | output_path = self._table_dir / table_name / '{0}.json.gz'.format( 80 | table_name) 81 | output_path.parent.mkdir(parents=True, exist_ok=True) 82 | file_obj.seek(0) 83 | with output_path.open('wb') as output_file: 84 | copyfileobj(file_obj, output_file) 85 | 86 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 87 | # -=-=-=-=-=-=-=-=-=-=- INTERNAL METHODS FOLLOW -=-=-=-=-=-=-=-=-=-=-=-=-=- 88 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 89 | 90 | # TODO: Add this. 91 | -------------------------------------------------------------------------------- /ncd/data_zip.py: -------------------------------------------------------------------------------- 1 | from csv import DictWriter 2 | from io import StringIO 3 | from itertools import starmap 4 | import logging 5 | import re 6 | from zipfile import ZipFile 7 | 8 | from ncd.global_file import GlobalFile 9 | from ncd.lookup_table import LookupTable 10 | from ncd.normal_table import NormalTable 11 | 12 | 13 | logger = logging.getLogger(__name__) 14 | logger.setLevel(logging.DEBUG) 15 | ch = logging.StreamHandler() 16 | ch.setLevel(logging.DEBUG) 17 | formatter = logging.Formatter( 18 | '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s') 19 | ch.setFormatter(formatter) 20 | logger.addHandler(ch) 21 | 22 | 23 | class DataZip(object): 24 | """Load all of the data from an NCD zip file to Athena. 25 | 26 | Args: 27 | zip_path: A string path to a zip file from NCD. 28 | athena: An ncd.Athena to use when accessing AWS. 29 | """ 30 | 31 | def __init__(self, zip_path=None, athena=None): 32 | self._zip_path = zip_path 33 | self._athena = athena 34 | self.logger = logger.getChild('DataZip') 35 | 36 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 37 | # -=-=-=-=-=-=-=-=-=-=-= PUBLIC METHODS FOLLOW =-=-=-=-=-=-=-=-=-=-=-=-=-=- 38 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 39 | 40 | def load(self): 41 | """Load this file's tables into Athena.""" 42 | with ZipFile(self._zip_path, 'r') as zip_file: 43 | logger.info('Opened input file {0}'.format(self._zip_path)) 44 | self._zip_file = zip_file 45 | 46 | normal_schemas = self._extract_normal_schemas() 47 | 48 | self._process_normal_tables(normal_schemas) 49 | self._process_global_tables() 50 | self._process_lookup_tables() 51 | 52 | self.logger.info('Done') 53 | 54 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 55 | # -=-=-=-=-=-=-=-=-=-=- INTERNAL METHODS FOLLOW -=-=-=-=-=-=-=-=-=-=-=-=-=- 56 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 57 | 58 | def _extract_normal_schema(self, raw_fragment): 59 | """Extract a normal table's schema from a README fragment. 60 | 61 | Args: 62 | raw_fragment: A string fragment of the zip file's README. 63 | 64 | Returns: 65 | A text file-like object with CSV schema information in the format 66 | expected by csvkit's in2csv utility. 67 | """ 68 | raw_field_specs = re.finditer( 69 | ( 70 | r'^(?P[A-Z][^\s]+)\s+(?:NOT NULL)?\s+' + 71 | r'(?P[A-Z][^\s]+)\s+' + 72 | r'\((?P\d+):(?P\d+)\)'), 73 | raw_fragment, re.MULTILINE) 74 | 75 | def make_row(row): 76 | start_column = int(row.group('start_column')) 77 | end_column = int(row.group('end_column')) 78 | return { 79 | 'column': row.group('field_name'), 80 | 'start': str(start_column), 81 | 'length': str(end_column - start_column + 1), 82 | 'field_type': row.group('field_type') 83 | } 84 | rows = map(make_row, raw_field_specs) 85 | 86 | field_names = ('column', 'start', 'length', 'field_type') 87 | output_io = StringIO() 88 | 89 | writer = DictWriter(output_io, field_names) 90 | writer.writeheader() 91 | writer.writerows(rows) 92 | 93 | output_io.seek(0) 94 | return output_io 95 | 96 | def _extract_normal_schemas(self): 97 | """Extract schemas for normal data tables. 98 | 99 | Returns: 100 | A dict with string table names as keys and text file-like objects 101 | as values. Each value contains a CSV with schema information in the 102 | format expected by csvkit's in2csv utility. 103 | """ 104 | with self._zip_file.open('README.TXT', 'r') as readme_file: 105 | readme = readme_file.read().decode('latin-1') 106 | 107 | schemas = {} 108 | 109 | table_names = re.findall(r'^([A-Z][^ ]+) - ', readme, re.MULTILINE) 110 | if not table_names: 111 | return schemas 112 | 113 | def get_table_start(table_name): 114 | start_match = re.search( 115 | r'^' + table_name + ' - ', readme, re.MULTILINE) 116 | return (table_name, start_match.start()) 117 | table_starts = tuple(map(get_table_start, table_names)) 118 | last_table_name = table_names[-1] 119 | 120 | def get_table_end(i, table_info): 121 | table_name, table_start = table_info 122 | if table_name == last_table_name: 123 | table_end = None 124 | else: 125 | table_end = table_starts[i + 1][1] 126 | return (table_name, table_start, table_end) 127 | table_info = tuple(starmap(get_table_end, enumerate(table_starts))) 128 | 129 | for table_name, table_start, table_end in table_info: 130 | readme_fragment = readme[table_start:table_end] 131 | schema = self._extract_normal_schema(readme_fragment) 132 | schemas[table_name] = schema 133 | 134 | return schemas 135 | 136 | def _process_global_tables(self): 137 | """Load this file's global (schemaless) tables into Athena.""" 138 | GlobalFile(self._zip_file, self._athena).load() 139 | 140 | def _process_lookup_tables(self): 141 | """Load this file's separate lookup tables into Athena.""" 142 | table_file_names = sorted(tuple(filter( 143 | lambda name: name.startswith('table_gs_'), 144 | self._zip_file.namelist()))) 145 | for file_name in table_file_names: 146 | with self._zip_file.open(file_name, 'r') as input_file: 147 | raw_content = input_file.read().decode('latin-1') 148 | LookupTable(raw_content, self._athena).load() 149 | 150 | def _process_normal_tables(self, schemas): 151 | """Load this file's normal tables into Athena. 152 | 153 | Args: 154 | schemas: A dict as returned by _extract_table_schemas. 155 | """ 156 | table_names = sorted(schemas.keys()) 157 | self.logger.info('Found {0} table schemas: {1}'.format( 158 | len(table_names), ', '.join(table_names))) 159 | for table_name in table_names: 160 | normal_table = NormalTable( 161 | name=table_name, zip_file=self._zip_file, 162 | schema_io=schemas[table_name], athena=self._athena) 163 | normal_table.load() 164 | -------------------------------------------------------------------------------- /ncd/global_file.py: -------------------------------------------------------------------------------- 1 | from csv import DictReader, reader, writer 2 | import gzip 3 | from io import StringIO, TextIOWrapper 4 | from itertools import starmap 5 | import json 6 | import logging 7 | import re 8 | from tempfile import NamedTemporaryFile 9 | from textwrap import dedent 10 | 11 | 12 | logger = logging.getLogger(__name__) 13 | logger.setLevel(logging.DEBUG) 14 | ch = logging.StreamHandler() 15 | ch.setLevel(logging.DEBUG) 16 | formatter = logging.Formatter( 17 | '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s') 18 | ch.setFormatter(formatter) 19 | logger.addHandler(ch) 20 | 21 | 22 | class GlobalFile(object): 23 | """Helper to import from global_LIONS.txt to Athena. 24 | 25 | Args: 26 | zip_file: A zipfile.ZipFile of NCD data. 27 | athena: An ncd.Athena to use when accessing AWS. 28 | """ 29 | 30 | def __init__(self, zip_file=None, athena=None): 31 | self._zip = zip_file 32 | self._athena = athena 33 | self.logger = logger.getChild('GlobalFile') 34 | 35 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 36 | # -=-=-=-=-=-=-=-=-=-=-= PUBLIC METHODS FOLLOW =-=-=-=-=-=-=-=-=-=-=-=-=-=- 37 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 38 | 39 | def load(self): 40 | """Load all tables from this file into Athena.""" 41 | try: 42 | raw_content = self._get_raw_content() 43 | except KeyError: 44 | return 45 | tables = self._extract_global_tables(raw_content) 46 | table_names = sorted(tables.keys()) 47 | for table_name in table_names: 48 | self._load_table(table_name, tables[table_name]) 49 | self.logger.info('Loaded global table {0}'.format(table_name)) 50 | 51 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 52 | # -=-=-=-=-=-=-=-=-=-=- INTERNAL METHODS FOLLOW -=-=-=-=-=-=-=-=-=-=-=-=-=- 53 | # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 54 | 55 | def _convert_raw_file(self, table, output_file): 56 | """Convert a raw data file for Athena and add it to a .gz. 57 | 58 | Args: 59 | table: A text file-like object with table data. 60 | output_file: A text file-like object to which our newly converted 61 | data should be appended. 62 | """ 63 | table.seek(0) 64 | reader = DictReader(table) 65 | 66 | for input_row in reader: 67 | output_row = {} 68 | for key, value in input_row.items(): 69 | if key.startswith('redacted_'): 70 | output_row[key] = bool(value) 71 | else: 72 | output_row[key] = value 73 | output_file.write(json.dumps(output_row)) 74 | output_file.write('\n') 75 | 76 | def _extract_global_table(self, raw_fragment): 77 | """Extract a CSV of data for one table. 78 | 79 | Args: 80 | raw_fragment: A string containing fixed-width data for one table 81 | from within global_LIONS.txt. 82 | 83 | Returns: 84 | A text file-like object containing CSV data from the given table. 85 | """ 86 | header, divider, *fixed_rows = raw_fragment.split('\n') 87 | field_width_matches = tuple(re.finditer(r'-+', divider)) 88 | 89 | def split_row(row, is_header=False): 90 | def extract_field(match): 91 | return row[match.start():match.end()].strip() 92 | raw_cells = tuple(map(extract_field, field_width_matches)) 93 | if is_header: 94 | data_cells = list(raw_cells) 95 | redaction_cells = [ 96 | 'redacted_{0}'.format(cell) for cell in raw_cells] 97 | else: 98 | data_cells = [ 99 | (cell if cell != '*' else '') for cell in raw_cells] 100 | redaction_cells = [ 101 | ('t' if cell == '*' else '') for cell in raw_cells] 102 | return data_cells + redaction_cells 103 | 104 | def convert_camel_case_field_name(field_name): 105 | def add_underscore(match): 106 | return '_' + match.group(1) 107 | converted = re.sub( 108 | r'(?[^(]+)(?:\((?P.+)\))?', field_type_text) 120 | field_type_component = field_components.group('type') 121 | if field_type_component in ('VARCHAR', 'VARCHAR2'): 122 | return converter_with_nulls(str) 123 | if field_type_component == 'NUMBER': 124 | return converter_with_nulls(int) 125 | if field_type_component == 'DATE': 126 | return converter_with_nulls(_parse_oracle_date) 127 | if field_type_component == 'FLOAT': 128 | return converter_with_nulls(float) 129 | raise NotImplementedError( 130 | 'Unsure how to handle a {0}'.format(field_type_text)) 131 | 132 | def build_column(row): 133 | return (row['column'], get_python_type(row['field_type'])) 134 | 135 | return dict(map(build_column, schema_reader)) 136 | 137 | def _generate_ddl(self, is_partitioned=False): 138 | """Generate a CREATE EXTERNAL TABLE query to run on Athena. 139 | 140 | Args: 141 | is_partitioned: A boolean specifying whether a table is to be split 142 | into multiple files by federal judicial district (True) or 143 | consists of only one file covering all districts (False). 144 | 145 | Returns: 146 | A string SQL query to execute. 147 | """ 148 | self._schema.seek(0) 149 | reader = DictReader(self._schema) 150 | 151 | def get_athena_type(field_type_text): 152 | field_components = re.match( 153 | r'(?P[^(]+)(?:\((?P.+)\))?', field_type_text) 154 | field_type_component = field_components.group('type') 155 | if field_type_component in ('VARCHAR', 'VARCHAR2'): 156 | return 'STRING' 157 | if field_type_component == 'NUMBER': 158 | return 'BIGINT' 159 | if field_type_component == 'DATE': 160 | return 'DATE' # Actually a date in strftime format '%d-%b-%Y' 161 | if field_type_component == 'FLOAT': 162 | return 'DOUBLE' 163 | raise NotImplementedError( 164 | 'Unsure how to handle a {0}'.format(field_type_text)) 165 | 166 | def build_column(row): 167 | data_column = '{0} {1}'.format( 168 | row['column'], get_athena_type(row['field_type'])) 169 | redaction_column = 'redacted_{0} BOOLEAN'.format(row['column']) 170 | return (data_column, redaction_column) 171 | 172 | column_pairs = tuple(map(build_column, reader)) 173 | data_columns = map(itemgetter(0), column_pairs) 174 | redaction_columns = map(itemgetter(1), column_pairs) 175 | columns = tuple(chain(data_columns, redaction_columns)) 176 | column_specs = ',\n '.join(columns) 177 | 178 | if is_partitioned: 179 | partition_clause = ( 180 | '\n PARTITIONED BY (filename_district STRING)') 181 | else: 182 | partition_clause = '' 183 | 184 | query = """ 185 | CREATE EXTERNAL TABLE IF NOT EXISTS {name} ( 186 | {columns} 187 | ){partition_clause} 188 | ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' 189 | STORED AS TEXTFILE 190 | LOCATION 's3://{bucket}/{table_prefix}'; 191 | """.format( 192 | name=self.name, columns=column_specs, 193 | partition_clause=partition_clause, 194 | bucket=self._athena.data_bucket, 195 | table_prefix=self._athena.prefix_for_table(self.name) 196 | ) 197 | 198 | return dedent(query) 199 | 200 | def _generate_rows(self, csv_data, gzip_file): 201 | """Convert rows of a CSV and append the results to a .gz. 202 | 203 | Args: 204 | csv_data: A text file-like object containing CSV data. 205 | gzip_file: A file-like object to which our newly converted data 206 | should be appended. 207 | """ 208 | field_converters = self._gather_python_types() 209 | reader = DictReader(csv_data) 210 | for input_row in reader: 211 | output_obj = {} 212 | for field_name, field_raw_value in input_row.items(): 213 | if field_raw_value == '*': 214 | field_value = None 215 | redacted_value = True 216 | else: 217 | field_value = field_converters[field_name](field_raw_value) 218 | redacted_value = False 219 | output_obj[field_name] = field_value 220 | output_obj['redacted_{0}'.format(field_name)] = redacted_value 221 | output_json = json.dumps(output_obj) 222 | gzip_file.write('{0}\n'.format(output_json)) 223 | 224 | def _get_file_names(self): 225 | """Determine which contents to use from our zip file. 226 | 227 | Returns: 228 | A dict. Each key specifies the federal judicial district covered by 229 | a given data file; this is a string unless the file covers all 230 | districts, which case it is None. Each value is a string filename 231 | for the given data file within self._zip. 232 | """ 233 | lowercase_name = self.name.lower() 234 | file_name_pattern = re.compile(''.join([ 235 | r'^', lowercase_name, r'(?:_(?P[A-Z]+))?\.txt$'])) 236 | 237 | def file_is_for_table(file_name): 238 | match = file_name_pattern.match(file_name) 239 | if not match: 240 | return None 241 | return (match.group('district'), file_name) 242 | data_file_names = dict( 243 | filter(None, map(file_is_for_table, self._zip.namelist()))) 244 | 245 | return data_file_names 246 | 247 | def _make_csv(self, fixed_width_data): 248 | """Convert a fixed-width data file to a CSV. 249 | 250 | Args: 251 | fixed_width_data: A text file-like object containing fixed-width 252 | data, following the format described in self._schema. 253 | 254 | Returns: 255 | A text file-like object containing CSV data. 256 | """ 257 | self._schema.seek(0) 258 | fixed_width_data.seek(0) 259 | fixed_width_text = TextIOWrapper(fixed_width_data, encoding='latin-1') 260 | 261 | csv_file = TemporaryFile(mode='w+') 262 | fixed2csv(fixed_width_text, self._schema, output=csv_file) 263 | 264 | fixed_width_text.close() 265 | csv_file.seek(0) 266 | 267 | self.logger.debug('Converted fixed-width data to CSV') 268 | return csv_file 269 | 270 | def _remove_crs(self, raw_data): 271 | """Remove carriage returns from a file. 272 | 273 | Args: 274 | raw_data: A file-like object. 275 | 276 | Returns: 277 | A file-like object with most of the same content. 278 | """ 279 | no_cr_file = TemporaryFile(mode='w+b') 280 | while True: 281 | raw_chunk = raw_data.read(4096) 282 | if not raw_chunk: 283 | break 284 | fixed_chunk = raw_chunk.replace(b'\r', b' ') 285 | no_cr_file.write(fixed_chunk) 286 | 287 | no_cr_file.seek(0) 288 | raw_data.close() 289 | 290 | self.logger.debug('Removed carriage returns') 291 | return no_cr_file 292 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | cssselect==1.0.3 2 | csvkit==1.0.2 3 | lxml==4.1.1 4 | requests==2.21.0 5 | --------------------------------------------------------------------------------