├── .gitignore
├── LICENSE
├── README.md
├── import_all.py
├── import_zip.py
├── ncd
    ├── __init__.py
    ├── athena.py
    ├── athena_mock.py
    ├── data_zip.py
    ├── global_file.py
    ├── lookup_table.py
    └── normal_table.py
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | desktop.ini
3 | __pycache__
4 | source/
5 | tables/
6 | ddl/
7 | tmp/
8 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | ISC License
 2 | 
 3 | Copyright (c) 2017-2018, Associated Press
 4 | 
 5 | Permission to use, copy, modify, and/or distribute this software for any
 6 | purpose with or without fee is hereby granted, provided that the above
 7 | copyright notice and this permission notice appear in all copies.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
10 | REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
11 | AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
12 | INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
13 | LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE
14 | OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
15 | PERFORMANCE OF THIS SOFTWARE.
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # National Caseload Data #
 2 | 
 3 | Ingest script to take the Department of Justice's
 4 | [National Caseload Data][ncd], which covers cases handled by U.S. Attorneys,
 5 | and load it into [Athena][] for querying.
 6 | 
 7 | (Want to use a more typical database, such as PostgreSQL? Look in the
 8 | `sqlalchemy` branch.)
 9 | 
10 | [Athena]: https://aws.amazon.com/athena/
11 | [ncd]: https://www.justice.gov/usao/resources/foia-library/national-caseload-data
12 | 
13 | ## The files ##
14 | 
15 | The DOJ website provides data dumps of the entire database as of the end of
16 | each month and as of the end of each fiscal year (ends Sept. 30). DOJ only
17 | retains the three most recent monthly dumps, and there's usually a lag of a
18 | month or so.
19 | 
20 | Each dump is cumulative, so the November 2017 dump includes all of the cases
21 | from the October 2017 dump, plus whatever was added in November.
22 | 
23 | Each dump is split into a number of zip files (24 as of the December 2017
24 | release). Each zip file has one of three structures:
25 | 
26 | *   **Normal tables:**
27 | 
28 |     Most of the NCD database tables are stored as fixed-width text files, with
29 |     their schemas described in the zip file's `README.TXT`.
30 | 
31 |     For example, the `GS_COURT_HIST` table's contents are in a file called
32 |     `gs_court_hist.txt` within one of the zip files, and that zip file's
33 |     `README.TXT` will contain the schema for that table (and others contained
34 |     in the zip).
35 | 
36 |     If a table is particularly large, its contents will be split into several
37 |     text files--one for each [district][]--and distributed among several zip
38 |     files.
39 | 
40 |     For example, the `GS_PARTICIPANT` table's contents are split into files
41 |     such as `gs_participant_FLM.txt` for the Middle District of Florida and
42 |     `gs_participant_CT` for the District of Connecticut, and these might live
43 |     in separate zip files, each with the `GS_PARTICIPANT` schema in its
44 |     `README.TXT`.
45 | 
46 | *   **Codebooks:**
47 | 
48 |     The last file in a given dump will contain codebooks that can be useful in
49 |     interpreting the contents of the normal tables. These come in two forms,
50 |     which this code describes as:
51 | 
52 |     *   **Lookup tables:**
53 | 
54 |         A lookup table consists of one fixed-width text file containing some
55 |         metadata followed by a row of column headers and a separator row of
56 |         hyphens (useful for determining column widths).
57 | 
58 |         These filenames start with `table_`; for example, the `GS_POSITION`
59 |         table is in a file called `table_gs_position.txt`.
60 | 
61 |     *   **Global tables:**
62 | 
63 |         One file called `global_LIONS.txt` contains several distinct
64 |         fixed-width tables all stacked on top of one another.
65 | 
66 |         This is as painful as it sounds.
67 | 
68 | *   **Documentation:**
69 | 
70 |     The second-to-last file in a given dump usually has no specific data in it;
71 |     it just contains logs, statistics files and other semi-documentation.
72 | 
73 | [district]: https://en.wikipedia.org/wiki/United_States_federal_judicial_district
74 | 
75 | ## Ingest architecture ##
76 | 
77 | There are two scripts in the root of this repo:
78 | 
79 | *   `import_zip.py`, which takes one already-downloaded NCD component zip file,
80 |     converts it to [gzipped][athena-compression] [JSON][athena-json] for
81 |     Athena, uploads it to S3 and creates the appropriate Athena tables.
82 | 
83 | *   `import_all.py` is an experimental script that takes a URL to a dump's
84 |     landing page ([such as this one][dump_fy_2017]) and asynchronously
85 |     processes the zip files listed there. (This is meant to make it easier to
86 |     invoke automatically and to allow one file to be processed while another
87 |     downloads. Still working out some of the kinks there, though.)
88 | 
89 | [athena-compression]: https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html
90 | [athena-json]: https://docs.aws.amazon.com/athena/latest/ug/json.html
91 | [dump_fy_2017]: https://www.justice.gov/usao/resources/foia-library/national-caseload-data/2017
92 | 


--------------------------------------------------------------------------------
/import_all.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | from argparse import ArgumentParser
  3 | import logging
  4 | import sys
  5 | from tempfile import NamedTemporaryFile
  6 | from urllib.parse import urlsplit, urlunsplit
  7 | 
  8 | from lxml import etree
  9 | import requests
 10 | 
 11 | from ncd.athena import Athena as Athena
 12 | from ncd.data_zip import DataZip
 13 | 
 14 | 
 15 | logger = logging.getLogger(__name__)
 16 | logger.setLevel(logging.DEBUG)
 17 | ch = logging.StreamHandler()
 18 | ch.setLevel(logging.DEBUG)
 19 | formatter = logging.Formatter(
 20 |     '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s')
 21 | ch.setFormatter(formatter)
 22 | logger.addHandler(ch)
 23 | 
 24 | 
 25 | parser = ArgumentParser(description='Load a month of National Caseload Data.')
 26 | parser.add_argument(
 27 |     '--data-bucket', help='S3 bucket name for data files', required=True)
 28 | parser.add_argument(
 29 |     '--results-bucket', help='S3 bucket name for query results', required=True)
 30 | parser.add_argument(
 31 |     '--s3-prefix', help='Prefix for data file paths on S3', required=True)
 32 | parser.add_argument('--db-name', help='Database name on Athena', required=True)
 33 | parser.add_argument(
 34 |     'file_listing_url',
 35 |     help='URL to a DOJ page of yearly or monthly data files')
 36 | 
 37 | 
 38 | def change_url_scheme(url, new_scheme):
 39 |     """Change a URL from, say, HTTP to HTTPS.
 40 | 
 41 |     Args:
 42 |         url: A string URL to modify.
 43 |         new_scheme: A string with which to replace the original URL's scheme
 44 |             (everything before the :// portion).
 45 | 
 46 |     Returns:
 47 |         A string URL.
 48 |     """
 49 |     url_parts = urlsplit(url)
 50 |     return urlunsplit((new_scheme, *url_parts[1:]))
 51 | 
 52 | 
 53 | def get_file_urls(file_listing_url):
 54 |     """Determine which URLs need to be downloaded.
 55 | 
 56 |     Args:
 57 |         file_listing_url: A string URL to a page on the DOJ site.
 58 | 
 59 |     Returns:
 60 |         A tuple of string URLs to individual zip files.
 61 |     """
 62 |     r = requests.get(file_listing_url)
 63 |     raw_html = r.text
 64 |     html = etree.HTML(raw_html)
 65 |     links = html.cssselect('a[href$=".zip"]')
 66 |     return tuple(map(
 67 |         lambda link: change_url_scheme(link.attrib['href'], 'https'),
 68 |         links))
 69 | 
 70 | 
 71 | def load_file_from_url(zip_file_url, athena):
 72 |     """Download a data file and load it into a database.
 73 | 
 74 |     Args:
 75 |         zip_file_url: A string URL to an NCD data file.
 76 |         athena: An ncd.Athena to use when storing the file.
 77 |     """
 78 |     zip_file_basename = zip_file_url.split('/')[-1]
 79 |     logger.debug('About to download {0}'.format(zip_file_basename))
 80 |     with NamedTemporaryFile() as zip_file:
 81 |         chunk_size = 32768
 82 |         r = requests.get(zip_file_url, stream=True)
 83 |         logger.debug('Saving {0} to {1}'.format(
 84 |             zip_file_basename, zip_file.name))
 85 |         for chunk in r.iter_content(chunk_size=chunk_size):
 86 |             zip_file.write(chunk)
 87 |         logger.debug('Finished saving {0} to {1}'.format(
 88 |             zip_file_basename, zip_file.name))
 89 |         zip_file.seek(0)
 90 | 
 91 |         logger.debug('Saving {0} to Athena'.format(zip_file_basename))
 92 |         DataZip(zip_file.name, athena).load()
 93 |         logger.debug('Completed {0}'.format(zip_file_basename))
 94 | 
 95 | 
 96 | def main(raw_args):
 97 |     args = parser.parse_args(raw_args)
 98 | 
 99 |     athena = Athena(
100 |         data_bucket=args.data_bucket, results_bucket=args.results_bucket,
101 |         s3_prefix=args.s3_prefix, db_name=args.db_name)
102 |     athena.create_db()
103 | 
104 |     file_urls = get_file_urls(args.file_listing_url)
105 |     logger.info('Found {0} files to download'.format(len(file_urls)))
106 | 
107 |     for file_url in file_urls:
108 |         load_file_from_url(file_url, athena)
109 | 
110 | 
111 | if __name__ == '__main__':
112 |     main(sys.argv[1:])
113 | 


--------------------------------------------------------------------------------
/import_zip.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | from argparse import ArgumentParser
 3 | import logging
 4 | import sys
 5 | 
 6 | from ncd.athena import Athena
 7 | from ncd.data_zip import DataZip
 8 | 
 9 | 
10 | logger = logging.getLogger(__name__)
11 | logger.setLevel(logging.DEBUG)
12 | ch = logging.StreamHandler()
13 | ch.setLevel(logging.DEBUG)
14 | formatter = logging.Formatter(
15 |     '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s')
16 | ch.setFormatter(formatter)
17 | logger.addHandler(ch)
18 | 
19 | 
20 | parser = ArgumentParser(description='Load a file of National Caseload Data.')
21 | parser.add_argument(
22 |     '--data-bucket', help='S3 bucket name for data files', required=True)
23 | parser.add_argument(
24 |     '--results-bucket', help='S3 bucket name for query results', required=True)
25 | parser.add_argument(
26 |     '--s3-prefix', help='Prefix for data file paths on S3', required=True)
27 | parser.add_argument('--db-name', help='Database name on Athena', required=True)
28 | parser.add_argument('zip_path', help='Path to a zip file from NCD')
29 | 
30 | 
31 | def main(raw_args):
32 |     args = parser.parse_args(raw_args)
33 |     athena = Athena(
34 |         data_bucket=args.data_bucket, results_bucket=args.results_bucket,
35 |         s3_prefix=args.s3_prefix, db_name=args.db_name)
36 |     athena.create_db()
37 |     DataZip(args.zip_path, athena).load()
38 | 
39 | 
40 | if __name__ == '__main__':
41 |     main(sys.argv[1:])
42 | 


--------------------------------------------------------------------------------
/ncd/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/associatedpress/national-caseload-data-ingest/f657719e96129dc578ec7c4c5e484781df34b213/ncd/__init__.py


--------------------------------------------------------------------------------
/ncd/athena.py:
--------------------------------------------------------------------------------
  1 | from io import BytesIO, TextIOWrapper
  2 | import logging
  3 | import posixpath
  4 | from time import sleep
  5 | 
  6 | import boto3
  7 | 
  8 | 
  9 | logger = logging.getLogger(__name__)
 10 | logger.setLevel(logging.DEBUG)
 11 | ch = logging.StreamHandler()
 12 | ch.setLevel(logging.DEBUG)
 13 | formatter = logging.Formatter(
 14 |     '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s')
 15 | ch.setFormatter(formatter)
 16 | logger.addHandler(ch)
 17 | 
 18 | 
 19 | class Athena(object):
 20 |     """Helper for Athena I/O.
 21 | 
 22 |     Args:
 23 |         data_bucket: A string name of an S3 bucket where we should store data
 24 |             files.
 25 |         results_bucket: A string name of an S3 bucket where Athena should store
 26 |             CSVs of query results.
 27 |         s3_prefix: A string prefix to use for data files' S3 keys.
 28 |         db_name: A string database name to use when creating and querying
 29 |             tables.
 30 |     """
 31 |     def __init__(
 32 |             self, data_bucket=None, results_bucket=None, s3_prefix=None,
 33 |             db_name=None):
 34 |         self._athena = boto3.client('athena')
 35 |         self._s3 = boto3.resource('s3')
 36 | 
 37 |         self.data_bucket = data_bucket
 38 |         self.results_bucket = results_bucket
 39 |         self.s3_prefix = s3_prefix
 40 |         self.db_name = db_name
 41 | 
 42 |         self.logger = logger.getChild('Athena')
 43 | 
 44 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 45 |     # -=-=-=-=-=-=-=-=-=-=-= PUBLIC METHODS FOLLOW =-=-=-=-=-=-=-=-=-=-=-=-=-=-
 46 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 47 | 
 48 |     def create_db(self):
 49 |         """Create the database we want to use.
 50 |         """
 51 |         self.logger.debug('Ensuring Athena database exists')
 52 |         self.execute_query(
 53 |             'CREATE DATABASE IF NOT EXISTS {0};'.format(self.db_name),
 54 |             'default')
 55 |         self.logger.debug('CREATE DATABASE query completed')
 56 | 
 57 |     def execute_query(self, sql_string, db_name=None):
 58 |         """Execute a query on Athena.
 59 | 
 60 |         Args:
 61 |             sql_string: A string SQL query to execute.
 62 | 
 63 |         Returns:
 64 |             A StringIO of CSV output from Athena.
 65 |         """
 66 |         db_for_query = self.db_name if db_name is None else db_name
 67 |         start_response = self._athena.start_query_execution(
 68 |             QueryString=sql_string,
 69 |             QueryExecutionContext={'Database': db_for_query},
 70 |             ResultConfiguration={
 71 |                 'OutputLocation': 's3://{results_bucket}/{s3_prefix}'.format(
 72 |                     results_bucket=self.results_bucket,
 73 |                     s3_prefix=self.s3_prefix)
 74 |             })
 75 | 
 76 |         query_execution_id = start_response['QueryExecutionId']
 77 |         self.logger.debug('Started query ID {0}'.format(query_execution_id))
 78 | 
 79 |         return self._results_for_query(query_execution_id)
 80 | 
 81 |     def prefix_for_table(self, table_name):
 82 |         """Create a full prefix for S3 keys for a given table.
 83 | 
 84 |         Args:
 85 |             table_name: A string table name.
 86 | 
 87 |         Returns:
 88 |             A string S3 prefix.
 89 |         """
 90 |         return posixpath.join(self.s3_prefix, self.db_name, table_name)
 91 | 
 92 |     def upload_data(self, table_name, file_obj, district=None):
 93 |         """Upload the given data to S3.
 94 | 
 95 |         Args:
 96 |             table_name: A string table name.
 97 |             file_obj: A binary file-like object.
 98 |             district: An optional string code for a federal judicial district;
 99 |                 provide this when DOJ splits up a table by district.
100 |         """
101 |         if district:
102 |             s3_key = posixpath.join(
103 |                 self.prefix_for_table(table_name),
104 |                 'filename_district={0}'.format(district),
105 |                 '{0}-{1}.json.gz'.format(table_name, district))
106 |         else:
107 |             s3_key = posixpath.join(
108 |                 self.prefix_for_table(table_name),
109 |                 '{0}.json.gz'.format(table_name))
110 |         file_obj.seek(0)
111 |         self._s3.Bucket(self.data_bucket).upload_fileobj(file_obj, s3_key)
112 |         self.logger.debug('Uploaded file to s3://{0}/{1}'.format(
113 |             self.data_bucket, s3_key))
114 | 
115 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
116 |     # -=-=-=-=-=-=-=-=-=-=- INTERNAL METHODS FOLLOW -=-=-=-=-=-=-=-=-=-=-=-=-=-
117 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
118 | 
119 |     def _results_for_query(self, query_execution_id):
120 |         """Retrieve the results for the given query.
121 | 
122 |         Args:
123 |             query_execution_id: A string execution ID.
124 | 
125 |         Returns:
126 |             A text file-like object of CSV output from Athena.
127 |         """
128 |         result_bucket, result_key = self._wait_for_result(query_execution_id)
129 |         results_bytes = BytesIO()
130 |         self.logger.debug('Downloading results for query ID {0}'.format(
131 |             query_execution_id))
132 |         self._s3.Bucket(result_bucket).download_fileobj(
133 |             result_key, results_bytes)
134 |         self.logger.debug('Downloaded results for query ID {0}'.format(
135 |             query_execution_id))
136 |         results_bytes.seek(0)
137 |         results_text = TextIOWrapper(results_bytes, encoding='utf-8')
138 |         return results_text
139 | 
140 |     def _wait_for_result(self, query_execution_id):
141 |         """Wait for a query to complete.
142 | 
143 |         This method will block until the query has completed.
144 | 
145 |         Args:
146 |             query_execution_id: A string execution ID.
147 | 
148 |         Returns:
149 |             A tuple with two elements:
150 |             *   A string S3 bucket name where results are stored.
151 |             *   A string S3 key for the object containing results.
152 |         """
153 |         try:
154 |             sleep(0.5)
155 |             while True:
156 |                 query_execution = self._athena.get_query_execution(
157 |                     QueryExecutionId=query_execution_id)
158 |                 query_state = query_execution['QueryExecution'][
159 |                     'Status']['State']
160 |                 if query_state != 'RUNNING':
161 |                     break
162 |                 self.logger.debug('Waiting for results for query {0}'.format(
163 |                     query_execution_id))
164 |                 sleep(5)
165 | 
166 |             output_location = query_execution['QueryExecution'][
167 |                 'ResultConfiguration']['OutputLocation']
168 |             location_components = output_location.split('/', maxsplit=3)
169 | 
170 |             return (location_components[2], location_components[3])
171 |         except BaseException as e:  # Yes, really.
172 |             self._athena.stop_query_execution(
173 |                 QueryExecutionId=query_execution_id)
174 |             raise e
175 | 


--------------------------------------------------------------------------------
/ncd/athena_mock.py:
--------------------------------------------------------------------------------
 1 | from datetime import datetime
 2 | import posixpath
 3 | from pathlib import Path
 4 | from shutil import copyfileobj
 5 | 
 6 | 
 7 | class AthenaMock(object):
 8 |     """Mock for ncd.athena.Athena that saves to disk.
 9 | 
10 |     Args:
11 |         data_bucket: Ignored.
12 |         results_bucket: Ignored.
13 |         s3_prefix: A string base directory into which table data and queries
14 |             will be saved.
15 |         db_name: Ignored.
16 |     """
17 |     def __init__(
18 |             self, data_bucket=None, results_bucket=None, s3_prefix=None,
19 |             db_name=None):
20 |         self.data_bucket = data_bucket
21 |         self.results_bucket = results_bucket
22 |         self.s3_prefix = s3_prefix
23 |         self.db_name = db_name
24 | 
25 |         self._base_dir = Path(s3_prefix)
26 | 
27 |         self._query_dir = self._base_dir / 'queries'
28 |         self._table_dir = self._base_dir / 'tables'
29 | 
30 |         self._query_dir.mkdir(parents=True, exist_ok=True)
31 |         self._table_dir.mkdir(parents=True, exist_ok=True)
32 | 
33 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
34 |     # -=-=-=-=-=-=-=-=-=-=-= PUBLIC METHODS FOLLOW =-=-=-=-=-=-=-=-=-=-=-=-=-=-
35 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
36 | 
37 |     def create_db(self):
38 |         """No-op.
39 |         """
40 |         return
41 | 
42 |     def execute_query(self, sql_string):
43 |         """Save the given query to disk.
44 | 
45 |         Args:
46 |             sql_string: A string SQL query to save.
47 |         """
48 |         timestamp = datetime.now().isoformat()
49 |         output_path = self._query_dir / '{0}.sql'.format(timestamp)
50 |         with output_path.open('w') as output_file:
51 |             output_file.write(sql_string)
52 | 
53 |     def prefix_for_table(self, table_name):
54 |         """Create a full prefix for S3 keys for a given table.
55 | 
56 |         Args:
57 |             table_name: A string table name.
58 | 
59 |         Returns:
60 |             A string S3 prefix.
61 |         """
62 |         return posixpath.join(str(self._base_dir), table_name)
63 | 
64 |     def upload_data(self, table_name, file_obj, district=None):
65 |         """Save the given data to disk.
66 | 
67 |         Args:
68 |             table_name: A string table name.
69 |             file_obj: A binary file-like object.
70 |             district: An optional string code for a federal judicial district;
71 |                 provide this when DOJ splits up a table by district.
72 |         """
73 |         if district:
74 |             table_dir = self._table_dir / table_name
75 |             district_dir = table_dir / 'filename_district={0}'.format(district)
76 |             output_path = district_dir / '{0}-{1}.json.gz'.format(
77 |                 table_name, district)
78 |         else:
79 |             output_path = self._table_dir / table_name / '{0}.json.gz'.format(
80 |                 table_name)
81 |         output_path.parent.mkdir(parents=True, exist_ok=True)
82 |         file_obj.seek(0)
83 |         with output_path.open('wb') as output_file:
84 |             copyfileobj(file_obj, output_file)
85 | 
86 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
87 |     # -=-=-=-=-=-=-=-=-=-=- INTERNAL METHODS FOLLOW -=-=-=-=-=-=-=-=-=-=-=-=-=-
88 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
89 | 
90 |     # TODO: Add this.
91 | 


--------------------------------------------------------------------------------
/ncd/data_zip.py:
--------------------------------------------------------------------------------
  1 | from csv import DictWriter
  2 | from io import StringIO
  3 | from itertools import starmap
  4 | import logging
  5 | import re
  6 | from zipfile import ZipFile
  7 | 
  8 | from ncd.global_file import GlobalFile
  9 | from ncd.lookup_table import LookupTable
 10 | from ncd.normal_table import NormalTable
 11 | 
 12 | 
 13 | logger = logging.getLogger(__name__)
 14 | logger.setLevel(logging.DEBUG)
 15 | ch = logging.StreamHandler()
 16 | ch.setLevel(logging.DEBUG)
 17 | formatter = logging.Formatter(
 18 |     '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s')
 19 | ch.setFormatter(formatter)
 20 | logger.addHandler(ch)
 21 | 
 22 | 
 23 | class DataZip(object):
 24 |     """Load all of the data from an NCD zip file to Athena.
 25 | 
 26 |     Args:
 27 |         zip_path: A string path to a zip file from NCD.
 28 |         athena: An ncd.Athena to use when accessing AWS.
 29 |     """
 30 | 
 31 |     def __init__(self, zip_path=None, athena=None):
 32 |         self._zip_path = zip_path
 33 |         self._athena = athena
 34 |         self.logger = logger.getChild('DataZip')
 35 | 
 36 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 37 |     # -=-=-=-=-=-=-=-=-=-=-= PUBLIC METHODS FOLLOW =-=-=-=-=-=-=-=-=-=-=-=-=-=-
 38 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 39 | 
 40 |     def load(self):
 41 |         """Load this file's tables into Athena."""
 42 |         with ZipFile(self._zip_path, 'r') as zip_file:
 43 |             logger.info('Opened input file {0}'.format(self._zip_path))
 44 |             self._zip_file = zip_file
 45 | 
 46 |             normal_schemas = self._extract_normal_schemas()
 47 | 
 48 |             self._process_normal_tables(normal_schemas)
 49 |             self._process_global_tables()
 50 |             self._process_lookup_tables()
 51 | 
 52 |         self.logger.info('Done')
 53 | 
 54 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 55 |     # -=-=-=-=-=-=-=-=-=-=- INTERNAL METHODS FOLLOW -=-=-=-=-=-=-=-=-=-=-=-=-=-
 56 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 57 | 
 58 |     def _extract_normal_schema(self, raw_fragment):
 59 |         """Extract a normal table's schema from a README fragment.
 60 | 
 61 |         Args:
 62 |             raw_fragment: A string fragment of the zip file's README.
 63 | 
 64 |         Returns:
 65 |             A text file-like object with CSV schema information in the format
 66 |             expected by csvkit's in2csv utility.
 67 |         """
 68 |         raw_field_specs = re.finditer(
 69 |             (
 70 |                 r'^(?P<field_name>[A-Z][^\s]+)\s+(?:NOT NULL)?\s+' +
 71 |                 r'(?P<field_type>[A-Z][^\s]+)\s+' +
 72 |                 r'\((?P<start_column>\d+):(?P<end_column>\d+)\)'),
 73 |             raw_fragment, re.MULTILINE)
 74 | 
 75 |         def make_row(row):
 76 |             start_column = int(row.group('start_column'))
 77 |             end_column = int(row.group('end_column'))
 78 |             return {
 79 |                 'column': row.group('field_name'),
 80 |                 'start': str(start_column),
 81 |                 'length': str(end_column - start_column + 1),
 82 |                 'field_type': row.group('field_type')
 83 |             }
 84 |         rows = map(make_row, raw_field_specs)
 85 | 
 86 |         field_names = ('column', 'start', 'length', 'field_type')
 87 |         output_io = StringIO()
 88 | 
 89 |         writer = DictWriter(output_io, field_names)
 90 |         writer.writeheader()
 91 |         writer.writerows(rows)
 92 | 
 93 |         output_io.seek(0)
 94 |         return output_io
 95 | 
 96 |     def _extract_normal_schemas(self):
 97 |         """Extract schemas for normal data tables.
 98 | 
 99 |         Returns:
100 |             A dict with string table names as keys and text file-like objects
101 |             as values. Each value contains a CSV with schema information in the
102 |             format expected by csvkit's in2csv utility.
103 |         """
104 |         with self._zip_file.open('README.TXT', 'r') as readme_file:
105 |             readme = readme_file.read().decode('latin-1')
106 | 
107 |         schemas = {}
108 | 
109 |         table_names = re.findall(r'^([A-Z][^ ]+) - ', readme, re.MULTILINE)
110 |         if not table_names:
111 |             return schemas
112 | 
113 |         def get_table_start(table_name):
114 |             start_match = re.search(
115 |                 r'^' + table_name + ' - ', readme, re.MULTILINE)
116 |             return (table_name, start_match.start())
117 |         table_starts = tuple(map(get_table_start, table_names))
118 |         last_table_name = table_names[-1]
119 | 
120 |         def get_table_end(i, table_info):
121 |             table_name, table_start = table_info
122 |             if table_name == last_table_name:
123 |                 table_end = None
124 |             else:
125 |                 table_end = table_starts[i + 1][1]
126 |             return (table_name, table_start, table_end)
127 |         table_info = tuple(starmap(get_table_end, enumerate(table_starts)))
128 | 
129 |         for table_name, table_start, table_end in table_info:
130 |             readme_fragment = readme[table_start:table_end]
131 |             schema = self._extract_normal_schema(readme_fragment)
132 |             schemas[table_name] = schema
133 | 
134 |         return schemas
135 | 
136 |     def _process_global_tables(self):
137 |         """Load this file's global (schemaless) tables into Athena."""
138 |         GlobalFile(self._zip_file, self._athena).load()
139 | 
140 |     def _process_lookup_tables(self):
141 |         """Load this file's separate lookup tables into Athena."""
142 |         table_file_names = sorted(tuple(filter(
143 |             lambda name: name.startswith('table_gs_'),
144 |             self._zip_file.namelist())))
145 |         for file_name in table_file_names:
146 |             with self._zip_file.open(file_name, 'r') as input_file:
147 |                 raw_content = input_file.read().decode('latin-1')
148 |             LookupTable(raw_content, self._athena).load()
149 | 
150 |     def _process_normal_tables(self, schemas):
151 |         """Load this file's normal tables into Athena.
152 | 
153 |         Args:
154 |             schemas: A dict as returned by _extract_table_schemas.
155 |         """
156 |         table_names = sorted(schemas.keys())
157 |         self.logger.info('Found {0} table schemas: {1}'.format(
158 |             len(table_names), ', '.join(table_names)))
159 |         for table_name in table_names:
160 |             normal_table = NormalTable(
161 |                 name=table_name, zip_file=self._zip_file,
162 |                 schema_io=schemas[table_name], athena=self._athena)
163 |             normal_table.load()
164 | 


--------------------------------------------------------------------------------
/ncd/global_file.py:
--------------------------------------------------------------------------------
  1 | from csv import DictReader, reader, writer
  2 | import gzip
  3 | from io import StringIO, TextIOWrapper
  4 | from itertools import starmap
  5 | import json
  6 | import logging
  7 | import re
  8 | from tempfile import NamedTemporaryFile
  9 | from textwrap import dedent
 10 | 
 11 | 
 12 | logger = logging.getLogger(__name__)
 13 | logger.setLevel(logging.DEBUG)
 14 | ch = logging.StreamHandler()
 15 | ch.setLevel(logging.DEBUG)
 16 | formatter = logging.Formatter(
 17 |     '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s')
 18 | ch.setFormatter(formatter)
 19 | logger.addHandler(ch)
 20 | 
 21 | 
 22 | class GlobalFile(object):
 23 |     """Helper to import from global_LIONS.txt to Athena.
 24 | 
 25 |     Args:
 26 |         zip_file: A zipfile.ZipFile of NCD data.
 27 |         athena: An ncd.Athena to use when accessing AWS.
 28 |     """
 29 | 
 30 |     def __init__(self, zip_file=None, athena=None):
 31 |         self._zip = zip_file
 32 |         self._athena = athena
 33 |         self.logger = logger.getChild('GlobalFile')
 34 | 
 35 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 36 |     # -=-=-=-=-=-=-=-=-=-=-= PUBLIC METHODS FOLLOW =-=-=-=-=-=-=-=-=-=-=-=-=-=-
 37 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 38 | 
 39 |     def load(self):
 40 |         """Load all tables from this file into Athena."""
 41 |         try:
 42 |             raw_content = self._get_raw_content()
 43 |         except KeyError:
 44 |             return
 45 |         tables = self._extract_global_tables(raw_content)
 46 |         table_names = sorted(tables.keys())
 47 |         for table_name in table_names:
 48 |             self._load_table(table_name, tables[table_name])
 49 |             self.logger.info('Loaded global table {0}'.format(table_name))
 50 | 
 51 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 52 |     # -=-=-=-=-=-=-=-=-=-=- INTERNAL METHODS FOLLOW -=-=-=-=-=-=-=-=-=-=-=-=-=-
 53 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 54 | 
 55 |     def _convert_raw_file(self, table, output_file):
 56 |         """Convert a raw data file for Athena and add it to a .gz.
 57 | 
 58 |         Args:
 59 |             table: A text file-like object with table data.
 60 |             output_file: A text file-like object to which our newly converted
 61 |                 data should be appended.
 62 |         """
 63 |         table.seek(0)
 64 |         reader = DictReader(table)
 65 | 
 66 |         for input_row in reader:
 67 |             output_row = {}
 68 |             for key, value in input_row.items():
 69 |                 if key.startswith('redacted_'):
 70 |                     output_row[key] = bool(value)
 71 |                 else:
 72 |                     output_row[key] = value
 73 |             output_file.write(json.dumps(output_row))
 74 |             output_file.write('\n')
 75 | 
 76 |     def _extract_global_table(self, raw_fragment):
 77 |         """Extract a CSV of data for one table.
 78 | 
 79 |         Args:
 80 |             raw_fragment: A string containing fixed-width data for one table
 81 |                 from within global_LIONS.txt.
 82 | 
 83 |         Returns:
 84 |             A text file-like object containing CSV data from the given table.
 85 |         """
 86 |         header, divider, *fixed_rows = raw_fragment.split('\n')
 87 |         field_width_matches = tuple(re.finditer(r'-+', divider))
 88 | 
 89 |         def split_row(row, is_header=False):
 90 |             def extract_field(match):
 91 |                 return row[match.start():match.end()].strip()
 92 |             raw_cells = tuple(map(extract_field, field_width_matches))
 93 |             if is_header:
 94 |                 data_cells = list(raw_cells)
 95 |                 redaction_cells = [
 96 |                     'redacted_{0}'.format(cell) for cell in raw_cells]
 97 |             else:
 98 |                 data_cells = [
 99 |                     (cell if cell != '*' else '') for cell in raw_cells]
100 |                 redaction_cells = [
101 |                     ('t' if cell == '*' else '') for cell in raw_cells]
102 |             return data_cells + redaction_cells
103 | 
104 |         def convert_camel_case_field_name(field_name):
105 |             def add_underscore(match):
106 |                 return '_' + match.group(1)
107 |             converted = re.sub(
108 |                 r'(?<!^)([A-Z])', add_underscore, field_name).upper()
109 |             if converted.startswith('REDACTED__'):
110 |                 converted = converted.replace('REDACTED__', 'redacted_', 1)
111 |             return converted
112 | 
113 |         field_names = split_row(header, True)
114 |         field_names = tuple(map(convert_camel_case_field_name, field_names))
115 |         rows = tuple(map(split_row, fixed_rows))
116 | 
117 |         table_io = StringIO()
118 |         table_writer = writer(table_io)
119 |         table_writer.writerow(field_names)
120 |         table_writer.writerows(rows)
121 | 
122 |         table_io.seek(0)
123 |         return table_io
124 | 
125 |     def _extract_global_tables(self, raw_content):
126 |         """Split global_LIONS.txt's content into individual tables.
127 | 
128 |         Args:
129 |             raw_content: A string with the content of global_LIONS.txt.
130 | 
131 |         Returns:
132 |             A dict with string table names as keys and text file-like objects
133 |             as values. Each value contains CSV data for the given table.
134 |         """
135 |         tables = {}
136 | 
137 |         table_names = re.findall(r'^([A-Z][^\s]+)$', raw_content, re.MULTILINE)
138 |         if not table_names:
139 |             return tables
140 | 
141 |         def get_table_start(table_name):
142 |             start_match = re.search(
143 |                 r'(?<=^' + table_name + r'\n\n)', raw_content, re.MULTILINE)
144 |             return (table_name, start_match.start())
145 |         table_starts = tuple(map(get_table_start, table_names))
146 |         last_table_name = table_names[-1]
147 | 
148 |         def get_table_end(i, table_info):
149 |             table_name, table_start = table_info
150 |             if table_name == last_table_name:
151 |                 table_end = None
152 |             else:
153 |                 table_end = table_starts[i + 1][1]
154 |             return (table_name, table_start, table_end)
155 |         table_info = tuple(starmap(get_table_end, enumerate(table_starts)))
156 | 
157 |         for table_name, table_start, table_end in table_info:
158 |             global_fragment = raw_content[table_start:table_end]
159 |             next_name_match = re.search(
160 |                 r'\n*^[A-Z][^\s]+$\s*', global_fragment, re.MULTILINE)
161 |             if next_name_match:
162 |                 global_fragment = global_fragment[:next_name_match.start()]
163 |             global_fragment = global_fragment.strip()
164 |             schema = self._extract_global_table(global_fragment)
165 |             tables[table_name] = schema
166 | 
167 |         return tables
168 | 
169 |     def _generate_ddl(self, name, table):
170 |         """Generate a CREATE EXTERNAL TABLE query to run on Athena.
171 | 
172 |         Args:
173 |             name: A string name for the table being loaded.
174 |             table: A text file-like object with table data.
175 | 
176 |         Returns:
177 |             A string SQL query to execute.
178 |         """
179 |         table.seek(0)
180 |         table_reader = reader(table)
181 |         field_names = next(table_reader)
182 | 
183 |         def build_column(field_name):
184 |             if field_name.startswith('redacted_'):
185 |                 return '{0} BOOLEAN'.format(field_name)
186 |             else:
187 |                 return '{0} STRING'.format(field_name)
188 | 
189 |         columns = tuple(map(build_column, field_names))
190 |         column_specs = ',\n                '.join(columns)
191 | 
192 |         query = """
193 |             CREATE EXTERNAL TABLE IF NOT EXISTS {name} (
194 |                 {columns}
195 |             )
196 |             ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
197 |             STORED AS TEXTFILE
198 |             LOCATION 's3://{bucket}/{table_prefix}';
199 |         """.format(
200 |             name=name, columns=column_specs,
201 |             bucket=self._athena.data_bucket,
202 |             table_prefix=self._athena.prefix_for_table(name)
203 |         )
204 | 
205 |         return dedent(query)
206 | 
207 |     def _get_raw_content(self):
208 |         """Load the global file into memory if available.
209 | 
210 |         Returns:
211 |             A string with the content of the global file.
212 | 
213 |         Raises:
214 |             KeyError: The given zip file doesn't contain a global file.
215 |         """
216 |         try:
217 |             global_file_info = self._zip.getinfo('global_LIONS.txt')
218 |         except KeyError:
219 |             self.logger.info('No global file detected in zip')
220 |             raise KeyError('No global file detected in zip') from None
221 | 
222 |         with self._zip.open(global_file_info, 'r') as input_file:
223 |             return input_file.read().decode('utf-8')
224 | 
225 |     def _load_table(self, name, table):
226 |         """Load a single table from the global file to Athena.
227 | 
228 |         Args:
229 |             name: A string name for the table being loaded.
230 |             table: A text file-like object with table data.
231 |         """
232 |         with NamedTemporaryFile('w+b') as raw_file:
233 |             with gzip.open(raw_file, 'wb') as gzip_file:
234 |                 text_gzip_file = TextIOWrapper(gzip_file, encoding='utf-8')
235 |                 self._convert_raw_file(table, text_gzip_file)
236 |             self._athena.upload_data(name, raw_file)
237 |         ddl = self._generate_ddl(name, table)
238 |         self._athena.execute_query(ddl)
239 | 


--------------------------------------------------------------------------------
/ncd/lookup_table.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | import re
 3 | 
 4 | from ncd.global_file import GlobalFile
 5 | 
 6 | 
 7 | logger = logging.getLogger(__name__)
 8 | logger.setLevel(logging.DEBUG)
 9 | ch = logging.StreamHandler()
10 | ch.setLevel(logging.DEBUG)
11 | formatter = logging.Formatter(
12 |     '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s')
13 | ch.setFormatter(formatter)
14 | logger.addHandler(ch)
15 | 
16 | 
17 | class LookupTable(GlobalFile):
18 |     """Helper to import schemaless table files to Athena.
19 | 
20 |     Args:
21 |         raw_content: A string with the content of a table's text file.
22 |         athena: An ncd.Athena to use when accessing AWS.
23 |     """
24 | 
25 |     def __init__(self, raw_content=None, athena=None):
26 |         self._raw = raw_content
27 |         self._athena = athena
28 |         self.logger = logger.getChild('LookupTable')
29 | 
30 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
31 |     # -=-=-=-=-=-=-=-=-=-=-= PUBLIC METHODS FOLLOW =-=-=-=-=-=-=-=-=-=-=-=-=-=-
32 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
33 | 
34 |     def load(self):
35 |         """Load this table's data into Athena."""
36 |         name = self._extract_table_name()
37 |         table = self._extract_lookup_table()
38 |         self._load_table(name, table)
39 |         self.logger.info('Loaded lookup table {0}'.format(name))
40 | 
41 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
42 |     # -=-=-=-=-=-=-=-=-=-=- INTERNAL METHODS FOLLOW -=-=-=-=-=-=-=-=-=-=-=-=-=-
43 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
44 | 
45 |     def _extract_lookup_table(self):
46 |         """Extract this table's data from our text file.
47 | 
48 |         Returns:
49 |             A text file-like object with CSV data for the given table.
50 |         """
51 |         blank_line_matches = tuple(
52 |             re.finditer(r'^[\s\n]*$', self._raw, re.MULTILINE))
53 |         table_start = blank_line_matches[0].end()
54 |         table_end = blank_line_matches[1].start()
55 |         raw_table = self._raw[table_start:table_end].strip()
56 |         return self._extract_global_table(raw_table)
57 | 
58 |     def _extract_table_name(self):
59 |         """Extract the name of a lookup table from our text file.
60 | 
61 |         Returns:
62 |             A string table name.
63 |         """
64 |         return re.search(r'(?<=\s)GS_[^\s]+', self._raw).group(0)
65 | 


--------------------------------------------------------------------------------
/ncd/normal_table.py:
--------------------------------------------------------------------------------
  1 | from csv import DictReader
  2 | import datetime
  3 | import gzip
  4 | from io import TextIOWrapper
  5 | from itertools import chain
  6 | import json
  7 | import logging
  8 | from operator import itemgetter
  9 | import re
 10 | from tempfile import NamedTemporaryFile, TemporaryFile
 11 | from textwrap import dedent
 12 | 
 13 | from csvkit.convert.fixed import fixed2csv
 14 | 
 15 | 
 16 | logger = logging.getLogger(__name__)
 17 | logger.setLevel(logging.DEBUG)
 18 | ch = logging.StreamHandler()
 19 | ch.setLevel(logging.DEBUG)
 20 | formatter = logging.Formatter(
 21 |     '%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s')
 22 | ch.setFormatter(formatter)
 23 | logger.addHandler(ch)
 24 | 
 25 | 
 26 | class NormalTable(object):
 27 |     """Helper to import raw fixed-width data to Athena.
 28 | 
 29 |     Args:
 30 |         name: A string name for the NCD table being imported.
 31 |         zip_file: A zipfile.ZipFile of NCD data.
 32 |         schema_io: A text file-like object with field information for the
 33 |             table's raw data.
 34 |         athena: An ncd.Athena to use when accessing AWS.
 35 |     """
 36 | 
 37 |     def __init__(self, name=None, zip_file=None, schema_io=None, athena=None):
 38 |         self.name = name
 39 |         self._zip = zip_file
 40 |         self._schema = schema_io
 41 |         self._athena = athena
 42 |         self.logger = logger.getChild('NormalTable')
 43 | 
 44 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 45 |     # -=-=-=-=-=-=-=-=-=-=-= PUBLIC METHODS FOLLOW =-=-=-=-=-=-=-=-=-=-=-=-=-=-
 46 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 47 | 
 48 |     def load(self):
 49 |         """Load this table's data into Athena."""
 50 |         data_file_names = self._get_file_names()
 51 |         districts = sorted(data_file_names.keys())
 52 |         for district in districts:
 53 |             district_file_name = data_file_names[district]
 54 |             with NamedTemporaryFile('w+b') as raw_file:
 55 |                 with gzip.open(raw_file, 'wb') as gzip_file:
 56 |                     text_gzip_file = TextIOWrapper(gzip_file, encoding='utf-8')
 57 |                     self._convert_raw_file(district_file_name, text_gzip_file)
 58 |                     text_gzip_file.close()
 59 |                 self._athena.upload_data(
 60 |                     self.name, raw_file, district=district)
 61 | 
 62 |         is_partitioned = None not in districts
 63 | 
 64 |         ddl = self._generate_ddl(is_partitioned)
 65 |         self._athena.execute_query(ddl)
 66 |         self.logger.debug('Ensured table exists for {0}'.format(self.name))
 67 | 
 68 |         if is_partitioned:
 69 |             self._athena.execute_query(
 70 |                 'MSCK REPAIR TABLE {0};'.format(self.name))
 71 |             self.logger.debug('Repaired table for {0}'.format(self.name))
 72 | 
 73 |         self.logger.info('Loaded normal table {0}'.format(self.name))
 74 | 
 75 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 76 |     # -=-=-=-=-=-=-=-=-=-=- INTERNAL METHODS FOLLOW -=-=-=-=-=-=-=-=-=-=-=-=-=-
 77 |     # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 78 | 
 79 |     def _convert_raw_file(self, raw_path, gzip_file):
 80 |         """Convert a raw data file for Athena and add it to a .gz.
 81 | 
 82 |         Args:
 83 |             raw_path: A string path to a file stored in self._zip.
 84 |             gzip_file: A file-like object to which our newly converted data
 85 |                 should be appended.
 86 |         """
 87 |         self.logger.debug('Beginning conversion of {0}'.format(raw_path))
 88 | 
 89 |         with self._zip.open(raw_path) as raw_data:
 90 |             without_carriage_returns = self._remove_crs(raw_data)
 91 |         csv_data = self._make_csv(without_carriage_returns)
 92 |         self._generate_rows(csv_data, gzip_file)
 93 | 
 94 |         self.logger.debug('Completed conversion of {0}'.format(raw_path))
 95 | 
 96 |     def _gather_python_types(self):
 97 |         """Determine which Python data type each field should have.
 98 | 
 99 |         Returns:
100 |             A dict with field names as keys and functions as values.
101 |         """
102 |         self._schema.seek(0)
103 |         schema_reader = DictReader(self._schema)
104 | 
105 |         def _parse_oracle_date(raw_text):
106 |             return datetime.datetime.strptime(raw_text, '%d-%b-%Y').strftime(
107 |                 '%Y-%m-%d').rjust(10, '0')
108 | 
109 |         def converter_with_nulls(converter):
110 |             def convert(raw_text):
111 |                 try:
112 |                     return converter(raw_text)
113 |                 except ValueError:
114 |                     return None
115 |             return convert
116 | 
117 |         def get_python_type(field_type_text):
118 |             field_components = re.match(
119 |                 r'(?P<type>[^(]+)(?:\((?P<args>.+)\))?', field_type_text)
120 |             field_type_component = field_components.group('type')
121 |             if field_type_component in ('VARCHAR', 'VARCHAR2'):
122 |                 return converter_with_nulls(str)
123 |             if field_type_component == 'NUMBER':
124 |                 return converter_with_nulls(int)
125 |             if field_type_component == 'DATE':
126 |                 return converter_with_nulls(_parse_oracle_date)
127 |             if field_type_component == 'FLOAT':
128 |                 return converter_with_nulls(float)
129 |             raise NotImplementedError(
130 |                 'Unsure how to handle a {0}'.format(field_type_text))
131 | 
132 |         def build_column(row):
133 |             return (row['column'], get_python_type(row['field_type']))
134 | 
135 |         return dict(map(build_column, schema_reader))
136 | 
137 |     def _generate_ddl(self, is_partitioned=False):
138 |         """Generate a CREATE EXTERNAL TABLE query to run on Athena.
139 | 
140 |         Args:
141 |             is_partitioned: A boolean specifying whether a table is to be split
142 |                 into multiple files by federal judicial district (True) or
143 |                 consists of only one file covering all districts (False).
144 | 
145 |         Returns:
146 |             A string SQL query to execute.
147 |         """
148 |         self._schema.seek(0)
149 |         reader = DictReader(self._schema)
150 | 
151 |         def get_athena_type(field_type_text):
152 |             field_components = re.match(
153 |                 r'(?P<type>[^(]+)(?:\((?P<args>.+)\))?', field_type_text)
154 |             field_type_component = field_components.group('type')
155 |             if field_type_component in ('VARCHAR', 'VARCHAR2'):
156 |                 return 'STRING'
157 |             if field_type_component == 'NUMBER':
158 |                 return 'BIGINT'
159 |             if field_type_component == 'DATE':
160 |                 return 'DATE'  # Actually a date in strftime format '%d-%b-%Y'
161 |             if field_type_component == 'FLOAT':
162 |                 return 'DOUBLE'
163 |             raise NotImplementedError(
164 |                 'Unsure how to handle a {0}'.format(field_type_text))
165 | 
166 |         def build_column(row):
167 |             data_column = '{0} {1}'.format(
168 |                 row['column'], get_athena_type(row['field_type']))
169 |             redaction_column = 'redacted_{0} BOOLEAN'.format(row['column'])
170 |             return (data_column, redaction_column)
171 | 
172 |         column_pairs = tuple(map(build_column, reader))
173 |         data_columns = map(itemgetter(0), column_pairs)
174 |         redaction_columns = map(itemgetter(1), column_pairs)
175 |         columns = tuple(chain(data_columns, redaction_columns))
176 |         column_specs = ',\n                '.join(columns)
177 | 
178 |         if is_partitioned:
179 |             partition_clause = (
180 |                 '\n            PARTITIONED BY (filename_district STRING)')
181 |         else:
182 |             partition_clause = ''
183 | 
184 |         query = """
185 |             CREATE EXTERNAL TABLE IF NOT EXISTS {name} (
186 |                 {columns}
187 |             ){partition_clause}
188 |             ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
189 |             STORED AS TEXTFILE
190 |             LOCATION 's3://{bucket}/{table_prefix}';
191 |         """.format(
192 |             name=self.name, columns=column_specs,
193 |             partition_clause=partition_clause,
194 |             bucket=self._athena.data_bucket,
195 |             table_prefix=self._athena.prefix_for_table(self.name)
196 |         )
197 | 
198 |         return dedent(query)
199 | 
200 |     def _generate_rows(self, csv_data, gzip_file):
201 |         """Convert rows of a CSV and append the results to a .gz.
202 | 
203 |         Args:
204 |             csv_data: A text file-like object containing CSV data.
205 |             gzip_file: A file-like object to which our newly converted data
206 |                 should be appended.
207 |         """
208 |         field_converters = self._gather_python_types()
209 |         reader = DictReader(csv_data)
210 |         for input_row in reader:
211 |             output_obj = {}
212 |             for field_name, field_raw_value in input_row.items():
213 |                 if field_raw_value == '*':
214 |                     field_value = None
215 |                     redacted_value = True
216 |                 else:
217 |                     field_value = field_converters[field_name](field_raw_value)
218 |                     redacted_value = False
219 |                 output_obj[field_name] = field_value
220 |                 output_obj['redacted_{0}'.format(field_name)] = redacted_value
221 |             output_json = json.dumps(output_obj)
222 |             gzip_file.write('{0}\n'.format(output_json))
223 | 
224 |     def _get_file_names(self):
225 |         """Determine which contents to use from our zip file.
226 | 
227 |         Returns:
228 |             A dict. Each key specifies the federal judicial district covered by
229 |             a given data file; this is a string unless the file covers all
230 |             districts, which case it is None. Each value is a string filename
231 |             for the given data file within self._zip.
232 |         """
233 |         lowercase_name = self.name.lower()
234 |         file_name_pattern = re.compile(''.join([
235 |             r'^', lowercase_name, r'(?:_(?P<district>[A-Z]+))?\.txt$']))
236 | 
237 |         def file_is_for_table(file_name):
238 |             match = file_name_pattern.match(file_name)
239 |             if not match:
240 |                 return None
241 |             return (match.group('district'), file_name)
242 |         data_file_names = dict(
243 |             filter(None, map(file_is_for_table, self._zip.namelist())))
244 | 
245 |         return data_file_names
246 | 
247 |     def _make_csv(self, fixed_width_data):
248 |         """Convert a fixed-width data file to a CSV.
249 | 
250 |         Args:
251 |             fixed_width_data: A text file-like object containing fixed-width
252 |                 data, following the format described in self._schema.
253 | 
254 |         Returns:
255 |             A text file-like object containing CSV data.
256 |         """
257 |         self._schema.seek(0)
258 |         fixed_width_data.seek(0)
259 |         fixed_width_text = TextIOWrapper(fixed_width_data, encoding='latin-1')
260 | 
261 |         csv_file = TemporaryFile(mode='w+')
262 |         fixed2csv(fixed_width_text, self._schema, output=csv_file)
263 | 
264 |         fixed_width_text.close()
265 |         csv_file.seek(0)
266 | 
267 |         self.logger.debug('Converted fixed-width data to CSV')
268 |         return csv_file
269 | 
270 |     def _remove_crs(self, raw_data):
271 |         """Remove carriage returns from a file.
272 | 
273 |         Args:
274 |             raw_data: A file-like object.
275 | 
276 |         Returns:
277 |             A file-like object with most of the same content.
278 |         """
279 |         no_cr_file = TemporaryFile(mode='w+b')
280 |         while True:
281 |             raw_chunk = raw_data.read(4096)
282 |             if not raw_chunk:
283 |                 break
284 |             fixed_chunk = raw_chunk.replace(b'\r', b' ')
285 |             no_cr_file.write(fixed_chunk)
286 | 
287 |         no_cr_file.seek(0)
288 |         raw_data.close()
289 | 
290 |         self.logger.debug('Removed carriage returns')
291 |         return no_cr_file
292 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | cssselect==1.0.3
2 | csvkit==1.0.2
3 | lxml==4.1.1
4 | requests==2.21.0
5 | 


--------------------------------------------------------------------------------