├── .gitignore ├── LICENSE ├── README.md ├── athena-sqlite.yaml ├── lambda-function ├── s3qlite.py ├── sqlite_db.py └── vfs.py ├── lambda-layer ├── Dockerfile ├── Dockerfile.pyarrow ├── build-pyarrow.sh └── build.sh └── sample-data └── sample_data.sqlite /.gitignore: -------------------------------------------------------------------------------- 1 | venv/ 2 | target/ 3 | *.pyc 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Athena SQLite Driver 2 | 3 | Using Athena's new [Query Federation](https://github.com/awslabs/aws-athena-query-federation/) functionality, read SQLite databases from S3. 4 | 5 | Install it from the Serverless Application Repository: [AthenaSQLiteConnector](https://serverlessrepo.aws.amazon.com/#/applications/arn:aws:serverlessrepo:us-east-1:689449560910:applications~AthenaSQLITEConnector). 6 | 7 | ## Why? 8 | 9 | I occasionally like to put together fun side projects over Thanksgiving and Christmas holidays. 10 | 11 | I'd always joked it would a crazy idea to be able to read SQLite using Athena, so...here we are! 12 | 13 | ## How? 14 | 15 | - I decided to use Python as I'm most familiar with it and because of the next point 16 | - Using [APSW](https://rogerbinns.github.io/apsw/), we can implement a [Virtual File System](https://rogerbinns.github.io/apsw/vfs.html) (VFS) for S3 17 | - Using the [Athena query federation example](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-example/), we can see what calls need to be implemented 18 | 19 | The PyArrow library unfortunately weighs in over 250MB, so we have to use a custom compilation step to build a Lambda Layer. 20 | 21 | ## What? 22 | 23 | Drop SQLite databases in a single prefix in S3, and Athena will list each file as a database and automatically detect tables and schemas. 24 | 25 | Currently, all data types are strings. I'll fix this eventually. All good things in time. 26 | 27 | ## Status 28 | 29 | This project is under active development and very much in it's infancy. 30 | 31 | Many things are hard-coded or broken into various pieces as I experiment and figure out how everything works. 32 | 33 | ## Building 34 | 35 | The documentation for this is a work in progress. It's currently in between me creating the resources manually and building the assets for the AWS SAR, 36 | and most of the docs will be automated away. 37 | 38 | ### Requirements 39 | 40 | - Docker 41 | - Python 3.7 42 | 43 | ### Lambda layer 44 | 45 | First you need to build Lambda layer. There are two Dockerfiles and build scripts in the `lambda-layer/` directory. 46 | 47 | We'll execute each of the build scripts and copy the results to the target directory. This is referenced by the SAR template, [`athena-sqlite.yaml`](athena-sqlite.yaml). 48 | 49 | ``` 50 | cd lambda-layer 51 | ./build.sh 52 | ./build-pyarrow.sh 53 | cp -R layer/ ../target/ 54 | ``` 55 | 56 | ### Upload sample data 57 | 58 | For the purpose of this test, we just have a sample sqlite database you can upload. 59 | 60 | `aws s3 cp sample-data/sample_data.sqlite s3:////` 61 | 62 | Feel free to upload your own SQLite databases as well! 63 | 64 | ### Lambda function 65 | 66 | There are three components to the Lambda code: 67 | 68 | - `vfs.py` - A SQLite Virtual File System implementation for S3 69 | - `s3qlite.py` - The actual Lambda function that handles Athena metadata/data requests 70 | - `sqlite_db.py` - Helper functions for access SQLite databases on S3 71 | 72 | Create a function with the code in [lambda-function/s3qlite.py](lambda-function/s3qlite.py) that uses the previously created layer. 73 | The handler will be `s3qlite.lambda_handler` 74 | Also include the `vfs.py` and `sqlite_db.py` files in your Lambda function 75 | 76 | Configure two environment variables for your lambda function: 77 | - `TARGET_BUCKET` - The name of your S3 bucket where SQLite files live 78 | - `TARGET_PREFIX` - The prefix (e.g. `data/sqlite`) that you uploaded the sample sqlite database to 79 | 80 | Note that the IAM role you associate the function with will also need `s3:GetObject` and `s3:ListBucket` access to wherever your lovely SQLite databases are stored. 81 | 82 | ### Configure Athena 83 | 84 | Follow the Athena documentation for [Connecting to a data source](https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source.html). 85 | The primary thing to note here is that you need to create a workgroup named `AmazonAthenaPreviewFunctionality` and use that for your testing. 86 | Some functionality will work in the primary workgroup, but you'll get weird errors when you try to query data. 87 | 88 | I named my function `s3qlite` :) 89 | 90 | ### Run queries! 91 | 92 | Here's a couple basic queries that should work: 93 | 94 | ```sql 95 | SELECT * FROM "s3qlite"."sample_data"."records" limit 10; 96 | 97 | SELECT COUNT(*) FROM "s3qlite"."sample_data"."records"; 98 | ``` 99 | 100 | If you deploy the SAR app, the data catalog isn't registered automatically, but you can still run queries by using the special `lambda:` schema: 101 | 102 | ```sql 103 | SELECT * FROM "lambda:s3qlite".sample_data.records LIMIT 10; 104 | ``` 105 | 106 | Where `s3qlite` is the value you provided for the `AthenaCatalogName` parameter. 107 | 108 | ## TODO 109 | 110 | - Move these into issues :) 111 | - Move vfs.py into it's own module 112 | - Maybe add write support to it someday :scream: 113 | - Publish to SAR 114 | - Add tests...always tests 115 | - struct types, probably 116 | - Don't read the entire file every time :) 117 | - Escape column names with invalid characters 118 | - Implement recursive listing 119 | 120 | ## Serverless App Repo 121 | 122 | These are mostly notes I made while figuring out how to get SAR working. 123 | 124 | Need to grant SAR access to the bucket 125 | 126 | ```shell 127 | aws s3api put-bucket-policy --bucket --region us-east-1 --policy '{ 128 | "Version": "2012-10-17", 129 | "Statement": [ 130 | { 131 | "Effect": "Allow", 132 | "Principal": { 133 | "Service": "serverlessrepo.amazonaws.com" 134 | }, 135 | "Action": "s3:GetObject", 136 | "Resource": "arn:aws:s3:::/*" 137 | } 138 | ] 139 | }' 140 | ``` 141 | 142 | For publishing to the SAR, we just execute two commands 143 | 144 | ```shell 145 | sam package --template-file athena-sqlite.yaml --s3-bucket --output-template-file target/out.yaml 146 | sam publish --template target/out.yaml --region us-east-1 147 | ``` 148 | 149 | If you want to deploy using CloudFormation, use this command: 150 | 151 | ```shell 152 | sam deploy --template-file ./target/out.yaml --stack-name athena-sqlite --capabilities CAPABILITY_IAM --parameter-overrides 'DataBucket= DataPrefix=tmp/sqlite' --region us-east-1 153 | ``` 154 | -------------------------------------------------------------------------------- /athena-sqlite.yaml: -------------------------------------------------------------------------------- 1 | Transform: 'AWS::Serverless-2016-10-31' 2 | 3 | Metadata: 4 | AWS::ServerlessRepo::Application: 5 | Name: AthenaSQLITEConnector 6 | Description: Use Amazon Athena to query SQLite databases on Amazon S3 7 | Author: 'Damon Cortesi' 8 | SpdxLicenseId: Apache-2.0 9 | LicenseUrl: LICENSE 10 | ReadmeUrl: README.md 11 | Labels: ['athena-federation', 'sqlite'] 12 | HomePageUrl: https://github.com/dacort/athena-sqlite 13 | SemanticVersion: 0.0.3 14 | SourceCodeUrl: https://github.com/dacort/athena-sqlite 15 | 16 | Parameters: 17 | AthenaCatalogName: 18 | Description: "The name you will give to this catalog in Athena will also be used as you Lambda function name." 19 | Type: String 20 | Default: s3qlite 21 | DataBucket: 22 | Description: "The bucket where this tutorial's data lives." 23 | Type: String 24 | DataPrefix: 25 | Description: "The prefix where SQLite databases are." 26 | Type: String 27 | LambdaTimeout: 28 | Description: "Maximum Lambda invocation runtime in seconds. (min 1 - 900 max)" 29 | Default: 300 30 | Type: Number 31 | LambdaMemory: 32 | Description: "Lambda memory in MB (min 128 - 3008 max)." 33 | Default: 512 34 | Type: Number 35 | 36 | Resources: 37 | ConnectorConfig: 38 | Type: 'AWS::Serverless::Function' 39 | Properties: 40 | Environment: 41 | Variables: 42 | TARGET_BUCKET: !Ref DataBucket 43 | TARGET_PREFIX: !Ref DataPrefix 44 | FunctionName: !Sub "${AthenaCatalogName}" 45 | Handler: "s3qlite.lambda_handler" 46 | Layers: 47 | - !Ref SQLiteDependencyLayer 48 | CodeUri: "./target/athena-sqlite-0.0.1.zip" 49 | Description: "Use Amazon Athena to query SQLite databases on Amazon S3." 50 | Runtime: python3.7 51 | Timeout: !Ref LambdaTimeout 52 | MemorySize: !Ref LambdaMemory 53 | Policies: 54 | - S3ReadPolicy: 55 | BucketName: !Ref DataBucket 56 | 57 | SQLiteDependencyLayer: 58 | Type: 'AWS::Serverless::LayerVersion' 59 | Properties: 60 | LayerName: sam-app-dependencies 61 | Description: Dependencies for sam app athena-sqlite 62 | ContentUri: "./target/layer/" 63 | CompatibleRuntimes: 64 | - python3.7 65 | LicenseInfo: 'MIT' 66 | RetentionPolicy: Retain 67 | -------------------------------------------------------------------------------- /lambda-function/s3qlite.py: -------------------------------------------------------------------------------- 1 | import pyarrow as pa 2 | import apsw 3 | import boto3 4 | from vfs import S3VFS, S3VFSFile 5 | from sqlite_db import SQLiteDB 6 | 7 | import os 8 | from uuid import uuid4 9 | import base64 10 | 11 | S3_BUCKET = os.environ['TARGET_BUCKET'] 12 | S3_PREFIX = os.environ['TARGET_PREFIX'].rstrip('/') # Ensure that the prefix does *not* have a slash at the end 13 | 14 | S3_CLIENT = boto3.client('s3') 15 | S3FS = S3VFS() 16 | 17 | # https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-sdk/src/main/java/com/amazonaws/athena/connector/lambda/handlers/FederationCapabilities.java#L33 18 | CAPABILITIES = 23 19 | 20 | 21 | class ListSchemasRequest: 22 | """List sqlite files in the defined prefix, do not recurse""" 23 | def execute(self, event): 24 | return { 25 | "@type": "ListSchemasResponse", 26 | "catalogName": event['catalogName'], 27 | "schemas": self._list_sqlite_objects(), 28 | "requestType": "LIST_SCHEMAS" 29 | } 30 | 31 | def _list_sqlite_objects(self): 32 | # We don't yet support recursive listing - everything must be in the prefix 33 | params = { 34 | 'Bucket': S3_BUCKET, 35 | 'Prefix': S3_PREFIX + '/', 36 | 'Delimiter': '/' 37 | } 38 | sqlite_filenames = [] 39 | while True: 40 | response = S3_CLIENT.list_objects_v2(**params) 41 | for data in response.get('Contents', []): 42 | sqlite_basename = data['Key'].replace(S3_PREFIX + '/', '').replace('.sqlite', '') 43 | sqlite_filenames.append(sqlite_basename) 44 | if 'NextContinuationToken' in response: 45 | params['ContinuationToken'] = response['NextContinuationToken'] 46 | else: 47 | break 48 | return sqlite_filenames 49 | 50 | 51 | class ListTablesRequest: 52 | """Given a sqlite schema (filename), return the tables of the database""" 53 | def execute(self, event): 54 | sqlite_dbname = event.get('schemaName') 55 | 56 | return { 57 | "@type": "ListTablesResponse", 58 | "catalogName": event['catalogName'], 59 | "tables": self._fetch_table_list(sqlite_dbname), 60 | "requestType": "LIST_TABLES" 61 | } 62 | 63 | def _fetch_table_list(self, sqlite_dbname): 64 | tables = [] 65 | s3db = SQLiteDB(S3_BUCKET, S3_PREFIX, sqlite_dbname) 66 | for row in s3db.execute("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;"): 67 | print("Found table: ", row[0]) 68 | tables.append({'schemaName': sqlite_dbname, 'tableName': row[0]}) 69 | return tables 70 | 71 | 72 | class GetTableRequest: 73 | """Given a SQLite schema (filename) and table, return the schema""" 74 | def execute(self, event): 75 | databaseName = event['tableName']['schemaName'] 76 | tableName = event['tableName']['tableName'] 77 | columns = self._fetch_schema_for_table(databaseName, tableName) 78 | schema = self._build_pyarrow_schema(columns) 79 | print({ 80 | "@type": "GetTableResponse", 81 | "catalogName": event['catalogName'], 82 | "tableName": {'schemaName': databaseName, 'tableName': tableName}, 83 | "schema": {"schema": base64.b64encode(schema.serialize().slice(4)).decode("utf-8")}, 84 | "partitionColumns": [], 85 | "requestType": "GET_TABLE" 86 | }) 87 | return { 88 | "@type": "GetTableResponse", 89 | "catalogName": event['catalogName'], 90 | "tableName": {'schemaName': databaseName, 'tableName': tableName}, 91 | "schema": {"schema": base64.b64encode(schema.serialize().slice(4)).decode("utf-8")}, 92 | "partitionColumns": [], 93 | "requestType": "GET_TABLE" 94 | } 95 | 96 | def _fetch_schema_for_table(self, databaseName, tableName): 97 | columns = [] 98 | s3db = SQLiteDB(S3_BUCKET, S3_PREFIX, databaseName) 99 | for row in s3db.execute("SELECT cid, name, type FROM pragma_table_info('{}')".format(tableName)): 100 | columns.append([row[1], row[2]]) 101 | 102 | return columns 103 | 104 | def _build_pyarrow_schema(self, columns): 105 | """Return a pyarrow schema based on the SQLite data types, but for now ... everything is a string :)""" 106 | return pa.schema( 107 | [(col[0], pa.string()) for col in columns] 108 | ) 109 | 110 | 111 | class ReadRecordsRequest: 112 | def execute(self, event): 113 | schema = self._parse_schema(event['schema']['schema']) 114 | records = {k: [] for k in schema.names} 115 | sqlite_dbname = event['tableName']['schemaName'] 116 | sqlite_tablename = event['tableName']['tableName'] 117 | s3db = SQLiteDB(S3_BUCKET, S3_PREFIX, sqlite_dbname) 118 | 119 | # TODO: How to select field names? 120 | for row in s3db.execute("SELECT {} FROM {}".format(','.join(schema.names), sqlite_tablename)): 121 | for i, name in enumerate(schema.names): 122 | records[name].append(str(row[i])) 123 | 124 | pa_records = pa.RecordBatch.from_arrays([pa.array(records[name]) for name in schema.names], schema=schema) 125 | return { 126 | "@type": "ReadRecordsResponse", 127 | "catalogName": event['catalogName'], 128 | "records": { 129 | "aId": str(uuid4()), 130 | "schema": base64.b64encode(schema.serialize().slice(4)).decode("utf-8"), 131 | "records": base64.b64encode(pa_records.serialize().slice(4)).decode("utf-8") 132 | }, 133 | "requestType": "READ_RECORDS" 134 | } 135 | 136 | def _parse_schema(self, encoded_schema): 137 | return pa.read_schema(pa.BufferReader(base64.b64decode(encoded_schema))) 138 | 139 | class PingRequest: 140 | """Simple ping request that just returns some metadata""" 141 | def execute(self, event): 142 | return { 143 | "@type": "PingResponse", 144 | "catalogName": event['catalogName'], 145 | "queryId": event['queryId'], 146 | "sourceType": "sqlite", 147 | "capabilities": CAPABILITIES 148 | } 149 | 150 | 151 | def lambda_handler(event, context): 152 | print(event) 153 | # request_type = event['requestType'] 154 | request_type = event['@type'] 155 | if request_type == 'ListSchemasRequest': 156 | return ListSchemasRequest().execute(event) 157 | elif request_type == 'ListTablesRequest': 158 | return ListTablesRequest().execute(event) 159 | elif request_type == 'GetTableRequest': 160 | return GetTableRequest().execute(event) 161 | elif request_type == 'PingRequest': 162 | return PingRequest().execute(event) 163 | elif request_type == 'GetTableLayoutRequest': 164 | databaseName = event['tableName']['schemaName'] 165 | tableName = event['tableName']['tableName'] 166 | # If the data is partitioned, this sends back the partition schema 167 | # Block schema is defined in BlockSerializer in the Athena Federation SDK 168 | block = { 169 | 'aId': str(uuid4()), 170 | 'schema': base64.b64encode(pa.schema({}).serialize().slice(4)).decode("utf-8"), 171 | 'records': base64.b64encode(pa.RecordBatch.from_arrays([]).serialize().slice(4)).decode("utf-8") 172 | } 173 | # Unsure how to do this with an "empty" block. 174 | # Used this response from the cloudwatch example and it worked: 175 | # >>> schema 176 | # partitionId: int32 177 | # metadata 178 | # -------- 179 | # {} 180 | # 181 | # >>> batch.columns 182 | # [ 183 | # [ 184 | # 1 185 | # ]] 186 | cloudwatch = { 187 | "aId": str(uuid4()), 188 | "schema": "nAAAABAAAAAAAAoADgAGAA0ACAAKAAAAAAADABAAAAAAAQoADAAAAAgABAAKAAAACAAAAAgAAAAAAAAAAQAAABgAAAAAABIAGAAUABMAEgAMAAAACAAEABIAAAAUAAAAFAAAABwAAAAAAAIBIAAAAAAAAAAAAAAACAAMAAgABwAIAAAAAAAAASAAAAALAAAAcGFydGl0aW9uSWQAAAAAAA==", 189 | "records": "jAAAABQAAAAAAAAADAAWAA4AFQAQAAQADAAAABAAAAAAAAAAAAADABAAAAAAAwoAGAAMAAgABAAKAAAAFAAAADgAAAABAAAAAAAAAAAAAAACAAAAAAAAAAAAAAABAAAAAAAAAAgAAAAAAAAABAAAAAAAAAAAAAAAAQAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAQAAAAAAAAABAAAAAAAAAA==" 190 | } 191 | # Let's use cloudwatch for now, that gets me to GetSplitsRequest 192 | return { 193 | "@type": "GetTableLayoutResponse", 194 | "catalogName": event['catalogName'], 195 | "tableName": {'schemaName': databaseName, 'tableName': tableName}, 196 | "partitions": cloudwatch, 197 | "requestType": "GET_TABLE_LAYOUT" 198 | } 199 | elif request_type == 'GetSplitsRequest': 200 | # The splits don't matter to Athena, it's mostly hints to pass on to ReadRecordsRequest 201 | return { 202 | "@type": "GetSplitsResponse", 203 | "catalogName": event['catalogName'], 204 | "splits": [ 205 | { 206 | "spillLocation": { 207 | "@type": "S3SpillLocation", 208 | "bucket": S3_BUCKET, 209 | "key": "athena-spill/7b2b96c9-1be5-4810-ac2a-163f754e132c/1a50edb8-c4c7-41d7-8a0d-1ce8e510755f", 210 | "directory": True 211 | }, 212 | "properties": {} 213 | } 214 | ], 215 | "continuationToken": None, 216 | "requestType": "GET_SPLITS" 217 | } 218 | elif request_type == 'ReadRecordsRequest': 219 | return ReadRecordsRequest().execute(event) 220 | -------------------------------------------------------------------------------- /lambda-function/sqlite_db.py: -------------------------------------------------------------------------------- 1 | import apsw 2 | from vfs import S3VFS, S3VFSFile 3 | 4 | S3FS = S3VFS() 5 | 6 | 7 | class SQLiteDB: 8 | def __init__(self, bucket, prefix, database_name): 9 | self.bucket = bucket 10 | self.prefix = prefix 11 | self.database_name = database_name 12 | 13 | self.connection = self._build_connection() 14 | self.cursor = self.connection.cursor() 15 | 16 | def execute(self, query): 17 | return self.cursor.execute(query) 18 | 19 | def _build_connection(self): 20 | return apsw.Connection(self._build_sqlite_s3_uri(), 21 | flags=apsw.SQLITE_OPEN_READONLY | apsw.SQLITE_OPEN_URI, 22 | vfs=S3FS.vfsname) 23 | 24 | def _build_sqlite_s3_uri(self): 25 | """Build a SQLite-compatible URI 26 | 27 | If this wasn't a `file:` identifier, SQLite would throw an error. 28 | We also include the `immutable` flag or SQLite will try to create a `-journal` file and fail. 29 | """ 30 | return "file:/{}/{}.sqlite?bucket={}&immutable=1".format(self.prefix, self.database_name, self.bucket) -------------------------------------------------------------------------------- /lambda-function/vfs.py: -------------------------------------------------------------------------------- 1 | import apsw 2 | import sys 3 | import boto3 4 | 5 | VFS_S3_CLIENT = boto3.client('s3') 6 | 7 | 8 | class S3VFS(apsw.VFS): 9 | def __init__(self, vfsname="s3", basevfs=""): 10 | self.vfsname=vfsname 11 | self.basevfs=basevfs 12 | apsw.VFS.__init__(self, self.vfsname, self.basevfs) 13 | 14 | def xOpen(self, name, flags): 15 | return S3VFSFile(self.basevfs, name, flags) 16 | 17 | class S3VFSFile(): 18 | def __init__(self, inheritfromvfsname, filename, flags): 19 | self.bucket = filename.uri_parameter("bucket") 20 | self.key = filename.filename().lstrip("/") 21 | print("Initiated S3 VFS for file: {}".format(self._get_s3_url())) 22 | 23 | def xRead(self, amount, offset): 24 | response = VFS_S3_CLIENT.get_object(Bucket=self.bucket, Key=self.key, Range='bytes={}-{}'.format(offset, offset + amount)) 25 | response_data = response['Body'].read() 26 | return response_data 27 | 28 | def xFileSize(self): 29 | client = boto3.client('s3') 30 | response = client.head_object( Bucket=self.bucket, Key=self.key) 31 | return response['ContentLength'] 32 | 33 | def xClose(self): 34 | pass 35 | 36 | def xFileControl(self, op, ptr): 37 | return False 38 | 39 | def _get_s3_url(self): 40 | return "s3://{}/{}".format(self.bucket, self.key) -------------------------------------------------------------------------------- /lambda-layer/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM lambci/lambda:build-python3.7 2 | 3 | ENV PYTHON_VERSION=3.7.5 4 | RUN curl -LO https://github.com/rogerbinns/apsw/releases/download/3.30.1-r1/apsw-3.30.1-r1.zip 5 | RUN unzip apsw-3.30.1-r1.zip 6 | WORKDIR apsw-3.30.1-r1 7 | RUN python setup.py fetch --all build --enable-all-extensions install 8 | # set workdir back 9 | WORKDIR /var/task 10 | -------------------------------------------------------------------------------- /lambda-layer/Dockerfile.pyarrow: -------------------------------------------------------------------------------- 1 | FROM lambci/lambda:build-python3.7 2 | 3 | ENV PYTHON_VERSION=3.7.5 4 | ENV APACHE_ARROW_VERSION=0.15.1 5 | 6 | # Clone desired Arrow version 7 | RUN git clone \ 8 | --branch apache-arrow-$APACHE_ARROW_VERSION \ 9 | --single-branch \ 10 | https://github.com/apache/arrow.git 11 | 12 | # Install dependencies 13 | RUN yum install -y \ 14 | autoconf \ 15 | bison \ 16 | boost-devel \ 17 | flex \ 18 | jemalloc-devel \ 19 | python36-devel 20 | RUN pip install --upgrade six numpy pandas cython pytest cmake wheel 21 | 22 | # Build Arrow 23 | ENV ARROW_HOME=/var/task/dist 24 | ENV LD_LIBRARY_PATH=/var/task/dist/lib:$LD_LIBRARY_PATH 25 | RUN mkdir dist 26 | RUN mkdir arrow/cpp/build 27 | WORKDIR arrow/cpp/build 28 | RUN cmake \ 29 | -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ 30 | -DCMAKE_INSTALL_LIBDIR=lib \ 31 | -DARROW_FLIGHT=OFF \ 32 | -DARROW_GANDIVA=OFF \ 33 | -DARROW_ORC=OFF \ 34 | -DARROW_PARQUET=ON \ 35 | -DARROW_PYTHON=ON \ 36 | -DARROW_PLASMA=OFF \ 37 | -DARROW_BUILD_TESTS=ON \ 38 | .. 39 | RUN make -j4 40 | RUN make install 41 | WORKDIR /var/task 42 | # Done building Arrow 43 | 44 | # Build Pyarrow 45 | ENV ARROW_PRE_0_15_IPC_FORMAT=1 46 | ENV PYARROW_WITH_FLIGHT=0 47 | ENV PYARROW_WITH_GANDIVA=0 48 | ENV PYARROW_WITH_ORC=0 49 | ENV PYARROW_WITH_PARQUET=1 50 | WORKDIR arrow/python 51 | RUN python setup.py build_ext \ 52 | --build-type=release \ 53 | --bundle-arrow-cpp \ 54 | bdist_wheel 55 | RUN cp dist/pyarrow-*.whl /var/task 56 | WORKDIR /var/task 57 | # Done building PyArrow 58 | 59 | # Extracting files 60 | RUN pip install pyarrow-*whl -t python/ 61 | RUN zip -r9 "pyarrow_lite.zip" ./python 62 | 63 | # Building 64 | RUN ls -alF 65 | -------------------------------------------------------------------------------- /lambda-layer/build-pyarrow.sh: -------------------------------------------------------------------------------- 1 | docker build -t py37-pyarrow-builder -f Dockerfile.pyarrow . 2 | CONTAINER=$(docker run -d py37-pyarrow-builder false) 3 | docker cp \ 4 | $CONTAINER:/var/task/pyarrow_lite.zip \ 5 | layer/. 6 | pushd layer 7 | unzip pyarrow_lite.zip 8 | rm pyarrow_lite.zip 9 | popd 10 | docker rm $CONTAINER -------------------------------------------------------------------------------- /lambda-layer/build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -x 2 | 3 | set -e 4 | 5 | rm -rf layer && mkdir -p layer/python 6 | docker build -t py37-apsw-builder -f Dockerfile . 7 | CONTAINER=$(docker run -d py37-apsw-builder false) 8 | docker cp \ 9 | $CONTAINER:/var/lang/lib/python3.7/site-packages/apsw-3.30.1.post1-py3.7-linux-x86_64.egg/apsw.cpython-37m-x86_64-linux-gnu.so \ 10 | layer/python/. 11 | docker cp \ 12 | $CONTAINER:/var/lang/lib/python3.7/site-packages/apsw-3.30.1.post1-py3.7-linux-x86_64.egg/apsw.py \ 13 | layer/python/. 14 | docker rm $CONTAINER -------------------------------------------------------------------------------- /sample-data/sample_data.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dacort/athena-sqlite/eff71802b31c4fc0eb708e7c9930dc94d7533222/sample-data/sample_data.sqlite --------------------------------------------------------------------------------