├── .gitignore ├── LICENSE ├── README.md ├── create_table_from_sheet.py ├── db.json.example └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | /env/ 2 | db.json 3 | service-account.json 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2019, Roman Health Ventures Inc. 2 | 3 | Permission to use, copy, modify, and/or distribute this software for any purpose 4 | with or without fee is hereby granted, provided that the above copyright notice 5 | and this permission notice appear in all copies. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH 8 | REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND 9 | FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, 10 | INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS 11 | OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER 12 | TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF 13 | THIS SOFTWARE. 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Import Semi-Structured Data from Google Sheets to Snowflake 2 | 3 | ## Requirements 4 | 5 | - [`snowflake-connector-python`](https://pypi.org/project/snowflake-connector-python/) 6 | - [`python-dateutil`](https://pypi.org/project/python-dateutil/) 7 | - [`pygsheets`](https://pypi.org/project/pygsheets/) 8 | 9 | ## Background 10 | 11 | Suppose you have a Google Sheet, accessible via a [service account][], that 12 | looks something like this: 13 | 14 | | id | code | date | cost | 15 | | -- | ---- | ---------- | ------ | 16 | | 1 | abc | 03/01/2019 | 100.00 | 17 | | 2 | xyz | 04/01/2019 | 200.00 | 18 | 19 | And you want to import this data into your data warehouse on a regular basis. 20 | Further suppose that this sheet's structure is likely to change frequently, with 21 | fields being added and removed, so you don't want to use a rigid schema. 22 | 23 | On Snowflake, one possibility is to take advantage of [`variant`][variant], a 24 | [semi-structured data type][semi-structured] data type. 25 | 26 | [service account]: https://cloud.google.com/iam/docs/service-accounts 27 | [variant]: https://docs.snowflake.net/manuals/sql-reference/data-types-semistructured.html 28 | [semi-structured]: https://docs.snowflake.net/manuals/user-guide/semistructured-concepts.html 29 | 30 | Using this module, the result of importing the sheet above would look like this: 31 | 32 | | source | imported_at | data | 33 | | -----------------| ----------- | -------------------------------------------------------------- | 34 | | [worksheet name] | [timestamp] | {"id": 1, "code": "abc", "date": "2019-03-01", "cost": 100.00} | 35 | | [worksheet name] | [timestamp] | {"id": 2, "code": "xyz", "date": "2019-04-01", "cost": 200.00} | 36 | 37 | ## Usage 38 | 39 | To get set up: 40 | 41 | - Copy `db.json.example` to `db.json` 42 | - Edit `db.json` to contain your Snowflake connection information and 43 | credentials 44 | - Download your Google service account file 45 | - Find the ID of a sheet you'd like to import (and to which your service account 46 | has access) 47 | 48 | You can then invoke this script: 49 | 50 | python create_table_from_sheet.py 51 | --schema [destination_schema] --table [destination_table] 52 | --sheet [sheet_id] 53 | --service-account-file [path_to_service_file] 54 | --db-config [path_to_db_config_file] 55 | 56 | If omitted, `./service-account.json` and `./db.json` are used as the default 57 | values for the service account file and DB config file respectively. 58 | 59 | However, this will import `id`, `date`, and `cost` as strings containing the 60 | contents reflected in the sheet. You can use `--coercions` to specify that they 61 | should be interpreted specially: 62 | 63 | python create_table_from_sheet.py 64 | # ... same as above ... 65 | --coercions '{"id": "int", "date": "date", "cost": "float"}' 66 | 67 | This says that the column `id` should be interpreted as an integer, `date` as a 68 | date, and `cost` as a float. 69 | 70 | By default, the first worksheet is imported, but you can specify a worksheet by 71 | name with the `--worksheet` argument. 72 | 73 | There are also options `--verbose` (which will print the SQL generated) and 74 | `--dry-run` (which will read the sheet and generate the SQL, but not execute it). 75 | 76 | ## Limitations 77 | 78 | This script replaces the full table in the database every time it is run, so if 79 | historical information is removed from the sheet, it will be removed from the 80 | database too. 81 | -------------------------------------------------------------------------------- /create_table_from_sheet.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from __future__ import print_function, unicode_literals 4 | 5 | import os 6 | import io 7 | import re 8 | import sys 9 | import json 10 | 11 | import dateutil.parser 12 | import snowflake.connector 13 | import pygsheets 14 | 15 | DEFAULT_DB_CONFIG_FILENAME = os.path.abspath('db.json') 16 | DEFAULT_SERVICE_ACCOUNT_FILE = os.path.abspath('service-account.json') 17 | 18 | DEFAULT_SCOPES = [ 19 | 'https://www.googleapis.com/auth/drive', 20 | 'https://www.googleapis.com/auth/spreadsheets' 21 | ] 22 | 23 | 24 | def read_db_config(filename=None): 25 | """Read and return a JSON object from `filename`.""" 26 | filename = filename or DEFAULT_DB_CONFIG_FILENAME 27 | with open(filename, 'r') as infile: 28 | return json.load(infile) 29 | 30 | 31 | def chop_at_blank(row): 32 | """Chop `row` off at its first empty element.""" 33 | result = [] 34 | for item in row: 35 | if item == '': 36 | break 37 | result.append(item) 38 | return result 39 | 40 | 41 | def drop_empty_rows(rows): 42 | """Return `rows` with all empty rows removed.""" 43 | return [row for row in rows if any(val.strip() for val in row)] 44 | 45 | 46 | def _read_worksheet(sheet_id, worksheet_id=None, service_account_file=None, 47 | scopes=None): 48 | service_account_file = service_account_file or DEFAULT_SERVICE_ACCOUNT_FILE 49 | scopes = scopes or DEFAULT_SCOPES 50 | api = pygsheets.authorize(service_account_file=service_account_file, 51 | scopes=scopes) 52 | sheet = api.open_by_key(sheet_id) 53 | worksheet_id = worksheet_id or 0 54 | if isinstance(worksheet_id, int): 55 | worksheet = sheet[worksheet_id] 56 | elif isinstance(worksheet_id, str): 57 | worksheet = sheet.worksheet_by_title(worksheet_id) 58 | else: 59 | raise Exception('Invalid ID for worksheet: {!r}'.format(worksheet_id)) 60 | title = worksheet.title 61 | rows = list(worksheet) 62 | headers = chop_at_blank(rows[0]) 63 | data = drop_empty_rows(rows[1:]) 64 | return {'title': title, 'headers': headers, 'data': data} 65 | 66 | 67 | def headers_to_keys(headers): 68 | """Convert row headers to object keys.""" 69 | regex = re.compile(r'[^a-z0-9_]+') 70 | return [regex.sub('_', header.lower()) for header in headers] 71 | 72 | 73 | def apply_coercions_1(obj, coercions): 74 | """Return `obj` with `coercions` applied.""" 75 | result = {} 76 | for key, val in obj.items(): 77 | target = coercions.get(key) 78 | if target in ('int', 'integer'): 79 | val = re.sub(r'[,$]', '', val) 80 | val = int(val) if val else None 81 | elif target == 'float': 82 | val = re.sub(r'[,$]', '', val) 83 | val = float(val) if val else None 84 | elif target == 'date': 85 | val = dateutil.parser.parse(val) if val.strip() else None 86 | val = val.strftime('%Y-%m-%d') 87 | elif target in ('datetime', 'timestamp'): 88 | val = dateutil.parser.parse(val) if val.strip() else None 89 | val = val.strftime('%Y-%m-%d %H:%M:%S') 90 | elif target is not None: 91 | print('Unknown coercion target {!r}'.format(target), 92 | file=sys.stderr) 93 | result[key] = val 94 | return result 95 | 96 | 97 | def apply_coercions(data, coercions): 98 | """Return `data` with `coercions` applied to each object.""" 99 | return [apply_coercions_1(obj, coercions) for obj in data] 100 | 101 | 102 | def read_worksheet(sheet_id, worksheet_id=None, coercions=None, 103 | service_account_file=None, scopes=None): 104 | """Read a worksheet and return a dict. 105 | 106 | The dict will have two keys: `title` (the title of the worksheet) and 107 | `data` (a list of dicts, one for each row, mapping column names to values). 108 | 109 | The `sheet_id` should be the ID as used by Google Sheets, not the title. 110 | The `worksheet_id` can be either an integer (the ordinal position of the 111 | worksheet) or a string (its title). 112 | """ 113 | objects = [] 114 | payload = _read_worksheet(sheet_id, worksheet_id=worksheet_id, 115 | service_account_file=service_account_file, 116 | scopes=scopes) 117 | headers = payload['headers'] 118 | keys = headers_to_keys(headers) 119 | for row in payload['data']: 120 | objects.append(dict(zip(keys, row))) 121 | if coercions: 122 | objects = apply_coercions(objects, coercions) 123 | return {'title': payload['title'], 'data': objects} 124 | 125 | 126 | def build_create_table(schema, table): 127 | """Return the CREATE TABLE statement as a string.""" 128 | return """CREATE OR REPLACE TABLE {}.{} ( 129 | source string, 130 | imported_at timestamp_tz, 131 | data variant 132 | ); 133 | """.format(schema, table) 134 | 135 | 136 | def build_insert_rows(schema, table, payload): 137 | """Return the INSERT INTO statement as a string.""" 138 | out = io.StringIO() 139 | 140 | out.write('INSERT INTO {}.{}\n'.format(schema, table)) 141 | out.write('SELECT column1, column2, parse_json(column3)\n') 142 | out.write('FROM VALUES\n') 143 | 144 | title = payload['title'] 145 | data = payload['data'] 146 | count = len(data) 147 | for i, obj in enumerate(data): 148 | out.write("('{}', current_timestamp, '{}')".format( 149 | title, json.dumps(obj) 150 | )) 151 | if i != count - 1: 152 | out.write(',') 153 | out.write('\n') 154 | 155 | return out.getvalue() 156 | 157 | 158 | def load_sheet(schema, table, sheet_id, worksheet=None, coercions=None, 159 | service_account_file=None, config_file=None, 160 | verbose=False, dry_run=False): 161 | """Load ``schema.table`` from `sheet_id`.""" 162 | if isinstance(coercions, str): 163 | coercions = json.loads(coercions) 164 | config_file = config_file or DEFAULT_DB_CONFIG_FILENAME 165 | config = read_db_config(config_file) 166 | payload = read_worksheet(sheet_id, worksheet_id=worksheet, 167 | service_account_file=service_account_file, 168 | coercions=coercions) 169 | create_table = build_create_table(schema, table) 170 | insert_rows = build_insert_rows(schema, table, payload) 171 | with snowflake.connector.connect(**config) as connection: 172 | cursor = connection.cursor() 173 | for statement in create_table, insert_rows: 174 | if verbose: 175 | print(statement) 176 | if not dry_run: 177 | cursor.execute(statement) 178 | 179 | 180 | if __name__ == '__main__': 181 | import argparse 182 | 183 | parser = argparse.ArgumentParser() 184 | parser.add_argument('--schema', required=True) 185 | parser.add_argument('--table', required=True) 186 | parser.add_argument('--sheet', required=True) 187 | parser.add_argument('--worksheet') 188 | parser.add_argument('--coercions') 189 | parser.add_argument('--db-config') 190 | parser.add_argument('--service-account-file') 191 | parser.add_argument('--verbose', action='store_true') 192 | parser.add_argument('--dry-run', action='store_true') 193 | args = parser.parse_args() 194 | 195 | load_sheet(args.schema, args.table, args.sheet, 196 | worksheet=args.worksheet, 197 | coercions=args.coercions, 198 | service_account_file=args.service_account_file, 199 | config_file=args.db_config, 200 | verbose=args.verbose, 201 | dry_run=args.dry_run) 202 | -------------------------------------------------------------------------------- /db.json.example: -------------------------------------------------------------------------------- 1 | { 2 | "user": "your_user", 3 | "password": "your_password", 4 | "account": "your_account", 5 | "database": "your_database", 6 | "schema": "your_schema", 7 | "warehouse": "your_warehouse" 8 | } 9 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | asn1crypto==0.24.0 2 | azure-common==1.1.18 3 | azure-storage-blob==1.5.0 4 | azure-storage-common==1.4.0 5 | boto3==1.9.130 6 | botocore==1.12.134 7 | cachetools==3.1.0 8 | certifi==2019.3.9 9 | cffi==1.12.2 10 | chardet==3.0.4 11 | cryptography==2.6.1 12 | docutils==0.14 13 | future==0.17.1 14 | google-api-python-client==1.7.8 15 | google-auth==1.6.3 16 | google-auth-httplib2==0.0.3 17 | google-auth-oauthlib==0.3.0 18 | httplib2==0.12.1 19 | idna==2.8 20 | ijson==2.3 21 | jmespath==0.9.4 22 | oauthlib==3.0.1 23 | pyasn1==0.4.5 24 | pyasn1-modules==0.2.4 25 | pycparser==2.19 26 | pycryptodomex==3.8.1 27 | pygsheets==2.0.1 28 | PyJWT==1.7.1 29 | pyOpenSSL==19.0.0 30 | python-dateutil==2.8.0 31 | pytz==2018.9 32 | requests==2.21.0 33 | requests-oauthlib==1.2.0 34 | rsa==4.0 35 | s3transfer==0.2.0 36 | six==1.12.0 37 | snowflake-connector-python==1.7.9 38 | uritemplate==3.0.0 39 | urllib3==1.24.2 40 | --------------------------------------------------------------------------------