├── .gitignore
├── LICENSE
├── README.md
├── create_table_from_sheet.py
├── db.json.example
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | /env/
2 | db.json
3 | service-account.json
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright 2019, Roman Health Ventures Inc.
 2 | 
 3 | Permission to use, copy, modify, and/or distribute this software for any purpose
 4 | with or without fee is hereby granted, provided that the above copyright notice
 5 | and this permission notice appear in all copies.
 6 | 
 7 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
 8 | REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
 9 | FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
10 | INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS
11 | OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER
12 | TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
13 | THIS SOFTWARE.
14 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Import Semi-Structured Data from Google Sheets to Snowflake
 2 | 
 3 | ## Requirements
 4 | 
 5 | - [`snowflake-connector-python`](https://pypi.org/project/snowflake-connector-python/)
 6 | - [`python-dateutil`](https://pypi.org/project/python-dateutil/)
 7 | - [`pygsheets`](https://pypi.org/project/pygsheets/)
 8 | 
 9 | ## Background
10 | 
11 | Suppose you have a Google Sheet, accessible via a [service account][], that
12 | looks something like this:
13 | 
14 | | id | code | date       | cost   |
15 | | -- | ---- | ---------- | ------ |
16 | | 1  | abc  | 03/01/2019 | 100.00 |
17 | | 2  | xyz  | 04/01/2019 | 200.00 |
18 | 
19 | And you want to import this data into your data warehouse on a regular basis.
20 | Further suppose that this sheet's structure is likely to change frequently, with
21 | fields being added and removed, so you don't want to use a rigid schema.
22 | 
23 | On Snowflake, one possibility is to take advantage of [`variant`][variant], a
24 | [semi-structured data type][semi-structured] data type.
25 | 
26 | [service account]: https://cloud.google.com/iam/docs/service-accounts
27 | [variant]: https://docs.snowflake.net/manuals/sql-reference/data-types-semistructured.html
28 | [semi-structured]: https://docs.snowflake.net/manuals/user-guide/semistructured-concepts.html
29 | 
30 | Using this module, the result of importing the sheet above would look like this:
31 | 
32 | | source           | imported_at | data                                                           |
33 | | -----------------| ----------- | -------------------------------------------------------------- |
34 | | [worksheet name] | [timestamp] | {"id": 1, "code": "abc", "date": "2019-03-01", "cost": 100.00} |
35 | | [worksheet name] | [timestamp] | {"id": 2, "code": "xyz", "date": "2019-04-01", "cost": 200.00} |
36 | 
37 | ## Usage
38 | 
39 | To get set up:
40 | 
41 | - Copy `db.json.example` to `db.json`
42 | - Edit `db.json` to contain your Snowflake connection information and
43 |   credentials
44 | - Download your Google service account file
45 | - Find the ID of a sheet you'd like to import (and to which your service account
46 |   has access)
47 | 
48 | You can then invoke this script:
49 | 
50 |     python create_table_from_sheet.py
51 |         --schema [destination_schema] --table [destination_table]
52 |         --sheet [sheet_id]
53 |         --service-account-file [path_to_service_file]
54 |         --db-config [path_to_db_config_file]
55 | 
56 | If omitted, `./service-account.json` and `./db.json` are used as the default
57 | values for the service account file and DB config file respectively.
58 | 
59 | However, this will import `id`, `date`, and `cost` as strings containing the
60 | contents reflected in the sheet. You can use `--coercions` to specify that they
61 | should be interpreted specially:
62 | 
63 |     python create_table_from_sheet.py
64 |         # ... same as above ...
65 |         --coercions '{"id": "int", "date": "date", "cost": "float"}'
66 | 
67 | This says that the column `id` should be interpreted as an integer, `date` as a
68 | date, and `cost` as a float.
69 | 
70 | By default, the first worksheet is imported, but you can specify a worksheet by
71 | name with the `--worksheet` argument.
72 | 
73 | There are also options `--verbose` (which will print the SQL generated) and
74 | `--dry-run` (which will read the sheet and generate the SQL, but not execute it).
75 | 
76 | ## Limitations
77 | 
78 | This script replaces the full table in the database every time it is run, so if
79 | historical information is removed from the sheet, it will be removed from the
80 | database too.
81 | 


--------------------------------------------------------------------------------
/create_table_from_sheet.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | from __future__ import print_function, unicode_literals
  4 | 
  5 | import os
  6 | import io
  7 | import re
  8 | import sys
  9 | import json
 10 | 
 11 | import dateutil.parser
 12 | import snowflake.connector
 13 | import pygsheets
 14 | 
 15 | DEFAULT_DB_CONFIG_FILENAME = os.path.abspath('db.json')
 16 | DEFAULT_SERVICE_ACCOUNT_FILE = os.path.abspath('service-account.json')
 17 | 
 18 | DEFAULT_SCOPES = [
 19 |     'https://www.googleapis.com/auth/drive',
 20 |     'https://www.googleapis.com/auth/spreadsheets'
 21 | ]
 22 | 
 23 | 
 24 | def read_db_config(filename=None):
 25 |     """Read and return a JSON object from `filename`."""
 26 |     filename = filename or DEFAULT_DB_CONFIG_FILENAME
 27 |     with open(filename, 'r') as infile:
 28 |         return json.load(infile)
 29 | 
 30 | 
 31 | def chop_at_blank(row):
 32 |     """Chop `row` off at its first empty element."""
 33 |     result = []
 34 |     for item in row:
 35 |         if item == '':
 36 |             break
 37 |         result.append(item)
 38 |     return result
 39 | 
 40 | 
 41 | def drop_empty_rows(rows):
 42 |     """Return `rows` with all empty rows removed."""
 43 |     return [row for row in rows if any(val.strip() for val in row)]
 44 | 
 45 | 
 46 | def _read_worksheet(sheet_id, worksheet_id=None, service_account_file=None,
 47 |                     scopes=None):
 48 |     service_account_file = service_account_file or DEFAULT_SERVICE_ACCOUNT_FILE
 49 |     scopes = scopes or DEFAULT_SCOPES
 50 |     api = pygsheets.authorize(service_account_file=service_account_file,
 51 |                               scopes=scopes)
 52 |     sheet = api.open_by_key(sheet_id)
 53 |     worksheet_id = worksheet_id or 0
 54 |     if isinstance(worksheet_id, int):
 55 |         worksheet = sheet[worksheet_id]
 56 |     elif isinstance(worksheet_id, str):
 57 |         worksheet = sheet.worksheet_by_title(worksheet_id)
 58 |     else:
 59 |         raise Exception('Invalid ID for worksheet: {!r}'.format(worksheet_id))
 60 |     title = worksheet.title
 61 |     rows = list(worksheet)
 62 |     headers = chop_at_blank(rows[0])
 63 |     data = drop_empty_rows(rows[1:])
 64 |     return {'title': title, 'headers': headers, 'data': data}
 65 | 
 66 | 
 67 | def headers_to_keys(headers):
 68 |     """Convert row headers to object keys."""
 69 |     regex = re.compile(r'[^a-z0-9_]+')
 70 |     return [regex.sub('_', header.lower()) for header in headers]
 71 | 
 72 | 
 73 | def apply_coercions_1(obj, coercions):
 74 |     """Return `obj` with `coercions` applied."""
 75 |     result = {}
 76 |     for key, val in obj.items():
 77 |         target = coercions.get(key)
 78 |         if target in ('int', 'integer'):
 79 |             val = re.sub(r'[,$]', '', val)
 80 |             val = int(val) if val else None
 81 |         elif target == 'float':
 82 |             val = re.sub(r'[,$]', '', val)
 83 |             val = float(val) if val else None
 84 |         elif target == 'date':
 85 |             val = dateutil.parser.parse(val) if val.strip() else None
 86 |             val = val.strftime('%Y-%m-%d')
 87 |         elif target in ('datetime', 'timestamp'):
 88 |             val = dateutil.parser.parse(val) if val.strip() else None
 89 |             val = val.strftime('%Y-%m-%d %H:%M:%S')
 90 |         elif target is not None:
 91 |             print('Unknown coercion target {!r}'.format(target),
 92 |                   file=sys.stderr)
 93 |         result[key] = val
 94 |     return result
 95 | 
 96 | 
 97 | def apply_coercions(data, coercions):
 98 |     """Return `data` with `coercions` applied to each object."""
 99 |     return [apply_coercions_1(obj, coercions) for obj in data]
100 | 
101 | 
102 | def read_worksheet(sheet_id, worksheet_id=None, coercions=None,
103 |                    service_account_file=None, scopes=None):
104 |     """Read a worksheet and return a dict.
105 | 
106 |     The dict will have two keys: `title` (the title of the worksheet) and
107 |     `data` (a list of dicts, one for each row, mapping column names to values).
108 | 
109 |     The `sheet_id` should be the ID as used by Google Sheets, not the title.
110 |     The `worksheet_id` can be either an integer (the ordinal position of the
111 |     worksheet) or a string (its title).
112 |     """
113 |     objects = []
114 |     payload = _read_worksheet(sheet_id, worksheet_id=worksheet_id,
115 |                               service_account_file=service_account_file,
116 |                               scopes=scopes)
117 |     headers = payload['headers']
118 |     keys = headers_to_keys(headers)
119 |     for row in payload['data']:
120 |         objects.append(dict(zip(keys, row)))
121 |     if coercions:
122 |         objects = apply_coercions(objects, coercions)
123 |     return {'title': payload['title'], 'data': objects}
124 | 
125 | 
126 | def build_create_table(schema, table):
127 |     """Return the CREATE TABLE statement as a string."""
128 |     return """CREATE OR REPLACE TABLE {}.{} (
129 |         source string,
130 |         imported_at timestamp_tz,
131 |         data variant
132 |     );
133 |     """.format(schema, table)
134 | 
135 | 
136 | def build_insert_rows(schema, table, payload):
137 |     """Return the INSERT INTO statement as a string."""
138 |     out = io.StringIO()
139 | 
140 |     out.write('INSERT INTO {}.{}\n'.format(schema, table))
141 |     out.write('SELECT column1, column2, parse_json(column3)\n')
142 |     out.write('FROM VALUES\n')
143 | 
144 |     title = payload['title']
145 |     data = payload['data']
146 |     count = len(data)
147 |     for i, obj in enumerate(data):
148 |         out.write("('{}', current_timestamp, '{}')".format(
149 |             title, json.dumps(obj)
150 |         ))
151 |         if i != count - 1:
152 |             out.write(',')
153 |         out.write('\n')
154 | 
155 |     return out.getvalue()
156 | 
157 | 
158 | def load_sheet(schema, table, sheet_id, worksheet=None, coercions=None,
159 |                service_account_file=None, config_file=None,
160 |                verbose=False, dry_run=False):
161 |     """Load ``schema.table`` from `sheet_id`."""
162 |     if isinstance(coercions, str):
163 |         coercions = json.loads(coercions)
164 |     config_file = config_file or DEFAULT_DB_CONFIG_FILENAME
165 |     config = read_db_config(config_file)
166 |     payload = read_worksheet(sheet_id, worksheet_id=worksheet,
167 |                              service_account_file=service_account_file,
168 |                              coercions=coercions)
169 |     create_table = build_create_table(schema, table)
170 |     insert_rows = build_insert_rows(schema, table, payload)
171 |     with snowflake.connector.connect(**config) as connection:
172 |         cursor = connection.cursor()
173 |         for statement in create_table, insert_rows:
174 |             if verbose:
175 |                 print(statement)
176 |             if not dry_run:
177 |                 cursor.execute(statement)
178 | 
179 | 
180 | if __name__ == '__main__':
181 |     import argparse
182 | 
183 |     parser = argparse.ArgumentParser()
184 |     parser.add_argument('--schema', required=True)
185 |     parser.add_argument('--table', required=True)
186 |     parser.add_argument('--sheet', required=True)
187 |     parser.add_argument('--worksheet')
188 |     parser.add_argument('--coercions')
189 |     parser.add_argument('--db-config')
190 |     parser.add_argument('--service-account-file')
191 |     parser.add_argument('--verbose', action='store_true')
192 |     parser.add_argument('--dry-run', action='store_true')
193 |     args = parser.parse_args()
194 | 
195 |     load_sheet(args.schema, args.table, args.sheet,
196 |                worksheet=args.worksheet,
197 |                coercions=args.coercions,
198 |                service_account_file=args.service_account_file,
199 |                config_file=args.db_config,
200 |                verbose=args.verbose,
201 |                dry_run=args.dry_run)
202 | 


--------------------------------------------------------------------------------
/db.json.example:
--------------------------------------------------------------------------------
1 | {
2 |   "user": "your_user",
3 |   "password": "your_password",
4 |   "account": "your_account",
5 |   "database": "your_database",
6 |   "schema": "your_schema",
7 |   "warehouse": "your_warehouse"
8 | }
9 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | asn1crypto==0.24.0
 2 | azure-common==1.1.18
 3 | azure-storage-blob==1.5.0
 4 | azure-storage-common==1.4.0
 5 | boto3==1.9.130
 6 | botocore==1.12.134
 7 | cachetools==3.1.0
 8 | certifi==2019.3.9
 9 | cffi==1.12.2
10 | chardet==3.0.4
11 | cryptography==2.6.1
12 | docutils==0.14
13 | future==0.17.1
14 | google-api-python-client==1.7.8
15 | google-auth==1.6.3
16 | google-auth-httplib2==0.0.3
17 | google-auth-oauthlib==0.3.0
18 | httplib2==0.12.1
19 | idna==2.8
20 | ijson==2.3
21 | jmespath==0.9.4
22 | oauthlib==3.0.1
23 | pyasn1==0.4.5
24 | pyasn1-modules==0.2.4
25 | pycparser==2.19
26 | pycryptodomex==3.8.1
27 | pygsheets==2.0.1
28 | PyJWT==1.7.1
29 | pyOpenSSL==19.0.0
30 | python-dateutil==2.8.0
31 | pytz==2018.9
32 | requests==2.21.0
33 | requests-oauthlib==1.2.0
34 | rsa==4.0
35 | s3transfer==0.2.0
36 | six==1.12.0
37 | snowflake-connector-python==1.7.9
38 | uritemplate==3.0.0
39 | urllib3==1.24.2
40 | 


--------------------------------------------------------------------------------