├── .gitignore
├── LICENSE
├── README.md
├── aws_glue_etl_docker
    ├── __init__.py
    ├── __main__.py
    └── glueshim.py
├── createdeployzip.sh
├── examples
    ├── TestWorkbook.ipynb
    └── data
    │   └── data.json
├── install.sh
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | /env/
 2 | /*.egg-info
 3 | /examples/.ipynb_checkpoints
 4 | /examples/parquet*
 5 | /examples/csv
 6 | /examples/exampleoutput/*
 7 | /out/
 8 | startDevEnv.sh
 9 | *.zip
10 | .vscode
11 | aws_glue_etl_docker_deploy


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Genesys Cloud Services, Inc.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # AWS Glue ETL in Docker and Jupyter
 2 | This project is a helper for creating scripts that run in both [AWS Glue](https://aws.amazon.com/glue/), [Jupyter](http://jupyter.org/) notebooks, and in docker containers with spark-submit.  Glue supports running [Zepplin notebooks](https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-EC2-notebook.html) against a dev endpoint, but for quick dev sometimes you just want to run locally against a subset of data and don't want to have to pay to keep the dev endpoints running.
 3 | 
 4 | ## Glue Shim
 5 | Glue has specific methods to load and save data to s3 which won't work when running in a jupyter notebook.  The glueshim provides a higher level api to work in both scenarios.  
 6 | 
 7 | ```python
 8 | from aws_glue_etl_docker import glueshim
 9 | shim = glueshim.GlueShim()
10 | 
11 | params = shim.arguments({'data_bucket': "examples"})
12 | pprint(params)
13 | 
14 | 
15 | files = shim.get_all_files_with_prefix(params['data_bucket'], "data/")
16 | print(files)
17 | 
18 | data = shim.load_data(files, 'example_data')
19 | data.printSchema()
20 | data.show()
21 | 
22 | shim.write_parquet(data, params['data_bucket'], "parquet", None, 'parquetdata' )
23 | shim.write_parquet(data, params['data_bucket'], "parquetpartition", "car", 'partitioneddata' )
24 | 
25 | shim.write_csv(data, params['data_bucket'],"csv", 'csvdata')
26 | 
27 | shim.finish()
28 | ```
29 | 
30 | ## Local environment
31 | Running locally is easiest in a docker container
32 | 
33 | 1. Copy data locally, and map that folder to your docker container to the /data/<bucket>/<files> path.
34 | 2. Start docker container, map your local notebook directory to ```/home/jovyan/work```
35 | 
36 | *Example Docker command*
37 | ```docker run -p 8888:8888 -v "$PWD/examples":/home/jovyan/work -v "$PWD":/data jupyter/pyspark-notebook```
38 | 
39 | ### Installing package in Jupyter
40 | 
41 | ```python
42 | import sys
43 | !{sys.executable} -m pip install git+https://github.com/purecloudlabs/aws_glue_etl_docker
44 | ```
45 | 
46 | ## AWS Deployment
47 | For deployment to AWS, this library must be packaged and put into S3. You can use the helper script deploytos3.sh to package and copy.  
48 | 
49 | Usage ```./deploytos3.sh s3://example-bucket/myprefix/aws-glue-etl-jupyter.zip
50 | 
51 | Then when starting the glue job, use your S3 zip path in the _Python library path_ configuration
52 | 
53 | ## Bookmarks
54 | The shim is currently setup to delete any data in the output folder so that if you run with bookmarks enabled and then need to reprocess the entire dataset and 
55 | 
56 | ## Converting Workbook to Python Script
57 | 
58 | aws_glue_etl_docker can also be used as a cli tool to clean up Jupyter metadata from a workbook or convert it to a python script.
59 | 
60 | ## Clean
61 | 
62 | The clean command will open all workbooks in a given path and remove any metadata, output and execution information. This keeps the workbooks cleaner in source control
63 | 
64 | ``` aws_glue_etl_docker clean --path /dir/to/workbooks  ```
65 | 
66 | ## Build
67 | 
68 | The build command will open all workbooks in a given path and convert them to python scripts.  Build will convert any markdown cells to multiline comments.  This command will not convert any cells that contain ```#LOCALDEV``` or lines that start with ```!``` as in ```!{sys.executable} -m pip install git+https://github.com/purecloudlabs/aws_glue_etl_docker```
69 | 
70 | ``` aws_glue_etl_docker build --path /dir/to/workbooks  ```


--------------------------------------------------------------------------------
/aws_glue_etl_docker/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/purecloudlabs/aws_glue_etl_docker/b5c9130b8f8c0e133f89ab0742415040f37cade0/aws_glue_etl_docker/__init__.py


--------------------------------------------------------------------------------
/aws_glue_etl_docker/__main__.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import argparse
 4 | import glob
 5 | from pprint import pprint
 6 | 
 7 | def cleanWorkbooks(path):
 8 |     print('cleaning workbooks ' + path)
 9 |     print(glob.glob(path + "/*.ipynb"))
10 |     for workbook in glob.glob(path + "/*.ipynb"):
11 |         input_file = open (workbook)
12 |         notebookContents = json.load(input_file)
13 | 
14 |         for cell in notebookContents['cells']:
15 |             if 'metadata' in cell:
16 |                 cell['metadata'] = {}
17 | 
18 |             if 'outputs' in cell:
19 |                 cell['outputs'] = []
20 | 
21 |             if 'execution_count' in cell:
22 |                 cell['execution_count'] = 0
23 | 
24 |         with open(workbook, 'w') as out:
25 |             json.dump(notebookContents, out, indent=1, separators=(',', ': '))
26 | 
27 | 
28 | def buildWorkbooks(path, outputdir):
29 |     if not os.path.exists(outputdir):
30 |         os.makedirs(outputdir)
31 | 
32 |     for workbook in glob.glob(path + "/*.ipynb"):
33 |         print(os.path.basename(workbook))
34 | 
35 |         # with open(workbook) as fp:
36 |         #     for i, line in enumerate(fp):
37 |         #         if "\xe2" in line:
38 |         #             print i, repr(line)
39 | 
40 |         input_file = open (workbook)
41 |         notebookContents = json.load(input_file)
42 | 
43 |         out = open('{}/{}'.format(outputdir, workbook.replace(path, "").replace("ipynb","py")), 'w+')
44 | 
45 |         for cell in notebookContents['cells']:
46 |             if cell['cell_type'] == "code" and len(cell['source']) > 0 and "#LOCALDEV" not in cell['source'][0]:
47 |                 for line in cell['source']:
48 |                     if line[0] != '!':
49 |                         out.write(line)
50 |             elif cell['cell_type'] == "markdown":
51 |                 
52 |                 out.write('\n\'\'\'\n')
53 |                 for line in cell['source']:
54 |                     out.write("#" + line)
55 |                 out.write('\n\'\'\'\n')
56 | 
57 | 
58 | def main():
59 |     parser = argparse.ArgumentParser()
60 | 
61 |     parser.add_argument('command', help='Command to run (clean, build)')
62 |     parser.add_argument('--path', help='Path to workbooks')
63 |     parser.add_argument('--outdir', help='Output path')
64 |     
65 |     args = parser.parse_args()
66 | 
67 |     if args.command == "clean":
68 |         cleanWorkbooks(args.path)
69 |     elif args.command == "build":
70 |         buildWorkbooks(args.path, args.outdir)
71 |         
72 | if __name__ == '__main__':
73 |     main()
74 | 


--------------------------------------------------------------------------------
/aws_glue_etl_docker/glueshim.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import os.path
  3 | import shutil
  4 | import glob
  5 | import pyspark
  6 | from pyspark import SparkConf, SparkContext, SQLContext
  7 | from pprint import pprint
  8 | 
  9 | def _load_data(filePaths, dataset_name, spark_context, groupfiles, groupsize):
 10 |     sqlContext = SQLContext(spark_context)
 11 |     return sqlContext.read.json(filePaths)
 12 | 
 13 | def _write_csv(dataframe, bucket, location, dataset_name, spark_context):
 14 |     output_path = '/data/' + bucket + '/' + location
 15 | 
 16 |     shutil.rmtree(output_path, True)
 17 |     dataframe.repartition(1).write.format("com.databricks.spark.csv").option("header","true").save(output_path)
 18 | 
 19 | def _write_parquet(dataframe, bucket, location, partition_columns, dataset_name, spark_context):
 20 | 
 21 |     output_path = '/data/' + bucket + '/' + location
 22 | 
 23 |     shutil.rmtree(output_path, True)
 24 | 
 25 |     if partition_columns != None and len(partition_columns) > 0:
 26 |         dataframe.write.partitionBy(partition_columns).parquet(output_path)
 27 |     else:
 28 |         dataframe.write.parquet(output_path)
 29 | 
 30 | def _get_spark_context():
 31 |     return (pyspark.SparkContext.getOrCreate(), None)
 32 | 
 33 | def _get_all_files_with_prefix(bucket, prefix, spark_context):
 34 |     pathToWalk = '/data/' + bucket + '/' + prefix +'**/*.*'   
 35 |     return glob.glob(pathToWalk,recursive=True)
 36 | 
 37 | def _is_in_aws():
 38 |     return False
 39 | 
 40 | def _get_arguments(default):
 41 |     return default
 42 | 
 43 | def _finish(self):
 44 |     return None
 45 | 
 46 | try:
 47 |     from awsglue.transforms import *
 48 |     from awsglue.utils import getResolvedOptions
 49 |     from awsglue.context import GlueContext
 50 |     from awsglue.dynamicframe import DynamicFrame
 51 |     from awsglue.job import Job
 52 |     import boto3
 53 |     
 54 |     def _load_data(file_paths, dataset_name, context, groupfiles, groupsize):
 55 | 
 56 |         connection_options = {'paths': file_paths}
 57 | 
 58 |         if groupfiles != None:
 59 |             connection_options["groupFiles"] = groupfiles
 60 | 
 61 |         if groupsize != None:
 62 |             connection_options["groupSize"] = groupsize
 63 | 
 64 |         glue0 = context.create_dynamic_frame.from_options(connection_type='s3',
 65 |                                                       connection_options=connection_options,
 66 |                                                       format='json',
 67 |                                                       transformation_ctx=dataset_name)
 68 | 
 69 |         return glue0.toDF()
 70 |     
 71 |     def _write_csv(dataframe, bucket, location, dataset_name, spark_context):
 72 |         output_path = "s3://" + bucket + "/" + location
 73 |         df_tmp = DynamicFrame.fromDF(dataframe.repartition(1), spark_context, dataset_name)
 74 |         spark_context.write_dynamic_frame.from_options(frame = df_tmp, connection_type = "s3", connection_options = {"path": output_path}, format = "csv")
 75 | 
 76 | 
 77 |     def _delete_files_with_prefix(bucket, prefix):
 78 |         if not prefix.endswith('/'):
 79 |             prefix = prefix + "/"
 80 | 
 81 |         delete_keys = {'Objects' : []}
 82 |         s3 = boto3.client('s3')
 83 | 
 84 |         paginator = s3.get_paginator('list_objects')
 85 |         pages = paginator.paginate(Bucket=bucket, Prefix=prefix)
 86 |         for page in pages:
 87 |             if page.get('Contents'):
 88 |                 for obj in page['Contents']:
 89 |                     if not obj['Key'].endswith('/'):
 90 |                         delete_keys['Objects'].append({'Key': str(obj['Key'])})
 91 |                   
 92 |                 s3.delete_objects(Bucket=bucket, Delete=delete_keys)
 93 |                 delete_keys = {'Objects' : []}
 94 | 
 95 |                 
 96 |     def _write_parquet(dataframe, bucket, location, partition_columns, dataset_name, spark_context):
 97 |         if "job-bookmark-disable" in sys.argv:
 98 |             _delete_files_with_prefix(bucket, location)
 99 | 
100 |         output_path = "s3://" + bucket + "/" + location
101 | 
102 |         df_tmp = DynamicFrame.fromDF(dataframe, spark_context, dataset_name)
103 | 
104 |         print("Writing to {} ".format(output_path))
105 | 
106 |         if partition_columns != None and len(partition_columns) > 0:
107 |             spark_context.write_dynamic_frame.from_options(frame = df_tmp, connection_type = "s3", connection_options = {"path": output_path, "partitionKeys": partition_columns }, format = "parquet")
108 |         else:
109 |             spark_context.write_dynamic_frame.from_options(frame = df_tmp, connection_type = "s3", connection_options = {"path": output_path }, format = "parquet")
110 | 
111 | 
112 | 
113 |     def _get_spark_context():
114 |         spark_context = GlueContext(SparkContext.getOrCreate())
115 |         job = Job(spark_context)
116 |         args = _get_arguments({})
117 |         job.init(args['JOB_NAME'], args)
118 |         
119 |         return (spark_context, job)
120 | 
121 |     def _get_all_files_with_prefix(bucket, prefix, spark_context):
122 |         prefixes = set()
123 |         s3 = boto3.client('s3')
124 |         paginator = s3.get_paginator('list_objects')
125 |         pages = paginator.paginate(Bucket=bucket, Prefix=prefix)
126 |         for page in pages:
127 |             if 'Contents' in page and page['Contents']:
128 |                 for obj in page['Contents']:
129 |                     if not obj['Key'].endswith('/') and '/' in obj['Key']:
130 |                         idx = obj['Key'].rfind('/')
131 |                         prefixes.add('s3://{}/{}'.format(bucket, obj['Key'][0:idx]))
132 |                         
133 |         return list(prefixes)
134 |    
135 |     def _get_arguments(defaults):
136 |         return getResolvedOptions(sys.argv, ['JOB_NAME'] + defaults.keys())  
137 | 
138 |     def _is_in_aws():
139 |         return True
140 | 
141 |     def _finish(self):
142 |         if self.job:
143 |             try:
144 |                 self.job.commit()
145 |             except NameError:
146 |                 print("unable to commit job")
147 | 
148 |     
149 | except Exception as e:
150 |     print('local dev')
151 |     
152 | class GlueShim:    
153 |     def __init__(self):
154 |         c = _get_spark_context()
155 |         self.spark_context = c[0]
156 |         self.job = c[1]
157 |         self._groupfiles = None
158 |         self._groupsize = None
159 |         
160 |     def arguments(self, defaults):
161 |         """Gets the arguments for a job.  When running in glue, the response is pulled form sys.argv
162 | 
163 |         Keyword arguments:
164 |         defaults -- default dictionary of options
165 |         """
166 |         return _get_arguments(defaults)    
167 |         
168 |     def load_data(self, file_paths, dataset_name):
169 |         """Loads data into a dataframe
170 | 
171 |         Keyword arguments:
172 |         file_paths -- list of file paths to pull from, either absolute paths or s3:// uris
173 |         dataset_name -- name of this dataset, used for glue bookmarking
174 |         """
175 |         return _load_data(file_paths, dataset_name, self.spark_context, self._groupfiles, self._groupsize)
176 |     
177 |     def get_all_files_with_prefix(self, bucket, prefix):
178 |         """Given a bucket and file prefix, this method will return a list of all files with that prefix
179 | 
180 |         Keyword arguments:
181 |         bucket -- bucket name
182 |         prefix -- filename prefix
183 |         """
184 |         return _get_all_files_with_prefix(bucket, prefix, self.spark_context)
185 | 
186 |     def write_parquet(self, dataframe, bucket, location, partition_columns, dataset_name):
187 |         """Writes a dataframe in parquet format
188 | 
189 |         Keyword arguments:
190 |         dataframe -- dataframe to write out
191 |         bucket -- Output bucket name
192 |         location -- Output filename prefix
193 |         partition_columns -- list of strings to partition by, None for default partitions
194 |         dataset_name - dataset name, will be appended to location
195 | 
196 |         """
197 |         _write_parquet(dataframe, bucket, location, partition_columns, dataset_name, self.spark_context)
198 |         
199 |     def write_csv(self, dataframe, bucket, location, dataset_name):
200 |         """Writes a dataframe in csv format with a partition count of 1
201 | 
202 |         Keyword arguments:
203 |         dataframe -- dataframe to write out
204 |         bucket -- Output bucket name
205 |         location -- Output filename prefix
206 |         dataset_name - dataset name, will be appended to location
207 | 
208 |         """
209 |         _write_csv(dataframe, bucket, location, dataset_name, self.spark_context)
210 |             
211 |     def get_spark_context(self):
212 |         """ Gets the spark context """
213 |         return self.context
214 | 
215 |     def finish(self):
216 |         """ Should be run at the end, will set Glue bookmarks """
217 |         _finish(self)
218 | 
219 |     def set_group_files(self, groupfiles):
220 |         """ Sets extra options used with glue https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html """
221 |         self._groupfiles = groupfiles
222 |         
223 |     def set_group_size(self, groupsize):
224 |         """ Sets extra options used with glue https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html """
225 |         self._groupsize = groupsize
226 | 


--------------------------------------------------------------------------------
/createdeployzip.sh:
--------------------------------------------------------------------------------
 1 | !/usr/bin/env bash
 2 | rm -rf aws_glue_etl_docker_deploy
 3 | mkdir aws_glue_etl_docker_deploy
 4 | cd aws_glue_etl_docker_deploy
 5 | mkdir deps
 6 | virtualenv -p python2.7 .
 7 | pip install -t deps git+https://github.com/purecloudlabs/aws_glue_etl_docker
 8 | cd deps && zip -r ../aws_glue_etl_docker_deploy.zip . && cd ..
 9 | aws s3 cp ./aws_glue_etl_docker_deploy.zip $1
10 |  


--------------------------------------------------------------------------------
/examples/TestWorkbook.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |     "nbformat_minor": 1,
 3 |     "nbformat": 4,
 4 |     "cells": [
 5 |         {
 6 |             "execution_count": 0,
 7 |             "cell_type": "code",
 8 |             "source": [
 9 |                 "import sys\n",
10 |                 "!{sys.executable} -m pip install git+https://github.com/purecloudlabs/aws_glue_etl_docker.git\n",
11 |                 "# Because the previous line starts with a !, it will be removed from the output py file"
12 |             ],
13 |             "outputs": [],
14 |             "metadata": {}
15 |         },
16 |         {
17 |             "source": [
18 |                 "#This is a markdown cell which will get converted into an inline comment"
19 |             ],
20 |             "cell_type": "markdown",
21 |             "metadata": {}
22 |         },
23 |         {
24 |             "execution_count": 0,
25 |             "cell_type": "code",
26 |             "source": [
27 |                 "from aws_glue_etl_docker import glueshim\n",
28 |                 "\n",
29 |                 "print(\"starting\")\n",
30 |                 "print(sys.argv)\n",
31 |                 "\n",
32 |                 "shim = glueshim.GlueShim()\n",
33 |                 "\n",
34 |                 "params = {'data_bucket': \"examples\"}\n",
35 |                 "\n",
36 |                 "files = shim.get_all_files_with_prefix(params['data_bucket'], \"data/\")\n",
37 |                 "print(files)\n",
38 |                 "\n",
39 |                 "\n",
40 |                 "data = shim.load_data(files, 'example_data')\n",
41 |                 "data.printSchema()\n",
42 |                 "data.show()\n",
43 |                 "\n",
44 |                 "shim.write_parquet(data, params['data_bucket'], \"exampleoutput/parquet/\", None, 'parquetdata' )\n",
45 |                 "\n",
46 |                 "shim.write_csv(data, params['data_bucket'],\"exampleoutput/csv/\", 'csvdata')",
47 |                 "\n",
48 |                 "shim.finish()"
49 |             ],
50 |             "outputs": [],
51 |             "metadata": {}
52 |         },
53 |         {
54 |             "execution_count": 0,
55 |             "cell_type": "code",
56 |             "source": [],
57 |             "outputs": [],
58 |             "metadata": {}
59 |         }
60 |     ],
61 |     "metadata": {
62 |         "kernelspec": {
63 |             "display_name": "Python 3",
64 |             "name": "python3",
65 |             "language": "python"
66 |         },
67 |         "language_info": {
68 |             "mimetype": "text/x-python",
69 |             "nbconvert_exporter": "python",
70 |             "name": "python",
71 |             "file_extension": ".py",
72 |             "version": "3.6.5",
73 |             "pygments_lexer": "ipython3",
74 |             "codemirror_mode": {
75 |                 "version": 3,
76 |                 "name": "ipython"
77 |             }
78 |         }
79 |     }
80 | }


--------------------------------------------------------------------------------
/examples/data/data.json:
--------------------------------------------------------------------------------
1 | { "name":"John", "age":30, "car":null }
2 | { "name":"Kevin", "age":36, "car":"tesla" }


--------------------------------------------------------------------------------
/install.sh:
--------------------------------------------------------------------------------
1 | pip3 install -e .


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | setup(
 3 |     name = 'aws_glue_etl_docker',
 4 |     version = '0.6.0',
 5 |     packages = ['aws_glue_etl_docker'],
 6 |     license='mit',
 7 | 
 8 |     entry_points = {
 9 |         'console_scripts': [
10 |             'aws_glue_etl_docker = aws_glue_etl_docker.__main__:main'
11 |         ]
12 |     })
13 | 


--------------------------------------------------------------------------------