├── .gitignore ├── LICENSE ├── README.md ├── aws_glue_etl_docker ├── __init__.py ├── __main__.py └── glueshim.py ├── createdeployzip.sh ├── examples ├── TestWorkbook.ipynb └── data │ └── data.json ├── install.sh └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | /env/ 2 | /*.egg-info 3 | /examples/.ipynb_checkpoints 4 | /examples/parquet* 5 | /examples/csv 6 | /examples/exampleoutput/* 7 | /out/ 8 | startDevEnv.sh 9 | *.zip 10 | .vscode 11 | aws_glue_etl_docker_deploy -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Genesys Cloud Services, Inc. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AWS Glue ETL in Docker and Jupyter 2 | This project is a helper for creating scripts that run in both [AWS Glue](https://aws.amazon.com/glue/), [Jupyter](http://jupyter.org/) notebooks, and in docker containers with spark-submit. Glue supports running [Zepplin notebooks](https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-EC2-notebook.html) against a dev endpoint, but for quick dev sometimes you just want to run locally against a subset of data and don't want to have to pay to keep the dev endpoints running. 3 | 4 | ## Glue Shim 5 | Glue has specific methods to load and save data to s3 which won't work when running in a jupyter notebook. The glueshim provides a higher level api to work in both scenarios. 6 | 7 | ```python 8 | from aws_glue_etl_docker import glueshim 9 | shim = glueshim.GlueShim() 10 | 11 | params = shim.arguments({'data_bucket': "examples"}) 12 | pprint(params) 13 | 14 | 15 | files = shim.get_all_files_with_prefix(params['data_bucket'], "data/") 16 | print(files) 17 | 18 | data = shim.load_data(files, 'example_data') 19 | data.printSchema() 20 | data.show() 21 | 22 | shim.write_parquet(data, params['data_bucket'], "parquet", None, 'parquetdata' ) 23 | shim.write_parquet(data, params['data_bucket'], "parquetpartition", "car", 'partitioneddata' ) 24 | 25 | shim.write_csv(data, params['data_bucket'],"csv", 'csvdata') 26 | 27 | shim.finish() 28 | ``` 29 | 30 | ## Local environment 31 | Running locally is easiest in a docker container 32 | 33 | 1. Copy data locally, and map that folder to your docker container to the /data// path. 34 | 2. Start docker container, map your local notebook directory to ```/home/jovyan/work``` 35 | 36 | *Example Docker command* 37 | ```docker run -p 8888:8888 -v "$PWD/examples":/home/jovyan/work -v "$PWD":/data jupyter/pyspark-notebook``` 38 | 39 | ### Installing package in Jupyter 40 | 41 | ```python 42 | import sys 43 | !{sys.executable} -m pip install git+https://github.com/purecloudlabs/aws_glue_etl_docker 44 | ``` 45 | 46 | ## AWS Deployment 47 | For deployment to AWS, this library must be packaged and put into S3. You can use the helper script deploytos3.sh to package and copy. 48 | 49 | Usage ```./deploytos3.sh s3://example-bucket/myprefix/aws-glue-etl-jupyter.zip 50 | 51 | Then when starting the glue job, use your S3 zip path in the _Python library path_ configuration 52 | 53 | ## Bookmarks 54 | The shim is currently setup to delete any data in the output folder so that if you run with bookmarks enabled and then need to reprocess the entire dataset and 55 | 56 | ## Converting Workbook to Python Script 57 | 58 | aws_glue_etl_docker can also be used as a cli tool to clean up Jupyter metadata from a workbook or convert it to a python script. 59 | 60 | ## Clean 61 | 62 | The clean command will open all workbooks in a given path and remove any metadata, output and execution information. This keeps the workbooks cleaner in source control 63 | 64 | ``` aws_glue_etl_docker clean --path /dir/to/workbooks ``` 65 | 66 | ## Build 67 | 68 | The build command will open all workbooks in a given path and convert them to python scripts. Build will convert any markdown cells to multiline comments. This command will not convert any cells that contain ```#LOCALDEV``` or lines that start with ```!``` as in ```!{sys.executable} -m pip install git+https://github.com/purecloudlabs/aws_glue_etl_docker``` 69 | 70 | ``` aws_glue_etl_docker build --path /dir/to/workbooks ``` -------------------------------------------------------------------------------- /aws_glue_etl_docker/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/purecloudlabs/aws_glue_etl_docker/b5c9130b8f8c0e133f89ab0742415040f37cade0/aws_glue_etl_docker/__init__.py -------------------------------------------------------------------------------- /aws_glue_etl_docker/__main__.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import argparse 4 | import glob 5 | from pprint import pprint 6 | 7 | def cleanWorkbooks(path): 8 | print('cleaning workbooks ' + path) 9 | print(glob.glob(path + "/*.ipynb")) 10 | for workbook in glob.glob(path + "/*.ipynb"): 11 | input_file = open (workbook) 12 | notebookContents = json.load(input_file) 13 | 14 | for cell in notebookContents['cells']: 15 | if 'metadata' in cell: 16 | cell['metadata'] = {} 17 | 18 | if 'outputs' in cell: 19 | cell['outputs'] = [] 20 | 21 | if 'execution_count' in cell: 22 | cell['execution_count'] = 0 23 | 24 | with open(workbook, 'w') as out: 25 | json.dump(notebookContents, out, indent=1, separators=(',', ': ')) 26 | 27 | 28 | def buildWorkbooks(path, outputdir): 29 | if not os.path.exists(outputdir): 30 | os.makedirs(outputdir) 31 | 32 | for workbook in glob.glob(path + "/*.ipynb"): 33 | print(os.path.basename(workbook)) 34 | 35 | # with open(workbook) as fp: 36 | # for i, line in enumerate(fp): 37 | # if "\xe2" in line: 38 | # print i, repr(line) 39 | 40 | input_file = open (workbook) 41 | notebookContents = json.load(input_file) 42 | 43 | out = open('{}/{}'.format(outputdir, workbook.replace(path, "").replace("ipynb","py")), 'w+') 44 | 45 | for cell in notebookContents['cells']: 46 | if cell['cell_type'] == "code" and len(cell['source']) > 0 and "#LOCALDEV" not in cell['source'][0]: 47 | for line in cell['source']: 48 | if line[0] != '!': 49 | out.write(line) 50 | elif cell['cell_type'] == "markdown": 51 | 52 | out.write('\n\'\'\'\n') 53 | for line in cell['source']: 54 | out.write("#" + line) 55 | out.write('\n\'\'\'\n') 56 | 57 | 58 | def main(): 59 | parser = argparse.ArgumentParser() 60 | 61 | parser.add_argument('command', help='Command to run (clean, build)') 62 | parser.add_argument('--path', help='Path to workbooks') 63 | parser.add_argument('--outdir', help='Output path') 64 | 65 | args = parser.parse_args() 66 | 67 | if args.command == "clean": 68 | cleanWorkbooks(args.path) 69 | elif args.command == "build": 70 | buildWorkbooks(args.path, args.outdir) 71 | 72 | if __name__ == '__main__': 73 | main() 74 | -------------------------------------------------------------------------------- /aws_glue_etl_docker/glueshim.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os.path 3 | import shutil 4 | import glob 5 | import pyspark 6 | from pyspark import SparkConf, SparkContext, SQLContext 7 | from pprint import pprint 8 | 9 | def _load_data(filePaths, dataset_name, spark_context, groupfiles, groupsize): 10 | sqlContext = SQLContext(spark_context) 11 | return sqlContext.read.json(filePaths) 12 | 13 | def _write_csv(dataframe, bucket, location, dataset_name, spark_context): 14 | output_path = '/data/' + bucket + '/' + location 15 | 16 | shutil.rmtree(output_path, True) 17 | dataframe.repartition(1).write.format("com.databricks.spark.csv").option("header","true").save(output_path) 18 | 19 | def _write_parquet(dataframe, bucket, location, partition_columns, dataset_name, spark_context): 20 | 21 | output_path = '/data/' + bucket + '/' + location 22 | 23 | shutil.rmtree(output_path, True) 24 | 25 | if partition_columns != None and len(partition_columns) > 0: 26 | dataframe.write.partitionBy(partition_columns).parquet(output_path) 27 | else: 28 | dataframe.write.parquet(output_path) 29 | 30 | def _get_spark_context(): 31 | return (pyspark.SparkContext.getOrCreate(), None) 32 | 33 | def _get_all_files_with_prefix(bucket, prefix, spark_context): 34 | pathToWalk = '/data/' + bucket + '/' + prefix +'**/*.*' 35 | return glob.glob(pathToWalk,recursive=True) 36 | 37 | def _is_in_aws(): 38 | return False 39 | 40 | def _get_arguments(default): 41 | return default 42 | 43 | def _finish(self): 44 | return None 45 | 46 | try: 47 | from awsglue.transforms import * 48 | from awsglue.utils import getResolvedOptions 49 | from awsglue.context import GlueContext 50 | from awsglue.dynamicframe import DynamicFrame 51 | from awsglue.job import Job 52 | import boto3 53 | 54 | def _load_data(file_paths, dataset_name, context, groupfiles, groupsize): 55 | 56 | connection_options = {'paths': file_paths} 57 | 58 | if groupfiles != None: 59 | connection_options["groupFiles"] = groupfiles 60 | 61 | if groupsize != None: 62 | connection_options["groupSize"] = groupsize 63 | 64 | glue0 = context.create_dynamic_frame.from_options(connection_type='s3', 65 | connection_options=connection_options, 66 | format='json', 67 | transformation_ctx=dataset_name) 68 | 69 | return glue0.toDF() 70 | 71 | def _write_csv(dataframe, bucket, location, dataset_name, spark_context): 72 | output_path = "s3://" + bucket + "/" + location 73 | df_tmp = DynamicFrame.fromDF(dataframe.repartition(1), spark_context, dataset_name) 74 | spark_context.write_dynamic_frame.from_options(frame = df_tmp, connection_type = "s3", connection_options = {"path": output_path}, format = "csv") 75 | 76 | 77 | def _delete_files_with_prefix(bucket, prefix): 78 | if not prefix.endswith('/'): 79 | prefix = prefix + "/" 80 | 81 | delete_keys = {'Objects' : []} 82 | s3 = boto3.client('s3') 83 | 84 | paginator = s3.get_paginator('list_objects') 85 | pages = paginator.paginate(Bucket=bucket, Prefix=prefix) 86 | for page in pages: 87 | if page.get('Contents'): 88 | for obj in page['Contents']: 89 | if not obj['Key'].endswith('/'): 90 | delete_keys['Objects'].append({'Key': str(obj['Key'])}) 91 | 92 | s3.delete_objects(Bucket=bucket, Delete=delete_keys) 93 | delete_keys = {'Objects' : []} 94 | 95 | 96 | def _write_parquet(dataframe, bucket, location, partition_columns, dataset_name, spark_context): 97 | if "job-bookmark-disable" in sys.argv: 98 | _delete_files_with_prefix(bucket, location) 99 | 100 | output_path = "s3://" + bucket + "/" + location 101 | 102 | df_tmp = DynamicFrame.fromDF(dataframe, spark_context, dataset_name) 103 | 104 | print("Writing to {} ".format(output_path)) 105 | 106 | if partition_columns != None and len(partition_columns) > 0: 107 | spark_context.write_dynamic_frame.from_options(frame = df_tmp, connection_type = "s3", connection_options = {"path": output_path, "partitionKeys": partition_columns }, format = "parquet") 108 | else: 109 | spark_context.write_dynamic_frame.from_options(frame = df_tmp, connection_type = "s3", connection_options = {"path": output_path }, format = "parquet") 110 | 111 | 112 | 113 | def _get_spark_context(): 114 | spark_context = GlueContext(SparkContext.getOrCreate()) 115 | job = Job(spark_context) 116 | args = _get_arguments({}) 117 | job.init(args['JOB_NAME'], args) 118 | 119 | return (spark_context, job) 120 | 121 | def _get_all_files_with_prefix(bucket, prefix, spark_context): 122 | prefixes = set() 123 | s3 = boto3.client('s3') 124 | paginator = s3.get_paginator('list_objects') 125 | pages = paginator.paginate(Bucket=bucket, Prefix=prefix) 126 | for page in pages: 127 | if 'Contents' in page and page['Contents']: 128 | for obj in page['Contents']: 129 | if not obj['Key'].endswith('/') and '/' in obj['Key']: 130 | idx = obj['Key'].rfind('/') 131 | prefixes.add('s3://{}/{}'.format(bucket, obj['Key'][0:idx])) 132 | 133 | return list(prefixes) 134 | 135 | def _get_arguments(defaults): 136 | return getResolvedOptions(sys.argv, ['JOB_NAME'] + defaults.keys()) 137 | 138 | def _is_in_aws(): 139 | return True 140 | 141 | def _finish(self): 142 | if self.job: 143 | try: 144 | self.job.commit() 145 | except NameError: 146 | print("unable to commit job") 147 | 148 | 149 | except Exception as e: 150 | print('local dev') 151 | 152 | class GlueShim: 153 | def __init__(self): 154 | c = _get_spark_context() 155 | self.spark_context = c[0] 156 | self.job = c[1] 157 | self._groupfiles = None 158 | self._groupsize = None 159 | 160 | def arguments(self, defaults): 161 | """Gets the arguments for a job. When running in glue, the response is pulled form sys.argv 162 | 163 | Keyword arguments: 164 | defaults -- default dictionary of options 165 | """ 166 | return _get_arguments(defaults) 167 | 168 | def load_data(self, file_paths, dataset_name): 169 | """Loads data into a dataframe 170 | 171 | Keyword arguments: 172 | file_paths -- list of file paths to pull from, either absolute paths or s3:// uris 173 | dataset_name -- name of this dataset, used for glue bookmarking 174 | """ 175 | return _load_data(file_paths, dataset_name, self.spark_context, self._groupfiles, self._groupsize) 176 | 177 | def get_all_files_with_prefix(self, bucket, prefix): 178 | """Given a bucket and file prefix, this method will return a list of all files with that prefix 179 | 180 | Keyword arguments: 181 | bucket -- bucket name 182 | prefix -- filename prefix 183 | """ 184 | return _get_all_files_with_prefix(bucket, prefix, self.spark_context) 185 | 186 | def write_parquet(self, dataframe, bucket, location, partition_columns, dataset_name): 187 | """Writes a dataframe in parquet format 188 | 189 | Keyword arguments: 190 | dataframe -- dataframe to write out 191 | bucket -- Output bucket name 192 | location -- Output filename prefix 193 | partition_columns -- list of strings to partition by, None for default partitions 194 | dataset_name - dataset name, will be appended to location 195 | 196 | """ 197 | _write_parquet(dataframe, bucket, location, partition_columns, dataset_name, self.spark_context) 198 | 199 | def write_csv(self, dataframe, bucket, location, dataset_name): 200 | """Writes a dataframe in csv format with a partition count of 1 201 | 202 | Keyword arguments: 203 | dataframe -- dataframe to write out 204 | bucket -- Output bucket name 205 | location -- Output filename prefix 206 | dataset_name - dataset name, will be appended to location 207 | 208 | """ 209 | _write_csv(dataframe, bucket, location, dataset_name, self.spark_context) 210 | 211 | def get_spark_context(self): 212 | """ Gets the spark context """ 213 | return self.context 214 | 215 | def finish(self): 216 | """ Should be run at the end, will set Glue bookmarks """ 217 | _finish(self) 218 | 219 | def set_group_files(self, groupfiles): 220 | """ Sets extra options used with glue https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html """ 221 | self._groupfiles = groupfiles 222 | 223 | def set_group_size(self, groupsize): 224 | """ Sets extra options used with glue https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html """ 225 | self._groupsize = groupsize 226 | -------------------------------------------------------------------------------- /createdeployzip.sh: -------------------------------------------------------------------------------- 1 | !/usr/bin/env bash 2 | rm -rf aws_glue_etl_docker_deploy 3 | mkdir aws_glue_etl_docker_deploy 4 | cd aws_glue_etl_docker_deploy 5 | mkdir deps 6 | virtualenv -p python2.7 . 7 | pip install -t deps git+https://github.com/purecloudlabs/aws_glue_etl_docker 8 | cd deps && zip -r ../aws_glue_etl_docker_deploy.zip . && cd .. 9 | aws s3 cp ./aws_glue_etl_docker_deploy.zip $1 10 | -------------------------------------------------------------------------------- /examples/TestWorkbook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat_minor": 1, 3 | "nbformat": 4, 4 | "cells": [ 5 | { 6 | "execution_count": 0, 7 | "cell_type": "code", 8 | "source": [ 9 | "import sys\n", 10 | "!{sys.executable} -m pip install git+https://github.com/purecloudlabs/aws_glue_etl_docker.git\n", 11 | "# Because the previous line starts with a !, it will be removed from the output py file" 12 | ], 13 | "outputs": [], 14 | "metadata": {} 15 | }, 16 | { 17 | "source": [ 18 | "#This is a markdown cell which will get converted into an inline comment" 19 | ], 20 | "cell_type": "markdown", 21 | "metadata": {} 22 | }, 23 | { 24 | "execution_count": 0, 25 | "cell_type": "code", 26 | "source": [ 27 | "from aws_glue_etl_docker import glueshim\n", 28 | "\n", 29 | "print(\"starting\")\n", 30 | "print(sys.argv)\n", 31 | "\n", 32 | "shim = glueshim.GlueShim()\n", 33 | "\n", 34 | "params = {'data_bucket': \"examples\"}\n", 35 | "\n", 36 | "files = shim.get_all_files_with_prefix(params['data_bucket'], \"data/\")\n", 37 | "print(files)\n", 38 | "\n", 39 | "\n", 40 | "data = shim.load_data(files, 'example_data')\n", 41 | "data.printSchema()\n", 42 | "data.show()\n", 43 | "\n", 44 | "shim.write_parquet(data, params['data_bucket'], \"exampleoutput/parquet/\", None, 'parquetdata' )\n", 45 | "\n", 46 | "shim.write_csv(data, params['data_bucket'],\"exampleoutput/csv/\", 'csvdata')", 47 | "\n", 48 | "shim.finish()" 49 | ], 50 | "outputs": [], 51 | "metadata": {} 52 | }, 53 | { 54 | "execution_count": 0, 55 | "cell_type": "code", 56 | "source": [], 57 | "outputs": [], 58 | "metadata": {} 59 | } 60 | ], 61 | "metadata": { 62 | "kernelspec": { 63 | "display_name": "Python 3", 64 | "name": "python3", 65 | "language": "python" 66 | }, 67 | "language_info": { 68 | "mimetype": "text/x-python", 69 | "nbconvert_exporter": "python", 70 | "name": "python", 71 | "file_extension": ".py", 72 | "version": "3.6.5", 73 | "pygments_lexer": "ipython3", 74 | "codemirror_mode": { 75 | "version": 3, 76 | "name": "ipython" 77 | } 78 | } 79 | } 80 | } -------------------------------------------------------------------------------- /examples/data/data.json: -------------------------------------------------------------------------------- 1 | { "name":"John", "age":30, "car":null } 2 | { "name":"Kevin", "age":36, "car":"tesla" } -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | pip3 install -e . -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | setup( 3 | name = 'aws_glue_etl_docker', 4 | version = '0.6.0', 5 | packages = ['aws_glue_etl_docker'], 6 | license='mit', 7 | 8 | entry_points = { 9 | 'console_scripts': [ 10 | 'aws_glue_etl_docker = aws_glue_etl_docker.__main__:main' 11 | ] 12 | }) 13 | --------------------------------------------------------------------------------