├── .github └── PULL_REQUEST_TEMPLATE.md ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── NOTICE.txt ├── README.md ├── build.py ├── cloudformation ├── athenarunner-lambda-params.json ├── athenarunner-lambda.yaml ├── glue-resources-params.json ├── glue-resources.yaml ├── gluerunner-lambda-params.json ├── gluerunner-lambda.yaml ├── step-functions-resources-params.json └── step-functions-resources.yaml ├── glue-scripts ├── join_marketing_and_sales_data.py ├── process_marketing_data.py └── process_sales_data.py ├── lambda ├── athenarunner │ ├── athenarunner-config.json │ ├── athenarunner.py │ ├── requirements.txt │ └── setup.cfg ├── global-config.json ├── gluerunner │ ├── gluerunner-config.json │ ├── gluerunner.py │ ├── requirements.txt │ └── setup.cfg ├── ons3objectcreated │ ├── ons3objectcreated-config.json │ ├── ons3objectcreated.py │ ├── requirements.txt │ └── setup.cfg └── s3-deployment-descriptor.json ├── resources ├── MarketingAndSalesETLOrchestrator-notstarted.png ├── MarketingAndSalesETLOrchestrator-success.png ├── aws-etl-orchestrator-serverless.png └── example-etl-flow.png └── samples ├── MarketingData_QuickSightSample.csv └── SalesPipeline_QuickSightSample.csv /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | *Issue #, if available:* 2 | 3 | *Description of changes:* 4 | 5 | 6 | By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. 7 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by .ignore support plugin (hsz.mobi) 2 | # macOS template 3 | # General 4 | .DS_Store 5 | .AppleDouble 6 | .LSOverride 7 | 8 | # Icon must end with two \r 9 | Icon 10 | 11 | # Thumbnails 12 | ._* 13 | 14 | # Files that might appear in the root of a volume 15 | .DocumentRevisions-V100 16 | .fseventsd 17 | .Spotlight-V100 18 | .TemporaryItems 19 | .Trashes 20 | .VolumeIcon.icns 21 | .com.apple.timemachine.donotpresent 22 | 23 | 24 | # Directories potentially created on remote AFP share 25 | .AppleDB 26 | .AppleDesktop 27 | Network Trash Folder 28 | Temporary Items 29 | .apdisk 30 | 31 | # Archives template 32 | # It's better to unpack these files and commit the raw source because 33 | # git has its own built in compression methods. 34 | *.7z 35 | *.jar 36 | *.rar 37 | *.zip 38 | *.gz 39 | *.tgz 40 | *.bzip 41 | *.bz2 42 | *.xz 43 | *.lzma 44 | *.cab 45 | 46 | # Packing-only formats 47 | *.iso 48 | *.tar 49 | 50 | # Package management formats 51 | *.dmg 52 | *.xpi 53 | *.gem 54 | *.egg 55 | *.deb 56 | *.rpm 57 | *.msi 58 | *.msm 59 | *.msp 60 | ### Example user template template 61 | ### Example user template 62 | 63 | # IntelliJ project files 64 | .idea 65 | *.iml 66 | out 67 | gen 68 | notebook 69 | 70 | # GitHub 71 | .github 72 | 73 | # Pynt 74 | *.pyc 75 | 76 | # Python Virtual Environment 77 | venv 78 | 79 | # Copy of customized configuration files 80 | config-copy 81 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check [existing open](https://github.com/aws-samples/aws-etl-orchestrator/issues), or [recently closed](https://github.com/aws-samples/aws-etl-orchestrator/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels ((enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/aws-samples/aws-etl-orchestrator/labels/help%20wanted) issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](https://github.com/aws-samples/aws-etl-orchestrator/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT No Attribution 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | software and associated documentation files (the "Software"), to deal in the Software 5 | without restriction, including without limitation the rights to use, copy, modify, 6 | merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /NOTICE.txt: -------------------------------------------------------------------------------- 1 | aws-etl-orchestrator 2 | Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Table of Contents 2 | ================= 3 | 4 | * [Introduction](#introduction) 5 | * [The challenge of orchestrating an ETL workflow](#the-challenge-of-orchestrating-an-etl-workflow) 6 | * [Example ETL workflow requirements](#example-etl-workflow-requirements) 7 | * [The ETL orchestration architecture and events](#the-etl-orchestration-architecture-and-events) 8 | * [Modeling the ETL orchestration workflow in AWS Step Functions](#modeling-the-etl-orchestration-workflow-in-aws-step-functions) 9 | * [AWS CloudFormation templates](#aws-cloudformation-templates) 10 | * [Handling failed ETL jobs](#handling-failed-etl-jobs) 11 | * [Preparing your development environment](#preparing-your-development-environment) 12 | * [Configuring the project](#configuring-the-project) 13 | * [cloudformation/gluerunner-lambda-params.json](#cloudformationgluerunner-lambda-paramsjson) 14 | * [cloudformation/athenarunner-lambda-params.json](#cloudformationathenarunner-lambda-paramsjson) 15 | * [lambda/s3-deployment-descriptor.json](#lambdas3-deployment-descriptorjson) 16 | * [cloudformation/glue-resources-params.json](#cloudformationglue-resources-paramsjson) 17 | * [lambda/gluerunner/gluerunner-config.json](#lambdagluerunnergluerunner-configjson) 18 | * [cloudformation/step-functions-resources-params.json](#cloudformationstep-functions-resources-paramsjson) 19 | * [Build commands](#build-commands) 20 | * [The packagelambda build command](#the-packagelambda-build-command) 21 | * [The deploylambda build command](#the-deploylambda-build-command) 22 | * [The createstack build command](#the-createstack-build-command) 23 | * [The updatestack build command](#the-updatestack-build-command) 24 | * [The deletestack build command](#the-deletestack-build-command) 25 | * [The deploygluescripts build command](#the-deploygluescripts-build-command) 26 | * [Putting it all together: Example usage of build commands](#putting-it-all-together-example-usage-of-build-commands) 27 | * [License](#license) 28 | 29 | 30 | 31 | # Introduction 32 | Extract, transform, and load (ETL) operations collectively form the backbone of any modern enterprise data lake. It transforms raw data into useful datasets and, ultimately, into actionable insight. An ETL job typically reads data from one or more data sources, applies various transformations to the data, and then writes the results to a target where data is ready for consumption. The sources and targets of an ETL job could be relational databases in Amazon Relational Database Service (Amazon RDS) or on-premises, a data warehouse such as Amazon Redshift, or object storage such as Amazon Simple Storage Service (Amazon S3) buckets. Amazon S3 as a target is especially commonplace in the context of building a data lake in AWS. 33 | 34 | AWS offers [AWS Glue](https://aws.amazon.com/glue/), which is a service that helps author and deploy ETL jobs. AWS Glue is a fully managed extract, transform, and load service that makes it easy for customers to prepare and load their data for analytics. Other AWS Services also can be used to implement and manage ETL jobs. They include: [AWS Database Migration Service](https://docs.aws.amazon.com/dms/latest/userguide/Welcome.html) (AWS DMS), [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html) (using the Steps API), and even [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html). 35 | 36 | 37 | # The challenge of orchestrating an ETL workflow 38 | 39 | How can we orchestrate an ETL workflow that involves a diverse set of ETL technologies? AWS Glue, AWS DMS, Amazon EMR, and other services support [Amazon CloudWatch Events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html), which we could use to chain ETL jobs together. Amazon S3, the central data lake store, also supports CloudWatch Events. But relying on CloudWatch Events alone means that there's no single visual representation of the ETL workflow. Also, tracing the overall ETL workflow's execution status and handling error scenarios can become a challenge. 40 | 41 | In this code sample, I show you how to use [AWS Step Functions](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html) and [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) for orchestrating multiple ETL jobs involving a diverse set of technologies in an arbitrarily-complex ETL workflow. AWS Step Functions is a web service that enables you to coordinate the components of distributed applications and microservices using visual workflows. You build applications from individual components. Each component performs a discrete function, or task, allowing you to scale and change applications quickly. Using AWS Step Functions also opens up opportunities for integrating other AWS or external services into your ETL flows. Let’s take an example ETL flow for illustration. 42 | 43 | 44 | # Example ETL workflow requirements 45 | 46 | For our example, we'll use two publicly-available [Amazon QuickSight](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html) datasets. The first dataset is a [sales pipeline dataset](samples/SalesPipeline_QuickSightSample.csv) (Sales dataset) that contains a list of slightly above 20K sales opportunity records for a fictitious business. Each record has fields that specify: 47 | 48 | * A date, potentially when an opportunity was identified. 49 | * The salesperson’s name. 50 | * A market segment to which the opportunity belongs. 51 | * Forecasted monthly revenue. 52 | 53 | The second dataset is an [online marketing metrics dataset](samples/MarketingData_QuickSightSample.csv) (Marketing dataset). The data set contains records of marketing metrics, aggregated by day. The metrics describe user engagement across various channels (website, mobile, and social media) plus other marketing metrics. The two data sets are unrelated, but we’ll assume that they are for the purpose of this example. 54 | 55 | Imagine there’s a business user who needs to answer questions based on both datasets. Perhaps the user wants to explore the correlations between online user engagement metrics on the one hand, and forecasted sales revenue and opportunities generated on the other hand. The user engagement metrics include website visits, mobile users, and desktop users. 56 | 57 | The steps in the ETL flow chart are: 58 | 59 | 1. **Process the Sales dataset.** Read Sales dataset. Group records by day, aggregating the Forecasted Monthly Revenue field. Rename fields to replace white space with underscores. Output the intermediary results to Amazon S3 in compressed Parquet format. Overwrite any previous outputs. 60 | 61 | 2. **Process the Marketing dataset.** Read Marketing dataset. Rename fields to replace white space with underscores. Output the intermediary results to Amazon S3 in compressed Parquet format. Overwrite any previous outputs. 62 | 63 | 3. **Join Sales and Marketing datasets.** Read outputs of processing Sales and Marketing datasets. Perform an inner join of both datasets on the date field. Sort in ascending order by date. Output final joined dataset to Amazon S3, overwriting any previous outputs. 64 | Solution Architecture 65 | 66 | So far, this ETL workflow can be implemented with AWS Glue, with the ETL jobs being chained by using job triggers. But you might have other requirements outside of AWS Glue that are part of your end-to-end data processing workflow, such as the following: 67 | 68 | * Both Sales and Marketing datasets are uploaded to an S3 bucket at random times in an interval of up to a week. The PSD and PMD jobs should start as soon as the Sales dataset file is uploaded. Parallel ETL jobs can start and finish anytime, but the final JMSD job can start only after all parallel ETL jobs are complete. 69 | * In addition to PSD and PMD jobs, the orchestration must support more parallel ETL jobs in the future that contribute to the final dataset aggregated by the JMSD job. The additional ETL jobs could be managed by AWS services, such as AWS Database Migration Service, Amazon EMR, Amazon Athena or other non-AWS services. 70 | 71 | 72 | The data engineer takes these requirements and builds the following ETL workflow chart. 73 | ![Example ETL Flow](resources/example-etl-flow.png) 74 | 75 | 76 | # The ETL orchestration architecture and events 77 | Let’s see how we can orchestrate such an ETL flow with AWS Step Functions, AWS Glue, and AWS Lambda. The following diagram shows the ETL orchestration architecture in action. 78 | 79 | ![Architecture](resources/aws-etl-orchestrator-serverless.png) 80 | 81 | The main flow of events starts with an AWS Step Functions state machine. This state machine defines the steps in the orchestrated ETL workflow. A state machine can be triggered through [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/?nc2=h_m1) based on a schedule, through the [AWS Command Line Interface (AWS CLI)](https://aws.amazon.com/cli/), or using the various AWS SDKs in an AWS Lambda function or some other execution environment. 82 | 83 | As the state machine execution progresses, it invokes the ETL jobs. As shown in the diagram, the invocation happens indirectly through intermediary AWS Lambda functions that you author and set up in your account. We'll call this type of function an ETL Runner. 84 | 85 | While the architecture in the diagram shows Amazon Athena, Amazon EMR, and AWS Glue, **this code sample now includes two ETL Runners for both AWS Glue and Amazon Athena**. You can use these ETL Runners to orchestrate AWS Glue jobs and Amazon Athena queries. You can also follow the pattern and implement more ETL Runners to orchestrate other AWS services or non-AWS tools. 86 | 87 | ETL Runners are invoked by [activity tasks](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-activities.html) in Step Functions. Because of the way AWS Step Functions' activity tasks work, ETL Runners need to periodically poll the AWS Step Functions state machine for tasks. The state machine responds by providing a Task object. The Task object contains inputs which enable an ETL Runner to run an ETL job. 88 | 89 | As soon as an ETL Runner receives a task, it starts the respective ETL job. An ETL Runner maintains a state of active jobs in an [Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html) table. Periodically, the ETL Runner checks the status of active jobs. When an active ETL job completes, the ETL Runners notifies the AWS Step Functions state machine. This allows the ETL workflow in AWS Step Functions to proceed to the next step. 90 | 91 | 92 | ## Modeling the ETL orchestration workflow in AWS Step Functions 93 | 94 | The following snapshot from the AWS Step Functions console shows our example ETL workflow modeled as a state machine. This workflow is what we provide you in the code sample. 95 | 96 | ![ETL flow in AWS Step Functions](resources/MarketingAndSalesETLOrchestrator-notstarted.png) 97 | 98 | When you start an execution of this state machine, it will branch to run two ETL jobs in parallel: Process Sales Data (PSD) and Process Marketing Data (PMD). But, according to the requirements, both ETL jobs should not start until their respective datasets are uploaded to Amazon S3. Hence, we implement Wait activity tasks before both PSD and PMD. When a dataset file is uploaded to Amazon S3, this triggers an AWS Lambda function that notifies the state machine to exit the Wait states. When both PMD and PSD jobs are successful, the JMSD job runs to produce the final dataset. 99 | 100 | Finally, to have this ETL workflow execute once per week, you will need to configure a state machine execution to start once per week using a CloudWatch Event. 101 | 102 | 103 | ## AWS CloudFormation templates 104 | 105 | In this code sample, there are three separate AWS CloudFormation templates. You may choose to structure your own project similarly. 106 | 107 | 1. A template responsible for setting up AWS Glue resources ([glue-resources.yaml](cloudformation/glue-resources.yaml)). For our example ETL flow, the sample template creates three AWS Glue jobs: PSD, PMD, and JMSD. The scripts are pulled by AWS CloudFormation from an Amazon S3 bucket that you own. 108 | 2. A template where the AWS Step Functions state machine is defined ([step-functions-resources.yaml](cloudformation/step-functions-resources.yaml)). The state machine definition in Amazon States Language is embedded in a StateMachine resource within this template. 109 | 3. A template that sets up Glue Runner resources ([gluerunner-resources.yaml](cloudformation/gluerunner-resources.yaml)). Glue Runner is a Python script that is written to be run as an AWS Lambda function. 110 | 111 | For our ETL example, the `step-functions-resources.yaml` CloudFormation template creates a State Machine named `MarketingAndSalesETLOrchestrator`. You can start an execution from the AWS Step Functions console, or through an AWS CLI command. 112 | 113 | 114 | ## Handling failed ETL jobs 115 | 116 | What if a job in the ETL workflow fails? In such a case, there are error-handling strategies available to the ETL workflow developer, from simply notifying an administrator, to fully undoing the effects of the previous jobs through compensating ETL jobs. Detecting and responding to a failed Glue job can be implemented using AWS Step Functions' Catch mechanism, which is explained in [this tutorial](https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-handling-error-conditions.html). In this code sample, this state is a do-nothing [Pass](https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-pass-state.html) state. You can change that state into a [Task](https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-task-state.html) state that triggers your error-handling logic in an AWS Lambda function. 117 | 118 | 119 | > NOTE: To make this code sample easier to setup and execute, build scripts and commands are included to help you with common tasks such as creating the above AWS CloudFormation stacks in your account. Proceed below to learn how to setup and run this example orchestrated ETL flow. 120 | 121 | 122 | 123 | # Preparing your development environment 124 | Here’s a high-level checklist of what you need to do to setup your development environment. 125 | 126 | 1. Sign up for an AWS account if you haven't already and create an Administrator User. The steps are published [here](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html). 127 | 128 | 2. Ensure that you have Python 2.7 and Pip installed on your machine. Instructions for that varies based on your operating system and OS version. 129 | 130 | 3. Create a Python [virtual environment](https://virtualenv.pypa.io/en/stable/) for the project with Virtualenv. This helps keep project’s python dependencies neatly isolated from your Operating System’s default python installation. **Once you’ve created a virtual python environment, activate it before moving on with the following steps**. 131 | 132 | 4. Use Pip to [install AWS CLI](http://docs.aws.amazon.com/cli/latest/userguide/installing.html). [Configure](http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html) the AWS CLI. It is recommended that the access keys you configure are associated with an IAM User who has full access to the following: 133 | - Amazon S3 134 | - Amazon DynamoDB 135 | - AWS Lambda 136 | - Amazon CloudWatch and CloudWatch Logs 137 | - AWS CloudFormation 138 | - AWS Step Functions 139 | - Creating IAM Roles 140 | 141 | The IAM User can be the Administrator User you created in Step 1. 142 | 143 | 5. Make sure you choose a region where all of the above services are available, such as us-east-1 (N. Virginia). Visit [this page](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/) to learn more about service availability in AWS regions. 144 | 145 | 6. Use Pip to install [Pynt](https://github.com/rags/pynt). Pynt is used for the project's Python-based build scripts. 146 | 147 | 148 | # Configuring the project 149 | 150 | This section list configuration files, parameters within them, and parameter default values. The build commands (detailed later) and CloudFormation templates extract their parameters from these files. Also, the Glue Runner AWS Lambda function extract parameters at runtime from `gluerunner-config.json`. 151 | 152 | >**NOTE: Do not remove any of the attributes already specified in these files.** 153 | 154 | >**NOTE: You must set the value of any parameter that has the tag NO-DEFAULT** 155 | 156 | 157 | ## cloudformation/gluerunner-lambda-params.json 158 | 159 | Specifies parameters for creation of the `gluerunner-lambda` CloudFormation stack (as defined in `gluerunner-lambda.yaml` template). It is also used by several build scripts, including `createstack`, and `updatestack`. 160 | 161 | ```json 162 | [ 163 | { 164 | "ParameterKey": "ArtifactBucketName", 165 | "ParameterValue": "" 166 | }, 167 | { 168 | "ParameterKey": "LambdaSourceS3Key", 169 | "ParameterValue": "src/gluerunner.zip" 170 | }, 171 | { 172 | "ParameterKey": "DDBTableName", 173 | "ParameterValue": "GlueRunnerActiveJobs" 174 | }, 175 | { 176 | "ParameterKey": "GlueRunnerLambdaFunctionName", 177 | "ParameterValue": "gluerunner" 178 | } 179 | ] 180 | ``` 181 | #### Parameters: 182 | 183 | * `ArtifactBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) in which Glue scripts and Lambda function source will be stored. **If a bucket with such a name does not exist, the `deploylambda` build command will create it for you with appropriate permissions.** 184 | 185 | * `LambdaSourceS3Key` - The Amazon S3 key (e.g. `src/gluerunner.zip`) pointing to your AWS Lambda function's .zip package in the artifact bucket. 186 | 187 | * `DDBTableName` - The Amazon DynamoDB table in which the state of active AWS Glue jobs is tracked between Glue Runner AWS Lambda function invocations. 188 | 189 | * `GlueRunnerLambdaFunctionName` - The name to be assigned to the Glue Runner AWS Lambda function. 190 | 191 | 192 | ## cloudformation/athenarunner-lambda-params.json 193 | 194 | Specifies parameters for creation of the `gluerunner-lambda` CloudFormation stack (as defined in `gluerunner-lambda.yaml` template). It is also used by several build scripts, including `createstack`, and `updatestack`. 195 | 196 | ```json 197 | [ 198 | { 199 | "ParameterKey": "ArtifactBucketName", 200 | "ParameterValue": "" 201 | }, 202 | { 203 | "ParameterKey": "LambdaSourceS3Key", 204 | "ParameterValue": "src/athenarunner.zip" 205 | }, 206 | { 207 | "ParameterKey": "DDBTableName", 208 | "ParameterValue": "AthenaRunnerActiveJobs" 209 | }, 210 | { 211 | "ParameterKey": "AthenaRunnerLambdaFunctionName", 212 | "ParameterValue": "athenarunner" 213 | } 214 | ] 215 | ``` 216 | #### Parameters: 217 | 218 | * `ArtifactBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) in which Glue scripts and Lambda function source will be stored. **If a bucket with such a name does not exist, the `deploylambda` build command will create it for you with appropriate permissions.** 219 | 220 | * `LambdaSourceS3Key` - The Amazon S3 key (e.g. `src/athenarunner.zip`) pointing to your AWS Lambda function's .zip package. 221 | 222 | * `DDBTableName` - The Amazon DynamoDB table in which the state of active AWS Athena queries is tracked between Athena Runner AWS Lambda function invocations. 223 | 224 | * `AthenaRunnerLambdaFunctionName` - The name to be assigned to the Athena Runner AWS Lambda function. 225 | 226 | 227 | 228 | ## lambda/s3-deployment-descriptor.json 229 | 230 | Specifies the S3 buckets and prefixes for the Glue Runner lambda function (and other functions that may be added later). It is used by the `deploylambda` build script. 231 | 232 | Sample content: 233 | 234 | ```json 235 | { 236 | "gluerunner": { 237 | "ArtifactBucketName": "", 238 | "LambdaSourceS3Key":"src/gluerunner.zip" 239 | }, 240 | "ons3objectcreated": { 241 | "ArtifactBucketName": "", 242 | "LambdaSourceS3Key":"src/ons3objectcreated.zip" 243 | } 244 | } 245 | ``` 246 | #### Parameters: 247 | 248 | * `ArtifactBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) in which Glue scripts and Lambda function source will be stored. **If a bucket with such a name does not exist, the `deploylambda` build command will create it for you with appropriate permissions.** 249 | 250 | * `LambdaSourceS3Key` - The Amazon S3 key (e.g. `src/gluerunner.zip`) for your AWS Lambda function's .zip package. 251 | 252 | >**NOTE: The values set here must match values set in `cloudformation/gluerunner-lambda-params.json`.** 253 | 254 | 255 | 256 | ## cloudformation/glue-resources-params.json 257 | 258 | Specifies parameters for creation of the `glue-resources` CloudFormation stack (as defined in `glue-resources.yaml` template). It is also read by several build scripts, including `createstack`, `updatestack`, and `deploygluescripts`. 259 | 260 | ```json 261 | [ 262 | { 263 | "ParameterKey": "ArtifactBucketName", 264 | "ParameterValue": "" 265 | }, 266 | { 267 | "ParameterKey": "ETLScriptsPrefix", 268 | "ParameterValue": "scripts" 269 | }, 270 | { 271 | "ParameterKey": "DataBucketName", 272 | "ParameterValue": "" 273 | }, 274 | { 275 | "ParameterKey": "ETLOutputPrefix", 276 | "ParameterValue": "output" 277 | } 278 | ] 279 | ``` 280 | #### Parameters: 281 | 282 | * `ArtifactBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) that will be created by the `step-functions-resources.yaml` CloudFormation template. **If a bucket with such a name does not exist, the `deploylambda` build command will create it for you with appropriate permissions.** 283 | 284 | * `ETLScriptsPrefix` - The Amazon S3 prefix (in the format ``example/path`` without leading or trailing '/') to which AWS Glue scripts will be deployed in the artifact bucket. Glue scripts can be found under the `glue-scripts` project directory 285 | 286 | * `DataBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) that will be created by the `step-functions-resources.yaml` CloudFormation template. This is the bucket to which Sales and Marketing datasets must be uploaded. It is also the bucket in which output will be created. **This bucket is created by `step-functions-resources` CloudFormation. CloudFormation stack creation will fail if the bucket already exists.** 287 | 288 | * `ETLOutputPrefix` - The Amazon S3 prefix (in the format ``example/path`` without leading or trailing '/') to which AWS Glue jobs will produce their intermediary outputs. This path will be created in the data bucket. 289 | 290 | 291 | 292 | The parameters are used by AWS CloudFormation during the creation of `glue-resources` stack. 293 | 294 | 295 | ## lambda/gluerunner/gluerunner-config.json 296 | 297 | Specifies the parameters used by Glue Runner AWS Lambda function at run-time. 298 | 299 | ```json 300 | { 301 | "sfn_activity_arn": "arn:aws:states:::activity:GlueRunnerActivity", 302 | "sfn_worker_name": "gluerunner", 303 | "ddb_table": "GlueRunnerActiveJobs", 304 | "ddb_query_limit": 50, 305 | "glue_job_capacity": 10 306 | } 307 | ``` 308 | #### Parameters: 309 | 310 | * `sfn_activity_arn` - AWS Step Functions activity task ARN. This ARN is used to query AWS Step Functions for new tasks (i.e. new AWS Glue jobs to run). The ARN is a combination of the AWS region, your AWS account Id, and the name property of the [AWS::StepFunctions::Activity](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-stepfunctions-activity.html) resource in the `stepfunctions-resources.yaml` CloudFormation template. An ARN looks as follows `arn:aws:states:::activity:`. By default, the activity name is `GlueRunnerActivity`. 311 | 312 | * `sfn_worker_name` - A property that is passed to AWS Step Functions when getting activity tasks. 313 | 314 | * `ddb_table` - The DynamoDB table name in which state for active AWS Glue jobs is persisted between Glue Runner invocations. **This parameter's value must match the value of `DDBTableName` parameter in the `cloudformation/gluerunner-lambda-params.json` config file.** 315 | 316 | * `ddb_query_limit` - Maximum number of items to be retrieved when the DynamoDB state table is queried. Default is `50` items. 317 | 318 | * `glue_job_capacity` - The capacity, in [Data Processing Units](https://aws.amazon.com/glue/pricing/), which is allocated to every AWS Glue job started by Glue Runner. 319 | 320 | 321 | ## cloudformation/step-functions-resources-params.json 322 | 323 | Specifies parameters for creation of the `step-functions-resources` CloudFormation stack (as defined in `step-functions-resources.yaml` template). It is also read by several build scripts, including `createstack`, `updatestack`, and `deploygluescripts`. 324 | 325 | ```json 326 | [ 327 | { 328 | "ParameterKey": "ArtifactBucketName", 329 | "ParameterValue": "" 330 | }, 331 | { 332 | "ParameterKey": "LambdaSourceS3Key", 333 | "ParameterValue": "src/ons3objectcreated.zip" 334 | }, 335 | { 336 | "ParameterKey": "GlueRunnerActivityName", 337 | "ParameterValue": "GlueRunnerActivity" 338 | }, 339 | { 340 | "ParameterKey": "DataBucketName", 341 | "ParameterValue": "" 342 | } 343 | ] 344 | ``` 345 | #### Parameters: 346 | 347 | * `GlueRunnerActivityName` - The Step Functions activity name that will be polled for tasks by the Glue Runner lambda function. 348 | 349 | Both parameters are also used by AWS CloudFormation during stack creation. 350 | 351 | * `ArtifactBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) to which the `ons3objectcreated` AWS Lambda function package (.zip file) will be deployed. If a bucket with such a name does not exist, the `deploylambda` build command will create it for you with appropriate permissions. 352 | 353 | * `LambdaSourceS3Key` - The Amazon S3 key (e.g. `src/ons3objectcreated.zip`) for your AWS Lambda function's .zip package. 354 | 355 | * `DataBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix). All OnS3ObjectCreated CloudWatch Events will for the bucket be handled by the `ons3objectcreated` AWS Lambda function. **This bucket will be created by CloudFormation. CloudFormation stack creation will fail if the bucket already exists.** 356 | 357 | 358 | # Build commands 359 | Common interactions with the project have been simplified for you. Using `pynt`, the following tasks are automated with simple commands: 360 | 361 | - Creating, deleting, updating, and checking the status of AWS CloudFormation stacks under `cloudformation` directory 362 | - Packaging lambda code into a .zip file and deploying it into an Amazon S3 bucket 363 | - Updating Glue Runner lambda package directly in AWS Lambda 364 | 365 | For a list of all available tasks, enter the following command in the base directory of this project: 366 | 367 | ```bash 368 | pynt -l 369 | ``` 370 | 371 | Build commands are implemented as Python scripts in the file ```build.py```. The scripts use the AWS Python SDK (Boto) under the hood. They are documented in the following section. 372 | 373 | >Prior to using these build commands, you must configure the project. 374 | 375 | 376 | ## The `packagelambda` build command 377 | 378 | Run this command to package the Glue Runner AWS Lambda function and its dependencies into a .zip package. The deployment packages are created under the `build/` directory. 379 | 380 | ```bash 381 | pynt packagelambda # Package all AWS Lambda functions and their dependencies into zip files. 382 | 383 | pynt packagelambda[gluerunner] # Package only the Glue Runner function. 384 | 385 | pynt packagelambda[ons3objectcreated] # Package only the On S3 Object Created function. 386 | ``` 387 | 388 | 389 | ## The `deploylambda` build command 390 | 391 | Run this command before you run `createstack`. The ```deploylambda``` command uploads Glue Runner .zip package to Amazon S3 for pickup by AWS CloudFormation while creating the `gluerunner-lambda` stack. 392 | 393 | Here are sample command invocations. 394 | 395 | ```bash 396 | pynt deploylambda # Deploy all functions .zip files to Amazon S3. 397 | 398 | pynt deploylambda[gluerunner] # Deploy only Glue Runner function to Amazon S3 399 | 400 | pynt deploylambda[ons3objectcreated] # Deploy only On S3 Object Created function to Amazon S3 401 | ``` 402 | 403 | ## The `createstack` build command 404 | The createstack command creates the project's AWS CloudFormation stacks by invoking the `create_stack()` API. 405 | 406 | Valid parameters to this command are `glue-resources`, `gluerunner-lambda`, `step-functions-resources`. 407 | 408 | Note that for `gluerunner-lambda` stack, you must first package and deploy Glue Runner function to Amazon S3 using the `packagelambda` and then `deploylambda` commands for the AWS CloudFormation stack creation to succeed. 409 | 410 | Examples of using `createstack`: 411 | 412 | ```bash 413 | #Create AWS Step Functions resources 414 | pynt createstack["step-functions-resources"] 415 | ``` 416 | ```bash 417 | #Create AWS Glue resources 418 | pynt createstack["glue-resources"] 419 | ``` 420 | ```bash 421 | #Create Glue Runner AWS Lambda function resources 422 | pynt createstack["gluerunner-lambda"] 423 | ``` 424 | 425 | **For a complete example, please refer to the 'Putting it all together' section.** 426 | 427 | Stack creation should take a few minutes. At any time, you can check on the prototype's stack status either through the AWS CloudFormation console or by issuing the following command. 428 | 429 | ```bash 430 | pynt stackstatus 431 | ``` 432 | 433 | 434 | ## The `updatestack` build command 435 | 436 | The `updatestack` command updates any of the projects AWS CloudFormation stacks. Valid parameters to this command are `glue-resources`, `gluerunner-lambda`, `step-functions-resources`. 437 | 438 | You can issue the `updatestack` command as follows. 439 | 440 | ```bash 441 | pynt updatestack["glue-resources"] 442 | ``` 443 | 444 | 445 | ## The `deletestack` build command 446 | 447 | The `deletestack` command, calls the AWS CloudFormation delete_stack() API to delete a CloudFormation stack from your account. You must specify a stack to delete. Valid values are `glue-resources`, `gluerunner-lambda`, `step-functions-resources` 448 | 449 | You can issue the `deletestack` command as follows. 450 | 451 | ```bash 452 | pynt deletestack["glue-resources"] 453 | ``` 454 | 455 | > NOTE: Before deleting `step-functions-resources` stack, you have to delete the S3 bucket specified in the `DataBucketName` parameter value in `step-functions-resources-config.json` configuration file. You can use the `pynt deletes3bucket[]` build command to delete the bucket. 456 | 457 | As with `createstack`, you can check the status of stack deletion using the `stackstatus` build command. 458 | 459 | 460 | ## The `deploygluescripts` build command 461 | 462 | This command deploys AWS Glue scripts to the Amazon S3 path you specified in project config files. 463 | 464 | You can issue the `deploygluescripts` command as follows. 465 | 466 | ```bash 467 | pynt deploygluescripts 468 | ``` 469 | 470 | 471 | # Putting it all together: Example usage of build commands 472 | 473 | 474 | ```bash 475 | ### AFTER PROJECT CONFIGURATION ### 476 | 477 | # Package and deploy Glue Runner AWS Lambda function 478 | pynt packagelambda 479 | pynt deploylambda 480 | 481 | # Deploy AWS Glue scripts 482 | pynt deploygluescripts 483 | 484 | # Create step-functions-resources stack. 485 | # THIS STACK MUST BE CREATED BEFORE `glue-resources` STACK. 486 | pynt createstack["step-functions-resources"] 487 | 488 | # Create glue-resources stack 489 | pynt createstack["glue-resources"] 490 | 491 | # Create gluerunner-lambda stack 492 | pynt createstack["gluerunner-lambda"] 493 | 494 | # Create athenarunner-lambda stack 495 | pynt createstack["athenarunner-lambda"] 496 | 497 | ``` 498 | 499 | Note that the `step-functions-resources` stack **must** be created first, before the `glue-resources` stack. 500 | 501 | Now head to the AWS Step Functions console. Start and observe an execution of the 'MarketingAndSalesETLOrchestrator' state machine. Execution should halt at the 'Wait for XYZ Data' states. At this point, you should upload the sample .CSV files under the `samples` directory to the S3 bucket you specified as the `DataBucketName` parameter value in `step-functions-resources-config.json` configuration file. **Upload the marketing sample file under prefix 'marketing' and the sales sample file under prefix 'sales'. To do that, you may issue the following AWS CLI commands while at the project's root directory:** 502 | 503 | ``` 504 | aws s3 cp samples/MarketingData_QuickSightSample.csv s3://{DataBucketName}/marketing/ 505 | 506 | aws s3 cp samples/SalesPipeline_QuickSightSample.csv s3://{DataBucketName}/sales/ 507 | ``` 508 | 509 | This should allow the state machine to move on to next steps -- Process Sales Data and Process Marketing Data. 510 | 511 | If you have setup and run the sample correctly, you should see this output in the AWS Step Functions console: 512 | 513 | ![ETL flow in AWS Step Functions - Success](resources/MarketingAndSalesETLOrchestrator-success.png) 514 | 515 | This indicates that all jobs have been run and orchestrated successfully. 516 | 517 | 518 | # License 519 | This project is licensed under the MIT-No Attribution (MIT-0) license. 520 | 521 | A copy of the License is located at 522 | 523 | [https://spdx.org/licenses/MIT-0.html](https://spdx.org/licenses/MIT-0.html) -------------------------------------------------------------------------------- /build.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import os 17 | import shutil 18 | import zipfile 19 | import time 20 | 21 | from pip._vendor.distlib.compat import raw_input 22 | from pynt import task 23 | import boto3 24 | import botocore 25 | from botocore.exceptions import ClientError 26 | import json 27 | import re 28 | 29 | 30 | def write_dir_to_zip(src, zf): 31 | '''Write a directory tree to an open ZipFile object.''' 32 | abs_src = os.path.abspath(src) 33 | for dirname, subdirs, files in os.walk(src): 34 | for filename in files: 35 | absname = os.path.abspath(os.path.join(dirname, filename)) 36 | arcname = absname[len(abs_src) + 1:] 37 | print('zipping {} as {}'.format(os.path.join(dirname, filename), 38 | arcname)) 39 | zf.write(absname, arcname) 40 | 41 | def read_json(jsonf_path): 42 | '''Read a JSON file into a dict.''' 43 | with open(jsonf_path, 'r') as jsonf: 44 | json_text = jsonf.read() 45 | return json.loads(json_text) 46 | 47 | def check_bucket_exists(s3path): 48 | s3 = boto3.resource('s3') 49 | 50 | result = re.search('s3://(.*)/', s3path) 51 | bucketname = s3path if result is None else result.group(1) 52 | bucket = s3.Bucket(bucketname) 53 | exists = True 54 | 55 | try: 56 | 57 | s3.meta.client.head_bucket(Bucket=bucketname) 58 | except botocore.exceptions.ClientError as e: 59 | # If a client error is thrown, then check that it was a 404 error. 60 | # If it was a 404 error, then the bucket does not exist. 61 | error_code = int(e.response['Error']['Code']) 62 | if error_code == 404: 63 | exists = False 64 | return exists 65 | 66 | @task() 67 | def clean(): 68 | '''Clean build directory.''' 69 | print('Cleaning build directory...') 70 | 71 | if os.path.exists('build'): 72 | shutil.rmtree('build') 73 | 74 | os.mkdir('build') 75 | 76 | @task() 77 | def packagelambda(* functions): 78 | '''Package lambda functions into a deployment-ready zip files.''' 79 | if not os.path.exists('build'): 80 | os.mkdir('build') 81 | 82 | os.chdir("build") 83 | 84 | if len(functions) == 0: 85 | functions = ("athenarunner", "gluerunner", "ons3objectcreated") 86 | 87 | for function in functions: 88 | print('Packaging "{}" lambda function in directory'.format(function)) 89 | zipf = zipfile.ZipFile("%s.zip" % function, "w", zipfile.ZIP_DEFLATED) 90 | 91 | write_dir_to_zip("../lambda/{}/".format(function), zipf) 92 | 93 | zipf.close() 94 | 95 | os.chdir("..") 96 | 97 | return 98 | 99 | 100 | @task() 101 | def updatelambda(*functions): 102 | '''Directly update lambda function code in AWS (without upload to S3).''' 103 | lambda_client = boto3.client('lambda') 104 | 105 | if len(functions) == 0: 106 | functions = ("athenarunner", "gluerunner", "ons3objectcreated") 107 | 108 | for function in functions: 109 | with open('build/%s.zip' % function, 'rb') as zipf: 110 | lambda_client.update_function_code( 111 | FunctionName=function, 112 | ZipFile=zipf.read() 113 | ) 114 | return 115 | 116 | 117 | @task() 118 | def deploylambda(*functions, **kwargs): 119 | '''Upload lambda functions .zip file to S3 for download by CloudFormation stack during creation.''' 120 | 121 | if len(functions) == 0: 122 | functions = ("athenarunner", "gluerunner", "ons3objectcreated") 123 | 124 | region_name = boto3.session.Session().region_name 125 | 126 | s3_client = boto3.client("s3") 127 | 128 | print("Reading .lambda/s3-deployment-descriptor.json...") 129 | 130 | params = read_json("./lambda/s3-deployment-descriptor.json") 131 | 132 | for function in functions: 133 | 134 | src_s3_bucket_name = params[function]['ArtifactBucketName'] 135 | src_s3_key = params[function]['LambdaSourceS3Key'] 136 | 137 | if not src_s3_key and not src_s3_bucket_name: 138 | print( 139 | "ERROR: Both Artifact S3 bucket name and Lambda source S3 key must be specified for function '{}'. FUNCTION NOT DEPLOYED.".format( 140 | function)) 141 | continue 142 | 143 | print("Checking if S3 Bucket '{}' exists...".format(src_s3_bucket_name)) 144 | 145 | if not check_bucket_exists(src_s3_bucket_name): 146 | print("Bucket %s not found. Creating in region {}.".format(src_s3_bucket_name, region_name)) 147 | 148 | if region_name == "us-east-1": 149 | s3_client.create_bucket( 150 | # ACL="authenticated-read", 151 | Bucket=src_s3_bucket_name 152 | ) 153 | else: 154 | s3_client.create_bucket( 155 | # ACL="authenticated-read", 156 | Bucket=src_s3_bucket_name, 157 | CreateBucketConfiguration={ 158 | "LocationConstraint": region_name 159 | } 160 | ) 161 | 162 | print("Uploading function '{}' to '{}'".format(function, src_s3_key)) 163 | 164 | with open('build/{}.zip'.format(function), 'rb') as data: 165 | s3_client.upload_fileobj(data, src_s3_bucket_name, src_s3_key) 166 | 167 | return 168 | 169 | 170 | @task() 171 | def createstack(* stacks, **kwargs): 172 | '''Create stacks using CloudFormation.''' 173 | 174 | if len(stacks) == 0: 175 | print( 176 | "ERROR: Please specify a stack to create. Valid values are glue-resources, gluerunner-lambda, step-functions-resources.") 177 | return 178 | 179 | for stack in stacks: 180 | cfn_path = "cloudformation/{}.yaml".format(stack) 181 | cfn_params_path = "cloudformation/{}-params.json".format(stack) 182 | cfn_params = read_json(cfn_params_path) 183 | stack_name = stack 184 | 185 | cfn_file = open(cfn_path, 'r') 186 | cfn_template = cfn_file.read(51200) #Maximum size of a cfn template 187 | 188 | cfn_client = boto3.client('cloudformation') 189 | 190 | print("Attempting to CREATE '%s' stack using CloudFormation." % stack_name) 191 | start_t = time.time() 192 | response = cfn_client.create_stack( 193 | StackName=stack_name, 194 | TemplateBody=cfn_template, 195 | Parameters=cfn_params, 196 | Capabilities=[ 197 | 'CAPABILITY_NAMED_IAM', 198 | ], 199 | ) 200 | 201 | print("Waiting until '%s' stack status is CREATE_COMPLETE" % stack_name) 202 | 203 | try: 204 | # cc           +o 205 | cfn_stack_delete_waiter = cfn_client.get_waiter('stack_create_complete') 206 | cfn_stack_delete_waiter.wait(StackName=stack_name) 207 | print("Stack CREATED in approximately %d secs." % int(time.time() - start_t)) 208 | 209 | except Exception as e: 210 | print("Stack creation FAILED.") 211 | print(e.message) 212 | 213 | 214 | @task() 215 | def updatestack(* stacks, **kwargs): 216 | '''Update a CloudFormation stack.''' 217 | 218 | if len(stacks) == 0: 219 | print( 220 | "ERROR: Please specify a stack to create. Valid values are glue-resources, gluerunner-lambda, step-functions-resources.") 221 | return 222 | 223 | for stack in stacks: 224 | stack_name = stack 225 | cfn_path = "cloudformation/{}.yaml".format(stack) 226 | cfn_params_path = "cloudformation/{}-params.json".format(stack) 227 | cfn_params = read_json(cfn_params_path) 228 | 229 | cfn_file = open(cfn_path, 'r') 230 | cfn_template = cfn_file.read(51200) #Maximum size of a cfn template 231 | 232 | cfn_client = boto3.client('cloudformation') 233 | 234 | print("Attempting to UPDATE '%s' stack using CloudFormation." % stack_name) 235 | try: 236 | start_t = time.time() 237 | response = cfn_client.update_stack( 238 | StackName=stack_name, 239 | TemplateBody=cfn_template, 240 | Parameters=cfn_params, 241 | Capabilities=[ 242 | 'CAPABILITY_NAMED_IAM', 243 | ], 244 | ) 245 | 246 | print("Waiting until '%s' stack status is UPDATE_COMPLETE" % stack_name) 247 | cfn_stack_update_waiter = cfn_client.get_waiter('stack_update_complete') 248 | cfn_stack_update_waiter.wait(StackName=stack_name) 249 | 250 | print("Stack UPDATED in approximately %d secs." % int(time.time() - start_t)) 251 | except ClientError as e: 252 | print("EXCEPTION: " + e.response["Error"]["Message"]) 253 | 254 | @task() 255 | def stackstatus(* stacks): 256 | '''Check the status of a CloudFormation stack.''' 257 | 258 | if len(stacks) == 0: 259 | stacks = ("glue-resources", "gluerunner-lambda", "step-functions-resources") 260 | 261 | for stack in stacks: 262 | stack_name = stack 263 | 264 | cfn_client = boto3.client('cloudformation') 265 | 266 | try: 267 | response = cfn_client.describe_stacks( 268 | StackName=stack_name 269 | ) 270 | 271 | if response["Stacks"][0]: 272 | print("Stack '%s' has the status '%s'" % (stack_name, response["Stacks"][0]["StackStatus"])) 273 | 274 | except ClientError as e: 275 | print("EXCEPTION: " + e.response["Error"]["Message"]) 276 | 277 | 278 | @task() 279 | def deletestack(* stacks): 280 | '''Delete stacks using CloudFormation.''' 281 | 282 | if len(stacks) == 0: 283 | print("ERROR: Please specify a stack to delete.") 284 | return 285 | 286 | for stack in stacks: 287 | stack_name = stack 288 | 289 | cfn_client = boto3.client('cloudformation') 290 | 291 | print("Attempting to DELETE '%s' stack using CloudFormation." % stack_name) 292 | start_t = time.time() 293 | response = cfn_client.delete_stack( 294 | StackName=stack_name 295 | ) 296 | 297 | print("Waiting until '%s' stack status is DELETE_COMPLETE" % stack_name) 298 | cfn_stack_delete_waiter = cfn_client.get_waiter('stack_delete_complete') 299 | cfn_stack_delete_waiter.wait(StackName=stack_name) 300 | print("Stack DELETED in approximately %d secs." % int(time.time() - start_t)) 301 | 302 | 303 | @task() 304 | def deploygluescripts(**kwargs): 305 | '''Upload AWS Glue scripts to S3 for download by CloudFormation stack during creation.''' 306 | 307 | region_name = boto3.session.Session().region_name 308 | 309 | s3_client = boto3.client("s3") 310 | 311 | glue_scripts_path = "./glue-scripts/" 312 | 313 | glue_cfn_params = read_json("cloudformation/glue-resources-params.json") 314 | 315 | s3_etl_script_path = '' 316 | bucket_name = '' 317 | prefix = '' 318 | for param in glue_cfn_params: 319 | if param['ParameterKey'] == 'ArtifactBucketName': 320 | bucket_name = param['ParameterValue'] 321 | if param['ParameterKey'] == 'ETLScriptsPrefix': 322 | prefix = param['ParameterValue'] 323 | 324 | if not bucket_name or not prefix: 325 | print( 326 | "ERROR: ArtifactBucketName and ETLScriptsPrefix must be set in 'cloudformation/glue-resources-params.json'.") 327 | return 328 | 329 | s3_etl_script_path = 's3://' + bucket_name + '/' + prefix 330 | 331 | result = re.search('s3://(.+?)/(.*)', s3_etl_script_path) 332 | if result is None: 333 | print("ERROR: Invalid S3 ETL bucket name and/or script prefix.") 334 | return 335 | 336 | print("Checking if S3 Bucket '{}' exists...".format(bucket_name)) 337 | 338 | if not check_bucket_exists(bucket_name): 339 | print("ERROR: S3 bucket '{}' not found.".format(bucket_name)) 340 | return 341 | 342 | for dirname, subdirs, files in os.walk(glue_scripts_path): 343 | for filename in files: 344 | absname = os.path.abspath(os.path.join(dirname, filename)) 345 | print("Uploading AWS Glue script '{}' to '{}/{}'".format(absname, bucket_name, prefix)) 346 | with open(absname, 'rb') as data: 347 | s3_client.upload_fileobj(data, bucket_name, '{}/{}'.format(prefix, filename)) 348 | 349 | return 350 | 351 | 352 | @task() 353 | def deletes3bucket(name): 354 | '''DELETE ALL objects in an Amazon S3 bucket and THE BUCKET ITSELF. Use with caution!''' 355 | 356 | proceed = raw_input( 357 | "This command will DELETE ALL DATA in S3 bucket '%s' and the BUCKET ITSELF.\nDo you wish to continue? [Y/N] " \ 358 | % name) 359 | 360 | if proceed.lower() != 'y': 361 | print("Aborting deletion.") 362 | return 363 | 364 | print("Attempting to DELETE ALL OBJECTS in '%s' S3 bucket." % name) 365 | 366 | s3 = boto3.resource('s3') 367 | bucket = s3.Bucket(name) 368 | bucket.objects.delete() 369 | bucket.delete() 370 | return -------------------------------------------------------------------------------- /cloudformation/athenarunner-lambda-params.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "ParameterKey": "ArtifactBucketName", 4 | "ParameterValue": "" 5 | }, 6 | { 7 | "ParameterKey": "LambdaSourceS3Key", 8 | "ParameterValue": "src/athenarunner.zip" 9 | }, 10 | { 11 | "ParameterKey": "DDBTableName", 12 | "ParameterValue": "AthenaRunnerActiveJobs" 13 | }, 14 | { 15 | "ParameterKey": "AthenaRunnerLambdaFunctionName", 16 | "ParameterValue": "athenarunner" 17 | } 18 | ] -------------------------------------------------------------------------------- /cloudformation/athenarunner-lambda.yaml: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | AWSTemplateFormatVersion: 2010-09-09 17 | 18 | Description: The CloudFormation template for AWS resources required by Athena Runner. 19 | 20 | Parameters: 21 | 22 | DDBTableName: 23 | Type: String 24 | Default: "AthenaRunnerActiveQueries" 25 | Description: "Name of the DynamoDB table for persistence & querying of active Athena queries' statuses." 26 | 27 | AthenaRunnerLambdaFunctionName: 28 | Type: String 29 | Default: "athenarunner" 30 | Description: "Name of the Lambda function that mediates between AWS Step Functions and AWS Athena." 31 | 32 | ArtifactBucketName: 33 | Type: String 34 | MinLength: "1" 35 | Description: "Name of the S3 bucket containing source .zip files." 36 | 37 | LambdaSourceS3Key: 38 | Type: String 39 | MinLength: "1" 40 | Description: "Name of the S3 key of Athena Runner lambda function .zip file." 41 | 42 | Resources: 43 | 44 | AthenaRunnerLambdaExecutionRole: 45 | Type: "AWS::IAM::Role" 46 | Properties: 47 | AssumeRolePolicyDocument: 48 | Version: '2012-10-17' 49 | Statement: 50 | - Effect: Allow 51 | Principal: 52 | Service: 53 | - lambda.amazonaws.com 54 | Action: 55 | - sts:AssumeRole 56 | Path: "/" 57 | 58 | AthenaRunnerPolicy: 59 | Type: "AWS::IAM::Policy" 60 | Properties: 61 | PolicyDocument: { 62 | "Version": "2012-10-17", 63 | "Statement": [{ 64 | "Effect": "Allow", 65 | "Action": [ 66 | "dynamodb:GetItem", 67 | "dynamodb:Query", 68 | "dynamodb:PutItem", 69 | "dynamodb:UpdateItem", 70 | "dynamodb:DeleteItem" 71 | ], 72 | "Resource": !Sub "arn:aws:dynamodb:${AWS::Region}:${AWS::AccountId}:table/${DDBTableName}" 73 | }, 74 | { 75 | "Effect": "Allow", 76 | "Action": [ 77 | "logs:CreateLogStream", 78 | "logs:PutLogEvents" 79 | ], 80 | "Resource": !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*" 81 | }, 82 | { 83 | "Effect": "Allow", 84 | "Action": "logs:CreateLogGroup", 85 | "Resource": "*" 86 | }, 87 | { 88 | "Effect": "Allow", 89 | "Action": [ 90 | "states:SendTaskSuccess", 91 | "states:SendTaskFailure", 92 | "states:SendTaskHeartbeat", 93 | "states:GetActivityTask" 94 | ], 95 | "Resource": "*" 96 | }, 97 | { 98 | "Effect": "Allow", 99 | "Action": [ 100 | "athena:StartQueryExecution", 101 | "athena:GetQueryExecution", 102 | "athena:GetNamedQuery" 103 | ], 104 | "Resource": "*" 105 | } 106 | ] 107 | } 108 | PolicyName: "AthenaRunnerPolicy" 109 | Roles: 110 | - !Ref AthenaRunnerLambdaExecutionRole 111 | 112 | AthenaRunnerLambdaFunction: 113 | Type: AWS::Lambda::Function 114 | Properties: 115 | FunctionName: "athenarunner" 116 | Description: "Starts and monitors AWS Athena queries on behalf of AWS Step Functions for a specific Activity ARN." 117 | Handler: "athenarunner.handler" 118 | Role: !GetAtt AthenaRunnerLambdaExecutionRole.Arn 119 | Code: 120 | S3Bucket: !Ref ArtifactBucketName 121 | S3Key: !Ref LambdaSourceS3Key 122 | Timeout: 180 #seconds 123 | MemorySize: 128 #MB 124 | Runtime: python2.7 125 | DependsOn: 126 | - AthenaRunnerLambdaExecutionRole 127 | 128 | ScheduledRule: 129 | Type: "AWS::Events::Rule" 130 | Properties: 131 | Description: "ScheduledRule" 132 | ScheduleExpression: "rate(3 minutes)" 133 | State: "ENABLED" 134 | Targets: 135 | - 136 | Arn: !GetAtt AthenaRunnerLambdaFunction.Arn 137 | Id: "AthenaRunner" 138 | 139 | PermissionForEventsToInvokeLambda: 140 | Type: "AWS::Lambda::Permission" 141 | Properties: 142 | FunctionName: !Ref AthenaRunnerLambdaFunction 143 | Action: "lambda:InvokeFunction" 144 | Principal: "events.amazonaws.com" 145 | SourceArn: !GetAtt ScheduledRule.Arn 146 | 147 | AthenaRunnerActiveQueriesTable: 148 | Type: "AWS::DynamoDB::Table" 149 | Properties: 150 | TableName: !Ref DDBTableName 151 | KeySchema: 152 | - KeyType: "HASH" 153 | AttributeName: "sfn_activity_arn" 154 | - KeyType: "RANGE" 155 | AttributeName: "athena_query_execution_id" 156 | AttributeDefinitions: 157 | - AttributeName: "sfn_activity_arn" 158 | AttributeType: "S" 159 | - AttributeName: "athena_query_execution_id" 160 | AttributeType: "S" 161 | ProvisionedThroughput: 162 | WriteCapacityUnits: 10 163 | ReadCapacityUnits: 10 -------------------------------------------------------------------------------- /cloudformation/glue-resources-params.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "ParameterKey": "ArtifactBucketName", 4 | "ParameterValue": "" 5 | }, 6 | { 7 | "ParameterKey": "ETLScriptsPrefix", 8 | "ParameterValue": "scripts" 9 | }, 10 | { 11 | "ParameterKey": "DataBucketName", 12 | "ParameterValue": "" 13 | }, 14 | { 15 | "ParameterKey": "ETLOutputPrefix", 16 | "ParameterValue": "output" 17 | } 18 | ] -------------------------------------------------------------------------------- /cloudformation/glue-resources.yaml: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | 17 | AWSTemplateFormatVersion: "2010-09-09" 18 | Description: > 19 | This template sets up sample AWS Glue resources to be orchestrated by AWS Step Functions. 20 | 21 | Parameters: 22 | 23 | MarketingAndSalesDatabaseName: 24 | Type: String 25 | MinLength: "4" 26 | Default: "marketingandsales_qs" 27 | Description: "Name of the AWS Glue database to contain this CloudFormation template's tables." 28 | 29 | SalesPipelineTableName: 30 | Type: String 31 | MinLength: "4" 32 | Default: "salespipeline_qs" 33 | Description: "Name of the Sales Pipeline data table in AWS Glue." 34 | 35 | MarketingTableName: 36 | Type: String 37 | MinLength: "4" 38 | Default: "marketing_qs" 39 | Description: "Name of the Marketing data table in AWS Glue." 40 | 41 | ETLScriptsPrefix: 42 | Type: String 43 | MinLength: "1" 44 | Description: "Location of the Glue job ETL scripts in S3." 45 | 46 | ETLOutputPrefix: 47 | Type: String 48 | MinLength: "1" 49 | Description: "Name of the S3 output path to which this CloudFormation template's AWS Glue jobs are going to write ETL output." 50 | 51 | DataBucketName: 52 | Type: String 53 | MinLength: "1" 54 | Description: "Name of the S3 bucket in which the source Marketing and Sales data will be uploaded. Bucket is created by this CFT." 55 | 56 | ArtifactBucketName: 57 | Type: String 58 | MinLength: "1" 59 | Description: "Name of the S3 bucket in which the Marketing and Sales ETL scripts reside. Bucket is NOT created by this CFT." 60 | 61 | Resources: 62 | 63 | ### AWS GLUE RESOURCES ### 64 | AWSGlueJobRole: 65 | Type: "AWS::IAM::Role" 66 | Properties: 67 | AssumeRolePolicyDocument: 68 | Version: '2012-10-17' 69 | Statement: 70 | - Effect: Allow 71 | Principal: 72 | Service: 73 | - glue.amazonaws.com 74 | Action: 75 | - sts:AssumeRole 76 | Policies: 77 | - PolicyName: root 78 | PolicyDocument: 79 | Version: 2012-10-17 80 | Statement: 81 | - Effect: Allow 82 | Action: 83 | - "s3:GetObject" 84 | - "s3:PutObject" 85 | - "s3:ListBucket" 86 | - "s3:DeleteObject" 87 | Resource: 88 | - !Sub "arn:aws:s3:::${DataBucketName}" 89 | - !Sub "arn:aws:s3:::${DataBucketName}/*" 90 | - !Sub "arn:aws:s3:::${ArtifactBucketName}" 91 | - !Sub "arn:aws:s3:::${ArtifactBucketName}/*" 92 | ManagedPolicyArns: 93 | - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole 94 | Path: "/" 95 | 96 | 97 | MarketingAndSalesDatabase: 98 | Type: "AWS::Glue::Database" 99 | Properties: 100 | DatabaseInput: 101 | Description: "Marketing and Sales database (Amazon QuickSight Samples)." 102 | Name: !Ref MarketingAndSalesDatabaseName 103 | CatalogId: !Ref AWS::AccountId 104 | 105 | SalesPipelineTable: 106 | Type: "AWS::Glue::Table" 107 | DependsOn: MarketingAndSalesDatabase 108 | Properties: 109 | TableInput: 110 | Description: "Sales Pipeline table (Amazon QuickSight Sample)." 111 | TableType: "EXTERNAL_TABLE" 112 | Parameters: { 113 | "CrawlerSchemaDeserializerVersion": "1.0", 114 | "compressionType": "none", 115 | "classification": "csv", 116 | "recordCount": "16831", 117 | "typeOfData": "file", 118 | "CrawlerSchemaSerializerVersion": "1.0", 119 | "columnsOrdered": "true", 120 | "objectCount": "1", 121 | "delimiter": ",", 122 | "skip.header.line.count": "1", 123 | "averageRecordSize": "119", 124 | "sizeKey": "2002910" 125 | } 126 | StorageDescriptor: 127 | StoredAsSubDirectories: False 128 | Parameters: { 129 | "CrawlerSchemaDeserializerVersion": "1.0", 130 | "compressionType": "none", 131 | "classification": "csv", 132 | "recordCount": "16831", 133 | "typeOfData": "file", 134 | "CrawlerSchemaSerializerVersion": "1.0", 135 | "columnsOrdered": "true", 136 | "objectCount": "1", 137 | "delimiter": ",", 138 | "skip.header.line.count": "1", 139 | "averageRecordSize": "119", 140 | "sizeKey": "2002910" 141 | } 142 | InputFormat: "org.apache.hadoop.mapred.TextInputFormat" 143 | OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" 144 | Columns: 145 | - Type: string 146 | Name: date 147 | - Type: string 148 | Name: salesperson 149 | - Type: string 150 | Name: lead name 151 | - Type: string 152 | Name: segment 153 | - Type: string 154 | Name: region 155 | - Type: string 156 | Name: target close 157 | - Type: bigint 158 | Name: forecasted monthly revenue 159 | - Type: string 160 | Name: opportunity stage 161 | - Type: bigint 162 | Name: weighted revenue 163 | - Type: boolean 164 | Name: closed opportunity 165 | - Type: boolean 166 | Name: active opportunity 167 | - Type: boolean 168 | Name: latest status entry 169 | SerdeInfo: 170 | Parameters: { 171 | "field.delim": "," 172 | } 173 | SerializationLibrary: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" 174 | Compressed: False 175 | Location: !Sub "s3://${DataBucketName}/sales/" 176 | Retention: 0 177 | Name: !Ref SalesPipelineTableName 178 | DatabaseName: !Ref MarketingAndSalesDatabaseName 179 | CatalogId: !Ref AWS::AccountId 180 | 181 | MarketingTable: 182 | Type: "AWS::Glue::Table" 183 | DependsOn: MarketingAndSalesDatabase 184 | Properties: 185 | TableInput: 186 | Description: "Marketing table (Amazon QuickSight Sample)." 187 | TableType: "EXTERNAL_TABLE" 188 | Parameters: { 189 | "CrawlerSchemaDeserializerVersion": "1.0", 190 | "compressionType": "none", 191 | "classification": "csv", 192 | "recordCount": "948", 193 | "typeOfData": "file", 194 | "CrawlerSchemaSerializerVersion": "1.0", 195 | "columnsOrdered": "true", 196 | "objectCount": "1", 197 | "delimiter": ",", 198 | "skip.header.line.count": "1", 199 | "averageRecordSize": "160", 200 | "sizeKey": "151746" 201 | } 202 | StorageDescriptor: 203 | StoredAsSubDirectories: False 204 | Parameters: { 205 | "CrawlerSchemaDeserializerVersion": "1.0", 206 | "compressionType": "none", 207 | "classification": "csv", 208 | "recordCount": "948", 209 | "typeOfData": "file", 210 | "CrawlerSchemaSerializerVersion": "1.0", 211 | "columnsOrdered": "true", 212 | "objectCount": "1", 213 | "delimiter": ",", 214 | "skip.header.line.count": "1", 215 | "averageRecordSize": "160", 216 | "sizeKey": "151746" 217 | } 218 | InputFormat: "org.apache.hadoop.mapred.TextInputFormat" 219 | OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" 220 | Columns: 221 | - Type: string 222 | Name: date 223 | - Type: bigint 224 | Name: new visitors seo 225 | - Type: bigint 226 | Name: new visitors cpc 227 | - Type: bigint 228 | Name: new visitors social media 229 | - Type: bigint 230 | Name: return visitors 231 | - Type: bigint 232 | Name: twitter mentions 233 | - Type: bigint 234 | Name: twitter follower adds 235 | - Type: bigint 236 | Name: twitter followers cumulative 237 | - Type: bigint 238 | Name: mailing list adds 239 | - Type: bigint 240 | Name: mailing list cumulative 241 | - Type: bigint 242 | Name: website pageviews 243 | - Type: bigint 244 | Name: website visits 245 | - Type: bigint 246 | Name: website unique visits 247 | - Type: bigint 248 | Name: mobile uniques 249 | - Type: bigint 250 | Name: tablet uniques 251 | - Type: bigint 252 | Name: desktop uniques 253 | - Type: bigint 254 | Name: free sign up 255 | - Type: bigint 256 | Name: paid conversion 257 | - Type: string 258 | Name: events 259 | SerdeInfo: 260 | Parameters: { 261 | "field.delim": "," 262 | } 263 | SerializationLibrary: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" 264 | Compressed: False 265 | Location: !Sub "s3://${DataBucketName}/marketing/" 266 | Retention: 0 267 | Name: !Ref MarketingTableName 268 | DatabaseName: !Ref MarketingAndSalesDatabaseName 269 | CatalogId: !Ref AWS::AccountId 270 | 271 | ProcessSalesDataJob: 272 | Type: "AWS::Glue::Job" 273 | Properties: 274 | Role: !Ref AWSGlueJobRole 275 | Name: "ProcessSalesData" 276 | Command: { 277 | "Name" : "glueetl", 278 | "ScriptLocation": !Sub "s3://${ArtifactBucketName}/${ETLScriptsPrefix}/process_sales_data.py" 279 | } 280 | DefaultArguments: { 281 | "--database_name" : !Ref MarketingAndSalesDatabaseName, 282 | "--table_name" : !Ref SalesPipelineTableName, 283 | "--s3_output_path": !Sub "s3://${DataBucketName}/${ETLOutputPrefix}/tmp/sales" 284 | } 285 | MaxRetries: 0 286 | Description: "Process Sales Pipeline data." 287 | AllocatedCapacity: 5 288 | 289 | ProcessMarketingDataJob: 290 | Type: "AWS::Glue::Job" 291 | Properties: 292 | Role: !Ref AWSGlueJobRole 293 | Name: "ProcessMarketingData" 294 | Command: { 295 | "Name" : "glueetl", 296 | "ScriptLocation": !Sub "s3://${ArtifactBucketName}/${ETLScriptsPrefix}/process_marketing_data.py" 297 | } 298 | DefaultArguments: { 299 | "--database_name" : !Ref MarketingAndSalesDatabaseName, 300 | "--table_name" : !Ref MarketingTableName, 301 | "--s3_output_path": !Sub "s3://${DataBucketName}/${ETLOutputPrefix}/tmp/marketing" 302 | } 303 | MaxRetries: 0 304 | Description: "Process Marketing data." 305 | AllocatedCapacity: 5 306 | 307 | JoinMarketingAndSalesDataJob: 308 | Type: "AWS::Glue::Job" 309 | Properties: 310 | Role: !Ref AWSGlueJobRole 311 | Name: "JoinMarketingAndSalesData" 312 | Command: { 313 | "Name" : "glueetl", 314 | "ScriptLocation": !Sub "s3://${ArtifactBucketName}/${ETLScriptsPrefix}/join_marketing_and_sales_data.py" 315 | } 316 | DefaultArguments: { 317 | "--database_name": !Ref MarketingAndSalesDatabaseName, 318 | "--s3_output_path": !Sub "s3://${DataBucketName}/${ETLOutputPrefix}/sales-leads-influenced", 319 | "--s3_sales_data_path": !Sub "s3://${DataBucketName}/${ETLOutputPrefix}/tmp/sales", 320 | "--s3_marketing_data_path": !Sub "s3://${DataBucketName}/${ETLOutputPrefix}/tmp/marketing" 321 | } 322 | MaxRetries: 0 323 | Description: "Join Marketing and Sales data." 324 | AllocatedCapacity: 5 325 | -------------------------------------------------------------------------------- /cloudformation/gluerunner-lambda-params.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "ParameterKey": "ArtifactBucketName", 4 | "ParameterValue": "" 5 | }, 6 | { 7 | "ParameterKey": "LambdaSourceS3Key", 8 | "ParameterValue": "src/gluerunner.zip" 9 | }, 10 | { 11 | "ParameterKey": "DDBTableName", 12 | "ParameterValue": "GlueRunnerActiveJobs" 13 | }, 14 | { 15 | "ParameterKey": "GlueRunnerLambdaFunctionName", 16 | "ParameterValue": "gluerunner" 17 | } 18 | ] -------------------------------------------------------------------------------- /cloudformation/gluerunner-lambda.yaml: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | AWSTemplateFormatVersion: 2010-09-09 17 | 18 | Description: The CloudFormation template for AWS resources required by Glue Runner. 19 | 20 | Parameters: 21 | 22 | DDBTableName: 23 | Type: String 24 | Default: "GlueRunnerActiveJobs" 25 | Description: "Name of the DynamoDB table for persistence & querying of active Glue jobs' statuses." 26 | 27 | GlueRunnerLambdaFunctionName: 28 | Type: String 29 | Default: "gluerunner" 30 | Description: "Name of the Lambda function that mediates between AWS Step Functions and AWS Glue." 31 | 32 | ArtifactBucketName: 33 | Type: String 34 | MinLength: "1" 35 | Description: "Name of the S3 bucket containing source .zip files." 36 | 37 | LambdaSourceS3Key: 38 | Type: String 39 | MinLength: "1" 40 | Description: "Name of the S3 key of Glue Runner lambda function .zip file." 41 | 42 | Resources: 43 | 44 | GlueRunnerLambdaExecutionRole: 45 | Type: "AWS::IAM::Role" 46 | Properties: 47 | AssumeRolePolicyDocument: 48 | Version: '2012-10-17' 49 | Statement: 50 | - Effect: Allow 51 | Principal: 52 | Service: 53 | - lambda.amazonaws.com 54 | Action: 55 | - sts:AssumeRole 56 | Path: "/" 57 | 58 | GlueRunnerPolicy: 59 | Type: "AWS::IAM::Policy" 60 | Properties: 61 | PolicyDocument: { 62 | "Version": "2012-10-17", 63 | "Statement": [{ 64 | "Effect": "Allow", 65 | "Action": [ 66 | "dynamodb:GetItem", 67 | "dynamodb:Query", 68 | "dynamodb:PutItem", 69 | "dynamodb:UpdateItem", 70 | "dynamodb:DeleteItem" 71 | ], 72 | "Resource": !Sub "arn:aws:dynamodb:${AWS::Region}:${AWS::AccountId}:table/${DDBTableName}" 73 | }, 74 | { 75 | "Effect": "Allow", 76 | "Action": [ 77 | "logs:CreateLogStream", 78 | "logs:PutLogEvents" 79 | ], 80 | "Resource": !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*" 81 | }, 82 | { 83 | "Effect": "Allow", 84 | "Action": "logs:CreateLogGroup", 85 | "Resource": "*" 86 | }, 87 | { 88 | "Effect": "Allow", 89 | "Action": [ 90 | "states:SendTaskSuccess", 91 | "states:SendTaskFailure", 92 | "states:SendTaskHeartbeat", 93 | "states:GetActivityTask" 94 | ], 95 | "Resource": "*" 96 | }, 97 | { 98 | "Effect": "Allow", 99 | "Action": [ 100 | "glue:StartJobRun", 101 | "glue:GetJobRun" 102 | ], 103 | "Resource": "*" 104 | } 105 | ] 106 | } 107 | PolicyName: "GlueRunnerPolicy" 108 | Roles: 109 | - !Ref GlueRunnerLambdaExecutionRole 110 | 111 | GlueRunnerLambdaFunction: 112 | Type: AWS::Lambda::Function 113 | Properties: 114 | FunctionName: "gluerunner" 115 | Description: "Starts and monitors AWS Glue jobs on behalf of AWS Step Functions for a specific Activity ARN." 116 | Handler: "gluerunner.handler" 117 | Role: !GetAtt GlueRunnerLambdaExecutionRole.Arn 118 | Code: 119 | S3Bucket: !Ref ArtifactBucketName 120 | S3Key: !Ref LambdaSourceS3Key 121 | Timeout: 180 #seconds 122 | MemorySize: 128 #MB 123 | Runtime: python2.7 124 | DependsOn: 125 | - GlueRunnerLambdaExecutionRole 126 | 127 | ScheduledRule: 128 | Type: "AWS::Events::Rule" 129 | Properties: 130 | Description: "ScheduledRule" 131 | ScheduleExpression: "rate(3 minutes)" 132 | State: "ENABLED" 133 | Targets: 134 | - 135 | Arn: !GetAtt GlueRunnerLambdaFunction.Arn 136 | Id: "GlueRunner" 137 | 138 | PermissionForEventsToInvokeLambda: 139 | Type: "AWS::Lambda::Permission" 140 | Properties: 141 | FunctionName: !Ref GlueRunnerLambdaFunction 142 | Action: "lambda:InvokeFunction" 143 | Principal: "events.amazonaws.com" 144 | SourceArn: !GetAtt ScheduledRule.Arn 145 | 146 | GlueRunnerActiveJobsTable: 147 | Type: "AWS::DynamoDB::Table" 148 | Properties: 149 | TableName: !Ref DDBTableName 150 | KeySchema: 151 | - KeyType: "HASH" 152 | AttributeName: "sfn_activity_arn" 153 | - KeyType: "RANGE" 154 | AttributeName: "glue_job_run_id" 155 | AttributeDefinitions: 156 | - AttributeName: "sfn_activity_arn" 157 | AttributeType: "S" 158 | - AttributeName: "glue_job_run_id" 159 | AttributeType: "S" 160 | ProvisionedThroughput: 161 | WriteCapacityUnits: 10 162 | ReadCapacityUnits: 10 -------------------------------------------------------------------------------- /cloudformation/step-functions-resources-params.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "ParameterKey": "ArtifactBucketName", 4 | "ParameterValue": "" 5 | }, 6 | { 7 | "ParameterKey": "LambdaSourceS3Key", 8 | "ParameterValue": "src/ons3objectcreated.zip" 9 | }, 10 | { 11 | "ParameterKey": "GlueRunnerActivityName", 12 | "ParameterValue": "GlueRunnerActivity" 13 | }, 14 | { 15 | "ParameterKey": "DataBucketName", 16 | "ParameterValue": "etl-orchestrator-745-data" 17 | } 18 | ] -------------------------------------------------------------------------------- /cloudformation/step-functions-resources.yaml: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | AWSTemplateFormatVersion: "2010-09-09" 17 | 18 | Description: > 19 | This template manages sample AWS Step Functions resources to be orchestrate AWS Glue jobs and crawlers. 20 | 21 | Parameters: 22 | 23 | MarketingAndSalesDatabaseName: 24 | Type: String 25 | MinLength: "4" 26 | Default: "marketingandsales_qs" 27 | Description: "Name of the AWS Glue database." 28 | 29 | SalesPipelineTableName: 30 | Type: String 31 | MinLength: "4" 32 | Default: "salespipeline_qs" 33 | Description: "Name of the Sales Pipeline data table in AWS Glue." 34 | 35 | MarketingTableName: 36 | Type: String 37 | MinLength: "4" 38 | Default: "marketing_qs" 39 | Description: "Name of the Marketing data table in AWS Glue." 40 | 41 | GlueRunnerActivityName: 42 | Type: String 43 | Default: "GlueRunnerActivity" 44 | Description: "Name of the AWS Step Functions activity to be polled by GlueRunner." 45 | 46 | AthenaRunnerActivityName: 47 | Type: String 48 | Default: "AthenaRunnerActivity" 49 | Description: "Name of the AWS Step Functions activity to be polled by AthenaRunner." 50 | 51 | ArtifactBucketName: 52 | Type: String 53 | MinLength: "1" 54 | Description: "Name of the S3 bucket containing source .zip files. Bucket is NOT created by this CFT." 55 | 56 | LambdaSourceS3Key: 57 | Type: String 58 | MinLength: "1" 59 | Description: "Name of the S3 key of Glue Runner lambda function .zip file." 60 | 61 | DataBucketName: 62 | Type: String 63 | MinLength: "1" 64 | Description: "Name of the S3 bucket in which the source Marketing and Sales data will be uploaded. Bucket is created by this CFT." 65 | 66 | Resources: 67 | 68 | #IAM Resources 69 | StateExecutionRole: 70 | Type: "AWS::IAM::Role" 71 | Properties: 72 | AssumeRolePolicyDocument: 73 | Version: '2012-10-17' 74 | Statement: 75 | - Effect: Allow 76 | Principal: 77 | Service: 78 | - states.us-east-1.amazonaws.com 79 | Action: 80 | - sts:AssumeRole 81 | Policies: 82 | - 83 | PolicyName: "StatesExecutionPolicy" 84 | PolicyDocument: 85 | Version: "2012-10-17" 86 | Statement: 87 | - 88 | Effect: "Allow" 89 | Action: "lambda:InvokeFunction" 90 | Resource: "*" 91 | Path: "/" 92 | 93 | # Activity resources 94 | GlueRunnerActivity: 95 | Type: "AWS::StepFunctions::Activity" 96 | Properties: 97 | Name: !Ref GlueRunnerActivityName 98 | 99 | AthenaRunnerActivity: 100 | Type: "AWS::StepFunctions::Activity" 101 | Properties: 102 | Name: !Ref AthenaRunnerActivityName 103 | 104 | WaitForSalesDataActivity: 105 | Type: "AWS::StepFunctions::Activity" 106 | Properties: 107 | Name: !Sub "${DataBucketName}-SalesPipeline_QuickSightSample.csv" 108 | 109 | WaitForMarketingDataActivity: 110 | Type: "AWS::StepFunctions::Activity" 111 | Properties: 112 | Name: !Sub "${DataBucketName}-MarketingData_QuickSightSample.csv" 113 | 114 | 115 | # State Machine resources 116 | 117 | # State machine for testing Athena Runner 118 | AthenaRunnerTestETLOrchestrator: 119 | Type: "AWS::StepFunctions::StateMachine" 120 | Properties: 121 | StateMachineName: AthenaRunnerTestETLOrchestrator 122 | DefinitionString: 123 | Fn::Sub: 124 | - |- 125 | { 126 | "StartAt": "Configure Athena Query", 127 | "States": { 128 | "Configure Athena Query":{ 129 | "Type": "Pass", 130 | "Result": "{ \"AthenaQueryString\" : \"SELECT * FROM ${GlueTableName} limit 10;\", \"AthenaDatabase\": \"${GlueDatabaseName}\", \"AthenaResultOutputLocation\": \"${AthenaResultOutputLocation}\", \"AthenaResultEncryptionOption\": \"${AthenaResultEncryptionOption}\"}", 131 | "Next": "Execute Athena Query" 132 | }, 133 | "Execute Athena Query":{ 134 | "Type": "Task", 135 | "Resource": "${AthenaRunnerActivityArn}", 136 | "End": true 137 | } 138 | } 139 | } 140 | - { 141 | GlueDatabaseName: !Ref MarketingAndSalesDatabaseName, 142 | GlueTableName: !Ref MarketingTableName, 143 | AthenaRunnerActivityArn: !Ref AthenaRunnerActivity, 144 | AthenaResultOutputLocation: !Sub "s3://${DataBucketName}/athena-runner-output/", 145 | AthenaResultEncryptionOption: "SSE_S3" 146 | } 147 | RoleArn: !GetAtt StateExecutionRole.Arn 148 | 149 | MarketingAndSalesETLOrchestrator: 150 | Type: "AWS::StepFunctions::StateMachine" 151 | Properties: 152 | StateMachineName: MarketingAndSalesETLOrchestrator 153 | DefinitionString: 154 | Fn::Sub: 155 | - |- 156 | { 157 | "Comment": "Marketing and Sales ETL Pipeline Orchestrator.", 158 | "StartAt": "Pre-process", 159 | "States": { 160 | 161 | "Pre-process": { 162 | "Type": "Pass", 163 | "Next": "Start Parallel Glue Jobs" 164 | }, 165 | "Start Parallel Glue Jobs": { 166 | "Type": "Parallel", 167 | "Branches": [ 168 | { 169 | "StartAt": "Wait For Sales Data", 170 | "States": { 171 | "Wait For Sales Data": { 172 | "Type": "Task", 173 | "Resource": "${WaitForSalesDataActivityArn}", 174 | "Next": "Prep for Process Sales Data" 175 | }, 176 | "Prep for Process Sales Data": { 177 | "Type": "Pass", 178 | "Result": "{\"GlueJobName\" : \"${ProcessSalesDataGlueJobName}\"}", 179 | "Next": "Process Sales Data" 180 | }, 181 | "Process Sales Data": { 182 | "Type": "Task", 183 | "Resource": "${GlueRunnerActivityArn}", 184 | "End": true 185 | } 186 | } 187 | }, 188 | { 189 | "StartAt": "Wait For Marketing Data", 190 | "States": { 191 | "Wait For Marketing Data": { 192 | "Type": "Task", 193 | "Resource": "${WaitForMarketingDataActivityArn}", 194 | "Next": "Prep for Process Marketing Data" 195 | }, 196 | "Prep for Process Marketing Data": { 197 | "Type": "Pass", 198 | "Result": "{\"GlueJobName\" : \"${ProcessMarketingDataGlueJobName}\"}", 199 | "Next": "Process Marketing Data" 200 | }, 201 | "Process Marketing Data": { 202 | "Type": "Task", 203 | "Resource": "${GlueRunnerActivityArn}", 204 | "End": true 205 | } 206 | } 207 | } 208 | ], 209 | "Next": "Prep For Joining Marketing And Sales Data", 210 | "Catch": [ { 211 | "ErrorEquals": ["GlueJobFailedError"], 212 | "Next": "ETL Job Failed Fallback" 213 | }] 214 | }, 215 | "Prep For Joining Marketing And Sales Data": { 216 | "Type": "Pass", 217 | "Result": "{\"GlueJobName\" : \"${JoinMarketingAndSalesDataJobName}\"}", 218 | "Next": "Join Marketing And Sales Data" 219 | }, 220 | "Join Marketing And Sales Data": { 221 | "Type": "Task", 222 | "Resource": "${GlueRunnerActivityArn}", 223 | "Catch": [ { 224 | "ErrorEquals": ["GlueJobFailedError"], 225 | "Next": "ETL Job Failed Fallback" 226 | }], 227 | "End": true 228 | }, 229 | "ETL Job Failed Fallback": { 230 | "Type": "Pass", 231 | "Result": "This is a fallback from an ETL job failure.", 232 | "End": true 233 | } 234 | } 235 | } 236 | - { 237 | GlueRunnerActivityArn : !Ref GlueRunnerActivity, 238 | WaitForSalesDataActivityArn: !Ref WaitForSalesDataActivity, 239 | WaitForMarketingDataActivityArn: !Ref WaitForMarketingDataActivity, 240 | ProcessSalesDataGlueJobName: "ProcessSalesData", 241 | ProcessMarketingDataGlueJobName: "ProcessMarketingData", 242 | JoinMarketingAndSalesDataJobName: "JoinMarketingAndSalesData" 243 | } 244 | RoleArn: !GetAtt StateExecutionRole.Arn 245 | 246 | 247 | 248 | # On S3 Object Created Lambda Functions resources 249 | 250 | OnS3ObjectCreatedLambdaExecutionRole: 251 | Type: "AWS::IAM::Role" 252 | Properties: 253 | AssumeRolePolicyDocument: 254 | Version: '2012-10-17' 255 | Statement: 256 | - Effect: Allow 257 | Principal: 258 | Service: 259 | - lambda.amazonaws.com 260 | Action: 261 | - sts:AssumeRole 262 | ManagedPolicyArns: 263 | - arn:aws:iam::aws:policy/AmazonS3FullAccess 264 | - arn:aws:iam::aws:policy/AmazonSNSFullAccess 265 | - arn:aws:iam::aws:policy/CloudWatchLogsFullAccess 266 | - arn:aws:iam::aws:policy/AWSStepFunctionsFullAccess 267 | Path: "/" 268 | 269 | OnS3ObjectCreatedPolicy: 270 | Type: "AWS::IAM::Policy" 271 | Properties: 272 | PolicyDocument: { 273 | "Version": "2012-10-17", 274 | "Statement": [ 275 | { 276 | "Effect": "Allow", 277 | "Action": [ 278 | "logs:CreateLogStream", 279 | "logs:PutLogEvents" 280 | ], 281 | "Resource": !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*" 282 | }, 283 | { 284 | "Effect": "Allow", 285 | "Action": "logs:CreateLogGroup", 286 | "Resource": "*" 287 | }, 288 | { 289 | "Effect": "Allow", 290 | "Action": [ 291 | "states:SendTaskSuccess", 292 | "states:SendTaskFailure", 293 | "states:SendTaskHeartbeat", 294 | "states:GetActivityTask" 295 | ], 296 | "Resource": "*" 297 | } 298 | ] 299 | } 300 | PolicyName: "OnS3ObjectCreatedPolicy" 301 | Roles: 302 | - !Ref OnS3ObjectCreatedLambdaExecutionRole 303 | 304 | 305 | 306 | OnS3ObjectCreatedLambdaFunction: 307 | Type: "AWS::Lambda::Function" 308 | Properties: 309 | FunctionName: "ons3objectcreated" 310 | Description: "Enables a Step Function State Machine to continue execution after a new object is created in S3." 311 | Handler: "ons3objectcreated.handler" 312 | Role: !GetAtt OnS3ObjectCreatedLambdaExecutionRole.Arn 313 | Code: 314 | S3Bucket: !Ref ArtifactBucketName 315 | S3Key: !Ref LambdaSourceS3Key 316 | Timeout: 180 #seconds 317 | MemorySize: 128 #MB 318 | Runtime: python2.7 319 | DependsOn: 320 | - OnS3ObjectCreatedLambdaExecutionRole 321 | 322 | # For every bucket that needs to invoke OnS3ObjectCreated: 323 | 324 | DataBucket: 325 | Type: "AWS::S3::Bucket" 326 | Properties: 327 | BucketName: !Ref DataBucketName 328 | NotificationConfiguration: 329 | LambdaConfigurations: 330 | - Function: !GetAtt OnS3ObjectCreatedLambdaFunction.Arn 331 | Event: "s3:ObjectCreated:*" 332 | Filter: 333 | S3Key: 334 | Rules: 335 | - 336 | Name: "suffix" 337 | Value: "csv" 338 | DependsOn: 339 | - DataBucketPermission 340 | 341 | DataBucketPermission: 342 | Type: "AWS::Lambda::Permission" 343 | Properties: 344 | Action: 'lambda:InvokeFunction' 345 | FunctionName: !Ref OnS3ObjectCreatedLambdaFunction 346 | Principal: s3.amazonaws.com 347 | SourceAccount: !Ref "AWS::AccountId" 348 | SourceArn: !Sub "arn:aws:s3:::${DataBucketName}" -------------------------------------------------------------------------------- /glue-scripts/join_marketing_and_sales_data.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import sys 17 | import pyspark.sql.functions as func 18 | from awsglue.dynamicframe import DynamicFrame 19 | from awsglue.transforms import * 20 | from awsglue.utils import getResolvedOptions 21 | from pyspark.context import SparkContext 22 | from awsglue.context import GlueContext 23 | from awsglue.job import Job 24 | 25 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_output_path', 's3_sales_data_path', 's3_marketing_data_path']) 26 | s3_output_path = args['s3_output_path'] 27 | s3_sales_data_path = args['s3_sales_data_path'] 28 | s3_marketing_data_path = args['s3_marketing_data_path'] 29 | 30 | sc = SparkContext.getOrCreate() 31 | glueContext = GlueContext(sc) 32 | spark = glueContext.spark_session 33 | job = Job(glueContext) 34 | job.init(args['JOB_NAME'], args) 35 | 36 | salesForecastedByDate_DF = \ 37 | glueContext.spark_session.read.option("header", "true")\ 38 | .load(s3_sales_data_path, format="parquet") 39 | 40 | mktg_DF = \ 41 | glueContext.spark_session.read.option("header", "true")\ 42 | .load(s3_marketing_data_path, format="parquet") 43 | 44 | 45 | salesForecastedByDate_DF\ 46 | .join(mktg_DF, 'date', 'inner')\ 47 | .orderBy(salesForecastedByDate_DF['date']) \ 48 | .write \ 49 | .format('csv') \ 50 | .option('header', 'true') \ 51 | .mode('overwrite') \ 52 | .save(s3_output_path) 53 | -------------------------------------------------------------------------------- /glue-scripts/process_marketing_data.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import sys 17 | import pyspark.sql.functions as func 18 | from awsglue.dynamicframe import DynamicFrame 19 | from awsglue.transforms import * 20 | from awsglue.utils import getResolvedOptions 21 | from pyspark.context import SparkContext 22 | from awsglue.context import GlueContext 23 | from awsglue.job import Job 24 | 25 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_output_path', 'database_name', 'table_name']) 26 | s3_output_path = args['s3_output_path'] 27 | database_name = args['database_name'] 28 | table_name = args['table_name'] 29 | 30 | sc = SparkContext.getOrCreate() 31 | glueContext = GlueContext(sc) 32 | spark = glueContext.spark_session 33 | job = Job(glueContext) 34 | job.init(args['JOB_NAME'], args) 35 | 36 | 37 | mktg_DyF = glueContext.create_dynamic_frame\ 38 | .from_catalog(database=database_name, table_name=table_name) 39 | 40 | mktg_DyF = ApplyMapping.apply(frame=mktg_DyF, mappings=[ 41 | ('date', 'string', 'date', 'string'), 42 | ('new visitors seo', 'bigint', 'new_visitors_seo', 'bigint'), 43 | ('new visitors cpc', 'bigint', 'new_visitors_cpc', 'bigint'), 44 | ('new visitors social media', 'bigint', 'new_visitors_social_media', 'bigint'), 45 | ('return visitors', 'bigint', 'return_visitors', 'bigint'), 46 | ], transformation_ctx='applymapping1') 47 | 48 | mktg_DyF.printSchema() 49 | 50 | mktg_DF = mktg_DyF.toDF() 51 | 52 | mktg_DF.write\ 53 | .format('parquet')\ 54 | .option('header', 'true')\ 55 | .mode('overwrite')\ 56 | .save(s3_output_path) 57 | 58 | job.commit() -------------------------------------------------------------------------------- /glue-scripts/process_sales_data.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import sys 17 | import pyspark.sql.functions as func 18 | from awsglue.dynamicframe import DynamicFrame 19 | from awsglue.transforms import * 20 | from awsglue.utils import getResolvedOptions 21 | from pyspark.context import SparkContext 22 | from awsglue.context import GlueContext 23 | from awsglue.job import Job 24 | 25 | 26 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_output_path', 'database_name', 'table_name']) 27 | s3_output_path = args['s3_output_path'] 28 | database_name = args['database_name'] 29 | table_name = args['table_name'] 30 | 31 | sc = SparkContext.getOrCreate() 32 | glueContext = GlueContext(sc) 33 | spark = glueContext.spark_session 34 | job = Job(glueContext) 35 | job.init(args['JOB_NAME'], args) 36 | 37 | 38 | 39 | sales_DyF = glueContext.create_dynamic_frame\ 40 | .from_catalog(database=database_name, table_name=table_name) 41 | 42 | sales_DyF = ApplyMapping.apply(frame=sales_DyF, mappings=[ 43 | ('date', 'string', 'date', 'string'), 44 | ('lead name', 'string', 'lead_name', 'string'), 45 | ('forecasted monthly revenue', 'bigint', 'forecasted_monthly_revenue', 'bigint'), 46 | ('opportunity stage', 'string', 'opportunity_stage', 'string'), 47 | ('weighted revenue', 'bigint', 'weighted_revenue', 'bigint'), 48 | ('target close', 'string', 'target_close', 'string'), 49 | ('closed opportunity', 'string', 'closed_opportunity', 'string'), 50 | ('active opportunity', 'string', 'active_opportunity', 'string'), 51 | ('last status entry', 'string', 'last_status_entry', 'string'), 52 | ], transformation_ctx='applymapping1') 53 | 54 | sales_DyF.printSchema() 55 | 56 | sales_DF = sales_DyF.toDF() 57 | 58 | salesForecastedByDate_DF = \ 59 | sales_DF\ 60 | .where(sales_DF['opportunity_stage'] == 'Lead')\ 61 | .groupBy('date')\ 62 | .agg({'forecasted_monthly_revenue': 'sum', 'opportunity_stage': 'count'})\ 63 | .orderBy('date')\ 64 | .withColumnRenamed('count(opportunity_stage)', 'leads_generated')\ 65 | .withColumnRenamed('sum(forecasted_monthly_revenue)', 'forecasted_lead_revenue')\ 66 | .coalesce(1) 67 | 68 | salesForecastedByDate_DF.write\ 69 | .format('parquet')\ 70 | .option('header', 'true')\ 71 | .mode('overwrite')\ 72 | .save(s3_output_path) 73 | 74 | job.commit() 75 | -------------------------------------------------------------------------------- /lambda/athenarunner/athenarunner-config.json: -------------------------------------------------------------------------------- 1 | { 2 | "sfn_activity_arn": "arn:aws:states:us-east-1:074599006431:activity:AthenaRunnerActivity", 3 | "sfn_worker_name": "athenarunner", 4 | "ddb_table": "AthenaRunnerActiveJobs", 5 | "ddb_query_limit": 50, 6 | "sfn_max_executions": 100 7 | } -------------------------------------------------------------------------------- /lambda/athenarunner/athenarunner.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import boto3 17 | import logging, logging.config 18 | import json 19 | from botocore.client import Config 20 | from boto3.dynamodb.conditions import Key, Attr 21 | 22 | 23 | def load_log_config(): 24 | # Basic config. Replace with your own logging config if required 25 | root = logging.getLogger() 26 | root.setLevel(logging.INFO) 27 | return root 28 | 29 | def load_config(): 30 | with open('athenarunner-config.json', 'r') as conf_file: 31 | conf_json = conf_file.read() 32 | return json.loads(conf_json) 33 | 34 | def is_json(jsonstring): 35 | try: 36 | json_object = json.loads(jsonstring) 37 | except ValueError: 38 | return False 39 | return True 40 | 41 | def start_athena_queries(config): 42 | 43 | ddb_table = dynamodb.Table(config["ddb_table"]) 44 | 45 | logger.debug('Athena runner started') 46 | 47 | sfn_activity_arn = config['sfn_activity_arn'] 48 | sfn_worker_name = config['sfn_worker_name'] 49 | 50 | # Loop until no tasks are available from Step Functions 51 | while True: 52 | 53 | logger.debug('Polling for Athena tasks for Step Functions Activity ARN: {}'.format(sfn_activity_arn)) 54 | 55 | try: 56 | response = sfn.get_activity_task( 57 | activityArn=sfn_activity_arn, 58 | workerName=sfn_worker_name 59 | ) 60 | 61 | except Exception as e: 62 | logger.critical(e.message) 63 | logger.critical( 64 | 'Unrecoverable error invoking get_activity_task for {}.'.format(sfn_activity_arn)) 65 | raise 66 | 67 | # Keep the new Task Token for reference later 68 | task_token = response.get('taskToken', '') 69 | 70 | # If there are no tasks, return 71 | if not task_token: 72 | logger.debug('No tasks available.') 73 | return 74 | 75 | 76 | logger.info('Received task. Task token: {}'.format(task_token)) 77 | 78 | # Parse task input and create Athena query input 79 | task_input = '' 80 | try: 81 | 82 | task_input = json.loads(response['input']) 83 | task_input_dict = json.loads(task_input) 84 | 85 | athena_named_query = task_input_dict.get('AthenaNamedQuery', None) 86 | if athena_named_query is None: 87 | athena_query_string = task_input_dict['AthenaQueryString'] 88 | athena_database = task_input_dict['AthenaDatabase'] 89 | else: 90 | logger.debug('Retrieving details of named query "{}" from Athena'.format(athena_named_query)) 91 | 92 | response = athena.get_named_query( 93 | NamedQueryId=athena_named_query 94 | ) 95 | 96 | athena_query_string = response['NamedQuery']['QueryString'] 97 | athena_database = response['NamedQuery']['Database'] 98 | 99 | athena_result_output_location = task_input_dict['AthenaResultOutputLocation'] 100 | athena_result_encryption_option = task_input_dict.get('AthenaResultEncryptionOption', None) 101 | athena_result_kms_key = task_input_dict.get('AthenaResultKmsKey', "NONE").capitalize() 102 | 103 | athena_result_encryption_config = {} 104 | if athena_result_encryption_option is not None: 105 | 106 | athena_result_encryption_config['EncryptionOption'] = athena_result_encryption_option 107 | 108 | if athena_result_encryption_option in ["SSE_KMS", "CSE_KMS"]: 109 | athena_result_encryption_config['KmsKey'] = athena_result_kms_key 110 | 111 | 112 | except (KeyError, Exception): 113 | logger.critical('Invalid Athena Runner input. Make sure required input parameters are properly specified.') 114 | raise 115 | 116 | 117 | 118 | # Run Athena Query 119 | logger.info('Querying "{}" Athena database using query string: "{}"'.format(athena_database, athena_query_string)) 120 | 121 | try: 122 | 123 | response = athena.start_query_execution( 124 | QueryString=athena_query_string, 125 | QueryExecutionContext={ 126 | 'Database': athena_database 127 | }, 128 | ResultConfiguration={ 129 | 'OutputLocation': athena_result_output_location, 130 | 'EncryptionConfiguration': athena_result_encryption_config 131 | } 132 | ) 133 | 134 | athena_query_execution_id = response['QueryExecutionId'] 135 | 136 | # Store SFN 'Task Token' and 'Query Execution Id' in DynamoDB 137 | 138 | item = { 139 | 'sfn_activity_arn': sfn_activity_arn, 140 | 'athena_query_execution_id': athena_query_execution_id, 141 | 'sfn_task_token': task_token 142 | } 143 | 144 | ddb_table.put_item(Item=item) 145 | 146 | 147 | except Exception as e: 148 | logger.error('Failed to query Athena database "{}" with query string "{}"..'.format(athena_database, athena_query_string)) 149 | logger.error('Reason: {}'.format(e.message)) 150 | logger.info('Sending "Task Failed" signal to Step Functions.') 151 | 152 | response = sfn.send_task_failure( 153 | taskToken=task_token, 154 | error='Failed to start Athena query. Check Athena Runner logs for more details.' 155 | ) 156 | return 157 | 158 | 159 | logger.info('Athena query started. Query Execution Id: {}'.format(athena_query_execution_id)) 160 | 161 | 162 | def check_athena_queries(config): 163 | 164 | # Query all items in table for a particular SFN activity ARN 165 | # This should retrieve records for all started athena queries for this particular activity ARN 166 | ddb_table = dynamodb.Table(config['ddb_table']) 167 | sfn_activity_arn = config['sfn_activity_arn'] 168 | 169 | ddb_resp = ddb_table.query( 170 | KeyConditionExpression=Key('sfn_activity_arn').eq(sfn_activity_arn), 171 | Limit=config['ddb_query_limit'] 172 | ) 173 | 174 | # For each item... 175 | for item in ddb_resp['Items']: 176 | 177 | athena_query_execution_id = item['athena_query_execution_id'] 178 | sfn_task_token = item['sfn_task_token'] 179 | 180 | try: 181 | logger.debug('Polling Athena query execution status..') 182 | 183 | # Query athena query execution status... 184 | athena_resp = athena.get_query_execution( 185 | QueryExecutionId=athena_query_execution_id 186 | ) 187 | 188 | query_exec_resp = athena_resp['QueryExecution'] 189 | query_exec_state = query_exec_resp['Status']['State'] 190 | query_state_change_reason = query_exec_resp['Status'].get('StateChangeReason', '') 191 | 192 | logger.debug('Query with Execution Id {} is currently in state "{}"'.format(query_exec_state, 193 | query_state_change_reason)) 194 | 195 | # If Athena query completed, return success: 196 | if query_exec_state in ['SUCCEEDED']: 197 | 198 | logger.info('Query with Execution Id {} SUCCEEDED.'.format(athena_query_execution_id)) 199 | 200 | # Build an output dict and format it as JSON 201 | task_output_dict = { 202 | "AthenaQueryString": query_exec_resp['Query'], 203 | "AthenaQueryExecutionId": athena_query_execution_id, 204 | "AthenaQueryExecutionState": query_exec_state, 205 | "AthenaQueryExecutionStateChangeReason": query_state_change_reason, 206 | "AthenaQuerySubmissionDateTime": query_exec_resp['Status'].get('SubmissionDateTime', '').strftime( 207 | '%x, %-I:%M %p %Z'), 208 | "AthenaQueryCompletionDateTime": query_exec_resp['Status'].get('CompletionDateTime', '').strftime( 209 | '%x, %-I:%M %p %Z'), 210 | "AthenaQueryEngineExecutionTimeInMillis": query_exec_resp['Statistics'].get( 211 | 'EngineExecutionTimeInMillis', 0), 212 | "AthenaQueryDataScannedInBytes": query_exec_resp['Statistics'].get('DataScannedInBytes', 0) 213 | } 214 | 215 | task_output_json = json.dumps(task_output_dict) 216 | 217 | logger.info('Sending "Task Succeeded" signal to Step Functions..') 218 | sfn_resp = sfn.send_task_success( 219 | taskToken=sfn_task_token, 220 | output=task_output_json 221 | ) 222 | 223 | # Delete item 224 | resp = ddb_table.delete_item( 225 | Key={ 226 | 'sfn_activity_arn': sfn_activity_arn, 227 | 'athena_query_execution_id': athena_query_execution_id 228 | } 229 | ) 230 | 231 | # Task succeeded, next item 232 | 233 | elif query_exec_state in ['RUNNING', 'QUEUED']: 234 | logger.debug( 235 | 'Query with Execution Id {} is in state hasn\'t completed yet.'.format(athena_query_execution_id)) 236 | 237 | # Send heartbeat 238 | sfn_resp = sfn.send_task_heartbeat( 239 | taskToken=sfn_task_token 240 | ) 241 | 242 | logger.debug('Heartbeat sent to Step Functions.') 243 | 244 | # Heartbeat sent, next item 245 | 246 | elif query_exec_state in ['FAILED', 'CANCELLED']: 247 | 248 | message = 'Athena query with Execution Id "{}" failed. Last state: {}. Error message: {}' \ 249 | .format(athena_query_execution_id, query_exec_state, query_state_change_reason) 250 | 251 | logger.error(message) 252 | 253 | message_json = { 254 | "AthenaQueryString": query_exec_resp['Query'], 255 | "AthenaQueryExecutionId": athena_query_execution_id, 256 | "AthenaQueryExecutionState": query_exec_state, 257 | "AthenaQueryExecutionStateChangeReason": query_state_change_reason, 258 | "AthenaQuerySubmissionDateTime": query_exec_resp['Status'].get('SubmissionDateTime', '').strftime( 259 | '%x, %-I:%M %p %Z'), 260 | "AthenaQueryCompletionDateTime": query_exec_resp['Status'].get('CompletionDateTime', '').strftime( 261 | '%x, %-I:%M %p %Z'), 262 | "AthenaQueryEngineExecutionTimeInMillis": query_exec_resp['Statistics'].get( 263 | 'EngineExecutionTimeInMillis', 0), 264 | "AthenaQueryDataScannedInBytes": query_exec_resp['Statistics'].get('DataScannedInBytes', 0) 265 | } 266 | 267 | sfn_resp = sfn.send_task_failure( 268 | taskToken=sfn_task_token, 269 | cause=json.dumps(message_json), 270 | error='AthenaQueryFailedError' 271 | ) 272 | 273 | # Delete item 274 | resp = ddb_table.delete_item( 275 | Key={ 276 | 'sfn_activity_arn': sfn_activity_arn, 277 | 'athena_query_execution_id': athena_query_execution_id 278 | } 279 | ) 280 | 281 | logger.error(message) 282 | 283 | except Exception as e: 284 | logger.error('There was a problem checking status of Athena query..') 285 | logger.error('Glue job Run Id "{}"'.format(athena_query_execution_id)) 286 | logger.error('Reason: {}'.format(e.message)) 287 | logger.info('Checking next Athena query.') 288 | 289 | # Task failed, next item 290 | 291 | 292 | athena = boto3.client('athena') 293 | # Because Step Functions client uses long polling, read timeout has to be > 60 seconds 294 | sfn_client_config = Config(connect_timeout=50, read_timeout=70) 295 | sfn = boto3.client('stepfunctions', config=sfn_client_config) 296 | dynamodb = boto3.resource('dynamodb') 297 | 298 | # Load logging config and create logger 299 | logger = load_log_config() 300 | 301 | def handler(event, context): 302 | 303 | logger.debug('*** Athena Runner lambda function starting ***') 304 | 305 | try: 306 | 307 | # Get config (including a single activity ARN) from local file 308 | config = load_config() 309 | 310 | # One round of starting Athena queries 311 | start_athena_queries(config) 312 | 313 | # One round of checking on Athena queries 314 | check_athena_queries(config) 315 | 316 | logger.debug('*** Master Athena Runner terminating ***') 317 | 318 | except Exception as e: 319 | logger.critical('*** ERROR: Athena runner lambda function failed ***') 320 | logger.critical(e.message) 321 | raise 322 | 323 | -------------------------------------------------------------------------------- /lambda/athenarunner/requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/lambda/athenarunner/requirements.txt -------------------------------------------------------------------------------- /lambda/athenarunner/setup.cfg: -------------------------------------------------------------------------------- 1 | [install] 2 | prefix= -------------------------------------------------------------------------------- /lambda/global-config.json: -------------------------------------------------------------------------------- 1 | {} -------------------------------------------------------------------------------- /lambda/gluerunner/gluerunner-config.json: -------------------------------------------------------------------------------- 1 | { 2 | "sfn_activity_arn": "arn:aws:states:::activity:GlueRunnerActivity", 3 | "sfn_worker_name": "gluerunner", 4 | "ddb_table": "GlueRunnerActiveJobs", 5 | "ddb_query_limit": 50, 6 | "glue_job_capacity": 10 7 | } -------------------------------------------------------------------------------- /lambda/gluerunner/gluerunner.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | # software and associated documentation files (the "Software"), to deal in the Software 5 | # without restriction, including without limitation the rights to use, copy, modify, 6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | # permit persons to whom the Software is furnished to do so. 8 | # 9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | import boto3 17 | import logging, logging.config 18 | import json 19 | from botocore.client import Config 20 | from boto3.dynamodb.conditions import Key, Attr 21 | 22 | 23 | def load_log_config(): 24 | # Basic config. Replace with your own logging config if required 25 | root = logging.getLogger() 26 | root.setLevel(logging.INFO) 27 | return root 28 | 29 | def load_config(): 30 | with open('gluerunner-config.json', 'r') as conf_file: 31 | conf_json = conf_file.read() 32 | return json.loads(conf_json) 33 | 34 | def is_json(jsonstring): 35 | try: 36 | json_object = json.loads(jsonstring) 37 | except ValueError: 38 | return False 39 | return True 40 | 41 | def start_glue_jobs(config): 42 | 43 | ddb_table = dynamodb.Table(config["ddb_table"]) 44 | 45 | logger.debug('Glue runner started') 46 | 47 | glue_job_capacity = config['glue_job_capacity'] 48 | sfn_activity_arn = config['sfn_activity_arn'] 49 | sfn_worker_name = config['sfn_worker_name'] 50 | 51 | # Loop until no tasks are available from Step Functions 52 | while True: 53 | 54 | logger.debug('Polling for Glue tasks for Step Functions Activity ARN: {}'.format(sfn_activity_arn)) 55 | 56 | try: 57 | response = sfn.get_activity_task( 58 | activityArn=sfn_activity_arn, 59 | workerName=sfn_worker_name 60 | ) 61 | 62 | except Exception as e: 63 | logger.critical(e.message) 64 | logger.critical( 65 | 'Unrecoverable error invoking get_activity_task for {}.'.format(sfn_activity_arn)) 66 | raise 67 | 68 | # Keep the new Task Token for reference later 69 | task_token = response.get('taskToken', '') 70 | 71 | # If there are no tasks, return 72 | if not task_token: 73 | logger.debug('No tasks available.') 74 | return 75 | 76 | 77 | logger.info('Received task. Task token: {}'.format(task_token)) 78 | 79 | # Parse task input and create Glue job input 80 | task_input = '' 81 | try: 82 | 83 | task_input = json.loads(response['input']) 84 | task_input_dict = json.loads(task_input) 85 | 86 | glue_job_name = task_input_dict['GlueJobName'] 87 | glue_job_capacity = int(task_input_dict.get('GlueJobCapacity', glue_job_capacity)) 88 | 89 | except (KeyError, Exception): 90 | logger.critical('Invalid Glue Runner input. Make sure required input parameters are properly specified.') 91 | raise 92 | 93 | # Extract additional Glue job inputs from task_input_dict 94 | glue_job_args = task_input_dict 95 | 96 | # Run Glue job 97 | logger.info('Running Glue job named "{}"..'.format(glue_job_name)) 98 | 99 | try: 100 | response = glue.start_job_run( 101 | JobName=glue_job_name, 102 | Arguments=glue_job_args, 103 | AllocatedCapacity=glue_job_capacity 104 | ) 105 | 106 | glue_job_run_id = response['JobRunId'] 107 | 108 | # Store SFN 'Task Token' and Glue Job 'Run Id' in DynamoDB 109 | 110 | item = { 111 | 'sfn_activity_arn': sfn_activity_arn, 112 | 'glue_job_name': glue_job_name, 113 | 'glue_job_run_id': glue_job_run_id, 114 | 'sfn_task_token': task_token 115 | } 116 | 117 | ddb_table.put_item(Item=item) 118 | 119 | 120 | except Exception as e: 121 | logger.error('Failed to start Glue job named "{}"..'.format(glue_job_name)) 122 | logger.error('Reason: {}'.format(e.message)) 123 | logger.info('Sending "Task Failed" signal to Step Functions.') 124 | 125 | response = sfn.send_task_failure( 126 | taskToken=task_token, 127 | error='Failed to start Glue job. Check Glue Runner logs for more details.' 128 | ) 129 | return 130 | 131 | 132 | logger.info('Glue job run started. Run Id: {}'.format(glue_job_run_id)) 133 | 134 | 135 | def check_glue_jobs(config): 136 | 137 | # Query all items in table for a particular SFN activity ARN 138 | # This should retrieve records for all started glue jobs for this particular activity ARN 139 | ddb_table = dynamodb.Table(config['ddb_table']) 140 | sfn_activity_arn = config['sfn_activity_arn'] 141 | 142 | ddb_resp = ddb_table.query( 143 | KeyConditionExpression=Key('sfn_activity_arn').eq(sfn_activity_arn), 144 | Limit=config['ddb_query_limit'] 145 | ) 146 | 147 | # For each item... 148 | for item in ddb_resp['Items']: 149 | glue_job_run_id = item['glue_job_run_id'] 150 | glue_job_name = item['glue_job_name'] 151 | sfn_task_token = item['sfn_task_token'] 152 | 153 | try: 154 | 155 | logger.debug('Polling Glue job run status..') 156 | 157 | # Query glue job status... 158 | glue_resp = glue.get_job_run( 159 | JobName=glue_job_name, 160 | RunId=glue_job_run_id, 161 | PredecessorsIncluded=False 162 | ) 163 | 164 | job_run_state = glue_resp['JobRun']['JobRunState'] 165 | job_run_error_message = glue_resp['JobRun'].get('ErrorMessage', '') 166 | 167 | logger.debug('Job with Run Id {} is currently in state "{}"'.format(glue_job_run_id, job_run_state)) 168 | 169 | # If Glue job completed, return success: 170 | if job_run_state in ['SUCCEEDED']: 171 | 172 | logger.info('Job with Run Id {} SUCCEEDED.'.format(glue_job_run_id)) 173 | 174 | # Build an output dict and format it as JSON 175 | task_output_dict = { 176 | "GlueJobName": glue_job_name, 177 | "GlueJobRunId": glue_job_run_id, 178 | "GlueJobRunState": job_run_state, 179 | "GlueJobStartedOn": glue_resp['JobRun'].get('StartedOn', '').strftime('%x, %-I:%M %p %Z'), 180 | "GlueJobCompletedOn": glue_resp['JobRun'].get('CompletedOn', '').strftime('%x, %-I:%M %p %Z'), 181 | "GlueJobLastModifiedOn": glue_resp['JobRun'].get('LastModifiedOn', '').strftime('%x, %-I:%M %p %Z') 182 | } 183 | 184 | task_output_json = json.dumps(task_output_dict) 185 | 186 | logger.info('Sending "Task Succeeded" signal to Step Functions..') 187 | sfn_resp = sfn.send_task_success( 188 | taskToken=sfn_task_token, 189 | output=task_output_json 190 | ) 191 | 192 | # Delete item 193 | resp = ddb_table.delete_item( 194 | Key={ 195 | 'sfn_activity_arn': sfn_activity_arn, 196 | 'glue_job_run_id': glue_job_run_id 197 | } 198 | ) 199 | 200 | # Task succeeded, next item 201 | 202 | elif job_run_state in ['STARTING', 'RUNNING', 'STARTING', 'STOPPING']: 203 | logger.debug('Job with Run Id {} hasn\'t succeeded yet.'.format(glue_job_run_id)) 204 | 205 | # Send heartbeat 206 | sfn_resp = sfn.send_task_heartbeat( 207 | taskToken=sfn_task_token 208 | ) 209 | 210 | logger.debug('Heartbeat sent to Step Functions.') 211 | 212 | # Heartbeat sent, next item 213 | 214 | elif job_run_state in ['FAILED', 'STOPPED']: 215 | 216 | message = 'Glue job "{}" run with Run Id "{}" failed. Last state: {}. Error message: {}' \ 217 | .format(glue_job_name, glue_job_run_id[:8] + "...", job_run_state, job_run_error_message) 218 | 219 | logger.error(message) 220 | 221 | message_json = { 222 | 'glue_job_name': glue_job_name, 223 | 'glue_job_run_id': glue_job_run_id, 224 | 'glue_job_run_state': job_run_state, 225 | 'glue_job_run_error_msg': job_run_error_message 226 | } 227 | 228 | sfn_resp = sfn.send_task_failure( 229 | taskToken=sfn_task_token, 230 | cause=json.dumps(message_json), 231 | error='GlueJobFailedError' 232 | ) 233 | 234 | # Delete item 235 | resp = ddb_table.delete_item( 236 | Key={ 237 | 'sfn_activity_arn': sfn_activity_arn, 238 | 'glue_job_run_id': glue_job_run_id 239 | } 240 | ) 241 | 242 | logger.error(message) 243 | # Task failed, next item 244 | except Exception as e: 245 | logger.error('There was a problem checking status of Glue job "{}"..'.format(glue_job_name)) 246 | logger.error('Glue job Run Id "{}"'.format(glue_job_run_id)) 247 | logger.error('Reason: {}'.format(e.message)) 248 | logger.info('Checking next Glue job.') 249 | 250 | 251 | glue = boto3.client('glue') 252 | # Because Step Functions client uses long polling, read timeout has to be > 60 seconds 253 | sfn_client_config = Config(connect_timeout=50, read_timeout=70) 254 | sfn = boto3.client('stepfunctions', config=sfn_client_config) 255 | dynamodb = boto3.resource('dynamodb') 256 | 257 | # Load logging config and create logger 258 | logger = load_log_config() 259 | 260 | def handler(event, context): 261 | 262 | logger.debug('*** Glue Runner lambda function starting ***') 263 | 264 | try: 265 | 266 | # Get config (including a single activity ARN) from local file 267 | config = load_config() 268 | 269 | # One round of starting Glue jobs 270 | start_glue_jobs(config) 271 | 272 | # One round of checking on Glue jobs 273 | check_glue_jobs(config) 274 | 275 | logger.debug('*** Master Glue Runner terminating ***') 276 | 277 | except Exception as e: 278 | logger.critical('*** ERROR: Glue runner lambda function failed ***') 279 | logger.critical(e.message) 280 | raise 281 | 282 | -------------------------------------------------------------------------------- /lambda/gluerunner/requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/lambda/gluerunner/requirements.txt -------------------------------------------------------------------------------- /lambda/gluerunner/setup.cfg: -------------------------------------------------------------------------------- 1 | [install] 2 | prefix= -------------------------------------------------------------------------------- /lambda/ons3objectcreated/ons3objectcreated-config.json: -------------------------------------------------------------------------------- 1 | {} -------------------------------------------------------------------------------- /lambda/ons3objectcreated/ons3objectcreated.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import json 4 | import urllib 5 | import boto3 6 | import logging, logging.config 7 | from botocore.client import Config 8 | 9 | # Because Step Functions client uses long polling, read timeout has to be > 60 seconds 10 | sfn_client_config = Config(connect_timeout=50, read_timeout=70) 11 | sfn = boto3.client('stepfunctions', config=sfn_client_config) 12 | 13 | sts = boto3.client('sts') 14 | account_id = sts.get_caller_identity().get('Account') 15 | region_name = boto3.session.Session().region_name 16 | 17 | def load_log_config(): 18 | # Basic config. Replace with your own logging config if required 19 | root = logging.getLogger() 20 | root.setLevel(logging.INFO) 21 | return root 22 | 23 | def map_activity_arn(bucket, key): 24 | # Map s3 key to activity ARN based on a convention 25 | # Here, we simply map bucket name plus last element in the s3 object key (i.e. filename) to activity name 26 | key_elements = [x.strip() for x in key.split('/')] 27 | activity_name = '{}-{}'.format(bucket, key_elements[-1]) 28 | 29 | return 'arn:aws:states:{}:{}:activity:{}'.format(region_name, account_id, activity_name) 30 | 31 | # Load logging config and create logger 32 | logger = load_log_config() 33 | 34 | def handler(event, context): 35 | logger.info("Received event: " + json.dumps(event, indent=2)) 36 | 37 | # Get the object from the event and show its content type 38 | bucket = event['Records'][0]['s3']['bucket']['name'] 39 | key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8')) 40 | 41 | # Based on a naming convention that maps s3 keys to activity ARNs, deduce the activity arn 42 | sfn_activity_arn = map_activity_arn(bucket, key) 43 | sfn_worker_name = 'on_s3_object_created' 44 | 45 | try: 46 | try: 47 | response = sfn.get_activity_task( 48 | activityArn=sfn_activity_arn, 49 | workerName=sfn_worker_name 50 | ) 51 | 52 | except Exception as e: 53 | logger.critical(e.message) 54 | logger.critical( 55 | 'Unrecoverable error invoking get_activity_task for {}.'.format(sfn_activity_arn)) 56 | raise 57 | 58 | # Get the Task Token 59 | sfn_task_token = response.get('taskToken', '') 60 | 61 | logger.info('Sending "Task Succeeded" signal to Step Functions..') 62 | 63 | # Build an output dict and format it as JSON 64 | task_output_dict = { 65 | 'S3BucketName': bucket, 66 | 'S3Key': key, 67 | 'SFNActivityArn': sfn_activity_arn 68 | } 69 | 70 | task_output_json = json.dumps(task_output_dict) 71 | 72 | sfn_resp = sfn.send_task_success( 73 | taskToken=sfn_task_token, 74 | output=task_output_json 75 | ) 76 | 77 | 78 | except Exception as e: 79 | logger.critical(e) 80 | raise e 81 | -------------------------------------------------------------------------------- /lambda/ons3objectcreated/requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/lambda/ons3objectcreated/requirements.txt -------------------------------------------------------------------------------- /lambda/ons3objectcreated/setup.cfg: -------------------------------------------------------------------------------- 1 | [install] 2 | prefix= -------------------------------------------------------------------------------- /lambda/s3-deployment-descriptor.json: -------------------------------------------------------------------------------- 1 | { 2 | "athenarunner": { 3 | "ArtifactBucketName": "", 4 | "LambdaSourceS3Key": "src/athenarunner.zip" 5 | }, 6 | "gluerunner": { 7 | "ArtifactBucketName": "", 8 | "LambdaSourceS3Key": "src/gluerunner.zip" 9 | }, 10 | "ons3objectcreated": { 11 | "ArtifactBucketName": "", 12 | "LambdaSourceS3Key": "src/ons3objectcreated.zip" 13 | } 14 | } -------------------------------------------------------------------------------- /resources/MarketingAndSalesETLOrchestrator-notstarted.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/resources/MarketingAndSalesETLOrchestrator-notstarted.png -------------------------------------------------------------------------------- /resources/MarketingAndSalesETLOrchestrator-success.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/resources/MarketingAndSalesETLOrchestrator-success.png -------------------------------------------------------------------------------- /resources/aws-etl-orchestrator-serverless.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/resources/aws-etl-orchestrator-serverless.png -------------------------------------------------------------------------------- /resources/example-etl-flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/resources/example-etl-flow.png --------------------------------------------------------------------------------