├── .github
└── PULL_REQUEST_TEMPLATE.md
├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE.txt
├── README.md
├── build.py
├── cloudformation
├── athenarunner-lambda-params.json
├── athenarunner-lambda.yaml
├── glue-resources-params.json
├── glue-resources.yaml
├── gluerunner-lambda-params.json
├── gluerunner-lambda.yaml
├── step-functions-resources-params.json
└── step-functions-resources.yaml
├── glue-scripts
├── join_marketing_and_sales_data.py
├── process_marketing_data.py
└── process_sales_data.py
├── lambda
├── athenarunner
│ ├── athenarunner-config.json
│ ├── athenarunner.py
│ ├── requirements.txt
│ └── setup.cfg
├── global-config.json
├── gluerunner
│ ├── gluerunner-config.json
│ ├── gluerunner.py
│ ├── requirements.txt
│ └── setup.cfg
├── ons3objectcreated
│ ├── ons3objectcreated-config.json
│ ├── ons3objectcreated.py
│ ├── requirements.txt
│ └── setup.cfg
└── s3-deployment-descriptor.json
├── resources
├── MarketingAndSalesETLOrchestrator-notstarted.png
├── MarketingAndSalesETLOrchestrator-success.png
├── aws-etl-orchestrator-serverless.png
└── example-etl-flow.png
└── samples
├── MarketingData_QuickSightSample.csv
└── SalesPipeline_QuickSightSample.csv
/.github/PULL_REQUEST_TEMPLATE.md:
--------------------------------------------------------------------------------
1 | *Issue #, if available:*
2 |
3 | *Description of changes:*
4 |
5 |
6 | By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
7 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Created by .ignore support plugin (hsz.mobi)
2 | # macOS template
3 | # General
4 | .DS_Store
5 | .AppleDouble
6 | .LSOverride
7 |
8 | # Icon must end with two \r
9 | Icon
10 |
11 | # Thumbnails
12 | ._*
13 |
14 | # Files that might appear in the root of a volume
15 | .DocumentRevisions-V100
16 | .fseventsd
17 | .Spotlight-V100
18 | .TemporaryItems
19 | .Trashes
20 | .VolumeIcon.icns
21 | .com.apple.timemachine.donotpresent
22 |
23 |
24 | # Directories potentially created on remote AFP share
25 | .AppleDB
26 | .AppleDesktop
27 | Network Trash Folder
28 | Temporary Items
29 | .apdisk
30 |
31 | # Archives template
32 | # It's better to unpack these files and commit the raw source because
33 | # git has its own built in compression methods.
34 | *.7z
35 | *.jar
36 | *.rar
37 | *.zip
38 | *.gz
39 | *.tgz
40 | *.bzip
41 | *.bz2
42 | *.xz
43 | *.lzma
44 | *.cab
45 |
46 | # Packing-only formats
47 | *.iso
48 | *.tar
49 |
50 | # Package management formats
51 | *.dmg
52 | *.xpi
53 | *.gem
54 | *.egg
55 | *.deb
56 | *.rpm
57 | *.msi
58 | *.msm
59 | *.msp
60 | ### Example user template template
61 | ### Example user template
62 |
63 | # IntelliJ project files
64 | .idea
65 | *.iml
66 | out
67 | gen
68 | notebook
69 |
70 | # GitHub
71 | .github
72 |
73 | # Pynt
74 | *.pyc
75 |
76 | # Python Virtual Environment
77 | venv
78 |
79 | # Copy of customized configuration files
80 | config-copy
81 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing Guidelines
2 |
3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4 | documentation, we greatly value feedback and contributions from our community.
5 |
6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7 | information to effectively respond to your bug report or contribution.
8 |
9 |
10 | ## Reporting Bugs/Feature Requests
11 |
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 |
14 | When filing an issue, please check [existing open](https://github.com/aws-samples/aws-etl-orchestrator/issues), or [recently closed](https://github.com/aws-samples/aws-etl-orchestrator/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 |
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 |
22 |
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 |
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 |
30 | To send us a pull request, please:
31 |
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 |
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 |
42 |
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels ((enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/aws-samples/aws-etl-orchestrator/labels/help%20wanted) issues is a great place to start.
45 |
46 |
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 |
52 |
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 |
56 |
57 | ## Licensing
58 |
59 | See the [LICENSE](https://github.com/aws-samples/aws-etl-orchestrator/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 |
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT No Attribution
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | software and associated documentation files (the "Software"), to deal in the Software
5 | without restriction, including without limitation the rights to use, copy, modify,
6 | merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | permit persons to whom the Software is furnished to do so.
8 |
9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--------------------------------------------------------------------------------
/NOTICE.txt:
--------------------------------------------------------------------------------
1 | aws-etl-orchestrator
2 | Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Table of Contents
2 | =================
3 |
4 | * [Introduction](#introduction)
5 | * [The challenge of orchestrating an ETL workflow](#the-challenge-of-orchestrating-an-etl-workflow)
6 | * [Example ETL workflow requirements](#example-etl-workflow-requirements)
7 | * [The ETL orchestration architecture and events](#the-etl-orchestration-architecture-and-events)
8 | * [Modeling the ETL orchestration workflow in AWS Step Functions](#modeling-the-etl-orchestration-workflow-in-aws-step-functions)
9 | * [AWS CloudFormation templates](#aws-cloudformation-templates)
10 | * [Handling failed ETL jobs](#handling-failed-etl-jobs)
11 | * [Preparing your development environment](#preparing-your-development-environment)
12 | * [Configuring the project](#configuring-the-project)
13 | * [cloudformation/gluerunner-lambda-params.json](#cloudformationgluerunner-lambda-paramsjson)
14 | * [cloudformation/athenarunner-lambda-params.json](#cloudformationathenarunner-lambda-paramsjson)
15 | * [lambda/s3-deployment-descriptor.json](#lambdas3-deployment-descriptorjson)
16 | * [cloudformation/glue-resources-params.json](#cloudformationglue-resources-paramsjson)
17 | * [lambda/gluerunner/gluerunner-config.json](#lambdagluerunnergluerunner-configjson)
18 | * [cloudformation/step-functions-resources-params.json](#cloudformationstep-functions-resources-paramsjson)
19 | * [Build commands](#build-commands)
20 | * [The packagelambda build command](#the-packagelambda-build-command)
21 | * [The deploylambda build command](#the-deploylambda-build-command)
22 | * [The createstack build command](#the-createstack-build-command)
23 | * [The updatestack build command](#the-updatestack-build-command)
24 | * [The deletestack build command](#the-deletestack-build-command)
25 | * [The deploygluescripts build command](#the-deploygluescripts-build-command)
26 | * [Putting it all together: Example usage of build commands](#putting-it-all-together-example-usage-of-build-commands)
27 | * [License](#license)
28 |
29 |
30 |
31 | # Introduction
32 | Extract, transform, and load (ETL) operations collectively form the backbone of any modern enterprise data lake. It transforms raw data into useful datasets and, ultimately, into actionable insight. An ETL job typically reads data from one or more data sources, applies various transformations to the data, and then writes the results to a target where data is ready for consumption. The sources and targets of an ETL job could be relational databases in Amazon Relational Database Service (Amazon RDS) or on-premises, a data warehouse such as Amazon Redshift, or object storage such as Amazon Simple Storage Service (Amazon S3) buckets. Amazon S3 as a target is especially commonplace in the context of building a data lake in AWS.
33 |
34 | AWS offers [AWS Glue](https://aws.amazon.com/glue/), which is a service that helps author and deploy ETL jobs. AWS Glue is a fully managed extract, transform, and load service that makes it easy for customers to prepare and load their data for analytics. Other AWS Services also can be used to implement and manage ETL jobs. They include: [AWS Database Migration Service](https://docs.aws.amazon.com/dms/latest/userguide/Welcome.html) (AWS DMS), [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html) (using the Steps API), and even [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html).
35 |
36 |
37 | # The challenge of orchestrating an ETL workflow
38 |
39 | How can we orchestrate an ETL workflow that involves a diverse set of ETL technologies? AWS Glue, AWS DMS, Amazon EMR, and other services support [Amazon CloudWatch Events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html), which we could use to chain ETL jobs together. Amazon S3, the central data lake store, also supports CloudWatch Events. But relying on CloudWatch Events alone means that there's no single visual representation of the ETL workflow. Also, tracing the overall ETL workflow's execution status and handling error scenarios can become a challenge.
40 |
41 | In this code sample, I show you how to use [AWS Step Functions](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html) and [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) for orchestrating multiple ETL jobs involving a diverse set of technologies in an arbitrarily-complex ETL workflow. AWS Step Functions is a web service that enables you to coordinate the components of distributed applications and microservices using visual workflows. You build applications from individual components. Each component performs a discrete function, or task, allowing you to scale and change applications quickly. Using AWS Step Functions also opens up opportunities for integrating other AWS or external services into your ETL flows. Let’s take an example ETL flow for illustration.
42 |
43 |
44 | # Example ETL workflow requirements
45 |
46 | For our example, we'll use two publicly-available [Amazon QuickSight](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html) datasets. The first dataset is a [sales pipeline dataset](samples/SalesPipeline_QuickSightSample.csv) (Sales dataset) that contains a list of slightly above 20K sales opportunity records for a fictitious business. Each record has fields that specify:
47 |
48 | * A date, potentially when an opportunity was identified.
49 | * The salesperson’s name.
50 | * A market segment to which the opportunity belongs.
51 | * Forecasted monthly revenue.
52 |
53 | The second dataset is an [online marketing metrics dataset](samples/MarketingData_QuickSightSample.csv) (Marketing dataset). The data set contains records of marketing metrics, aggregated by day. The metrics describe user engagement across various channels (website, mobile, and social media) plus other marketing metrics. The two data sets are unrelated, but we’ll assume that they are for the purpose of this example.
54 |
55 | Imagine there’s a business user who needs to answer questions based on both datasets. Perhaps the user wants to explore the correlations between online user engagement metrics on the one hand, and forecasted sales revenue and opportunities generated on the other hand. The user engagement metrics include website visits, mobile users, and desktop users.
56 |
57 | The steps in the ETL flow chart are:
58 |
59 | 1. **Process the Sales dataset.** Read Sales dataset. Group records by day, aggregating the Forecasted Monthly Revenue field. Rename fields to replace white space with underscores. Output the intermediary results to Amazon S3 in compressed Parquet format. Overwrite any previous outputs.
60 |
61 | 2. **Process the Marketing dataset.** Read Marketing dataset. Rename fields to replace white space with underscores. Output the intermediary results to Amazon S3 in compressed Parquet format. Overwrite any previous outputs.
62 |
63 | 3. **Join Sales and Marketing datasets.** Read outputs of processing Sales and Marketing datasets. Perform an inner join of both datasets on the date field. Sort in ascending order by date. Output final joined dataset to Amazon S3, overwriting any previous outputs.
64 | Solution Architecture
65 |
66 | So far, this ETL workflow can be implemented with AWS Glue, with the ETL jobs being chained by using job triggers. But you might have other requirements outside of AWS Glue that are part of your end-to-end data processing workflow, such as the following:
67 |
68 | * Both Sales and Marketing datasets are uploaded to an S3 bucket at random times in an interval of up to a week. The PSD and PMD jobs should start as soon as the Sales dataset file is uploaded. Parallel ETL jobs can start and finish anytime, but the final JMSD job can start only after all parallel ETL jobs are complete.
69 | * In addition to PSD and PMD jobs, the orchestration must support more parallel ETL jobs in the future that contribute to the final dataset aggregated by the JMSD job. The additional ETL jobs could be managed by AWS services, such as AWS Database Migration Service, Amazon EMR, Amazon Athena or other non-AWS services.
70 |
71 |
72 | The data engineer takes these requirements and builds the following ETL workflow chart.
73 | 
74 |
75 |
76 | # The ETL orchestration architecture and events
77 | Let’s see how we can orchestrate such an ETL flow with AWS Step Functions, AWS Glue, and AWS Lambda. The following diagram shows the ETL orchestration architecture in action.
78 |
79 | 
80 |
81 | The main flow of events starts with an AWS Step Functions state machine. This state machine defines the steps in the orchestrated ETL workflow. A state machine can be triggered through [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/?nc2=h_m1) based on a schedule, through the [AWS Command Line Interface (AWS CLI)](https://aws.amazon.com/cli/), or using the various AWS SDKs in an AWS Lambda function or some other execution environment.
82 |
83 | As the state machine execution progresses, it invokes the ETL jobs. As shown in the diagram, the invocation happens indirectly through intermediary AWS Lambda functions that you author and set up in your account. We'll call this type of function an ETL Runner.
84 |
85 | While the architecture in the diagram shows Amazon Athena, Amazon EMR, and AWS Glue, **this code sample now includes two ETL Runners for both AWS Glue and Amazon Athena**. You can use these ETL Runners to orchestrate AWS Glue jobs and Amazon Athena queries. You can also follow the pattern and implement more ETL Runners to orchestrate other AWS services or non-AWS tools.
86 |
87 | ETL Runners are invoked by [activity tasks](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-activities.html) in Step Functions. Because of the way AWS Step Functions' activity tasks work, ETL Runners need to periodically poll the AWS Step Functions state machine for tasks. The state machine responds by providing a Task object. The Task object contains inputs which enable an ETL Runner to run an ETL job.
88 |
89 | As soon as an ETL Runner receives a task, it starts the respective ETL job. An ETL Runner maintains a state of active jobs in an [Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html) table. Periodically, the ETL Runner checks the status of active jobs. When an active ETL job completes, the ETL Runners notifies the AWS Step Functions state machine. This allows the ETL workflow in AWS Step Functions to proceed to the next step.
90 |
91 |
92 | ## Modeling the ETL orchestration workflow in AWS Step Functions
93 |
94 | The following snapshot from the AWS Step Functions console shows our example ETL workflow modeled as a state machine. This workflow is what we provide you in the code sample.
95 |
96 | 
97 |
98 | When you start an execution of this state machine, it will branch to run two ETL jobs in parallel: Process Sales Data (PSD) and Process Marketing Data (PMD). But, according to the requirements, both ETL jobs should not start until their respective datasets are uploaded to Amazon S3. Hence, we implement Wait activity tasks before both PSD and PMD. When a dataset file is uploaded to Amazon S3, this triggers an AWS Lambda function that notifies the state machine to exit the Wait states. When both PMD and PSD jobs are successful, the JMSD job runs to produce the final dataset.
99 |
100 | Finally, to have this ETL workflow execute once per week, you will need to configure a state machine execution to start once per week using a CloudWatch Event.
101 |
102 |
103 | ## AWS CloudFormation templates
104 |
105 | In this code sample, there are three separate AWS CloudFormation templates. You may choose to structure your own project similarly.
106 |
107 | 1. A template responsible for setting up AWS Glue resources ([glue-resources.yaml](cloudformation/glue-resources.yaml)). For our example ETL flow, the sample template creates three AWS Glue jobs: PSD, PMD, and JMSD. The scripts are pulled by AWS CloudFormation from an Amazon S3 bucket that you own.
108 | 2. A template where the AWS Step Functions state machine is defined ([step-functions-resources.yaml](cloudformation/step-functions-resources.yaml)). The state machine definition in Amazon States Language is embedded in a StateMachine resource within this template.
109 | 3. A template that sets up Glue Runner resources ([gluerunner-resources.yaml](cloudformation/gluerunner-resources.yaml)). Glue Runner is a Python script that is written to be run as an AWS Lambda function.
110 |
111 | For our ETL example, the `step-functions-resources.yaml` CloudFormation template creates a State Machine named `MarketingAndSalesETLOrchestrator`. You can start an execution from the AWS Step Functions console, or through an AWS CLI command.
112 |
113 |
114 | ## Handling failed ETL jobs
115 |
116 | What if a job in the ETL workflow fails? In such a case, there are error-handling strategies available to the ETL workflow developer, from simply notifying an administrator, to fully undoing the effects of the previous jobs through compensating ETL jobs. Detecting and responding to a failed Glue job can be implemented using AWS Step Functions' Catch mechanism, which is explained in [this tutorial](https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-handling-error-conditions.html). In this code sample, this state is a do-nothing [Pass](https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-pass-state.html) state. You can change that state into a [Task](https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-task-state.html) state that triggers your error-handling logic in an AWS Lambda function.
117 |
118 |
119 | > NOTE: To make this code sample easier to setup and execute, build scripts and commands are included to help you with common tasks such as creating the above AWS CloudFormation stacks in your account. Proceed below to learn how to setup and run this example orchestrated ETL flow.
120 |
121 |
122 |
123 | # Preparing your development environment
124 | Here’s a high-level checklist of what you need to do to setup your development environment.
125 |
126 | 1. Sign up for an AWS account if you haven't already and create an Administrator User. The steps are published [here](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html).
127 |
128 | 2. Ensure that you have Python 2.7 and Pip installed on your machine. Instructions for that varies based on your operating system and OS version.
129 |
130 | 3. Create a Python [virtual environment](https://virtualenv.pypa.io/en/stable/) for the project with Virtualenv. This helps keep project’s python dependencies neatly isolated from your Operating System’s default python installation. **Once you’ve created a virtual python environment, activate it before moving on with the following steps**.
131 |
132 | 4. Use Pip to [install AWS CLI](http://docs.aws.amazon.com/cli/latest/userguide/installing.html). [Configure](http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html) the AWS CLI. It is recommended that the access keys you configure are associated with an IAM User who has full access to the following:
133 | - Amazon S3
134 | - Amazon DynamoDB
135 | - AWS Lambda
136 | - Amazon CloudWatch and CloudWatch Logs
137 | - AWS CloudFormation
138 | - AWS Step Functions
139 | - Creating IAM Roles
140 |
141 | The IAM User can be the Administrator User you created in Step 1.
142 |
143 | 5. Make sure you choose a region where all of the above services are available, such as us-east-1 (N. Virginia). Visit [this page](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/) to learn more about service availability in AWS regions.
144 |
145 | 6. Use Pip to install [Pynt](https://github.com/rags/pynt). Pynt is used for the project's Python-based build scripts.
146 |
147 |
148 | # Configuring the project
149 |
150 | This section list configuration files, parameters within them, and parameter default values. The build commands (detailed later) and CloudFormation templates extract their parameters from these files. Also, the Glue Runner AWS Lambda function extract parameters at runtime from `gluerunner-config.json`.
151 |
152 | >**NOTE: Do not remove any of the attributes already specified in these files.**
153 |
154 | >**NOTE: You must set the value of any parameter that has the tag NO-DEFAULT**
155 |
156 |
157 | ## cloudformation/gluerunner-lambda-params.json
158 |
159 | Specifies parameters for creation of the `gluerunner-lambda` CloudFormation stack (as defined in `gluerunner-lambda.yaml` template). It is also used by several build scripts, including `createstack`, and `updatestack`.
160 |
161 | ```json
162 | [
163 | {
164 | "ParameterKey": "ArtifactBucketName",
165 | "ParameterValue": ""
166 | },
167 | {
168 | "ParameterKey": "LambdaSourceS3Key",
169 | "ParameterValue": "src/gluerunner.zip"
170 | },
171 | {
172 | "ParameterKey": "DDBTableName",
173 | "ParameterValue": "GlueRunnerActiveJobs"
174 | },
175 | {
176 | "ParameterKey": "GlueRunnerLambdaFunctionName",
177 | "ParameterValue": "gluerunner"
178 | }
179 | ]
180 | ```
181 | #### Parameters:
182 |
183 | * `ArtifactBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) in which Glue scripts and Lambda function source will be stored. **If a bucket with such a name does not exist, the `deploylambda` build command will create it for you with appropriate permissions.**
184 |
185 | * `LambdaSourceS3Key` - The Amazon S3 key (e.g. `src/gluerunner.zip`) pointing to your AWS Lambda function's .zip package in the artifact bucket.
186 |
187 | * `DDBTableName` - The Amazon DynamoDB table in which the state of active AWS Glue jobs is tracked between Glue Runner AWS Lambda function invocations.
188 |
189 | * `GlueRunnerLambdaFunctionName` - The name to be assigned to the Glue Runner AWS Lambda function.
190 |
191 |
192 | ## cloudformation/athenarunner-lambda-params.json
193 |
194 | Specifies parameters for creation of the `gluerunner-lambda` CloudFormation stack (as defined in `gluerunner-lambda.yaml` template). It is also used by several build scripts, including `createstack`, and `updatestack`.
195 |
196 | ```json
197 | [
198 | {
199 | "ParameterKey": "ArtifactBucketName",
200 | "ParameterValue": ""
201 | },
202 | {
203 | "ParameterKey": "LambdaSourceS3Key",
204 | "ParameterValue": "src/athenarunner.zip"
205 | },
206 | {
207 | "ParameterKey": "DDBTableName",
208 | "ParameterValue": "AthenaRunnerActiveJobs"
209 | },
210 | {
211 | "ParameterKey": "AthenaRunnerLambdaFunctionName",
212 | "ParameterValue": "athenarunner"
213 | }
214 | ]
215 | ```
216 | #### Parameters:
217 |
218 | * `ArtifactBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) in which Glue scripts and Lambda function source will be stored. **If a bucket with such a name does not exist, the `deploylambda` build command will create it for you with appropriate permissions.**
219 |
220 | * `LambdaSourceS3Key` - The Amazon S3 key (e.g. `src/athenarunner.zip`) pointing to your AWS Lambda function's .zip package.
221 |
222 | * `DDBTableName` - The Amazon DynamoDB table in which the state of active AWS Athena queries is tracked between Athena Runner AWS Lambda function invocations.
223 |
224 | * `AthenaRunnerLambdaFunctionName` - The name to be assigned to the Athena Runner AWS Lambda function.
225 |
226 |
227 |
228 | ## lambda/s3-deployment-descriptor.json
229 |
230 | Specifies the S3 buckets and prefixes for the Glue Runner lambda function (and other functions that may be added later). It is used by the `deploylambda` build script.
231 |
232 | Sample content:
233 |
234 | ```json
235 | {
236 | "gluerunner": {
237 | "ArtifactBucketName": "",
238 | "LambdaSourceS3Key":"src/gluerunner.zip"
239 | },
240 | "ons3objectcreated": {
241 | "ArtifactBucketName": "",
242 | "LambdaSourceS3Key":"src/ons3objectcreated.zip"
243 | }
244 | }
245 | ```
246 | #### Parameters:
247 |
248 | * `ArtifactBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) in which Glue scripts and Lambda function source will be stored. **If a bucket with such a name does not exist, the `deploylambda` build command will create it for you with appropriate permissions.**
249 |
250 | * `LambdaSourceS3Key` - The Amazon S3 key (e.g. `src/gluerunner.zip`) for your AWS Lambda function's .zip package.
251 |
252 | >**NOTE: The values set here must match values set in `cloudformation/gluerunner-lambda-params.json`.**
253 |
254 |
255 |
256 | ## cloudformation/glue-resources-params.json
257 |
258 | Specifies parameters for creation of the `glue-resources` CloudFormation stack (as defined in `glue-resources.yaml` template). It is also read by several build scripts, including `createstack`, `updatestack`, and `deploygluescripts`.
259 |
260 | ```json
261 | [
262 | {
263 | "ParameterKey": "ArtifactBucketName",
264 | "ParameterValue": ""
265 | },
266 | {
267 | "ParameterKey": "ETLScriptsPrefix",
268 | "ParameterValue": "scripts"
269 | },
270 | {
271 | "ParameterKey": "DataBucketName",
272 | "ParameterValue": ""
273 | },
274 | {
275 | "ParameterKey": "ETLOutputPrefix",
276 | "ParameterValue": "output"
277 | }
278 | ]
279 | ```
280 | #### Parameters:
281 |
282 | * `ArtifactBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) that will be created by the `step-functions-resources.yaml` CloudFormation template. **If a bucket with such a name does not exist, the `deploylambda` build command will create it for you with appropriate permissions.**
283 |
284 | * `ETLScriptsPrefix` - The Amazon S3 prefix (in the format ``example/path`` without leading or trailing '/') to which AWS Glue scripts will be deployed in the artifact bucket. Glue scripts can be found under the `glue-scripts` project directory
285 |
286 | * `DataBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) that will be created by the `step-functions-resources.yaml` CloudFormation template. This is the bucket to which Sales and Marketing datasets must be uploaded. It is also the bucket in which output will be created. **This bucket is created by `step-functions-resources` CloudFormation. CloudFormation stack creation will fail if the bucket already exists.**
287 |
288 | * `ETLOutputPrefix` - The Amazon S3 prefix (in the format ``example/path`` without leading or trailing '/') to which AWS Glue jobs will produce their intermediary outputs. This path will be created in the data bucket.
289 |
290 |
291 |
292 | The parameters are used by AWS CloudFormation during the creation of `glue-resources` stack.
293 |
294 |
295 | ## lambda/gluerunner/gluerunner-config.json
296 |
297 | Specifies the parameters used by Glue Runner AWS Lambda function at run-time.
298 |
299 | ```json
300 | {
301 | "sfn_activity_arn": "arn:aws:states:::activity:GlueRunnerActivity",
302 | "sfn_worker_name": "gluerunner",
303 | "ddb_table": "GlueRunnerActiveJobs",
304 | "ddb_query_limit": 50,
305 | "glue_job_capacity": 10
306 | }
307 | ```
308 | #### Parameters:
309 |
310 | * `sfn_activity_arn` - AWS Step Functions activity task ARN. This ARN is used to query AWS Step Functions for new tasks (i.e. new AWS Glue jobs to run). The ARN is a combination of the AWS region, your AWS account Id, and the name property of the [AWS::StepFunctions::Activity](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-stepfunctions-activity.html) resource in the `stepfunctions-resources.yaml` CloudFormation template. An ARN looks as follows `arn:aws:states:::activity:`. By default, the activity name is `GlueRunnerActivity`.
311 |
312 | * `sfn_worker_name` - A property that is passed to AWS Step Functions when getting activity tasks.
313 |
314 | * `ddb_table` - The DynamoDB table name in which state for active AWS Glue jobs is persisted between Glue Runner invocations. **This parameter's value must match the value of `DDBTableName` parameter in the `cloudformation/gluerunner-lambda-params.json` config file.**
315 |
316 | * `ddb_query_limit` - Maximum number of items to be retrieved when the DynamoDB state table is queried. Default is `50` items.
317 |
318 | * `glue_job_capacity` - The capacity, in [Data Processing Units](https://aws.amazon.com/glue/pricing/), which is allocated to every AWS Glue job started by Glue Runner.
319 |
320 |
321 | ## cloudformation/step-functions-resources-params.json
322 |
323 | Specifies parameters for creation of the `step-functions-resources` CloudFormation stack (as defined in `step-functions-resources.yaml` template). It is also read by several build scripts, including `createstack`, `updatestack`, and `deploygluescripts`.
324 |
325 | ```json
326 | [
327 | {
328 | "ParameterKey": "ArtifactBucketName",
329 | "ParameterValue": ""
330 | },
331 | {
332 | "ParameterKey": "LambdaSourceS3Key",
333 | "ParameterValue": "src/ons3objectcreated.zip"
334 | },
335 | {
336 | "ParameterKey": "GlueRunnerActivityName",
337 | "ParameterValue": "GlueRunnerActivity"
338 | },
339 | {
340 | "ParameterKey": "DataBucketName",
341 | "ParameterValue": ""
342 | }
343 | ]
344 | ```
345 | #### Parameters:
346 |
347 | * `GlueRunnerActivityName` - The Step Functions activity name that will be polled for tasks by the Glue Runner lambda function.
348 |
349 | Both parameters are also used by AWS CloudFormation during stack creation.
350 |
351 | * `ArtifactBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix) to which the `ons3objectcreated` AWS Lambda function package (.zip file) will be deployed. If a bucket with such a name does not exist, the `deploylambda` build command will create it for you with appropriate permissions.
352 |
353 | * `LambdaSourceS3Key` - The Amazon S3 key (e.g. `src/ons3objectcreated.zip`) for your AWS Lambda function's .zip package.
354 |
355 | * `DataBucketName` - The Amazon S3 bucket name (without the `s3://...` prefix). All OnS3ObjectCreated CloudWatch Events will for the bucket be handled by the `ons3objectcreated` AWS Lambda function. **This bucket will be created by CloudFormation. CloudFormation stack creation will fail if the bucket already exists.**
356 |
357 |
358 | # Build commands
359 | Common interactions with the project have been simplified for you. Using `pynt`, the following tasks are automated with simple commands:
360 |
361 | - Creating, deleting, updating, and checking the status of AWS CloudFormation stacks under `cloudformation` directory
362 | - Packaging lambda code into a .zip file and deploying it into an Amazon S3 bucket
363 | - Updating Glue Runner lambda package directly in AWS Lambda
364 |
365 | For a list of all available tasks, enter the following command in the base directory of this project:
366 |
367 | ```bash
368 | pynt -l
369 | ```
370 |
371 | Build commands are implemented as Python scripts in the file ```build.py```. The scripts use the AWS Python SDK (Boto) under the hood. They are documented in the following section.
372 |
373 | >Prior to using these build commands, you must configure the project.
374 |
375 |
376 | ## The `packagelambda` build command
377 |
378 | Run this command to package the Glue Runner AWS Lambda function and its dependencies into a .zip package. The deployment packages are created under the `build/` directory.
379 |
380 | ```bash
381 | pynt packagelambda # Package all AWS Lambda functions and their dependencies into zip files.
382 |
383 | pynt packagelambda[gluerunner] # Package only the Glue Runner function.
384 |
385 | pynt packagelambda[ons3objectcreated] # Package only the On S3 Object Created function.
386 | ```
387 |
388 |
389 | ## The `deploylambda` build command
390 |
391 | Run this command before you run `createstack`. The ```deploylambda``` command uploads Glue Runner .zip package to Amazon S3 for pickup by AWS CloudFormation while creating the `gluerunner-lambda` stack.
392 |
393 | Here are sample command invocations.
394 |
395 | ```bash
396 | pynt deploylambda # Deploy all functions .zip files to Amazon S3.
397 |
398 | pynt deploylambda[gluerunner] # Deploy only Glue Runner function to Amazon S3
399 |
400 | pynt deploylambda[ons3objectcreated] # Deploy only On S3 Object Created function to Amazon S3
401 | ```
402 |
403 | ## The `createstack` build command
404 | The createstack command creates the project's AWS CloudFormation stacks by invoking the `create_stack()` API.
405 |
406 | Valid parameters to this command are `glue-resources`, `gluerunner-lambda`, `step-functions-resources`.
407 |
408 | Note that for `gluerunner-lambda` stack, you must first package and deploy Glue Runner function to Amazon S3 using the `packagelambda` and then `deploylambda` commands for the AWS CloudFormation stack creation to succeed.
409 |
410 | Examples of using `createstack`:
411 |
412 | ```bash
413 | #Create AWS Step Functions resources
414 | pynt createstack["step-functions-resources"]
415 | ```
416 | ```bash
417 | #Create AWS Glue resources
418 | pynt createstack["glue-resources"]
419 | ```
420 | ```bash
421 | #Create Glue Runner AWS Lambda function resources
422 | pynt createstack["gluerunner-lambda"]
423 | ```
424 |
425 | **For a complete example, please refer to the 'Putting it all together' section.**
426 |
427 | Stack creation should take a few minutes. At any time, you can check on the prototype's stack status either through the AWS CloudFormation console or by issuing the following command.
428 |
429 | ```bash
430 | pynt stackstatus
431 | ```
432 |
433 |
434 | ## The `updatestack` build command
435 |
436 | The `updatestack` command updates any of the projects AWS CloudFormation stacks. Valid parameters to this command are `glue-resources`, `gluerunner-lambda`, `step-functions-resources`.
437 |
438 | You can issue the `updatestack` command as follows.
439 |
440 | ```bash
441 | pynt updatestack["glue-resources"]
442 | ```
443 |
444 |
445 | ## The `deletestack` build command
446 |
447 | The `deletestack` command, calls the AWS CloudFormation delete_stack() API to delete a CloudFormation stack from your account. You must specify a stack to delete. Valid values are `glue-resources`, `gluerunner-lambda`, `step-functions-resources`
448 |
449 | You can issue the `deletestack` command as follows.
450 |
451 | ```bash
452 | pynt deletestack["glue-resources"]
453 | ```
454 |
455 | > NOTE: Before deleting `step-functions-resources` stack, you have to delete the S3 bucket specified in the `DataBucketName` parameter value in `step-functions-resources-config.json` configuration file. You can use the `pynt deletes3bucket[]` build command to delete the bucket.
456 |
457 | As with `createstack`, you can check the status of stack deletion using the `stackstatus` build command.
458 |
459 |
460 | ## The `deploygluescripts` build command
461 |
462 | This command deploys AWS Glue scripts to the Amazon S3 path you specified in project config files.
463 |
464 | You can issue the `deploygluescripts` command as follows.
465 |
466 | ```bash
467 | pynt deploygluescripts
468 | ```
469 |
470 |
471 | # Putting it all together: Example usage of build commands
472 |
473 |
474 | ```bash
475 | ### AFTER PROJECT CONFIGURATION ###
476 |
477 | # Package and deploy Glue Runner AWS Lambda function
478 | pynt packagelambda
479 | pynt deploylambda
480 |
481 | # Deploy AWS Glue scripts
482 | pynt deploygluescripts
483 |
484 | # Create step-functions-resources stack.
485 | # THIS STACK MUST BE CREATED BEFORE `glue-resources` STACK.
486 | pynt createstack["step-functions-resources"]
487 |
488 | # Create glue-resources stack
489 | pynt createstack["glue-resources"]
490 |
491 | # Create gluerunner-lambda stack
492 | pynt createstack["gluerunner-lambda"]
493 |
494 | # Create athenarunner-lambda stack
495 | pynt createstack["athenarunner-lambda"]
496 |
497 | ```
498 |
499 | Note that the `step-functions-resources` stack **must** be created first, before the `glue-resources` stack.
500 |
501 | Now head to the AWS Step Functions console. Start and observe an execution of the 'MarketingAndSalesETLOrchestrator' state machine. Execution should halt at the 'Wait for XYZ Data' states. At this point, you should upload the sample .CSV files under the `samples` directory to the S3 bucket you specified as the `DataBucketName` parameter value in `step-functions-resources-config.json` configuration file. **Upload the marketing sample file under prefix 'marketing' and the sales sample file under prefix 'sales'. To do that, you may issue the following AWS CLI commands while at the project's root directory:**
502 |
503 | ```
504 | aws s3 cp samples/MarketingData_QuickSightSample.csv s3://{DataBucketName}/marketing/
505 |
506 | aws s3 cp samples/SalesPipeline_QuickSightSample.csv s3://{DataBucketName}/sales/
507 | ```
508 |
509 | This should allow the state machine to move on to next steps -- Process Sales Data and Process Marketing Data.
510 |
511 | If you have setup and run the sample correctly, you should see this output in the AWS Step Functions console:
512 |
513 | 
514 |
515 | This indicates that all jobs have been run and orchestrated successfully.
516 |
517 |
518 | # License
519 | This project is licensed under the MIT-No Attribution (MIT-0) license.
520 |
521 | A copy of the License is located at
522 |
523 | [https://spdx.org/licenses/MIT-0.html](https://spdx.org/licenses/MIT-0.html)
--------------------------------------------------------------------------------
/build.py:
--------------------------------------------------------------------------------
1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | # software and associated documentation files (the "Software"), to deal in the Software
5 | # without restriction, including without limitation the rights to use, copy, modify,
6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | # permit persons to whom the Software is furnished to do so.
8 | #
9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 | import os
17 | import shutil
18 | import zipfile
19 | import time
20 |
21 | from pip._vendor.distlib.compat import raw_input
22 | from pynt import task
23 | import boto3
24 | import botocore
25 | from botocore.exceptions import ClientError
26 | import json
27 | import re
28 |
29 |
30 | def write_dir_to_zip(src, zf):
31 | '''Write a directory tree to an open ZipFile object.'''
32 | abs_src = os.path.abspath(src)
33 | for dirname, subdirs, files in os.walk(src):
34 | for filename in files:
35 | absname = os.path.abspath(os.path.join(dirname, filename))
36 | arcname = absname[len(abs_src) + 1:]
37 | print('zipping {} as {}'.format(os.path.join(dirname, filename),
38 | arcname))
39 | zf.write(absname, arcname)
40 |
41 | def read_json(jsonf_path):
42 | '''Read a JSON file into a dict.'''
43 | with open(jsonf_path, 'r') as jsonf:
44 | json_text = jsonf.read()
45 | return json.loads(json_text)
46 |
47 | def check_bucket_exists(s3path):
48 | s3 = boto3.resource('s3')
49 |
50 | result = re.search('s3://(.*)/', s3path)
51 | bucketname = s3path if result is None else result.group(1)
52 | bucket = s3.Bucket(bucketname)
53 | exists = True
54 |
55 | try:
56 |
57 | s3.meta.client.head_bucket(Bucket=bucketname)
58 | except botocore.exceptions.ClientError as e:
59 | # If a client error is thrown, then check that it was a 404 error.
60 | # If it was a 404 error, then the bucket does not exist.
61 | error_code = int(e.response['Error']['Code'])
62 | if error_code == 404:
63 | exists = False
64 | return exists
65 |
66 | @task()
67 | def clean():
68 | '''Clean build directory.'''
69 | print('Cleaning build directory...')
70 |
71 | if os.path.exists('build'):
72 | shutil.rmtree('build')
73 |
74 | os.mkdir('build')
75 |
76 | @task()
77 | def packagelambda(* functions):
78 | '''Package lambda functions into a deployment-ready zip files.'''
79 | if not os.path.exists('build'):
80 | os.mkdir('build')
81 |
82 | os.chdir("build")
83 |
84 | if len(functions) == 0:
85 | functions = ("athenarunner", "gluerunner", "ons3objectcreated")
86 |
87 | for function in functions:
88 | print('Packaging "{}" lambda function in directory'.format(function))
89 | zipf = zipfile.ZipFile("%s.zip" % function, "w", zipfile.ZIP_DEFLATED)
90 |
91 | write_dir_to_zip("../lambda/{}/".format(function), zipf)
92 |
93 | zipf.close()
94 |
95 | os.chdir("..")
96 |
97 | return
98 |
99 |
100 | @task()
101 | def updatelambda(*functions):
102 | '''Directly update lambda function code in AWS (without upload to S3).'''
103 | lambda_client = boto3.client('lambda')
104 |
105 | if len(functions) == 0:
106 | functions = ("athenarunner", "gluerunner", "ons3objectcreated")
107 |
108 | for function in functions:
109 | with open('build/%s.zip' % function, 'rb') as zipf:
110 | lambda_client.update_function_code(
111 | FunctionName=function,
112 | ZipFile=zipf.read()
113 | )
114 | return
115 |
116 |
117 | @task()
118 | def deploylambda(*functions, **kwargs):
119 | '''Upload lambda functions .zip file to S3 for download by CloudFormation stack during creation.'''
120 |
121 | if len(functions) == 0:
122 | functions = ("athenarunner", "gluerunner", "ons3objectcreated")
123 |
124 | region_name = boto3.session.Session().region_name
125 |
126 | s3_client = boto3.client("s3")
127 |
128 | print("Reading .lambda/s3-deployment-descriptor.json...")
129 |
130 | params = read_json("./lambda/s3-deployment-descriptor.json")
131 |
132 | for function in functions:
133 |
134 | src_s3_bucket_name = params[function]['ArtifactBucketName']
135 | src_s3_key = params[function]['LambdaSourceS3Key']
136 |
137 | if not src_s3_key and not src_s3_bucket_name:
138 | print(
139 | "ERROR: Both Artifact S3 bucket name and Lambda source S3 key must be specified for function '{}'. FUNCTION NOT DEPLOYED.".format(
140 | function))
141 | continue
142 |
143 | print("Checking if S3 Bucket '{}' exists...".format(src_s3_bucket_name))
144 |
145 | if not check_bucket_exists(src_s3_bucket_name):
146 | print("Bucket %s not found. Creating in region {}.".format(src_s3_bucket_name, region_name))
147 |
148 | if region_name == "us-east-1":
149 | s3_client.create_bucket(
150 | # ACL="authenticated-read",
151 | Bucket=src_s3_bucket_name
152 | )
153 | else:
154 | s3_client.create_bucket(
155 | # ACL="authenticated-read",
156 | Bucket=src_s3_bucket_name,
157 | CreateBucketConfiguration={
158 | "LocationConstraint": region_name
159 | }
160 | )
161 |
162 | print("Uploading function '{}' to '{}'".format(function, src_s3_key))
163 |
164 | with open('build/{}.zip'.format(function), 'rb') as data:
165 | s3_client.upload_fileobj(data, src_s3_bucket_name, src_s3_key)
166 |
167 | return
168 |
169 |
170 | @task()
171 | def createstack(* stacks, **kwargs):
172 | '''Create stacks using CloudFormation.'''
173 |
174 | if len(stacks) == 0:
175 | print(
176 | "ERROR: Please specify a stack to create. Valid values are glue-resources, gluerunner-lambda, step-functions-resources.")
177 | return
178 |
179 | for stack in stacks:
180 | cfn_path = "cloudformation/{}.yaml".format(stack)
181 | cfn_params_path = "cloudformation/{}-params.json".format(stack)
182 | cfn_params = read_json(cfn_params_path)
183 | stack_name = stack
184 |
185 | cfn_file = open(cfn_path, 'r')
186 | cfn_template = cfn_file.read(51200) #Maximum size of a cfn template
187 |
188 | cfn_client = boto3.client('cloudformation')
189 |
190 | print("Attempting to CREATE '%s' stack using CloudFormation." % stack_name)
191 | start_t = time.time()
192 | response = cfn_client.create_stack(
193 | StackName=stack_name,
194 | TemplateBody=cfn_template,
195 | Parameters=cfn_params,
196 | Capabilities=[
197 | 'CAPABILITY_NAMED_IAM',
198 | ],
199 | )
200 |
201 | print("Waiting until '%s' stack status is CREATE_COMPLETE" % stack_name)
202 |
203 | try:
204 | # cc +o
205 | cfn_stack_delete_waiter = cfn_client.get_waiter('stack_create_complete')
206 | cfn_stack_delete_waiter.wait(StackName=stack_name)
207 | print("Stack CREATED in approximately %d secs." % int(time.time() - start_t))
208 |
209 | except Exception as e:
210 | print("Stack creation FAILED.")
211 | print(e.message)
212 |
213 |
214 | @task()
215 | def updatestack(* stacks, **kwargs):
216 | '''Update a CloudFormation stack.'''
217 |
218 | if len(stacks) == 0:
219 | print(
220 | "ERROR: Please specify a stack to create. Valid values are glue-resources, gluerunner-lambda, step-functions-resources.")
221 | return
222 |
223 | for stack in stacks:
224 | stack_name = stack
225 | cfn_path = "cloudformation/{}.yaml".format(stack)
226 | cfn_params_path = "cloudformation/{}-params.json".format(stack)
227 | cfn_params = read_json(cfn_params_path)
228 |
229 | cfn_file = open(cfn_path, 'r')
230 | cfn_template = cfn_file.read(51200) #Maximum size of a cfn template
231 |
232 | cfn_client = boto3.client('cloudformation')
233 |
234 | print("Attempting to UPDATE '%s' stack using CloudFormation." % stack_name)
235 | try:
236 | start_t = time.time()
237 | response = cfn_client.update_stack(
238 | StackName=stack_name,
239 | TemplateBody=cfn_template,
240 | Parameters=cfn_params,
241 | Capabilities=[
242 | 'CAPABILITY_NAMED_IAM',
243 | ],
244 | )
245 |
246 | print("Waiting until '%s' stack status is UPDATE_COMPLETE" % stack_name)
247 | cfn_stack_update_waiter = cfn_client.get_waiter('stack_update_complete')
248 | cfn_stack_update_waiter.wait(StackName=stack_name)
249 |
250 | print("Stack UPDATED in approximately %d secs." % int(time.time() - start_t))
251 | except ClientError as e:
252 | print("EXCEPTION: " + e.response["Error"]["Message"])
253 |
254 | @task()
255 | def stackstatus(* stacks):
256 | '''Check the status of a CloudFormation stack.'''
257 |
258 | if len(stacks) == 0:
259 | stacks = ("glue-resources", "gluerunner-lambda", "step-functions-resources")
260 |
261 | for stack in stacks:
262 | stack_name = stack
263 |
264 | cfn_client = boto3.client('cloudformation')
265 |
266 | try:
267 | response = cfn_client.describe_stacks(
268 | StackName=stack_name
269 | )
270 |
271 | if response["Stacks"][0]:
272 | print("Stack '%s' has the status '%s'" % (stack_name, response["Stacks"][0]["StackStatus"]))
273 |
274 | except ClientError as e:
275 | print("EXCEPTION: " + e.response["Error"]["Message"])
276 |
277 |
278 | @task()
279 | def deletestack(* stacks):
280 | '''Delete stacks using CloudFormation.'''
281 |
282 | if len(stacks) == 0:
283 | print("ERROR: Please specify a stack to delete.")
284 | return
285 |
286 | for stack in stacks:
287 | stack_name = stack
288 |
289 | cfn_client = boto3.client('cloudformation')
290 |
291 | print("Attempting to DELETE '%s' stack using CloudFormation." % stack_name)
292 | start_t = time.time()
293 | response = cfn_client.delete_stack(
294 | StackName=stack_name
295 | )
296 |
297 | print("Waiting until '%s' stack status is DELETE_COMPLETE" % stack_name)
298 | cfn_stack_delete_waiter = cfn_client.get_waiter('stack_delete_complete')
299 | cfn_stack_delete_waiter.wait(StackName=stack_name)
300 | print("Stack DELETED in approximately %d secs." % int(time.time() - start_t))
301 |
302 |
303 | @task()
304 | def deploygluescripts(**kwargs):
305 | '''Upload AWS Glue scripts to S3 for download by CloudFormation stack during creation.'''
306 |
307 | region_name = boto3.session.Session().region_name
308 |
309 | s3_client = boto3.client("s3")
310 |
311 | glue_scripts_path = "./glue-scripts/"
312 |
313 | glue_cfn_params = read_json("cloudformation/glue-resources-params.json")
314 |
315 | s3_etl_script_path = ''
316 | bucket_name = ''
317 | prefix = ''
318 | for param in glue_cfn_params:
319 | if param['ParameterKey'] == 'ArtifactBucketName':
320 | bucket_name = param['ParameterValue']
321 | if param['ParameterKey'] == 'ETLScriptsPrefix':
322 | prefix = param['ParameterValue']
323 |
324 | if not bucket_name or not prefix:
325 | print(
326 | "ERROR: ArtifactBucketName and ETLScriptsPrefix must be set in 'cloudformation/glue-resources-params.json'.")
327 | return
328 |
329 | s3_etl_script_path = 's3://' + bucket_name + '/' + prefix
330 |
331 | result = re.search('s3://(.+?)/(.*)', s3_etl_script_path)
332 | if result is None:
333 | print("ERROR: Invalid S3 ETL bucket name and/or script prefix.")
334 | return
335 |
336 | print("Checking if S3 Bucket '{}' exists...".format(bucket_name))
337 |
338 | if not check_bucket_exists(bucket_name):
339 | print("ERROR: S3 bucket '{}' not found.".format(bucket_name))
340 | return
341 |
342 | for dirname, subdirs, files in os.walk(glue_scripts_path):
343 | for filename in files:
344 | absname = os.path.abspath(os.path.join(dirname, filename))
345 | print("Uploading AWS Glue script '{}' to '{}/{}'".format(absname, bucket_name, prefix))
346 | with open(absname, 'rb') as data:
347 | s3_client.upload_fileobj(data, bucket_name, '{}/{}'.format(prefix, filename))
348 |
349 | return
350 |
351 |
352 | @task()
353 | def deletes3bucket(name):
354 | '''DELETE ALL objects in an Amazon S3 bucket and THE BUCKET ITSELF. Use with caution!'''
355 |
356 | proceed = raw_input(
357 | "This command will DELETE ALL DATA in S3 bucket '%s' and the BUCKET ITSELF.\nDo you wish to continue? [Y/N] " \
358 | % name)
359 |
360 | if proceed.lower() != 'y':
361 | print("Aborting deletion.")
362 | return
363 |
364 | print("Attempting to DELETE ALL OBJECTS in '%s' S3 bucket." % name)
365 |
366 | s3 = boto3.resource('s3')
367 | bucket = s3.Bucket(name)
368 | bucket.objects.delete()
369 | bucket.delete()
370 | return
--------------------------------------------------------------------------------
/cloudformation/athenarunner-lambda-params.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "ParameterKey": "ArtifactBucketName",
4 | "ParameterValue": ""
5 | },
6 | {
7 | "ParameterKey": "LambdaSourceS3Key",
8 | "ParameterValue": "src/athenarunner.zip"
9 | },
10 | {
11 | "ParameterKey": "DDBTableName",
12 | "ParameterValue": "AthenaRunnerActiveJobs"
13 | },
14 | {
15 | "ParameterKey": "AthenaRunnerLambdaFunctionName",
16 | "ParameterValue": "athenarunner"
17 | }
18 | ]
--------------------------------------------------------------------------------
/cloudformation/athenarunner-lambda.yaml:
--------------------------------------------------------------------------------
1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | # software and associated documentation files (the "Software"), to deal in the Software
5 | # without restriction, including without limitation the rights to use, copy, modify,
6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | # permit persons to whom the Software is furnished to do so.
8 | #
9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 | AWSTemplateFormatVersion: 2010-09-09
17 |
18 | Description: The CloudFormation template for AWS resources required by Athena Runner.
19 |
20 | Parameters:
21 |
22 | DDBTableName:
23 | Type: String
24 | Default: "AthenaRunnerActiveQueries"
25 | Description: "Name of the DynamoDB table for persistence & querying of active Athena queries' statuses."
26 |
27 | AthenaRunnerLambdaFunctionName:
28 | Type: String
29 | Default: "athenarunner"
30 | Description: "Name of the Lambda function that mediates between AWS Step Functions and AWS Athena."
31 |
32 | ArtifactBucketName:
33 | Type: String
34 | MinLength: "1"
35 | Description: "Name of the S3 bucket containing source .zip files."
36 |
37 | LambdaSourceS3Key:
38 | Type: String
39 | MinLength: "1"
40 | Description: "Name of the S3 key of Athena Runner lambda function .zip file."
41 |
42 | Resources:
43 |
44 | AthenaRunnerLambdaExecutionRole:
45 | Type: "AWS::IAM::Role"
46 | Properties:
47 | AssumeRolePolicyDocument:
48 | Version: '2012-10-17'
49 | Statement:
50 | - Effect: Allow
51 | Principal:
52 | Service:
53 | - lambda.amazonaws.com
54 | Action:
55 | - sts:AssumeRole
56 | Path: "/"
57 |
58 | AthenaRunnerPolicy:
59 | Type: "AWS::IAM::Policy"
60 | Properties:
61 | PolicyDocument: {
62 | "Version": "2012-10-17",
63 | "Statement": [{
64 | "Effect": "Allow",
65 | "Action": [
66 | "dynamodb:GetItem",
67 | "dynamodb:Query",
68 | "dynamodb:PutItem",
69 | "dynamodb:UpdateItem",
70 | "dynamodb:DeleteItem"
71 | ],
72 | "Resource": !Sub "arn:aws:dynamodb:${AWS::Region}:${AWS::AccountId}:table/${DDBTableName}"
73 | },
74 | {
75 | "Effect": "Allow",
76 | "Action": [
77 | "logs:CreateLogStream",
78 | "logs:PutLogEvents"
79 | ],
80 | "Resource": !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*"
81 | },
82 | {
83 | "Effect": "Allow",
84 | "Action": "logs:CreateLogGroup",
85 | "Resource": "*"
86 | },
87 | {
88 | "Effect": "Allow",
89 | "Action": [
90 | "states:SendTaskSuccess",
91 | "states:SendTaskFailure",
92 | "states:SendTaskHeartbeat",
93 | "states:GetActivityTask"
94 | ],
95 | "Resource": "*"
96 | },
97 | {
98 | "Effect": "Allow",
99 | "Action": [
100 | "athena:StartQueryExecution",
101 | "athena:GetQueryExecution",
102 | "athena:GetNamedQuery"
103 | ],
104 | "Resource": "*"
105 | }
106 | ]
107 | }
108 | PolicyName: "AthenaRunnerPolicy"
109 | Roles:
110 | - !Ref AthenaRunnerLambdaExecutionRole
111 |
112 | AthenaRunnerLambdaFunction:
113 | Type: AWS::Lambda::Function
114 | Properties:
115 | FunctionName: "athenarunner"
116 | Description: "Starts and monitors AWS Athena queries on behalf of AWS Step Functions for a specific Activity ARN."
117 | Handler: "athenarunner.handler"
118 | Role: !GetAtt AthenaRunnerLambdaExecutionRole.Arn
119 | Code:
120 | S3Bucket: !Ref ArtifactBucketName
121 | S3Key: !Ref LambdaSourceS3Key
122 | Timeout: 180 #seconds
123 | MemorySize: 128 #MB
124 | Runtime: python2.7
125 | DependsOn:
126 | - AthenaRunnerLambdaExecutionRole
127 |
128 | ScheduledRule:
129 | Type: "AWS::Events::Rule"
130 | Properties:
131 | Description: "ScheduledRule"
132 | ScheduleExpression: "rate(3 minutes)"
133 | State: "ENABLED"
134 | Targets:
135 | -
136 | Arn: !GetAtt AthenaRunnerLambdaFunction.Arn
137 | Id: "AthenaRunner"
138 |
139 | PermissionForEventsToInvokeLambda:
140 | Type: "AWS::Lambda::Permission"
141 | Properties:
142 | FunctionName: !Ref AthenaRunnerLambdaFunction
143 | Action: "lambda:InvokeFunction"
144 | Principal: "events.amazonaws.com"
145 | SourceArn: !GetAtt ScheduledRule.Arn
146 |
147 | AthenaRunnerActiveQueriesTable:
148 | Type: "AWS::DynamoDB::Table"
149 | Properties:
150 | TableName: !Ref DDBTableName
151 | KeySchema:
152 | - KeyType: "HASH"
153 | AttributeName: "sfn_activity_arn"
154 | - KeyType: "RANGE"
155 | AttributeName: "athena_query_execution_id"
156 | AttributeDefinitions:
157 | - AttributeName: "sfn_activity_arn"
158 | AttributeType: "S"
159 | - AttributeName: "athena_query_execution_id"
160 | AttributeType: "S"
161 | ProvisionedThroughput:
162 | WriteCapacityUnits: 10
163 | ReadCapacityUnits: 10
--------------------------------------------------------------------------------
/cloudformation/glue-resources-params.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "ParameterKey": "ArtifactBucketName",
4 | "ParameterValue": ""
5 | },
6 | {
7 | "ParameterKey": "ETLScriptsPrefix",
8 | "ParameterValue": "scripts"
9 | },
10 | {
11 | "ParameterKey": "DataBucketName",
12 | "ParameterValue": ""
13 | },
14 | {
15 | "ParameterKey": "ETLOutputPrefix",
16 | "ParameterValue": "output"
17 | }
18 | ]
--------------------------------------------------------------------------------
/cloudformation/glue-resources.yaml:
--------------------------------------------------------------------------------
1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | # software and associated documentation files (the "Software"), to deal in the Software
5 | # without restriction, including without limitation the rights to use, copy, modify,
6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | # permit persons to whom the Software is furnished to do so.
8 | #
9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 |
17 | AWSTemplateFormatVersion: "2010-09-09"
18 | Description: >
19 | This template sets up sample AWS Glue resources to be orchestrated by AWS Step Functions.
20 |
21 | Parameters:
22 |
23 | MarketingAndSalesDatabaseName:
24 | Type: String
25 | MinLength: "4"
26 | Default: "marketingandsales_qs"
27 | Description: "Name of the AWS Glue database to contain this CloudFormation template's tables."
28 |
29 | SalesPipelineTableName:
30 | Type: String
31 | MinLength: "4"
32 | Default: "salespipeline_qs"
33 | Description: "Name of the Sales Pipeline data table in AWS Glue."
34 |
35 | MarketingTableName:
36 | Type: String
37 | MinLength: "4"
38 | Default: "marketing_qs"
39 | Description: "Name of the Marketing data table in AWS Glue."
40 |
41 | ETLScriptsPrefix:
42 | Type: String
43 | MinLength: "1"
44 | Description: "Location of the Glue job ETL scripts in S3."
45 |
46 | ETLOutputPrefix:
47 | Type: String
48 | MinLength: "1"
49 | Description: "Name of the S3 output path to which this CloudFormation template's AWS Glue jobs are going to write ETL output."
50 |
51 | DataBucketName:
52 | Type: String
53 | MinLength: "1"
54 | Description: "Name of the S3 bucket in which the source Marketing and Sales data will be uploaded. Bucket is created by this CFT."
55 |
56 | ArtifactBucketName:
57 | Type: String
58 | MinLength: "1"
59 | Description: "Name of the S3 bucket in which the Marketing and Sales ETL scripts reside. Bucket is NOT created by this CFT."
60 |
61 | Resources:
62 |
63 | ### AWS GLUE RESOURCES ###
64 | AWSGlueJobRole:
65 | Type: "AWS::IAM::Role"
66 | Properties:
67 | AssumeRolePolicyDocument:
68 | Version: '2012-10-17'
69 | Statement:
70 | - Effect: Allow
71 | Principal:
72 | Service:
73 | - glue.amazonaws.com
74 | Action:
75 | - sts:AssumeRole
76 | Policies:
77 | - PolicyName: root
78 | PolicyDocument:
79 | Version: 2012-10-17
80 | Statement:
81 | - Effect: Allow
82 | Action:
83 | - "s3:GetObject"
84 | - "s3:PutObject"
85 | - "s3:ListBucket"
86 | - "s3:DeleteObject"
87 | Resource:
88 | - !Sub "arn:aws:s3:::${DataBucketName}"
89 | - !Sub "arn:aws:s3:::${DataBucketName}/*"
90 | - !Sub "arn:aws:s3:::${ArtifactBucketName}"
91 | - !Sub "arn:aws:s3:::${ArtifactBucketName}/*"
92 | ManagedPolicyArns:
93 | - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
94 | Path: "/"
95 |
96 |
97 | MarketingAndSalesDatabase:
98 | Type: "AWS::Glue::Database"
99 | Properties:
100 | DatabaseInput:
101 | Description: "Marketing and Sales database (Amazon QuickSight Samples)."
102 | Name: !Ref MarketingAndSalesDatabaseName
103 | CatalogId: !Ref AWS::AccountId
104 |
105 | SalesPipelineTable:
106 | Type: "AWS::Glue::Table"
107 | DependsOn: MarketingAndSalesDatabase
108 | Properties:
109 | TableInput:
110 | Description: "Sales Pipeline table (Amazon QuickSight Sample)."
111 | TableType: "EXTERNAL_TABLE"
112 | Parameters: {
113 | "CrawlerSchemaDeserializerVersion": "1.0",
114 | "compressionType": "none",
115 | "classification": "csv",
116 | "recordCount": "16831",
117 | "typeOfData": "file",
118 | "CrawlerSchemaSerializerVersion": "1.0",
119 | "columnsOrdered": "true",
120 | "objectCount": "1",
121 | "delimiter": ",",
122 | "skip.header.line.count": "1",
123 | "averageRecordSize": "119",
124 | "sizeKey": "2002910"
125 | }
126 | StorageDescriptor:
127 | StoredAsSubDirectories: False
128 | Parameters: {
129 | "CrawlerSchemaDeserializerVersion": "1.0",
130 | "compressionType": "none",
131 | "classification": "csv",
132 | "recordCount": "16831",
133 | "typeOfData": "file",
134 | "CrawlerSchemaSerializerVersion": "1.0",
135 | "columnsOrdered": "true",
136 | "objectCount": "1",
137 | "delimiter": ",",
138 | "skip.header.line.count": "1",
139 | "averageRecordSize": "119",
140 | "sizeKey": "2002910"
141 | }
142 | InputFormat: "org.apache.hadoop.mapred.TextInputFormat"
143 | OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
144 | Columns:
145 | - Type: string
146 | Name: date
147 | - Type: string
148 | Name: salesperson
149 | - Type: string
150 | Name: lead name
151 | - Type: string
152 | Name: segment
153 | - Type: string
154 | Name: region
155 | - Type: string
156 | Name: target close
157 | - Type: bigint
158 | Name: forecasted monthly revenue
159 | - Type: string
160 | Name: opportunity stage
161 | - Type: bigint
162 | Name: weighted revenue
163 | - Type: boolean
164 | Name: closed opportunity
165 | - Type: boolean
166 | Name: active opportunity
167 | - Type: boolean
168 | Name: latest status entry
169 | SerdeInfo:
170 | Parameters: {
171 | "field.delim": ","
172 | }
173 | SerializationLibrary: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
174 | Compressed: False
175 | Location: !Sub "s3://${DataBucketName}/sales/"
176 | Retention: 0
177 | Name: !Ref SalesPipelineTableName
178 | DatabaseName: !Ref MarketingAndSalesDatabaseName
179 | CatalogId: !Ref AWS::AccountId
180 |
181 | MarketingTable:
182 | Type: "AWS::Glue::Table"
183 | DependsOn: MarketingAndSalesDatabase
184 | Properties:
185 | TableInput:
186 | Description: "Marketing table (Amazon QuickSight Sample)."
187 | TableType: "EXTERNAL_TABLE"
188 | Parameters: {
189 | "CrawlerSchemaDeserializerVersion": "1.0",
190 | "compressionType": "none",
191 | "classification": "csv",
192 | "recordCount": "948",
193 | "typeOfData": "file",
194 | "CrawlerSchemaSerializerVersion": "1.0",
195 | "columnsOrdered": "true",
196 | "objectCount": "1",
197 | "delimiter": ",",
198 | "skip.header.line.count": "1",
199 | "averageRecordSize": "160",
200 | "sizeKey": "151746"
201 | }
202 | StorageDescriptor:
203 | StoredAsSubDirectories: False
204 | Parameters: {
205 | "CrawlerSchemaDeserializerVersion": "1.0",
206 | "compressionType": "none",
207 | "classification": "csv",
208 | "recordCount": "948",
209 | "typeOfData": "file",
210 | "CrawlerSchemaSerializerVersion": "1.0",
211 | "columnsOrdered": "true",
212 | "objectCount": "1",
213 | "delimiter": ",",
214 | "skip.header.line.count": "1",
215 | "averageRecordSize": "160",
216 | "sizeKey": "151746"
217 | }
218 | InputFormat: "org.apache.hadoop.mapred.TextInputFormat"
219 | OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
220 | Columns:
221 | - Type: string
222 | Name: date
223 | - Type: bigint
224 | Name: new visitors seo
225 | - Type: bigint
226 | Name: new visitors cpc
227 | - Type: bigint
228 | Name: new visitors social media
229 | - Type: bigint
230 | Name: return visitors
231 | - Type: bigint
232 | Name: twitter mentions
233 | - Type: bigint
234 | Name: twitter follower adds
235 | - Type: bigint
236 | Name: twitter followers cumulative
237 | - Type: bigint
238 | Name: mailing list adds
239 | - Type: bigint
240 | Name: mailing list cumulative
241 | - Type: bigint
242 | Name: website pageviews
243 | - Type: bigint
244 | Name: website visits
245 | - Type: bigint
246 | Name: website unique visits
247 | - Type: bigint
248 | Name: mobile uniques
249 | - Type: bigint
250 | Name: tablet uniques
251 | - Type: bigint
252 | Name: desktop uniques
253 | - Type: bigint
254 | Name: free sign up
255 | - Type: bigint
256 | Name: paid conversion
257 | - Type: string
258 | Name: events
259 | SerdeInfo:
260 | Parameters: {
261 | "field.delim": ","
262 | }
263 | SerializationLibrary: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
264 | Compressed: False
265 | Location: !Sub "s3://${DataBucketName}/marketing/"
266 | Retention: 0
267 | Name: !Ref MarketingTableName
268 | DatabaseName: !Ref MarketingAndSalesDatabaseName
269 | CatalogId: !Ref AWS::AccountId
270 |
271 | ProcessSalesDataJob:
272 | Type: "AWS::Glue::Job"
273 | Properties:
274 | Role: !Ref AWSGlueJobRole
275 | Name: "ProcessSalesData"
276 | Command: {
277 | "Name" : "glueetl",
278 | "ScriptLocation": !Sub "s3://${ArtifactBucketName}/${ETLScriptsPrefix}/process_sales_data.py"
279 | }
280 | DefaultArguments: {
281 | "--database_name" : !Ref MarketingAndSalesDatabaseName,
282 | "--table_name" : !Ref SalesPipelineTableName,
283 | "--s3_output_path": !Sub "s3://${DataBucketName}/${ETLOutputPrefix}/tmp/sales"
284 | }
285 | MaxRetries: 0
286 | Description: "Process Sales Pipeline data."
287 | AllocatedCapacity: 5
288 |
289 | ProcessMarketingDataJob:
290 | Type: "AWS::Glue::Job"
291 | Properties:
292 | Role: !Ref AWSGlueJobRole
293 | Name: "ProcessMarketingData"
294 | Command: {
295 | "Name" : "glueetl",
296 | "ScriptLocation": !Sub "s3://${ArtifactBucketName}/${ETLScriptsPrefix}/process_marketing_data.py"
297 | }
298 | DefaultArguments: {
299 | "--database_name" : !Ref MarketingAndSalesDatabaseName,
300 | "--table_name" : !Ref MarketingTableName,
301 | "--s3_output_path": !Sub "s3://${DataBucketName}/${ETLOutputPrefix}/tmp/marketing"
302 | }
303 | MaxRetries: 0
304 | Description: "Process Marketing data."
305 | AllocatedCapacity: 5
306 |
307 | JoinMarketingAndSalesDataJob:
308 | Type: "AWS::Glue::Job"
309 | Properties:
310 | Role: !Ref AWSGlueJobRole
311 | Name: "JoinMarketingAndSalesData"
312 | Command: {
313 | "Name" : "glueetl",
314 | "ScriptLocation": !Sub "s3://${ArtifactBucketName}/${ETLScriptsPrefix}/join_marketing_and_sales_data.py"
315 | }
316 | DefaultArguments: {
317 | "--database_name": !Ref MarketingAndSalesDatabaseName,
318 | "--s3_output_path": !Sub "s3://${DataBucketName}/${ETLOutputPrefix}/sales-leads-influenced",
319 | "--s3_sales_data_path": !Sub "s3://${DataBucketName}/${ETLOutputPrefix}/tmp/sales",
320 | "--s3_marketing_data_path": !Sub "s3://${DataBucketName}/${ETLOutputPrefix}/tmp/marketing"
321 | }
322 | MaxRetries: 0
323 | Description: "Join Marketing and Sales data."
324 | AllocatedCapacity: 5
325 |
--------------------------------------------------------------------------------
/cloudformation/gluerunner-lambda-params.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "ParameterKey": "ArtifactBucketName",
4 | "ParameterValue": ""
5 | },
6 | {
7 | "ParameterKey": "LambdaSourceS3Key",
8 | "ParameterValue": "src/gluerunner.zip"
9 | },
10 | {
11 | "ParameterKey": "DDBTableName",
12 | "ParameterValue": "GlueRunnerActiveJobs"
13 | },
14 | {
15 | "ParameterKey": "GlueRunnerLambdaFunctionName",
16 | "ParameterValue": "gluerunner"
17 | }
18 | ]
--------------------------------------------------------------------------------
/cloudformation/gluerunner-lambda.yaml:
--------------------------------------------------------------------------------
1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | # software and associated documentation files (the "Software"), to deal in the Software
5 | # without restriction, including without limitation the rights to use, copy, modify,
6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | # permit persons to whom the Software is furnished to do so.
8 | #
9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 | AWSTemplateFormatVersion: 2010-09-09
17 |
18 | Description: The CloudFormation template for AWS resources required by Glue Runner.
19 |
20 | Parameters:
21 |
22 | DDBTableName:
23 | Type: String
24 | Default: "GlueRunnerActiveJobs"
25 | Description: "Name of the DynamoDB table for persistence & querying of active Glue jobs' statuses."
26 |
27 | GlueRunnerLambdaFunctionName:
28 | Type: String
29 | Default: "gluerunner"
30 | Description: "Name of the Lambda function that mediates between AWS Step Functions and AWS Glue."
31 |
32 | ArtifactBucketName:
33 | Type: String
34 | MinLength: "1"
35 | Description: "Name of the S3 bucket containing source .zip files."
36 |
37 | LambdaSourceS3Key:
38 | Type: String
39 | MinLength: "1"
40 | Description: "Name of the S3 key of Glue Runner lambda function .zip file."
41 |
42 | Resources:
43 |
44 | GlueRunnerLambdaExecutionRole:
45 | Type: "AWS::IAM::Role"
46 | Properties:
47 | AssumeRolePolicyDocument:
48 | Version: '2012-10-17'
49 | Statement:
50 | - Effect: Allow
51 | Principal:
52 | Service:
53 | - lambda.amazonaws.com
54 | Action:
55 | - sts:AssumeRole
56 | Path: "/"
57 |
58 | GlueRunnerPolicy:
59 | Type: "AWS::IAM::Policy"
60 | Properties:
61 | PolicyDocument: {
62 | "Version": "2012-10-17",
63 | "Statement": [{
64 | "Effect": "Allow",
65 | "Action": [
66 | "dynamodb:GetItem",
67 | "dynamodb:Query",
68 | "dynamodb:PutItem",
69 | "dynamodb:UpdateItem",
70 | "dynamodb:DeleteItem"
71 | ],
72 | "Resource": !Sub "arn:aws:dynamodb:${AWS::Region}:${AWS::AccountId}:table/${DDBTableName}"
73 | },
74 | {
75 | "Effect": "Allow",
76 | "Action": [
77 | "logs:CreateLogStream",
78 | "logs:PutLogEvents"
79 | ],
80 | "Resource": !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*"
81 | },
82 | {
83 | "Effect": "Allow",
84 | "Action": "logs:CreateLogGroup",
85 | "Resource": "*"
86 | },
87 | {
88 | "Effect": "Allow",
89 | "Action": [
90 | "states:SendTaskSuccess",
91 | "states:SendTaskFailure",
92 | "states:SendTaskHeartbeat",
93 | "states:GetActivityTask"
94 | ],
95 | "Resource": "*"
96 | },
97 | {
98 | "Effect": "Allow",
99 | "Action": [
100 | "glue:StartJobRun",
101 | "glue:GetJobRun"
102 | ],
103 | "Resource": "*"
104 | }
105 | ]
106 | }
107 | PolicyName: "GlueRunnerPolicy"
108 | Roles:
109 | - !Ref GlueRunnerLambdaExecutionRole
110 |
111 | GlueRunnerLambdaFunction:
112 | Type: AWS::Lambda::Function
113 | Properties:
114 | FunctionName: "gluerunner"
115 | Description: "Starts and monitors AWS Glue jobs on behalf of AWS Step Functions for a specific Activity ARN."
116 | Handler: "gluerunner.handler"
117 | Role: !GetAtt GlueRunnerLambdaExecutionRole.Arn
118 | Code:
119 | S3Bucket: !Ref ArtifactBucketName
120 | S3Key: !Ref LambdaSourceS3Key
121 | Timeout: 180 #seconds
122 | MemorySize: 128 #MB
123 | Runtime: python2.7
124 | DependsOn:
125 | - GlueRunnerLambdaExecutionRole
126 |
127 | ScheduledRule:
128 | Type: "AWS::Events::Rule"
129 | Properties:
130 | Description: "ScheduledRule"
131 | ScheduleExpression: "rate(3 minutes)"
132 | State: "ENABLED"
133 | Targets:
134 | -
135 | Arn: !GetAtt GlueRunnerLambdaFunction.Arn
136 | Id: "GlueRunner"
137 |
138 | PermissionForEventsToInvokeLambda:
139 | Type: "AWS::Lambda::Permission"
140 | Properties:
141 | FunctionName: !Ref GlueRunnerLambdaFunction
142 | Action: "lambda:InvokeFunction"
143 | Principal: "events.amazonaws.com"
144 | SourceArn: !GetAtt ScheduledRule.Arn
145 |
146 | GlueRunnerActiveJobsTable:
147 | Type: "AWS::DynamoDB::Table"
148 | Properties:
149 | TableName: !Ref DDBTableName
150 | KeySchema:
151 | - KeyType: "HASH"
152 | AttributeName: "sfn_activity_arn"
153 | - KeyType: "RANGE"
154 | AttributeName: "glue_job_run_id"
155 | AttributeDefinitions:
156 | - AttributeName: "sfn_activity_arn"
157 | AttributeType: "S"
158 | - AttributeName: "glue_job_run_id"
159 | AttributeType: "S"
160 | ProvisionedThroughput:
161 | WriteCapacityUnits: 10
162 | ReadCapacityUnits: 10
--------------------------------------------------------------------------------
/cloudformation/step-functions-resources-params.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "ParameterKey": "ArtifactBucketName",
4 | "ParameterValue": ""
5 | },
6 | {
7 | "ParameterKey": "LambdaSourceS3Key",
8 | "ParameterValue": "src/ons3objectcreated.zip"
9 | },
10 | {
11 | "ParameterKey": "GlueRunnerActivityName",
12 | "ParameterValue": "GlueRunnerActivity"
13 | },
14 | {
15 | "ParameterKey": "DataBucketName",
16 | "ParameterValue": "etl-orchestrator-745-data"
17 | }
18 | ]
--------------------------------------------------------------------------------
/cloudformation/step-functions-resources.yaml:
--------------------------------------------------------------------------------
1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | # software and associated documentation files (the "Software"), to deal in the Software
5 | # without restriction, including without limitation the rights to use, copy, modify,
6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | # permit persons to whom the Software is furnished to do so.
8 | #
9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 | AWSTemplateFormatVersion: "2010-09-09"
17 |
18 | Description: >
19 | This template manages sample AWS Step Functions resources to be orchestrate AWS Glue jobs and crawlers.
20 |
21 | Parameters:
22 |
23 | MarketingAndSalesDatabaseName:
24 | Type: String
25 | MinLength: "4"
26 | Default: "marketingandsales_qs"
27 | Description: "Name of the AWS Glue database."
28 |
29 | SalesPipelineTableName:
30 | Type: String
31 | MinLength: "4"
32 | Default: "salespipeline_qs"
33 | Description: "Name of the Sales Pipeline data table in AWS Glue."
34 |
35 | MarketingTableName:
36 | Type: String
37 | MinLength: "4"
38 | Default: "marketing_qs"
39 | Description: "Name of the Marketing data table in AWS Glue."
40 |
41 | GlueRunnerActivityName:
42 | Type: String
43 | Default: "GlueRunnerActivity"
44 | Description: "Name of the AWS Step Functions activity to be polled by GlueRunner."
45 |
46 | AthenaRunnerActivityName:
47 | Type: String
48 | Default: "AthenaRunnerActivity"
49 | Description: "Name of the AWS Step Functions activity to be polled by AthenaRunner."
50 |
51 | ArtifactBucketName:
52 | Type: String
53 | MinLength: "1"
54 | Description: "Name of the S3 bucket containing source .zip files. Bucket is NOT created by this CFT."
55 |
56 | LambdaSourceS3Key:
57 | Type: String
58 | MinLength: "1"
59 | Description: "Name of the S3 key of Glue Runner lambda function .zip file."
60 |
61 | DataBucketName:
62 | Type: String
63 | MinLength: "1"
64 | Description: "Name of the S3 bucket in which the source Marketing and Sales data will be uploaded. Bucket is created by this CFT."
65 |
66 | Resources:
67 |
68 | #IAM Resources
69 | StateExecutionRole:
70 | Type: "AWS::IAM::Role"
71 | Properties:
72 | AssumeRolePolicyDocument:
73 | Version: '2012-10-17'
74 | Statement:
75 | - Effect: Allow
76 | Principal:
77 | Service:
78 | - states.us-east-1.amazonaws.com
79 | Action:
80 | - sts:AssumeRole
81 | Policies:
82 | -
83 | PolicyName: "StatesExecutionPolicy"
84 | PolicyDocument:
85 | Version: "2012-10-17"
86 | Statement:
87 | -
88 | Effect: "Allow"
89 | Action: "lambda:InvokeFunction"
90 | Resource: "*"
91 | Path: "/"
92 |
93 | # Activity resources
94 | GlueRunnerActivity:
95 | Type: "AWS::StepFunctions::Activity"
96 | Properties:
97 | Name: !Ref GlueRunnerActivityName
98 |
99 | AthenaRunnerActivity:
100 | Type: "AWS::StepFunctions::Activity"
101 | Properties:
102 | Name: !Ref AthenaRunnerActivityName
103 |
104 | WaitForSalesDataActivity:
105 | Type: "AWS::StepFunctions::Activity"
106 | Properties:
107 | Name: !Sub "${DataBucketName}-SalesPipeline_QuickSightSample.csv"
108 |
109 | WaitForMarketingDataActivity:
110 | Type: "AWS::StepFunctions::Activity"
111 | Properties:
112 | Name: !Sub "${DataBucketName}-MarketingData_QuickSightSample.csv"
113 |
114 |
115 | # State Machine resources
116 |
117 | # State machine for testing Athena Runner
118 | AthenaRunnerTestETLOrchestrator:
119 | Type: "AWS::StepFunctions::StateMachine"
120 | Properties:
121 | StateMachineName: AthenaRunnerTestETLOrchestrator
122 | DefinitionString:
123 | Fn::Sub:
124 | - |-
125 | {
126 | "StartAt": "Configure Athena Query",
127 | "States": {
128 | "Configure Athena Query":{
129 | "Type": "Pass",
130 | "Result": "{ \"AthenaQueryString\" : \"SELECT * FROM ${GlueTableName} limit 10;\", \"AthenaDatabase\": \"${GlueDatabaseName}\", \"AthenaResultOutputLocation\": \"${AthenaResultOutputLocation}\", \"AthenaResultEncryptionOption\": \"${AthenaResultEncryptionOption}\"}",
131 | "Next": "Execute Athena Query"
132 | },
133 | "Execute Athena Query":{
134 | "Type": "Task",
135 | "Resource": "${AthenaRunnerActivityArn}",
136 | "End": true
137 | }
138 | }
139 | }
140 | - {
141 | GlueDatabaseName: !Ref MarketingAndSalesDatabaseName,
142 | GlueTableName: !Ref MarketingTableName,
143 | AthenaRunnerActivityArn: !Ref AthenaRunnerActivity,
144 | AthenaResultOutputLocation: !Sub "s3://${DataBucketName}/athena-runner-output/",
145 | AthenaResultEncryptionOption: "SSE_S3"
146 | }
147 | RoleArn: !GetAtt StateExecutionRole.Arn
148 |
149 | MarketingAndSalesETLOrchestrator:
150 | Type: "AWS::StepFunctions::StateMachine"
151 | Properties:
152 | StateMachineName: MarketingAndSalesETLOrchestrator
153 | DefinitionString:
154 | Fn::Sub:
155 | - |-
156 | {
157 | "Comment": "Marketing and Sales ETL Pipeline Orchestrator.",
158 | "StartAt": "Pre-process",
159 | "States": {
160 |
161 | "Pre-process": {
162 | "Type": "Pass",
163 | "Next": "Start Parallel Glue Jobs"
164 | },
165 | "Start Parallel Glue Jobs": {
166 | "Type": "Parallel",
167 | "Branches": [
168 | {
169 | "StartAt": "Wait For Sales Data",
170 | "States": {
171 | "Wait For Sales Data": {
172 | "Type": "Task",
173 | "Resource": "${WaitForSalesDataActivityArn}",
174 | "Next": "Prep for Process Sales Data"
175 | },
176 | "Prep for Process Sales Data": {
177 | "Type": "Pass",
178 | "Result": "{\"GlueJobName\" : \"${ProcessSalesDataGlueJobName}\"}",
179 | "Next": "Process Sales Data"
180 | },
181 | "Process Sales Data": {
182 | "Type": "Task",
183 | "Resource": "${GlueRunnerActivityArn}",
184 | "End": true
185 | }
186 | }
187 | },
188 | {
189 | "StartAt": "Wait For Marketing Data",
190 | "States": {
191 | "Wait For Marketing Data": {
192 | "Type": "Task",
193 | "Resource": "${WaitForMarketingDataActivityArn}",
194 | "Next": "Prep for Process Marketing Data"
195 | },
196 | "Prep for Process Marketing Data": {
197 | "Type": "Pass",
198 | "Result": "{\"GlueJobName\" : \"${ProcessMarketingDataGlueJobName}\"}",
199 | "Next": "Process Marketing Data"
200 | },
201 | "Process Marketing Data": {
202 | "Type": "Task",
203 | "Resource": "${GlueRunnerActivityArn}",
204 | "End": true
205 | }
206 | }
207 | }
208 | ],
209 | "Next": "Prep For Joining Marketing And Sales Data",
210 | "Catch": [ {
211 | "ErrorEquals": ["GlueJobFailedError"],
212 | "Next": "ETL Job Failed Fallback"
213 | }]
214 | },
215 | "Prep For Joining Marketing And Sales Data": {
216 | "Type": "Pass",
217 | "Result": "{\"GlueJobName\" : \"${JoinMarketingAndSalesDataJobName}\"}",
218 | "Next": "Join Marketing And Sales Data"
219 | },
220 | "Join Marketing And Sales Data": {
221 | "Type": "Task",
222 | "Resource": "${GlueRunnerActivityArn}",
223 | "Catch": [ {
224 | "ErrorEquals": ["GlueJobFailedError"],
225 | "Next": "ETL Job Failed Fallback"
226 | }],
227 | "End": true
228 | },
229 | "ETL Job Failed Fallback": {
230 | "Type": "Pass",
231 | "Result": "This is a fallback from an ETL job failure.",
232 | "End": true
233 | }
234 | }
235 | }
236 | - {
237 | GlueRunnerActivityArn : !Ref GlueRunnerActivity,
238 | WaitForSalesDataActivityArn: !Ref WaitForSalesDataActivity,
239 | WaitForMarketingDataActivityArn: !Ref WaitForMarketingDataActivity,
240 | ProcessSalesDataGlueJobName: "ProcessSalesData",
241 | ProcessMarketingDataGlueJobName: "ProcessMarketingData",
242 | JoinMarketingAndSalesDataJobName: "JoinMarketingAndSalesData"
243 | }
244 | RoleArn: !GetAtt StateExecutionRole.Arn
245 |
246 |
247 |
248 | # On S3 Object Created Lambda Functions resources
249 |
250 | OnS3ObjectCreatedLambdaExecutionRole:
251 | Type: "AWS::IAM::Role"
252 | Properties:
253 | AssumeRolePolicyDocument:
254 | Version: '2012-10-17'
255 | Statement:
256 | - Effect: Allow
257 | Principal:
258 | Service:
259 | - lambda.amazonaws.com
260 | Action:
261 | - sts:AssumeRole
262 | ManagedPolicyArns:
263 | - arn:aws:iam::aws:policy/AmazonS3FullAccess
264 | - arn:aws:iam::aws:policy/AmazonSNSFullAccess
265 | - arn:aws:iam::aws:policy/CloudWatchLogsFullAccess
266 | - arn:aws:iam::aws:policy/AWSStepFunctionsFullAccess
267 | Path: "/"
268 |
269 | OnS3ObjectCreatedPolicy:
270 | Type: "AWS::IAM::Policy"
271 | Properties:
272 | PolicyDocument: {
273 | "Version": "2012-10-17",
274 | "Statement": [
275 | {
276 | "Effect": "Allow",
277 | "Action": [
278 | "logs:CreateLogStream",
279 | "logs:PutLogEvents"
280 | ],
281 | "Resource": !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*"
282 | },
283 | {
284 | "Effect": "Allow",
285 | "Action": "logs:CreateLogGroup",
286 | "Resource": "*"
287 | },
288 | {
289 | "Effect": "Allow",
290 | "Action": [
291 | "states:SendTaskSuccess",
292 | "states:SendTaskFailure",
293 | "states:SendTaskHeartbeat",
294 | "states:GetActivityTask"
295 | ],
296 | "Resource": "*"
297 | }
298 | ]
299 | }
300 | PolicyName: "OnS3ObjectCreatedPolicy"
301 | Roles:
302 | - !Ref OnS3ObjectCreatedLambdaExecutionRole
303 |
304 |
305 |
306 | OnS3ObjectCreatedLambdaFunction:
307 | Type: "AWS::Lambda::Function"
308 | Properties:
309 | FunctionName: "ons3objectcreated"
310 | Description: "Enables a Step Function State Machine to continue execution after a new object is created in S3."
311 | Handler: "ons3objectcreated.handler"
312 | Role: !GetAtt OnS3ObjectCreatedLambdaExecutionRole.Arn
313 | Code:
314 | S3Bucket: !Ref ArtifactBucketName
315 | S3Key: !Ref LambdaSourceS3Key
316 | Timeout: 180 #seconds
317 | MemorySize: 128 #MB
318 | Runtime: python2.7
319 | DependsOn:
320 | - OnS3ObjectCreatedLambdaExecutionRole
321 |
322 | # For every bucket that needs to invoke OnS3ObjectCreated:
323 |
324 | DataBucket:
325 | Type: "AWS::S3::Bucket"
326 | Properties:
327 | BucketName: !Ref DataBucketName
328 | NotificationConfiguration:
329 | LambdaConfigurations:
330 | - Function: !GetAtt OnS3ObjectCreatedLambdaFunction.Arn
331 | Event: "s3:ObjectCreated:*"
332 | Filter:
333 | S3Key:
334 | Rules:
335 | -
336 | Name: "suffix"
337 | Value: "csv"
338 | DependsOn:
339 | - DataBucketPermission
340 |
341 | DataBucketPermission:
342 | Type: "AWS::Lambda::Permission"
343 | Properties:
344 | Action: 'lambda:InvokeFunction'
345 | FunctionName: !Ref OnS3ObjectCreatedLambdaFunction
346 | Principal: s3.amazonaws.com
347 | SourceAccount: !Ref "AWS::AccountId"
348 | SourceArn: !Sub "arn:aws:s3:::${DataBucketName}"
--------------------------------------------------------------------------------
/glue-scripts/join_marketing_and_sales_data.py:
--------------------------------------------------------------------------------
1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | # software and associated documentation files (the "Software"), to deal in the Software
5 | # without restriction, including without limitation the rights to use, copy, modify,
6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | # permit persons to whom the Software is furnished to do so.
8 | #
9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 | import sys
17 | import pyspark.sql.functions as func
18 | from awsglue.dynamicframe import DynamicFrame
19 | from awsglue.transforms import *
20 | from awsglue.utils import getResolvedOptions
21 | from pyspark.context import SparkContext
22 | from awsglue.context import GlueContext
23 | from awsglue.job import Job
24 |
25 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_output_path', 's3_sales_data_path', 's3_marketing_data_path'])
26 | s3_output_path = args['s3_output_path']
27 | s3_sales_data_path = args['s3_sales_data_path']
28 | s3_marketing_data_path = args['s3_marketing_data_path']
29 |
30 | sc = SparkContext.getOrCreate()
31 | glueContext = GlueContext(sc)
32 | spark = glueContext.spark_session
33 | job = Job(glueContext)
34 | job.init(args['JOB_NAME'], args)
35 |
36 | salesForecastedByDate_DF = \
37 | glueContext.spark_session.read.option("header", "true")\
38 | .load(s3_sales_data_path, format="parquet")
39 |
40 | mktg_DF = \
41 | glueContext.spark_session.read.option("header", "true")\
42 | .load(s3_marketing_data_path, format="parquet")
43 |
44 |
45 | salesForecastedByDate_DF\
46 | .join(mktg_DF, 'date', 'inner')\
47 | .orderBy(salesForecastedByDate_DF['date']) \
48 | .write \
49 | .format('csv') \
50 | .option('header', 'true') \
51 | .mode('overwrite') \
52 | .save(s3_output_path)
53 |
--------------------------------------------------------------------------------
/glue-scripts/process_marketing_data.py:
--------------------------------------------------------------------------------
1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | # software and associated documentation files (the "Software"), to deal in the Software
5 | # without restriction, including without limitation the rights to use, copy, modify,
6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | # permit persons to whom the Software is furnished to do so.
8 | #
9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 | import sys
17 | import pyspark.sql.functions as func
18 | from awsglue.dynamicframe import DynamicFrame
19 | from awsglue.transforms import *
20 | from awsglue.utils import getResolvedOptions
21 | from pyspark.context import SparkContext
22 | from awsglue.context import GlueContext
23 | from awsglue.job import Job
24 |
25 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_output_path', 'database_name', 'table_name'])
26 | s3_output_path = args['s3_output_path']
27 | database_name = args['database_name']
28 | table_name = args['table_name']
29 |
30 | sc = SparkContext.getOrCreate()
31 | glueContext = GlueContext(sc)
32 | spark = glueContext.spark_session
33 | job = Job(glueContext)
34 | job.init(args['JOB_NAME'], args)
35 |
36 |
37 | mktg_DyF = glueContext.create_dynamic_frame\
38 | .from_catalog(database=database_name, table_name=table_name)
39 |
40 | mktg_DyF = ApplyMapping.apply(frame=mktg_DyF, mappings=[
41 | ('date', 'string', 'date', 'string'),
42 | ('new visitors seo', 'bigint', 'new_visitors_seo', 'bigint'),
43 | ('new visitors cpc', 'bigint', 'new_visitors_cpc', 'bigint'),
44 | ('new visitors social media', 'bigint', 'new_visitors_social_media', 'bigint'),
45 | ('return visitors', 'bigint', 'return_visitors', 'bigint'),
46 | ], transformation_ctx='applymapping1')
47 |
48 | mktg_DyF.printSchema()
49 |
50 | mktg_DF = mktg_DyF.toDF()
51 |
52 | mktg_DF.write\
53 | .format('parquet')\
54 | .option('header', 'true')\
55 | .mode('overwrite')\
56 | .save(s3_output_path)
57 |
58 | job.commit()
--------------------------------------------------------------------------------
/glue-scripts/process_sales_data.py:
--------------------------------------------------------------------------------
1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | # software and associated documentation files (the "Software"), to deal in the Software
5 | # without restriction, including without limitation the rights to use, copy, modify,
6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | # permit persons to whom the Software is furnished to do so.
8 | #
9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 | import sys
17 | import pyspark.sql.functions as func
18 | from awsglue.dynamicframe import DynamicFrame
19 | from awsglue.transforms import *
20 | from awsglue.utils import getResolvedOptions
21 | from pyspark.context import SparkContext
22 | from awsglue.context import GlueContext
23 | from awsglue.job import Job
24 |
25 |
26 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_output_path', 'database_name', 'table_name'])
27 | s3_output_path = args['s3_output_path']
28 | database_name = args['database_name']
29 | table_name = args['table_name']
30 |
31 | sc = SparkContext.getOrCreate()
32 | glueContext = GlueContext(sc)
33 | spark = glueContext.spark_session
34 | job = Job(glueContext)
35 | job.init(args['JOB_NAME'], args)
36 |
37 |
38 |
39 | sales_DyF = glueContext.create_dynamic_frame\
40 | .from_catalog(database=database_name, table_name=table_name)
41 |
42 | sales_DyF = ApplyMapping.apply(frame=sales_DyF, mappings=[
43 | ('date', 'string', 'date', 'string'),
44 | ('lead name', 'string', 'lead_name', 'string'),
45 | ('forecasted monthly revenue', 'bigint', 'forecasted_monthly_revenue', 'bigint'),
46 | ('opportunity stage', 'string', 'opportunity_stage', 'string'),
47 | ('weighted revenue', 'bigint', 'weighted_revenue', 'bigint'),
48 | ('target close', 'string', 'target_close', 'string'),
49 | ('closed opportunity', 'string', 'closed_opportunity', 'string'),
50 | ('active opportunity', 'string', 'active_opportunity', 'string'),
51 | ('last status entry', 'string', 'last_status_entry', 'string'),
52 | ], transformation_ctx='applymapping1')
53 |
54 | sales_DyF.printSchema()
55 |
56 | sales_DF = sales_DyF.toDF()
57 |
58 | salesForecastedByDate_DF = \
59 | sales_DF\
60 | .where(sales_DF['opportunity_stage'] == 'Lead')\
61 | .groupBy('date')\
62 | .agg({'forecasted_monthly_revenue': 'sum', 'opportunity_stage': 'count'})\
63 | .orderBy('date')\
64 | .withColumnRenamed('count(opportunity_stage)', 'leads_generated')\
65 | .withColumnRenamed('sum(forecasted_monthly_revenue)', 'forecasted_lead_revenue')\
66 | .coalesce(1)
67 |
68 | salesForecastedByDate_DF.write\
69 | .format('parquet')\
70 | .option('header', 'true')\
71 | .mode('overwrite')\
72 | .save(s3_output_path)
73 |
74 | job.commit()
75 |
--------------------------------------------------------------------------------
/lambda/athenarunner/athenarunner-config.json:
--------------------------------------------------------------------------------
1 | {
2 | "sfn_activity_arn": "arn:aws:states:us-east-1:074599006431:activity:AthenaRunnerActivity",
3 | "sfn_worker_name": "athenarunner",
4 | "ddb_table": "AthenaRunnerActiveJobs",
5 | "ddb_query_limit": 50,
6 | "sfn_max_executions": 100
7 | }
--------------------------------------------------------------------------------
/lambda/athenarunner/athenarunner.py:
--------------------------------------------------------------------------------
1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | # software and associated documentation files (the "Software"), to deal in the Software
5 | # without restriction, including without limitation the rights to use, copy, modify,
6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | # permit persons to whom the Software is furnished to do so.
8 | #
9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 | import boto3
17 | import logging, logging.config
18 | import json
19 | from botocore.client import Config
20 | from boto3.dynamodb.conditions import Key, Attr
21 |
22 |
23 | def load_log_config():
24 | # Basic config. Replace with your own logging config if required
25 | root = logging.getLogger()
26 | root.setLevel(logging.INFO)
27 | return root
28 |
29 | def load_config():
30 | with open('athenarunner-config.json', 'r') as conf_file:
31 | conf_json = conf_file.read()
32 | return json.loads(conf_json)
33 |
34 | def is_json(jsonstring):
35 | try:
36 | json_object = json.loads(jsonstring)
37 | except ValueError:
38 | return False
39 | return True
40 |
41 | def start_athena_queries(config):
42 |
43 | ddb_table = dynamodb.Table(config["ddb_table"])
44 |
45 | logger.debug('Athena runner started')
46 |
47 | sfn_activity_arn = config['sfn_activity_arn']
48 | sfn_worker_name = config['sfn_worker_name']
49 |
50 | # Loop until no tasks are available from Step Functions
51 | while True:
52 |
53 | logger.debug('Polling for Athena tasks for Step Functions Activity ARN: {}'.format(sfn_activity_arn))
54 |
55 | try:
56 | response = sfn.get_activity_task(
57 | activityArn=sfn_activity_arn,
58 | workerName=sfn_worker_name
59 | )
60 |
61 | except Exception as e:
62 | logger.critical(e.message)
63 | logger.critical(
64 | 'Unrecoverable error invoking get_activity_task for {}.'.format(sfn_activity_arn))
65 | raise
66 |
67 | # Keep the new Task Token for reference later
68 | task_token = response.get('taskToken', '')
69 |
70 | # If there are no tasks, return
71 | if not task_token:
72 | logger.debug('No tasks available.')
73 | return
74 |
75 |
76 | logger.info('Received task. Task token: {}'.format(task_token))
77 |
78 | # Parse task input and create Athena query input
79 | task_input = ''
80 | try:
81 |
82 | task_input = json.loads(response['input'])
83 | task_input_dict = json.loads(task_input)
84 |
85 | athena_named_query = task_input_dict.get('AthenaNamedQuery', None)
86 | if athena_named_query is None:
87 | athena_query_string = task_input_dict['AthenaQueryString']
88 | athena_database = task_input_dict['AthenaDatabase']
89 | else:
90 | logger.debug('Retrieving details of named query "{}" from Athena'.format(athena_named_query))
91 |
92 | response = athena.get_named_query(
93 | NamedQueryId=athena_named_query
94 | )
95 |
96 | athena_query_string = response['NamedQuery']['QueryString']
97 | athena_database = response['NamedQuery']['Database']
98 |
99 | athena_result_output_location = task_input_dict['AthenaResultOutputLocation']
100 | athena_result_encryption_option = task_input_dict.get('AthenaResultEncryptionOption', None)
101 | athena_result_kms_key = task_input_dict.get('AthenaResultKmsKey', "NONE").capitalize()
102 |
103 | athena_result_encryption_config = {}
104 | if athena_result_encryption_option is not None:
105 |
106 | athena_result_encryption_config['EncryptionOption'] = athena_result_encryption_option
107 |
108 | if athena_result_encryption_option in ["SSE_KMS", "CSE_KMS"]:
109 | athena_result_encryption_config['KmsKey'] = athena_result_kms_key
110 |
111 |
112 | except (KeyError, Exception):
113 | logger.critical('Invalid Athena Runner input. Make sure required input parameters are properly specified.')
114 | raise
115 |
116 |
117 |
118 | # Run Athena Query
119 | logger.info('Querying "{}" Athena database using query string: "{}"'.format(athena_database, athena_query_string))
120 |
121 | try:
122 |
123 | response = athena.start_query_execution(
124 | QueryString=athena_query_string,
125 | QueryExecutionContext={
126 | 'Database': athena_database
127 | },
128 | ResultConfiguration={
129 | 'OutputLocation': athena_result_output_location,
130 | 'EncryptionConfiguration': athena_result_encryption_config
131 | }
132 | )
133 |
134 | athena_query_execution_id = response['QueryExecutionId']
135 |
136 | # Store SFN 'Task Token' and 'Query Execution Id' in DynamoDB
137 |
138 | item = {
139 | 'sfn_activity_arn': sfn_activity_arn,
140 | 'athena_query_execution_id': athena_query_execution_id,
141 | 'sfn_task_token': task_token
142 | }
143 |
144 | ddb_table.put_item(Item=item)
145 |
146 |
147 | except Exception as e:
148 | logger.error('Failed to query Athena database "{}" with query string "{}"..'.format(athena_database, athena_query_string))
149 | logger.error('Reason: {}'.format(e.message))
150 | logger.info('Sending "Task Failed" signal to Step Functions.')
151 |
152 | response = sfn.send_task_failure(
153 | taskToken=task_token,
154 | error='Failed to start Athena query. Check Athena Runner logs for more details.'
155 | )
156 | return
157 |
158 |
159 | logger.info('Athena query started. Query Execution Id: {}'.format(athena_query_execution_id))
160 |
161 |
162 | def check_athena_queries(config):
163 |
164 | # Query all items in table for a particular SFN activity ARN
165 | # This should retrieve records for all started athena queries for this particular activity ARN
166 | ddb_table = dynamodb.Table(config['ddb_table'])
167 | sfn_activity_arn = config['sfn_activity_arn']
168 |
169 | ddb_resp = ddb_table.query(
170 | KeyConditionExpression=Key('sfn_activity_arn').eq(sfn_activity_arn),
171 | Limit=config['ddb_query_limit']
172 | )
173 |
174 | # For each item...
175 | for item in ddb_resp['Items']:
176 |
177 | athena_query_execution_id = item['athena_query_execution_id']
178 | sfn_task_token = item['sfn_task_token']
179 |
180 | try:
181 | logger.debug('Polling Athena query execution status..')
182 |
183 | # Query athena query execution status...
184 | athena_resp = athena.get_query_execution(
185 | QueryExecutionId=athena_query_execution_id
186 | )
187 |
188 | query_exec_resp = athena_resp['QueryExecution']
189 | query_exec_state = query_exec_resp['Status']['State']
190 | query_state_change_reason = query_exec_resp['Status'].get('StateChangeReason', '')
191 |
192 | logger.debug('Query with Execution Id {} is currently in state "{}"'.format(query_exec_state,
193 | query_state_change_reason))
194 |
195 | # If Athena query completed, return success:
196 | if query_exec_state in ['SUCCEEDED']:
197 |
198 | logger.info('Query with Execution Id {} SUCCEEDED.'.format(athena_query_execution_id))
199 |
200 | # Build an output dict and format it as JSON
201 | task_output_dict = {
202 | "AthenaQueryString": query_exec_resp['Query'],
203 | "AthenaQueryExecutionId": athena_query_execution_id,
204 | "AthenaQueryExecutionState": query_exec_state,
205 | "AthenaQueryExecutionStateChangeReason": query_state_change_reason,
206 | "AthenaQuerySubmissionDateTime": query_exec_resp['Status'].get('SubmissionDateTime', '').strftime(
207 | '%x, %-I:%M %p %Z'),
208 | "AthenaQueryCompletionDateTime": query_exec_resp['Status'].get('CompletionDateTime', '').strftime(
209 | '%x, %-I:%M %p %Z'),
210 | "AthenaQueryEngineExecutionTimeInMillis": query_exec_resp['Statistics'].get(
211 | 'EngineExecutionTimeInMillis', 0),
212 | "AthenaQueryDataScannedInBytes": query_exec_resp['Statistics'].get('DataScannedInBytes', 0)
213 | }
214 |
215 | task_output_json = json.dumps(task_output_dict)
216 |
217 | logger.info('Sending "Task Succeeded" signal to Step Functions..')
218 | sfn_resp = sfn.send_task_success(
219 | taskToken=sfn_task_token,
220 | output=task_output_json
221 | )
222 |
223 | # Delete item
224 | resp = ddb_table.delete_item(
225 | Key={
226 | 'sfn_activity_arn': sfn_activity_arn,
227 | 'athena_query_execution_id': athena_query_execution_id
228 | }
229 | )
230 |
231 | # Task succeeded, next item
232 |
233 | elif query_exec_state in ['RUNNING', 'QUEUED']:
234 | logger.debug(
235 | 'Query with Execution Id {} is in state hasn\'t completed yet.'.format(athena_query_execution_id))
236 |
237 | # Send heartbeat
238 | sfn_resp = sfn.send_task_heartbeat(
239 | taskToken=sfn_task_token
240 | )
241 |
242 | logger.debug('Heartbeat sent to Step Functions.')
243 |
244 | # Heartbeat sent, next item
245 |
246 | elif query_exec_state in ['FAILED', 'CANCELLED']:
247 |
248 | message = 'Athena query with Execution Id "{}" failed. Last state: {}. Error message: {}' \
249 | .format(athena_query_execution_id, query_exec_state, query_state_change_reason)
250 |
251 | logger.error(message)
252 |
253 | message_json = {
254 | "AthenaQueryString": query_exec_resp['Query'],
255 | "AthenaQueryExecutionId": athena_query_execution_id,
256 | "AthenaQueryExecutionState": query_exec_state,
257 | "AthenaQueryExecutionStateChangeReason": query_state_change_reason,
258 | "AthenaQuerySubmissionDateTime": query_exec_resp['Status'].get('SubmissionDateTime', '').strftime(
259 | '%x, %-I:%M %p %Z'),
260 | "AthenaQueryCompletionDateTime": query_exec_resp['Status'].get('CompletionDateTime', '').strftime(
261 | '%x, %-I:%M %p %Z'),
262 | "AthenaQueryEngineExecutionTimeInMillis": query_exec_resp['Statistics'].get(
263 | 'EngineExecutionTimeInMillis', 0),
264 | "AthenaQueryDataScannedInBytes": query_exec_resp['Statistics'].get('DataScannedInBytes', 0)
265 | }
266 |
267 | sfn_resp = sfn.send_task_failure(
268 | taskToken=sfn_task_token,
269 | cause=json.dumps(message_json),
270 | error='AthenaQueryFailedError'
271 | )
272 |
273 | # Delete item
274 | resp = ddb_table.delete_item(
275 | Key={
276 | 'sfn_activity_arn': sfn_activity_arn,
277 | 'athena_query_execution_id': athena_query_execution_id
278 | }
279 | )
280 |
281 | logger.error(message)
282 |
283 | except Exception as e:
284 | logger.error('There was a problem checking status of Athena query..')
285 | logger.error('Glue job Run Id "{}"'.format(athena_query_execution_id))
286 | logger.error('Reason: {}'.format(e.message))
287 | logger.info('Checking next Athena query.')
288 |
289 | # Task failed, next item
290 |
291 |
292 | athena = boto3.client('athena')
293 | # Because Step Functions client uses long polling, read timeout has to be > 60 seconds
294 | sfn_client_config = Config(connect_timeout=50, read_timeout=70)
295 | sfn = boto3.client('stepfunctions', config=sfn_client_config)
296 | dynamodb = boto3.resource('dynamodb')
297 |
298 | # Load logging config and create logger
299 | logger = load_log_config()
300 |
301 | def handler(event, context):
302 |
303 | logger.debug('*** Athena Runner lambda function starting ***')
304 |
305 | try:
306 |
307 | # Get config (including a single activity ARN) from local file
308 | config = load_config()
309 |
310 | # One round of starting Athena queries
311 | start_athena_queries(config)
312 |
313 | # One round of checking on Athena queries
314 | check_athena_queries(config)
315 |
316 | logger.debug('*** Master Athena Runner terminating ***')
317 |
318 | except Exception as e:
319 | logger.critical('*** ERROR: Athena runner lambda function failed ***')
320 | logger.critical(e.message)
321 | raise
322 |
323 |
--------------------------------------------------------------------------------
/lambda/athenarunner/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/lambda/athenarunner/requirements.txt
--------------------------------------------------------------------------------
/lambda/athenarunner/setup.cfg:
--------------------------------------------------------------------------------
1 | [install]
2 | prefix=
--------------------------------------------------------------------------------
/lambda/global-config.json:
--------------------------------------------------------------------------------
1 | {}
--------------------------------------------------------------------------------
/lambda/gluerunner/gluerunner-config.json:
--------------------------------------------------------------------------------
1 | {
2 | "sfn_activity_arn": "arn:aws:states:::activity:GlueRunnerActivity",
3 | "sfn_worker_name": "gluerunner",
4 | "ddb_table": "GlueRunnerActiveJobs",
5 | "ddb_query_limit": 50,
6 | "glue_job_capacity": 10
7 | }
--------------------------------------------------------------------------------
/lambda/gluerunner/gluerunner.py:
--------------------------------------------------------------------------------
1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
4 | # software and associated documentation files (the "Software"), to deal in the Software
5 | # without restriction, including without limitation the rights to use, copy, modify,
6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
7 | # permit persons to whom the Software is furnished to do so.
8 | #
9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 | import boto3
17 | import logging, logging.config
18 | import json
19 | from botocore.client import Config
20 | from boto3.dynamodb.conditions import Key, Attr
21 |
22 |
23 | def load_log_config():
24 | # Basic config. Replace with your own logging config if required
25 | root = logging.getLogger()
26 | root.setLevel(logging.INFO)
27 | return root
28 |
29 | def load_config():
30 | with open('gluerunner-config.json', 'r') as conf_file:
31 | conf_json = conf_file.read()
32 | return json.loads(conf_json)
33 |
34 | def is_json(jsonstring):
35 | try:
36 | json_object = json.loads(jsonstring)
37 | except ValueError:
38 | return False
39 | return True
40 |
41 | def start_glue_jobs(config):
42 |
43 | ddb_table = dynamodb.Table(config["ddb_table"])
44 |
45 | logger.debug('Glue runner started')
46 |
47 | glue_job_capacity = config['glue_job_capacity']
48 | sfn_activity_arn = config['sfn_activity_arn']
49 | sfn_worker_name = config['sfn_worker_name']
50 |
51 | # Loop until no tasks are available from Step Functions
52 | while True:
53 |
54 | logger.debug('Polling for Glue tasks for Step Functions Activity ARN: {}'.format(sfn_activity_arn))
55 |
56 | try:
57 | response = sfn.get_activity_task(
58 | activityArn=sfn_activity_arn,
59 | workerName=sfn_worker_name
60 | )
61 |
62 | except Exception as e:
63 | logger.critical(e.message)
64 | logger.critical(
65 | 'Unrecoverable error invoking get_activity_task for {}.'.format(sfn_activity_arn))
66 | raise
67 |
68 | # Keep the new Task Token for reference later
69 | task_token = response.get('taskToken', '')
70 |
71 | # If there are no tasks, return
72 | if not task_token:
73 | logger.debug('No tasks available.')
74 | return
75 |
76 |
77 | logger.info('Received task. Task token: {}'.format(task_token))
78 |
79 | # Parse task input and create Glue job input
80 | task_input = ''
81 | try:
82 |
83 | task_input = json.loads(response['input'])
84 | task_input_dict = json.loads(task_input)
85 |
86 | glue_job_name = task_input_dict['GlueJobName']
87 | glue_job_capacity = int(task_input_dict.get('GlueJobCapacity', glue_job_capacity))
88 |
89 | except (KeyError, Exception):
90 | logger.critical('Invalid Glue Runner input. Make sure required input parameters are properly specified.')
91 | raise
92 |
93 | # Extract additional Glue job inputs from task_input_dict
94 | glue_job_args = task_input_dict
95 |
96 | # Run Glue job
97 | logger.info('Running Glue job named "{}"..'.format(glue_job_name))
98 |
99 | try:
100 | response = glue.start_job_run(
101 | JobName=glue_job_name,
102 | Arguments=glue_job_args,
103 | AllocatedCapacity=glue_job_capacity
104 | )
105 |
106 | glue_job_run_id = response['JobRunId']
107 |
108 | # Store SFN 'Task Token' and Glue Job 'Run Id' in DynamoDB
109 |
110 | item = {
111 | 'sfn_activity_arn': sfn_activity_arn,
112 | 'glue_job_name': glue_job_name,
113 | 'glue_job_run_id': glue_job_run_id,
114 | 'sfn_task_token': task_token
115 | }
116 |
117 | ddb_table.put_item(Item=item)
118 |
119 |
120 | except Exception as e:
121 | logger.error('Failed to start Glue job named "{}"..'.format(glue_job_name))
122 | logger.error('Reason: {}'.format(e.message))
123 | logger.info('Sending "Task Failed" signal to Step Functions.')
124 |
125 | response = sfn.send_task_failure(
126 | taskToken=task_token,
127 | error='Failed to start Glue job. Check Glue Runner logs for more details.'
128 | )
129 | return
130 |
131 |
132 | logger.info('Glue job run started. Run Id: {}'.format(glue_job_run_id))
133 |
134 |
135 | def check_glue_jobs(config):
136 |
137 | # Query all items in table for a particular SFN activity ARN
138 | # This should retrieve records for all started glue jobs for this particular activity ARN
139 | ddb_table = dynamodb.Table(config['ddb_table'])
140 | sfn_activity_arn = config['sfn_activity_arn']
141 |
142 | ddb_resp = ddb_table.query(
143 | KeyConditionExpression=Key('sfn_activity_arn').eq(sfn_activity_arn),
144 | Limit=config['ddb_query_limit']
145 | )
146 |
147 | # For each item...
148 | for item in ddb_resp['Items']:
149 | glue_job_run_id = item['glue_job_run_id']
150 | glue_job_name = item['glue_job_name']
151 | sfn_task_token = item['sfn_task_token']
152 |
153 | try:
154 |
155 | logger.debug('Polling Glue job run status..')
156 |
157 | # Query glue job status...
158 | glue_resp = glue.get_job_run(
159 | JobName=glue_job_name,
160 | RunId=glue_job_run_id,
161 | PredecessorsIncluded=False
162 | )
163 |
164 | job_run_state = glue_resp['JobRun']['JobRunState']
165 | job_run_error_message = glue_resp['JobRun'].get('ErrorMessage', '')
166 |
167 | logger.debug('Job with Run Id {} is currently in state "{}"'.format(glue_job_run_id, job_run_state))
168 |
169 | # If Glue job completed, return success:
170 | if job_run_state in ['SUCCEEDED']:
171 |
172 | logger.info('Job with Run Id {} SUCCEEDED.'.format(glue_job_run_id))
173 |
174 | # Build an output dict and format it as JSON
175 | task_output_dict = {
176 | "GlueJobName": glue_job_name,
177 | "GlueJobRunId": glue_job_run_id,
178 | "GlueJobRunState": job_run_state,
179 | "GlueJobStartedOn": glue_resp['JobRun'].get('StartedOn', '').strftime('%x, %-I:%M %p %Z'),
180 | "GlueJobCompletedOn": glue_resp['JobRun'].get('CompletedOn', '').strftime('%x, %-I:%M %p %Z'),
181 | "GlueJobLastModifiedOn": glue_resp['JobRun'].get('LastModifiedOn', '').strftime('%x, %-I:%M %p %Z')
182 | }
183 |
184 | task_output_json = json.dumps(task_output_dict)
185 |
186 | logger.info('Sending "Task Succeeded" signal to Step Functions..')
187 | sfn_resp = sfn.send_task_success(
188 | taskToken=sfn_task_token,
189 | output=task_output_json
190 | )
191 |
192 | # Delete item
193 | resp = ddb_table.delete_item(
194 | Key={
195 | 'sfn_activity_arn': sfn_activity_arn,
196 | 'glue_job_run_id': glue_job_run_id
197 | }
198 | )
199 |
200 | # Task succeeded, next item
201 |
202 | elif job_run_state in ['STARTING', 'RUNNING', 'STARTING', 'STOPPING']:
203 | logger.debug('Job with Run Id {} hasn\'t succeeded yet.'.format(glue_job_run_id))
204 |
205 | # Send heartbeat
206 | sfn_resp = sfn.send_task_heartbeat(
207 | taskToken=sfn_task_token
208 | )
209 |
210 | logger.debug('Heartbeat sent to Step Functions.')
211 |
212 | # Heartbeat sent, next item
213 |
214 | elif job_run_state in ['FAILED', 'STOPPED']:
215 |
216 | message = 'Glue job "{}" run with Run Id "{}" failed. Last state: {}. Error message: {}' \
217 | .format(glue_job_name, glue_job_run_id[:8] + "...", job_run_state, job_run_error_message)
218 |
219 | logger.error(message)
220 |
221 | message_json = {
222 | 'glue_job_name': glue_job_name,
223 | 'glue_job_run_id': glue_job_run_id,
224 | 'glue_job_run_state': job_run_state,
225 | 'glue_job_run_error_msg': job_run_error_message
226 | }
227 |
228 | sfn_resp = sfn.send_task_failure(
229 | taskToken=sfn_task_token,
230 | cause=json.dumps(message_json),
231 | error='GlueJobFailedError'
232 | )
233 |
234 | # Delete item
235 | resp = ddb_table.delete_item(
236 | Key={
237 | 'sfn_activity_arn': sfn_activity_arn,
238 | 'glue_job_run_id': glue_job_run_id
239 | }
240 | )
241 |
242 | logger.error(message)
243 | # Task failed, next item
244 | except Exception as e:
245 | logger.error('There was a problem checking status of Glue job "{}"..'.format(glue_job_name))
246 | logger.error('Glue job Run Id "{}"'.format(glue_job_run_id))
247 | logger.error('Reason: {}'.format(e.message))
248 | logger.info('Checking next Glue job.')
249 |
250 |
251 | glue = boto3.client('glue')
252 | # Because Step Functions client uses long polling, read timeout has to be > 60 seconds
253 | sfn_client_config = Config(connect_timeout=50, read_timeout=70)
254 | sfn = boto3.client('stepfunctions', config=sfn_client_config)
255 | dynamodb = boto3.resource('dynamodb')
256 |
257 | # Load logging config and create logger
258 | logger = load_log_config()
259 |
260 | def handler(event, context):
261 |
262 | logger.debug('*** Glue Runner lambda function starting ***')
263 |
264 | try:
265 |
266 | # Get config (including a single activity ARN) from local file
267 | config = load_config()
268 |
269 | # One round of starting Glue jobs
270 | start_glue_jobs(config)
271 |
272 | # One round of checking on Glue jobs
273 | check_glue_jobs(config)
274 |
275 | logger.debug('*** Master Glue Runner terminating ***')
276 |
277 | except Exception as e:
278 | logger.critical('*** ERROR: Glue runner lambda function failed ***')
279 | logger.critical(e.message)
280 | raise
281 |
282 |
--------------------------------------------------------------------------------
/lambda/gluerunner/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/lambda/gluerunner/requirements.txt
--------------------------------------------------------------------------------
/lambda/gluerunner/setup.cfg:
--------------------------------------------------------------------------------
1 | [install]
2 | prefix=
--------------------------------------------------------------------------------
/lambda/ons3objectcreated/ons3objectcreated-config.json:
--------------------------------------------------------------------------------
1 | {}
--------------------------------------------------------------------------------
/lambda/ons3objectcreated/ons3objectcreated.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import json
4 | import urllib
5 | import boto3
6 | import logging, logging.config
7 | from botocore.client import Config
8 |
9 | # Because Step Functions client uses long polling, read timeout has to be > 60 seconds
10 | sfn_client_config = Config(connect_timeout=50, read_timeout=70)
11 | sfn = boto3.client('stepfunctions', config=sfn_client_config)
12 |
13 | sts = boto3.client('sts')
14 | account_id = sts.get_caller_identity().get('Account')
15 | region_name = boto3.session.Session().region_name
16 |
17 | def load_log_config():
18 | # Basic config. Replace with your own logging config if required
19 | root = logging.getLogger()
20 | root.setLevel(logging.INFO)
21 | return root
22 |
23 | def map_activity_arn(bucket, key):
24 | # Map s3 key to activity ARN based on a convention
25 | # Here, we simply map bucket name plus last element in the s3 object key (i.e. filename) to activity name
26 | key_elements = [x.strip() for x in key.split('/')]
27 | activity_name = '{}-{}'.format(bucket, key_elements[-1])
28 |
29 | return 'arn:aws:states:{}:{}:activity:{}'.format(region_name, account_id, activity_name)
30 |
31 | # Load logging config and create logger
32 | logger = load_log_config()
33 |
34 | def handler(event, context):
35 | logger.info("Received event: " + json.dumps(event, indent=2))
36 |
37 | # Get the object from the event and show its content type
38 | bucket = event['Records'][0]['s3']['bucket']['name']
39 | key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
40 |
41 | # Based on a naming convention that maps s3 keys to activity ARNs, deduce the activity arn
42 | sfn_activity_arn = map_activity_arn(bucket, key)
43 | sfn_worker_name = 'on_s3_object_created'
44 |
45 | try:
46 | try:
47 | response = sfn.get_activity_task(
48 | activityArn=sfn_activity_arn,
49 | workerName=sfn_worker_name
50 | )
51 |
52 | except Exception as e:
53 | logger.critical(e.message)
54 | logger.critical(
55 | 'Unrecoverable error invoking get_activity_task for {}.'.format(sfn_activity_arn))
56 | raise
57 |
58 | # Get the Task Token
59 | sfn_task_token = response.get('taskToken', '')
60 |
61 | logger.info('Sending "Task Succeeded" signal to Step Functions..')
62 |
63 | # Build an output dict and format it as JSON
64 | task_output_dict = {
65 | 'S3BucketName': bucket,
66 | 'S3Key': key,
67 | 'SFNActivityArn': sfn_activity_arn
68 | }
69 |
70 | task_output_json = json.dumps(task_output_dict)
71 |
72 | sfn_resp = sfn.send_task_success(
73 | taskToken=sfn_task_token,
74 | output=task_output_json
75 | )
76 |
77 |
78 | except Exception as e:
79 | logger.critical(e)
80 | raise e
81 |
--------------------------------------------------------------------------------
/lambda/ons3objectcreated/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/lambda/ons3objectcreated/requirements.txt
--------------------------------------------------------------------------------
/lambda/ons3objectcreated/setup.cfg:
--------------------------------------------------------------------------------
1 | [install]
2 | prefix=
--------------------------------------------------------------------------------
/lambda/s3-deployment-descriptor.json:
--------------------------------------------------------------------------------
1 | {
2 | "athenarunner": {
3 | "ArtifactBucketName": "",
4 | "LambdaSourceS3Key": "src/athenarunner.zip"
5 | },
6 | "gluerunner": {
7 | "ArtifactBucketName": "",
8 | "LambdaSourceS3Key": "src/gluerunner.zip"
9 | },
10 | "ons3objectcreated": {
11 | "ArtifactBucketName": "",
12 | "LambdaSourceS3Key": "src/ons3objectcreated.zip"
13 | }
14 | }
--------------------------------------------------------------------------------
/resources/MarketingAndSalesETLOrchestrator-notstarted.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/resources/MarketingAndSalesETLOrchestrator-notstarted.png
--------------------------------------------------------------------------------
/resources/MarketingAndSalesETLOrchestrator-success.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/resources/MarketingAndSalesETLOrchestrator-success.png
--------------------------------------------------------------------------------
/resources/aws-etl-orchestrator-serverless.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/resources/aws-etl-orchestrator-serverless.png
--------------------------------------------------------------------------------
/resources/example-etl-flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-etl-orchestrator/6477b181054ac39f5661e6f2cacb4cc885c2f7a5/resources/example-etl-flow.png
--------------------------------------------------------------------------------