├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── app.py ├── athena_query.png ├── cdk.json ├── create_lake.sh ├── glue-crawler.png ├── glue-table.png ├── glue-workflow.png ├── requirements-dev.txt ├── requirements.txt ├── serverless-datalake.drawio.png ├── serverless_datalake.zip ├── serverless_datalake ├── __init__.py ├── infrastructure │ ├── etl_stack.py │ ├── glue_job │ │ └── event-bridge-dev-etl.py │ ├── ingestion_stack.py │ ├── lambda_tester │ │ └── event_simulator.py │ └── test_stack.py └── serverless_datalake_stack.py └── source.bat /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .venv 3 | cdk.out/ 4 | *.pyc -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Serverless Datalake with Amazon EventBridge 3 | 4 | ### Architecture 5 | ![architecture](https://user-images.githubusercontent.com/25897220/210151716-5e670221-db78-4681-b711-1ef8d94ea102.png) 6 | 7 |
8 |
9 | 10 | ### Overview 11 | This solution demonstrates how we could build a serverless ETL solution with Amazon Eventbridge, Kinesis Firehose, AWS Glue. 12 | 13 | * Amazon Eventbridge is a high throughput serverless event router used for ingesting data, it provides out-of-the-box integration with multiple AWS services. We could generate custom events from multiple hetrogenous sources and route it to any of the AWS services without the need to write any integration logic. With content based routing , a single event bus can route events to different services based on the match/un-matched event content. 14 | 15 | * Kinesis Firehose is used to buffer the events and store them as json files in S3. 16 | 17 | * AWS Glue is a serverless data integration service. We've used Glue to transform the raw event data and also to generate a 18 | schema using AWS Glue crawler. AWS Glue Jobs run some basic transformations on the incoming raw data. It then compresses, partitions the data and stores it in parquet(columnar) storage. AWS Glue job is configured to run every hour. A successful AWS Glue job completion would trigger the AWS Glue Crawler that in turn would either create a schema on the S3 data or update the schema and other metadata such as newly added/deleted table partitions. 19 | 20 | * Glue Workflow 21 | glue-workflow 22 | 23 | * Glue Crawler 24 | glue-crawler 25 | 26 | * Datalake Table 27 | glue-table 28 | 29 | * Once the table is created we could query it through Athena and visualize it using Quicksight.
30 | Athena Query 31 | athena_query 32 | 33 | * This solution also provides a test lambda that generates 500 random events to push to the bus. The lambda is part of the **test_stack.py** nested stack. The lambda should be invoked on demand. 34 | Test lambda name: **serverless-event-simulator-dev** 35 | 36 | 37 | 38 | This solution could be used to build a datalake for API usage tracking, State Change Notifications, Content based routing and much more. 39 | 40 | 41 | ### Setup to execute on AWS Cloudshell 42 | 43 | 1. Create an new IAM user(username = LakeAdmin) with Administrator access.Enable Console access too 44 | 45 | Screenshot 2023-03-12 at 6 20 15 PM 46 | 47 | 48 | 2. Once the user is created. Head to Security Credentials Tab and generate access/secret key for the new IAM user. 49 | 50 | Screenshot 2023-03-12 at 6 18 34 PM 51 | 52 | 2. Copy access/secret keys for the newly created IAM User on your local machine. 53 | 54 | Screenshot 2023-03-12 at 6 17 29 PM 55 | 56 | 57 | 3. Search for AWS Cloudshell. Configure your aws cli environment with the access/secret keys of the new admin user using the below command on AWS Cloudshell 58 | ``` 59 | aws configure 60 | ``` 61 | Screenshot 2023-03-12 at 6 16 56 PM 62 | 63 | 64 | 4. Git Clone the serverless-datalake repository from aws-samples 65 | ``` 66 | git clone https://github.com/aws-samples/serverless-datalake.git 67 | ``` 68 | 69 | 8. cd serverless-datalake 70 | ``` 71 | cd serverless-datalake 72 | 73 | ``` 74 | 75 | 9. Fire the bash script that automates the lake creation process. 76 | ``` 77 | sh create_lake.sh 78 | ``` 79 | 80 | 10. After you've successfully deployed this stack on your account, you could test it out by executing the test lambda thats deployed as part of this stack. 81 | Test lambda name: **serverless-event-simulator-dev** . This lambda will push 1K random transaction events to the event-bus. 82 | 83 | 11. Verify if raw data is available in the s3 bucket under prefix 'raw-data/....' 84 | 85 | 12. Verify if the Glue job is running 86 | 87 | 13. Once the Glue job succeeds, it would trigger a glue crawler that creates a table in our datalake 88 | 89 | 14. Head to Athena after the table is created and query the table 90 | 91 | 15. Create 3 roles in IAM with Administrator access -> cloud-developer / cloud-analyst / cloud-data-engineer 92 | 93 | 16. Add inline Permissions to IAM user(LakeAdmin) so that we can switch roles 94 | ``` 95 | { 96 | "Version": "2012-10-17", 97 | "Statement": { 98 | "Effect": "Allow", 99 | "Action": "sts:AssumeRole", 100 | "Resource": "arn:aws:iam::account-id:role/cloud-*" 101 | } 102 | } 103 | ``` 104 | 105 | 16. Head to Amazon Lake formation and under Tables, select our table -> Actions -> Grant privileges to cloud-developer role 106 | 16.a: Add column level security 107 | 108 | Screenshot 2023-03-12 at 3 36 29 AM 109 | 110 | 17. View Permissions for the table and revoke IAMAllowedPrincipals. https://docs.aws.amazon.com/lake-formation/latest/dg/upgrade-glue-lake-formation-background.html 111 | 112 | Screenshot 2023-03-12 at 3 35 08 AM 113 | 114 | 18. In an incognito window, login as LakeAdmin user. 115 | 116 | 19. Switch roles and head to Athena and test Column level security. 117 | 118 | 20. Now in Amazon Lake Formation (Back to our main window), create a Data Filter and add the below 119 | a. Under Row Filter expression add -> country='IN' 120 | a.1 https://docs.aws.amazon.com/lake-formation/latest/dg/data-filters-about.html 121 | b. Include columns you wish to view for that role. 122 | 123 | Screenshot 2023-03-12 at 3 30 51 AM 124 | 125 | 126 | 127 | 21. Head back to the incognito window and fire the select command. Confirm if RLS and CLS are correctly working for that role. 128 | 22. Configurations for dev environment are defined in cdk.json. S3 bucket name is created on the fly based on account_id and region in which the cdk is deployed 129 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import os 3 | 4 | import aws_cdk as cdk 5 | 6 | from serverless_datalake.serverless_datalake_stack import ServerlessDatalakeStack 7 | 8 | account_id = os.getenv('CDK_DEFAULT_ACCOUNT') 9 | region = os.getenv('CDK_DEFAULT_REGION') 10 | 11 | app = cdk.App() 12 | env=cdk.Environment(account=account_id, region=region) 13 | # Set Context Bucket_name 14 | env_name = app.node.try_get_context('environment_name') 15 | bucket_name = f"lake-store-{env_name}-{region}-{account_id}" 16 | ServerlessDatalakeStack(app, "ServerlessDatalakeStack", bucket_name=bucket_name, 17 | # If you don't specify 'env', this stack will be environment-agnostic. 18 | # Account/Region-dependent features and context lookups will not work, 19 | # but a single synthesized template can be deployed anywhere. 20 | 21 | # Uncomment the next line to specialize this stack for the AWS Account 22 | # and Region that are implied by the current CLI configuration. 23 | env=env 24 | 25 | # Uncomment the next line if you know exactly what Account and Region you 26 | # want to deploy the stack to. */ 27 | 28 | #env=cdk.Environment(account='123456789012', region='us-east-1'), 29 | 30 | # For more information, see https://docs.aws.amazon.com/cdk/latest/guide/environments.html 31 | ) 32 | 33 | app.synth() 34 | -------------------------------------------------------------------------------- /athena_query.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/serverless-datalake/3569b8c035a36e63356fe6423c2490cf442644b1/athena_query.png -------------------------------------------------------------------------------- /cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "python3 app.py", 3 | "watch": { 4 | "include": [ 5 | "**" 6 | ], 7 | "exclude": [ 8 | "README.md", 9 | "cdk*.json", 10 | "requirements*.txt", 11 | "source.bat", 12 | "**/__init__.py", 13 | "python/__pycache__", 14 | "tests" 15 | ] 16 | }, 17 | "context": { 18 | "@aws-cdk/aws-apigateway:usagePlanKeyOrderInsensitiveId": true, 19 | "@aws-cdk/core:stackRelativeExports": true, 20 | "@aws-cdk/aws-rds:lowercaseDbIdentifier": true, 21 | "@aws-cdk/aws-lambda:recognizeVersionProps": true, 22 | "@aws-cdk/aws-lambda:recognizeLayerVersion": true, 23 | "@aws-cdk/aws-cloudfront:defaultSecurityPolicyTLSv1.2_2021": true, 24 | "@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true, 25 | "@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true, 26 | "@aws-cdk/core:checkSecretUsage": true, 27 | "@aws-cdk/aws-iam:minimizePolicies": true, 28 | "@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true, 29 | "@aws-cdk/core:validateSnapshotRemovalPolicy": true, 30 | "@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true, 31 | "@aws-cdk/aws-s3:createDefaultLoggingPolicy": true, 32 | "@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true, 33 | "@aws-cdk/aws-apigateway:disableCloudWatchRole": true, 34 | "@aws-cdk/core:enablePartitionLiterals": true, 35 | "@aws-cdk/core:target-partitions": [ 36 | "aws", 37 | "aws-cn" 38 | ], 39 | "dev": { 40 | "bus-name": "serverless-bus-dev", 41 | "rule-name": "custom-event-rule-dev", 42 | "event-pattern": "custom-event-pattern-dev", 43 | "firehose-name": "sample-stream-dev", 44 | "buffering-interval-in-seconds": 120, 45 | "buffering-size-in-mb": 128, 46 | "firehose-s3-rolename": "firehoses3dev", 47 | "eventbus-firehose-rolename": "eventbusfirehosedev", 48 | "glue-job-name": "serverless-etl-job-dev", 49 | "workers": 7, 50 | "worker-type": "G.1X", 51 | "glue-script-location": "glue/scripts/", 52 | "glue-script-uri": "/glue/scripts/event-bridge-dev-etl.py", 53 | "temp-location": "/glue/etl/temp/", 54 | "glue-database-location": "/glue/etl/database/", 55 | "glue-database-name": "serverless-lake", 56 | "glue-table-name": "processed-table", 57 | "glue-crawler": "serverless-lake-crawler", 58 | "glue-output": "/glue/output/", 59 | "glue-job-trigger-name": "scheduled-etl-job-trigger-dev", 60 | "glue-job-cron": "cron(0/15 * * * ? *)", 61 | "glue-crawler-trigger-name": "crawler-trigger-dev", 62 | "glue-script-local-machine": "" 63 | } 64 | } 65 | } 66 | -------------------------------------------------------------------------------- /create_lake.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Assuming we are in Serverless-datalake folder 3 | cd .. 4 | echo "--- Upgrading npm ---" 5 | sudo npm install n stable -g 6 | echo "--- Installing cdk ---" 7 | sudo npm install -g aws-cdk@2.55.1 8 | echo "--- Bootstrapping CDK on account ---" 9 | cdk bootstrap aws://$(aws sts get-caller-identity --query "Account" --output text)/us-east-1 10 | echo "--- Cloning serverless-datalake project from aws-samples ---" 11 | # assuming we have already cloned. 12 | # git clone https://github.com/aws-samples/serverless-datalake.git 13 | cd serverless-datalake 14 | echo "--- Set python virtual environment ---" 15 | python3 -m venv .venv 16 | echo "--- Activate virtual environment ---" 17 | source .venv/bin/activate 18 | echo "--- Install Requirements ---" 19 | pip install -r requirements.txt 20 | echo "--- CDK synthesize ---" 21 | cdk synth -c environment_name=dev 22 | echo "--- CDK deploy ---" 23 | # read -p "Press any key to deploy the Serverless Datalake ..." 24 | cdk deploy -c environment_name=dev ServerlessDatalakeStack 25 | echo "Lake deployed successfully" 26 | aws lambda invoke --function-name serverless-event-simulator-dev --invocation-type Event --cli-binary-format raw-in-base64-out --payload '{ "name": "sample" }' response.json 27 | aws lambda invoke --function-name serverless-event-simulator-dev --invocation-type Event --cli-binary-format raw-in-base64-out --payload '{ "name": "sample" }' response.json 28 | aws lambda invoke --function-name serverless-event-simulator-dev --invocation-type Event --cli-binary-format raw-in-base64-out --payload '{ "name": "sample" }' response.json 29 | aws lambda invoke --function-name serverless-event-simulator-dev --invocation-type Event --cli-binary-format raw-in-base64-out --payload '{ "name": "sample" }' response.json 30 | aws lambda invoke --function-name serverless-event-simulator-dev --invocation-type Event --cli-binary-format raw-in-base64-out --payload '{ "name": "sample" }' response.json 31 | aws lambda invoke --function-name serverless-event-simulator-dev --invocation-type Event --cli-binary-format raw-in-base64-out --payload '{ "name": "sample" }' response.json 32 | aws lambda invoke --function-name serverless-event-simulator-dev --invocation-type Event --cli-binary-format raw-in-base64-out --payload '{ "name": "sample" }' response.json 33 | aws lambda invoke --function-name serverless-event-simulator-dev --invocation-type Event --cli-binary-format raw-in-base64-out --payload '{ "name": "sample" }' response.json 34 | aws lambda invoke --function-name serverless-event-simulator-dev --invocation-type Event --cli-binary-format raw-in-base64-out --payload '{ "name": "sample" }' response.json 35 | aws lambda invoke --function-name serverless-event-simulator-dev --invocation-type Event --cli-binary-format raw-in-base64-out --payload '{ "name": "sample" }' response.json -------------------------------------------------------------------------------- /glue-crawler.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/serverless-datalake/3569b8c035a36e63356fe6423c2490cf442644b1/glue-crawler.png -------------------------------------------------------------------------------- /glue-table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/serverless-datalake/3569b8c035a36e63356fe6423c2490cf442644b1/glue-table.png -------------------------------------------------------------------------------- /glue-workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/serverless-datalake/3569b8c035a36e63356fe6423c2490cf442644b1/glue-workflow.png -------------------------------------------------------------------------------- /requirements-dev.txt: -------------------------------------------------------------------------------- 1 | pytest==6.2.5 2 | aws-cdk-lib==2.55.1 3 | aws-cdk.aws-glue-alpha==2.55.1a0 4 | constructs>=10.0.0,<11.0.0 5 | 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aws-cdk-lib==2.55.1 2 | aws-cdk.aws-glue-alpha==2.55.1a0 3 | constructs>=10.0.0,<11.0.0 4 | -------------------------------------------------------------------------------- /serverless-datalake.drawio.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/serverless-datalake/3569b8c035a36e63356fe6423c2490cf442644b1/serverless-datalake.drawio.png -------------------------------------------------------------------------------- /serverless_datalake.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/serverless-datalake/3569b8c035a36e63356fe6423c2490cf442644b1/serverless_datalake.zip -------------------------------------------------------------------------------- /serverless_datalake/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/serverless-datalake/3569b8c035a36e63356fe6423c2490cf442644b1/serverless_datalake/__init__.py -------------------------------------------------------------------------------- /serverless_datalake/infrastructure/etl_stack.py: -------------------------------------------------------------------------------- 1 | from aws_cdk import ( 2 | # Duration, 3 | Stack, 4 | NestedStack, 5 | aws_glue_alpha as _glue_alpha 6 | # aws_sqs as sqs, 7 | ) 8 | import aws_cdk as _cdk 9 | from constructs import Construct 10 | import os 11 | 12 | 13 | class EtlStack(NestedStack): 14 | 15 | def __init__(self, scope: Construct, construct_id: str, bucket_name, **kwargs) -> None: 16 | super().__init__(scope, construct_id, **kwargs) 17 | env = self.node.try_get_context('environment_name') 18 | if env is None: 19 | print('Setting environment as Dev') 20 | print('Fetching Dev Properties') 21 | env = 'dev' 22 | # fetching details from cdk.json 23 | config_details = self.node.try_get_context(env) 24 | # Create a Glue Role 25 | glue_role = _cdk.aws_iam.Role(self, f'etl-role-{env}', assumed_by=_cdk.aws_iam.ServicePrincipal(service='glue.amazonaws.com'), 26 | role_name=f'etlrole{env}', managed_policies=[ 27 | _cdk.aws_iam.ManagedPolicy.from_managed_policy_arn(self, 's3-full-access-etl', 28 | managed_policy_arn='arn:aws:iam::aws:policy/AmazonS3FullAccess'), 29 | _cdk.aws_iam.ManagedPolicy.from_managed_policy_arn(self, 'comprehend-full-access-etl', 30 | managed_policy_arn='arn:aws:iam::aws:policy/ComprehendFullAccess'), 31 | _cdk.aws_iam.ManagedPolicy.from_aws_managed_policy_name( 32 | "service-role/AWSGlueServiceRole") 33 | ] 34 | ) 35 | 36 | s3_bucket = _cdk.aws_s3.Bucket.from_bucket_name( 37 | self, f's3_lake_{env}', bucket_name=bucket_name) 38 | 39 | # Upload Glue Script on S3 40 | _cdk.aws_s3_deployment.BucketDeployment(self, f"glue_scripts_deploy_{env}", 41 | sources=[_cdk.aws_s3_deployment.Source 42 | .asset(os.path.join(os.getcwd(), 'serverless_datalake/infrastructure/glue_job/'))], 43 | destination_bucket=s3_bucket, 44 | destination_key_prefix=config_details[ 45 | "glue-script-location"] 46 | ) 47 | workflow = _cdk.aws_glue.CfnWorkflow(self, f'workflow-{env}', name=f'serverless-etl-{env}', 48 | description='sample etl with eventbridge ingestion', max_concurrent_runs=1) 49 | 50 | # Create a Glue job and run on schedule 51 | glue_job = _cdk.aws_glue.CfnJob( 52 | self, f'serverless-etl-{env}', 53 | name=config_details['glue-job-name'], 54 | number_of_workers=config_details['workers'], 55 | worker_type=config_details['worker-type'], 56 | command=_cdk.aws_glue.CfnJob.JobCommandProperty(name='glueetl', 57 | script_location=f"s3://{bucket_name}{config_details['glue-script-uri']}", 58 | ), 59 | role=glue_role.role_arn, 60 | glue_version='3.0', 61 | execution_property=_cdk.aws_glue.CfnJob.ExecutionPropertyProperty( 62 | max_concurrent_runs=1), 63 | description='Serverless etl processing raw data from event-bus', 64 | default_arguments={ 65 | "--enable-metrics": "", 66 | "--enable-job-insights": "true", 67 | '--TempDir': f"s3://{bucket_name}{config_details['temp-location']}", 68 | '--job-bookmark-option': 'job-bookmark-enable', 69 | '--s3_input_location': f's3://{bucket_name}/raw-data/', 70 | '--s3_output_location': f"s3://{bucket_name}{config_details['glue-output']}" 71 | } 72 | ) 73 | 74 | 75 | # Create a Glue Database 76 | glue_database = _cdk.aws_glue.CfnDatabase(self, f'glue-database-{env}', 77 | catalog_id=self.account, database_input=_cdk.aws_glue.CfnDatabase 78 | .DatabaseInputProperty( 79 | location_uri=f"s3://{bucket_name}{config_details['glue-database-location']}", 80 | description='store processed data', name=config_details['glue-database-name'])) 81 | 82 | # Create a Glue Crawler 83 | glue_crawler = _cdk.aws_glue.CfnCrawler(self, f'glue-crawler-{env}', 84 | role=glue_role.role_arn, name=config_details['glue-crawler'], 85 | database_name=config_details['glue-database-name'], 86 | targets=_cdk.aws_glue.CfnCrawler.TargetsProperty(s3_targets=[ 87 | _cdk.aws_glue.CfnCrawler.S3TargetProperty( 88 | path=f"s3://{bucket_name}{config_details['glue-output']}") 89 | ]), table_prefix=config_details['glue-table-name'], 90 | description='Crawl over processed parquet data. This optimizes query cost' 91 | ) 92 | 93 | # Create a Glue cron Trigger ̰ 94 | job_trigger = _cdk.aws_glue.CfnTrigger(self, f'glue-job-trigger-{env}', 95 | name=config_details['glue-job-trigger-name'], 96 | actions=[_cdk.aws_glue.CfnTrigger.ActionProperty(job_name=glue_job.name)], 97 | workflow_name=workflow.name, 98 | type='SCHEDULED', 99 | start_on_creation=True, 100 | schedule=config_details['glue-job-cron']) 101 | 102 | job_trigger.add_depends_on(glue_job) 103 | job_trigger.add_depends_on(glue_crawler) 104 | job_trigger.add_depends_on(glue_database) 105 | 106 | # Create a Glue conditional trigger that triggers the crawler on successful job completion 107 | crawler_trigger = _cdk.aws_glue.CfnTrigger(self, f'glue-crawler-trigger-{env}', 108 | name=config_details['glue-crawler-trigger-name'], 109 | actions=[_cdk.aws_glue.CfnTrigger.ActionProperty( 110 | crawler_name=glue_crawler.name)], 111 | workflow_name=workflow.name, 112 | type='CONDITIONAL', 113 | predicate=_cdk.aws_glue.CfnTrigger.PredicateProperty( 114 | conditions=[_cdk.aws_glue.CfnTrigger.ConditionProperty( 115 | job_name=glue_job.name, 116 | logical_operator='EQUALS', 117 | state="SUCCEEDED" 118 | )] 119 | ), 120 | start_on_creation=True 121 | ) 122 | 123 | crawler_trigger.add_depends_on(glue_job) 124 | crawler_trigger.add_depends_on(glue_crawler) 125 | crawler_trigger.add_depends_on(glue_database) 126 | 127 | 128 | 129 | 130 | #TODO: Create an Athena Workgroup 131 | 132 | -------------------------------------------------------------------------------- /serverless_datalake/infrastructure/glue_job/event-bridge-dev-etl.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import sys 3 | from awsglue.transforms import * 4 | from awsglue.utils import getResolvedOptions 5 | from pyspark.context import SparkContext 6 | from awsglue.context import GlueContext 7 | from awsglue.job import Job 8 | from awsglue.dynamicframe import DynamicFrame 9 | from datetime import datetime, timedelta 10 | from pyspark.sql.functions import to_json, struct, substring, from_json, lit, col, when, concat, concat_ws 11 | from pyspark.sql.types import StringType 12 | import hashlib, uuid 13 | 14 | 15 | def get_masked_entities(): 16 | """ 17 | return a list of entities to be masked. 18 | If unstructured, use comprehend to determine entities to be masked 19 | """ 20 | return ["name", "credit_card", "account", "city"] 21 | 22 | def get_encrypted_entities(): 23 | """ 24 | return a list of entities to be masked. 25 | """ 26 | return ["name", "credit_card"] 27 | 28 | 29 | def get_region_name(): 30 | global my_region 31 | my_session = boto3.session.Session() 32 | return my_session.region_name 33 | 34 | 35 | def detect_sensitive_info(r): 36 | """ 37 | return a tuple after masking is complete. 38 | If unstructured, use comprehend to determine entities to be masked 39 | """ 40 | 41 | metadata = r['trnx_msg'] 42 | try: 43 | for entity in get_masked_entities(): 44 | entity_masked = entity + "_masked" 45 | r[entity_masked] = "#######################" 46 | except: 47 | print ("DEBUG:",sys.exc_info()) 48 | 49 | ''' Uncomment to mask unstructured text through Amazon Comprehend ''' 50 | ''' Can result in extendend Glue Job times''' 51 | # client_pii = boto3.client('comprehend', region_name=get_region_name()) 52 | 53 | # try: 54 | # response = client_pii.detect_pii_entities( 55 | # Text = metadata, 56 | # LanguageCode = 'en' 57 | # ) 58 | # clean_text = metadata 59 | # # reversed to not modify the offsets of other entities when substituting 60 | # for NER in reversed(response['Entities']): 61 | # clean_text = clean_text[:NER['BeginOffset']] + NER['Type'] + clean_text[NER['EndOffset']:] 62 | # print(clean_text) 63 | # r['trnx_msg_masked'] = clean_text 64 | # except: 65 | # print ("DEBUG:",sys.exc_info()) 66 | 67 | return r 68 | 69 | 70 | def encrypt_rows(r): 71 | """ 72 | return tuple with encrypted string 73 | Hardcoding salted string. PLease feel free to use SSM and KMS. 74 | """ 75 | salted_string = 'glue_crypt' 76 | encrypted_entities = get_encrypted_entities() 77 | print ("encrypt_rows", salted_string, encrypted_entities) 78 | try: 79 | for entity in encrypted_entities: 80 | salted_entity = r[entity] + salted_string 81 | hashkey = hashlib.sha3_256(salted_entity.encode()).hexdigest() 82 | r[entity + '_encrypted'] = hashkey 83 | except: 84 | print ("DEBUG:",sys.exc_info()) 85 | return r 86 | 87 | 88 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_output_location', 's3_input_location']) 89 | sc = SparkContext() 90 | glueContext = GlueContext(sc) 91 | spark = glueContext.spark_session 92 | job = Job(glueContext) 93 | job.init(args['JOB_NAME'], args) 94 | logger = glueContext.get_logger() 95 | 96 | logger.info('\n -- Glue ETL Job begins -- ') 97 | 98 | s3_input_path = args["s3_input_location"] 99 | s3_output_path = args["s3_output_location"] 100 | 101 | now = datetime.utcnow() 102 | current_date = now.date() 103 | # Go back 5 days in the past. With Glue Bookmarking enabled only newer files are processed 104 | from_date = current_date - timedelta(days=5) 105 | delta = timedelta(days=1) 106 | bucket_path = [] 107 | while from_date <= current_date: 108 | # S3 Paths to process. 109 | if len(bucket_path) >= 10: 110 | logger.info(f'\n Backtrack complete: Found all paths up to {from_date} ') 111 | break 112 | lastprocessedtimestamp = str(from_date) 113 | path = s3_input_path + 'year=' + str(from_date.year) + '/month=' + str(from_date.month).zfill( 114 | 2) + '/day=' + str(from_date.day).zfill(2) 115 | logger.info(f'\n AWS Glue will check for newer unprocessed files in path : {path}') 116 | bucket_path.append(path) 117 | from_date += delta 118 | 119 | 120 | 121 | if len(bucket_path) > 0: 122 | # Read S3 data as a Glue Dynamic Frame 123 | datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", 124 | connection_options={'paths': bucket_path, 125 | 'groupFiles': 'inPartition'}, 126 | format="json", transformation_ctx="dtx") 127 | if datasource0 and datasource0.count() > 0: 128 | 129 | logger.info('\n -- Printing datasource schema --') 130 | logger.info(datasource0.printSchema()) 131 | 132 | # Unnests json data into a flat dataframe. 133 | # This glue transform converts a nested json into a flattened Glue DynamicFrame 134 | logger.info('\n -- Unnest dynamic frame --') 135 | unnested_dyf = UnnestFrame.apply(frame=datasource0, transformation_ctx='unnest') 136 | 137 | # Convert to a Spark DataFrame 138 | sparkDF = unnested_dyf.toDF() 139 | 140 | logger.info('\n -- Add new Columns to Spark data frame --') 141 | # Generate Year/Month/day columns from the time field 142 | sparkDF = sparkDF.withColumn('year', substring('time', 1, 4)) 143 | sparkDF = sparkDF.withColumn('month', substring('time', 6, 2)) 144 | sparkDF = sparkDF.withColumn('day', substring('time', 9, 2)) 145 | 146 | # Rename fields 147 | sparkDF = sparkDF.withColumnRenamed("detail.amount.value", "amount") 148 | sparkDF = sparkDF.withColumnRenamed("detail.amount.currency", "currency") 149 | sparkDF = sparkDF.withColumnRenamed("detail.location.country", "country") 150 | sparkDF = sparkDF.withColumnRenamed("detail.location.state", "state") 151 | sparkDF = sparkDF.withColumnRenamed("detail.location.city", "city") 152 | sparkDF = sparkDF.withColumnRenamed("detail.credit_card", "credit_card_number") 153 | sparkDF = sparkDF.withColumnRenamed("detail.transaction_message", "trnx_msg") 154 | 155 | # Concat firstName and LastName columns 156 | sparkDF = sparkDF.withColumn('name', concat_ws(' ', sparkDF["`detail.firstName`"], sparkDF["`detail.lastName`"])) 157 | 158 | dt_cols = sparkDF.columns 159 | # Drop detail.* columns 160 | for col_dt in dt_cols: 161 | if '.' in col_dt: 162 | logger.info(f'Dropping Column {col_dt}') 163 | sparkDF = sparkDF.drop(col_dt) 164 | 165 | 166 | transformed_dyf = DynamicFrame.fromDF(sparkDF, glueContext, "trnx") 167 | 168 | # Secure the lake through masking and encryption 169 | # Approach 1: Mask PII data 170 | masked_dyf = Map.apply(frame = transformed_dyf, f = detect_sensitive_info) 171 | masked_dyf.show() 172 | 173 | # Approach 2: Encrypting PII data 174 | # Apply encryption to the identified fields 175 | encrypted_dyf = Map.apply(frame = masked_dyf, f = encrypt_rows) 176 | 177 | 178 | # This Glue Transform drops null fields/columns if present in the dataset 179 | dropnullfields = DropNullFields.apply(frame=encrypted_dyf, transformation_ctx="dropnullfields") 180 | # This function repartitions the dataset into exactly two files however Coalesce is preferred 181 | # dropnullfields = dropnullfields.repartition(2) 182 | 183 | # logger.info('\n -- Printing schema after ETL --') 184 | # logger.info(dropnullfields.printSchema()) 185 | 186 | logger.info('\n -- Storing data in snappy compressed, partitioned, parquet format for optimal query performance --') 187 | # Partition the DynamicFrame by Year/Month/Day. Compression type is snappy. Format is Parquet 188 | datasink4 = glueContext.write_dynamic_frame.from_options(frame=dropnullfields, connection_type="s3", 189 | connection_options={"path": s3_output_path, 190 | "partitionKeys": ['year', 'month', 191 | 'day']}, 192 | format="parquet", transformation_ctx="s3sink") 193 | 194 | 195 | logger.info('Sample Glue Job complete') 196 | # Necessary for Glue Job Bookmarking 197 | job.commit() 198 | -------------------------------------------------------------------------------- /serverless_datalake/infrastructure/ingestion_stack.py: -------------------------------------------------------------------------------- 1 | from aws_cdk import ( 2 | # Duration, 3 | Stack, 4 | NestedStack, 5 | aws_glue_alpha as _glue_alpha 6 | # aws_sqs as sqs, 7 | ) 8 | import aws_cdk as _cdk 9 | from constructs import Construct 10 | import os 11 | from .etl_stack import EtlStack 12 | 13 | 14 | class IngestionStack(NestedStack): 15 | 16 | def __init__(self, scope: Construct, construct_id: str, bucket_name, **kwargs) -> None: 17 | super().__init__(scope, construct_id, **kwargs) 18 | env = self.node.try_get_context('environment_name') 19 | if env is None: 20 | print('Setting environment as Dev') 21 | print('Fetching Dev Properties') 22 | env = 'dev' 23 | # fetching details from cdk.json 24 | config_details = self.node.try_get_context(env) 25 | # Create an EventBus 26 | evt_bus = _cdk.aws_events.EventBus( 27 | self, f'bus-{env}', event_bus_name=config_details['bus-name']) 28 | 29 | # Create a rule that triggers on a Custom Bus event and sends data to Firehose 30 | # Define a rule that captures only card-events as an example 31 | custom_rule = _cdk.aws_events.Rule(self, f'bus-rule-{env}', rule_name=config_details['rule-name'], 32 | description="Match a custom-event type", event_bus=evt_bus, 33 | event_pattern=_cdk.aws_events.EventPattern(source=['transactions'], detail_type=['card-event'])) 34 | 35 | # Create Kinesis Firehose as an Event Target 36 | # Step 1: Create a Firehose role 37 | firehose_to_s3_role = _cdk.aws_iam.Role(self, f'firehose_s3_role_{env}', assumed_by=_cdk.aws_iam.ServicePrincipal( 38 | 'firehose.amazonaws.com'), role_name=config_details['firehose-s3-rolename']) 39 | 40 | # Step 2: Create an S3 bucket as a Firehose target 41 | s3_bucket = _cdk.aws_s3.Bucket( 42 | self, f'evtbus_s3_{env}', bucket_name=bucket_name) 43 | s3_bucket.grant_read_write(firehose_to_s3_role) 44 | 45 | # Step 3: Create a Firehose Delivery Stream 46 | firehose_stream = _cdk.aws_kinesisfirehose.CfnDeliveryStream(self, f'firehose_stream_{env}', 47 | delivery_stream_name=config_details['firehose-name'], 48 | delivery_stream_type='DirectPut', 49 | s3_destination_configuration=_cdk.aws_kinesisfirehose 50 | .CfnDeliveryStream.S3DestinationConfigurationProperty( 51 | bucket_arn=s3_bucket.bucket_arn, 52 | compression_format='GZIP', 53 | role_arn=firehose_to_s3_role.role_arn, 54 | buffering_hints=_cdk.aws_kinesisfirehose.CfnDeliveryStream.BufferingHintsProperty( 55 | interval_in_seconds=config_details['buffering-interval-in-seconds'], 56 | size_in_m_bs=config_details['buffering-size-in-mb'] 57 | ), 58 | prefix='raw-data/year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/', 59 | error_output_prefix='error/!{firehose:random-string}/!{firehose:error-output-type}/year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/' 60 | ) 61 | ) 62 | 63 | # Create an IAM role that allows EventBus to communicate with Kinesis Firehose 64 | evt_firehose_role = _cdk.aws_iam.Role(self, f'evt_bus_to_firehose_{env}', 65 | assumed_by=_cdk.aws_iam.ServicePrincipal('events.amazonaws.com'), 66 | role_name=config_details['eventbus-firehose-rolename']) 67 | 68 | evt_firehose_role.add_to_policy(_cdk.aws_iam.PolicyStatement( 69 | resources=[firehose_stream.attr_arn], 70 | actions=['firehose:PutRecord', 'firehose:PutRecordBatch'] 71 | )) 72 | 73 | # Link EventBus/ EventRule/ Firehose Target 74 | custom_rule.add_target(_cdk.aws_events_targets.KinesisFirehoseStream(firehose_stream)) 75 | -------------------------------------------------------------------------------- /serverless_datalake/infrastructure/lambda_tester/event_simulator.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import json 3 | import datetime 4 | import random 5 | 6 | client = boto3.client('events') 7 | 8 | 9 | def handler(event, context): 10 | 11 | print(f'Event Emitter Sample') 12 | 13 | currencies = ['dollar', 'rupee', 'pound', 'rial'] 14 | locations = ['US-TX-alto road, T4 serein', 'US-FL-palo road, lake view', 'IN-MH-cira street, Sector 17 Vashi', 'IN-GA-MG Road, Sector 25 Navi', 'IN-AP-SB Road, Sector 10 Mokl'] 15 | # First name , Last name, Credit card number 16 | names = ['Adam-Oldham-4024007175687564', 'William-Wong-4250653376577248', 'Karma-Chako-4532695203170069', 'Fraser-Sequeira-376442558724183', 'Prasad-Vedhantham-340657673453698', 'Preeti-Mathias-5247358584639920', 'David-Valles-5409458579753902', 'Nathan-S-374420227894977', 'Sanjay-C-374549020453175', 'Vikas-K-3661894701348823'] 17 | sample_json = { 18 | "amount": { 19 | "value": 50, 20 | "currency": "dollar" 21 | }, 22 | "location": { 23 | "country": "US", 24 | "state": "TX", 25 | }, 26 | "timestamp": "2022-12-31T00:00:00.000Z", 27 | "firstName": "Rav", 28 | "lastName": "G" 29 | } 30 | 31 | for i in range(0, 1000): 32 | sample_json["amount"]["value"] = random.randint(10, 5000) 33 | sample_json["amount"]["currency"] = random.choice(currencies) 34 | location = random.choice(locations).split('-') 35 | sample_json["location"]["country"] = location[0] 36 | sample_json["location"]["state"] = location[1] 37 | sample_json["location"]["city"] = location[2] 38 | name = random.choice(names).split('-') 39 | sample_json["firstName"] = name[0] 40 | sample_json["lastName"] = name[1] 41 | sample_json["credit_card"] = name[2] 42 | sample_json["transaction_message"] = name[0] + ' with credit card number ' + name[2] + ' made a purchase of ' + sample_json["amount"]["currency"] + '. Residing at ' + location[2] + ',' + location[1] + ', ' + location[0] + '.' 43 | sample_json["timestamp"] = datetime.datetime.utcnow().isoformat()[ 44 | :-3] + 'Z' 45 | 46 | response = client.put_events( 47 | Entries=[ 48 | { 49 | 'Time': datetime.datetime.now(), 50 | 'Source': 'transactions', 51 | 'DetailType': 'card-event', 52 | 'Detail': json.dumps(sample_json), 53 | 'EventBusName': 'serverless-bus-dev' 54 | }, 55 | ] 56 | ) 57 | 58 | #print(response) 59 | print('Simulation Complete. Events should be visible in S3 after 2(configured Firehose Buffer time) minutes') 60 | -------------------------------------------------------------------------------- /serverless_datalake/infrastructure/test_stack.py: -------------------------------------------------------------------------------- 1 | from aws_cdk import ( 2 | # Duration, 3 | Stack, 4 | NestedStack 5 | # aws_sqs as sqs, 6 | ) 7 | import aws_cdk as _cdk 8 | from constructs import Construct 9 | import os 10 | 11 | 12 | class TestStack(NestedStack): 13 | 14 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 15 | super().__init__(scope, construct_id, **kwargs) 16 | env = self.node.try_get_context('environment_name') 17 | 18 | # Define an On-demand Lambda function to randomly push data to the event bus 19 | function = _cdk.aws_lambda.Function(self, f'test_function_{env}', function_name=f'serverless-event-simulator-{env}', 20 | runtime=_cdk.aws_lambda.Runtime.PYTHON_3_9, memory_size=512, handler='event_simulator.handler', 21 | timeout= _cdk.Duration.minutes(10), 22 | code= _cdk.aws_lambda.Code.from_asset(os.path.join(os.getcwd(), 'serverless_datalake/infrastructure/lambda_tester/') )) 23 | 24 | lambda_policy = _cdk.aws_iam.PolicyStatement(actions=[ 25 | "events:*", 26 | ], resources=["*"]) 27 | 28 | function.add_to_role_policy(lambda_policy) 29 | 30 | -------------------------------------------------------------------------------- /serverless_datalake/serverless_datalake_stack.py: -------------------------------------------------------------------------------- 1 | from aws_cdk import ( 2 | # Duration, 3 | Stack 4 | # aws_sqs as sqs, 5 | ) 6 | import aws_cdk as _cdk 7 | import os 8 | from constructs import Construct 9 | from .infrastructure.ingestion_stack import IngestionStack 10 | from .infrastructure.etl_stack import EtlStack 11 | from .infrastructure.test_stack import TestStack 12 | 13 | 14 | class ServerlessDatalakeStack(Stack): 15 | 16 | def tag_my_stack(self, stack): 17 | tags = _cdk.Tags.of(stack) 18 | tags.add("project", "serverless-streaming-etl-sample") 19 | 20 | def __init__(self, scope: Construct, construct_id: str, bucket_name, **kwargs) -> None: 21 | super().__init__(scope, construct_id, **kwargs) 22 | env_name = self.node.try_get_context('environment_name') 23 | if env_name is None: 24 | print('Setting environment to dev as environment_name is not passed during synthesis') 25 | env_name = 'dev' 26 | account_id = os.getenv('CDK_DEFAULT_ACCOUNT') 27 | region = os.getenv('CDK_DEFAULT_REGION') 28 | env = _cdk.Environment(account=account_id, region=region) 29 | ingestion_bus_stack = IngestionStack(self, f'ingestion-bus-stack-{env_name}', bucket_name=bucket_name) 30 | etl_serverless_stack = EtlStack(self, f'etl-serverless-stack-{env}', bucket_name=bucket_name) 31 | test_stack = TestStack(self, f'test_serverless_datalake-{env}') 32 | self.tag_my_stack(ingestion_bus_stack) 33 | self.tag_my_stack(etl_serverless_stack) 34 | self.tag_my_stack(test_stack) 35 | etl_serverless_stack.add_dependency(ingestion_bus_stack) 36 | test_stack.add_dependency(etl_serverless_stack) 37 | -------------------------------------------------------------------------------- /source.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | rem The sole purpose of this script is to make the command 4 | rem 5 | rem source .venv/bin/activate 6 | rem 7 | rem (which activates a Python virtualenv on Linux or Mac OS X) work on Windows. 8 | rem On Windows, this command just runs this batch file (the argument is ignored). 9 | rem 10 | rem Now we don't need to document a Windows command for activating a virtualenv. 11 | 12 | echo Executing .venv\Scripts\activate.bat for you 13 | .venv\Scripts\activate.bat 14 | --------------------------------------------------------------------------------