├── from-kinesis ├── requirements-dev.txt ├── requirements.txt ├── cdk_stacks │ ├── __init__.py │ ├── kds.py │ ├── vpc.py │ └── redshift_serverless.py ├── .gitignore ├── redshift-query-editor-v2-connection.png ├── cdk.context.json ├── source.bat ├── app.py ├── cdk.json ├── src │ └── main │ │ └── python │ │ └── gen_fake_kinesis_stream_data.py ├── README.md └── redshift_streaming_from_kds.svg ├── from-msk ├── requirements.txt ├── requirements-dev.txt ├── redshift-query-editor-v2-connection.png ├── .gitignore ├── cdk_stacks │ ├── __init__.py │ ├── vpc.py │ ├── redshift_serverless.py │ ├── msk.py │ └── kafka_client_ec2.py ├── cdk.context.json ├── source.bat ├── app.py ├── cdk.json ├── src │ └── main │ │ └── python │ │ └── gen_fake_kafka_data.py ├── README.md └── redshift_streaming_from_msk.svg ├── from-msk-serverless ├── requirements-dev.txt ├── cdk.context.json ├── requirements.txt ├── .gitignore ├── redshift-query-editor-v2-connection.png ├── cdk_stacks │ ├── __init__.py │ ├── vpc.py │ ├── msk_serverless.py │ ├── redshift_serverless.py │ └── kafka_client_ec2.py ├── source.bat ├── app.py ├── cdk.json ├── src │ └── main │ │ └── python │ │ └── gen_fake_data.py ├── README.md └── redshift_streaming_from_msk_serverless.svg ├── CODE_OF_CONDUCT.md ├── LICENSE ├── README.md ├── CONTRIBUTING.md └── redshift-streaming-ingestion-arch.svg /from-kinesis/requirements-dev.txt: -------------------------------------------------------------------------------- 1 | boto3==1.26.32 2 | mimesis==7.0.0 3 | -------------------------------------------------------------------------------- /from-msk/requirements.txt: -------------------------------------------------------------------------------- 1 | aws-cdk-lib==2.56.0 2 | constructs>=10.0.0,<11.0.0 3 | -------------------------------------------------------------------------------- /from-kinesis/requirements.txt: -------------------------------------------------------------------------------- 1 | aws-cdk-lib==2.56.0 2 | constructs>=10.0.0,<11.0.0 3 | -------------------------------------------------------------------------------- /from-msk-serverless/requirements-dev.txt: -------------------------------------------------------------------------------- 1 | mimesis==4.1.3 # The last to support Python 3.6 and 3.7 2 | -------------------------------------------------------------------------------- /from-msk/requirements-dev.txt: -------------------------------------------------------------------------------- 1 | kafka-python==2.0.2 2 | mimesis==4.1.3 # The last to support Python 3.6 and 3.7 3 | -------------------------------------------------------------------------------- /from-msk-serverless/cdk.context.json: -------------------------------------------------------------------------------- 1 | { 2 | "vpc_name": "default", 3 | "msk_cluster_name": "evdata-serverless" 4 | } 5 | -------------------------------------------------------------------------------- /from-msk-serverless/requirements.txt: -------------------------------------------------------------------------------- 1 | aws-cdk-lib==2.56.0 2 | constructs>=10.0.0,<11.0.0 3 | boto3==1.26.34 4 | botocore==1.29.34 5 | -------------------------------------------------------------------------------- /from-kinesis/cdk_stacks/__init__.py: -------------------------------------------------------------------------------- 1 | from .kds import KdsStack 2 | from .vpc import VpcStack 3 | from .redshift_serverless import RedshiftServerlessStack 4 | -------------------------------------------------------------------------------- /from-msk/redshift-query-editor-v2-connection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/redshift-streaming-ingestion-patterns/HEAD/from-msk/redshift-query-editor-v2-connection.png -------------------------------------------------------------------------------- /from-kinesis/.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | package-lock.json 3 | __pycache__ 4 | .pytest_cache 5 | .venv 6 | *.egg-info 7 | 8 | # CDK asset staging directory 9 | .cdk.staging 10 | cdk.out 11 | -------------------------------------------------------------------------------- /from-kinesis/redshift-query-editor-v2-connection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/redshift-streaming-ingestion-patterns/HEAD/from-kinesis/redshift-query-editor-v2-connection.png -------------------------------------------------------------------------------- /from-msk/.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | package-lock.json 3 | __pycache__ 4 | .pytest_cache 5 | .venv 6 | *.egg-info 7 | 8 | # CDK asset staging directory 9 | .cdk.staging 10 | cdk.out 11 | -------------------------------------------------------------------------------- /from-msk-serverless/.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | package-lock.json 3 | __pycache__ 4 | .pytest_cache 5 | .venv 6 | *.egg-info 7 | 8 | # CDK asset staging directory 9 | .cdk.staging 10 | cdk.out 11 | -------------------------------------------------------------------------------- /from-msk-serverless/redshift-query-editor-v2-connection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/redshift-streaming-ingestion-patterns/HEAD/from-msk-serverless/redshift-query-editor-v2-connection.png -------------------------------------------------------------------------------- /from-msk/cdk_stacks/__init__.py: -------------------------------------------------------------------------------- 1 | from .vpc import VpcStack 2 | from .msk import MskStack 3 | from .redshift_serverless import RedshiftServerlessStack 4 | from .kafka_client_ec2 import KafkaClientEC2InstanceStack 5 | -------------------------------------------------------------------------------- /from-kinesis/cdk.context.json: -------------------------------------------------------------------------------- 1 | { 2 | "db_name": "dev", 3 | "namespace": "rss-streaming-from-kds-ns", 4 | "workgroup": "rss-streaming-from-kds-wg", 5 | "vpc_name": "default", 6 | "kinesis_stream_name": "ev_stream_data" 7 | } 8 | -------------------------------------------------------------------------------- /from-msk-serverless/cdk_stacks/__init__.py: -------------------------------------------------------------------------------- 1 | from .vpc import VpcStack 2 | from .msk_serverless import MskServerlessStack 3 | from .redshift_serverless import RedshiftServerlessStack 4 | from .kafka_client_ec2 import KafkaClientEC2InstanceStack 5 | -------------------------------------------------------------------------------- /from-msk/cdk.context.json: -------------------------------------------------------------------------------- 1 | { 2 | "vpc_name": "default", 3 | "msk": { 4 | "cluster_name": "evdata", 5 | "kafka_version": "2.8.1", 6 | "broker_instance_type": "kafka.m5.large", 7 | "number_of_broker_nodes": 3, 8 | "broker_ebs_volume_size": 100 9 | } 10 | } 11 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /from-msk/source.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | rem The sole purpose of this script is to make the command 4 | rem 5 | rem source .venv/bin/activate 6 | rem 7 | rem (which activates a Python virtualenv on Linux or Mac OS X) work on Windows. 8 | rem On Windows, this command just runs this batch file (the argument is ignored). 9 | rem 10 | rem Now we don't need to document a Windows command for activating a virtualenv. 11 | 12 | echo Executing .venv\Scripts\activate.bat for you 13 | .venv\Scripts\activate.bat 14 | -------------------------------------------------------------------------------- /from-kinesis/source.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | rem The sole purpose of this script is to make the command 4 | rem 5 | rem source .venv/bin/activate 6 | rem 7 | rem (which activates a Python virtualenv on Linux or Mac OS X) work on Windows. 8 | rem On Windows, this command just runs this batch file (the argument is ignored). 9 | rem 10 | rem Now we don't need to document a Windows command for activating a virtualenv. 11 | 12 | echo Executing .venv\Scripts\activate.bat for you 13 | .venv\Scripts\activate.bat 14 | -------------------------------------------------------------------------------- /from-msk-serverless/source.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | rem The sole purpose of this script is to make the command 4 | rem 5 | rem source .venv/bin/activate 6 | rem 7 | rem (which activates a Python virtualenv on Linux or Mac OS X) work on Windows. 8 | rem On Windows, this command just runs this batch file (the argument is ignored). 9 | rem 10 | rem Now we don't need to document a Windows command for activating a virtualenv. 11 | 12 | echo Executing .venv\Scripts\activate.bat for you 13 | .venv\Scripts\activate.bat 14 | -------------------------------------------------------------------------------- /from-kinesis/app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | 7 | import aws_cdk as cdk 8 | 9 | from cdk_stacks import ( 10 | KdsStack, 11 | VpcStack, 12 | RedshiftServerlessStack 13 | ) 14 | 15 | AWS_ENV = cdk.Environment(account=os.getenv('CDK_DEFAULT_ACCOUNT'), 16 | region=os.getenv('CDK_DEFAULT_REGION')) 17 | 18 | app = cdk.App() 19 | 20 | vpc_stack = VpcStack(app, 'KdsToRedshiftVpc', 21 | env=AWS_ENV) 22 | kds_stack = KdsStack(app, 'KdsToRedshift') 23 | redshift_stack = RedshiftServerlessStack(app, 'RedshiftStreamingIngestionFromKDS', 24 | vpc_stack.vpc, 25 | env=AWS_ENV) 26 | redshift_stack.add_dependency(vpc_stack) 27 | 28 | app.synth() 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /from-msk/app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | 7 | import aws_cdk as cdk 8 | 9 | from cdk_stacks import ( 10 | VpcStack, 11 | MskStack, 12 | KafkaClientEC2InstanceStack, 13 | RedshiftServerlessStack 14 | ) 15 | 16 | AWS_ENV = cdk.Environment(account=os.getenv('CDK_DEFAULT_ACCOUNT'), 17 | region=os.getenv('CDK_DEFAULT_REGION')) 18 | 19 | app = cdk.App() 20 | 21 | vpc_stack = VpcStack(app, 'MskToRedshiftVpc', 22 | env=AWS_ENV) 23 | 24 | msk_stack = MskStack(app, 'MskToRedshiftServerless', 25 | vpc_stack.vpc, 26 | env=AWS_ENV 27 | ) 28 | msk_stack.add_dependency(vpc_stack) 29 | 30 | redshift_stack = RedshiftServerlessStack(app, 'RedshiftStreamingIngestionFromMSK', 31 | vpc_stack.vpc, 32 | msk_stack.sg_msk_client, 33 | env=AWS_ENV 34 | ) 35 | redshift_stack.add_dependency(msk_stack) 36 | 37 | kafka_client_ec2_stack = KafkaClientEC2InstanceStack(app, 'KafkaClientEC2InstanceStack', 38 | vpc_stack.vpc, 39 | msk_stack.sg_msk_client, 40 | msk_stack.msk_cluster_name, 41 | env=AWS_ENV 42 | ) 43 | kafka_client_ec2_stack.add_dependency(msk_stack) 44 | 45 | app.synth() 46 | 47 | -------------------------------------------------------------------------------- /from-msk-serverless/app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | 7 | import aws_cdk as cdk 8 | 9 | from cdk_stacks import ( 10 | VpcStack, 11 | MskServerlessStack, 12 | KafkaClientEC2InstanceStack, 13 | RedshiftServerlessStack 14 | ) 15 | 16 | AWS_ENV = cdk.Environment(account=os.getenv('CDK_DEFAULT_ACCOUNT'), 17 | region=os.getenv('CDK_DEFAULT_REGION')) 18 | 19 | app = cdk.App() 20 | 21 | vpc_stack = VpcStack(app, 'MSKServerlessToRedshiftVpc', 22 | env=AWS_ENV) 23 | 24 | msk_stack = MskServerlessStack(app, 'MSKServerlessToRedshiftServerless', 25 | vpc_stack.vpc, 26 | env=AWS_ENV 27 | ) 28 | msk_stack.add_dependency(vpc_stack) 29 | 30 | redshift_stack = RedshiftServerlessStack(app, 'RedshiftStreamingFromMSKServerless', 31 | vpc_stack.vpc, 32 | msk_stack.sg_msk_client, 33 | env=AWS_ENV 34 | ) 35 | redshift_stack.add_dependency(msk_stack) 36 | 37 | kafka_client_ec2_stack = KafkaClientEC2InstanceStack(app, 'MSKServerlessClientEC2Stack', 38 | vpc_stack.vpc, 39 | msk_stack.sg_msk_client, 40 | msk_stack.msk_cluster_name, 41 | env=AWS_ENV 42 | ) 43 | kafka_client_ec2_stack.add_dependency(msk_stack) 44 | 45 | app.synth() 46 | 47 | -------------------------------------------------------------------------------- /from-kinesis/cdk_stacks/kds.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import random 6 | import string 7 | 8 | import aws_cdk as cdk 9 | 10 | from aws_cdk import ( 11 | Duration, 12 | Stack, 13 | aws_kinesis, 14 | ) 15 | from constructs import Construct 16 | 17 | random.seed(31) 18 | 19 | 20 | class KdsStack(Stack): 21 | 22 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 23 | super().__init__(scope, construct_id, **kwargs) 24 | 25 | KINESIS_DEFAULT_STREAM_NAME = 'PUT-Redshift-{}'.format(''.join(random.sample((string.ascii_letters), k=5))) 26 | KINESIS_STREAM_NAME = self.node.try_get_context('kinesis_stream_name') or KINESIS_DEFAULT_STREAM_NAME 27 | 28 | source_kinesis_stream = aws_kinesis.Stream(self, "SourceKinesisStreams", 29 | retention_period=Duration.hours(24), 30 | stream_mode=aws_kinesis.StreamMode.ON_DEMAND, 31 | stream_name=KINESIS_STREAM_NAME) 32 | 33 | self.source_kinesis_stream = source_kinesis_stream 34 | 35 | cdk.CfnOutput(self, f'{self.stack_name}-KinesisDataStreamName', 36 | value=self.source_kinesis_stream.stream_name, export_name=f'{self.stack_name}-KinesisDataStreamName') 37 | -------------------------------------------------------------------------------- /from-msk/cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "python3 app.py", 3 | "watch": { 4 | "include": [ 5 | "**" 6 | ], 7 | "exclude": [ 8 | "README.md", 9 | "cdk*.json", 10 | "requirements*.txt", 11 | "source.bat", 12 | "**/__init__.py", 13 | "python/__pycache__", 14 | "tests" 15 | ] 16 | }, 17 | "context": { 18 | "@aws-cdk/aws-lambda:recognizeLayerVersion": true, 19 | "@aws-cdk/core:checkSecretUsage": true, 20 | "@aws-cdk/core:target-partitions": [ 21 | "aws", 22 | "aws-cn" 23 | ], 24 | "@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true, 25 | "@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true, 26 | "@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true, 27 | "@aws-cdk/aws-iam:minimizePolicies": true, 28 | "@aws-cdk/core:validateSnapshotRemovalPolicy": true, 29 | "@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true, 30 | "@aws-cdk/aws-s3:createDefaultLoggingPolicy": true, 31 | "@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true, 32 | "@aws-cdk/aws-apigateway:disableCloudWatchRole": true, 33 | "@aws-cdk/core:enablePartitionLiterals": true, 34 | "@aws-cdk/aws-events:eventsTargetQueueSameAccount": true, 35 | "@aws-cdk/aws-iam:standardizedServicePrincipals": true, 36 | "@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true 37 | } 38 | } 39 | -------------------------------------------------------------------------------- /from-kinesis/cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "python3 app.py", 3 | "watch": { 4 | "include": [ 5 | "**" 6 | ], 7 | "exclude": [ 8 | "README.md", 9 | "cdk*.json", 10 | "requirements*.txt", 11 | "source.bat", 12 | "**/__init__.py", 13 | "python/__pycache__", 14 | "tests" 15 | ] 16 | }, 17 | "context": { 18 | "@aws-cdk/aws-lambda:recognizeLayerVersion": true, 19 | "@aws-cdk/core:checkSecretUsage": true, 20 | "@aws-cdk/core:target-partitions": [ 21 | "aws", 22 | "aws-cn" 23 | ], 24 | "@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true, 25 | "@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true, 26 | "@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true, 27 | "@aws-cdk/aws-iam:minimizePolicies": true, 28 | "@aws-cdk/core:validateSnapshotRemovalPolicy": true, 29 | "@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true, 30 | "@aws-cdk/aws-s3:createDefaultLoggingPolicy": true, 31 | "@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true, 32 | "@aws-cdk/aws-apigateway:disableCloudWatchRole": true, 33 | "@aws-cdk/core:enablePartitionLiterals": true, 34 | "@aws-cdk/aws-events:eventsTargetQueueSameAccount": true, 35 | "@aws-cdk/aws-iam:standardizedServicePrincipals": true, 36 | "@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true 37 | } 38 | } 39 | -------------------------------------------------------------------------------- /from-msk-serverless/cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "python3 app.py", 3 | "watch": { 4 | "include": [ 5 | "**" 6 | ], 7 | "exclude": [ 8 | "README.md", 9 | "cdk*.json", 10 | "requirements*.txt", 11 | "source.bat", 12 | "**/__init__.py", 13 | "python/__pycache__", 14 | "tests" 15 | ] 16 | }, 17 | "context": { 18 | "@aws-cdk/aws-lambda:recognizeLayerVersion": true, 19 | "@aws-cdk/core:checkSecretUsage": true, 20 | "@aws-cdk/core:target-partitions": [ 21 | "aws", 22 | "aws-cn" 23 | ], 24 | "@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true, 25 | "@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true, 26 | "@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true, 27 | "@aws-cdk/aws-iam:minimizePolicies": true, 28 | "@aws-cdk/core:validateSnapshotRemovalPolicy": true, 29 | "@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true, 30 | "@aws-cdk/aws-s3:createDefaultLoggingPolicy": true, 31 | "@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true, 32 | "@aws-cdk/aws-apigateway:disableCloudWatchRole": true, 33 | "@aws-cdk/core:enablePartitionLiterals": true, 34 | "@aws-cdk/aws-events:eventsTargetQueueSameAccount": true, 35 | "@aws-cdk/aws-iam:standardizedServicePrincipals": true, 36 | "@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true 37 | } 38 | } 39 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Amazon Redshift - Getting started with Streaming ingestion 2 | 3 | This is a collecton of CDK projects to show how to load data from streaming services into Amazon Redshift. 4 | 5 | ### Streaming ingestion from 6 | 7 | * [Amazon Kinesis Data Streams](./from-kinesis/) 8 | * [Amazon Managed Streaming for Apace Kafka (MSK)](./from-msk/) 9 | * [Amazon MSK Serverless](./from-msk-serverless/) 10 | 11 | ![redshift-streaming-ingestion-arch](./redshift-streaming-ingestion-arch.svg) 12 | 13 | ### References 14 | * [Amazon Redshift - Getting started with streaming ingestion from Amazon Kinesis Data Streams](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started.html) 15 | * [Amazon Redshift - Getting started with streaming ingestion from Amazon Managed Streaming for Apache Kafka](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started-MSK.html) 16 | * [Real-time analytics with Amazon Redshift streaming ingestion (2022-04-27)](https://aws.amazon.com/ko/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/) 17 | * [Easy analytics and cost-optimization with Amazon Redshift Serverless (2022-08-30)](https://aws.amazon.com/ko/blogs/big-data/easy-analytics-and-cost-optimization-with-amazon-redshift-serverless/) 18 | * [Amazon Redshift Features Workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/7782379f-2424-4a03-be55-ea7dc1a1c353/en-US) 19 | 20 | ## Security 21 | 22 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 23 | 24 | ## License 25 | 26 | This library is licensed under the MIT-0 License. See the LICENSE file. 27 | 28 | -------------------------------------------------------------------------------- /from-msk-serverless/src/main/python/gen_fake_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import sys 6 | import argparse 7 | import datetime 8 | import json 9 | import random 10 | import time 11 | 12 | import mimesis 13 | 14 | # Mimesis 5.0 supports Python 3.8, 3.9, and 3.10. 15 | # The Mimesis 4.1.3 is the last to support Python 3.6 and 3.7 16 | # For more information, see https://mimesis.name/en/latest/changelog.html#version-5-0-0 17 | assert mimesis.__version__ == '4.1.3' 18 | 19 | from mimesis import locales 20 | from mimesis.schema import Field, Schema 21 | 22 | random.seed(47) 23 | 24 | 25 | def main(): 26 | parser = argparse.ArgumentParser() 27 | 28 | parser.add_argument('--max-count', default=10, type=int, help='The max number of records to put.(default: 10)') 29 | 30 | options = parser.parse_args() 31 | 32 | CURRENT_YEAR = datetime.date.today().year 33 | start_year, end_year = (CURRENT_YEAR, CURRENT_YEAR) 34 | 35 | _ = Field(locale=locales.EN) 36 | _schema = Schema(schema=lambda: { 37 | "_id": _("uuid"), 38 | "clusterID": str(_("integer_number", start=1, end=50)), 39 | "connectionTime": _("formatted_datetime", fmt="%Y-%m-%d %H:%M:%S", start=start_year, end=end_year), 40 | "kWhDelivered": _("float_number", start=500.0, end=1500.0, precision=2), 41 | "stationID": _("integer_number", start=1, end=467), 42 | "spaceID": f'{_("word")}-{_("integer_number", start=1, end=20)}', # {{random.word}}-{{random.number({"min":1, "max":20})} 43 | "timezone": "America/Los_Angeles", 44 | "userID": str(_("integer_number", start=1000, end=500000)) # cast integer_number to string 45 | }) 46 | 47 | cnt = 0 48 | for record in _schema.create(options.max_count): 49 | cnt += 1 50 | print(json.dumps(record)) 51 | time.sleep(random.choices([0.01, 0.03, 0.05, 0.07, 0.1])[-1]) 52 | 53 | print(f'[INFO] {cnt} records are processed', file=sys.stderr) 54 | 55 | 56 | if __name__ == '__main__': 57 | main() 58 | 59 | -------------------------------------------------------------------------------- /from-msk/cdk_stacks/vpc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import aws_cdk as cdk 7 | 8 | from aws_cdk import ( 9 | Stack, 10 | aws_ec2, 11 | ) 12 | from constructs import Construct 13 | 14 | 15 | class VpcStack(Stack): 16 | 17 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 18 | super().__init__(scope, construct_id, **kwargs) 19 | 20 | #XXX: For creating this CDK Stack in the existing VPC, 21 | # remove comments from the below codes and 22 | # comments out vpc = aws_ec2.Vpc(..) codes, 23 | # then pass -c vpc_name=your-existing-vpc to cdk command 24 | # for example, 25 | # cdk -c vpc_name=your-existing-vpc syth 26 | # 27 | if str(os.environ.get('USE_DEFAULT_VPC', 'false')).lower() == 'true': 28 | vpc_name = self.node.try_get_context("vpc_name") 29 | self.vpc = aws_ec2.Vpc.from_lookup(self, "ExistingVPC", 30 | is_default=True, 31 | vpc_name=vpc_name) 32 | else: 33 | #XXX: To use more than 2 AZs, be sure to specify the account and region on your stack. 34 | #XXX: https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_ec2/Vpc.html 35 | self.vpc = aws_ec2.Vpc(self, "MskToRedshiftVpc", 36 | ip_addresses=aws_ec2.IpAddresses.cidr("10.0.0.0/21"), 37 | max_azs=3, 38 | 39 | # 'subnetConfiguration' specifies the "subnet groups" to create. 40 | # Every subnet group will have a subnet for each AZ, so this 41 | # configuration will create `2 groups × 3 AZs = 6` subnets. 42 | subnet_configuration=[ 43 | { 44 | "cidrMask": 24, 45 | "name": "Public", 46 | "subnetType": aws_ec2.SubnetType.PUBLIC, 47 | }, 48 | { 49 | "cidrMask": 24, 50 | "name": "Private", 51 | "subnetType": aws_ec2.SubnetType.PRIVATE_WITH_EGRESS 52 | } 53 | ], 54 | gateway_endpoints={ 55 | "S3": aws_ec2.GatewayVpcEndpointOptions( 56 | service=aws_ec2.GatewayVpcEndpointAwsService.S3 57 | ) 58 | } 59 | ) 60 | 61 | 62 | cdk.CfnOutput(self, 'VPCID', value=self.vpc.vpc_id, 63 | export_name=f'{self.stack_name}-VPCID') 64 | -------------------------------------------------------------------------------- /from-kinesis/cdk_stacks/vpc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import aws_cdk as cdk 7 | 8 | from aws_cdk import ( 9 | Stack, 10 | aws_ec2, 11 | ) 12 | from constructs import Construct 13 | 14 | 15 | class VpcStack(Stack): 16 | 17 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 18 | super().__init__(scope, construct_id, **kwargs) 19 | 20 | #XXX: For creating this CDK Stack in the existing VPC, 21 | # remove comments from the below codes and 22 | # comments out vpc = aws_ec2.Vpc(..) codes, 23 | # then pass -c vpc_name=your-existing-vpc to cdk command 24 | # for example, 25 | # cdk -c vpc_name=your-existing-vpc syth 26 | # 27 | if str(os.environ.get('USE_DEFAULT_VPC', 'false')).lower() == 'true': 28 | vpc_name = self.node.try_get_context("vpc_name") 29 | self.vpc = aws_ec2.Vpc.from_lookup(self, "ExistingVPC", 30 | is_default=True, 31 | vpc_name=vpc_name) 32 | else: 33 | #XXX: To use more than 2 AZs, be sure to specify the account and region on your stack. 34 | #XXX: https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_ec2/Vpc.html 35 | self.vpc = aws_ec2.Vpc(self, "KdsToRedshiftVpc", 36 | ip_addresses=aws_ec2.IpAddresses.cidr("10.0.0.0/21"), 37 | max_azs=3, 38 | 39 | # 'subnetConfiguration' specifies the "subnet groups" to create. 40 | # Every subnet group will have a subnet for each AZ, so this 41 | # configuration will create `2 groups × 3 AZs = 6` subnets. 42 | subnet_configuration=[ 43 | { 44 | "cidrMask": 24, 45 | "name": "Public", 46 | "subnetType": aws_ec2.SubnetType.PUBLIC, 47 | }, 48 | { 49 | "cidrMask": 24, 50 | "name": "Private", 51 | "subnetType": aws_ec2.SubnetType.PRIVATE_WITH_EGRESS 52 | } 53 | ], 54 | gateway_endpoints={ 55 | "S3": aws_ec2.GatewayVpcEndpointOptions( 56 | service=aws_ec2.GatewayVpcEndpointAwsService.S3 57 | ) 58 | } 59 | ) 60 | 61 | 62 | cdk.CfnOutput(self, 'VPCID', value=self.vpc.vpc_id, 63 | export_name=f'{self.stack_name}-VPCID') 64 | -------------------------------------------------------------------------------- /from-msk-serverless/cdk_stacks/vpc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import aws_cdk as cdk 7 | 8 | from aws_cdk import ( 9 | Stack, 10 | aws_ec2, 11 | ) 12 | from constructs import Construct 13 | 14 | 15 | class VpcStack(Stack): 16 | 17 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 18 | super().__init__(scope, construct_id, **kwargs) 19 | 20 | #XXX: For creating this CDK Stack in the existing VPC, 21 | # remove comments from the below codes and 22 | # comments out vpc = aws_ec2.Vpc(..) codes, 23 | # then pass -c vpc_name=your-existing-vpc to cdk command 24 | # for example, 25 | # cdk -c vpc_name=your-existing-vpc syth 26 | # 27 | if str(os.environ.get('USE_DEFAULT_VPC', 'false')).lower() == 'true': 28 | vpc_name = self.node.try_get_context("vpc_name") 29 | self.vpc = aws_ec2.Vpc.from_lookup(self, "ExistingVPC", 30 | is_default=True, 31 | vpc_name=vpc_name) 32 | else: 33 | #XXX: To use more than 2 AZs, be sure to specify the account and region on your stack. 34 | #XXX: https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_ec2/Vpc.html 35 | self.vpc = aws_ec2.Vpc(self, "MSKServerlessToRedshiftVpc", 36 | ip_addresses=aws_ec2.IpAddresses.cidr("10.0.0.0/21"), 37 | max_azs=3, 38 | 39 | # 'subnetConfiguration' specifies the "subnet groups" to create. 40 | # Every subnet group will have a subnet for each AZ, so this 41 | # configuration will create `2 groups × 3 AZs = 6` subnets. 42 | subnet_configuration=[ 43 | { 44 | "cidrMask": 24, 45 | "name": "Public", 46 | "subnetType": aws_ec2.SubnetType.PUBLIC, 47 | }, 48 | { 49 | "cidrMask": 24, 50 | "name": "Private", 51 | "subnetType": aws_ec2.SubnetType.PRIVATE_WITH_EGRESS 52 | } 53 | ], 54 | gateway_endpoints={ 55 | "S3": aws_ec2.GatewayVpcEndpointOptions( 56 | service=aws_ec2.GatewayVpcEndpointAwsService.S3 57 | ) 58 | } 59 | ) 60 | 61 | 62 | cdk.CfnOutput(self, 'VPCID', value=self.vpc.vpc_id, 63 | export_name=f'{self.stack_name}-VPCID') 64 | -------------------------------------------------------------------------------- /from-kinesis/src/main/python/gen_fake_kinesis_stream_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import sys 6 | import argparse 7 | import datetime 8 | import json 9 | import random 10 | import time 11 | 12 | import boto3 13 | 14 | from mimesis.locales import Locale 15 | from mimesis.schema import Field, Schema 16 | 17 | random.seed(47) 18 | 19 | 20 | def main(): 21 | parser = argparse.ArgumentParser() 22 | 23 | parser.add_argument('--region-name', action='store', default='us-east-1', 24 | help='aws region name (default: us-east-1)') 25 | parser.add_argument('--stream-name', help='The name of the stream to put the data record into.') 26 | parser.add_argument('--max-count', default=10, type=int, help='The max number of records to put.') 27 | parser.add_argument('--dry-run', action='store_true') 28 | 29 | options = parser.parse_args() 30 | 31 | CURRENT_YEAR = datetime.date.today().year 32 | start_year, end_year = (CURRENT_YEAR, CURRENT_YEAR) 33 | 34 | _ = Field(locale=Locale.EN) 35 | _schema = Schema(schema=lambda: { 36 | "_id": _("uuid"), 37 | "clusterID": str(_("integer_number", start=1, end=50)), 38 | "connectionTime": _("formatted_datetime", fmt="%Y-%m-%d %H:%M:%S", start=start_year, end=end_year), 39 | "kWhDelivered": _("price"), # DECIMAL(10,2) 40 | "stationID": _("integer_number", start=1, end=467), 41 | "spaceID": f'{_("word")}-{_("integer_number", start=1, end=20)}', # {{random.word}}-{{random.number({"min":1, "max":20})} 42 | "timezone": "America/Los_Angeles", 43 | "userID": str(_("integer_number", start=1000, end=500000)) # cast integer_number to string 44 | }) 45 | 46 | if not options.dry_run: 47 | kinesis_streams_client = boto3.client('kinesis', region_name=options.region_name) 48 | 49 | cnt = 0 50 | for record in _schema.iterator(options.max_count): 51 | cnt += 1 52 | 53 | if options.dry_run: 54 | print(json.dumps(record)) 55 | else: 56 | res = kinesis_streams_client.put_record( 57 | StreamName=options.stream_name, 58 | Data=json.dumps(record), 59 | PartitionKey=record['_id'] 60 | ) 61 | 62 | if cnt % 100 == 0: 63 | print(f'[INFO] {cnt} records are processed', file=sys.stderr) 64 | 65 | if res['ResponseMetadata']['HTTPStatusCode'] != 200: 66 | print(res, file=sys.stderr) 67 | time.sleep(random.choices([0.01, 0.03, 0.05, 0.07, 0.1])[-1]) 68 | print(f'[INFO] {cnt} records are processed', file=sys.stderr) 69 | 70 | 71 | if __name__ == '__main__': 72 | main() 73 | 74 | -------------------------------------------------------------------------------- /from-msk/src/main/python/gen_fake_kafka_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import sys 6 | import argparse 7 | import datetime 8 | import json 9 | import random 10 | import time 11 | 12 | import mimesis 13 | 14 | # Mimesis 5.0 supports Python 3.8, 3.9, and 3.10. 15 | # The Mimesis 4.1.3 is the last to support Python 3.6 and 3.7 16 | # For more information, see https://mimesis.name/en/latest/changelog.html#version-5-0-0 17 | assert mimesis.__version__ == '4.1.3' 18 | 19 | from mimesis import locales 20 | from mimesis.schema import Field, Schema 21 | 22 | from kafka import KafkaProducer 23 | from kafka.errors import KafkaError 24 | 25 | random.seed(47) 26 | 27 | def on_send_success(record_metadata): 28 | print(f"[INFO] {record_metadata.topic}:{record_metadata.partition}:{record_metadata.offset}", file=sys.stderr) 29 | 30 | 31 | def on_send_error(excp): 32 | print("[ERROR]", excp, file=sys.stderr) 33 | 34 | 35 | def main(): 36 | parser = argparse.ArgumentParser() 37 | 38 | parser.add_argument('--bootstrap-servers', help='bootstrap servers') 39 | parser.add_argument('--topic', help='kafka topic') 40 | parser.add_argument('--max-count', default=10, type=int, help='The max number of records to put.') 41 | parser.add_argument('--dry-run', action='store_true') 42 | 43 | options = parser.parse_args() 44 | 45 | CURRENT_YEAR = datetime.date.today().year 46 | start_year, end_year = (CURRENT_YEAR, CURRENT_YEAR) 47 | 48 | _ = Field(locale=locales.EN) 49 | _schema = Schema(schema=lambda: { 50 | "_id": _("uuid"), 51 | "clusterID": str(_("integer_number", start=1, end=50)), 52 | "connectionTime": _("formatted_datetime", fmt="%Y-%m-%d %H:%M:%S", start=start_year, end=end_year), 53 | "kWhDelivered": _("float_number", start=500.0, end=1500.0, precision=2), 54 | "stationID": _("integer_number", start=1, end=467), 55 | "spaceID": f'{_("word")}-{_("integer_number", start=1, end=20)}', # {{random.word}}-{{random.number({"min":1, "max":20})} 56 | "timezone": "America/Los_Angeles", 57 | "userID": str(_("integer_number", start=1000, end=500000)) # cast integer_number to string 58 | }) 59 | 60 | if not options.dry_run: 61 | kafka_producer = KafkaProducer(bootstrap_servers=options.bootstrap_servers, retries=5) 62 | 63 | cnt = 0 64 | for record in _schema.create(options.max_count): 65 | cnt += 1 66 | 67 | if options.dry_run: 68 | print(json.dumps(record)) 69 | else: 70 | partition_key = record['_id'].encode(encoding='utf-8') 71 | message = json.dumps(record).encode(encoding='utf-8') 72 | kafka_producer.send(options.topic, key=partition_key, value=message).add_callback(on_send_success).add_errback(on_send_error) 73 | 74 | if cnt % 100 == 0: 75 | print(f'[INFO] {cnt} records are processed', file=sys.stderr) 76 | kafka_producer.flush() 77 | 78 | time.sleep(random.choices([0.01, 0.03, 0.05, 0.07, 0.1])[-1]) 79 | 80 | if not options.dry_run: 81 | kafka_producer.flush() 82 | 83 | print(f'[INFO] {cnt} records are processed', file=sys.stderr) 84 | 85 | 86 | if __name__ == '__main__': 87 | main() 88 | -------------------------------------------------------------------------------- /from-msk-serverless/cdk_stacks/msk_serverless.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import re 6 | import random 7 | import string 8 | 9 | import aws_cdk as cdk 10 | 11 | from aws_cdk import ( 12 | Stack, 13 | aws_ec2, 14 | aws_msk 15 | ) 16 | from constructs import Construct 17 | 18 | random.seed(43) 19 | 20 | class MskServerlessStack(Stack): 21 | 22 | def __init__(self, scope: Construct, construct_id: str, vpc, **kwargs) -> None: 23 | super().__init__(scope, construct_id, **kwargs) 24 | 25 | msk_cluster_name = self.node.try_get_context('msk_cluster_name') 26 | 27 | _MSK_DEFAULT_CLUSTER_NAME = 'MSKServerless-{}'.format(''.join(random.choices((string.ascii_letters), k=5))) 28 | MSK_CLUSTER_NAME = msk_cluster_name or _MSK_DEFAULT_CLUSTER_NAME 29 | assert len(MSK_CLUSTER_NAME) <= 64 and re.fullmatch(r'[a-zA-Z]+[a-zA-Z0-9-]*', MSK_CLUSTER_NAME) 30 | 31 | MSK_CLIENT_SG_NAME = 'msk-client-sg-{}'.format(''.join(random.sample((string.ascii_lowercase), k=5))) 32 | sg_msk_client = aws_ec2.SecurityGroup(self, 'KafkaClientSecurityGroup', 33 | vpc=vpc, 34 | allow_all_outbound=True, 35 | description='security group for Amazon MSK client', 36 | security_group_name=MSK_CLIENT_SG_NAME 37 | ) 38 | cdk.Tags.of(sg_msk_client).add('Name', MSK_CLIENT_SG_NAME) 39 | 40 | MSK_CLUSTER_SG_NAME = 'msk-cluster-sg-{}'.format(''.join(random.sample((string.ascii_lowercase), k=5))) 41 | sg_msk_cluster = aws_ec2.SecurityGroup(self, 'MSKSecurityGroup', 42 | vpc=vpc, 43 | allow_all_outbound=True, 44 | description='security group for Amazon MSK Cluster', 45 | security_group_name=MSK_CLUSTER_SG_NAME 46 | ) 47 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(9098), 48 | description='msk client security group') 49 | cdk.Tags.of(sg_msk_cluster).add('Name', MSK_CLUSTER_SG_NAME) 50 | 51 | msk_serverless_cluster = aws_msk.CfnServerlessCluster(self, "MSKServerlessCfnCluster", 52 | #XXX: A serverless cluster must use SASL/IAM authentication 53 | client_authentication=aws_msk.CfnServerlessCluster.ClientAuthenticationProperty( 54 | sasl=aws_msk.CfnServerlessCluster.SaslProperty( 55 | iam=aws_msk.CfnServerlessCluster.IamProperty( 56 | enabled=True 57 | ) 58 | ) 59 | ), 60 | cluster_name=msk_cluster_name, 61 | vpc_configs=[aws_msk.CfnServerlessCluster.VpcConfigProperty( 62 | subnet_ids=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).subnet_ids, 63 | security_groups=[sg_msk_client.security_group_id, sg_msk_cluster.security_group_id] 64 | )] 65 | ) 66 | 67 | self.sg_msk_client = sg_msk_client 68 | self.msk_cluster_name = msk_serverless_cluster.cluster_name 69 | 70 | cdk.CfnOutput(self, f'{self.stack_name}-MSKClusterName', value=msk_serverless_cluster.cluster_name, 71 | export_name=f'{self.stack_name}-MSKClusterName') 72 | cdk.CfnOutput(self, f'{self.stack_name}-MSKClusterArn', value=msk_serverless_cluster.attr_arn, 73 | export_name=f'{self.stack_name}-MSKClusterArn') 74 | 75 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /from-kinesis/cdk_stacks/redshift_serverless.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_ec2, 10 | aws_iam, 11 | aws_redshiftserverless, 12 | aws_secretsmanager 13 | ) 14 | from constructs import Construct 15 | 16 | 17 | class RedshiftServerlessStack(Stack): 18 | 19 | def __init__(self, scope: Construct, construct_id: str, vpc, **kwargs) -> None: 20 | super().__init__(scope, construct_id, **kwargs) 21 | 22 | secret_name = self.node.try_get_context('aws_secret_name') 23 | rs_admin_user_secret = aws_secretsmanager.Secret.from_secret_name_v2(self, 24 | 'RedshiftAdminUserSecret', 25 | secret_name) 26 | 27 | REDSHIFT_DB_NAME = self.node.try_get_context('db_name') or 'dev' 28 | REDSHIFT_NAMESPACE_NAME = self.node.try_get_context('namespace') or 'rss-demo-ns' 29 | REDSHIFT_WORKGROUP_NAME = self.node.try_get_context('workgroup') or 'rss-demo-wg' 30 | 31 | sg_rs_client = aws_ec2.SecurityGroup(self, 'RedshiftClientSG', 32 | vpc=vpc, 33 | allow_all_outbound=True, 34 | description='security group for redshift client', 35 | security_group_name='redshift-client-sg' 36 | ) 37 | cdk.Tags.of(sg_rs_client).add('Name', 'redshift-client-sg') 38 | 39 | sg_rs_cluster = aws_ec2.SecurityGroup(self, 'RedshiftClusterSG', 40 | vpc=vpc, 41 | allow_all_outbound=True, 42 | description='security group for redshift cluster nodes', 43 | security_group_name='redshift-cluster-sg' 44 | ) 45 | sg_rs_cluster.add_ingress_rule(peer=sg_rs_client, connection=aws_ec2.Port.tcp(5439), 46 | description='redshift-client-sg') 47 | sg_rs_cluster.add_ingress_rule(peer=sg_rs_cluster, connection=aws_ec2.Port.all_tcp(), 48 | description='redshift-cluster-sg') 49 | cdk.Tags.of(sg_rs_cluster).add('Name', 'redshift-cluster-sg') 50 | 51 | redshift_streaming_role = aws_iam.Role(self, "RedshiftStreamingRole", 52 | role_name='RedshiftStreamingRole', 53 | assumed_by=aws_iam.ServicePrincipal('redshift.amazonaws.com'), 54 | managed_policies=[ 55 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonKinesisReadOnlyAccess'), 56 | ] 57 | ) 58 | 59 | cfn_rss_namespace = aws_redshiftserverless.CfnNamespace(self, 'RedshiftServerlessCfnNamespace', 60 | namespace_name=REDSHIFT_NAMESPACE_NAME, 61 | admin_username=rs_admin_user_secret.secret_value_from_json("admin_username").unsafe_unwrap(), 62 | admin_user_password=rs_admin_user_secret.secret_value_from_json("admin_user_password").unsafe_unwrap(), 63 | db_name=REDSHIFT_DB_NAME, 64 | iam_roles=[redshift_streaming_role.role_arn], 65 | log_exports=['userlog', 'connectionlog', 'useractivitylog'] 66 | ) 67 | 68 | cfn_rss_workgroup = aws_redshiftserverless.CfnWorkgroup(self, 'RedshiftServerlessCfnWorkgroup', 69 | workgroup_name=REDSHIFT_WORKGROUP_NAME, 70 | base_capacity=128, 71 | enhanced_vpc_routing=True, 72 | namespace_name=cfn_rss_namespace.namespace_name, 73 | publicly_accessible=False, 74 | security_group_ids=[sg_rs_cluster.security_group_id], 75 | subnet_ids=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).subnet_ids 76 | ) 77 | cfn_rss_workgroup.add_dependency(cfn_rss_namespace) 78 | cfn_rss_workgroup.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 79 | 80 | cdk.CfnOutput(self, f'{self.stack_name}-NamespaceName', 81 | value=cfn_rss_workgroup.namespace_name, export_name=f'{self.stack_name}-NamespaceName') 82 | cdk.CfnOutput(self, f'{self.stack_name}-WorkgroupName', 83 | value=cfn_rss_workgroup.workgroup_name, export_name=f'{self.stack_name}-WorkgroupName') 84 | 85 | -------------------------------------------------------------------------------- /from-msk/cdk_stacks/redshift_serverless.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_ec2, 10 | aws_iam, 11 | aws_redshiftserverless, 12 | aws_secretsmanager 13 | ) 14 | from constructs import Construct 15 | 16 | 17 | class RedshiftServerlessStack(Stack): 18 | 19 | def __init__(self, scope: Construct, construct_id: str, vpc, sg_msk_client, **kwargs) -> None: 20 | super().__init__(scope, construct_id, **kwargs) 21 | 22 | secret_name = self.node.try_get_context('aws_secret_name') 23 | rs_admin_user_secret = aws_secretsmanager.Secret.from_secret_name_v2(self, 24 | 'RedshiftAdminUserSecret', 25 | secret_name) 26 | 27 | REDSHIFT_DB_NAME = self.node.try_get_context('db_name') or 'dev' 28 | REDSHIFT_NAMESPACE_NAME = self.node.try_get_context('namespace') or 'rss-streaming-from-msk-ns' 29 | REDSHIFT_WORKGROUP_NAME = self.node.try_get_context('workgroup') or 'rss-streaming-from-msk-wg' 30 | 31 | sg_rs_client = aws_ec2.SecurityGroup(self, 'RedshiftClientSG', 32 | vpc=vpc, 33 | allow_all_outbound=True, 34 | description='security group for redshift client', 35 | security_group_name='redshift-client-sg' 36 | ) 37 | cdk.Tags.of(sg_rs_client).add('Name', 'redshift-client-sg') 38 | 39 | sg_rs_cluster = aws_ec2.SecurityGroup(self, 'RedshiftClusterSG', 40 | vpc=vpc, 41 | allow_all_outbound=True, 42 | description='security group for redshift cluster nodes', 43 | security_group_name='redshift-cluster-sg' 44 | ) 45 | sg_rs_cluster.add_ingress_rule(peer=sg_rs_client, connection=aws_ec2.Port.tcp(5439), 46 | description='redshift-client-sg') 47 | sg_rs_cluster.add_ingress_rule(peer=sg_rs_cluster, connection=aws_ec2.Port.all_tcp(), 48 | description='redshift-cluster-sg') 49 | cdk.Tags.of(sg_rs_cluster).add('Name', 'redshift-cluster-sg') 50 | 51 | msk_access_policy_doc = aws_iam.PolicyDocument() 52 | msk_access_policy_doc.add_statements(aws_iam.PolicyStatement(**{ 53 | "sid": "MSKPolicy", 54 | "effect": aws_iam.Effect.ALLOW, 55 | "resources": ["*"], 56 | "actions": [ 57 | "kafka:GetBootstrapBrokers" 58 | ] 59 | })) 60 | 61 | msk_access_policy_doc.add_statements(aws_iam.PolicyStatement(**{ 62 | "sid": "MSKIAMpolicy", 63 | "effect": aws_iam.Effect.ALLOW, 64 | "resources": [ 65 | f"arn:aws:kafka:*:{cdk.Aws.ACCOUNT_ID}:cluster/*/*", 66 | f"arn:aws:kafka:*:{cdk.Aws.ACCOUNT_ID}:topic/*/*/*" 67 | ], 68 | "actions": [ 69 | "kafka-cluster:ReadData", 70 | "kafka-cluster:DescribeTopic", 71 | "kafka-cluster:Connect" 72 | ] 73 | })) 74 | 75 | redshift_streaming_role = aws_iam.Role(self, 'RedshiftStreamingRole', 76 | role_name='RedshiftStreamingRole', 77 | assumed_by=aws_iam.ServicePrincipal('redshift.amazonaws.com'), 78 | inline_policies={ 79 | 'MSKAccessPolicy': msk_access_policy_doc 80 | } 81 | ) 82 | 83 | cfn_rss_namespace = aws_redshiftserverless.CfnNamespace(self, 'RedshiftServerlessCfnNamespace', 84 | namespace_name=REDSHIFT_NAMESPACE_NAME, 85 | admin_username=rs_admin_user_secret.secret_value_from_json("admin_username").unsafe_unwrap(), 86 | admin_user_password=rs_admin_user_secret.secret_value_from_json("admin_user_password").unsafe_unwrap(), 87 | db_name=REDSHIFT_DB_NAME, 88 | iam_roles=[redshift_streaming_role.role_arn], 89 | log_exports=['userlog', 'connectionlog', 'useractivitylog'] 90 | ) 91 | 92 | cfn_rss_workgroup = aws_redshiftserverless.CfnWorkgroup(self, 'RedshiftServerlessCfnWorkgroup', 93 | workgroup_name=REDSHIFT_WORKGROUP_NAME, 94 | base_capacity=128, 95 | enhanced_vpc_routing=True, 96 | namespace_name=cfn_rss_namespace.namespace_name, 97 | publicly_accessible=False, 98 | security_group_ids=[sg_rs_cluster.security_group_id, sg_msk_client.security_group_id], 99 | subnet_ids=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).subnet_ids 100 | ) 101 | cfn_rss_workgroup.add_dependency(cfn_rss_namespace) 102 | cfn_rss_workgroup.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 103 | 104 | cdk.CfnOutput(self, f'{self.stack_name}-NamespaceName', 105 | value=cfn_rss_workgroup.namespace_name, export_name=f'{self.stack_name}-NamespaceName') 106 | cdk.CfnOutput(self, f'{self.stack_name}-WorkgroupName', 107 | value=cfn_rss_workgroup.workgroup_name, export_name=f'{self.stack_name}-WorkgroupName') 108 | 109 | -------------------------------------------------------------------------------- /from-msk-serverless/cdk_stacks/redshift_serverless.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_ec2, 10 | aws_iam, 11 | aws_redshiftserverless, 12 | aws_secretsmanager 13 | ) 14 | from constructs import Construct 15 | 16 | 17 | class RedshiftServerlessStack(Stack): 18 | 19 | def __init__(self, scope: Construct, construct_id: str, vpc, sg_msk_client, **kwargs) -> None: 20 | super().__init__(scope, construct_id, **kwargs) 21 | 22 | secret_name = self.node.try_get_context('aws_secret_name') 23 | rs_admin_user_secret = aws_secretsmanager.Secret.from_secret_name_v2(self, 24 | 'RedshiftAdminUserSecret', 25 | secret_name) 26 | 27 | REDSHIFT_DB_NAME = self.node.try_get_context('db_name') or 'dev' 28 | REDSHIFT_NAMESPACE_NAME = self.node.try_get_context('namespace') or 'rss-streaming-from-msk-serverless-ns' 29 | REDSHIFT_WORKGROUP_NAME = self.node.try_get_context('workgroup') or 'rss-streaming-from-msk-serverless-wg' 30 | 31 | sg_rs_client = aws_ec2.SecurityGroup(self, 'RedshiftClientSG', 32 | vpc=vpc, 33 | allow_all_outbound=True, 34 | description='security group for redshift client', 35 | security_group_name='redshift-client-sg' 36 | ) 37 | cdk.Tags.of(sg_rs_client).add('Name', 'redshift-client-sg') 38 | 39 | sg_rs_cluster = aws_ec2.SecurityGroup(self, 'RedshiftClusterSG', 40 | vpc=vpc, 41 | allow_all_outbound=True, 42 | description='security group for redshift cluster nodes', 43 | security_group_name='redshift-cluster-sg' 44 | ) 45 | sg_rs_cluster.add_ingress_rule(peer=sg_rs_client, connection=aws_ec2.Port.tcp(5439), 46 | description='redshift-client-sg') 47 | sg_rs_cluster.add_ingress_rule(peer=sg_rs_cluster, connection=aws_ec2.Port.all_tcp(), 48 | description='redshift-cluster-sg') 49 | cdk.Tags.of(sg_rs_cluster).add('Name', 'redshift-cluster-sg') 50 | 51 | msk_serverless_access_policy_doc = aws_iam.PolicyDocument() 52 | msk_serverless_access_policy_doc.add_statements(aws_iam.PolicyStatement(**{ 53 | "effect": aws_iam.Effect.ALLOW, 54 | "resources": ["*"], 55 | "actions": [ 56 | "kafka:GetBootstrapBrokers" 57 | ] 58 | })) 59 | 60 | msk_serverless_access_policy_doc.add_statements(aws_iam.PolicyStatement(**{ 61 | "effect": aws_iam.Effect.ALLOW, 62 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:cluster/*/*" ], 63 | "actions": [ 64 | "kafka-cluster:Connect", 65 | "kafka-cluster:AlterCluster", 66 | "kafka-cluster:DescribeCluster" 67 | ] 68 | })) 69 | 70 | msk_serverless_access_policy_doc.add_statements(aws_iam.PolicyStatement(**{ 71 | "effect": aws_iam.Effect.ALLOW, 72 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:topic/*/*" ], 73 | "actions": [ 74 | "kafka-cluster:*Topic*", 75 | "kafka-cluster:WriteData", 76 | "kafka-cluster:ReadData" 77 | ] 78 | })) 79 | 80 | msk_serverless_access_policy_doc.add_statements(aws_iam.PolicyStatement(**{ 81 | "effect": aws_iam.Effect.ALLOW, 82 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:group/*/*" ], 83 | "actions": [ 84 | "kafka-cluster:AlterGroup", 85 | "kafka-cluster:DescribeGroup" 86 | ] 87 | })) 88 | 89 | redshift_streaming_role = aws_iam.Role(self, 'RedshiftStreamingRole', 90 | role_name='RedshiftStreamingRole', 91 | assumed_by=aws_iam.ServicePrincipal('redshift.amazonaws.com'), 92 | inline_policies={ 93 | 'MSKServerlessAccessPolicy': msk_serverless_access_policy_doc 94 | } 95 | ) 96 | 97 | cfn_rss_namespace = aws_redshiftserverless.CfnNamespace(self, 'RedshiftServerlessCfnNamespace', 98 | namespace_name=REDSHIFT_NAMESPACE_NAME, 99 | admin_username=rs_admin_user_secret.secret_value_from_json("admin_username").unsafe_unwrap(), 100 | admin_user_password=rs_admin_user_secret.secret_value_from_json("admin_user_password").unsafe_unwrap(), 101 | db_name=REDSHIFT_DB_NAME, 102 | iam_roles=[redshift_streaming_role.role_arn], 103 | log_exports=['userlog', 'connectionlog', 'useractivitylog'] 104 | ) 105 | 106 | cfn_rss_workgroup = aws_redshiftserverless.CfnWorkgroup(self, 'RedshiftServerlessCfnWorkgroup', 107 | workgroup_name=REDSHIFT_WORKGROUP_NAME, 108 | base_capacity=128, 109 | enhanced_vpc_routing=True, 110 | namespace_name=cfn_rss_namespace.namespace_name, 111 | publicly_accessible=False, 112 | security_group_ids=[sg_rs_cluster.security_group_id, sg_msk_client.security_group_id], 113 | subnet_ids=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).subnet_ids 114 | ) 115 | cfn_rss_workgroup.add_dependency(cfn_rss_namespace) 116 | cfn_rss_workgroup.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 117 | 118 | cdk.CfnOutput(self, f'{self.stack_name}-NamespaceName', 119 | value=cfn_rss_workgroup.namespace_name, export_name=f'{self.stack_name}-NamespaceName') 120 | cdk.CfnOutput(self, f'{self.stack_name}-WorkgroupName', 121 | value=cfn_rss_workgroup.workgroup_name, export_name=f'{self.stack_name}-WorkgroupName') 122 | 123 | -------------------------------------------------------------------------------- /from-msk/cdk_stacks/msk.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import re 6 | import random 7 | import string 8 | 9 | import aws_cdk as cdk 10 | 11 | from aws_cdk import ( 12 | Stack, 13 | aws_ec2, 14 | aws_msk 15 | ) 16 | from constructs import Construct 17 | 18 | random.seed(43) 19 | 20 | class MskStack(Stack): 21 | 22 | def __init__(self, scope: Construct, construct_id: str, vpc, **kwargs) -> None: 23 | super().__init__(scope, construct_id, **kwargs) 24 | 25 | msk_config = self.node.try_get_context('msk') 26 | 27 | _MSK_DEFAULT_CLUSTER_NAME = 'MSK-{}'.format(''.join(random.choices((string.ascii_letters), k=5))) 28 | MSK_CLUSTER_NAME = msk_config.get('cluster_name', _MSK_DEFAULT_CLUSTER_NAME) 29 | assert len(MSK_CLUSTER_NAME) <= 64 and re.fullmatch(r'[a-zA-Z]+[a-zA-Z0-9-]*', MSK_CLUSTER_NAME) 30 | 31 | # Supported Apache Kafka versions: 32 | # https://docs.aws.amazon.com/msk/latest/developerguide/supported-kafka-versions.html 33 | KAFA_VERSION = msk_config.get('kafka_version', '2.8.1') 34 | 35 | KAFA_BROKER_INSTANCE_TYPE = msk_config.get('broker_instance_type', 'kafka.m5.large') 36 | KAFA_NUMBER_OF_BROKER_NODES = int(msk_config.get('number_of_broker_nodes', 3)) 37 | 38 | KAFA_BROKER_EBS_VOLUME_SIZE = int(msk_config.get('broker_ebs_volume_size', 100)) 39 | assert (1 <= KAFA_BROKER_EBS_VOLUME_SIZE and KAFA_BROKER_EBS_VOLUME_SIZE <= 16384) 40 | 41 | MSK_CLIENT_SG_NAME = 'use-msk-sg-{}'.format(''.join(random.sample((string.ascii_lowercase), k=5))) 42 | sg_msk_client = aws_ec2.SecurityGroup(self, 'KafkaClientSecurityGroup', 43 | vpc=vpc, 44 | allow_all_outbound=True, 45 | description='security group for Amazon MSK client', 46 | security_group_name=MSK_CLIENT_SG_NAME 47 | ) 48 | cdk.Tags.of(sg_msk_client).add('Name', MSK_CLIENT_SG_NAME) 49 | 50 | MSK_CLUSTER_SG_NAME = 'msk-sg-{}'.format(''.join(random.sample((string.ascii_lowercase), k=5))) 51 | sg_msk_cluster = aws_ec2.SecurityGroup(self, 'MSKSecurityGroup', 52 | vpc=vpc, 53 | allow_all_outbound=True, 54 | description='security group for Amazon MSK Cluster', 55 | security_group_name=MSK_CLUSTER_SG_NAME 56 | ) 57 | # For more information about the numbers of the ports that Amazon MSK uses to communicate with client machines, 58 | # see https://docs.aws.amazon.com/msk/latest/developerguide/port-info.html 59 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(2181), 60 | description='allow msk client to communicate with Apache ZooKeeper in plaintext') 61 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(2182), 62 | description='allow msk client to communicate with Apache ZooKeeper by using TLS encryption') 63 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(9092), 64 | description='allow msk client to communicate with brokers in plaintext') 65 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(9094), 66 | description='allow msk client to communicate with brokers by using TLS encryption') 67 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(9098), 68 | description='msk client security group') 69 | cdk.Tags.of(sg_msk_cluster).add('Name', MSK_CLUSTER_SG_NAME) 70 | 71 | msk_broker_ebs_storage_info = aws_msk.CfnCluster.EBSStorageInfoProperty(volume_size=KAFA_BROKER_EBS_VOLUME_SIZE) 72 | 73 | msk_broker_storage_info = aws_msk.CfnCluster.StorageInfoProperty( 74 | ebs_storage_info=msk_broker_ebs_storage_info 75 | ) 76 | 77 | msk_broker_node_group_info = aws_msk.CfnCluster.BrokerNodeGroupInfoProperty( 78 | client_subnets=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).subnet_ids, 79 | instance_type=KAFA_BROKER_INSTANCE_TYPE, 80 | security_groups=[sg_msk_client.security_group_id, sg_msk_cluster.security_group_id], 81 | storage_info=msk_broker_storage_info 82 | ) 83 | 84 | msk_encryption_info = aws_msk.CfnCluster.EncryptionInfoProperty( 85 | encryption_in_transit=aws_msk.CfnCluster.EncryptionInTransitProperty( 86 | client_broker='TLS_PLAINTEXT', 87 | in_cluster=True 88 | ) 89 | ) 90 | 91 | msk_cluster = aws_msk.CfnCluster(self, 'AWSKafkaCluster', 92 | broker_node_group_info=msk_broker_node_group_info, 93 | cluster_name=MSK_CLUSTER_NAME, 94 | #XXX: Supported Apache Kafka versions 95 | # https://docs.aws.amazon.com/msk/latest/developerguide/supported-kafka-versions.html 96 | kafka_version=KAFA_VERSION, 97 | number_of_broker_nodes=KAFA_NUMBER_OF_BROKER_NODES, 98 | encryption_info=msk_encryption_info, 99 | enhanced_monitoring='PER_TOPIC_PER_BROKER' 100 | ) 101 | 102 | self.sg_msk_client = sg_msk_client 103 | self.msk_cluster_name = msk_cluster.cluster_name 104 | 105 | cdk.CfnOutput(self, f'{self.stack_name}-MSKClusterName', value=msk_cluster.cluster_name, export_name=f'{self.stack_name}-MSKClusterName') 106 | cdk.CfnOutput(self, f'{self.stack_name}-MSKClusterArn', value=msk_cluster.attr_arn, export_name=f'{self.stack_name}-MSKClusterArn') 107 | cdk.CfnOutput(self, f'{self.stack_name}-MSKVersion', value=msk_cluster.kafka_version, export_name=f'{self.stack_name}-MSKVersion') 108 | -------------------------------------------------------------------------------- /from-msk/cdk_stacks/kafka_client_ec2.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import random 7 | import string 8 | 9 | import aws_cdk as cdk 10 | 11 | from aws_cdk import ( 12 | Stack, 13 | aws_ec2, 14 | aws_iam, 15 | aws_s3_assets 16 | ) 17 | from constructs import Construct 18 | 19 | random.seed(47) 20 | 21 | 22 | class KafkaClientEC2InstanceStack(Stack): 23 | 24 | def __init__(self, scope: Construct, construct_id: str, vpc, sg_msk_client, msk_cluster_name, **kwargs) -> None: 25 | super().__init__(scope, construct_id, **kwargs) 26 | 27 | KAFKA_CLIENT_EC2_SG_NAME = 'kafka-client-ec2-sg-{}'.format(''.join(random.choices((string.ascii_lowercase), k=5))) 28 | sg_kafka_client_ec2_instance = aws_ec2.SecurityGroup(self, 'KafkaClientEC2InstanceSG', 29 | vpc=vpc, 30 | allow_all_outbound=True, 31 | description='security group for Kafka Client EC2 Instance', 32 | security_group_name=KAFKA_CLIENT_EC2_SG_NAME 33 | ) 34 | cdk.Tags.of(sg_kafka_client_ec2_instance).add('Name', KAFKA_CLIENT_EC2_SG_NAME) 35 | sg_kafka_client_ec2_instance.add_ingress_rule(peer=aws_ec2.Peer.ipv4("0.0.0.0/0"), 36 | connection=aws_ec2.Port.tcp(22)) 37 | 38 | #XXX: For more information, see https://docs.aws.amazon.com/msk/latest/developerguide/create-iam-role.html 39 | kafka_client_iam_policy = aws_iam.Policy(self, 'KafkaClientIAMPolicy', 40 | statements=[ 41 | aws_iam.PolicyStatement(**{ 42 | "effect": aws_iam.Effect.ALLOW, 43 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:cluster/{msk_cluster_name}/*" ], 44 | "actions": [ 45 | "kafka-cluster:Connect", 46 | "kafka-cluster:AlterCluster", 47 | "kafka-cluster:DescribeCluster" 48 | ] 49 | }), 50 | aws_iam.PolicyStatement(**{ 51 | "effect": aws_iam.Effect.ALLOW, 52 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:topic/{msk_cluster_name}/*" ], 53 | "actions": [ 54 | "kafka-cluster:*Topic*", 55 | "kafka-cluster:WriteData", 56 | "kafka-cluster:ReadData" 57 | ] 58 | }), 59 | aws_iam.PolicyStatement(**{ 60 | "effect": aws_iam.Effect.ALLOW, 61 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:group/{msk_cluster_name}/*" ], 62 | "actions": [ 63 | "kafka-cluster:AlterGroup", 64 | "kafka-cluster:DescribeGroup" 65 | ] 66 | }) 67 | ] 68 | ) 69 | kafka_client_iam_policy.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 70 | 71 | kafka_client_ec2_instance_role = aws_iam.Role(self, 'KafkaClientEC2InstanceRole', 72 | role_name=f'KafkaClientEC2InstanceRole-{self.stack_name}', 73 | assumed_by=aws_iam.ServicePrincipal('ec2.amazonaws.com'), 74 | managed_policies=[ 75 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonSSMManagedInstanceCore') 76 | ] 77 | ) 78 | 79 | kafka_client_iam_policy.attach_to_role(kafka_client_ec2_instance_role) 80 | 81 | amzn_linux = aws_ec2.MachineImage.latest_amazon_linux( 82 | generation=aws_ec2.AmazonLinuxGeneration.AMAZON_LINUX_2, 83 | edition=aws_ec2.AmazonLinuxEdition.STANDARD, 84 | virtualization=aws_ec2.AmazonLinuxVirt.HVM, 85 | storage=aws_ec2.AmazonLinuxStorage.GENERAL_PURPOSE, 86 | cpu_type=aws_ec2.AmazonLinuxCpuType.X86_64 87 | ) 88 | 89 | msk_client_ec2_instance = aws_ec2.Instance(self, 'KafkaClientEC2Instance', 90 | instance_type=aws_ec2.InstanceType.of(instance_class=aws_ec2.InstanceClass.BURSTABLE2, 91 | instance_size=aws_ec2.InstanceSize.MICRO), 92 | machine_image=amzn_linux, 93 | vpc=vpc, 94 | availability_zone=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).availability_zones[0], 95 | instance_name='KafkaClientInstance', 96 | role=kafka_client_ec2_instance_role, 97 | security_group=sg_kafka_client_ec2_instance, 98 | vpc_subnets=aws_ec2.SubnetSelection(subnet_type=aws_ec2.SubnetType.PUBLIC) 99 | ) 100 | msk_client_ec2_instance.add_security_group(sg_msk_client) 101 | 102 | # test data generator script in S3 as Asset 103 | user_data_asset = aws_s3_assets.Asset(self, 'KafkaClientEC2UserData', 104 | path=os.path.join(os.path.dirname(__file__), '../src/main/python/gen_fake_kafka_data.py')) 105 | user_data_asset.grant_read(msk_client_ec2_instance.role) 106 | 107 | USER_DATA_LOCAL_PATH = msk_client_ec2_instance.user_data.add_s3_download_command( 108 | bucket=user_data_asset.bucket, 109 | bucket_key=user_data_asset.s3_object_key 110 | ) 111 | 112 | commands = ''' 113 | yum update -y 114 | yum install python3.7 -y 115 | yum install java-1.8.0-openjdk-devel -y 116 | yum install -y jq 117 | 118 | cd /home/ec2-user 119 | mkdir -p opt && cd opt 120 | wget https://archive.apache.org/dist/kafka/2.8.1/kafka_2.12-2.8.1.tgz 121 | tar -xzf kafka_2.12-2.8.1.tgz 122 | ln -nsf kafka_2.12-2.8.1 kafka 123 | 124 | chown -R ec2-user /home/ec2-user/opt 125 | chgrp -R ec2-user /home/ec2-user/opt 126 | 127 | cd /home/ec2-user 128 | wget https://bootstrap.pypa.io/get-pip.py 129 | su -c "python3.7 get-pip.py --user" -s /bin/sh ec2-user 130 | su -c "/home/ec2-user/.local/bin/pip3 install boto3 --user" -s /bin/sh ec2-user 131 | 132 | echo 'export PATH=$HOME/.local/bin:$HOME/opt/kafka/bin:$PATH' >> .bash_profile 133 | ''' 134 | 135 | commands += f''' 136 | su -c "/home/ec2-user/.local/bin/pip3 install kafka-python==2.0.2 mimesis==4.1.3 --user" -s /bin/sh ec2-user 137 | cp {USER_DATA_LOCAL_PATH} /home/ec2-user/gen_fake_kafka_data.py & chown -R ec2-user /home/ec2-user/gen_fake_kafka_data.py 138 | ''' 139 | 140 | msk_client_ec2_instance.user_data.add_commands(commands) 141 | 142 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstancePublicDNS', 143 | value=msk_client_ec2_instance.instance_public_dns_name, 144 | export_name=f'{self.stack_name}-EC2InstancePublicDNS') 145 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstanceId', 146 | value=msk_client_ec2_instance.instance_id, 147 | export_name=f'{self.stack_name}-EC2InstanceId') 148 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstanceAZ', 149 | value=msk_client_ec2_instance.instance_availability_zone, 150 | export_name=f'{self.stack_name}-EC2InstanceAZ') 151 | 152 | -------------------------------------------------------------------------------- /from-msk-serverless/cdk_stacks/kafka_client_ec2.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import random 7 | import string 8 | 9 | import aws_cdk as cdk 10 | 11 | from aws_cdk import ( 12 | Stack, 13 | aws_ec2, 14 | aws_iam, 15 | aws_s3_assets 16 | ) 17 | from constructs import Construct 18 | 19 | random.seed(37) 20 | 21 | 22 | class KafkaClientEC2InstanceStack(Stack): 23 | 24 | def __init__(self, scope: Construct, construct_id: str, vpc, sg_msk_client, msk_cluster_name, **kwargs) -> None: 25 | super().__init__(scope, construct_id, **kwargs) 26 | 27 | KAFKA_CLIENT_EC2_SG_NAME = 'kafka-client-ec2-sg-{}'.format(''.join(random.choices((string.ascii_lowercase), k=5))) 28 | sg_kafka_client_ec2_instance = aws_ec2.SecurityGroup(self, 'KafkaClientEC2InstanceSG', 29 | vpc=vpc, 30 | allow_all_outbound=True, 31 | description='security group for Kafka Client EC2 Instance', 32 | security_group_name=KAFKA_CLIENT_EC2_SG_NAME 33 | ) 34 | cdk.Tags.of(sg_kafka_client_ec2_instance).add('Name', KAFKA_CLIENT_EC2_SG_NAME) 35 | sg_kafka_client_ec2_instance.add_ingress_rule(peer=aws_ec2.Peer.ipv4("0.0.0.0/0"), 36 | connection=aws_ec2.Port.tcp(22)) 37 | 38 | #XXX: For more information, see https://docs.aws.amazon.com/msk/latest/developerguide/create-iam-role.html 39 | kafka_client_iam_policy = aws_iam.Policy(self, 'KafkaClientIAMPolicy', 40 | statements=[ 41 | aws_iam.PolicyStatement(**{ 42 | "effect": aws_iam.Effect.ALLOW, 43 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:cluster/{msk_cluster_name}/*" ], 44 | "actions": [ 45 | "kafka-cluster:Connect", 46 | "kafka-cluster:AlterCluster", 47 | "kafka-cluster:DescribeCluster" 48 | ] 49 | }), 50 | aws_iam.PolicyStatement(**{ 51 | "effect": aws_iam.Effect.ALLOW, 52 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:topic/{msk_cluster_name}/*" ], 53 | "actions": [ 54 | "kafka-cluster:*Topic*", 55 | "kafka-cluster:WriteData", 56 | "kafka-cluster:ReadData" 57 | ] 58 | }), 59 | aws_iam.PolicyStatement(**{ 60 | "effect": aws_iam.Effect.ALLOW, 61 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:group/{msk_cluster_name}/*" ], 62 | "actions": [ 63 | "kafka-cluster:AlterGroup", 64 | "kafka-cluster:DescribeGroup" 65 | ] 66 | }) 67 | ] 68 | ) 69 | kafka_client_iam_policy.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 70 | 71 | kafka_client_ec2_instance_role = aws_iam.Role(self, 'KafkaClientEC2InstanceRole', 72 | role_name=f'KafkaClientEC2InstanceRole-{self.stack_name}', 73 | assumed_by=aws_iam.ServicePrincipal('ec2.amazonaws.com'), 74 | managed_policies=[ 75 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonSSMManagedInstanceCore') 76 | ] 77 | ) 78 | 79 | kafka_client_iam_policy.attach_to_role(kafka_client_ec2_instance_role) 80 | 81 | amzn_linux = aws_ec2.MachineImage.latest_amazon_linux( 82 | generation=aws_ec2.AmazonLinuxGeneration.AMAZON_LINUX_2, 83 | edition=aws_ec2.AmazonLinuxEdition.STANDARD, 84 | virtualization=aws_ec2.AmazonLinuxVirt.HVM, 85 | storage=aws_ec2.AmazonLinuxStorage.GENERAL_PURPOSE, 86 | cpu_type=aws_ec2.AmazonLinuxCpuType.X86_64 87 | ) 88 | 89 | msk_client_ec2_instance = aws_ec2.Instance(self, 'KafkaClientEC2Instance', 90 | instance_type=aws_ec2.InstanceType.of(instance_class=aws_ec2.InstanceClass.BURSTABLE2, 91 | instance_size=aws_ec2.InstanceSize.MICRO), 92 | machine_image=amzn_linux, 93 | vpc=vpc, 94 | availability_zone=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).availability_zones[0], 95 | instance_name=f'KafkaClientInstance-{self.stack_name}', 96 | role=kafka_client_ec2_instance_role, 97 | security_group=sg_kafka_client_ec2_instance, 98 | vpc_subnets=aws_ec2.SubnetSelection(subnet_type=aws_ec2.SubnetType.PUBLIC) 99 | ) 100 | msk_client_ec2_instance.add_security_group(sg_msk_client) 101 | 102 | # test data generator script in S3 as Asset 103 | user_data_asset = aws_s3_assets.Asset(self, 'KafkaClientEC2UserData', 104 | path=os.path.join(os.path.dirname(__file__), '../src/main/python/gen_fake_data.py')) 105 | user_data_asset.grant_read(msk_client_ec2_instance.role) 106 | 107 | USER_DATA_LOCAL_PATH = msk_client_ec2_instance.user_data.add_s3_download_command( 108 | bucket=user_data_asset.bucket, 109 | bucket_key=user_data_asset.s3_object_key 110 | ) 111 | 112 | commands = ''' 113 | yum update -y 114 | yum install python3.7 -y 115 | yum install java-11 -y 116 | yum install -y jq 117 | 118 | mkdir -p /home/ec2-user/opt 119 | cd /home/ec2-user/opt 120 | wget https://archive.apache.org/dist/kafka/2.8.1/kafka_2.12-2.8.1.tgz 121 | tar -xzf kafka_2.12-2.8.1.tgz 122 | ln -nsf kafka_2.12-2.8.1 kafka 123 | 124 | cd /home/ec2-user/opt/kafka/libs 125 | wget https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.1/aws-msk-iam-auth-1.1.1-all.jar 126 | 127 | chown -R ec2-user /home/ec2-user/opt 128 | chgrp -R ec2-user /home/ec2-user/opt 129 | 130 | cd /home/ec2-user 131 | wget https://bootstrap.pypa.io/get-pip.py 132 | su -c "python3.7 get-pip.py --user" -s /bin/sh ec2-user 133 | su -c "/home/ec2-user/.local/bin/pip3 install boto3 --user" -s /bin/sh ec2-user 134 | 135 | cat < msk_serverless_client.properties 136 | security.protocol=SASL_SSL 137 | sasl.mechanism=AWS_MSK_IAM 138 | sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required; 139 | sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler 140 | EOF 141 | 142 | ln -nsf msk_serverless_client.properties client.properties 143 | chown -R ec2-user /home/ec2-user/msk_serverless_client.properties 144 | chown -R ec2-user /home/ec2-user/client.properties 145 | 146 | echo 'export PATH=$HOME/opt/kafka/bin:$PATH' >> .bash_profile 147 | ''' 148 | 149 | commands += f''' 150 | su -c "/home/ec2-user/.local/bin/pip3 install mimesis==4.1.3 --user" -s /bin/sh ec2-user 151 | cp {USER_DATA_LOCAL_PATH} /home/ec2-user/gen_fake_data.py & chown -R ec2-user /home/ec2-user/gen_fake_data.py 152 | ''' 153 | 154 | msk_client_ec2_instance.user_data.add_commands(commands) 155 | 156 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstancePublicDNS', 157 | value=msk_client_ec2_instance.instance_public_dns_name, 158 | export_name=f'{self.stack_name}-EC2InstancePublicDNS') 159 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstanceId', 160 | value=msk_client_ec2_instance.instance_id, 161 | export_name=f'{self.stack_name}-EC2InstanceId') 162 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstanceAZ', 163 | value=msk_client_ec2_instance.instance_availability_zone, 164 | export_name=f'{self.stack_name}-EC2InstanceAZ') 165 | 166 | -------------------------------------------------------------------------------- /from-kinesis/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Amazon Redshift Streaming Ingestion from Kinesis Data Streams CDK Python project! 3 | 4 | ![redshift_streaming_ingestion_from_kinesis_data_streams](./redshift_streaming_from_kds.svg) 5 | 6 | This is an Amazon Redshift Streaming Ingestion from Kinesis Data Streams project for CDK development with Python. 7 | 8 | The `cdk.json` file tells the CDK Toolkit how to execute your app. 9 | 10 | This project is set up like a standard Python project. The initialization 11 | process also creates a virtualenv within this project, stored under the `.venv` 12 | directory. To create the virtualenv it assumes that there is a `python3` 13 | (or `python` for Windows) executable in your path with access to the `venv` 14 | package. If for any reason the automatic creation of the virtualenv fails, 15 | you can create the virtualenv manually. 16 | 17 | To manually create a virtualenv on MacOS and Linux: 18 | 19 | ``` 20 | $ python3 -m venv .venv 21 | ``` 22 | 23 | After the init process completes and the virtualenv is created, you can use the following 24 | step to activate your virtualenv. 25 | 26 | ``` 27 | $ source .venv/bin/activate 28 | ``` 29 | 30 | If you are a Windows platform, you would activate the virtualenv like this: 31 | 32 | ``` 33 | % .venv\Scripts\activate.bat 34 | ``` 35 | 36 | Once the virtualenv is activated, you can install the required dependencies. 37 | 38 | ``` 39 | $ pip install -r requirements.txt 40 | ``` 41 | 42 | :information_source: Before you deploy this project, you should create an AWS Secret for your Redshift Serverless Admin user. You can create an AWS Secret like this: 43 | 44 |
 45 | $ aws secretsmanager create-secret \
 46 |     --name "your_redshift_secret_name" \
 47 |     --description "(Optional) description of the secret" \
 48 |     --secret-string '{"admin_username": "admin", "admin_user_password": "password_of_at_last_8_characters"}'
 49 | 
50 | 51 | At this point you can now synthesize the CloudFormation template for this code. 52 | 53 |
 54 | (.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
 55 | (.venv) $ export CDK_DEFAULT_REGION=$(aws configure get region)
 56 | (.venv) $ cdk synth \
 57 |               -c aws_secret_name='your_redshift_secret_name'
 58 | 
59 | 60 | Use `cdk deploy` command to create the stack shown above. 61 | 62 |
 63 | (.venv) $ cdk deploy \
 64 |               -c aws_secret_name='your_redshift_secret_name'
 65 | 
66 | 67 | To add additional dependencies, for example other CDK libraries, just add 68 | them to your `setup.py` file and rerun the `pip install -r requirements.txt` 69 | command. 70 | 71 | ## Clean Up 72 | 73 | Delete the CloudFormation stack by running the below command. 74 | 75 |
 76 | (.venv) $ cdk destroy --force \
 77 |               -c aws_secret_name='your_redshift_secret_name'
 78 | 
79 | 80 | ## Useful commands 81 | 82 | * `cdk ls` list all stacks in the app 83 | * `cdk synth` emits the synthesized CloudFormation template 84 | * `cdk deploy` deploy this stack to your default AWS account/region 85 | * `cdk diff` compare deployed stack with current state 86 | * `cdk docs` open CDK documentation 87 | 88 | Enjoy! 89 | 90 | ## Run Test 91 | 92 | #### Amazon Redshift setup 93 | 94 | These steps show you how to configure the materialized view to ingest data. 95 | 96 | 1. Connect to the Redshift query editor v2 97 | 98 | ![redshift-query-editor-v2-connection](./redshift-query-editor-v2-connection.png) 99 | 100 | 2. Create an external schema to map the data from Kinesis to a Redshift object. 101 |
102 |    CREATE EXTERNAL SCHEMA evdata FROM KINESIS
103 |    IAM_ROLE 'arn:aws:iam::{AWS-ACCOUNT-ID}:role/RedshiftStreamingRole';
104 |    
105 | For information about how to configure the IAM role, see [Getting started with streaming ingestion from Amazon Kinesis Data Streams](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started.html). 106 | 107 | 3. Create a materialized view to consume the stream data. 108 | 109 | Note that Kinesis stream names are case-sensitive and can contain both uppercase and lowercase letters. To use case-sensitive identifiers, you can set the configuration setting `enable_case_sensitive_identifier` to true at either the session or cluster level. 110 |
111 |    -- To create and use case sensitive identifiers
112 |    SET enable_case_sensitive_identifier TO true;
113 | 
114 |    -- To check if enable_case_sensitive_identifier is turned on
115 |    SHOW enable_case_sensitive_identifier;
116 |    
117 | 118 | The following example defines a materialized view with JSON source data.
119 | Create the materialized view so it’s distributed on the UUID value from the stream and is sorted by the `refresh_time` value. The `refresh_time` is the start time of the materialized view refresh that loaded the record. The materialized view is set to auto refresh and will be refreshed as data keeps arriving in the stream. 120 |
121 |    CREATE MATERIALIZED VIEW ev_station_data_extract DISTKEY(6) sortkey(1) AUTO REFRESH YES AS
122 |     SELECT refresh_time,
123 |        approximate_arrival_timestamp,
124 |        partition_key,
125 |        shard_id,
126 |        sequence_number,
127 |        json_extract_path_text(from_varbyte(kinesis_data,'utf-8'),'_id',true)::character(36) as ID,
128 |        json_extract_path_text(from_varbyte(kinesis_data,'utf-8'),'clusterID',true)::varchar(30) as clusterID,
129 |        json_extract_path_text(from_varbyte(kinesis_data,'utf-8'),'connectionTime',true)::varchar(20) as connectionTime,
130 |        json_extract_path_text(from_varbyte(kinesis_data,'utf-8'),'kWhDelivered',true)::DECIMAL(10,2) as kWhDelivered,
131 |        json_extract_path_text(from_varbyte(kinesis_data,'utf-8'),'stationID',true)::INTEGER as stationID,
132 |        json_extract_path_text(from_varbyte(kinesis_data,'utf-8'),'spaceID',true)::varchar(100) as spaceID,
133 |        json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'timezone',true)::varchar(30)as timezone,
134 |        json_extract_path_text(from_varbyte(kinesis_data,'utf-8'),'userID',true)::varchar(30) as userID
135 |     FROM evdata."ev_stream_data"
136 |     WHERE LENGTH(kinesis_data) < 65355;
137 |    
138 | The code above filters records larger than **65355** bytes. This is because `json_extract_path_text` is limited to varchar data type. The Materialized view should be defined so that there aren’t any type conversion errors. 139 | 140 | 4. Refreshing materialized views for streaming ingestion 141 | 142 | The materialized view is auto-refreshed as long as there is new data on the KDS stream. You can also disable auto-refresh and run a manual refresh or schedule a manual refresh using the Redshift Console UI.
143 | To update the data in a materialized view, you can use the `REFRESH MATERIALIZED VIEW` statement at any time. 144 |
145 |    REFRESH MATERIALIZED VIEW ev_station_data_extract;
146 |    
147 | 148 | #### Query the stream 149 | 150 | 1. Query data in the materialized view. 151 |
152 |    SELECT *
153 |    FROM ev_station_data_extract;
154 |    
155 | 2. Query the refreshed materialized view to get usage statistics. 156 |
157 |    SELECT to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS') as connectiontime
158 |       ,SUM(kWhDelivered) AS Energy_Consumed
159 |       ,count(distinct userID) AS #Users
160 |    FROM ev_station_data_extract
161 |    GROUP BY to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS')
162 |    ORDER BY 1 DESC;
163 |    
164 | 165 | 166 | ## References 167 | 168 | * [Amazon Redshift - Getting started with streaming ingestion from Amazon Kinesis Data Streams](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started.html) 169 | * [Amazon Redshift - Electric vehicle station-data streaming ingestion tutorial, using Kinesis](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-example-station-data.html) 170 | * [Amazon Redshift Configuration Reference - enable_case_sensitive_identifier](https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html) 171 | * [Real-time analytics with Amazon Redshift streaming ingestion (2022-04-27)](https://aws.amazon.com/ko/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/) 172 | ![amazon-redshift-streaming-ingestion](https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/04/14/BDB-2193-image001.png) 173 | * [Mimesis: Fake Data Generator](https://mimesis.name/en/latest/index.html) - Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages 174 | 175 | -------------------------------------------------------------------------------- /from-msk-serverless/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Amazon Redshift Streaming Ingestion from Amazon MSK Serverless CDK Python project! 3 | 4 | ![redshift_streaming_from_msk_serverless](./redshift_streaming_from_msk_serverless.svg) 5 | 6 | This is an Amazon Redshift Streaming Ingestion from MSK Serverless project for CDK development with Python. 7 | 8 | The `cdk.json` file tells the CDK Toolkit how to execute your app. 9 | 10 | This project is set up like a standard Python project. The initialization 11 | process also creates a virtualenv within this project, stored under the `.venv` 12 | directory. To create the virtualenv it assumes that there is a `python3` 13 | (or `python` for Windows) executable in your path with access to the `venv` 14 | package. If for any reason the automatic creation of the virtualenv fails, 15 | you can create the virtualenv manually. 16 | 17 | To manually create a virtualenv on MacOS and Linux: 18 | 19 | ``` 20 | $ python3 -m venv .venv 21 | ``` 22 | 23 | After the init process completes and the virtualenv is created, you can use the following 24 | step to activate your virtualenv. 25 | 26 | ``` 27 | $ source .venv/bin/activate 28 | ``` 29 | 30 | If you are a Windows platform, you would activate the virtualenv like this: 31 | 32 | ``` 33 | % .venv\Scripts\activate.bat 34 | ``` 35 | 36 | Once the virtualenv is activated, you can install the required dependencies. 37 | 38 | ``` 39 | $ pip install -r requirements.txt 40 | ``` 41 | 42 | At this point you can now synthesize the CloudFormation template for this code. 43 | 44 |
 45 | (.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
 46 | (.venv) $ export CDK_DEFAULT_REGION=$(aws configure get region)
 47 | (.venv) $ cdk synth \
 48 |               -c aws_secret_name='your_redshift_secret_name'
 49 | 
50 | 51 | :information_source: Before you deploy this project, you should create an AWS Secret for your Redshift Serverless Admin user. You can create an AWS Secret like this: 52 | 53 |
 54 | $ aws secretsmanager create-secret \
 55 |     --name "your_redshift_secret_name" \
 56 |     --description "(Optional) description of the secret" \
 57 |     --secret-string '{"admin_username": "admin", "admin_user_password": "password_of_at_last_8_characters"}'
 58 | 
59 | 60 | Use `cdk deploy` command to create the stack shown above. 61 | 62 |
 63 | (.venv) $ cdk deploy \
 64 |               -c aws_secret_name='your_redshift_secret_name'
 65 | 
66 | 67 | To add additional dependencies, for example other CDK libraries, just add 68 | them to your `setup.py` file and rerun the `pip install -r requirements.txt` 69 | command. 70 | 71 | ## Clean Up 72 | 73 | Delete the CloudFormation stack by running the below command. 74 | 75 |
 76 | (.venv) $ cdk destroy --force \
 77 |               -c aws_secret_name='your_redshift_secret_name'
 78 | 
79 | 80 | ## Run Test 81 | 82 | #### Amazon MSK Serverless setup 83 | 84 | After MSK Serverless is succesfully created, you can now create topic, and produce on the topic in MSK as the following example. 85 | 86 | 1. Get cluster information 87 |
 88 |    $ export MSK_SERVERLESS_CLUSTER_ARN=$(aws kafka list-clusters-v2 | jq -r '.ClusterInfoList[] | select(.ClusterName == "your-msk-cluster-name") | .ClusterArn')
 89 |    $ aws kafka describe-cluster-v2 --cluster-arn $MSK_SERVERLESS_CLUSTER_ARN
 90 |    {
 91 |      "ClusterInfo": {
 92 |        "ClusterType": "SERVERLESS",
 93 |        "ClusterArn": "arn:aws:kafka:us-east-1:123456789012:cluster/your-msk-cluster-name/813876e5-2023-4882-88c4-58ad8599da5a-s2",
 94 |        "ClusterName": "your-msk-cluster-name",
 95 |        "CreationTime": "2022-12-16T02:26:31.369000+00:00",
 96 |        "CurrentVersion": "K2EUQ1WTGCTBG2",
 97 |        "State": "ACTIVE",
 98 |        "Tags": {},
 99 |        "Serverless": {
100 |          "VpcConfigs": [
101 |            {
102 |              "SubnetIds": [
103 |                "subnet-0113628395a293b98",
104 |                "subnet-090240f6a94a4b5aa",
105 |                "subnet-036e818e577297ddc"
106 |              ],
107 |              "SecurityGroupIds": [
108 |                "sg-0bd8f5ce976b51769",
109 |                "sg-0869c9987c033aaf1"
110 |              ]
111 |            }
112 |          ],
113 |          "ClientAuthentication": {
114 |            "Sasl": {
115 |              "Iam": {
116 |                "Enabled": true
117 |              }
118 |            }
119 |          }
120 |        }
121 |      }
122 |    }
123 |    
124 | 125 | 2. Get booststrap brokers 126 | 127 |
128 |    $ aws kafka get-bootstrap-brokers --cluster-arn $MSK_SERVERLESS_CLUSTER_ARN
129 |    {
130 |      "BootstrapBrokerStringSaslIam": "boot-deligu0c.c1.kafka-serverless.{region}.amazonaws.com:9098"
131 |    }
132 |    
133 | 134 | 3. Connect the MSK client EC2 Host. 135 | 136 | You can connect to an EC2 instance using the EC2 Instance Connect CLI.
137 | Install `ec2instanceconnectcli` python package and Use the **mssh** command with the instance ID as follows. 138 |
139 |    $ sudo pip install ec2instanceconnectcli
140 |    $ mssh ec2-user@i-001234a4bf70dec41EXAMPLE
141 |    
142 | 143 | 4. Create an Apache Kafka topic 144 | After connect your EC2 Host, you use the client machine to create a topic on the cluster. 145 | Run the following command to create a topic called `ev_stream_data`. 146 |
147 |    [ec2-user@ip-172-31-0-180 ~]$ export PATH=$HOME/opt/kafka/bin:$PATH
148 |    [ec2-user@ip-172-31-0-180 ~]$ export BS={BootstrapBrokerString}
149 |    [ec2-user@ip-172-31-0-180 ~]$ kafka-topics.sh --bootstrap-server $BS --command-config client.properties --create --topic ev_stream_data --partitions 3 --replication-factor 2
150 |    
151 | 152 | `client.properties` is a property file containing configs to be passed to Admin Client. This is used only with `--bootstrap-server` option for describing and altering broker configs.
153 | For more information, see [Getting started using MSK Serverless clusters - Step 3: Create a client machine](https://docs.aws.amazon.com/msk/latest/developerguide/create-serverless-cluster-client.html) 154 |
155 |    [ec2-user@ip-172-31-0-180 ~]$ cat client.properties
156 |    security.protocol=SASL_SSL
157 |    sasl.mechanism=AWS_MSK_IAM
158 |    sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
159 |    sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
160 |    
161 | 162 | 5. Produce and consume data 163 | 164 | **(1) To produce messages** 165 | 166 | Run the following command to generate messages into the topic on the cluster. 167 | 168 |
169 |    [ec2-user@ip-172-31-0-180 ~]$ python3 gen_fake_data.py | kafka-console-producer.sh --bootstrap-server $BS --producer.config client.properties --topic ev_stream_data
170 |    
171 | 172 | **(2) To consume messages** 173 | 174 | Keep the connection to the client machine open, and then open a second, separate connection to that machine in a new window. 175 | 176 |
177 |    [ec2-user@ip-172-31-0-180 ~]$ kafka-console-consumer.sh --bootstrap-server $BS --consumer.config client.properties --topic ev_stream_data --from-beginning
178 |    
179 | 180 | You start seeing the messages you entered earlier when you used the console producer command. 181 | Enter more messages in the producer window, and watch them appear in the consumer window. 182 | 183 | #### Amazon Redshift setup 184 | 185 | These steps show you how to configure the materialized view to ingest data. 186 | 187 | 1. Connect to the Redshift query editor v2 188 | 189 | ![redshift-query-editor-v2-connection](./redshift-query-editor-v2-connection.png) 190 | 191 | 2. Create an external schema to map the data from MSK to a Redshift object. 192 |
193 |    CREATE EXTERNAL SCHEMA evdata
194 |    FROM MSK
195 |    IAM_ROLE 'arn:aws:iam::{AWS-ACCOUNT-ID}:role/RedshiftStreamingRole'
196 |    AUTHENTICATION iam
197 |    CLUSTER_ARN 'arn:aws:kafka:{region}:{AWS-ACCOUNT-ID}:cluster/{cluser-name}/b5555d61-dbd3-4dab-b9d4-67921490c072-3';
198 |    
199 | For information about how to configure the IAM role, see [Getting started with streaming ingestion from Amazon Managed Streaming for Apache Kafka](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started-MSK.html). 200 | 201 | 3. Create a materialized view to consume the stream data. 202 | 203 | Note that MSK cluster names are case-sensitive and can contain both uppercase and lowercase letters. To use case-sensitive identifiers, you can set the configuration setting `enable_case_sensitive_identifier` to true at either the session or cluster level. 204 |
205 |    -- To create and use case sensitive identifiers
206 |    SET enable_case_sensitive_identifier TO true;
207 | 
208 |    -- To check if enable_case_sensitive_identifier is turned on
209 |    SHOW enable_case_sensitive_identifier;
210 |    
211 | 212 | The following example defines a materialized view with JSON source data.
213 | Create the materialized view so it’s distributed on the UUID value from the stream and is sorted by the `refresh_time` value. The `refresh_time` is the start time of the materialized view refresh that loaded the record. The materialized view is set to auto refresh and will be refreshed as data keeps arriving in the stream. 214 |
215 |    CREATE MATERIALIZED VIEW ev_station_data_extract DISTKEY(6) sortkey(1) AUTO REFRESH YES AS
216 |      SELECT refresh_time,
217 |        kafka_timestamp_type,
218 |        kafka_timestamp,
219 |        kafka_key,
220 |        kafka_partition,
221 |        kafka_offset,
222 |        kafka_headers,
223 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'_id',true)::character(36) as ID,
224 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'clusterID',true)::varchar(30) as clusterID,
225 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'connectionTime',true)::varchar(20) as connectionTime,
226 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'kWhDelivered',true)::DECIMAL(10,2) as kWhDelivered,
227 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'stationID',true)::INTEGER as stationID,
228 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'spaceID',true)::varchar(100) as spaceID,
229 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'timezone',true)::varchar(30)as timezone,
230 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'userID',true)::varchar(30) as userID
231 |      FROM evdata.ev_stream_data
232 |      WHERE LENGTH(kafka_value) < 65355 AND CAN_JSON_PARSE(kafka_value);
233 |    
234 | The code above filters records larger than **65355** bytes. This is because `json_extract_path_text` is limited to varchar data type. The Materialized view should be defined so that there aren’t any type conversion errors. 235 | 236 | 4. Refreshing materialized views for streaming ingestion 237 | 238 | The materialized view is auto-refreshed as long as there is new data on the MSK stream. You can also disable auto-refresh and run a manual refresh or schedule a manual refresh using the Redshift Console UI.
239 | To update the data in a materialized view, you can use the `REFRESH MATERIALIZED VIEW` statement at any time. 240 |
241 |    REFRESH MATERIALIZED VIEW ev_station_data_extract;
242 |    
243 | 244 | #### Query the stream 245 | 246 | 1. Query data in the materialized view. 247 |
248 |    SELECT *
249 |    FROM ev_station_data_extract;
250 |    
251 | 2. Query the refreshed materialized view to get usage statistics. 252 |
253 |    SELECT to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS') as connectiontime
254 |       ,SUM(kWhDelivered) AS Energy_Consumed
255 |       ,count(distinct userID) AS #Users
256 |    FROM ev_station_data_extract
257 |    GROUP BY to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS')
258 |    ORDER BY 1 DESC;
259 |    
260 | 261 | ## Useful commands 262 | 263 | * `cdk ls` list all stacks in the app 264 | * `cdk synth` emits the synthesized CloudFormation template 265 | * `cdk deploy` deploy this stack to your default AWS account/region 266 | * `cdk diff` compare deployed stack with current state 267 | * `cdk docs` open CDK documentation 268 | 269 | Enjoy! 270 | 271 | ## References 272 | 273 | * [Getting started using MSK Serverless clusters](https://docs.aws.amazon.com/msk/latest/developerguide/serverless-getting-started.html) 274 | * [Configuration for MSK Serverless clusters](https://docs.aws.amazon.com/msk/latest/developerguide/serverless-config.html) 275 | * [Amazon Redshift - Getting started with streaming ingestion from Amazon Managed Streaming for Apache Kafka](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started-MSK.html) 276 | * [Amazon Redshift - Electric vehicle station-data streaming ingestion tutorial, using Kinesis](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-example-station-data.html) 277 | * [Amazon Redshift Configuration Reference - enable_case_sensitive_identifier](https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html) 278 | * [Real-time analytics with Amazon Redshift streaming ingestion (2022-04-27)](https://aws.amazon.com/ko/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/) 279 | ![amazon-redshift-streaming-ingestion](https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/04/14/BDB-2193-image001.png) 280 | * [Connect using the EC2 Instance Connect CLI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-methods.html#ec2-instance-connect-connecting-ec2-cli) 281 |
282 |    $ sudo pip install ec2instanceconnectcli
283 |    $ mssh ec2-user@i-001234a4bf70dec41EXAMPLE # ec2-instance-id
284 |    
285 | * [ec2instanceconnectcli](https://pypi.org/project/ec2instanceconnectcli/): This Python CLI package handles publishing keys through EC2 Instance Connectand using them to connect to EC2 instances. 286 | * [Mimesis: Fake Data Generator](https://mimesis.name/en/latest/index.html) - Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages 287 | 288 | -------------------------------------------------------------------------------- /from-msk/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Amazon Redshift Streaming Ingestion from Amazon Managed Streaming for Apache Kafka (MSK) CDK Python project! 3 | 4 | ![redshift_streaming_from_msk](./redshift_streaming_from_msk.svg) 5 | 6 | This is an Amazon Redshift Streaming Ingestion from MSK project for CDK development with Python. 7 | 8 | The `cdk.json` file tells the CDK Toolkit how to execute your app. 9 | 10 | This project is set up like a standard Python project. The initialization 11 | process also creates a virtualenv within this project, stored under the `.venv` 12 | directory. To create the virtualenv it assumes that there is a `python3` 13 | (or `python` for Windows) executable in your path with access to the `venv` 14 | package. If for any reason the automatic creation of the virtualenv fails, 15 | you can create the virtualenv manually. 16 | 17 | To manually create a virtualenv on MacOS and Linux: 18 | 19 | ``` 20 | $ python3 -m venv .venv 21 | ``` 22 | 23 | After the init process completes and the virtualenv is created, you can use the following 24 | step to activate your virtualenv. 25 | 26 | ``` 27 | $ source .venv/bin/activate 28 | ``` 29 | 30 | If you are a Windows platform, you would activate the virtualenv like this: 31 | 32 | ``` 33 | % .venv\Scripts\activate.bat 34 | ``` 35 | 36 | Once the virtualenv is activated, you can install the required dependencies. 37 | 38 | ``` 39 | $ pip install -r requirements.txt 40 | ``` 41 | 42 | At this point you can now synthesize the CloudFormation template for this code. 43 | 44 |
 45 | (.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
 46 | (.venv) $ export CDK_DEFAULT_REGION=$(aws configure get region)
 47 | (.venv) $ cdk synth \
 48 |               -c aws_secret_name='your_redshift_secret_name'
 49 | 
50 | 51 | :information_source: Before you deploy this project, you should create an AWS Secret for your Redshift Serverless Admin user. You can create an AWS Secret like this: 52 | 53 |
 54 | $ aws secretsmanager create-secret \
 55 |     --name "your_redshift_secret_name" \
 56 |     --description "(Optional) description of the secret" \
 57 |     --secret-string '{"admin_username": "admin", "admin_user_password": "password_of_at_last_8_characters"}'
 58 | 
59 | 60 | Use `cdk deploy` command to create the stack shown above. 61 | 62 |
 63 | (.venv) $ cdk deploy \
 64 |               -c aws_secret_name='your_redshift_secret_name'
 65 | 
66 | 67 | To add additional dependencies, for example other CDK libraries, just add 68 | them to your `setup.py` file and rerun the `pip install -r requirements.txt` 69 | command. 70 | 71 | ## Clean Up 72 | 73 | Delete the CloudFormation stack by running the below command. 74 | 75 |
 76 | (.venv) $ cdk destroy --force \
 77 |               -c aws_secret_name=your_redshift_secret_name
 78 | 
79 | 80 | ## Run Test 81 | 82 | #### Amazon MSK setup 83 | 84 | After MSK is succesfully created, you can now create topic, and produce on the topic in MSK as the following example. 85 | 86 | 1. Get cluster information 87 |
 88 |    $ export MSK_CLUSTER_ARN=$(aws kafka list-clusters-v2 | jq -r '.ClusterInfoList[] | select(.ClusterName == "your-msk-cluster-name") | .ClusterArn')
 89 |    $ aws kafka describe-cluster-v2 --cluster-arn $MSK_CLUSTER_ARN
 90 |    {
 91 |     "ClusterInfo": {
 92 |      "ClusterType": "PROVISIONED",
 93 |      "ClusterArn": "arn:aws:kafka:us-east-1:123456789012:cluster/{msk-cluster-name}/b5555d61-dbd3-4dab-b9d4-67921490c072-3",
 94 |      "ClusterName": "{msk-cluster-name}",
 95 |      "CreationTime": "2022-12-20T12:26:29.328000+00:00",
 96 |      "CurrentVersion": "K3AEGXETSR30VB",
 97 |      "State": "ACTIVE",
 98 |      "Tags": {},
 99 |      "Provisioned": {
100 |       "BrokerNodeGroupInfo": {
101 |        "BrokerAZDistribution": "DEFAULT",
102 |        "ClientSubnets": [
103 |         "subnet-036e818e577297ddc",
104 |         "subnet-0113628395a293b98",
105 |         "subnet-090240f6a94a4b5aa"
106 |        ],
107 |        "InstanceType": "kafka.m5.large",
108 |        "SecurityGroups": [
109 |         "sg-05c2c5594681e3910",
110 |         "sg-066ae826faa6a4bb6"
111 |        ],
112 |        "StorageInfo": {
113 |         "EbsStorageInfo": {
114 |          "VolumeSize": 100
115 |         }
116 |        },
117 |        "ConnectivityInfo": {
118 |         "PublicAccess": {
119 |          "Type": "DISABLED"
120 |         }
121 |        }
122 |       },
123 |       "CurrentBrokerSoftwareInfo": {
124 |        "KafkaVersion": "2.8.1"
125 |       },
126 |       "EncryptionInfo": {
127 |        "EncryptionAtRest": {
128 |         "DataVolumeKMSKeyId": "arn:aws:kms:us-east-1:123456789012:key/cfd8f515-305d-4e6b-9c53-90869065347e"
129 |        },
130 |        "EncryptionInTransit": {
131 |         "ClientBroker": "TLS_PLAINTEXT",
132 |         "InCluster": true
133 |        }
134 |       },
135 |       "EnhancedMonitoring": "PER_TOPIC_PER_BROKER",
136 |       "OpenMonitoring": {
137 |        "Prometheus": {
138 |         "JmxExporter": {
139 |          "EnabledInBroker": false
140 |         },
141 |         "NodeExporter": {
142 |          "EnabledInBroker": false
143 |         }
144 |        }
145 |       },
146 |       "NumberOfBrokerNodes": 3,
147 |       "ZookeeperConnectString": "z-1.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:2181,z-3.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:2181,z-2.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:2181",
148 |       "ZookeeperConnectStringTls": "z-1.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:2182,z-3.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:2182,z-2.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:2182",
149 |       "StorageMode": "LOCAL"
150 |      }
151 |     }
152 |    }
153 |    
154 | 155 | 1. Get booststrap brokers 156 | 157 |
158 |    $ aws kafka get-bootstrap-brokers --cluster-arn $MSK_CLUSTER_ARN
159 |    {
160 |     "BootstrapBrokerString": "b-2.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:9092,b-1.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:9092,b-3.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:9092",
161 |     "BootstrapBrokerStringTls": "b-2.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:9094,b-1.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:9094,b-3.{msk-cluster-name}.0o776j.c3.kafka.us-east-1.amazonaws.com:9094"
162 |    }
163 |    
164 | 165 | 2. Connect the MSK client EC2 Host. 166 | 167 | You can connect to an EC2 instance using the EC2 Instance Connect CLI.
168 | Install `ec2instanceconnectcli` python package and Use the **mssh** command with the instance ID as follows. 169 |
170 |    $ sudo pip install ec2instanceconnectcli
171 |    $ mssh ec2-user@i-001234a4bf70dec41EXAMPLE
172 |    
173 | 174 | 3. Create an Apache Kafka topic 175 | After connect your EC2 Host, you use the client machine to create a topic on the cluster. 176 | Run the following command to create a topic called `ev_stream_data`. 177 |
178 |    [ec2-user@ip-172-31-0-180 ~]$ export PATH=$HOME/opt/kafka/bin:$PATH
179 |    [ec2-user@ip-172-31-0-180 ~]$ export BS={BootstrapBrokerString}
180 |    [ec2-user@ip-172-31-0-180 ~]$ kafka-topics.sh --create --bootstrap-server $BS --topic ev_stream_data --partitions 3 --replication-factor 2
181 |    
182 | 183 | 4. Produce and consume data 184 | 185 | **(1) To produce messages** 186 | 187 | Run the following command to generate messages into the topic on the cluster. 188 | 189 |
190 |    [ec2-user@ip-172-31-0-180 ~]$ python3 gen_fake_kafka_data.py --bootstrap-servers $BS --topic ev_stream_data
191 |    
192 | 193 | **(2) To consume messages** 194 | 195 | Keep the connection to the client machine open, and then open a second, separate connection to that machine in a new window. 196 | 197 |
198 |    [ec2-user@ip-172-31-0-180 ~]$ kafka-console-consumer.sh --bootstrap-server $BS --topic ev_stream_data --from-beginning
199 |    
200 | 201 | You start seeing the messages you entered earlier when you used the console producer command. 202 | Enter more messages in the producer window, and watch them appear in the consumer window. 203 | 204 | #### Amazon Redshift setup 205 | 206 | These steps show you how to configure the materialized view to ingest data. 207 | 208 | 1. Connect to the Redshift query editor v2 209 | 210 | ![redshift-query-editor-v2-connection](./redshift-query-editor-v2-connection.png) 211 | 212 | 2. Create an external schema to map the data from MSK to a Redshift object. 213 |
214 |    CREATE EXTERNAL SCHEMA evdata
215 |    FROM MSK
216 |    IAM_ROLE 'arn:aws:iam::{AWS-ACCOUNT-ID}:role/RedshiftStreamingRole'
217 |    AUTHENTICATION none
218 |    CLUSTER_ARN 'arn:aws:kafka:{region}:{AWS-ACCOUNT-ID}:cluster/{cluser-name}/b5555d61-dbd3-4dab-b9d4-67921490c072-3';
219 |    
220 | For information about how to configure the IAM role, see [Getting started with streaming ingestion from Amazon Managed Streaming for Apache Kafka](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started-MSK.html). 221 | 222 | 3. Create a materialized view to consume the stream data. 223 | 224 | Note that MSK cluster names are case-sensitive and can contain both uppercase and lowercase letters. To use case-sensitive identifiers, you can set the configuration setting `enable_case_sensitive_identifier` to true at either the session or cluster level. 225 |
226 |    -- To create and use case sensitive identifiers
227 |    SET enable_case_sensitive_identifier TO true;
228 | 
229 |    -- To check if enable_case_sensitive_identifier is turned on
230 |    SHOW enable_case_sensitive_identifier;
231 |    
232 | 233 | The following example defines a materialized view with JSON source data.
234 | Create the materialized view so it’s distributed on the UUID value from the stream and is sorted by the `refresh_time` value. The `refresh_time` is the start time of the materialized view refresh that loaded the record. The materialized view is set to auto refresh and will be refreshed as data keeps arriving in the stream. 235 |
236 |    CREATE MATERIALIZED VIEW ev_station_data_extract DISTKEY(6) sortkey(1) AUTO REFRESH YES AS
237 |      SELECT refresh_time,
238 |        kafka_timestamp_type,
239 |        kafka_timestamp,
240 |        kafka_key,
241 |        kafka_partition,
242 |        kafka_offset,
243 |        kafka_headers,
244 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'_id',true)::character(36) as ID,
245 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'clusterID',true)::varchar(30) as clusterID,
246 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'connectionTime',true)::varchar(20) as connectionTime,
247 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'kWhDelivered',true)::DECIMAL(10,2) as kWhDelivered,
248 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'stationID',true)::INTEGER as stationID,
249 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'spaceID',true)::varchar(100) as spaceID,
250 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'timezone',true)::varchar(30)as timezone,
251 |        json_extract_path_text(from_varbyte(kafka_value,'utf-8'),'userID',true)::varchar(30) as userID
252 |      FROM evdata.ev_stream_data
253 |      WHERE LENGTH(kafka_value) < 65355 AND CAN_JSON_PARSE(kafka_value);
254 |    
255 | The code above filters records larger than **65355** bytes. This is because `json_extract_path_text` is limited to varchar data type. The Materialized view should be defined so that there aren’t any type conversion errors. 256 | 257 | 4. Refreshing materialized views for streaming ingestion 258 | 259 | The materialized view is auto-refreshed as long as there is new data on the MSK stream. You can also disable auto-refresh and run a manual refresh or schedule a manual refresh using the Redshift Console UI.
260 | To update the data in a materialized view, you can use the `REFRESH MATERIALIZED VIEW` statement at any time. 261 |
262 |    REFRESH MATERIALIZED VIEW ev_station_data_extract;
263 |    
264 | 265 | #### Query the stream 266 | 267 | 1. Query data in the materialized view. 268 |
269 |    SELECT *
270 |    FROM ev_station_data_extract;
271 |    
272 | 2. Query the refreshed materialized view to get usage statistics. 273 |
274 |    SELECT to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS') as connectiontime
275 |       ,SUM(kWhDelivered) AS Energy_Consumed
276 |       ,count(distinct userID) AS #Users
277 |    FROM ev_station_data_extract
278 |    GROUP BY to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS')
279 |    ORDER BY 1 DESC;
280 |    
281 | 282 | ## Useful commands 283 | 284 | * `cdk ls` list all stacks in the app 285 | * `cdk synth` emits the synthesized CloudFormation template 286 | * `cdk deploy` deploy this stack to your default AWS account/region 287 | * `cdk diff` compare deployed stack with current state 288 | * `cdk docs` open CDK documentation 289 | 290 | Enjoy! 291 | 292 | ## References 293 | 294 | * [Amazon Redshift - Getting started with streaming ingestion from Amazon Managed Streaming for Apache Kafka](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started-MSK.html) 295 | * [Amazon Redshift - Electric vehicle station-data streaming ingestion tutorial, using Kinesis](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-example-station-data.html) 296 | * [Amazon Redshift Configuration Reference - enable_case_sensitive_identifier](https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html) 297 | * [Real-time analytics with Amazon Redshift streaming ingestion (2022-04-27)](https://aws.amazon.com/ko/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/) 298 | ![amazon-redshift-streaming-ingestion](https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/04/14/BDB-2193-image001.png) 299 | * [Connect using the EC2 Instance Connect CLI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-methods.html#ec2-instance-connect-connecting-ec2-cli) 300 |
301 |    $ sudo pip install ec2instanceconnectcli
302 |    $ mssh ec2-user@i-001234a4bf70dec41EXAMPLE # ec2-instance-id
303 |    
304 | * [ec2instanceconnectcli](https://pypi.org/project/ec2instanceconnectcli/): This Python CLI package handles publishing keys through EC2 Instance Connectand using them to connect to EC2 instances. 305 | * [Mimesis: Fake Data Generator](https://mimesis.name/en/latest/index.html) - Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages 306 | -------------------------------------------------------------------------------- /from-kinesis/redshift_streaming_from_kds.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 |
Private subnet
Private subnet
Redshift Serverless
Redshift Serverless
Kinesis Data Streams
Kinesis Data Streams
AWS Cloud
AWS Cloud
VPC
VPC
devices
devices
Viewer does not support full SVG 1.1
-------------------------------------------------------------------------------- /from-msk-serverless/redshift_streaming_from_msk_serverless.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 |
AWS Cloud
AWS Cloud
VPC
VPC
Private subnet
Private subnet
Redshift Serverless
Redshift Serverless
MSK Serverless
MSK Serverless
devices
devices
Viewer does not support full SVG 1.1
-------------------------------------------------------------------------------- /from-msk/redshift_streaming_from_msk.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 |
AWS Cloud
AWS Cloud
VPC
VPC
Private subnet
Private subnet
Redshift Serverless
Redshift Serverless
MSK
(Provisioned)
MSK...
devices
devices
Viewer does not support full SVG 1.1
-------------------------------------------------------------------------------- /redshift-streaming-ingestion-arch.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 |
devices
devices
Streaming
Table
Streaming...
. . .
. . .
Permanent
Tables
Permanent...
Amazon Kinesis
Data Streams
Amazon Kinesis...
Amazon Managed
Streaming for Apache Kafka
Amazon Managed...
Amazon MSK
Serverless
Amazon MSK...
Real-time
Materialized
View
Real-time...
Amazon Redshift
Amazon Redshift
Viewer does not support full SVG 1.1
--------------------------------------------------------------------------------