├── msk-to-iceberg ├── requirements-dev.txt ├── requirements.txt ├── assets │ ├── iceberg-table.png │ ├── iceberg-data-level-01.png │ ├── iceberg-data-level-02.png │ └── iceberg-data-level-03.png ├── cdk_stacks │ ├── __init__.py │ ├── glue_catalog_database.py │ ├── s3.py │ ├── vpc.py │ ├── lakeformation_permissions.py │ ├── glue_msk_connection.py │ ├── glue_streaming_job.py │ ├── glue_job_role.py │ ├── msk.py │ └── kafka_client_ec2.py ├── source.bat ├── cdk.json ├── src │ ├── utils │ │ └── gen_fake_data.py │ └── main │ │ └── python │ │ ├── spark_dataframe_insert_iceberg_from_kafka.py │ │ ├── spark_sql_insert_overwrite_iceberg_from_kafka.py │ │ └── spark_sql_merge_into_iceberg_from_kafka.py ├── app.py ├── README.md └── glue-streaming-data-from-kafka-to-iceberg-table.svg ├── msk-serverless-to-iceberg ├── requirements-dev.txt ├── requirements.txt ├── assets │ ├── iceberg-table.png │ ├── iceberg-data-level-01.png │ ├── iceberg-data-level-02.png │ └── iceberg-data-level-03.png ├── cdk_stacks │ ├── __init__.py │ ├── glue_catalog_database.py │ ├── s3.py │ ├── vpc.py │ ├── lakeformation_permissions.py │ ├── glue_msk_connection.py │ ├── msk_serverless.py │ ├── glue_streaming_job.py │ ├── glue_job_role.py │ └── kafka_client_ec2.py ├── source.bat ├── cdk.json ├── src │ ├── utils │ │ └── gen_fake_data.py │ └── main │ │ └── python │ │ ├── spark_dataframe_insert_iceberg_from_msk_serverless.py │ │ ├── spark_sql_insert_overwrite_iceberg_from_msk_serverless.py │ │ └── spark_sql_merge_into_iceberg_from_msk_serverless.py ├── app.py ├── README.md └── glue-streaming-data-from-msk-serverless-to-iceberg-table.svg ├── .gitignore ├── CODE_OF_CONDUCT.md ├── LICENSE ├── README.md └── CONTRIBUTING.md /msk-to-iceberg/requirements-dev.txt: -------------------------------------------------------------------------------- 1 | mimesis==4.1.3 # The last to support Python 3.6 and 3.7 2 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/requirements-dev.txt: -------------------------------------------------------------------------------- 1 | mimesis==4.1.3 # The last to support Python 3.6 and 3.7 2 | -------------------------------------------------------------------------------- /msk-to-iceberg/requirements.txt: -------------------------------------------------------------------------------- 1 | aws-cdk-lib==2.61.1 2 | constructs>=10.0.0,<11.0.0 3 | 4 | boto3==1.26.55 -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/requirements.txt: -------------------------------------------------------------------------------- 1 | aws-cdk-lib==2.61.1 2 | constructs>=10.0.0,<11.0.0 3 | 4 | boto3==1.26.55 -------------------------------------------------------------------------------- /msk-to-iceberg/assets/iceberg-table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-glue-streaming-ingestion-from-kafka-to-apache-iceberg/HEAD/msk-to-iceberg/assets/iceberg-table.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | *.swp 3 | package-lock.json 4 | __pycache__ 5 | .pytest_cache 6 | .venv 7 | *.egg-info 8 | 9 | # CDK asset staging directory 10 | .cdk.staging 11 | cdk.out 12 | -------------------------------------------------------------------------------- /msk-to-iceberg/assets/iceberg-data-level-01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-glue-streaming-ingestion-from-kafka-to-apache-iceberg/HEAD/msk-to-iceberg/assets/iceberg-data-level-01.png -------------------------------------------------------------------------------- /msk-to-iceberg/assets/iceberg-data-level-02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-glue-streaming-ingestion-from-kafka-to-apache-iceberg/HEAD/msk-to-iceberg/assets/iceberg-data-level-02.png -------------------------------------------------------------------------------- /msk-to-iceberg/assets/iceberg-data-level-03.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-glue-streaming-ingestion-from-kafka-to-apache-iceberg/HEAD/msk-to-iceberg/assets/iceberg-data-level-03.png -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/assets/iceberg-table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-glue-streaming-ingestion-from-kafka-to-apache-iceberg/HEAD/msk-serverless-to-iceberg/assets/iceberg-table.png -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/assets/iceberg-data-level-01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-glue-streaming-ingestion-from-kafka-to-apache-iceberg/HEAD/msk-serverless-to-iceberg/assets/iceberg-data-level-01.png -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/assets/iceberg-data-level-02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-glue-streaming-ingestion-from-kafka-to-apache-iceberg/HEAD/msk-serverless-to-iceberg/assets/iceberg-data-level-02.png -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/assets/iceberg-data-level-03.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-glue-streaming-ingestion-from-kafka-to-apache-iceberg/HEAD/msk-serverless-to-iceberg/assets/iceberg-data-level-03.png -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk_stacks/__init__.py: -------------------------------------------------------------------------------- 1 | from .vpc import VpcStack 2 | from .kafka_client_ec2 import KafkaClientEC2InstanceStack 3 | from .msk import MskStack 4 | from .glue_job_role import GlueJobRoleStack 5 | from .glue_msk_connection import GlueMSKConnectionStack 6 | from .glue_catalog_database import GlueCatalogDatabaseStack 7 | from .glue_streaming_job import GlueStreamingJobStack 8 | from .lakeformation_permissions import DataLakePermissionsStack 9 | from .s3 import S3BucketStack 10 | -------------------------------------------------------------------------------- /msk-to-iceberg/source.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | rem The sole purpose of this script is to make the command 4 | rem 5 | rem source .venv/bin/activate 6 | rem 7 | rem (which activates a Python virtualenv on Linux or Mac OS X) work on Windows. 8 | rem On Windows, this command just runs this batch file (the argument is ignored). 9 | rem 10 | rem Now we don't need to document a Windows command for activating a virtualenv. 11 | 12 | echo Executing .venv\Scripts\activate.bat for you 13 | .venv\Scripts\activate.bat 14 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk_stacks/__init__.py: -------------------------------------------------------------------------------- 1 | from .vpc import VpcStack 2 | from .kafka_client_ec2 import KafkaClientEC2InstanceStack 3 | from .msk_serverless import MskServerlessStack 4 | from .glue_job_role import GlueJobRoleStack 5 | from .glue_msk_connection import GlueMSKConnectionStack 6 | from .glue_catalog_database import GlueCatalogDatabaseStack 7 | from .glue_streaming_job import GlueStreamingJobStack 8 | from .lakeformation_permissions import DataLakePermissionsStack 9 | from .s3 import S3BucketStack 10 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/source.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | rem The sole purpose of this script is to make the command 4 | rem 5 | rem source .venv/bin/activate 6 | rem 7 | rem (which activates a Python virtualenv on Linux or Mac OS X) work on Windows. 8 | rem On Windows, this command just runs this batch file (the argument is ignored). 9 | rem 10 | rem Now we don't need to document a Windows command for activating a virtualenv. 11 | 12 | echo Executing .venv\Scripts\activate.bat for you 13 | .venv\Scripts\activate.bat 14 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk_stacks/glue_catalog_database.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_glue 10 | ) 11 | from constructs import Construct 12 | 13 | 14 | class GlueCatalogDatabaseStack(Stack): 15 | 16 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 17 | super().__init__(scope, construct_id, **kwargs) 18 | 19 | glue_kinesis_table = self.node.try_get_context('glue_job_input_arguments') 20 | database_name = glue_kinesis_table['--database_name'] 21 | 22 | cfn_database = aws_glue.CfnDatabase(self, "GlueCfnDatabase", 23 | catalog_id=cdk.Aws.ACCOUNT_ID, 24 | database_input=aws_glue.CfnDatabase.DatabaseInputProperty( 25 | name=database_name 26 | ) 27 | ) 28 | cfn_database.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 29 | 30 | cdk.CfnOutput(self, f'{self.stack_name}_GlueDatabaseName', 31 | value=cfn_database.database_input.name) 32 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk_stacks/glue_catalog_database.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_glue 10 | ) 11 | from constructs import Construct 12 | 13 | 14 | class GlueCatalogDatabaseStack(Stack): 15 | 16 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 17 | super().__init__(scope, construct_id, **kwargs) 18 | 19 | glue_kinesis_table = self.node.try_get_context('glue_job_input_arguments') 20 | database_name = glue_kinesis_table['--database_name'] 21 | 22 | cfn_database = aws_glue.CfnDatabase(self, "GlueCfnDatabase", 23 | catalog_id=cdk.Aws.ACCOUNT_ID, 24 | database_input=aws_glue.CfnDatabase.DatabaseInputProperty( 25 | name=database_name 26 | ) 27 | ) 28 | cfn_database.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 29 | 30 | cdk.CfnOutput(self, f'{self.stack_name}_GlueDatabaseName', 31 | value=cfn_database.database_input.name) 32 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk_stacks/s3.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | from urllib.parse import urlparse 6 | 7 | import aws_cdk as cdk 8 | 9 | from aws_cdk import ( 10 | Stack, 11 | aws_s3 as s3 12 | ) 13 | 14 | from constructs import Construct 15 | 16 | 17 | class S3BucketStack(Stack): 18 | 19 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 20 | super().__init__(scope, construct_id, **kwargs) 21 | 22 | glue_job_input_arguments = self.node.try_get_context('glue_job_input_arguments') 23 | s3_path = glue_job_input_arguments["--iceberg_s3_path"] 24 | s3_bucket_name = urlparse(s3_path).netloc 25 | 26 | s3_bucket = s3.Bucket(self, "s3bucket", 27 | removal_policy=cdk.RemovalPolicy.DESTROY, #XXX: Default: cdk.RemovalPolicy.RETAIN - The bucket will be orphaned 28 | bucket_name=s3_bucket_name) 29 | 30 | self.s3_bucket_name = s3_bucket.bucket_name 31 | 32 | cdk.CfnOutput(self, f'{self.stack_name}_S3Bucket', value=self.s3_bucket_name) 33 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk_stacks/s3.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | from urllib.parse import urlparse 6 | 7 | import aws_cdk as cdk 8 | 9 | from aws_cdk import ( 10 | Stack, 11 | aws_s3 as s3 12 | ) 13 | 14 | from constructs import Construct 15 | 16 | 17 | class S3BucketStack(Stack): 18 | 19 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 20 | super().__init__(scope, construct_id, **kwargs) 21 | 22 | glue_job_input_arguments = self.node.try_get_context('glue_job_input_arguments') 23 | s3_path = glue_job_input_arguments["--iceberg_s3_path"] 24 | s3_bucket_name = urlparse(s3_path).netloc 25 | 26 | s3_bucket = s3.Bucket(self, "s3bucket", 27 | removal_policy=cdk.RemovalPolicy.DESTROY, #XXX: Default: cdk.RemovalPolicy.RETAIN - The bucket will be orphaned 28 | bucket_name=s3_bucket_name) 29 | 30 | self.s3_bucket_name = s3_bucket.bucket_name 31 | 32 | cdk.CfnOutput(self, f'{self.stack_name}_S3Bucket', value=self.s3_bucket_name) 33 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AWS Glue Streaming Ingestion from Kafka to Apache Iceberg 2 | 3 | This is a collecton of Amazon CDK projects to show how to directly ingest streaming data from Amazon Mananged Service for Apache Kafka (MSK) and MSK Serverless into Apache Iceberg table in S3 with AWS Glue Streaming. 4 | 5 | | Project | Description | Tags | 6 | |---------|-------------|------| 7 | | [msk-to-iceberg](./msk-to-iceberg/) | ![glue-streaming-ingestion-from-msk-to-iceberg-arch](./msk-to-iceberg/glue-streaming-data-from-kafka-to-iceberg-table.svg) | AWS Glue Streaming, Managed Service for Apache Kafka (MSK), S3, Apache Iceberg | 8 | | [msk-serverless-to-iceberg](./msk-serverless-to-iceberg/) | ![glue-streaming-ingestion-from-msk-serverless-to-iceberg-arch](./msk-serverless-to-iceberg/glue-streaming-data-from-msk-serverless-to-iceberg-table.svg) | AWS Glue Streaming, MSK Serverless, S3, Apache Iceberg | 9 | 10 | Enjoy! 11 | 12 | ## Security 13 | 14 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 15 | 16 | ## License 17 | 18 | This library is licensed under the MIT-0 License. See the LICENSE file. 19 | 20 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "python3 app.py", 3 | "watch": { 4 | "include": [ 5 | "**" 6 | ], 7 | "exclude": [ 8 | "README.md", 9 | "cdk*.json", 10 | "requirements*.txt", 11 | "source.bat", 12 | "**/__init__.py", 13 | "python/__pycache__", 14 | "tests" 15 | ] 16 | }, 17 | "context": { 18 | "@aws-cdk/aws-lambda:recognizeLayerVersion": true, 19 | "@aws-cdk/core:checkSecretUsage": true, 20 | "@aws-cdk/core:target-partitions": [ 21 | "aws", 22 | "aws-cn" 23 | ], 24 | "@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true, 25 | "@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true, 26 | "@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true, 27 | "@aws-cdk/aws-iam:minimizePolicies": true, 28 | "@aws-cdk/core:validateSnapshotRemovalPolicy": true, 29 | "@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true, 30 | "@aws-cdk/aws-s3:createDefaultLoggingPolicy": true, 31 | "@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true, 32 | "@aws-cdk/aws-apigateway:disableCloudWatchRole": true, 33 | "@aws-cdk/core:enablePartitionLiterals": true, 34 | "@aws-cdk/aws-events:eventsTargetQueueSameAccount": true, 35 | "@aws-cdk/aws-iam:standardizedServicePrincipals": true, 36 | "@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true, 37 | "@aws-cdk/aws-iam:importedRoleStackSafeDefaultPolicyName": true, 38 | "@aws-cdk/aws-s3:serverAccessLogsUseBucketPolicy": true, 39 | "@aws-cdk/aws-route53-patters:useCertificate": true, 40 | "@aws-cdk/customresources:installLatestAwsSdkDefault": false 41 | } 42 | } 43 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "python3 app.py", 3 | "watch": { 4 | "include": [ 5 | "**" 6 | ], 7 | "exclude": [ 8 | "README.md", 9 | "cdk*.json", 10 | "requirements*.txt", 11 | "source.bat", 12 | "**/__init__.py", 13 | "python/__pycache__", 14 | "tests" 15 | ] 16 | }, 17 | "context": { 18 | "@aws-cdk/aws-lambda:recognizeLayerVersion": true, 19 | "@aws-cdk/core:checkSecretUsage": true, 20 | "@aws-cdk/core:target-partitions": [ 21 | "aws", 22 | "aws-cn" 23 | ], 24 | "@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true, 25 | "@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true, 26 | "@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true, 27 | "@aws-cdk/aws-iam:minimizePolicies": true, 28 | "@aws-cdk/core:validateSnapshotRemovalPolicy": true, 29 | "@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true, 30 | "@aws-cdk/aws-s3:createDefaultLoggingPolicy": true, 31 | "@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true, 32 | "@aws-cdk/aws-apigateway:disableCloudWatchRole": true, 33 | "@aws-cdk/core:enablePartitionLiterals": true, 34 | "@aws-cdk/aws-events:eventsTargetQueueSameAccount": true, 35 | "@aws-cdk/aws-iam:standardizedServicePrincipals": true, 36 | "@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true, 37 | "@aws-cdk/aws-iam:importedRoleStackSafeDefaultPolicyName": true, 38 | "@aws-cdk/aws-s3:serverAccessLogsUseBucketPolicy": true, 39 | "@aws-cdk/aws-route53-patters:useCertificate": true, 40 | "@aws-cdk/customresources:installLatestAwsSdkDefault": false 41 | } 42 | } 43 | -------------------------------------------------------------------------------- /msk-to-iceberg/src/utils/gen_fake_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import sys 6 | import argparse 7 | from datetime import datetime 8 | import json 9 | import random 10 | import time 11 | 12 | import mimesis 13 | 14 | # Mimesis 5.0 supports Python 3.8, 3.9, and 3.10. 15 | # The Mimesis 4.1.3 is the last to support Python 3.6 and 3.7 16 | # For more information, see https://mimesis.name/en/latest/changelog.html#version-5-0-0 17 | assert mimesis.__version__ == '4.1.3' 18 | 19 | from mimesis import locales 20 | from mimesis.schema import Field, Schema 21 | 22 | random.seed(47) 23 | 24 | 25 | def main(): 26 | parser = argparse.ArgumentParser() 27 | 28 | parser.add_argument('--max-count', default=10, type=int, help='The max number of records to put.(default: 10)') 29 | 30 | options = parser.parse_args() 31 | 32 | _CURRENT_YEAR = datetime.now().year 33 | _NAMES = 'Arica,Burton,Cory,Fernando,Gonzalo,Kenton,Linsey,Micheal,Ricky,Takisha'.split(',') 34 | 35 | #XXX: For more information about synthetic data schema, see 36 | # https://github.com/aws-samples/aws-glue-streaming-etl-blog/blob/master/config/generate_data.py 37 | _ = Field(locale=locales.EN) 38 | 39 | _schema = Schema(schema=lambda: { 40 | # "name": _("first_name"), 41 | "name": _("choice", items=_NAMES), 42 | "age": _("age"), 43 | "m_time": _("formatted_datetime", fmt="%Y-%m-%d %H:%M:%S", start=_CURRENT_YEAR, end=_CURRENT_YEAR) 44 | }) 45 | 46 | cnt = 0 47 | for record in _schema.create(options.max_count): 48 | cnt += 1 49 | partition_key = record['name'] 50 | print(f"{partition_key}\t{json.dumps(record)}") 51 | time.sleep(random.choices([0.01, 0.03, 0.05, 0.07, 0.1])[-1]) 52 | 53 | print(f'[INFO] {cnt} records are processed', file=sys.stderr) 54 | 55 | 56 | if __name__ == '__main__': 57 | main() 58 | 59 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/src/utils/gen_fake_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import sys 6 | import argparse 7 | from datetime import datetime 8 | import json 9 | import random 10 | import time 11 | 12 | import mimesis 13 | 14 | # Mimesis 5.0 supports Python 3.8, 3.9, and 3.10. 15 | # The Mimesis 4.1.3 is the last to support Python 3.6 and 3.7 16 | # For more information, see https://mimesis.name/en/latest/changelog.html#version-5-0-0 17 | assert mimesis.__version__ == '4.1.3' 18 | 19 | from mimesis import locales 20 | from mimesis.schema import Field, Schema 21 | 22 | random.seed(47) 23 | 24 | 25 | def main(): 26 | parser = argparse.ArgumentParser() 27 | 28 | parser.add_argument('--max-count', default=10, type=int, help='The max number of records to put.(default: 10)') 29 | 30 | options = parser.parse_args() 31 | 32 | _CURRENT_YEAR = datetime.now().year 33 | _NAMES = 'Arica,Burton,Cory,Fernando,Gonzalo,Kenton,Linsey,Micheal,Ricky,Takisha'.split(',') 34 | 35 | #XXX: For more information about synthetic data schema, see 36 | # https://github.com/aws-samples/aws-glue-streaming-etl-blog/blob/master/config/generate_data.py 37 | _ = Field(locale=locales.EN) 38 | 39 | _schema = Schema(schema=lambda: { 40 | # "name": _("first_name"), 41 | "name": _("choice", items=_NAMES), 42 | "age": _("age"), 43 | "m_time": _("formatted_datetime", fmt="%Y-%m-%d %H:%M:%S", start=_CURRENT_YEAR, end=_CURRENT_YEAR) 44 | }) 45 | 46 | cnt = 0 47 | for record in _schema.create(options.max_count): 48 | cnt += 1 49 | partition_key = record['name'] 50 | print(f"{partition_key}\t{json.dumps(record)}") 51 | time.sleep(random.choices([0.01, 0.03, 0.05, 0.07, 0.1])[-1]) 52 | 53 | print(f'[INFO] {cnt} records are processed', file=sys.stderr) 54 | 55 | 56 | if __name__ == '__main__': 57 | main() 58 | 59 | -------------------------------------------------------------------------------- /msk-to-iceberg/app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import os 3 | 4 | import aws_cdk as cdk 5 | 6 | from cdk_stacks import ( 7 | VpcStack, 8 | MskStack, 9 | KafkaClientEC2InstanceStack, 10 | GlueJobRoleStack, 11 | GlueMSKConnectionStack, 12 | GlueCatalogDatabaseStack, 13 | GlueStreamingJobStack, 14 | DataLakePermissionsStack, 15 | S3BucketStack 16 | ) 17 | 18 | 19 | APP_ENV = cdk.Environment(account=os.getenv('CDK_DEFAULT_ACCOUNT'), 20 | region=os.getenv('CDK_DEFAULT_REGION')) 21 | 22 | app = cdk.App() 23 | 24 | s3_bucket = S3BucketStack(app, 'KafkaToIcebergS3Path') 25 | 26 | vpc_stack = VpcStack(app, 'KafkaToIcebergStackVpc', 27 | env=APP_ENV 28 | ) 29 | vpc_stack.add_dependency(s3_bucket) 30 | 31 | msk_stack = MskStack(app, 'KafkaAsGlueStreamingJobDataSource', 32 | vpc_stack.vpc, 33 | env=APP_ENV 34 | ) 35 | msk_stack.add_dependency(vpc_stack) 36 | 37 | kafka_client_ec2_stack = KafkaClientEC2InstanceStack(app, 'KafkaClientEC2Instance', 38 | vpc_stack.vpc, 39 | msk_stack.sg_msk_client, 40 | msk_stack.msk_cluster_name, 41 | env=APP_ENV 42 | ) 43 | kafka_client_ec2_stack.add_dependency(msk_stack) 44 | 45 | glue_msk_connection = GlueMSKConnectionStack(app, 'GlueMSKConnection', 46 | vpc_stack.vpc, 47 | msk_stack.sg_msk_client, 48 | env=APP_ENV 49 | ) 50 | glue_msk_connection.add_dependency(msk_stack) 51 | 52 | glue_job_role = GlueJobRoleStack(app, 'GlueStreamingMSKtoIcebergJobRole') 53 | glue_job_role.add_dependency(msk_stack) 54 | 55 | glue_database = GlueCatalogDatabaseStack(app, 'GlueIcebergDatabase') 56 | 57 | grant_lake_formation_permissions = DataLakePermissionsStack(app, 'GrantLFPermissionsOnGlueJobRole', 58 | glue_job_role.glue_job_role 59 | ) 60 | grant_lake_formation_permissions.add_dependency(glue_database) 61 | grant_lake_formation_permissions.add_dependency(glue_job_role) 62 | 63 | glue_streaming_job = GlueStreamingJobStack(app, 'GlueStreamingJobMSKtoIceberg', 64 | glue_job_role.glue_job_role, 65 | glue_msk_connection.msk_connection_info 66 | ) 67 | glue_streaming_job.add_dependency(grant_lake_formation_permissions) 68 | 69 | app.synth() 70 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import os 3 | 4 | import aws_cdk as cdk 5 | 6 | from cdk_stacks import ( 7 | VpcStack, 8 | MskServerlessStack, 9 | KafkaClientEC2InstanceStack, 10 | GlueJobRoleStack, 11 | GlueMSKConnectionStack, 12 | GlueCatalogDatabaseStack, 13 | GlueStreamingJobStack, 14 | DataLakePermissionsStack, 15 | S3BucketStack 16 | ) 17 | 18 | 19 | APP_ENV = cdk.Environment(account=os.getenv('CDK_DEFAULT_ACCOUNT'), 20 | region=os.getenv('CDK_DEFAULT_REGION')) 21 | 22 | app = cdk.App() 23 | 24 | s3_bucket = S3BucketStack(app, 'MSKServerlessToIcebergS3Path') 25 | 26 | vpc_stack = VpcStack(app, 'MSKServerlessToIcebergStackVpc', 27 | env=APP_ENV 28 | ) 29 | vpc_stack.add_dependency(s3_bucket) 30 | 31 | msk_stack = MskServerlessStack(app, 'MSKServerlessAsGlueStreamingJobDataSource', 32 | vpc_stack.vpc, 33 | env=APP_ENV 34 | ) 35 | msk_stack.add_dependency(s3_bucket) 36 | 37 | kafka_client_ec2_stack = KafkaClientEC2InstanceStack(app, 'MSKServerlessClientEC2Instance', 38 | vpc_stack.vpc, 39 | msk_stack.sg_msk_client, 40 | msk_stack.msk_cluster_name, 41 | env=APP_ENV 42 | ) 43 | kafka_client_ec2_stack.add_dependency(msk_stack) 44 | 45 | glue_msk_connection = GlueMSKConnectionStack(app, 'GlueMSKServerlessConnection', 46 | vpc_stack.vpc, 47 | msk_stack.sg_msk_client, 48 | env=APP_ENV 49 | ) 50 | glue_msk_connection.add_dependency(msk_stack) 51 | 52 | glue_job_role = GlueJobRoleStack(app, 'GlueStreamingMSKServerlessToIcebergJobRole', 53 | msk_stack.msk_cluster_name, 54 | ) 55 | glue_job_role.add_dependency(msk_stack) 56 | 57 | glue_database = GlueCatalogDatabaseStack(app, 'GlueIcebergDatabase') 58 | 59 | grant_lake_formation_permissions = DataLakePermissionsStack(app, 'GrantLFPermissionsOnGlueJobRole', 60 | glue_job_role.glue_job_role 61 | ) 62 | grant_lake_formation_permissions.add_dependency(glue_database) 63 | grant_lake_formation_permissions.add_dependency(glue_job_role) 64 | 65 | glue_streaming_job = GlueStreamingJobStack(app, 'GlueStreamingJobMSKServerlessToIceberg', 66 | glue_job_role.glue_job_role, 67 | glue_msk_connection.msk_connection_info 68 | ) 69 | glue_streaming_job.add_dependency(grant_lake_formation_permissions) 70 | 71 | app.synth() 72 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk_stacks/vpc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import aws_cdk as cdk 7 | 8 | from aws_cdk import ( 9 | Stack, 10 | aws_ec2, 11 | ) 12 | from constructs import Construct 13 | 14 | 15 | class VpcStack(Stack): 16 | 17 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 18 | super().__init__(scope, construct_id, **kwargs) 19 | 20 | #XXX: For creating this CDK Stack in the existing VPC, 21 | # remove comments from the below codes and 22 | # comments out vpc = aws_ec2.Vpc(..) codes, 23 | # then pass -c vpc_name=your-existing-vpc to cdk command 24 | # for example, 25 | # cdk -c vpc_name=your-existing-vpc syth 26 | # 27 | if str(os.environ.get('USE_DEFAULT_VPC', 'false')).lower() == 'true': 28 | vpc_name = self.node.try_get_context("vpc_name") or "default" 29 | self.vpc = aws_ec2.Vpc.from_lookup(self, "MSKVpc", 30 | is_default=True, 31 | vpc_name=vpc_name) 32 | else: 33 | #XXX: To use more than 2 AZs, be sure to specify the account and region on your stack. 34 | #XXX: https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_ec2/Vpc.html 35 | self.vpc = aws_ec2.Vpc(self, "MSKServerlessVpc", 36 | ip_addresses=aws_ec2.IpAddresses.cidr("10.0.0.0/21"), 37 | max_azs=3, 38 | 39 | # 'subnetConfiguration' specifies the "subnet groups" to create. 40 | # Every subnet group will have a subnet for each AZ, so this 41 | # configuration will create `2 groups × 3 AZs = 6` subnets. 42 | subnet_configuration=[ 43 | { 44 | "cidrMask": 24, 45 | "name": "Public", 46 | "subnetType": aws_ec2.SubnetType.PUBLIC, 47 | }, 48 | { 49 | "cidrMask": 24, 50 | "name": "Private", 51 | "subnetType": aws_ec2.SubnetType.PRIVATE_WITH_EGRESS 52 | } 53 | ], 54 | gateway_endpoints={ 55 | "S3": aws_ec2.GatewayVpcEndpointOptions( 56 | service=aws_ec2.GatewayVpcEndpointAwsService.S3 57 | ) 58 | } 59 | ) 60 | 61 | 62 | #XXX: The Name field of every Export member must be specified and 63 | # consist only of alphanumeric characters,colons, or hyphens. 64 | cdk.CfnOutput(self, 'VPCID', value=self.vpc.vpc_id, 65 | export_name=f'{self.stack_name}-VPCID') 66 | 67 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk_stacks/vpc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import aws_cdk as cdk 7 | 8 | from aws_cdk import ( 9 | Stack, 10 | aws_ec2, 11 | ) 12 | from constructs import Construct 13 | 14 | 15 | class VpcStack(Stack): 16 | 17 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 18 | super().__init__(scope, construct_id, **kwargs) 19 | 20 | #XXX: For creating this CDK Stack in the existing VPC, 21 | # remove comments from the below codes and 22 | # comments out vpc = aws_ec2.Vpc(..) codes, 23 | # then pass -c vpc_name=your-existing-vpc to cdk command 24 | # for example, 25 | # cdk -c vpc_name=your-existing-vpc syth 26 | # 27 | if str(os.environ.get('USE_DEFAULT_VPC', 'false')).lower() == 'true': 28 | vpc_name = self.node.try_get_context("vpc_name") or "default" 29 | self.vpc = aws_ec2.Vpc.from_lookup(self, "MSKServerlessVpc", 30 | is_default=True, 31 | vpc_name=vpc_name) 32 | else: 33 | #XXX: To use more than 2 AZs, be sure to specify the account and region on your stack. 34 | #XXX: https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_ec2/Vpc.html 35 | self.vpc = aws_ec2.Vpc(self, "MSKServerlessVpc", 36 | ip_addresses=aws_ec2.IpAddresses.cidr("10.0.0.0/21"), 37 | max_azs=3, 38 | 39 | # 'subnetConfiguration' specifies the "subnet groups" to create. 40 | # Every subnet group will have a subnet for each AZ, so this 41 | # configuration will create `2 groups × 3 AZs = 6` subnets. 42 | subnet_configuration=[ 43 | { 44 | "cidrMask": 24, 45 | "name": "Public", 46 | "subnetType": aws_ec2.SubnetType.PUBLIC, 47 | }, 48 | { 49 | "cidrMask": 24, 50 | "name": "Private", 51 | "subnetType": aws_ec2.SubnetType.PRIVATE_WITH_EGRESS 52 | } 53 | ], 54 | gateway_endpoints={ 55 | "S3": aws_ec2.GatewayVpcEndpointOptions( 56 | service=aws_ec2.GatewayVpcEndpointAwsService.S3 57 | ) 58 | } 59 | ) 60 | 61 | 62 | #XXX: The Name field of every Export member must be specified and 63 | # consist only of alphanumeric characters,colons, or hyphens. 64 | cdk.CfnOutput(self, 'VPCID', value=self.vpc.vpc_id, 65 | export_name=f'{self.stack_name}-VPCID') 66 | 67 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk_stacks/lakeformation_permissions.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_lakeformation 10 | ) 11 | from constructs import Construct 12 | 13 | 14 | class DataLakePermissionsStack(Stack): 15 | 16 | def __init__(self, scope: Construct, construct_id: str, glue_job_role, **kwargs) -> None: 17 | super().__init__(scope, construct_id, **kwargs) 18 | 19 | glue_job_input_arguments = self.node.try_get_context('glue_job_input_arguments') 20 | database_name = glue_job_input_arguments["--database_name"] 21 | 22 | #XXXX: The role assumed by cdk is not a data lake administrator. 23 | # So, deploying PrincipalPermissions meets the error such as: 24 | # "Resource does not exist or requester is not authorized to access requested permissions." 25 | # In order to solve the error, it is necessary to promote the cdk execution role to the data lake administrator. 26 | # For example, https://github.com/aws-samples/data-lake-as-code/blob/mainline/lib/stacks/datalake-stack.ts#L68 27 | cfn_data_lake_settings = aws_lakeformation.CfnDataLakeSettings(self, "CfnDataLakeSettings", 28 | admins=[aws_lakeformation.CfnDataLakeSettings.DataLakePrincipalProperty( 29 | data_lake_principal_identifier=cdk.Fn.sub(self.synthesizer.cloud_formation_execution_role_arn) 30 | )] 31 | ) 32 | 33 | cfn_principal_permissions = aws_lakeformation.CfnPrincipalPermissions(self, "CfnPrincipalPermissions", 34 | permissions=["SELECT", "INSERT", "DELETE", "DESCRIBE", "ALTER"], 35 | permissions_with_grant_option=[], 36 | principal=aws_lakeformation.CfnPrincipalPermissions.DataLakePrincipalProperty( 37 | data_lake_principal_identifier=glue_job_role.role_arn 38 | ), 39 | resource=aws_lakeformation.CfnPrincipalPermissions.ResourceProperty( 40 | #XXX: Can't specify a TableWithColumns resource and a Table resource 41 | table=aws_lakeformation.CfnPrincipalPermissions.TableResourceProperty( 42 | catalog_id=cdk.Aws.ACCOUNT_ID, 43 | database_name=database_name, 44 | # name="ALL_TABLES", 45 | table_wildcard={} 46 | ) 47 | ) 48 | ) 49 | cfn_principal_permissions.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 50 | 51 | #XXX: In order to keep resource destruction order, 52 | # set dependency between CfnDataLakeSettings and CfnPrincipalPermissions 53 | cfn_principal_permissions.add_dependency(cfn_data_lake_settings) 54 | 55 | cdk.CfnOutput(self, f'{self.stack_name}_Principal', 56 | value=cfn_principal_permissions.attr_principal_identifier) 57 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk_stacks/lakeformation_permissions.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_lakeformation 10 | ) 11 | from constructs import Construct 12 | 13 | 14 | class DataLakePermissionsStack(Stack): 15 | 16 | def __init__(self, scope: Construct, construct_id: str, glue_job_role, **kwargs) -> None: 17 | super().__init__(scope, construct_id, **kwargs) 18 | 19 | glue_job_input_arguments = self.node.try_get_context('glue_job_input_arguments') 20 | database_name = glue_job_input_arguments["--database_name"] 21 | 22 | #XXXX: The role assumed by cdk is not a data lake administrator. 23 | # So, deploying PrincipalPermissions meets the error such as: 24 | # "Resource does not exist or requester is not authorized to access requested permissions." 25 | # In order to solve the error, it is necessary to promote the cdk execution role to the data lake administrator. 26 | # For example, https://github.com/aws-samples/data-lake-as-code/blob/mainline/lib/stacks/datalake-stack.ts#L68 27 | cfn_data_lake_settings = aws_lakeformation.CfnDataLakeSettings(self, "CfnDataLakeSettings", 28 | admins=[aws_lakeformation.CfnDataLakeSettings.DataLakePrincipalProperty( 29 | data_lake_principal_identifier=cdk.Fn.sub(self.synthesizer.cloud_formation_execution_role_arn) 30 | )] 31 | ) 32 | 33 | cfn_principal_permissions = aws_lakeformation.CfnPrincipalPermissions(self, "CfnPrincipalPermissions", 34 | permissions=["SELECT", "INSERT", "DELETE", "DESCRIBE", "ALTER"], 35 | permissions_with_grant_option=[], 36 | principal=aws_lakeformation.CfnPrincipalPermissions.DataLakePrincipalProperty( 37 | data_lake_principal_identifier=glue_job_role.role_arn 38 | ), 39 | resource=aws_lakeformation.CfnPrincipalPermissions.ResourceProperty( 40 | #XXX: Can't specify a TableWithColumns resource and a Table resource 41 | table=aws_lakeformation.CfnPrincipalPermissions.TableResourceProperty( 42 | catalog_id=cdk.Aws.ACCOUNT_ID, 43 | database_name=database_name, 44 | # name="ALL_TABLES", 45 | table_wildcard={} 46 | ) 47 | ) 48 | ) 49 | cfn_principal_permissions.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 50 | 51 | #XXX: In order to keep resource destruction order, 52 | # set dependency between CfnDataLakeSettings and CfnPrincipalPermissions 53 | cfn_principal_permissions.add_dependency(cfn_data_lake_settings) 54 | 55 | cdk.CfnOutput(self, f'{self.stack_name}_Principal', 56 | value=cfn_principal_permissions.attr_principal_identifier) 57 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk_stacks/glue_msk_connection.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import boto3 6 | 7 | import aws_cdk as cdk 8 | 9 | from aws_cdk import ( 10 | Stack, 11 | aws_ec2, 12 | aws_glue, 13 | aws_msk 14 | ) 15 | from constructs import Construct 16 | 17 | 18 | class GlueMSKConnectionStack(Stack): 19 | 20 | def __init__(self, scope: Construct, construct_id: str, vpc, sg_msk_client, **kwargs) -> None: 21 | super().__init__(scope, construct_id, **kwargs) 22 | 23 | sg_glue_cluster = aws_ec2.SecurityGroup(self, 'GlueClusterSecurityGroup', 24 | vpc=vpc, 25 | allow_all_outbound=True, 26 | description='security group for Amazon Glue Cluster', 27 | security_group_name='glue-cluster-sg' 28 | ) 29 | sg_glue_cluster.add_ingress_rule(peer=sg_glue_cluster, connection=aws_ec2.Port.all_tcp(), 30 | description='inter-communication between glue cluster nodes') 31 | cdk.Tags.of(sg_glue_cluster).add('Name', 'glue-cluster-sg') 32 | 33 | msk_cluster_name = self.node.try_get_context('msk').get('cluster_name') 34 | msk_client = boto3.client('kafka', region_name=vpc.env.region) 35 | response = msk_client.list_clusters(ClusterNameFilter=msk_cluster_name) 36 | msk_cluster_info_list = response['ClusterInfoList'] 37 | if not msk_cluster_info_list: 38 | kafka_bootstrap_servers = "localhost:9094" 39 | else: 40 | msk_cluster_arn = msk_cluster_info_list[0]['ClusterArn'] 41 | msk_brokers = msk_client.get_bootstrap_brokers(ClusterArn=msk_cluster_arn) 42 | kafka_bootstrap_servers = msk_brokers['BootstrapBrokerString'] 43 | assert kafka_bootstrap_servers 44 | 45 | connection_properties = { 46 | "KAFKA_BOOTSTRAP_SERVERS": kafka_bootstrap_servers, 47 | "KAFKA_SSL_ENABLED": "false" 48 | } 49 | 50 | subnet = vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).subnets[0] 51 | 52 | connection_input_property = aws_glue.CfnConnection.ConnectionInputProperty( 53 | connection_type="KAFKA", 54 | connection_properties=connection_properties, 55 | name="msk-connector", 56 | physical_connection_requirements=aws_glue.CfnConnection.PhysicalConnectionRequirementsProperty( 57 | security_group_id_list=[sg_msk_client.security_group_id, sg_glue_cluster.security_group_id], 58 | subnet_id=subnet.subnet_id, 59 | availability_zone=subnet.availability_zone 60 | ) 61 | ) 62 | 63 | msk_connection = aws_glue.CfnConnection(self, 'GlueMSKConnector', 64 | catalog_id=cdk.Aws.ACCOUNT_ID, 65 | connection_input=connection_input_property 66 | ) 67 | 68 | self.msk_connection_info = msk_connection.connection_input 69 | 70 | cdk.CfnOutput(self, f'{self.stack_name}-MSKConnectorName', value=self.msk_connection_info.name) 71 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk_stacks/glue_msk_connection.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import boto3 6 | 7 | import aws_cdk as cdk 8 | 9 | from aws_cdk import ( 10 | Stack, 11 | aws_ec2, 12 | aws_glue, 13 | aws_msk 14 | ) 15 | from constructs import Construct 16 | 17 | 18 | class GlueMSKConnectionStack(Stack): 19 | 20 | def __init__(self, scope: Construct, construct_id: str, vpc, sg_msk_client, **kwargs) -> None: 21 | super().__init__(scope, construct_id, **kwargs) 22 | 23 | sg_glue_cluster = aws_ec2.SecurityGroup(self, 'GlueClusterSecurityGroup', 24 | vpc=vpc, 25 | allow_all_outbound=True, 26 | description='security group for Amazon Glue Cluster', 27 | security_group_name='glue-cluster-sg' 28 | ) 29 | sg_glue_cluster.add_ingress_rule(peer=sg_glue_cluster, connection=aws_ec2.Port.all_tcp(), 30 | description='inter-communication between glue cluster nodes') 31 | cdk.Tags.of(sg_glue_cluster).add('Name', 'glue-cluster-sg') 32 | 33 | msk_cluster_name = self.node.try_get_context('msk_cluster_name') 34 | msk_client = boto3.client('kafka', region_name=vpc.env.region) 35 | response = msk_client.list_clusters_v2(ClusterNameFilter=msk_cluster_name) 36 | msk_cluster_info_list = response['ClusterInfoList'] 37 | if not msk_cluster_info_list: 38 | kafka_bootstrap_servers = "localhost:9094" 39 | else: 40 | msk_cluster_arn = msk_cluster_info_list[0]['ClusterArn'] 41 | msk_brokers = msk_client.get_bootstrap_brokers(ClusterArn=msk_cluster_arn) 42 | kafka_bootstrap_servers = msk_brokers['BootstrapBrokerStringSaslIam'] 43 | assert kafka_bootstrap_servers 44 | 45 | connection_properties = { 46 | "KAFKA_BOOTSTRAP_SERVERS": kafka_bootstrap_servers, 47 | "KAFKA_SSL_ENABLED": "false" 48 | } 49 | 50 | subnet = vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).subnets[0] 51 | 52 | connection_input_property = aws_glue.CfnConnection.ConnectionInputProperty( 53 | connection_type="KAFKA", 54 | connection_properties=connection_properties, 55 | name="msk-serverless-connector", 56 | physical_connection_requirements=aws_glue.CfnConnection.PhysicalConnectionRequirementsProperty( 57 | security_group_id_list=[sg_msk_client.security_group_id, sg_glue_cluster.security_group_id], 58 | subnet_id=subnet.subnet_id, 59 | availability_zone=subnet.availability_zone 60 | ) 61 | ) 62 | 63 | msk_connection = aws_glue.CfnConnection(self, 'GlueMSKConnector', 64 | catalog_id=cdk.Aws.ACCOUNT_ID, 65 | connection_input=connection_input_property 66 | ) 67 | 68 | self.msk_connection_info = msk_connection.connection_input 69 | 70 | cdk.CfnOutput(self, f'{self.stack_name}-MSKConnectorName', value=self.msk_connection_info.name) 71 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk_stacks/msk_serverless.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import random 6 | import string 7 | 8 | import aws_cdk as cdk 9 | 10 | from aws_cdk import ( 11 | Stack, 12 | aws_ec2, 13 | aws_msk 14 | ) 15 | from constructs import Construct 16 | 17 | random.seed(47) 18 | 19 | 20 | class MskServerlessStack(Stack): 21 | 22 | def __init__(self, scope: Construct, construct_id: str, vpc, **kwargs) -> None: 23 | super().__init__(scope, construct_id, **kwargs) 24 | 25 | msk_cluster_name = self.node.try_get_context("msk_cluster_name") 26 | 27 | MSK_CLIENT_SG_NAME = 'msk-client-sg-{}'.format(''.join(random.sample((string.ascii_lowercase), k=5))) 28 | sg_msk_client = aws_ec2.SecurityGroup(self, 'KafkaClientSecurityGroup', 29 | vpc=vpc, 30 | allow_all_outbound=True, 31 | description='security group for Amazon MSK client', 32 | security_group_name=MSK_CLIENT_SG_NAME 33 | ) 34 | cdk.Tags.of(sg_msk_client).add('Name', MSK_CLIENT_SG_NAME) 35 | 36 | MSK_CLUSTER_SG_NAME = 'msk-cluster-sg-{}'.format(''.join(random.sample((string.ascii_lowercase), k=5))) 37 | sg_msk_cluster = aws_ec2.SecurityGroup(self, 'MSKSecurityGroup', 38 | vpc=vpc, 39 | allow_all_outbound=True, 40 | description='security group for Amazon MSK Cluster', 41 | security_group_name=MSK_CLUSTER_SG_NAME 42 | ) 43 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(9098), 44 | description='msk client security group') 45 | cdk.Tags.of(sg_msk_cluster).add('Name', MSK_CLUSTER_SG_NAME) 46 | 47 | msk_serverless_cluster = aws_msk.CfnServerlessCluster(self, "MSKServerlessCfnCluster", 48 | #XXX: A serverless cluster must use SASL/IAM authentication 49 | client_authentication=aws_msk.CfnServerlessCluster.ClientAuthenticationProperty( 50 | sasl=aws_msk.CfnServerlessCluster.SaslProperty( 51 | iam=aws_msk.CfnServerlessCluster.IamProperty( 52 | enabled=True 53 | ) 54 | ) 55 | ), 56 | cluster_name=msk_cluster_name, 57 | vpc_configs=[aws_msk.CfnServerlessCluster.VpcConfigProperty( 58 | subnet_ids=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).subnet_ids, 59 | security_groups=[sg_msk_client.security_group_id, sg_msk_cluster.security_group_id] 60 | )] 61 | ) 62 | 63 | self.sg_msk_client = sg_msk_client 64 | self.msk_cluster_name = msk_serverless_cluster.cluster_name 65 | self.msk_cluster_arn = msk_serverless_cluster.attr_arn 66 | 67 | cdk.CfnOutput(self, f'{self.stack_name}-SecurityGroupId', 68 | value=self.sg_msk_client.security_group_id) 69 | cdk.CfnOutput(self, f'{self.stack_name}-ClusterName', 70 | value=self.msk_cluster_name) 71 | cdk.CfnOutput(self, f'{self.stack_name}-ClusterArn', 72 | value=self.msk_cluster_arn) 73 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk_stacks/glue_streaming_job.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_glue, 10 | aws_s3 as s3, 11 | ) 12 | from constructs import Construct 13 | 14 | 15 | class GlueStreamingJobStack(Stack): 16 | 17 | def __init__(self, scope: Construct, construct_id: str, glue_job_role, msk_connection_info, **kwargs) -> None: 18 | super().__init__(scope, construct_id, **kwargs) 19 | 20 | glue_assets_s3_bucket_name = self.node.try_get_context('glue_assets_s3_bucket_name') 21 | glue_job_script_file_name = self.node.try_get_context('glue_job_script_file_name') 22 | glue_job_input_arguments = self.node.try_get_context('glue_job_input_arguments') 23 | 24 | msk_connection_name = msk_connection_info.name 25 | kafka_bootstrap_servers = msk_connection_info.connection_properties['KAFKA_BOOTSTRAP_SERVERS'] 26 | 27 | glue_job_default_arguments = { 28 | "--enable-metrics": "true", 29 | "--enable-spark-ui": "true", 30 | "--spark-event-logs-path": f"s3://{glue_assets_s3_bucket_name}/sparkHistoryLogs/", 31 | "--enable-job-insights": "false", 32 | "--enable-glue-datacatalog": "true", 33 | "--enable-continuous-cloudwatch-log": "true", 34 | "--job-bookmark-option": "job-bookmark-disable", 35 | "--job-language": "python", 36 | "--TempDir": f"s3://{glue_assets_s3_bucket_name}/temporary/", 37 | "--kafka_connection_name": msk_connection_name, 38 | "--kafka_bootstrap_servers": kafka_bootstrap_servers, 39 | } 40 | 41 | glue_job_default_arguments.update(glue_job_input_arguments) 42 | 43 | glue_job_name = self.node.try_get_context('glue_job_name') 44 | 45 | glue_connections_name = self.node.try_get_context('glue_connections_name') 46 | 47 | glue_cfn_job = aws_glue.CfnJob(self, "GlueStreamingETLJob", 48 | command=aws_glue.CfnJob.JobCommandProperty( 49 | name="gluestreaming", 50 | python_version="3", 51 | script_location="s3://{glue_assets}/scripts/{glue_job_script_file_name}".format( 52 | glue_assets=glue_assets_s3_bucket_name, 53 | glue_job_script_file_name=glue_job_script_file_name 54 | ) 55 | ), 56 | role=glue_job_role.role_arn, 57 | 58 | #XXX: Set only AllocatedCapacity or MaxCapacity 59 | # Do not set Allocated Capacity if using Worker Type and Number of Workers 60 | # allocated_capacity=2, 61 | connections=aws_glue.CfnJob.ConnectionsListProperty( 62 | connections=[glue_connections_name, msk_connection_name] 63 | ), 64 | default_arguments=glue_job_default_arguments, 65 | description="This job loads the data from MSK to Apache Iceberg table in S3.", 66 | execution_property=aws_glue.CfnJob.ExecutionPropertyProperty( 67 | max_concurrent_runs=1 68 | ), 69 | #XXX: check AWS Glue Version in https://docs.aws.amazon.com/glue/latest/dg/add-job.html#create-job 70 | glue_version="3.0", 71 | #XXX: Do not set Max Capacity if using Worker Type and Number of Workers 72 | # max_capacity=2, 73 | max_retries=0, 74 | name=glue_job_name, 75 | # notification_property=aws_glue.CfnJob.NotificationPropertyProperty( 76 | # notify_delay_after=10 # 10 minutes 77 | # ), 78 | number_of_workers=2, 79 | timeout=2880, 80 | worker_type="G.1X" # ['Standard' | 'G.1X' | 'G.2X' | 'G.025x'] 81 | ) 82 | 83 | cdk.CfnOutput(self, f'{self.stack_name}_GlueJobName', value=glue_cfn_job.name) 84 | cdk.CfnOutput(self, f'{self.stack_name}_GlueJobRoleArn', value=glue_job_role.role_arn) 85 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk_stacks/glue_streaming_job.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_glue, 10 | aws_s3 as s3, 11 | ) 12 | from constructs import Construct 13 | 14 | 15 | class GlueStreamingJobStack(Stack): 16 | 17 | def __init__(self, scope: Construct, construct_id: str, glue_job_role, msk_connection_info, **kwargs) -> None: 18 | super().__init__(scope, construct_id, **kwargs) 19 | 20 | glue_assets_s3_bucket_name = self.node.try_get_context('glue_assets_s3_bucket_name') 21 | glue_job_script_file_name = self.node.try_get_context('glue_job_script_file_name') 22 | glue_job_input_arguments = self.node.try_get_context('glue_job_input_arguments') 23 | 24 | msk_connection_name = msk_connection_info.name 25 | kafka_bootstrap_servers = msk_connection_info.connection_properties['KAFKA_BOOTSTRAP_SERVERS'] 26 | 27 | glue_job_default_arguments = { 28 | "--enable-metrics": "true", 29 | "--enable-spark-ui": "true", 30 | "--spark-event-logs-path": f"s3://{glue_assets_s3_bucket_name}/sparkHistoryLogs/", 31 | "--enable-job-insights": "false", 32 | "--enable-glue-datacatalog": "true", 33 | "--enable-continuous-cloudwatch-log": "true", 34 | "--job-bookmark-option": "job-bookmark-disable", 35 | "--job-language": "python", 36 | "--TempDir": f"s3://{glue_assets_s3_bucket_name}/temporary/", 37 | "--kafka_connection_name": msk_connection_name, 38 | "--kafka_bootstrap_servers": kafka_bootstrap_servers, 39 | } 40 | 41 | glue_job_default_arguments.update(glue_job_input_arguments) 42 | 43 | glue_job_name = self.node.try_get_context('glue_job_name') 44 | 45 | glue_connections_name = self.node.try_get_context('glue_connections_name') 46 | 47 | glue_cfn_job = aws_glue.CfnJob(self, "GlueStreamingETLJob", 48 | command=aws_glue.CfnJob.JobCommandProperty( 49 | name="gluestreaming", 50 | python_version="3", 51 | script_location="s3://{glue_assets}/scripts/{glue_job_script_file_name}".format( 52 | glue_assets=glue_assets_s3_bucket_name, 53 | glue_job_script_file_name=glue_job_script_file_name 54 | ) 55 | ), 56 | role=glue_job_role.role_arn, 57 | 58 | #XXX: Set only AllocatedCapacity or MaxCapacity 59 | # Do not set Allocated Capacity if using Worker Type and Number of Workers 60 | # allocated_capacity=2, 61 | connections=aws_glue.CfnJob.ConnectionsListProperty( 62 | connections=[glue_connections_name, msk_connection_name] 63 | ), 64 | default_arguments=glue_job_default_arguments, 65 | description="This job loads the data from MSK to Apache Iceberg table in S3.", 66 | execution_property=aws_glue.CfnJob.ExecutionPropertyProperty( 67 | max_concurrent_runs=1 68 | ), 69 | #XXX: check AWS Glue Version in https://docs.aws.amazon.com/glue/latest/dg/add-job.html#create-job 70 | glue_version="3.0", 71 | #XXX: Do not set Max Capacity if using Worker Type and Number of Workers 72 | # max_capacity=2, 73 | max_retries=0, 74 | name=glue_job_name, 75 | # notification_property=aws_glue.CfnJob.NotificationPropertyProperty( 76 | # notify_delay_after=10 # 10 minutes 77 | # ), 78 | number_of_workers=2, 79 | timeout=2880, 80 | worker_type="G.1X" # ['Standard' | 'G.1X' | 'G.2X' | 'G.025x'] 81 | ) 82 | 83 | cdk.CfnOutput(self, f'{self.stack_name}_GlueJobName', value=glue_cfn_job.name) 84 | cdk.CfnOutput(self, f'{self.stack_name}_GlueJobRoleArn', value=glue_job_role.role_arn) 85 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk_stacks/glue_job_role.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_iam 10 | ) 11 | from constructs import Construct 12 | 13 | 14 | class GlueJobRoleStack(Stack): 15 | 16 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 17 | super().__init__(scope, construct_id, **kwargs) 18 | 19 | glue_job_role_policy_doc = aws_iam.PolicyDocument() 20 | glue_job_role_policy_doc.add_statements(aws_iam.PolicyStatement(**{ 21 | "sid": "AWSGlueJobDynamoDBAccess", 22 | "effect": aws_iam.Effect.ALLOW, 23 | #XXX: The ARN will be formatted as follows: 24 | # arn:{partition}:{service}:{region}:{account}:{resource}{sep}{resource-name} 25 | "resources": [self.format_arn(service="dynamodb", resource="table", resource_name="*")], 26 | "actions": [ 27 | "dynamodb:BatchGetItem", 28 | "dynamodb:DescribeStream", 29 | "dynamodb:DescribeTable", 30 | "dynamodb:GetItem", 31 | "dynamodb:Query", 32 | "dynamodb:Scan", 33 | "dynamodb:BatchWriteItem", 34 | "dynamodb:CreateTable", 35 | "dynamodb:DeleteTable", 36 | "dynamodb:DeleteItem", 37 | "dynamodb:UpdateTable", 38 | "dynamodb:UpdateItem", 39 | "dynamodb:PutItem" 40 | ] 41 | })) 42 | 43 | glue_job_role_policy_doc.add_statements(aws_iam.PolicyStatement(**{ 44 | "sid": "AWSGlueJobS3Access", 45 | "effect": aws_iam.Effect.ALLOW, 46 | #XXX: The ARN will be formatted as follows: 47 | # arn:{partition}:{service}:{region}:{account}:{resource}{sep}{resource-name} 48 | "resources": ["*"], 49 | "actions": [ 50 | "s3:GetBucketLocation", 51 | "s3:ListBucket", 52 | "s3:GetBucketAcl", 53 | "s3:GetObject", 54 | "s3:PutObject", 55 | "s3:DeleteObject" 56 | ] 57 | })) 58 | 59 | glue_job_role = aws_iam.Role(self, 'GlueJobRole', 60 | role_name='GlueStreamingJobRole-MSK2Iceberg', 61 | assumed_by=aws_iam.ServicePrincipal('glue.amazonaws.com'), 62 | inline_policies={ 63 | 'aws_glue_job_role_policy': glue_job_role_policy_doc 64 | }, 65 | managed_policies=[ 66 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('service-role/AWSGlueServiceRole'), 67 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonSSMReadOnlyAccess'), 68 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonEC2ContainerRegistryReadOnly'), 69 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AWSGlueConsoleFullAccess'), 70 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonMSKReadOnlyAccess'), 71 | # aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonKinesisReadOnlyAccess') 72 | ] 73 | ) 74 | 75 | #XXX: When creating a notebook with a role, that role is then passed to interactive sessions 76 | # so that the same role can be used in both places. 77 | # As such, the `iam:PassRole` permission needs to be part of the role's policy. 78 | # More info at: https://docs.aws.amazon.com/glue/latest/ug/notebook-getting-started.html 79 | # 80 | glue_job_role.add_to_policy(aws_iam.PolicyStatement(**{ 81 | "sid": "AWSGlueJobIAMPassRole", 82 | "effect": aws_iam.Effect.ALLOW, 83 | #XXX: The ARN will be formatted as follows: 84 | # arn:{partition}:{service}:{region}:{account}:{resource}{sep}{resource-name} 85 | "resources": [self.format_arn(service="iam", region="", resource="role", resource_name=glue_job_role.role_name)], 86 | "conditions": { 87 | "StringLike": { 88 | "iam:PassedToService": [ 89 | "glue.amazonaws.com" 90 | ] 91 | } 92 | }, 93 | "actions": [ 94 | "iam:PassRole" 95 | ] 96 | })) 97 | 98 | self.glue_job_role = glue_job_role 99 | 100 | cdk.CfnOutput(self, f'{self.stack_name}_GlueJobRole', value=self.glue_job_role.role_name) 101 | cdk.CfnOutput(self, f'{self.stack_name}_GlueJobRoleArn', value=self.glue_job_role.role_arn) 102 | -------------------------------------------------------------------------------- /msk-to-iceberg/src/main/python/spark_dataframe_insert_iceberg_from_kafka.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import sys 7 | 8 | from awsglue.transforms import * 9 | from awsglue.utils import getResolvedOptions 10 | from pyspark.context import SparkContext 11 | from awsglue.context import GlueContext 12 | from awsglue.job import Job 13 | from awsglue import DynamicFrame 14 | 15 | from pyspark.conf import SparkConf 16 | from pyspark.sql.types import * 17 | from pyspark.sql.functions import ( 18 | col, 19 | from_json, 20 | to_timestamp 21 | ) 22 | 23 | 24 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 25 | 'catalog', 26 | 'database_name', 27 | 'table_name', 28 | 'primary_key', 29 | 'kafka_topic_name', 30 | 'starting_offsets_of_kafka_topic', 31 | 'kafka_connection_name', 32 | 'kafka_bootstrap_servers', 33 | 'iceberg_s3_path', 34 | 'lock_table_name', 35 | 'aws_region', 36 | 'window_size' 37 | ]) 38 | 39 | CATALOG = args['catalog'] 40 | 41 | ICEBERG_S3_PATH = args['iceberg_s3_path'] 42 | 43 | DATABASE = args['database_name'] 44 | TABLE_NAME = args['table_name'] 45 | PRIMARY_KEY = args['primary_key'] 46 | 47 | DYNAMODB_LOCK_TABLE = args['lock_table_name'] 48 | 49 | KAFKA_TOPIC_NAME = args['kafka_topic_name'] 50 | KAFKA_CONNECTION_NAME = args['kafka_connection_name'] 51 | KAFKA_BOOTSTRAP_SERVERS = args['kafka_bootstrap_servers'] 52 | 53 | #XXX: starting_offsets_of_kafka_topic: ['latest', 'earliest'] 54 | STARTING_OFFSETS_OF_KAFKA_TOPIC = args.get('starting_offsets_of_kafka_topic', 'latest') 55 | 56 | AWS_REGION = args['aws_region'] 57 | WINDOW_SIZE = args.get('window_size', '100 seconds') 58 | 59 | def setSparkIcebergConf() -> SparkConf: 60 | conf_list = [ 61 | (f"spark.sql.catalog.{CATALOG}", "org.apache.iceberg.spark.SparkCatalog"), 62 | (f"spark.sql.catalog.{CATALOG}.warehouse", ICEBERG_S3_PATH), 63 | (f"spark.sql.catalog.{CATALOG}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog"), 64 | (f"spark.sql.catalog.{CATALOG}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"), 65 | (f"spark.sql.catalog.{CATALOG}.lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager"), 66 | (f"spark.sql.catalog.{CATALOG}.lock.table", DYNAMODB_LOCK_TABLE), 67 | ("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"), 68 | ("spark.sql.iceberg.handle-timestamp-without-timezone", "true") 69 | ] 70 | spark_conf = SparkConf().setAll(conf_list) 71 | return spark_conf 72 | 73 | # Set the Spark + Glue context 74 | conf = setSparkIcebergConf() 75 | sc = SparkContext(conf=conf) 76 | glueContext = GlueContext(sc) 77 | spark = glueContext.spark_session 78 | job = Job(glueContext) 79 | job.init(args['JOB_NAME'], args) 80 | 81 | options_read = { 82 | "kafka.bootstrap.servers": KAFKA_BOOTSTRAP_SERVERS, 83 | "subscribe": KAFKA_TOPIC_NAME, 84 | "startingOffsets": STARTING_OFFSETS_OF_KAFKA_TOPIC 85 | } 86 | 87 | schema = StructType([ 88 | StructField("name", StringType(), False), 89 | StructField("age", IntegerType(), True), 90 | StructField("m_time", StringType(), False), 91 | ]) 92 | 93 | streaming_data = spark.readStream.format("kafka").options(**options_read).load() 94 | 95 | stream_data_df = streaming_data \ 96 | .select(from_json(col("value").cast("string"), schema).alias("source_table")) \ 97 | .select("source_table.*") \ 98 | .withColumn('m_time', to_timestamp(col('m_time'), 'yyyy-MM-dd HH:mm:ss')) 99 | 100 | table_id = f"{CATALOG}.{DATABASE}.{TABLE_NAME}" 101 | checkpointPath = os.path.join(args["TempDir"], args["JOB_NAME"], "checkpoint/") 102 | 103 | #XXX: Writing against partitioned table 104 | # https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/#writing-against-partitioned-table 105 | # Complete output mode not supported when there are no streaming aggregations on streaming DataFrame/Datasets 106 | query = stream_data_df.writeStream \ 107 | .format("iceberg") \ 108 | .outputMode("append") \ 109 | .trigger(processingTime=WINDOW_SIZE) \ 110 | .option("path", table_id) \ 111 | .option("fanout-enabled", "true") \ 112 | .option("checkpointLocation", checkpointPath) \ 113 | .start() 114 | 115 | query.awaitTermination() 116 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/src/main/python/spark_dataframe_insert_iceberg_from_msk_serverless.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import sys 7 | 8 | from awsglue.transforms import * 9 | from awsglue.utils import getResolvedOptions 10 | from pyspark.context import SparkContext 11 | from awsglue.context import GlueContext 12 | from awsglue.job import Job 13 | from awsglue import DynamicFrame 14 | 15 | from pyspark.conf import SparkConf 16 | from pyspark.sql.types import * 17 | from pyspark.sql.functions import ( 18 | col, 19 | from_json, 20 | to_timestamp 21 | ) 22 | 23 | 24 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 25 | 'catalog', 26 | 'database_name', 27 | 'table_name', 28 | 'primary_key', 29 | 'kafka_topic_name', 30 | 'starting_offsets_of_kafka_topic', 31 | 'kafka_connection_name', 32 | 'kafka_bootstrap_servers', 33 | 'iceberg_s3_path', 34 | 'lock_table_name', 35 | 'aws_region', 36 | 'window_size' 37 | ]) 38 | 39 | CATALOG = args['catalog'] 40 | 41 | ICEBERG_S3_PATH = args['iceberg_s3_path'] 42 | 43 | DATABASE = args['database_name'] 44 | TABLE_NAME = args['table_name'] 45 | PRIMARY_KEY = args['primary_key'] 46 | 47 | DYNAMODB_LOCK_TABLE = args['lock_table_name'] 48 | 49 | KAFKA_TOPIC_NAME = args['kafka_topic_name'] 50 | KAFKA_CONNECTION_NAME = args['kafka_connection_name'] 51 | KAFKA_BOOTSTRAP_SERVERS = args['kafka_bootstrap_servers'] 52 | 53 | #XXX: starting_offsets_of_kafka_topic: ['latest', 'earliest'] 54 | STARTING_OFFSETS_OF_KAFKA_TOPIC = args.get('starting_offsets_of_kafka_topic', 'latest') 55 | 56 | AWS_REGION = args['aws_region'] 57 | WINDOW_SIZE = args.get('window_size', '100 seconds') 58 | 59 | def setSparkIcebergConf() -> SparkConf: 60 | conf_list = [ 61 | (f"spark.sql.catalog.{CATALOG}", "org.apache.iceberg.spark.SparkCatalog"), 62 | (f"spark.sql.catalog.{CATALOG}.warehouse", ICEBERG_S3_PATH), 63 | (f"spark.sql.catalog.{CATALOG}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog"), 64 | (f"spark.sql.catalog.{CATALOG}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"), 65 | (f"spark.sql.catalog.{CATALOG}.lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager"), 66 | (f"spark.sql.catalog.{CATALOG}.lock.table", DYNAMODB_LOCK_TABLE), 67 | ("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"), 68 | ("spark.sql.iceberg.handle-timestamp-without-timezone", "true") 69 | ] 70 | spark_conf = SparkConf().setAll(conf_list) 71 | return spark_conf 72 | 73 | # Set the Spark + Glue context 74 | conf = setSparkIcebergConf() 75 | sc = SparkContext(conf=conf) 76 | glueContext = GlueContext(sc) 77 | spark = glueContext.spark_session 78 | job = Job(glueContext) 79 | job.init(args['JOB_NAME'], args) 80 | 81 | options_read = { 82 | "kafka.bootstrap.servers": KAFKA_BOOTSTRAP_SERVERS, 83 | "subscribe": KAFKA_TOPIC_NAME, 84 | "startingOffsets": STARTING_OFFSETS_OF_KAFKA_TOPIC, 85 | "kafka.security.protocol": "SASL_SSL", 86 | "kafka.sasl.mechanism": "AWS_MSK_IAM", 87 | "kafka.sasl.jaas.config": "software.amazon.msk.auth.iam.IAMLoginModule required;", 88 | "kafka.sasl.client.callback.handler.class": "software.amazon.msk.auth.iam.IAMClientCallbackHandler" 89 | } 90 | 91 | schema = StructType([ 92 | StructField("name", StringType(), False), 93 | StructField("age", IntegerType(), True), 94 | StructField("m_time", StringType(), False), 95 | ]) 96 | 97 | streaming_data = spark.readStream.format("kafka").options(**options_read).load() 98 | 99 | stream_data_df = streaming_data \ 100 | .select(from_json(col("value").cast("string"), schema).alias("source_table")) \ 101 | .select("source_table.*") \ 102 | .withColumn('m_time', to_timestamp(col('m_time'), 'yyyy-MM-dd HH:mm:ss')) 103 | 104 | table_id = f"{CATALOG}.{DATABASE}.{TABLE_NAME}" 105 | checkpointPath = os.path.join(args["TempDir"], args["JOB_NAME"], "checkpoint/") 106 | 107 | #XXX: Writing against partitioned table 108 | # https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/#writing-against-partitioned-table 109 | # Complete output mode not supported when there are no streaming aggregations on streaming DataFrame/Datasets 110 | query = stream_data_df.writeStream \ 111 | .format("iceberg") \ 112 | .outputMode("append") \ 113 | .trigger(processingTime=WINDOW_SIZE) \ 114 | .option("path", table_id) \ 115 | .option("fanout-enabled", "true") \ 116 | .option("checkpointLocation", checkpointPath) \ 117 | .start() 118 | 119 | query.awaitTermination() 120 | -------------------------------------------------------------------------------- /msk-to-iceberg/src/main/python/spark_sql_insert_overwrite_iceberg_from_kafka.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import sys 7 | import traceback 8 | 9 | from awsglue.transforms import * 10 | from awsglue.utils import getResolvedOptions 11 | from pyspark.context import SparkContext 12 | from awsglue.context import GlueContext 13 | from awsglue.job import Job 14 | from awsglue import DynamicFrame 15 | 16 | from pyspark.conf import SparkConf 17 | from pyspark.sql import DataFrame, Row 18 | from pyspark.sql.window import Window 19 | from pyspark.sql.types import * 20 | from pyspark.sql.functions import ( 21 | col, 22 | desc, 23 | row_number, 24 | to_timestamp 25 | ) 26 | 27 | 28 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 29 | 'catalog', 30 | 'database_name', 31 | 'table_name', 32 | 'primary_key', 33 | 'kafka_topic_name', 34 | 'starting_offsets_of_kafka_topic', 35 | 'kafka_connection_name', 36 | 'iceberg_s3_path', 37 | 'lock_table_name', 38 | 'aws_region', 39 | 'window_size' 40 | ]) 41 | 42 | CATALOG = args['catalog'] 43 | 44 | ICEBERG_S3_PATH = args['iceberg_s3_path'] 45 | 46 | DATABASE = args['database_name'] 47 | TABLE_NAME = args['table_name'] 48 | PRIMARY_KEY = args['primary_key'] 49 | 50 | DYNAMODB_LOCK_TABLE = args['lock_table_name'] 51 | 52 | KAFKA_TOPIC_NAME = args['kafka_topic_name'] 53 | KAFKA_CONNECTION_NAME = args['kafka_connection_name'] 54 | 55 | #XXX: starting_offsets_of_kafka_topic: ['latest', 'earliest'] 56 | STARTING_OFFSETS_OF_KAFKA_TOPIC = args.get('starting_offsets_of_kafka_topic', 'latest') 57 | 58 | AWS_REGION = args['aws_region'] 59 | WINDOW_SIZE = args.get('window_size', '100 seconds') 60 | 61 | def setSparkIcebergConf() -> SparkConf: 62 | conf_list = [ 63 | (f"spark.sql.catalog.{CATALOG}", "org.apache.iceberg.spark.SparkCatalog"), 64 | (f"spark.sql.catalog.{CATALOG}.warehouse", ICEBERG_S3_PATH), 65 | (f"spark.sql.catalog.{CATALOG}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog"), 66 | (f"spark.sql.catalog.{CATALOG}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"), 67 | (f"spark.sql.catalog.{CATALOG}.lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager"), 68 | (f"spark.sql.catalog.{CATALOG}.lock.table", DYNAMODB_LOCK_TABLE), 69 | ("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"), 70 | ("spark.sql.iceberg.handle-timestamp-without-timezone", "true") 71 | ] 72 | spark_conf = SparkConf().setAll(conf_list) 73 | return spark_conf 74 | 75 | # Set the Spark + Glue context 76 | conf = setSparkIcebergConf() 77 | sc = SparkContext(conf=conf) 78 | glueContext = GlueContext(sc) 79 | spark = glueContext.spark_session 80 | job = Job(glueContext) 81 | job.init(args['JOB_NAME'], args) 82 | 83 | kafka_options = { 84 | "connectionName": KAFKA_CONNECTION_NAME, 85 | "topicName": KAFKA_TOPIC_NAME, 86 | "startingOffsets": STARTING_OFFSETS_OF_KAFKA_TOPIC, 87 | "inferSchema": "true", 88 | "classification": "json" 89 | } 90 | 91 | streaming_data = glueContext.create_data_frame.from_options( 92 | connection_type="kafka", 93 | connection_options=kafka_options, 94 | transformation_ctx="kafka_df" 95 | ) 96 | 97 | def processBatch(data_frame, batch_id): 98 | if data_frame.count() > 0: 99 | stream_data_dynf = DynamicFrame.fromDF( 100 | data_frame, glueContext, "from_data_frame" 101 | ) 102 | 103 | _df = spark.sql(f"SELECT * FROM {CATALOG}.{DATABASE}.{TABLE_NAME} LIMIT 0") 104 | 105 | #XXX: Apply De-duplication logic on input data to pick up the latest record based on timestamp and operation 106 | window = Window.partitionBy(PRIMARY_KEY).orderBy(desc("m_time")) 107 | stream_data_df = stream_data_dynf.toDF() 108 | stream_data_df = stream_data_df.withColumn('m_time', to_timestamp(col('m_time'), 'yyyy-MM-dd HH:mm:ss')) 109 | upsert_data_df = stream_data_df.withColumn("row", row_number().over(window)) \ 110 | .filter(col("row") == 1).drop("row") \ 111 | .select(_df.schema.names) 112 | 113 | upsert_data_df.createOrReplaceTempView(f"{TABLE_NAME}_upsert") 114 | # print(f"Table '{TABLE_NAME}' is inserting overwrite...") 115 | 116 | sql_query = f""" 117 | INSERT OVERWRITE {CATALOG}.{DATABASE}.{TABLE_NAME} SELECT * FROM {TABLE_NAME}_upsert 118 | """ 119 | try: 120 | spark.sql(sql_query) 121 | except Exception as ex: 122 | traceback.print_exc() 123 | raise ex 124 | 125 | 126 | checkpointPath = os.path.join(args["TempDir"], args["JOB_NAME"], "checkpoint/") 127 | 128 | glueContext.forEachBatch( 129 | frame=streaming_data, 130 | batch_function=processBatch, 131 | options={ 132 | "windowSize": WINDOW_SIZE, 133 | "checkpointLocation": checkpointPath, 134 | } 135 | ) 136 | 137 | job.commit() 138 | -------------------------------------------------------------------------------- /msk-to-iceberg/src/main/python/spark_sql_merge_into_iceberg_from_kafka.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import sys 7 | import traceback 8 | 9 | from awsglue.transforms import * 10 | from awsglue.utils import getResolvedOptions 11 | from awsglue.context import GlueContext 12 | from awsglue.job import Job 13 | from awsglue import DynamicFrame 14 | 15 | from pyspark.context import SparkContext 16 | from pyspark.conf import SparkConf 17 | from pyspark.sql import DataFrame, Row 18 | from pyspark.sql.window import Window 19 | from pyspark.sql.functions import ( 20 | col, 21 | desc, 22 | row_number, 23 | to_timestamp 24 | ) 25 | 26 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 27 | 'catalog', 28 | 'database_name', 29 | 'table_name', 30 | 'primary_key', 31 | 'kafka_topic_name', 32 | 'starting_offsets_of_kafka_topic', 33 | 'kafka_connection_name', 34 | 'iceberg_s3_path', 35 | 'lock_table_name', 36 | 'aws_region', 37 | 'window_size' 38 | ]) 39 | 40 | CATALOG = args['catalog'] 41 | 42 | ICEBERG_S3_PATH = args['iceberg_s3_path'] 43 | 44 | DATABASE = args['database_name'] 45 | TABLE_NAME = args['table_name'] 46 | PRIMARY_KEY = args['primary_key'] 47 | 48 | DYNAMODB_LOCK_TABLE = args['lock_table_name'] 49 | 50 | KAFKA_TOPIC_NAME = args['kafka_topic_name'] 51 | KAFKA_CONNECTION_NAME = args['kafka_connection_name'] 52 | 53 | #XXX: starting_offsets_of_kafka_topic: ['latest', 'earliest'] 54 | STARTING_OFFSETS_OF_KAFKA_TOPIC = args.get('starting_offsets_of_kafka_topic', 'latest') 55 | 56 | AWS_REGION = args['aws_region'] 57 | WINDOW_SIZE = args.get('window_size', '100 seconds') 58 | 59 | def setSparkIcebergConf() -> SparkConf: 60 | conf_list = [ 61 | (f"spark.sql.catalog.{CATALOG}", "org.apache.iceberg.spark.SparkCatalog"), 62 | (f"spark.sql.catalog.{CATALOG}.warehouse", ICEBERG_S3_PATH), 63 | (f"spark.sql.catalog.{CATALOG}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog"), 64 | (f"spark.sql.catalog.{CATALOG}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"), 65 | (f"spark.sql.catalog.{CATALOG}.lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager"), 66 | (f"spark.sql.catalog.{CATALOG}.lock.table", DYNAMODB_LOCK_TABLE), 67 | ("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"), 68 | ("spark.sql.iceberg.handle-timestamp-without-timezone", "true") 69 | ] 70 | spark_conf = SparkConf().setAll(conf_list) 71 | return spark_conf 72 | 73 | # Set the Spark + Glue context 74 | conf = setSparkIcebergConf() 75 | sc = SparkContext(conf=conf) 76 | glueContext = GlueContext(sc) 77 | spark = glueContext.spark_session 78 | job = Job(glueContext) 79 | job.init(args['JOB_NAME'], args) 80 | 81 | kafka_options = { 82 | "connectionName": KAFKA_CONNECTION_NAME, 83 | "topicName": KAFKA_TOPIC_NAME, 84 | "startingOffsets": STARTING_OFFSETS_OF_KAFKA_TOPIC, 85 | "inferSchema": "true", 86 | "classification": "json" 87 | } 88 | 89 | streaming_data = glueContext.create_data_frame.from_options( 90 | connection_type="kafka", 91 | connection_options=kafka_options, 92 | transformation_ctx="kafka_df" 93 | ) 94 | 95 | def processBatch(data_frame, batch_id): 96 | if data_frame.count() > 0: 97 | stream_data_dynf = DynamicFrame.fromDF( 98 | data_frame, glueContext, "from_data_frame" 99 | ) 100 | 101 | tables_df = spark.sql(f"SHOW TABLES IN {CATALOG}.{DATABASE}") 102 | table_list = tables_df.select('tableName').rdd.flatMap(lambda x: x).collect() 103 | if f"{TABLE_NAME}" not in table_list: 104 | print(f"Table {TABLE_NAME} doesn't exist in {CATALOG}.{DATABASE}.") 105 | else: 106 | _df = spark.sql(f"SELECT * FROM {CATALOG}.{DATABASE}.{TABLE_NAME} LIMIT 0") 107 | 108 | #XXX: Apply De-duplication logic on input data to pick up the latest record based on timestamp and operation 109 | window = Window.partitionBy(PRIMARY_KEY).orderBy(desc("m_time")) 110 | stream_data_df = stream_data_dynf.toDF() 111 | stream_data_df = stream_data_df.withColumn('m_time', to_timestamp(col('m_time'), 'yyyy-MM-dd HH:mm:ss')) 112 | upsert_data_df = stream_data_df.withColumn("row", row_number().over(window)) \ 113 | .filter(col("row") == 1).drop("row") \ 114 | .select(_df.schema.names) 115 | 116 | upsert_data_df.createOrReplaceTempView(f"{TABLE_NAME}_upsert") 117 | # print(f"Table '{TABLE_NAME}' is upserting...") 118 | 119 | try: 120 | spark.sql(f"""MERGE INTO {CATALOG}.{DATABASE}.{TABLE_NAME} t 121 | USING {TABLE_NAME}_upsert s ON s.{PRIMARY_KEY} = t.{PRIMARY_KEY} 122 | WHEN MATCHED THEN UPDATE SET * 123 | WHEN NOT MATCHED THEN INSERT * 124 | """) 125 | except Exception as ex: 126 | traceback.print_exc() 127 | raise ex 128 | 129 | 130 | checkpointPath = os.path.join(args["TempDir"], args["JOB_NAME"], "checkpoint/") 131 | 132 | glueContext.forEachBatch( 133 | frame=streaming_data, 134 | batch_function=processBatch, 135 | options={ 136 | "windowSize": WINDOW_SIZE, 137 | "checkpointLocation": checkpointPath, 138 | } 139 | ) 140 | 141 | job.commit() 142 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/src/main/python/spark_sql_insert_overwrite_iceberg_from_msk_serverless.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import sys 7 | import traceback 8 | 9 | from awsglue.transforms import * 10 | from awsglue.utils import getResolvedOptions 11 | from pyspark.context import SparkContext 12 | from awsglue.context import GlueContext 13 | from awsglue.job import Job 14 | from awsglue import DynamicFrame 15 | 16 | from pyspark.conf import SparkConf 17 | from pyspark.sql import DataFrame, Row 18 | from pyspark.sql.window import Window 19 | from pyspark.sql.types import * 20 | from pyspark.sql.functions import ( 21 | col, 22 | desc, 23 | row_number, 24 | to_timestamp 25 | ) 26 | 27 | 28 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 29 | 'catalog', 30 | 'database_name', 31 | 'table_name', 32 | 'primary_key', 33 | 'kafka_topic_name', 34 | 'starting_offsets_of_kafka_topic', 35 | 'kafka_connection_name', 36 | 'iceberg_s3_path', 37 | 'lock_table_name', 38 | 'aws_region', 39 | 'window_size' 40 | ]) 41 | 42 | CATALOG = args['catalog'] 43 | 44 | ICEBERG_S3_PATH = args['iceberg_s3_path'] 45 | 46 | DATABASE = args['database_name'] 47 | TABLE_NAME = args['table_name'] 48 | PRIMARY_KEY = args['primary_key'] 49 | 50 | DYNAMODB_LOCK_TABLE = args['lock_table_name'] 51 | 52 | KAFKA_TOPIC_NAME = args['kafka_topic_name'] 53 | KAFKA_CONNECTION_NAME = args['kafka_connection_name'] 54 | 55 | #XXX: starting_offsets_of_kafka_topic: ['latest', 'earliest'] 56 | STARTING_OFFSETS_OF_KAFKA_TOPIC = args.get('starting_offsets_of_kafka_topic', 'latest') 57 | 58 | AWS_REGION = args['aws_region'] 59 | WINDOW_SIZE = args.get('window_size', '100 seconds') 60 | 61 | def setSparkIcebergConf() -> SparkConf: 62 | conf_list = [ 63 | (f"spark.sql.catalog.{CATALOG}", "org.apache.iceberg.spark.SparkCatalog"), 64 | (f"spark.sql.catalog.{CATALOG}.warehouse", ICEBERG_S3_PATH), 65 | (f"spark.sql.catalog.{CATALOG}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog"), 66 | (f"spark.sql.catalog.{CATALOG}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"), 67 | (f"spark.sql.catalog.{CATALOG}.lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager"), 68 | (f"spark.sql.catalog.{CATALOG}.lock.table", DYNAMODB_LOCK_TABLE), 69 | ("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"), 70 | ("spark.sql.iceberg.handle-timestamp-without-timezone", "true") 71 | ] 72 | spark_conf = SparkConf().setAll(conf_list) 73 | return spark_conf 74 | 75 | # Set the Spark + Glue context 76 | conf = setSparkIcebergConf() 77 | sc = SparkContext(conf=conf) 78 | glueContext = GlueContext(sc) 79 | spark = glueContext.spark_session 80 | job = Job(glueContext) 81 | job.init(args['JOB_NAME'], args) 82 | 83 | kafka_options = { 84 | "connectionName": KAFKA_CONNECTION_NAME, 85 | "topicName": KAFKA_TOPIC_NAME, 86 | "startingOffsets": STARTING_OFFSETS_OF_KAFKA_TOPIC, 87 | "inferSchema": "true", 88 | "classification": "json", 89 | 90 | #XXX: the properties below are required for IAM Access control for MSK Serverless 91 | "kafka.security.protocol": "SASL_SSL", 92 | "kafka.sasl.mechanism": "AWS_MSK_IAM", 93 | "kafka.sasl.jaas.config": "software.amazon.msk.auth.iam.IAMLoginModule required;", 94 | "kafka.sasl.client.callback.handler.class": "software.amazon.msk.auth.iam.IAMClientCallbackHandler" 95 | } 96 | 97 | streaming_data = glueContext.create_data_frame.from_options( 98 | connection_type="kafka", 99 | connection_options=kafka_options, 100 | transformation_ctx="kafka_df" 101 | ) 102 | 103 | def processBatch(data_frame, batch_id): 104 | if data_frame.count() > 0: 105 | stream_data_dynf = DynamicFrame.fromDF( 106 | data_frame, glueContext, "from_data_frame" 107 | ) 108 | 109 | _df = spark.sql(f"SELECT * FROM {CATALOG}.{DATABASE}.{TABLE_NAME} LIMIT 0") 110 | 111 | #XXX: Apply De-duplication logic on input data to pick up the latest record based on timestamp and operation 112 | window = Window.partitionBy(PRIMARY_KEY).orderBy(desc("m_time")) 113 | stream_data_df = stream_data_dynf.toDF() 114 | stream_data_df = stream_data_df.withColumn('m_time', to_timestamp(col('m_time'), 'yyyy-MM-dd HH:mm:ss')) 115 | upsert_data_df = stream_data_df.withColumn("row", row_number().over(window)) \ 116 | .filter(col("row") == 1).drop("row") \ 117 | .select(_df.schema.names) 118 | 119 | upsert_data_df.createOrReplaceTempView(f"{TABLE_NAME}_upsert") 120 | # print(f"Table '{TABLE_NAME}' is inserting overwrite...") 121 | 122 | sql_query = f""" 123 | INSERT OVERWRITE {CATALOG}.{DATABASE}.{TABLE_NAME} SELECT * FROM {TABLE_NAME}_upsert 124 | """ 125 | try: 126 | spark.sql(sql_query) 127 | except Exception as ex: 128 | traceback.print_exc() 129 | raise ex 130 | 131 | 132 | checkpointPath = os.path.join(args["TempDir"], args["JOB_NAME"], "checkpoint/") 133 | 134 | glueContext.forEachBatch( 135 | frame=streaming_data, 136 | batch_function=processBatch, 137 | options={ 138 | "windowSize": WINDOW_SIZE, 139 | "checkpointLocation": checkpointPath, 140 | } 141 | ) 142 | 143 | job.commit() 144 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk_stacks/msk.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import re 6 | import random 7 | import string 8 | 9 | import aws_cdk as cdk 10 | 11 | from aws_cdk import ( 12 | Stack, 13 | aws_ec2, 14 | aws_msk 15 | ) 16 | from constructs import Construct 17 | 18 | random.seed(43) 19 | 20 | 21 | class MskStack(Stack): 22 | 23 | def __init__(self, scope: Construct, construct_id: str, vpc, **kwargs) -> None: 24 | super().__init__(scope, construct_id, **kwargs) 25 | 26 | msk_config = self.node.try_get_context('msk') 27 | 28 | MSK_CLUSTER_NAME = msk_config['cluster_name'] 29 | assert len(MSK_CLUSTER_NAME) <= 64 and re.fullmatch(r'[a-zA-Z]+[a-zA-Z0-9-]*', MSK_CLUSTER_NAME) 30 | 31 | # Supported Apache Kafka versions: 32 | # https://docs.aws.amazon.com/msk/latest/developerguide/supported-kafka-versions.html 33 | KAFA_VERSION = msk_config.get('kafka_version', '2.8.1') 34 | 35 | KAFA_BROKER_INSTANCE_TYPE = msk_config.get('broker_instance_type', 'kafka.m5.large') 36 | KAFA_NUMBER_OF_BROKER_NODES = int(msk_config.get('number_of_broker_nodes', 3)) 37 | 38 | KAFA_BROKER_EBS_VOLUME_SIZE = int(msk_config.get('broker_ebs_volume_size', 100)) 39 | assert (1 <= KAFA_BROKER_EBS_VOLUME_SIZE and KAFA_BROKER_EBS_VOLUME_SIZE <= 16384) 40 | 41 | MSK_CLIENT_SG_NAME = 'use-msk-sg-{}'.format(''.join(random.sample((string.ascii_lowercase), k=5))) 42 | sg_msk_client = aws_ec2.SecurityGroup(self, 'KafkaClientSecurityGroup', 43 | vpc=vpc, 44 | allow_all_outbound=True, 45 | description='security group for Amazon MSK client', 46 | security_group_name=MSK_CLIENT_SG_NAME 47 | ) 48 | cdk.Tags.of(sg_msk_client).add('Name', MSK_CLIENT_SG_NAME) 49 | 50 | MSK_CLUSTER_SG_NAME = 'msk-sg-{}'.format(''.join(random.sample((string.ascii_lowercase), k=5))) 51 | sg_msk_cluster = aws_ec2.SecurityGroup(self, 'MSKSecurityGroup', 52 | vpc=vpc, 53 | allow_all_outbound=True, 54 | description='security group for Amazon MSK Cluster', 55 | security_group_name=MSK_CLUSTER_SG_NAME 56 | ) 57 | # For more information about the numbers of the ports that Amazon MSK uses to communicate with client machines, 58 | # see https://docs.aws.amazon.com/msk/latest/developerguide/port-info.html 59 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(2181), 60 | description='allow msk client to communicate with Apache ZooKeeper in plaintext') 61 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(2182), 62 | description='allow msk client to communicate with Apache ZooKeeper by using TLS encryption') 63 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(9092), 64 | description='allow msk client to communicate with brokers in plaintext') 65 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(9094), 66 | description='allow msk client to communicate with brokers by using TLS encryption') 67 | sg_msk_cluster.add_ingress_rule(peer=sg_msk_client, connection=aws_ec2.Port.tcp(9098), 68 | description='msk client security group') 69 | cdk.Tags.of(sg_msk_cluster).add('Name', MSK_CLUSTER_SG_NAME) 70 | 71 | msk_broker_ebs_storage_info = aws_msk.CfnCluster.EBSStorageInfoProperty(volume_size=KAFA_BROKER_EBS_VOLUME_SIZE) 72 | 73 | msk_broker_storage_info = aws_msk.CfnCluster.StorageInfoProperty( 74 | ebs_storage_info=msk_broker_ebs_storage_info 75 | ) 76 | 77 | msk_broker_node_group_info = aws_msk.CfnCluster.BrokerNodeGroupInfoProperty( 78 | client_subnets=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).subnet_ids, 79 | instance_type=KAFA_BROKER_INSTANCE_TYPE, 80 | security_groups=[sg_msk_client.security_group_id, sg_msk_cluster.security_group_id], 81 | storage_info=msk_broker_storage_info 82 | ) 83 | 84 | msk_encryption_info = aws_msk.CfnCluster.EncryptionInfoProperty( 85 | encryption_in_transit=aws_msk.CfnCluster.EncryptionInTransitProperty( 86 | client_broker='TLS_PLAINTEXT', 87 | in_cluster=True 88 | ) 89 | ) 90 | 91 | msk_cluster = aws_msk.CfnCluster(self, 'AWSKafkaCluster', 92 | broker_node_group_info=msk_broker_node_group_info, 93 | cluster_name=MSK_CLUSTER_NAME, 94 | #XXX: Supported Apache Kafka versions 95 | # https://docs.aws.amazon.com/msk/latest/developerguide/supported-kafka-versions.html 96 | kafka_version=KAFA_VERSION, 97 | number_of_broker_nodes=KAFA_NUMBER_OF_BROKER_NODES, 98 | encryption_info=msk_encryption_info, 99 | enhanced_monitoring='PER_TOPIC_PER_BROKER' 100 | ) 101 | 102 | self.sg_msk_client = sg_msk_client 103 | self.msk_cluster_name = msk_cluster.cluster_name 104 | 105 | cdk.CfnOutput(self, f'{self.stack_name}-MSKClusterName', value=msk_cluster.cluster_name, export_name=f'{self.stack_name}-MSKClusterName') 106 | cdk.CfnOutput(self, f'{self.stack_name}-MSKClusterArn', value=msk_cluster.attr_arn, export_name=f'{self.stack_name}-MSKClusterArn') 107 | cdk.CfnOutput(self, f'{self.stack_name}-MSKVersion', value=msk_cluster.kafka_version, export_name=f'{self.stack_name}-MSKVersion') 108 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/src/main/python/spark_sql_merge_into_iceberg_from_msk_serverless.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import sys 7 | import traceback 8 | 9 | from awsglue.transforms import * 10 | from awsglue.utils import getResolvedOptions 11 | from awsglue.context import GlueContext 12 | from awsglue.job import Job 13 | from awsglue import DynamicFrame 14 | 15 | from pyspark.context import SparkContext 16 | from pyspark.conf import SparkConf 17 | from pyspark.sql import DataFrame, Row 18 | from pyspark.sql.window import Window 19 | from pyspark.sql.functions import ( 20 | col, 21 | desc, 22 | row_number, 23 | to_timestamp 24 | ) 25 | 26 | args = getResolvedOptions(sys.argv, ['JOB_NAME', 27 | 'catalog', 28 | 'database_name', 29 | 'table_name', 30 | 'primary_key', 31 | 'kafka_topic_name', 32 | 'starting_offsets_of_kafka_topic', 33 | 'kafka_connection_name', 34 | 'iceberg_s3_path', 35 | 'lock_table_name', 36 | 'aws_region', 37 | 'window_size' 38 | ]) 39 | 40 | CATALOG = args['catalog'] 41 | 42 | ICEBERG_S3_PATH = args['iceberg_s3_path'] 43 | 44 | DATABASE = args['database_name'] 45 | TABLE_NAME = args['table_name'] 46 | PRIMARY_KEY = args['primary_key'] 47 | 48 | DYNAMODB_LOCK_TABLE = args['lock_table_name'] 49 | 50 | KAFKA_TOPIC_NAME = args['kafka_topic_name'] 51 | KAFKA_CONNECTION_NAME = args['kafka_connection_name'] 52 | 53 | #XXX: starting_offsets_of_kafka_topic: ['latest', 'earliest'] 54 | STARTING_OFFSETS_OF_KAFKA_TOPIC = args.get('starting_offsets_of_kafka_topic', 'latest') 55 | 56 | AWS_REGION = args['aws_region'] 57 | WINDOW_SIZE = args.get('window_size', '100 seconds') 58 | 59 | def setSparkIcebergConf() -> SparkConf: 60 | conf_list = [ 61 | (f"spark.sql.catalog.{CATALOG}", "org.apache.iceberg.spark.SparkCatalog"), 62 | (f"spark.sql.catalog.{CATALOG}.warehouse", ICEBERG_S3_PATH), 63 | (f"spark.sql.catalog.{CATALOG}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog"), 64 | (f"spark.sql.catalog.{CATALOG}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"), 65 | (f"spark.sql.catalog.{CATALOG}.lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager"), 66 | (f"spark.sql.catalog.{CATALOG}.lock.table", DYNAMODB_LOCK_TABLE), 67 | ("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"), 68 | ("spark.sql.iceberg.handle-timestamp-without-timezone", "true") 69 | ] 70 | spark_conf = SparkConf().setAll(conf_list) 71 | return spark_conf 72 | 73 | # Set the Spark + Glue context 74 | conf = setSparkIcebergConf() 75 | sc = SparkContext(conf=conf) 76 | glueContext = GlueContext(sc) 77 | spark = glueContext.spark_session 78 | job = Job(glueContext) 79 | job.init(args['JOB_NAME'], args) 80 | 81 | kafka_options = { 82 | "connectionName": KAFKA_CONNECTION_NAME, 83 | "topicName": KAFKA_TOPIC_NAME, 84 | "startingOffsets": STARTING_OFFSETS_OF_KAFKA_TOPIC, 85 | "inferSchema": "true", 86 | "classification": "json", 87 | 88 | #XXX: the properties below are required for IAM Access control for MSK Serverless 89 | "kafka.security.protocol": "SASL_SSL", 90 | "kafka.sasl.mechanism": "AWS_MSK_IAM", 91 | "kafka.sasl.jaas.config": "software.amazon.msk.auth.iam.IAMLoginModule required;", 92 | "kafka.sasl.client.callback.handler.class": "software.amazon.msk.auth.iam.IAMClientCallbackHandler" 93 | } 94 | 95 | streaming_data = glueContext.create_data_frame.from_options( 96 | connection_type="kafka", 97 | connection_options=kafka_options, 98 | transformation_ctx="kafka_df" 99 | ) 100 | 101 | def processBatch(data_frame, batch_id): 102 | if data_frame.count() > 0: 103 | stream_data_dynf = DynamicFrame.fromDF( 104 | data_frame, glueContext, "from_data_frame" 105 | ) 106 | 107 | tables_df = spark.sql(f"SHOW TABLES IN {CATALOG}.{DATABASE}") 108 | table_list = tables_df.select('tableName').rdd.flatMap(lambda x: x).collect() 109 | if f"{TABLE_NAME}" not in table_list: 110 | print(f"Table {TABLE_NAME} doesn't exist in {CATALOG}.{DATABASE}.") 111 | else: 112 | _df = spark.sql(f"SELECT * FROM {CATALOG}.{DATABASE}.{TABLE_NAME} LIMIT 0") 113 | 114 | #XXX: Apply De-duplication logic on input data to pick up the latest record based on timestamp and operation 115 | window = Window.partitionBy(PRIMARY_KEY).orderBy(desc("m_time")) 116 | stream_data_df = stream_data_dynf.toDF() 117 | stream_data_df = stream_data_df.withColumn('m_time', to_timestamp(col('m_time'), 'yyyy-MM-dd HH:mm:ss')) 118 | upsert_data_df = stream_data_df.withColumn("row", row_number().over(window)) \ 119 | .filter(col("row") == 1).drop("row") \ 120 | .select(_df.schema.names) 121 | 122 | upsert_data_df.createOrReplaceTempView(f"{TABLE_NAME}_upsert") 123 | # print(f"Table '{TABLE_NAME}' is upserting...") 124 | 125 | try: 126 | spark.sql(f"""MERGE INTO {CATALOG}.{DATABASE}.{TABLE_NAME} t 127 | USING {TABLE_NAME}_upsert s ON s.{PRIMARY_KEY} = t.{PRIMARY_KEY} 128 | WHEN MATCHED THEN UPDATE SET * 129 | WHEN NOT MATCHED THEN INSERT * 130 | """) 131 | except Exception as ex: 132 | traceback.print_exc() 133 | raise ex 134 | 135 | 136 | checkpointPath = os.path.join(args["TempDir"], args["JOB_NAME"], "checkpoint/") 137 | 138 | glueContext.forEachBatch( 139 | frame=streaming_data, 140 | batch_function=processBatch, 141 | options={ 142 | "windowSize": WINDOW_SIZE, 143 | "checkpointLocation": checkpointPath, 144 | } 145 | ) 146 | 147 | job.commit() 148 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk_stacks/glue_job_role.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import aws_cdk as cdk 6 | 7 | from aws_cdk import ( 8 | Stack, 9 | aws_iam 10 | ) 11 | from constructs import Construct 12 | 13 | 14 | class GlueJobRoleStack(Stack): 15 | 16 | def __init__(self, scope: Construct, construct_id: str, msk_cluster_name, **kwargs) -> None: 17 | super().__init__(scope, construct_id, **kwargs) 18 | 19 | glue_job_role_policy_doc = aws_iam.PolicyDocument() 20 | glue_job_role_policy_doc.add_statements(aws_iam.PolicyStatement(**{ 21 | "sid": "AWSGlueJobDynamoDBAccess", 22 | "effect": aws_iam.Effect.ALLOW, 23 | #XXX: The ARN will be formatted as follows: 24 | # arn:{partition}:{service}:{region}:{account}:{resource}{sep}{resource-name} 25 | "resources": [self.format_arn(service="dynamodb", resource="table", resource_name="*")], 26 | "actions": [ 27 | "dynamodb:BatchGetItem", 28 | "dynamodb:DescribeStream", 29 | "dynamodb:DescribeTable", 30 | "dynamodb:GetItem", 31 | "dynamodb:Query", 32 | "dynamodb:Scan", 33 | "dynamodb:BatchWriteItem", 34 | "dynamodb:CreateTable", 35 | "dynamodb:DeleteTable", 36 | "dynamodb:DeleteItem", 37 | "dynamodb:UpdateTable", 38 | "dynamodb:UpdateItem", 39 | "dynamodb:PutItem" 40 | ] 41 | })) 42 | 43 | glue_job_role_policy_doc.add_statements(aws_iam.PolicyStatement(**{ 44 | "sid": "AWSGlueJobS3Access", 45 | "effect": aws_iam.Effect.ALLOW, 46 | #XXX: The ARN will be formatted as follows: 47 | # arn:{partition}:{service}:{region}:{account}:{resource}{sep}{resource-name} 48 | "resources": ["*"], 49 | "actions": [ 50 | "s3:GetBucketLocation", 51 | "s3:ListBucket", 52 | "s3:GetBucketAcl", 53 | "s3:GetObject", 54 | "s3:PutObject", 55 | "s3:DeleteObject" 56 | ] 57 | })) 58 | 59 | glue_job_role = aws_iam.Role(self, 'GlueJobRole', 60 | role_name='GlueJobRole-MSKServerless2Iceberg', 61 | assumed_by=aws_iam.ServicePrincipal('glue.amazonaws.com'), 62 | inline_policies={ 63 | 'aws_glue_job_role_policy': glue_job_role_policy_doc 64 | }, 65 | managed_policies=[ 66 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('service-role/AWSGlueServiceRole'), 67 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonSSMReadOnlyAccess'), 68 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonEC2ContainerRegistryReadOnly'), 69 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AWSGlueConsoleFullAccess'), 70 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonMSKReadOnlyAccess'), 71 | # aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonKinesisReadOnlyAccess') 72 | ] 73 | ) 74 | 75 | #XXX: When creating a notebook with a role, that role is then passed to interactive sessions 76 | # so that the same role can be used in both places. 77 | # As such, the `iam:PassRole` permission needs to be part of the role's policy. 78 | # More info at: https://docs.aws.amazon.com/glue/latest/ug/notebook-getting-started.html 79 | # 80 | glue_job_role.add_to_policy(aws_iam.PolicyStatement(**{ 81 | "sid": "AWSGlueJobIAMPassRole", 82 | "effect": aws_iam.Effect.ALLOW, 83 | #XXX: The ARN will be formatted as follows: 84 | # arn:{partition}:{service}:{region}:{account}:{resource}{sep}{resource-name} 85 | "resources": [self.format_arn(service="iam", region="", resource="role", resource_name=glue_job_role.role_name)], 86 | "conditions": { 87 | "StringLike": { 88 | "iam:PassedToService": [ 89 | "glue.amazonaws.com" 90 | ] 91 | } 92 | }, 93 | "actions": [ 94 | "iam:PassRole" 95 | ] 96 | })) 97 | 98 | #XXX: For more information, see https://docs.aws.amazon.com/msk/latest/developerguide/create-iam-role.html 99 | kafka_access_control_iam_policy = aws_iam.Policy(self, 'KafkaAccessControlIAMPolicy', 100 | statements=[ 101 | aws_iam.PolicyStatement(**{ 102 | "effect": aws_iam.Effect.ALLOW, 103 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:cluster/{msk_cluster_name}/*" ], 104 | "actions": [ 105 | "kafka-cluster:Connect", 106 | "kafka-cluster:AlterCluster", 107 | "kafka-cluster:DescribeCluster" 108 | ] 109 | }), 110 | aws_iam.PolicyStatement(**{ 111 | "effect": aws_iam.Effect.ALLOW, 112 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:topic/{msk_cluster_name}/*" ], 113 | "actions": [ 114 | "kafka-cluster:*Topic*", 115 | "kafka-cluster:WriteData", 116 | "kafka-cluster:ReadData" 117 | ] 118 | }), 119 | aws_iam.PolicyStatement(**{ 120 | "effect": aws_iam.Effect.ALLOW, 121 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:group/{msk_cluster_name}/*" ], 122 | "actions": [ 123 | "kafka-cluster:AlterGroup", 124 | "kafka-cluster:DescribeGroup" 125 | ] 126 | }) 127 | ] 128 | ) 129 | kafka_access_control_iam_policy.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 130 | kafka_access_control_iam_policy.attach_to_role(glue_job_role) 131 | 132 | self.glue_job_role = glue_job_role 133 | 134 | cdk.CfnOutput(self, f'{self.stack_name}_GlueJobRole', value=self.glue_job_role.role_name) 135 | cdk.CfnOutput(self, f'{self.stack_name}_GlueJobRoleArn', value=self.glue_job_role.role_arn) 136 | -------------------------------------------------------------------------------- /msk-to-iceberg/cdk_stacks/kafka_client_ec2.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import random 7 | import string 8 | 9 | import aws_cdk as cdk 10 | 11 | from aws_cdk import ( 12 | Stack, 13 | aws_ec2, 14 | aws_iam, 15 | aws_s3_assets 16 | ) 17 | from constructs import Construct 18 | 19 | random.seed(37) 20 | 21 | 22 | class KafkaClientEC2InstanceStack(Stack): 23 | 24 | def __init__(self, scope: Construct, construct_id: str, vpc, sg_msk_client, msk_cluster_name, **kwargs) -> None: 25 | super().__init__(scope, construct_id, **kwargs) 26 | 27 | KAFKA_CLIENT_EC2_SG_NAME = 'kafka-client-ec2-sg-{}'.format(''.join(random.choices((string.ascii_lowercase), k=5))) 28 | sg_kafka_client_ec2_instance = aws_ec2.SecurityGroup(self, 'KafkaClientEC2InstanceSG', 29 | vpc=vpc, 30 | allow_all_outbound=True, 31 | description='security group for Kafka Client EC2 Instance', 32 | security_group_name=KAFKA_CLIENT_EC2_SG_NAME 33 | ) 34 | cdk.Tags.of(sg_kafka_client_ec2_instance).add('Name', KAFKA_CLIENT_EC2_SG_NAME) 35 | sg_kafka_client_ec2_instance.add_ingress_rule(peer=aws_ec2.Peer.ipv4("0.0.0.0/0"), 36 | connection=aws_ec2.Port.tcp(22)) 37 | 38 | #XXX: For more information, see https://docs.aws.amazon.com/msk/latest/developerguide/create-iam-role.html 39 | kafka_client_iam_policy = aws_iam.Policy(self, 'KafkaClientIAMPolicy', 40 | statements=[ 41 | aws_iam.PolicyStatement(**{ 42 | "effect": aws_iam.Effect.ALLOW, 43 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:cluster/{msk_cluster_name}/*" ], 44 | "actions": [ 45 | "kafka-cluster:Connect", 46 | "kafka-cluster:AlterCluster", 47 | "kafka-cluster:DescribeCluster" 48 | ] 49 | }), 50 | aws_iam.PolicyStatement(**{ 51 | "effect": aws_iam.Effect.ALLOW, 52 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:topic/{msk_cluster_name}/*" ], 53 | "actions": [ 54 | "kafka-cluster:*Topic*", 55 | "kafka-cluster:WriteData", 56 | "kafka-cluster:ReadData" 57 | ] 58 | }), 59 | aws_iam.PolicyStatement(**{ 60 | "effect": aws_iam.Effect.ALLOW, 61 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:group/{msk_cluster_name}/*" ], 62 | "actions": [ 63 | "kafka-cluster:AlterGroup", 64 | "kafka-cluster:DescribeGroup" 65 | ] 66 | }) 67 | ] 68 | ) 69 | kafka_client_iam_policy.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 70 | 71 | kafka_client_ec2_instance_role = aws_iam.Role(self, 'KafkaClientEC2InstanceRole', 72 | role_name=f'KafkaClientEC2InstanceRole-{self.stack_name}', 73 | assumed_by=aws_iam.ServicePrincipal('ec2.amazonaws.com'), 74 | managed_policies=[ 75 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonSSMManagedInstanceCore'), 76 | #XXX: EC2 instance should be able to access S3 for user data 77 | # aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonS3ReadOnlyAccess') 78 | ] 79 | ) 80 | 81 | kafka_client_iam_policy.attach_to_role(kafka_client_ec2_instance_role) 82 | 83 | amzn_linux = aws_ec2.MachineImage.latest_amazon_linux( 84 | generation=aws_ec2.AmazonLinuxGeneration.AMAZON_LINUX_2, 85 | edition=aws_ec2.AmazonLinuxEdition.STANDARD, 86 | virtualization=aws_ec2.AmazonLinuxVirt.HVM, 87 | storage=aws_ec2.AmazonLinuxStorage.GENERAL_PURPOSE, 88 | cpu_type=aws_ec2.AmazonLinuxCpuType.X86_64 89 | ) 90 | 91 | msk_client_ec2_instance = aws_ec2.Instance(self, 'KafkaClientEC2Instance', 92 | instance_type=aws_ec2.InstanceType.of(instance_class=aws_ec2.InstanceClass.BURSTABLE2, 93 | instance_size=aws_ec2.InstanceSize.MICRO), 94 | machine_image=amzn_linux, 95 | vpc=vpc, 96 | availability_zone=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).availability_zones[0], 97 | instance_name=f'KafkaClientInstance-{self.stack_name}', 98 | role=kafka_client_ec2_instance_role, 99 | security_group=sg_kafka_client_ec2_instance, 100 | vpc_subnets=aws_ec2.SubnetSelection(subnet_type=aws_ec2.SubnetType.PUBLIC) 101 | ) 102 | msk_client_ec2_instance.add_security_group(sg_msk_client) 103 | 104 | # test data generator script in S3 as Asset 105 | user_data_asset = aws_s3_assets.Asset(self, 'KafkaClientEC2UserData', 106 | path=os.path.join(os.path.dirname(__file__), '../src/utils/gen_fake_data.py')) 107 | user_data_asset.grant_read(msk_client_ec2_instance.role) 108 | 109 | USER_DATA_LOCAL_PATH = msk_client_ec2_instance.user_data.add_s3_download_command( 110 | bucket=user_data_asset.bucket, 111 | bucket_key=user_data_asset.s3_object_key 112 | ) 113 | 114 | commands = ''' 115 | yum update -y 116 | yum install python3.7 -y 117 | yum install java-11 -y 118 | yum install -y jq 119 | 120 | mkdir -p /home/ec2-user/opt 121 | cd /home/ec2-user/opt 122 | wget https://archive.apache.org/dist/kafka/2.8.1/kafka_2.12-2.8.1.tgz 123 | tar -xzf kafka_2.12-2.8.1.tgz 124 | ln -nsf kafka_2.12-2.8.1 kafka 125 | 126 | cd /home/ec2-user/opt/kafka/libs 127 | wget https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.1/aws-msk-iam-auth-1.1.1-all.jar 128 | 129 | chown -R ec2-user /home/ec2-user/opt 130 | chgrp -R ec2-user /home/ec2-user/opt 131 | 132 | cd /home/ec2-user 133 | wget https://bootstrap.pypa.io/get-pip.py 134 | su -c "python3.7 get-pip.py --user" -s /bin/sh ec2-user 135 | su -c "/home/ec2-user/.local/bin/pip3 install boto3 --user" -s /bin/sh ec2-user 136 | 137 | cat < msk_serverless_client.properties 138 | security.protocol=SASL_SSL 139 | sasl.mechanism=AWS_MSK_IAM 140 | sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required; 141 | sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler 142 | EOF 143 | 144 | ln -nsf msk_serverless_client.properties client.properties 145 | chown -R ec2-user /home/ec2-user/msk_serverless_client.properties 146 | chown -R ec2-user /home/ec2-user/client.properties 147 | 148 | echo 'export PATH=$HOME/opt/kafka/bin:$PATH' >> .bash_profile 149 | ''' 150 | 151 | commands += f''' 152 | su -c "/home/ec2-user/.local/bin/pip3 install mimesis==4.1.3 --user" -s /bin/sh ec2-user 153 | cp {USER_DATA_LOCAL_PATH} /home/ec2-user/gen_fake_data.py & chown -R ec2-user /home/ec2-user/gen_fake_data.py 154 | ''' 155 | 156 | msk_client_ec2_instance.user_data.add_commands(commands) 157 | 158 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstancePublicDNS', 159 | value=msk_client_ec2_instance.instance_public_dns_name, 160 | export_name=f'{self.stack_name}-EC2InstancePublicDNS') 161 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstanceId', 162 | value=msk_client_ec2_instance.instance_id, 163 | export_name=f'{self.stack_name}-EC2InstanceId') 164 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstanceAZ', 165 | value=msk_client_ec2_instance.instance_availability_zone, 166 | export_name=f'{self.stack_name}-EC2InstanceAZ') 167 | 168 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/cdk_stacks/kafka_client_ec2.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # vim: tabstop=2 shiftwidth=2 softtabstop=2 expandtab 4 | 5 | import os 6 | import random 7 | import string 8 | 9 | import aws_cdk as cdk 10 | 11 | from aws_cdk import ( 12 | Stack, 13 | aws_ec2, 14 | aws_iam, 15 | aws_s3_assets 16 | ) 17 | from constructs import Construct 18 | 19 | random.seed(37) 20 | 21 | 22 | class KafkaClientEC2InstanceStack(Stack): 23 | 24 | def __init__(self, scope: Construct, construct_id: str, vpc, sg_msk_client, msk_cluster_name, **kwargs) -> None: 25 | super().__init__(scope, construct_id, **kwargs) 26 | 27 | KAFKA_CLIENT_EC2_SG_NAME = 'kafka-client-ec2-sg-{}'.format(''.join(random.choices((string.ascii_lowercase), k=5))) 28 | sg_kafka_client_ec2_instance = aws_ec2.SecurityGroup(self, 'KafkaClientEC2InstanceSG', 29 | vpc=vpc, 30 | allow_all_outbound=True, 31 | description='security group for Kafka Client EC2 Instance', 32 | security_group_name=KAFKA_CLIENT_EC2_SG_NAME 33 | ) 34 | cdk.Tags.of(sg_kafka_client_ec2_instance).add('Name', KAFKA_CLIENT_EC2_SG_NAME) 35 | sg_kafka_client_ec2_instance.add_ingress_rule(peer=aws_ec2.Peer.ipv4("0.0.0.0/0"), 36 | connection=aws_ec2.Port.tcp(22)) 37 | 38 | #XXX: For more information, see https://docs.aws.amazon.com/msk/latest/developerguide/create-iam-role.html 39 | kafka_client_iam_policy = aws_iam.Policy(self, 'KafkaClientIAMPolicy', 40 | statements=[ 41 | aws_iam.PolicyStatement(**{ 42 | "effect": aws_iam.Effect.ALLOW, 43 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:cluster/{msk_cluster_name}/*" ], 44 | "actions": [ 45 | "kafka-cluster:Connect", 46 | "kafka-cluster:AlterCluster", 47 | "kafka-cluster:DescribeCluster" 48 | ] 49 | }), 50 | aws_iam.PolicyStatement(**{ 51 | "effect": aws_iam.Effect.ALLOW, 52 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:topic/{msk_cluster_name}/*" ], 53 | "actions": [ 54 | "kafka-cluster:Connect", 55 | "kafka-cluster:*Topic*", 56 | "kafka-cluster:WriteData", 57 | "kafka-cluster:ReadData" 58 | ] 59 | }), 60 | aws_iam.PolicyStatement(**{ 61 | "effect": aws_iam.Effect.ALLOW, 62 | "resources": [ f"arn:aws:kafka:{cdk.Aws.REGION}:{cdk.Aws.ACCOUNT_ID}:group/{msk_cluster_name}/*" ], 63 | "actions": [ 64 | "kafka-cluster:AlterGroup", 65 | "kafka-cluster:DescribeGroup" 66 | ] 67 | }) 68 | ] 69 | ) 70 | kafka_client_iam_policy.apply_removal_policy(cdk.RemovalPolicy.DESTROY) 71 | 72 | kafka_client_ec2_instance_role = aws_iam.Role(self, 'KafkaClientEC2InstanceRole', 73 | role_name=f'KafkaClientEC2InstanceRole-{self.stack_name}', 74 | assumed_by=aws_iam.ServicePrincipal('ec2.amazonaws.com'), 75 | managed_policies=[ 76 | aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonSSMManagedInstanceCore'), 77 | #XXX: EC2 instance should be able to access S3 for user data 78 | # aws_iam.ManagedPolicy.from_aws_managed_policy_name('AmazonS3ReadOnlyAccess') 79 | ] 80 | ) 81 | 82 | kafka_client_iam_policy.attach_to_role(kafka_client_ec2_instance_role) 83 | 84 | amzn_linux = aws_ec2.MachineImage.latest_amazon_linux( 85 | generation=aws_ec2.AmazonLinuxGeneration.AMAZON_LINUX_2, 86 | edition=aws_ec2.AmazonLinuxEdition.STANDARD, 87 | virtualization=aws_ec2.AmazonLinuxVirt.HVM, 88 | storage=aws_ec2.AmazonLinuxStorage.GENERAL_PURPOSE, 89 | cpu_type=aws_ec2.AmazonLinuxCpuType.X86_64 90 | ) 91 | 92 | msk_client_ec2_instance = aws_ec2.Instance(self, 'KafkaClientEC2Instance', 93 | instance_type=aws_ec2.InstanceType.of(instance_class=aws_ec2.InstanceClass.BURSTABLE2, 94 | instance_size=aws_ec2.InstanceSize.MICRO), 95 | machine_image=amzn_linux, 96 | vpc=vpc, 97 | availability_zone=vpc.select_subnets(subnet_type=aws_ec2.SubnetType.PRIVATE_WITH_EGRESS).availability_zones[0], 98 | instance_name=f'KafkaClientInstance-{self.stack_name}', 99 | role=kafka_client_ec2_instance_role, 100 | security_group=sg_kafka_client_ec2_instance, 101 | vpc_subnets=aws_ec2.SubnetSelection(subnet_type=aws_ec2.SubnetType.PUBLIC) 102 | ) 103 | msk_client_ec2_instance.add_security_group(sg_msk_client) 104 | 105 | # test data generator script in S3 as Asset 106 | user_data_asset = aws_s3_assets.Asset(self, 'KafkaClientEC2UserData', 107 | path=os.path.join(os.path.dirname(__file__), '../src/utils/gen_fake_data.py')) 108 | user_data_asset.grant_read(msk_client_ec2_instance.role) 109 | 110 | USER_DATA_LOCAL_PATH = msk_client_ec2_instance.user_data.add_s3_download_command( 111 | bucket=user_data_asset.bucket, 112 | bucket_key=user_data_asset.s3_object_key 113 | ) 114 | 115 | commands = ''' 116 | yum update -y 117 | yum install python3.7 -y 118 | yum install java-11 -y 119 | yum install -y jq 120 | 121 | mkdir -p /home/ec2-user/opt 122 | cd /home/ec2-user/opt 123 | wget https://archive.apache.org/dist/kafka/2.8.1/kafka_2.12-2.8.1.tgz 124 | tar -xzf kafka_2.12-2.8.1.tgz 125 | ln -nsf kafka_2.12-2.8.1 kafka 126 | 127 | cd /home/ec2-user/opt/kafka/libs 128 | wget https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.1/aws-msk-iam-auth-1.1.1-all.jar 129 | 130 | chown -R ec2-user /home/ec2-user/opt 131 | chgrp -R ec2-user /home/ec2-user/opt 132 | 133 | cd /home/ec2-user 134 | wget https://bootstrap.pypa.io/get-pip.py 135 | su -c "python3.7 get-pip.py --user" -s /bin/sh ec2-user 136 | su -c "/home/ec2-user/.local/bin/pip3 install boto3 --user" -s /bin/sh ec2-user 137 | 138 | cat < msk_serverless_client.properties 139 | security.protocol=SASL_SSL 140 | sasl.mechanism=AWS_MSK_IAM 141 | sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required; 142 | sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler 143 | EOF 144 | 145 | ln -nsf msk_serverless_client.properties client.properties 146 | chown -R ec2-user /home/ec2-user/msk_serverless_client.properties 147 | chown -R ec2-user /home/ec2-user/client.properties 148 | 149 | echo 'export PATH=$HOME/opt/kafka/bin:$PATH' >> .bash_profile 150 | ''' 151 | 152 | commands += f''' 153 | su -c "/home/ec2-user/.local/bin/pip3 install mimesis==4.1.3 --user" -s /bin/sh ec2-user 154 | cp {USER_DATA_LOCAL_PATH} /home/ec2-user/gen_fake_data.py & chown -R ec2-user /home/ec2-user/gen_fake_data.py 155 | ''' 156 | 157 | msk_client_ec2_instance.user_data.add_commands(commands) 158 | 159 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstancePublicDNS', 160 | value=msk_client_ec2_instance.instance_public_dns_name, 161 | export_name=f'{self.stack_name}-EC2InstancePublicDNS') 162 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstanceId', 163 | value=msk_client_ec2_instance.instance_id, 164 | export_name=f'{self.stack_name}-EC2InstanceId') 165 | cdk.CfnOutput(self, f'{self.stack_name}-EC2InstanceAZ', 166 | value=msk_client_ec2_instance.instance_availability_zone, 167 | export_name=f'{self.stack_name}-EC2InstanceAZ') 168 | 169 | -------------------------------------------------------------------------------- /msk-to-iceberg/README.md: -------------------------------------------------------------------------------- 1 | 2 | # AWS Glue Streaming ETL Job with Amazon MSK and Apace Iceberg 3 | 4 | ![glue-streaming-data-from-kafka-to-iceberg-table](./glue-streaming-data-from-kafka-to-iceberg-table.svg) 5 | 6 | In this project, we create a streaming ETL job in AWS Glue to integrate Iceberg with a streaming use case and create an in-place updatable data lake on Amazon S3. 7 | 8 | After streaming data are ingested from Amazon Managed Service for Apache Kafka (MSK) to Amazon S3, you can query the data with [Amazon Athena](http://aws.amazon.com/athena). 9 | 10 | This project can be deployed with [AWS CDK Python](https://docs.aws.amazon.com/cdk/api/v2/). 11 | The `cdk.json` file tells the CDK Toolkit how to execute your app. 12 | 13 | This project is set up like a standard Python project. The initialization 14 | process also creates a virtualenv within this project, stored under the `.venv` 15 | directory. To create the virtualenv it assumes that there is a `python3` 16 | (or `python` for Windows) executable in your path with access to the `venv` 17 | package. If for any reason the automatic creation of the virtualenv fails, 18 | you can create the virtualenv manually. 19 | 20 | To manually create a virtualenv on MacOS and Linux: 21 | 22 | ``` 23 | $ python3 -m venv .venv 24 | ``` 25 | 26 | After the init process completes and the virtualenv is created, you can use the following 27 | step to activate your virtualenv. 28 | 29 | ``` 30 | $ source .venv/bin/activate 31 | ``` 32 | 33 | If you are a Windows platform, you would activate the virtualenv like this: 34 | 35 | ``` 36 | % .venv\Scripts\activate.bat 37 | ``` 38 | 39 | Once the virtualenv is activated, you can install the required dependencies. 40 | 41 | ``` 42 | (.venv) $ pip install -r requirements.txt 43 | ``` 44 | 45 | In case of `AWS Glue 3.0`, before synthesizing the CloudFormation, **you first set up Apache Iceberg connector for AWS Glue to use Apache Iceber with AWS Glue jobs.** (For more information, see [References](#references) (2)) 46 | 47 | Then you should set approperly the cdk context configuration file, `cdk.context.json`. 48 | 49 | For example: 50 |
 51 | {
 52 |   "vpc_name": "default",
 53 |   "msk": {
 54 |     "cluster_name": "iceberg-demo-stream",
 55 |     "kafka_version": "2.8.1",
 56 |     "broker_instance_type": "kafka.m5.large",
 57 |     "number_of_broker_nodes": 3,
 58 |     "broker_ebs_volume_size": 100
 59 |   },
 60 |   "glue_assets_s3_bucket_name": "aws-glue-assets-123456789012-atq4q5u",
 61 |   "glue_job_script_file_name": "spark_sql_merge_into_iceberg_from_kafka.py",
 62 |   "glue_job_name": "streaming_data_from_kafka_into_iceberg_table",
 63 |   "glue_job_input_arguments": {
 64 |     "--catalog": "job_catalog",
 65 |     "--database_name": "iceberg_demo_db",
 66 |     "--table_name": "iceberg_demo_table",
 67 |     "--primary_key": "name",
 68 |     "--kafka_topic_name": "ev_stream_data",
 69 |     "--starting_offsets_of_kafka_topic": "latest",
 70 |     "--iceberg_s3_path": "s3://glue-iceberg-demo-atq4q5u/iceberg_demo_db",
 71 |     "--lock_table_name": "iceberg_lock",
 72 |     "--aws_region": "us-east-1",
 73 |     "--window_size": "100 seconds",
 74 |     "--extra-jars": "s3://aws-glue-assets-123456789012-atq4q5u/extra-jars/aws-sdk-java-2.17.224.jar",
 75 |     "--user-jars-first": "true"
 76 |   },
 77 |   "glue_connections_name": "iceberg-connection"
 78 | }
 79 | 
80 | 81 | :information_source: `--primary_key` option should be set by Iceberg table's primary column name. 82 | 83 | :warning: **You should create a S3 bucket for a glue job script and upload the glue job script file into the s3 bucket.** 84 | 85 | At this point you can now synthesize the CloudFormation template for this code. 86 | 87 |
 88 | (.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
 89 | (.venv) $ export CDK_DEFAULT_REGION=$(aws configure get region)
 90 | (.venv) $ cdk synth --all
 91 | 
92 | 93 | To add additional dependencies, for example other CDK libraries, just add 94 | them to your `setup.py` file and rerun the `pip install -r requirements.txt` 95 | command. 96 | 97 | ## Run Test 98 | 99 | 1. Set up **Apache Iceberg connector for AWS Glue** to use Apache Iceberg with AWS Glue jobs. 100 | 2. Create a S3 bucekt for Apache Iceberg table 101 |
102 |    (.venv) $ cdk deploy KafkaToIcebergS3Path
103 |    
104 | 3. Create a MSK 105 |
106 |    (.venv) $ cdk deploy KafkaToIcebergStackVpc KafkaAsGlueStreamingJobDataSource
107 |    
108 | 4. Create a MSK connector for Glue Streaming Job 109 |
110 |    (.venv) $ cdk deploy GlueMSKConnection
111 |    
112 | For more information, see [References](#references) (8) 113 | 5. Create a IAM Role for Glue Streaming Job 114 |
115 |    (.venv) $ cdk deploy GlueStreamingMSKtoIcebergJobRole
116 |    
117 | 6. Set up a Kafka Client Machine 118 |
119 |    (.venv) $ cdk deploy KafkaClientEC2Instance
120 |    
121 | 7. Create a Glue Database for an Apache Iceberg table 122 |
123 |    (.venv) $ cdk deploy GlueIcebergDatabase
124 |    
125 | 8. Upload **AWS SDK for Java 2.x** jar file into S3 126 |
127 |    (.venv) $ wget https://repo1.maven.org/maven2/software/amazon/awssdk/aws-sdk-java/2.17.224/aws-sdk-java-2.17.224.jar
128 |    (.venv) $ aws s3 cp aws-sdk-java-2.17.224.jar s3://aws-glue-assets-123456789012-atq4q5u/extra-jars/aws-sdk-java-2.17.224.jar
129 |    
130 | A Glue Streaming Job might fail because of the following error: 131 |
132 |    py4j.protocol.Py4JJavaError: An error occurred while calling o135.start.
133 |    : java.lang.NoSuchMethodError: software.amazon.awssdk.utils.SystemSetting.getStringValueFromEnvironmentVariable(Ljava/lang/String;)Ljava/util/Optional
134 |    
135 | We can work around the problem by starting the Glue Job with the additional parameters: 136 |
137 |    --extra-jars s3://path/to/aws-sdk-for-java-v2.jar
138 |    --user-jars-first true
139 |    
140 | In order to do this, we might need to upload **AWS SDK for Java 2.x** jar file into S3. 141 | 9. Create a Glue Streaming Job 142 | 143 | * (step 1) Select one of Glue Job Scripts and upload into S3 144 | 145 | **List of Glue Job Scirpts** 146 | | File name | Spark Writes | 147 | |-----------|--------------| 148 | | spark_dataframe_insert_iceberg_from_kafka.py | DataFrame append | 149 | | spark_sql_insert_overwrite_iceberg_from_kafka.py | SQL insert overwrite | 150 | | spark_sql_merge_into_iceberg_from_kafka.py | SQL merge into | 151 | 152 |
153 |      (.venv) $ ls src/main/python/
154 |       spark_dataframe_insert_iceberg_from_kafka.py
155 |       spark_sql_insert_overwrite_iceberg_from_kafka.py
156 |       spark_sql_merge_into_iceberg_from_kafka.py
157 |      (.venv) $ aws s3 mb s3://aws-glue-assets-123456789012-atq4q5u --region us-east-1
158 |      (.venv) $ aws s3 cp src/main/python/spark_sql_merge_into_iceberg_from_kafka.py s3://aws-glue-assets-123456789012-atq4q5u/scripts/
159 |      
160 | 161 | * (step 2) Provision the Glue Streaming Job 162 | 163 |
164 |      (.venv) $ cdk deploy GrantLFPermissionsOnGlueJobRole \
165 |                           GlueStreamingJobMSKtoIceberg
166 |      
167 | 168 | 10. Create a table with partitioned data in Amazon Athena 169 | 170 | Go to [Athena](https://console.aws.amazon.com/athena/home) on the AWS Management console. 171 | 172 | * (step 1) Create a database 173 | 174 | In order to create a new database called `iceberg_demo_db`, enter the following statement in the Athena query editor 175 | and click the **Run** button to execute the query. 176 | 177 |
178 |      CREATE DATABASE IF NOT EXISTS iceberg_demo_db
179 |      
180 | 181 | * (step 2) Create a table 182 | 183 | Copy the following query into the Athena query editor, replace the `xxxxxxx` in the last line under `LOCATION` with the string of your S3 bucket, and execute the query to create a new table. 184 |
185 |       CREATE TABLE iceberg_demo_db.iceberg_demo_table (
186 |         name string,
187 |         age int,
188 |         m_time timestamp
189 |       )
190 |       PARTITIONED BY (`name`)
191 |       LOCATION 's3://glue-iceberg-demo-atq4q5u/iceberg_demo_db/iceberg_demo_table'
192 |       TBLPROPERTIES (
193 |         'table_type'='iceberg'
194 |       );
195 |       
196 | If the query is successful, a table named `iceberg_demo_table` is created and displayed on the left panel under the **Tables** section. 197 | 198 | If you get an error, check if (a) you have updated the `LOCATION` to the correct S3 bucket name, (b) you have mydatabase selected under the Database dropdown, and (c) you have `AwsDataCatalog` selected as the **Data source**. 199 | 200 | :information_source: If you fail to create the table, give Athena users access permissions on `iceberg_demo_db` through [AWS Lake Formation](https://console.aws.amazon.com/lakeformation/home), or you can grant anyone using Athena to access `iceberg_demo_db` by running the following command: 201 |
202 |       (.venv) $ aws lakeformation grant-permissions \
203 |                 --principal DataLakePrincipalIdentifier=arn:aws:iam::{account-id}:user/example-user-id \
204 |                 --permissions CREATE_TABLE DESCRIBE ALTER DROP \
205 |                 --resource '{ "Database": { "Name": "iceberg_demo_db" } }'
206 |       (.venv) $ aws lakeformation grant-permissions \
207 |               --principal DataLakePrincipalIdentifier=arn:aws:iam::{account-id}:user/example-user-id \
208 |               --permissions SELECT DESCRIBE ALTER INSERT DELETE DROP \
209 |               --resource '{ "Table": {"DatabaseName": "iceberg_demo_db", "TableWildcard": {}} }'
210 |       
211 | 212 | 11. Make sure the glue job to access the Iceberg table in the Glue Catalog database 213 | 214 | We can get permissions by running the following command: 215 |
216 |     (.venv) $ aws lakeformation list-permissions | jq -r '.PrincipalResourcePermissions[] | select(.Principal.DataLakePrincipalIdentifier | endswith(":role/GlueStreamingJobRole-MSK2Iceberg"))'
217 |     
218 | If not found, we need manually to grant the glue job to required permissions by running the following command: 219 |
220 |     (.venv) $ aws lakeformation grant-permissions \
221 |                --principal DataLakePrincipalIdentifier=arn:aws:iam::{account-id}:role/GlueStreamingJobRole-MSK2Iceberg \
222 |                --permissions SELECT DESCRIBE ALTER INSERT DELETE \
223 |                --resource '{ "Table": {"DatabaseName": "iceberg_demo_db", "TableWildcard": {}} }'
224 |     
225 | 226 | 12. Run glue job to load data from MSK into S3 227 |
228 |     (.venv) $ aws glue start-job-run --job-name streaming_data_from_kafka_into_iceberg_table
229 |     
230 | 231 | 13. Generate streaming data 232 | 233 | 1. Connect the MSK client EC2 Host. 234 | 235 | You can connect to an EC2 instance using the EC2 Instance Connect CLI.
236 | Install `ec2instanceconnectcli` python package and Use the **mssh** command with the instance ID as follows. 237 |
238 |        $ sudo pip install ec2instanceconnectcli
239 |        $ mssh ec2-user@i-001234a4bf70dec41EXAMPLE
240 |        
241 | 242 | 2. Create an Apache Kafka topic 243 | After connect your EC2 Host, you use the client machine to create a topic on the cluster. 244 | Run the following command to create a topic called `ev_stream_data`. 245 |
246 |        [ec2-user@ip-172-31-0-180 ~]$ export PATH=$HOME/opt/kafka/bin:$PATH
247 |        [ec2-user@ip-172-31-0-180 ~]$ export BS={BootstrapBrokerString}
248 |        [ec2-user@ip-172-31-0-180 ~]$ kafka-topics.sh --bootstrap-server $BS \
249 |           --create \
250 |           --topic ev_stream_data \
251 |           --partitions 3 \
252 |           --replication-factor 2
253 |        
254 | 255 | 3. Produce and consume data 256 | 257 | **(1) To produce messages** 258 | 259 | Run the following command to generate messages into the topic on the cluster. 260 | 261 |
262 |        [ec2-user@ip-172-31-0-180 ~]$ python3 gen_fake_data.py | kafka-console-producer.sh \
263 |           --bootstrap-server $BS \
264 |           --topic ev_stream_data \
265 |           --property parse.key=true \
266 |           --property key.seperator='\t'
267 |        
268 | 269 | **(2) To consume messages** 270 | 271 | Keep the connection to the client machine open, and then open a second, separate connection to that machine in a new window. 272 | 273 |
274 |        [ec2-user@ip-172-31-0-180 ~]$ kafka-console-consumer.sh --bootstrap-server $BS \
275 |           --topic ev_stream_data \
276 |           --from-beginning
277 |        
278 | 279 | You start seeing the messages you entered earlier when you used the console producer command. 280 | Enter more messages in the producer window, and watch them appear in the consumer window. 281 | 282 | We can synthetically generate data in JSON format using a simple Python application. 283 | 284 | Synthentic Data Example order by `name` and `m_time` 285 |
286 |     {"name": "Arica", "age": 48, "m_time": "2023-04-11 19:13:21"}
287 |     {"name": "Arica", "age": 32, "m_time": "2023-10-20 17:24:17"}
288 |     {"name": "Arica", "age": 45, "m_time": "2023-12-26 01:20:49"}
289 |     {"name": "Fernando", "age": 16, "m_time": "2023-05-22 00:13:55"}
290 |     {"name": "Gonzalo", "age": 37, "m_time": "2023-01-11 06:18:26"}
291 |     {"name": "Gonzalo", "age": 60, "m_time": "2023-01-25 16:54:26"}
292 |     {"name": "Micheal", "age": 45, "m_time": "2023-04-07 06:18:17"}
293 |     {"name": "Micheal", "age": 44, "m_time": "2023-12-14 09:02:57"}
294 |     {"name": "Takisha", "age": 48, "m_time": "2023-12-20 16:44:13"}
295 |     {"name": "Takisha", "age": 24, "m_time": "2023-12-30 12:38:23"}
296 |     
297 | 298 | Spark Writes using `DataFrame append` insert all records into the Iceberg table. 299 |
300 |     {"name": "Arica", "age": 48, "m_time": "2023-04-11 19:13:21"}
301 |     {"name": "Arica", "age": 32, "m_time": "2023-10-20 17:24:17"}
302 |     {"name": "Arica", "age": 45, "m_time": "2023-12-26 01:20:49"}
303 |     {"name": "Fernando", "age": 16, "m_time": "2023-05-22 00:13:55"}
304 |     {"name": "Gonzalo", "age": 37, "m_time": "2023-01-11 06:18:26"}
305 |     {"name": "Gonzalo", "age": 60, "m_time": "2023-01-25 16:54:26"}
306 |     {"name": "Micheal", "age": 45, "m_time": "2023-04-07 06:18:17"}
307 |     {"name": "Micheal", "age": 44, "m_time": "2023-12-14 09:02:57"}
308 |     {"name": "Takisha", "age": 48, "m_time": "2023-12-20 16:44:13"}
309 |     {"name": "Takisha", "age": 24, "m_time": "2023-12-30 12:38:23"}
310 |     
311 | 312 | Spark Writes using `SQL insert overwrite` or `SQL merge into` insert the last updated records into the Iceberg table. 313 |
314 |     {"name": "Arica", "age": 45, "m_time": "2023-12-26 01:20:49"}
315 |     {"name": "Fernando", "age": 16, "m_time": "2023-05-22 00:13:55"}
316 |     {"name": "Gonzalo", "age": 60, "m_time": "2023-01-25 16:54:26"}
317 |     {"name": "Micheal", "age": 44, "m_time": "2023-12-14 09:02:57"}
318 |     {"name": "Takisha", "age": 24, "m_time": "2023-12-30 12:38:23"}
319 |     
320 | 321 | 14. Check streaming data in S3 322 | 323 | After `3~5` minutes, you can see that the streaming data have been delivered from **MSK** to **S3**. 324 | 325 | ![iceberg-table](./assets/iceberg-table.png) 326 | ![iceberg-table](./assets/iceberg-data-level-01.png) 327 | ![iceberg-table](./assets/iceberg-data-level-02.png) 328 | ![iceberg-table](./assets/iceberg-data-level-03.png) 329 | 330 | 15. Run test query 331 | 332 | Enter the following SQL statement and execute the query. 333 |
334 |     SELECT COUNT(*)
335 |     FROM iceberg_demo_db.iceberg_demo_table;
336 |     
337 | 338 | ## Clean Up 339 | 1. Stop the glue streaming job. 340 |
341 |    (.venv) $ JOB_RUN_IDS=$(aws glue get-job-runs --job-name streaming_data_from_kafka_into_iceberg_table | jq -r '.JobRuns[] | select(.JobRunState == "RUNNING") | .Id')
342 |    (.venv) $ aws glue batch-stop-job-run --job-name streaming_data_from_kafka_into_iceberg_table --job-run-ids $JOB_RUN_IDS
343 |    
344 | 2. Delete the CloudFormation stack by running the below command. 345 |
346 |    (.venv) $ cdk destroy --all
347 |    
348 | 349 | ## Useful commands 350 | 351 | * `cdk ls` list all stacks in the app 352 | * `cdk synth` emits the synthesized CloudFormation template 353 | * `cdk deploy` deploy this stack to your default AWS account/region 354 | * `cdk diff` compare deployed stack with current state 355 | * `cdk docs` open CDK documentation 356 | 357 | ## References 358 | 359 | * (1) [AWS Glue versions](https://docs.aws.amazon.com/glue/latest/dg/release-notes.html): The AWS Glue version determines the versions of Apache Spark and Python that AWS Glue supports. 360 | * (2) [Use the AWS Glue connector to read and write Apache Iceberg tables with ACID transactions and perform time travel \(2022-06-21\)](https://aws.amazon.com/ko/blogs/big-data/use-the-aws-glue-connector-to-read-and-write-apache-iceberg-tables-with-acid-transactions-and-perform-time-travel/) 361 | * (3) [Spark Stream Processing with Amazon EMR using Apache Kafka streams running in Amazon MSK (2022-06-30)](https://yogender027mae.medium.com/spark-stream-processing-with-amazon-emr-using-apache-kafka-streams-running-in-amazon-msk-9776036c18d9) 362 | * [yogenderPalChandra/AmazonMSK-EMR-tem-data](https://github.com/yogenderPalChandra/AmazonMSK-EMR-tem-data) - This is repo for medium article "Spark Stream Processing with Amazon EMR using Kafka streams running in Amazon MSK" 363 | * (4) [Amazon Athena Using Iceberg tables](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html) 364 | * (5) [Streaming ETL jobs in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html) 365 | * (6) [AWS Glue job parameters](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html) 366 | * (7) [Connection types and options for ETL in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-kafka) 367 | * (8) [Creating an AWS Glue connection for an Apache Kafka data stream](https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html#create-conn-streaming) 368 | * (9) [Crafting serverless streaming ETL jobs with AWS Glue (2020-10-14)](https://aws.amazon.com/ko/blogs/big-data/crafting-serverless-streaming-etl-jobs-with-aws-glue/) 369 | * (10) [Apache Iceberg - Spark Writes with SQL (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-writes/) 370 | * (11) [Apache Iceberg - Spark Structured Streaming (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/) 371 | * (12) [Apache Iceberg - Writing against partitioned table (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/#writing-against-partitioned-table) 372 | * Iceberg supports append and complete output modes: 373 | * `append`: appends the rows of every micro-batch to the table 374 | * `complete`: replaces the table contents every micro-batch 375 | 376 | Iceberg requires the data to be sorted according to the partition spec per task (Spark partition) in prior to write against partitioned table.
377 | Otherwise, you might encounter the following error: 378 |
379 |        pyspark.sql.utils.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrame/Datasets;
380 |        
381 | * (13) [Apache Iceberg - Maintenance for streaming tables (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/#maintenance-for-streaming-tables) 382 | * (14) [awsglue python package](https://github.com/awslabs/aws-glue-libs): The awsglue Python package contains the Python portion of the AWS Glue library. This library extends PySpark to support serverless ETL on AWS. 383 | 384 | ## Troubleshooting 385 | 386 | * Granting database or table permissions error using AWS CDK 387 | * Error message: 388 |
389 |      AWS::LakeFormation::PrincipalPermissions | CfnPrincipalPermissions Resource handler returned message: "Resource does not exist or requester is not authorized to access requested permissions. (Service: LakeFormation, Status Code: 400, Request ID: f4d5e58b-29b6-4889-9666-7e38420c9035)" (RequestToken: 4a4bb1d6-b051-032f-dd12-5951d7b4d2a9, HandlerErrorCode: AccessDenied)
390 |      
391 | * Solution: 392 | 393 | The role assumed by cdk is not a data lake administrator. (e.g., `cdk-hnb659fds-deploy-role-12345678912-us-east-1`)
394 | So, deploying PrincipalPermissions meets the error such as: 395 | 396 | `Resource does not exist or requester is not authorized to access requested permissions.` 397 | 398 | In order to solve the error, it is necessary to promote the cdk execution role to the data lake administrator.
399 | For example, https://github.com/aws-samples/data-lake-as-code/blob/mainline/lib/stacks/datalake-stack.ts#L68 400 | 401 | * Reference: 402 | 403 | [https://github.com/aws-samples/data-lake-as-code](https://github.com/aws-samples/data-lake-as-code) - Data Lake as Code 404 | 405 | -------------------------------------------------------------------------------- /msk-serverless-to-iceberg/README.md: -------------------------------------------------------------------------------- 1 | 2 | # AWS Glue Streaming ETL Job with Amazon MSK Serverless and Apace Iceberg 3 | 4 | ![glue-streaming-data-from-msk-serverless-to-iceberg-table](./glue-streaming-data-from-msk-serverless-to-iceberg-table.svg) 5 | 6 | In this project, we create a streaming ETL job in AWS Glue to integrate Iceberg with a streaming use case and create an in-place updatable data lake on Amazon S3. 7 | 8 | After streaming data are ingested from Amazon MSK Serverles to Amazon S3, you can query the data with [Amazon Athena](http://aws.amazon.com/athena). 9 | 10 | This project can be deployed with [AWS CDK Python](https://docs.aws.amazon.com/cdk/api/v2/). 11 | The `cdk.json` file tells the CDK Toolkit how to execute your app. 12 | 13 | This project is set up like a standard Python project. The initialization 14 | process also creates a virtualenv within this project, stored under the `.venv` 15 | directory. To create the virtualenv it assumes that there is a `python3` 16 | (or `python` for Windows) executable in your path with access to the `venv` 17 | package. If for any reason the automatic creation of the virtualenv fails, 18 | you can create the virtualenv manually. 19 | 20 | To manually create a virtualenv on MacOS and Linux: 21 | 22 | ``` 23 | $ python3 -m venv .venv 24 | ``` 25 | 26 | After the init process completes and the virtualenv is created, you can use the following 27 | step to activate your virtualenv. 28 | 29 | ``` 30 | $ source .venv/bin/activate 31 | ``` 32 | 33 | If you are a Windows platform, you would activate the virtualenv like this: 34 | 35 | ``` 36 | % .venv\Scripts\activate.bat 37 | ``` 38 | 39 | Once the virtualenv is activated, you can install the required dependencies. 40 | 41 | ``` 42 | (.venv) $ pip install -r requirements.txt 43 | ``` 44 | 45 | In case of `AWS Glue 3.0`, before synthesizing the CloudFormation, **you first set up Apache Iceberg connector for AWS Glue to use Apache Iceber with AWS Glue jobs.** (For more information, see [References](#references) (2)) 46 | 47 | Then you should set approperly the cdk context configuration file, `cdk.context.json`. 48 | 49 | For example: 50 |
 51 | {
 52 |   "vpc_name": "default",
 53 |   "msk_cluster_name": "iceberg-demo-stream",
 54 |   "glue_assets_s3_bucket_name": "aws-glue-assets-123456789012-atq4q5u",
 55 |   "glue_job_script_file_name": "spark_sql_merge_into_iceberg_from_msk_serverless.py",
 56 |   "glue_job_name": "streaming_data_from_msk_serverless_into_iceberg_table",
 57 |   "glue_job_input_arguments": {
 58 |     "--catalog": "job_catalog",
 59 |     "--database_name": "iceberg_demo_db",
 60 |     "--table_name": "iceberg_demo_table",
 61 |     "--primary_key": "name",
 62 |     "--kafka_topic_name": "ev_stream_data",
 63 |     "--starting_offsets_of_kafka_topic": "latest",
 64 |     "--iceberg_s3_path": "s3://glue-iceberg-demo-atq4q5u/iceberg_demo_db",
 65 |     "--lock_table_name": "iceberg_lock",
 66 |     "--aws_region": "us-east-1",
 67 |     "--window_size": "100 seconds",
 68 |     "--extra-jars": "s3://aws-glue-assets-123456789012-atq4q5u/extra-jars/aws-sdk-java-2.17.224.jar",
 69 |     "--user-jars-first": "true"
 70 |   },
 71 |   "glue_connections_name": "iceberg-connection"
 72 | }
 73 | 
74 | 75 | :information_source: `--primary_key` option should be set by Iceberg table's primary column name. 76 | 77 | :warning: **You should create a S3 bucket for a glue job script and upload the glue job script file into the s3 bucket.** 78 | 79 | At this point you can now synthesize the CloudFormation template for this code. 80 | 81 |
 82 | (.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
 83 | (.venv) $ export CDK_DEFAULT_REGION=$(aws configure get region)
 84 | (.venv) $ cdk synth --all
 85 | 
86 | 87 | To add additional dependencies, for example other CDK libraries, just add 88 | them to your `setup.py` file and rerun the `pip install -r requirements.txt` 89 | command. 90 | 91 | ## Run Test 92 | 93 | 1. Set up **Apache Iceberg connector for AWS Glue** to use Apache Iceberg with AWS Glue jobs. 94 | 2. Create a S3 bucekt for Apache Iceberg table 95 |
 96 |    (.venv) $ cdk deploy MSKServerlessToIcebergS3Path
 97 |    
98 | 3. Create a MSK 99 |
100 |    (.venv) $ cdk deploy MSKServerlessToIcebergStackVpc MSKServerlessAsGlueStreamingJobDataSource
101 |    
102 | 4. Create a MSK connector for Glue Streaming Job 103 |
104 |    (.venv) $ cdk deploy GlueMSKServerlessConnection
105 |    
106 | For more information, see [References](#references) (8) 107 | 5. Create a IAM Role for Glue Streaming Job 108 |
109 |    (.venv) $ cdk deploy GlueStreamingMSKServerlessToIcebergJobRole
110 |    
111 | 6. Set up a Kafka Client Machine 112 |
113 |    (.venv) $ cdk deploy MSKServerlessClientEC2Instance
114 |    
115 | 7. Create a Glue Database for an Apache Iceberg table 116 |
117 |    (.venv) $ cdk deploy GlueIcebergDatabase
118 |    
119 | 8. Upload **AWS SDK for Java 2.x** jar file into S3 120 |
121 |    (.venv) $ wget https://repo1.maven.org/maven2/software/amazon/awssdk/aws-sdk-java/2.17.224/aws-sdk-java-2.17.224.jar
122 |    (.venv) $ aws s3 cp aws-sdk-java-2.17.224.jar s3://aws-glue-assets-123456789012-atq4q5u/extra-jars/aws-sdk-java-2.17.224.jar
123 |    
124 | A Glue Streaming Job might fail because of the following error: 125 |
126 |    py4j.protocol.Py4JJavaError: An error occurred while calling o135.start.
127 |    : java.lang.NoSuchMethodError: software.amazon.awssdk.utils.SystemSetting.getStringValueFromEnvironmentVariable(Ljava/lang/String;)Ljava/util/Optional
128 |    
129 | We can work around the problem by starting the Glue Job with the additional parameters: 130 |
131 |    --extra-jars s3://path/to/aws-sdk-for-java-v2.jar
132 |    --user-jars-first true
133 |    
134 | In order to do this, we might need to upload **AWS SDK for Java 2.x** jar file into S3. 135 | 9. Create a Glue Streaming Job 136 | 137 | * (step 1) Select one of Glue Job Scripts and upload into S3 138 | 139 | **List of Glue Job Scirpts** 140 | | File name | Spark Writes | 141 | |-----------|--------------| 142 | | spark_dataframe_insert_iceberg_from_msk_serverless.py | DataFrame append | 143 | | spark_sql_insert_overwrite_iceberg_from_msk_serverless.py | SQL insert overwrite | 144 | | spark_sql_merge_into_iceberg_from_msk_serverless.py | SQL merge into | 145 | 146 |
147 |      (.venv) $ ls src/main/python/
148 |       spark_dataframe_insert_iceberg_from_msk_serverless.py
149 |       spark_sql_insert_overwrite_iceberg_from_msk_serverless.py
150 |       spark_sql_merge_into_iceberg_from_msk_serverless.py
151 |      (.venv) $ aws s3 mb s3://aws-glue-assets-123456789012-atq4q5u --region us-east-1
152 |      (.venv) $ aws s3 cp src/main/python/spark_sql_merge_into_iceberg_from_msk_serverless.py s3://aws-glue-assets-123456789012-atq4q5u/scripts/
153 |      
154 | 155 | * (step 2) Provision the Glue Streaming Job 156 | 157 |
158 |      (.venv) $ cdk deploy GrantLFPermissionsOnGlueJobRole \
159 |                           GlueStreamingJobMSKServerlessToIceberg
160 |      
161 | 162 | 10. Create a table with partitioned data in Amazon Athena 163 | 164 | Go to [Athena](https://console.aws.amazon.com/athena/home) on the AWS Management console. 165 | 166 | * (step 1) Create a database 167 | 168 | In order to create a new database called `iceberg_demo_db`, enter the following statement in the Athena query editor 169 | and click the **Run** button to execute the query. 170 | 171 |
172 |      CREATE DATABASE IF NOT EXISTS iceberg_demo_db
173 |      
174 | 175 | * (step 2) Create a table 176 | 177 | Copy the following query into the Athena query editor, replace the `xxxxxxx` in the last line under `LOCATION` with the string of your S3 bucket, and execute the query to create a new table. 178 | 179 |
180 |       CREATE TABLE iceberg_demo_db.iceberg_demo_table (
181 |         name string,
182 |         age int,
183 |         m_time timestamp
184 |       )
185 |       PARTITIONED BY (`name`)
186 |       LOCATION 's3://glue-iceberg-demo-atq4q5u/iceberg_demo_db/iceberg_demo_table'
187 |       TBLPROPERTIES (
188 |         'table_type'='iceberg'
189 |       );
190 |       
191 | If the query is successful, a table named `iceberg_demo_table` is created and displayed on the left panel under the **Tables** section. 192 | 193 | If you get an error, check if (a) you have updated the `LOCATION` to the correct S3 bucket name, (b) you have mydatabase selected under the Database dropdown, and (c) you have `AwsDataCatalog` selected as the **Data source**. 194 | 195 | :information_source: If you fail to create the table, give Athena users access permissions on `iceberg_demo_db` through [AWS Lake Formation](https://console.aws.amazon.com/lakeformation/home), or you can grant anyone using Athena to access `iceberg_demo_db` by running the following command: 196 |
197 |       (.venv) $ aws lakeformation grant-permissions \
198 |                 --principal DataLakePrincipalIdentifier=arn:aws:iam::{account-id}:user/example-user-id \
199 |                 --permissions CREATE_TABLE DESCRIBE ALTER DROP \
200 |                 --resource '{ "Database": { "Name": "iceberg_demo_db" } }'
201 |       (.venv) $ aws lakeformation grant-permissions \
202 |               --principal DataLakePrincipalIdentifier=arn:aws:iam::{account-id}:user/example-user-id \
203 |               --permissions SELECT DESCRIBE ALTER INSERT DELETE DROP \
204 |               --resource '{ "Table": {"DatabaseName": "iceberg_demo_db", "TableWildcard": {}} }'
205 |       
206 | 207 | 11. Make sure the glue job to access the Iceberg table in the Glue Catalog database 208 | 209 | We can get permissions by running the following command: 210 |
211 |     (.venv) $ aws lakeformation list-permissions | jq -r '.PrincipalResourcePermissions[] | select(.Principal.DataLakePrincipalIdentifier | endswith(":role/GlueJobRole-MSKServerless2Iceberg"))'
212 |     
213 | If not found, we need manually to grant the glue job to required permissions by running the following command: 214 |
215 |     (.venv) $ aws lakeformation grant-permissions \
216 |                 --principal DataLakePrincipalIdentifier=arn:aws:iam::{account-id}:role/GlueJobRole-MSKServerless2Iceberg \
217 |                 --permissions SELECT DESCRIBE ALTER INSERT DELETE \
218 |                 --resource '{ "Table": {"DatabaseName": "iceberg_demo_db", "TableWildcard": {}} }'
219 |     
220 | 221 | 12. Run glue job to load data from MSK into S3 222 |
223 |     (.venv) $ aws glue start-job-run --job-name streaming_data_from_msk_serverless_into_iceberg_table
224 |     
225 | 226 | 13. Generate streaming data 227 | 228 | 1. Connect the MSK client EC2 Host. 229 | 230 | You can connect to an EC2 instance using the EC2 Instance Connect CLI.
231 | Install `ec2instanceconnectcli` python package and Use the **mssh** command with the instance ID as follows. 232 |
233 |        $ sudo pip install ec2instanceconnectcli
234 |        $ mssh ec2-user@i-001234a4bf70dec41EXAMPLE
235 |        
236 | 237 | 2. Create an Apache Kafka topic 238 | After connect your EC2 Host, you use the client machine to create a topic on the cluster. 239 | Run the following command to create a topic called `ev_stream_data`. 240 |
241 |        [ec2-user@ip-172-31-0-180 ~]$ export PATH=$HOME/opt/kafka/bin:$PATH
242 |        [ec2-user@ip-172-31-0-180 ~]$ export BS={BootstrapBrokerString}
243 |        [ec2-user@ip-172-31-0-180 ~]$ kafka-topics.sh --bootstrap-server $BS \
244 |           --command-config client.properties \
245 |           --create \
246 |           --topic ev_stream_data \
247 |           --partitions 3 \
248 |           --replication-factor 2
249 |        
250 | 251 | 3. Produce and consume data 252 | 253 | **(1) To produce messages** 254 | 255 | Run the following command to generate messages into the topic on the cluster. 256 | 257 |
258 |        [ec2-user@ip-172-31-0-180 ~]$ python3 gen_fake_data.py | kafka-console-producer.sh \
259 |           --bootstrap-server $BS \
260 |           --producer.config client.properties \
261 |           --topic ev_stream_data \
262 |           --property parse.key=true \
263 |           --property key.seperator='\t'
264 |        
265 | 266 | **(2) To consume messages** 267 | 268 | Keep the connection to the client machine open, and then open a second, separate connection to that machine in a new window. 269 | 270 |
271 |        [ec2-user@ip-172-31-0-180 ~]$ kafka-console-consumer.sh --bootstrap-server $BS \
272 |           --consumer.config client.properties \
273 |           --topic ev_stream_data \
274 |           --from-beginning
275 |        
276 | 277 | You start seeing the messages you entered earlier when you used the console producer command. 278 | Enter more messages in the producer window, and watch them appear in the consumer window. 279 | 280 | We can synthetically generate data in JSON format using a simple Python application. 281 | 282 | Synthentic Data Example order by `name` and `m_time` 283 |
284 |     {"name": "Arica", "age": 48, "m_time": "2023-04-11 19:13:21"}
285 |     {"name": "Arica", "age": 32, "m_time": "2023-10-20 17:24:17"}
286 |     {"name": "Arica", "age": 45, "m_time": "2023-12-26 01:20:49"}
287 |     {"name": "Fernando", "age": 16, "m_time": "2023-05-22 00:13:55"}
288 |     {"name": "Gonzalo", "age": 37, "m_time": "2023-01-11 06:18:26"}
289 |     {"name": "Gonzalo", "age": 60, "m_time": "2023-01-25 16:54:26"}
290 |     {"name": "Micheal", "age": 45, "m_time": "2023-04-07 06:18:17"}
291 |     {"name": "Micheal", "age": 44, "m_time": "2023-12-14 09:02:57"}
292 |     {"name": "Takisha", "age": 48, "m_time": "2023-12-20 16:44:13"}
293 |     {"name": "Takisha", "age": 24, "m_time": "2023-12-30 12:38:23"}
294 |     
295 | 296 | Spark Writes using `DataFrame append` insert all records into the Iceberg table. 297 |
298 |     {"name": "Arica", "age": 48, "m_time": "2023-04-11 19:13:21"}
299 |     {"name": "Arica", "age": 32, "m_time": "2023-10-20 17:24:17"}
300 |     {"name": "Arica", "age": 45, "m_time": "2023-12-26 01:20:49"}
301 |     {"name": "Fernando", "age": 16, "m_time": "2023-05-22 00:13:55"}
302 |     {"name": "Gonzalo", "age": 37, "m_time": "2023-01-11 06:18:26"}
303 |     {"name": "Gonzalo", "age": 60, "m_time": "2023-01-25 16:54:26"}
304 |     {"name": "Micheal", "age": 45, "m_time": "2023-04-07 06:18:17"}
305 |     {"name": "Micheal", "age": 44, "m_time": "2023-12-14 09:02:57"}
306 |     {"name": "Takisha", "age": 48, "m_time": "2023-12-20 16:44:13"}
307 |     {"name": "Takisha", "age": 24, "m_time": "2023-12-30 12:38:23"}
308 |     
309 | 310 | Spark Writes using `SQL insert overwrite` or `SQL merge into` insert the last updated records into the Iceberg table. 311 |
312 |     {"name": "Arica", "age": 45, "m_time": "2023-12-26 01:20:49"}
313 |     {"name": "Fernando", "age": 16, "m_time": "2023-05-22 00:13:55"}
314 |     {"name": "Gonzalo", "age": 60, "m_time": "2023-01-25 16:54:26"}
315 |     {"name": "Micheal", "age": 44, "m_time": "2023-12-14 09:02:57"}
316 |     {"name": "Takisha", "age": 24, "m_time": "2023-12-30 12:38:23"}
317 |     
318 | 319 | 14. Check streaming data in S3 320 | 321 | After `3~5` minutes, you can see that the streaming data have been delivered from **MSK Serverless** to **S3**. 322 | 323 | ![iceberg-table](./assets/iceberg-table.png) 324 | ![iceberg-table](./assets/iceberg-data-level-01.png) 325 | ![iceberg-table](./assets/iceberg-data-level-02.png) 326 | ![iceberg-table](./assets/iceberg-data-level-03.png) 327 | 328 | 15. Run test query 329 | 330 | Enter the following SQL statement and execute the query. 331 |
332 |     SELECT COUNT(*)
333 |     FROM iceberg_demo_db.iceberg_demo_table;
334 |     
335 | 336 | ## Clean Up 337 | 1. Stop the glue streaming job. 338 |
339 |    (.venv) $ JOB_RUN_IDS=$(aws glue get-job-runs --job-name streaming_data_from_msk_serverless_into_iceberg_table | jq -r '.JobRuns[] | select(.JobRunState == "RUNNING") | .Id')
340 |    (.venv) $ aws glue batch-stop-job-run --job-name streaming_data_from_msk_serverless_into_iceberg_table --job-run-ids $JOB_RUN_IDS
341 |    
342 | 2. Delete the CloudFormation stack by running the below command. 343 |
344 |    (.venv) $ cdk destroy --all
345 |    
346 | 347 | ## Useful commands 348 | 349 | * `cdk ls` list all stacks in the app 350 | * `cdk synth` emits the synthesized CloudFormation template 351 | * `cdk deploy` deploy this stack to your default AWS account/region 352 | * `cdk diff` compare deployed stack with current state 353 | * `cdk docs` open CDK documentation 354 | 355 | ## References 356 | 357 | * (1) [AWS Glue versions](https://docs.aws.amazon.com/glue/latest/dg/release-notes.html): The AWS Glue version determines the versions of Apache Spark and Python that AWS Glue supports. 358 | * (2) [Use the AWS Glue connector to read and write Apache Iceberg tables with ACID transactions and perform time travel \(2022-06-21\)](https://aws.amazon.com/ko/blogs/big-data/use-the-aws-glue-connector-to-read-and-write-apache-iceberg-tables-with-acid-transactions-and-perform-time-travel/) 359 | * (3) [Spark Stream Processing with Amazon EMR using Apache Kafka streams running in Amazon MSK (2022-06-30)](https://yogender027mae.medium.com/spark-stream-processing-with-amazon-emr-using-apache-kafka-streams-running-in-amazon-msk-9776036c18d9) 360 | * [yogenderPalChandra/AmazonMSK-EMR-tem-data](https://github.com/yogenderPalChandra/AmazonMSK-EMR-tem-data) - This is repo for medium article "Spark Stream Processing with Amazon EMR using Kafka streams running in Amazon MSK" 361 | * (4) [Amazon Athena Using Iceberg tables](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html) 362 | * (5) [Streaming ETL jobs in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html) 363 | * (6) [AWS Glue job parameters](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html) 364 | * (7) [Connection types and options for ETL in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-kafka) 365 | * (8) [Creating an AWS Glue connection for an Apache Kafka data stream](https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html#create-conn-streaming) 366 | * (9) [Crafting serverless streaming ETL jobs with AWS Glue (2020-10-14)](https://aws.amazon.com/ko/blogs/big-data/crafting-serverless-streaming-etl-jobs-with-aws-glue/) 367 | * (10) [Actions, resources, and condition keys for Apache Kafka APIs for Amazon MSK clusters](https://docs.aws.amazon.com/service-authorization/latest/reference/list_apachekafkaapisforamazonmskclusters.html) 368 | * (11) [Apache Iceberg - Spark Writes with SQL (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-writes/) 369 | * (12) [Apache Iceberg - Spark Structured Streaming (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/) 370 | * (13) [Apache Iceberg - Writing against partitioned table (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/#writing-against-partitioned-table) 371 | * Iceberg supports append and complete output modes: 372 | * `append`: appends the rows of every micro-batch to the table 373 | * `complete`: replaces the table contents every micro-batch 374 | 375 | Iceberg requires the data to be sorted according to the partition spec per task (Spark partition) in prior to write against partitioned table.
376 | Otherwise, you might encounter the following error: 377 |
378 |        pyspark.sql.utils.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrame/Datasets;
379 |        
380 | * (14) [Apache Iceberg - Maintenance for streaming tables (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/#maintenance-for-streaming-tables) 381 | * (15) [awsglue python package](https://github.com/awslabs/aws-glue-libs): The awsglue Python package contains the Python portion of the AWS Glue library. This library extends PySpark to support serverless ETL on AWS. 382 | * (15) [AWS Glue Notebook Samples](https://github.com/aws-samples/aws-glue-samples/tree/master/examples/notebooks) - sample iPython notebook files which show you how to use open data dake formats; Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue Interactive Sessions and AWS Glue Studio Notebook. 383 | 384 | ## Troubleshooting 385 | 386 | * Granting database or table permissions error using AWS CDK 387 | * Error message: 388 |
389 |      AWS::LakeFormation::PrincipalPermissions | CfnPrincipalPermissions Resource handler returned message: "Resource does not exist or requester is not authorized to access requested permissions. (Service: LakeFormation, Status Code: 400, Request ID: f4d5e58b-29b6-4889-9666-7e38420c9035)" (RequestToken: 4a4bb1d6-b051-032f-dd12-5951d7b4d2a9, HandlerErrorCode: AccessDenied)
390 |      
391 | * Solution: 392 | 393 | The role assumed by cdk is not a data lake administrator. (e.g., `cdk-hnb659fds-deploy-role-12345678912-us-east-1`)
394 | So, deploying PrincipalPermissions meets the error such as: 395 | 396 | `Resource does not exist or requester is not authorized to access requested permissions.` 397 | 398 | In order to solve the error, it is necessary to promote the cdk execution role to the data lake administrator.
399 | For example, https://github.com/aws-samples/data-lake-as-code/blob/mainline/lib/stacks/datalake-stack.ts#L68 400 | 401 | * Reference: 402 | 403 | [https://github.com/aws-samples/data-lake-as-code](https://github.com/aws-samples/data-lake-as-code) - Data Lake as Code 404 | 405 | -------------------------------------------------------------------------------- /msk-to-iceberg/glue-streaming-data-from-kafka-to-iceberg-table.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 |
AWS Cloud
AWS Cloud
Private subnet
Private subnet
Glue Data Catalog
Glue Data Catalog
MSK
MSK
Glue Streaming
Glue Streaming
Athena
Athena
S3
S3
Python Data Generator
Python Data Generator
Viewer does not support full SVG 1.1
-------------------------------------------------------------------------------- /msk-serverless-to-iceberg/glue-streaming-data-from-msk-serverless-to-iceberg-table.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 |
AWS Cloud
AWS Cloud
Private subnet
Private subnet
Glue Data Catalog
Glue Data Catalog
MSK Serverless
MSK Serverless
Glue Streaming
Glue Streaming
Athena
Athena
S3
S3
Python Data Generator
Python Data Generator
Viewer does not support full SVG 1.1
--------------------------------------------------------------------------------