├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── docs ├── AWSGlueTableProperties.png ├── BuildOutput.png ├── ConfigurationUpdate.png ├── DagFolder.png └── MwaaFolder.png ├── scripts ├── create_delta_glue_job.py ├── create_delta_s3.py ├── create_hudi_glue_job.py ├── create_hudi_s3.py ├── create_iceberg_glue_job.py ├── create_iceberg_s3.py └── requirements.txt ├── xtable_lambda ├── app.py ├── cdk.json ├── cdk_stack.py ├── requirements.txt └── src │ ├── Dockerfile │ ├── __init__.py │ ├── converter.py │ ├── detector.py │ ├── models.py │ ├── requirements.txt │ └── xtable.py └── xtable_operator ├── Dockerfile ├── __init__.py ├── build-airflow-operator.sh ├── cmd.py ├── files ├── .airflowignore ├── DAG.py ├── requirements.txt └── startup.sh └── operator.py /.gitignore: -------------------------------------------------------------------------------- 1 | .venv/ 2 | out 3 | .jar 4 | .java-version 5 | .DS_Store 6 | .pytest_cache 7 | __pycache__ 8 | dist 9 | xtable 10 | jars 11 | build 12 | *.egg-info 13 | licenses.txt 14 | probe_out -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT No Attribution 2 | 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so. 10 | 11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 17 | 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # `apache-xtable-on-aws-samples` 2 | 3 | - [`apache-xtable-on-aws-samples`](#apache-xtable-on-aws-samples) 4 | - [AWS Lambda](#aws-lambda) 5 | - [Prerequisites - AWS Lambda](#prerequisites---aws-lambda) 6 | - [Step-by-step-guide - AWS Lambda](#step-by-step-guide---aws-lambda) 7 | - [Clean Up](#clean-up) 8 | - [Amazon MWAA Operator](#amazon-mwaa-operator) 9 | - [Prerequisites - Amazon MWAA Operator](#prerequisites---amazon-mwaa-operator) 10 | - [Step-by-step-guide - Amazon MWAA Operator](#step-by-step-guide---amazon-mwaa-operator) 11 | - [\[Optional step: AWS Glue Data Catalog used as an Iceberg Catalog\]](#optional-step-aws-glue-data-catalog-used-as-an-iceberg-catalog) 12 | - [Clean Up](#clean-up-1) 13 | - [Security](#security) 14 | - [Code of Conduct](#code-of-conduct) 15 | - [License](#license) 16 | 17 | Open table formats (OTFs) like Apache Iceberg are being increasingly adopted, for example, to improve transactional consistency of a data lake or to consolidate batch and streaming data pipelines on a single file format and reduce complexity. In practice, architects need to integrate the chosen format with the various layers of a modern data platform. However, the level of support for the different OTFs varies across common analytical services. 18 | 19 | Commercial vendors and the open source community have recognized this situation and are working on interoperability between table formats. One approach is to make a single physical dataset readable in different formats by translating its metadata and avoiding reprocessing of actual data files. Apache XTable is an open source solution that follows this approach and provides abstractions and tools for the translation of open table format metadata. 20 | 21 | In this repository, we show how to get started with Apache XTable on AWS. This repository is meant to be used in combination with related blog posts on this topic: 22 | 23 | - [Blog - Run Apache XTable in AWS Lambda for background conversion of open table formats](https://aws.amazon.com/blogs/big-data/run-apache-xtable-in-aws-lambda-for-background-conversion-of-open-table-formats/) 24 | - [Blog - Run Apache XTable on Amazon MWAA to translate open table formats](https://aws.amazon.com/blogs/big-data/run-apache-xtable-on-amazon-mwaa-to-translate-open-table-formats/) 25 | 26 | ## AWS Lambda 27 | 28 | Let’s dive deeper into how to deploy the provided AWS CDK stack. 29 | 30 | ### Prerequisites - AWS Lambda 31 | 32 | You need one of the following container runtimes and AWS CDK installed: 33 | 34 | - [Finch](https://runfinch.com/docs/getting-started/installation/) or [Docker](https://docs.docker.com/get-docker/) 35 | - [AWS CDK](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html) 36 | 37 | You also need an existing lakehouse setup or can use the provided [scripts](https://github.com/aws-samples/apache-xtable-on-aws-samples/tree/main/scripts) to set up a test environment. 38 | 39 | ### Step-by-step-guide - AWS Lambda 40 | 41 | (1) To deploy the stack, clone the github repo, change into the folder for this Blog Post (xtable_lambda) and deploy the CDK stack. This deploys a set of Lambda functions and an Amazon EventBridge Scheduler: 42 | 43 | ``` bash 44 | git clone https://github.com/aws-samples/apache-xtable-on-aws-samples.git 45 | cd xtable_lambda 46 | cdk deploy 47 | ``` 48 | 49 | When using Finch you need to set CDK_DOCKER environment variable before deployment: 50 | 51 | ``` bash 52 | export CDK_DOCKER=finch 53 | ``` 54 | 55 | After the deployment, all your correctly configured Glue Tables will be transformed every hour. 56 | 57 | (2) Set required Glue Data Catalog parameters: 58 | 59 | For the solution to work, you have to set the correct parameters in the Glue Data Catalog, in each of the Glue Tables you want to transform: 60 | 61 | - `"xtable_table_type": ""` 62 | - `"xtable_target_formats": ", "` 63 | 64 | In the AWS Console the parameters look like the following and can be set under “Table properties” when editing a Glue Table: 65 | 66 | AWS Glue Table Properties 67 | 68 | (3) (Optional) Create data lake test environment 69 | 70 | In case you do not have a lakehouse setup, these [scripts](https://github.com/aws-samples/apache-xtable-on-aws-samples/tree/main/scripts) can help you to set up a test environment either with your local machine or in a [AWS Glue for Spark Job](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-intro-tutorial.html). 71 | 72 | ``` bash 73 | #local: create hudi dataset on S3 74 | cd scripts 75 | pip install -r requirements.txt 76 | python ./create_hudi_s3.p 77 | ``` 78 | 79 | ### Clean Up 80 | 81 | - Delete the deployed CDK stack: 82 | 83 | ```bash 84 | cdk destroy 85 | ``` 86 | 87 | - Delete the downloaded git repository: 88 | 89 | ```bash 90 | rm -r apache-xtable-on-aws-samples 91 | ``` 92 | 93 | - Delete the used docker image: 94 | 95 | ```bash 96 | docker image rm public.ecr.aws/lambda/python:3.12.2024.07.10.11-arm64 97 | ``` 98 | 99 | - Reverse AWS Glue configurations in your Glue Tables 100 | - (Optional) Delete test data files created with the test scripts in your S3 Bucket 101 | 102 | ## Amazon MWAA Operator 103 | 104 | To deploy the custom operator to Amazon MWAA, we upload it together with DAGs into the configured DAG folder. Besides the operator itself, we also need to upload XTable’s executable JAR. As of writing this post, the JAR needs to be compiled by the user from source code. To simplify this, we provide a container-based build script. 105 | 106 | ### Prerequisites - Amazon MWAA Operator 107 | 108 | We assume you have at least an environment consisting of Amazon MWAA itself, an S3 bucket, and an [AWS Identity and Access Management](https://aws.amazon.com/iam/) (IAM) role for Amazon MWAA that has read access to the bucket and optionally write access to the AWS Glue Data Catalog. In addition, you need one of the following container runtimes to run the provided build script for XTable: 109 | 110 | - [Finch](https://runfinch.com/docs/getting-started/installation/) 111 | - [Docker](https://docs.docker.com/get-docker/) 112 | 113 | ### Step-by-step-guide - Amazon MWAA Operator 114 | 115 | To compile XTable, you can use the provided build script and complete the following steps: 116 | 117 | (1) Clone the sample code from GitHub: 118 | 119 | ``` bash 120 | git clone git@github.com:aws-samples/apache-xtable-on-aws-samples.git 121 | cd ./apache-xtable-on-aws-samples/xtable_operator 122 | ``` 123 | 124 | (2) Run the build script: 125 | 126 | ```bash 127 | ./build-airflow-operator.sh 128 | ``` 129 | 130 | Build Output 131 | 132 | (3) Because the Airflow operator uses the library [JPype](https://github.com/jpype-project/jpype) to invoke XTable’s JAR, add a dependency in the Amazon MWAA requirement.txt file: 133 | 134 | ``` python 135 | # requirements.txt 136 | JPype1==1.5.0 137 | ``` 138 | 139 | For a background on installing additional Python libraries on Amazon MWAA, see [Installing Python dependencies](https://docs.aws.amazon.com/mwaa/latest/userguide/working-dags-dependencies.html). 140 | 141 | (4) Because XTable is Java-based, a Java 11 runtime environment (JRE) is required on Amazon MWAA. You can use Amazon MWAA’s startup script, to install a JRE. Add the following lines to an existing startup script or create a new one as provided in the sample code base of this post: 142 | 143 | ``` bash 144 | # startup.sh 145 | if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]] 146 | then 147 | sudo yum install -y java-11-amazon-corretto-headless 148 | fi 149 | ``` 150 | 151 | For more information about this mechanism, see [Using a startup script with Amazon MWAA](https://docs.aws.amazon.com/mwaa/latest/userguide/using-startup-script.html). 152 | 153 | (5) Upload *xtable_operator/*, *requirements.txt*, *startup.sh* and *.airflowignore* to the S3 bucket and respective paths from which Amazon MWAA will read files. Make sure the IAM role for Amazon MWAA has appropriate read permissions. 154 | 155 | With regard to the Customer Operator, make sure to upload the local folder *xtable_operator/* into the configured DAG folder along with the [*.airflowignore*](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#airflowignore) file. 156 | 157 |

158 | Build Output 159 | Build Output 160 |

161 | 162 | (6) Update the configuration of your Amazon MWAA environment as follows and start the update process: 163 | 164 | - Add or update the S3 URI to the *requirements.txt* file through the Requirements file configuration option. 165 | - Add or update the S3 URI to the *startup.sh* script through the Startup script configuration option. 166 | 167 |

168 | Build Output 169 |

170 | 171 | ### [Optional step: AWS Glue Data Catalog used as an Iceberg Catalog] 172 | 173 | Optionally, you can use the AWS Glue Data Catalog as an Iceberg catalog. 174 | 175 | (7) In case you create Iceberg metadata and want to register it in the AWS Glue Data Catalog, the Amazon MWAA role needs permissions to create or modify tables in AWS Glue. The following listing shows a minimal policy for this. It constrains permissions to a defined database in AWS Glue: 176 | 177 | ``` json 178 | { 179 | "Version": "2012-10-17", 180 | "Statement": [ 181 | { 182 | "Effect": "Allow", 183 | "Action": [ 184 | "glue:GetDatabase", 185 | "glue:CreateTable", 186 | "glue:GetTables", 187 | "glue:UpdateTable", 188 | "glue:GetDatabases", 189 | "glue:GetTable" 190 | ], 191 | "Resource": [ 192 | "arn:aws:glue:::catalog", 193 | "arn:aws:glue:::database/", 194 | "arn:aws:glue:::table//*" 195 | ] 196 | } 197 | ] 198 | } 199 | ``` 200 | 201 | ### Clean Up 202 | 203 | - Delete the downloaded git repository: 204 | 205 | ```bash 206 | rm -r apache-xtable-on-aws-samples 207 | ``` 208 | 209 | - Delete the used docker image: 210 | 211 | ```bash 212 | docker image rm public.ecr.aws/amazonlinux/amazonlinux:2023.4.20240319.1 213 | ``` 214 | 215 | - Reverse Amazon MWAA configurations in: 216 | - requirements.txt 217 | - startup.sh 218 | - DAG 219 | - MWAA execution role 220 | 221 | - Delete files or versions of files in your S3 Bucket: 222 | - requirements.txt 223 | - startup.sh 224 | - DAG 225 | - .airflowignore 226 | 227 | ## Security 228 | 229 | See [CONTRIBUTING](./CONTRIBUTING.md#security-issue-notifications) for more information. 230 | 231 | ## Code of Conduct 232 | 233 | See [CODE OF CONDUCT](./CODE_OF_CONDUCT.md) for more information. 234 | 235 | ## License 236 | 237 | This library is licensed under the MIT-0 License. See the [LICENSE](./LICENSE) file. 238 | -------------------------------------------------------------------------------- /docs/AWSGlueTableProperties.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/docs/AWSGlueTableProperties.png -------------------------------------------------------------------------------- /docs/BuildOutput.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/docs/BuildOutput.png -------------------------------------------------------------------------------- /docs/ConfigurationUpdate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/docs/ConfigurationUpdate.png -------------------------------------------------------------------------------- /docs/DagFolder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/docs/DagFolder.png -------------------------------------------------------------------------------- /docs/MwaaFolder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/docs/MwaaFolder.png -------------------------------------------------------------------------------- /scripts/create_delta_glue_job.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | from awsglue.utils import getResolvedOptions 4 | from pyspark.context import SparkContext 5 | from awsglue.context import GlueContext 6 | from awsglue.job import Job 7 | 8 | # Glue Setup 9 | # --conf 10 | # spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore 11 | # --datalake-formats 12 | # delta 13 | # --additional-python-modules 14 | # Faker 15 | args = getResolvedOptions(sys.argv, ["JOB_NAME"]) 16 | glueContext = GlueContext(SparkContext()) 17 | spark = glueContext.spark_session 18 | job = Job(glueContext) 19 | job.init(args["JOB_NAME"], args) 20 | 21 | 22 | def generate_dataframe(n=100): 23 | import datetime 24 | import numpy as np 25 | import pandas as pd 26 | 27 | rng = np.random.default_rng() 28 | 29 | df = pd.DataFrame( 30 | { 31 | "date": rng.choice( 32 | pd.date_range( 33 | datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1) 34 | ).strftime("%Y-%m-%d"), 35 | size=n, 36 | ), 37 | "col1_int": rng.integers(low=0, high=100, size=n), 38 | "col2_float": rng.uniform(low=0, high=1, size=n), 39 | "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n), 40 | "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(), 41 | "col5_bool": rng.choice([True, False], size=n), 42 | } 43 | ) 44 | 45 | df["id"] = df.index 46 | return spark.createDataFrame(df) 47 | 48 | 49 | additional_options = { 50 | "path": args["DELTA_TABLE_PATH"], # e.g. s3://data/delta/table 51 | } 52 | 53 | df = generate_dataframe() 54 | ( 55 | df.write.format("delta") 56 | .options(**additional_options) 57 | .mode("append") 58 | .partitionBy("date") 59 | .saveAsTable("table") 60 | ) 61 | 62 | 63 | job.commit() 64 | -------------------------------------------------------------------------------- /scripts/create_delta_s3.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pyspark 3 | from delta import configure_spark_with_delta_pip 4 | # env 5 | from dotenv import load_dotenv 6 | 7 | load_dotenv() # take environment variables from .env 8 | 9 | os.environ["SPARK_LOCAL_IP"] = "127.0.0.1" 10 | 11 | # session 12 | builder = ( 13 | pyspark.sql.SparkSession.builder.appName("MyApp") 14 | .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 15 | .config( 16 | "spark.sql.catalog.spark_catalog", 17 | "org.apache.spark.sql.delta.catalog.DeltaCatalog", 18 | ) 19 | .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 20 | ) 21 | spark = configure_spark_with_delta_pip( 22 | builder, extra_packages=["org.apache.spark:spark-hadoop-cloud_2.12:3.5.1"] 23 | ).getOrCreate() 24 | 25 | 26 | def generate_dataframe(n=100): 27 | import datetime 28 | import numpy as np 29 | import pandas as pd 30 | 31 | rng = np.random.default_rng() 32 | 33 | df = pd.DataFrame( 34 | { 35 | "date": rng.choice( 36 | pd.date_range( 37 | datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1) 38 | ).strftime("%Y-%m-%d"), 39 | size=n, 40 | ), 41 | "col1_int": rng.integers(low=0, high=100, size=n), 42 | "col2_float": rng.uniform(low=0, high=1, size=n), 43 | "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n), 44 | "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(), 45 | "col5_bool": rng.choice([True, False], size=n), 46 | } 47 | ) 48 | 49 | df["id"] = df.index 50 | return spark.createDataFrame(df) 51 | 52 | 53 | if __name__ == "__main__": 54 | # generate dataframe 55 | dataframe = generate_dataframe() 56 | 57 | # write table 58 | ( 59 | dataframe.write.format("delta") 60 | .option("path", os.environ["DELTA_TABLE_PATH"]) # e.g. s3://data/delta/table 61 | .mode("overwrite") 62 | .partitionBy("date") 63 | .saveAsTable("table") 64 | ) 65 | 66 | -------------------------------------------------------------------------------- /scripts/create_hudi_glue_job.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | from awsglue.utils import getResolvedOptions 4 | from pyspark.context import SparkContext 5 | from awsglue.context import GlueContext 6 | from awsglue.job import Job 7 | 8 | # Glue Setup 9 | # --conf 10 | # spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false 11 | # --datalake-formats 12 | # hudi 13 | args = getResolvedOptions(sys.argv, ["JOB_NAME"]) 14 | glueContext = GlueContext(SparkContext()) 15 | spark = glueContext.spark_session 16 | job = Job(glueContext) 17 | job.init(args["JOB_NAME"], args) 18 | 19 | 20 | def generate_dataframe(n=100): 21 | import datetime 22 | import numpy as np 23 | import pandas as pd 24 | 25 | rng = np.random.default_rng() 26 | 27 | df = pd.DataFrame( 28 | { 29 | "date": rng.choice( 30 | pd.date_range( 31 | datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1) 32 | ).strftime("%Y-%m-%d"), 33 | size=n, 34 | ), 35 | "col1_int": rng.integers(low=0, high=100, size=n), 36 | "col2_float": rng.uniform(low=0, high=1, size=n), 37 | "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n), 38 | "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(), 39 | "col5_bool": rng.choice([True, False], size=n), 40 | } 41 | ) 42 | 43 | df["id"] = df.index 44 | return spark.createDataFrame(df) 45 | 46 | 47 | additional_options = { 48 | "hoodie.table.name": "table", 49 | "hoodie.datasource.write.table.name": "table", 50 | "hoodie.datasource.write.storage.type": "COPY_ON_WRITE", 51 | "hoodie.datasource.write.operation": "insert", 52 | "hoodie.datasource.write.recordkey.field": "id", 53 | "hoodie.datasource.write.precombine.field": "col4_time", 54 | "hoodie.datasource.write.partitionpath.field": "date", 55 | "hoodie.datasource.write.hive_style_partitioning": "true", 56 | "path": args["HUDI_TABLE_PATH"], # e.g. s3://data/hudi/table 57 | } 58 | 59 | df = generate_dataframe() 60 | df.write.format("hudi").options(**additional_options).mode("append").save() 61 | 62 | job.commit() 63 | -------------------------------------------------------------------------------- /scripts/create_hudi_s3.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pyspark.sql import SparkSession 3 | 4 | 5 | # env 6 | os.environ["SPARK_LOCAL_IP"] = "127.0.0.1" 7 | 8 | # session 9 | spark = ( 10 | SparkSession.builder.appName("MyApp") 11 | .config( 12 | "spark.jars.packages", 13 | "org.apache.hudi:hudi-spark3-bundle_2.12:0.15.0,org.apache.spark:spark-hadoop-cloud_2.12:3.5.1", 14 | ) 15 | .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") 16 | .config( 17 | "spark.sql.catalog.spark_catalog", 18 | "org.apache.spark.sql.hudi.catalog.HoodieCatalog", 19 | ) 20 | .config( 21 | "spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension" 22 | ) 23 | .config("spark.kryo.registrator", "org.apache.spark.HoodieSparkKryoRegistrar") 24 | .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 25 | .config("spark.driver.memory", "16g") 26 | .config("spark.hadoop.fs.s3a.fast.upload", "true") 27 | .config("spark.hadoop.fs.s3a.upload.buffer", "bytebuffer") 28 | .getOrCreate() 29 | ) 30 | 31 | 32 | def generate_dataframe(n=100): 33 | import datetime 34 | import numpy as np 35 | import pandas as pd 36 | 37 | rng = np.random.default_rng() 38 | 39 | df = pd.DataFrame( 40 | { 41 | "date": rng.choice( 42 | pd.date_range( 43 | datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1) 44 | ).strftime("%Y-%m-%d"), 45 | size=n, 46 | ), 47 | "col1_int": rng.integers(low=0, high=100, size=n), 48 | "col2_float": rng.uniform(low=0, high=1, size=n), 49 | "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n), 50 | "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(), 51 | "col5_bool": rng.choice([True, False], size=n), 52 | } 53 | ) 54 | 55 | df["id"] = df.index 56 | return spark.createDataFrame(df) 57 | 58 | 59 | if __name__ == "__main__": 60 | hoodie_options = { 61 | "hoodie.table.name": "table", 62 | "hoodie.datasource.write.recordkey.field": "id", 63 | "hoodie.datasource.write.precombine.field": "col4_time", 64 | "hoodie.datasource.write.operation": "insert", 65 | "hoodie.datasource.write.partitionpath.field": "date", 66 | "hoodie.datasource.write.table.name": "table", 67 | "hoodie.deltastreamer.source.dfs.listing.max.fileid": "-1", 68 | "hoodie.datasource.write.hive_style_partitioning": "true", 69 | "path": os.environ["HUDI_TABLE_PATH"], # e.g. s3://data/hudi/table 70 | } 71 | 72 | # generate dataframe 73 | dataframe = generate_dataframe() 74 | 75 | dataframe.write.format("hudi").options(**hoodie_options).mode("append").save() 76 | -------------------------------------------------------------------------------- /scripts/create_iceberg_glue_job.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | from awsglue.utils import getResolvedOptions 4 | from pyspark.context import SparkContext 5 | from awsglue.context import GlueContext 6 | from awsglue.job import Job 7 | 8 | # Glue Setup 9 | # --conf 10 | # spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.spark_catalog.warehouse=<# e.g. s3://data/iceberg/table> --conf spark.sql.catalog.spark_catalog.type=hadoop --conf spark.sql.defaultCatalog=spark_catalog --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 11 | # --datalake-formats 12 | # iceberg 13 | # --additional-python-modules 14 | # Faker 15 | 16 | 17 | args = getResolvedOptions(sys.argv, ["JOB_NAME"]) 18 | glueContext = GlueContext(SparkContext()) 19 | spark = glueContext.spark_session 20 | job = Job(glueContext) 21 | job.init(args["JOB_NAME"], args) 22 | 23 | 24 | def generate_dataframe(n=100): 25 | import datetime 26 | import numpy as np 27 | import pandas as pd 28 | 29 | rng = np.random.default_rng() 30 | 31 | df = pd.DataFrame( 32 | { 33 | "date": rng.choice( 34 | pd.date_range( 35 | datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1) 36 | ).strftime("%Y-%m-%d"), 37 | size=n, 38 | ), 39 | "col1_int": rng.integers(low=0, high=100, size=n), 40 | "col2_float": rng.uniform(low=0, high=1, size=n), 41 | "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n), 42 | "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(), 43 | "col5_bool": rng.choice([True, False], size=n), 44 | } 45 | ) 46 | 47 | df["id"] = df.index 48 | return spark.createDataFrame(df) 49 | 50 | 51 | # generate dataframe 52 | additional_options = { 53 | "path": args["ICEBERG_TABLE_PATH"], # e.g. s3://data/iceberg/table 54 | } 55 | 56 | # Generate and sort by partitions 57 | df = generate_dataframe() 58 | df = df.repartition("date").sortWithinPartitions("date") 59 | df.createOrReplaceTempView("tmp_table") 60 | 61 | ( 62 | df.write.format("iceberg") 63 | .options(**additional_options) 64 | .mode("append") 65 | .partitionBy("date") 66 | .saveAsTable("iceberg.table") 67 | ) 68 | 69 | # commit 70 | job.commit() 71 | -------------------------------------------------------------------------------- /scripts/create_iceberg_s3.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pyspark.sql import SparkSession 3 | 4 | 5 | # env 6 | os.environ["SPARK_LOCAL_IP"] = "127.0.0.1" 7 | 8 | # session 9 | spark = ( 10 | SparkSession.builder.appName("MyApp") 11 | .config( 12 | "spark.jars.packages", 13 | "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.4.2,org.apache.spark:spark-hadoop-cloud_2.12:3.4.3,software.amazon.awssdk:bundle:2.26.21,software.amazon.awssdk:url-connection-client:2.26.21", 14 | ) 15 | .config( 16 | "spark.sql.extensions", 17 | "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", 18 | ) 19 | .config( 20 | "spark.sql.catalog.spark_catalog", 21 | "org.apache.iceberg.spark.SparkCatalog", 22 | ) 23 | .config( 24 | "spark.sql.catalog.spark_catalog.warehouse", 25 | os.environ["ICEBERG_WAREHOUSE_PATH"], # e.g. s3://data/ 26 | ) 27 | .config( 28 | "spark.sql.catalog.spark_catalog.type", 29 | "hadoop", 30 | ) 31 | .config( 32 | "spark.sql.defaultCatalog", 33 | "spark_catalog", 34 | ) 35 | .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") 36 | .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 37 | .config("spark.driver.memory", "16g") 38 | .config("spark.hadoop.fs.s3a.fast.upload", "true") 39 | .config("spark.hadoop.fs.s3a.upload.buffer", "bytebuffer") 40 | .getOrCreate() 41 | ) 42 | 43 | 44 | def generate_dataframe(n=100): 45 | import datetime 46 | import numpy as np 47 | import pandas as pd 48 | 49 | rng = np.random.default_rng() 50 | 51 | df = pd.DataFrame( 52 | { 53 | "date": rng.choice( 54 | pd.date_range( 55 | datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1) 56 | ).strftime("%Y-%m-%d"), 57 | size=n, 58 | ), 59 | "col1_int": rng.integers(low=0, high=100, size=n), 60 | "col2_float": rng.uniform(low=0, high=1, size=n), 61 | "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n), 62 | "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(), 63 | "col5_bool": rng.choice([True, False], size=n), 64 | } 65 | ) 66 | 67 | df["id"] = df.index 68 | return spark.createDataFrame(df) 69 | 70 | 71 | if __name__ == "__main__": 72 | # generate dataframe 73 | additional_options = { 74 | "path": os.environ["ICEBERG_TABLE_PATH"], # e.g. s3://data/iceberg/table 75 | } 76 | 77 | # Generate and sort by partitions 78 | df = generate_dataframe() 79 | df = df.repartition("date").sortWithinPartitions("date") 80 | df.createOrReplaceTempView("tmp_table") 81 | 82 | ( 83 | df.write.format("iceberg") 84 | .options(**additional_options) 85 | .mode("append") 86 | .partitionBy("date") 87 | .saveAsTable("iceberg.table") 88 | ) 89 | -------------------------------------------------------------------------------- /scripts/requirements.txt: -------------------------------------------------------------------------------- 1 | delta-spark==3.2.0 2 | pyspark==3.5.1 -------------------------------------------------------------------------------- /xtable_lambda/app.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | #!/usr/bin/env python3 4 | import os 5 | 6 | import aws_cdk as cdk 7 | from aws_cdk import Aspects 8 | from cdk_nag import AwsSolutionsChecks 9 | 10 | 11 | from cdk_stack import XtableLambda 12 | 13 | 14 | app = cdk.App() 15 | XtableLambda( 16 | app, 17 | "XtableLambda", 18 | env=cdk.Environment( 19 | account=os.getenv("CDK_DEFAULT_ACCOUNT"), region=os.getenv("CDK_DEFAULT_REGION") 20 | ), 21 | ) 22 | Aspects.of(app).add(AwsSolutionsChecks(verbose=True)) 23 | 24 | app.synth() 25 | -------------------------------------------------------------------------------- /xtable_lambda/cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "python3 app.py", 3 | "watch": { 4 | "include": [ 5 | "**" 6 | ], 7 | "exclude": [ 8 | "README.md", 9 | "cdk*.json", 10 | "requirements*.txt", 11 | "source.bat", 12 | "**/__init__.py", 13 | "**/__pycache__", 14 | "tests" 15 | ] 16 | }, 17 | "context": { 18 | "@aws-cdk/aws-lambda:recognizeLayerVersion": true, 19 | "@aws-cdk/core:checkSecretUsage": true, 20 | "@aws-cdk/core:target-partitions": [ 21 | "aws", 22 | "aws-cn" 23 | ], 24 | "@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true, 25 | "@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true, 26 | "@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true, 27 | "@aws-cdk/aws-iam:minimizePolicies": true, 28 | "@aws-cdk/core:validateSnapshotRemovalPolicy": true, 29 | "@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true, 30 | "@aws-cdk/aws-s3:createDefaultLoggingPolicy": true, 31 | "@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true, 32 | "@aws-cdk/aws-apigateway:disableCloudWatchRole": true, 33 | "@aws-cdk/core:enablePartitionLiterals": true, 34 | "@aws-cdk/aws-events:eventsTargetQueueSameAccount": true, 35 | "@aws-cdk/aws-iam:standardizedServicePrincipals": true, 36 | "@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true, 37 | "@aws-cdk/aws-iam:importedRoleStackSafeDefaultPolicyName": true, 38 | "@aws-cdk/aws-s3:serverAccessLogsUseBucketPolicy": true, 39 | "@aws-cdk/aws-route53-patters:useCertificate": true, 40 | "@aws-cdk/customresources:installLatestAwsSdkDefault": false, 41 | "@aws-cdk/aws-rds:databaseProxyUniqueResourceName": true, 42 | "@aws-cdk/aws-codedeploy:removeAlarmsFromDeploymentGroup": true, 43 | "@aws-cdk/aws-apigateway:authorizerChangeDeploymentLogicalId": true, 44 | "@aws-cdk/aws-ec2:launchTemplateDefaultUserData": true, 45 | "@aws-cdk/aws-secretsmanager:useAttachedSecretResourcePolicyForSecretTargetAttachments": true, 46 | "@aws-cdk/aws-redshift:columnId": true, 47 | "@aws-cdk/aws-stepfunctions-tasks:enableEmrServicePolicyV2": true, 48 | "@aws-cdk/aws-ec2:restrictDefaultSecurityGroup": true, 49 | "@aws-cdk/aws-apigateway:requestValidatorUniqueId": true, 50 | "@aws-cdk/aws-kms:aliasNameRef": true, 51 | "@aws-cdk/aws-autoscaling:generateLaunchTemplateInsteadOfLaunchConfig": true, 52 | "@aws-cdk/core:includePrefixInUniqueNameGeneration": true, 53 | "@aws-cdk/aws-efs:denyAnonymousAccess": true, 54 | "@aws-cdk/aws-opensearchservice:enableOpensearchMultiAzWithStandby": true, 55 | "@aws-cdk/aws-lambda-nodejs:useLatestRuntimeVersion": true, 56 | "@aws-cdk/aws-efs:mountTargetOrderInsensitiveLogicalId": true, 57 | "@aws-cdk/aws-rds:auroraClusterChangeScopeOfInstanceParameterGroupWithEachParameters": true, 58 | "@aws-cdk/aws-appsync:useArnForSourceApiAssociationIdentifier": true, 59 | "@aws-cdk/aws-rds:preventRenderingDeprecatedCredentials": true, 60 | "@aws-cdk/aws-codepipeline-actions:useNewDefaultBranchForCodeCommitSource": true, 61 | "@aws-cdk/aws-cloudwatch-actions:changeLambdaPermissionLogicalIdForLambdaAction": true, 62 | "@aws-cdk/aws-codepipeline:crossAccountKeysDefaultValueToFalse": true, 63 | "@aws-cdk/aws-codepipeline:defaultPipelineTypeToV2": true, 64 | "@aws-cdk/aws-kms:reduceCrossAccountRegionPolicyScope": true, 65 | "@aws-cdk/aws-eks:nodegroupNameAttribute": true, 66 | "@aws-cdk/aws-ec2:ebsDefaultGp3Volume": true, 67 | "@aws-cdk/aws-ecs:removeDefaultDeploymentAlarm": true, 68 | "@aws-cdk/custom-resources:logApiResponseDataPropertyTrueDefault": false, 69 | "@aws-cdk/aws-stepfunctions-tasks:ecsReduceRunTaskPermissions": true 70 | } 71 | } 72 | -------------------------------------------------------------------------------- /xtable_lambda/cdk_stack.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | from aws_cdk import ( 4 | Duration, 5 | Stack, 6 | aws_lambda as _lambda, 7 | aws_iam as iam, 8 | aws_events as events, 9 | aws_events_targets as targets, 10 | ) 11 | from constructs import Construct 12 | 13 | 14 | class XtableLambda(Stack): 15 | 16 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 17 | super().__init__(scope, construct_id, **kwargs) 18 | 19 | # Converter 20 | converter = _lambda.DockerImageFunction( 21 | scope=self, 22 | id="Converter", 23 | memory_size=10240, 24 | architecture=_lambda.Architecture.ARM_64, 25 | code=_lambda.DockerImageCode.from_image_asset( 26 | directory="src", cmd=["converter.handler"] 27 | ), 28 | timeout=Duration.minutes(15), 29 | environment={ 30 | "SPARK_LOCAL_IP": "127.0.0.1", 31 | }, 32 | ) 33 | converter.add_to_role_policy( 34 | iam.PolicyStatement( 35 | actions=[ 36 | "glue:GetTables", 37 | "glue:GetTable", 38 | "glue:GetDatabases", 39 | "glue:GetDatabase", 40 | "glue:UpdateDatabase", 41 | "glue:CreateDatabase", 42 | "glue:UpdateTable", 43 | "glue:CreateTable", 44 | "s3:Abort*", 45 | "s3:DeleteObject*", 46 | "s3:GetBucket*", 47 | "s3:GetObject*", 48 | "s3:List*", 49 | "s3:PutObject", 50 | "s3:PutObjectLegalHold", 51 | "s3:PutObjectRetention", 52 | "s3:PutObjectTagging", 53 | "s3:PutObjectVersionTagging", 54 | ], 55 | resources=["*"], 56 | ) 57 | ) 58 | 59 | # Detector 60 | detector = _lambda.DockerImageFunction( 61 | scope=self, 62 | id="Detector", 63 | architecture=_lambda.Architecture.ARM_64, 64 | memory_size=1024, 65 | code=_lambda.DockerImageCode.from_image_asset( 66 | directory="src", cmd=["detector.handler"] 67 | ), 68 | timeout=Duration.minutes(3), 69 | environment={"CONVERTER_FUNCTION_NAME": converter.function_name}, 70 | ) 71 | converter.grant_invoke(detector) 72 | detector.add_to_role_policy( 73 | iam.PolicyStatement( 74 | actions=["glue:GetTables", "glue:GetDatabases"], 75 | resources=["*"], 76 | ) 77 | ) 78 | 79 | # Scheduled events 80 | event = events.Rule( 81 | scope=self, 82 | id="DetectorSchedule", 83 | schedule=events.Schedule.rate(Duration.hours(1)), 84 | ) 85 | event.add_target(targets.LambdaFunction(detector)) 86 | -------------------------------------------------------------------------------- /xtable_lambda/requirements.txt: -------------------------------------------------------------------------------- 1 | aws-cdk-lib==2.148.1 2 | constructs>=10.0.0,<11.0.0 3 | cdk_nag==2.34.6 4 | 5 | -------------------------------------------------------------------------------- /xtable_lambda/src/Dockerfile: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | # syntax = docker/dockerfile:1.7.0 5 | # args 6 | ARG MAVEN_VERSION=3.9.6 7 | ARG MAVEN_MAJOR_VERSION=3 8 | ARG XTABLE_VERSION=0.1.0-incubating 9 | ARG XTABLE_BRANCH=0.1.0-incubating 10 | ARG ICEBERG_VERSION=1.4.2 11 | ARG SPARK_VERSION=3.4 12 | ARG SCALA_VERSION=2.12 13 | ARG ARCH=aarch64 14 | 15 | # Build Stage 16 | FROM public.ecr.aws/lambda/python:3.12.2024.07.10.11-arm64 AS build-stage 17 | WORKDIR / 18 | 19 | ARG MAVEN_VERSION 20 | ARG MAVEN_MAJOR_VERSION 21 | ARG XTABLE_BRANCH 22 | ARG ICEBERG_VERSION 23 | ARG SPARK_VERSION 24 | ARG SCALA_VERSION 25 | 26 | # install java 27 | RUN dnf install -y java-11-amazon-corretto-headless \ 28 | git \ 29 | unzip \ 30 | wget \ 31 | gcc \ 32 | g++ \ 33 | && dnf clean all 34 | 35 | # install maven 36 | RUN wget https://dlcdn.apache.org/maven/maven-"$MAVEN_MAJOR_VERSION"/"$MAVEN_VERSION"/binaries/apache-maven-"$MAVEN_VERSION"-bin.zip 37 | RUN unzip apache-maven-"$MAVEN_VERSION"-bin.zip 38 | 39 | # clone sources 40 | RUN git clone --depth 1 --branch "$XTABLE_BRANCH" https://github.com/apache/incubator-xtable.git 41 | 42 | # build xtable jar 43 | WORKDIR /incubator-xtable 44 | RUN /apache-maven-"$MAVEN_VERSION"/bin/mvn package -DskipTests=true 45 | WORKDIR / 46 | 47 | # Download jars for iceberg and glue 48 | RUN wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-aws-bundle/"$ICEBERG_VERSION"/iceberg-aws-bundle-"$ICEBERG_VERSION".jar 49 | RUN wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"/"$ICEBERG_VERSION"/iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"-"$ICEBERG_VERSION".jar 50 | 51 | # Copy requirements.txt 52 | COPY requirements.txt . 53 | 54 | # Install the specified packages 55 | RUN pip install -r requirements.txt -t requirements/ 56 | 57 | USER 1000 58 | 59 | # Run stage 60 | FROM public.ecr.aws/lambda/python:3.12.2024.07.10.11-arm64 61 | 62 | # args 63 | ARG XTABLE_VERSION 64 | ARG ICEBERG_VERSION 65 | ARG SPARK_VERSION 66 | ARG SCALA_VERSION 67 | ARG ARCH 68 | 69 | # copy java 70 | COPY --from=build-stage /usr/lib/jvm/java-11-amazon-corretto."$ARCH" /usr/lib/jvm/java-11-amazon-corretto."$ARCH" 71 | 72 | # Copy jar files 73 | COPY --from=build-stage /incubator-xtable/xtable-utilities/target/xtable-utilities-"$XTABLE_VERSION"-bundled.jar "$LAMBDA_TASK_ROOT"/jars/ 74 | COPY --from=build-stage /iceberg-aws-bundle-"$ICEBERG_VERSION".jar "$LAMBDA_TASK_ROOT"/jars_iceberg/ 75 | COPY --from=build-stage /iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"-"$ICEBERG_VERSION".jar "$LAMBDA_TASK_ROOT"/jars_iceberg/ 76 | 77 | # Copy python requirements 78 | COPY --from=build-stage /requirements "$LAMBDA_TASK_ROOT" 79 | 80 | # Copy src code 81 | COPY . "$LAMBDA_TASK_ROOT" -------------------------------------------------------------------------------- /xtable_lambda/src/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/xtable_lambda/src/__init__.py -------------------------------------------------------------------------------- /xtable_lambda/src/converter.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import boto3 4 | from pathlib import Path 5 | 6 | from aws_lambda_powertools.utilities.typing import LambdaContext 7 | from aws_lambda_powertools.utilities.parser import event_parser 8 | 9 | import xtable 10 | from models import ConversionTask 11 | 12 | # clients 13 | lambda_client = boto3.client("lambda") 14 | 15 | 16 | @event_parser(model=ConversionTask) 17 | def handler(event: ConversionTask, context: LambdaContext): 18 | 19 | # sync with passed Dataset Config 20 | if "ICEBERG" not in event.dataset_config.targetFormats: 21 | 22 | xtable.sync( 23 | dataset_config=event.dataset_config, 24 | tmp_path=Path("/tmp"), 25 | jars=[ 26 | Path(__file__).resolve().parent / "jars/*", 27 | ], 28 | ) 29 | 30 | # if iceberg is target also register the table in the glue data catalog 31 | else: 32 | # sync with passed Dataset and Catalog Config 33 | xtable.sync( 34 | dataset_config=event.dataset_config, 35 | catalog_config=event.catalog_config, 36 | tmp_path=Path("/tmp"), 37 | jars=[ 38 | Path(__file__).resolve().parent / "jars/*", 39 | Path(__file__).resolve().parent / "jars_iceberg/*", 40 | ], 41 | ) 42 | -------------------------------------------------------------------------------- /xtable_lambda/src/detector.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import os 4 | import re 5 | from typing import Any, Dict 6 | 7 | import boto3 8 | from aws_lambda_powertools.utilities.typing import LambdaContext 9 | 10 | from models import ( 11 | ConversionTask, 12 | DatasetConfig, 13 | CatalogConfig, 14 | Dataset, 15 | CatalogOptions, 16 | S3Path, 17 | ) 18 | 19 | # clients 20 | glue_client = boto3.client("glue") 21 | lambda_client = boto3.client("lambda") 22 | 23 | # vars 24 | CONVERTER_FUNCTION_NAME = os.environ["CONVERTER_FUNCTION_NAME"] 25 | 26 | 27 | def get_convertable_tables(): 28 | """ 29 | Generator for all Glue tables in this accounts Glue Catalog 30 | Yields: 31 | dict: table dict 32 | """ 33 | # Get paginator for databases 34 | databases = glue_client.get_paginator("get_databases").paginate() 35 | for database_list in databases: 36 | database_list = database_list["DatabaseList"] 37 | for database in database_list: 38 | # Get paginator for tables 39 | tables = glue_client.get_paginator("get_tables").paginate( 40 | DatabaseName=database["Name"] 41 | ) 42 | for table_list in tables: 43 | table_list = table_list["TableList"] 44 | for table in table_list: 45 | # if required table parameters exist 46 | if {"xtable_target_formats", "xtable_table_type"} <= table["Parameters"].keys(): 47 | yield table 48 | 49 | 50 | def handler(event: Dict[str, Any], context: LambdaContext): 51 | # iterate over databases and tables in Glue 52 | for table in get_convertable_tables(): 53 | # vars 54 | s3_path = S3Path(table["StorageDescriptor"]["Location"]) 55 | 56 | # create task payload 57 | task = ConversionTask( 58 | dataset_config=DatasetConfig( 59 | sourceFormat=table["Parameters"]["xtable_table_type"].upper(), 60 | targetFormats=re.findall( 61 | r"\w+", 62 | table["Parameters"]["xtable_target_formats"].upper(), 63 | ), 64 | datasets=[ 65 | Dataset( 66 | tableBasePath=f"{s3_path}", 67 | tableName=f"{s3_path.name}_converted", 68 | namespace=f"{s3_path.parent.name}", 69 | ) 70 | ], 71 | ) 72 | ) 73 | 74 | # if ICEBERG add Glue Table Warehouse registration 75 | if "ICEBERG" in task.dataset_config.targetFormats: 76 | task.catalog_config = CatalogConfig( 77 | catalogOptions=CatalogOptions(warehouse=f"{s3_path}") 78 | ) 79 | 80 | # send to converter lambda 81 | lambda_client.invoke( 82 | FunctionName=CONVERTER_FUNCTION_NAME, 83 | InvocationType="Event", 84 | Payload=task.model_dump_json(by_alias=True), 85 | ) 86 | -------------------------------------------------------------------------------- /xtable_lambda/src/models.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import re 4 | from typing import List, Literal, Optional 5 | from pydantic import BaseModel, Field 6 | from pathlib import PurePath 7 | 8 | 9 | class Dataset(BaseModel): 10 | tableBasePath: str 11 | tableName: str 12 | namespace: Optional[str] = None 13 | partitionSpec: Optional[str] = None 14 | 15 | 16 | class DatasetConfig(BaseModel): 17 | sourceFormat: Literal["DELTA", "ICEBERG", "HUDI"] 18 | targetFormats: List[Literal["DELTA", "ICEBERG", "HUDI"]] 19 | datasets: List[Dataset] 20 | 21 | 22 | class CatalogOptions(BaseModel): 23 | warehouse: str 24 | catalog_impl: str = Field( 25 | alias="catalog-impl", default="org.apache.iceberg.aws.glue.GlueCatalog" 26 | ) 27 | io_impl: str = Field(alias="io-impl", default="org.apache.iceberg.aws.s3.S3FileIO") 28 | 29 | 30 | class CatalogConfig(BaseModel): 31 | catalogImpl: str = Field(default="org.apache.iceberg.aws.glue.GlueCatalog") 32 | catalogName: str = Field(default="glue") 33 | catalogOptions: CatalogOptions 34 | 35 | 36 | class ConversionTask(BaseModel): 37 | dataset_config: DatasetConfig 38 | catalog_config: Optional[CatalogConfig] = None 39 | 40 | 41 | class S3Path(PurePath): 42 | s3_uri_regex = re.compile(r"s3://(?P.+)") 43 | path_regex = re.compile(r"/*(?P.+)") 44 | drive = "s3://" 45 | 46 | def __init__(self, path: str): 47 | s3_match = re.match(self.s3_uri_regex, path) 48 | path_match = re.match(self.path_regex, path) 49 | if s3_match: 50 | super().__init__(s3_match.group("path")) 51 | elif path_match: 52 | super().__init__(path_match.group("path")) 53 | -------------------------------------------------------------------------------- /xtable_lambda/src/requirements.txt: -------------------------------------------------------------------------------- 1 | jpype1==1.5.0 2 | pydantic==2.8.2 3 | aws-lambda-powertools==2.36.0 4 | pyyaml==6.0.1 5 | boto3==1.34.74 -------------------------------------------------------------------------------- /xtable_lambda/src/xtable.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import platform 4 | from pathlib import Path 5 | from typing import List 6 | 7 | import yaml 8 | import jpype 9 | import jpype.imports 10 | import jpype.types 11 | 12 | from models import DatasetConfig, CatalogConfig 13 | 14 | def sync( 15 | dataset_config: DatasetConfig, 16 | tmp_path: Path, 17 | catalog_config: CatalogConfig = None, 18 | java11: Path = Path( 19 | f"/usr/lib/jvm/java-11-amazon-corretto.{platform.machine()}/lib/server/libjvm.so" 20 | ), 21 | jars: List[Path] = [Path(__file__).resolve().parent / "jars/*"], 22 | ): 23 | """ 24 | Sync a dataset metadata from source to target format. Optionally writes it to a catalog. 25 | 26 | Args: 27 | dataset_config (dict): Dataset configuration. 28 | tmp_path (Path): Path to a temporary directory. 29 | catalog_config (dict, optional): Iceberg catalog configuration. Defaults to None. 30 | java11 (Path, optional): Path to the java 11 library. Defaults to Path( 31 | f"/usr/lib/jvm/java-11-amazon-corretto.{platform.machine()}/lib/server/libjvm.so" 32 | ). 33 | jars (Path(s), optional): Path(s) to the jar files. Defaults to Path(__file__).resolve().parent / "jars". 34 | 35 | Returns: 36 | None 37 | 38 | """ 39 | 40 | # write config file 41 | config_path = tmp_path / "config.yaml" 42 | with config_path.open("w") as file: 43 | yaml.dump(dataset_config.model_dump(by_alias=True), file) 44 | 45 | # write catalog file 46 | if catalog_config: 47 | catalog_path = tmp_path / "catalog.yaml" 48 | with catalog_path.open("w") as file: 49 | yaml.dump(catalog_config.model_dump(by_alias=True), file) 50 | else: 51 | catalog_path = None 52 | 53 | # start a jvm in the background 54 | if jpype.isJVMStarted() is False: 55 | jpype.startJVM(java11.absolute().as_posix(), classpath=jars) 56 | 57 | # call java class with or without catalog config 58 | run_sync = jpype.JPackage("org").apache.xtable.utilities.RunSync.main 59 | if catalog_path: 60 | run_sync( 61 | [ 62 | "--datasetConfig", 63 | config_path.absolute().as_posix(), 64 | "--icebergCatalogConfig", 65 | catalog_path.absolute().as_posix(), 66 | ] 67 | ) 68 | 69 | else: 70 | run_sync(["--datasetConfig", config_path.absolute().as_posix()]) 71 | -------------------------------------------------------------------------------- /xtable_operator/Dockerfile: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | # syntax = docker/dockerfile:1.7.0 5 | # args 6 | ARG MAVEN_VERSION=3.9.6 7 | ARG MAVEN_MAJOR_VERSION=3 8 | ARG XTABLE_VERSION=0.1.0-SNAPSHOT 9 | ARG XTABLE_BRANCH=v0.1.0-beta1 10 | ARG ICEBERG_VERSION=1.4.2 11 | ARG SPARK_VERSION=3.4 12 | ARG SCALA_VERSION=2.12 13 | 14 | # Build Stage 15 | from public.ecr.aws/amazonlinux/amazonlinux:2023.4.20240319.1 as build-stage 16 | 17 | ARG MAVEN_VERSION 18 | ARG MAVEN_MAJOR_VERSION 19 | ARG XTABLE_BRANCH 20 | ARG ICEBERG_VERSION 21 | ARG SPARK_VERSION 22 | ARG SCALA_VERSION 23 | 24 | # install java 25 | RUN yum install -y java-11-amazon-corretto-headless \ 26 | git \ 27 | unzip \ 28 | wget \ 29 | && yum clean all 30 | 31 | # install maven 32 | RUN wget https://dlcdn.apache.org/maven/maven-"$MAVEN_MAJOR_VERSION"/"$MAVEN_VERSION"/binaries/apache-maven-"$MAVEN_VERSION"-bin.zip 33 | RUN unzip apache-maven-"$MAVEN_VERSION"-bin.zip 34 | 35 | # clone sources 36 | RUN git clone --depth 1 --branch "$XTABLE_BRANCH" https://github.com/apache/incubator-xtable.git 37 | 38 | # build xtable jar 39 | WORKDIR /incubator-xtable 40 | RUN /apache-maven-"$MAVEN_VERSION"/bin/mvn package -DskipTests=true 41 | WORKDIR / 42 | 43 | # Download jars for iceberg and glue 44 | RUN wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-aws-bundle/"$ICEBERG_VERSION"/iceberg-aws-bundle-"$ICEBERG_VERSION".jar 45 | RUN wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"/"$ICEBERG_VERSION"/iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"-"$ICEBERG_VERSION".jar 46 | 47 | USER 1000 48 | 49 | # Run Stage 50 | from scratch as out-stage 51 | 52 | # args 53 | ARG XTABLE_VERSION 54 | ARG ICEBERG_VERSION 55 | ARG SPARK_VERSION 56 | ARG SCALA_VERSION 57 | 58 | # Copy jar files 59 | COPY --from=build-stage /incubator-xtable/utilities/target/utilities-"$XTABLE_VERSION"-bundled.jar /xtable_operator/jars/ 60 | COPY --from=build-stage /iceberg-aws-bundle-"$ICEBERG_VERSION".jar /xtable_operator/jars/ 61 | COPY --from=build-stage /iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"-"$ICEBERG_VERSION".jar /xtable_operator/jars/ 62 | 63 | # Add python operator 64 | COPY operator.py /xtable_operator/ 65 | COPY cmd.py /xtable_operator/ -------------------------------------------------------------------------------- /xtable_operator/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/xtable_operator/__init__.py -------------------------------------------------------------------------------- /xtable_operator/build-airflow-operator.sh: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | # Builds the MWAA or Airflow Operator and outputs it to the out folder. This can be used to upload to MWAA. 5 | 6 | # End script if one subcommand fails 7 | set -e 8 | 9 | # Defining colours for output 10 | GREEN='\033[1;32m' 11 | OFF='\033[0m' 12 | 13 | # Check if finch is installed, if not fall back to docker 14 | printf "${GREEN}Checking if finch is installed...${OFF}\n" 15 | if ! command -v finch &> /dev/null 16 | then 17 | finch=docker 18 | else 19 | finch=finch 20 | fi 21 | 22 | # Build Apache Xtable 23 | printf "${GREEN}Building...${OFF}\n" 24 | $finch build -f Dockerfile -o type=tar,dest=out.tar . 25 | 26 | # Copy other helper files to the out directory 27 | printf "${GREEN}Copying build artifacts and other helper files to the out directory...${OFF}\n" 28 | mkdir -p ./out/ \ 29 | && mv ./out.tar ./out/out.tar \ 30 | && cp ./files/requirements.txt ./out/ \ 31 | && cp ./files/DAG.py ./out/ \ 32 | && cp ./files/startup.sh ./out/ \ 33 | && cp ./files/.airflowignore ./out/ 34 | 35 | # Extract docker output 36 | cd out/ \ 37 | && tar xf out.tar \ 38 | && rm ./out.tar 39 | -------------------------------------------------------------------------------- /xtable_operator/cmd.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import jpype 5 | import jpype.imports 6 | import jpype.types 7 | from pathlib import Path 8 | import platform 9 | 10 | def sync( 11 | config_path: Path, 12 | jars: Path, 13 | catalog_path: Path = None, 14 | java11: Path = Path( 15 | f"/usr/lib/jvm/java-11-amazon-corretto.{platform.machine()}/lib/server/libjvm.so" 16 | ), 17 | ): 18 | """Sync a dataset metadata from source to target format. Optionally writes it to a catalog. 19 | 20 | Args: 21 | config_path (Path): Path to the dataset config file. 22 | jars (Path): Path to the jars folder. 23 | catalog_path (Path, optional): Path to the catalog config file. Defaults to None. 24 | java11 (Path, optional): Path to the java 11 library. Defaults to Path( 25 | f"/usr/lib/jvm/java-11-amazon-corretto.{platform.machine()}/lib/server/libjvm.so" 26 | ). 27 | 28 | """ 29 | 30 | # start a jvm in the background 31 | jpype.startJVM(java11.absolute().as_posix(), classpath=jars / "*") 32 | run_sync = jpype.JPackage("io").onetable.utilities.RunSync.main 33 | 34 | # call java class with or without catalog config 35 | if catalog_path: 36 | run_sync( 37 | [ 38 | "--datasetConfig", 39 | config_path.absolute().as_posix(), 40 | "--icebergCatalogConfig", 41 | catalog_path.absolute().as_posix(), 42 | ] 43 | ) 44 | 45 | else: 46 | run_sync(["--datasetConfig", config_path.absolute().as_posix()]) 47 | 48 | # shutdown 49 | jpype.shutdownJVM() 50 | -------------------------------------------------------------------------------- /xtable_operator/files/.airflowignore: -------------------------------------------------------------------------------- 1 | xtable_operator -------------------------------------------------------------------------------- /xtable_operator/files/DAG.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | from airflow import DAG 5 | from datetime import datetime, timedelta 6 | from xtable_operator.operator import XtableOperator 7 | from pathlib import Path 8 | 9 | default_args = { 10 | "owner": "airflow", 11 | "depends_on_past": False, 12 | "start_date": datetime(2024, 1, 1), 13 | "email_on_failure": False, 14 | "email_on_retry": False, 15 | "retries": 1, 16 | "retry_delay": timedelta(minutes=5), 17 | } 18 | 19 | 20 | with DAG( 21 | "xtableDag", max_active_runs=3, schedule_interval="@once", default_args=default_args 22 | ) as dag: 23 | path = Path("/usr/local/airflow/tmp") 24 | path.mkdir(exist_ok=True) 25 | op = XtableOperator( 26 | task_id="xtableTask", 27 | tmp_path=path, 28 | dataset_config={ 29 | "sourceFormat": "DELTA", 30 | "targetFormats": ["ICEBERG"], 31 | "datasets": [ 32 | { 33 | "tableBasePath": "/example/datalake/table/random_numbers", 34 | "tableName": "random_numbers", 35 | "namespace": "table", 36 | } 37 | ], 38 | }, 39 | ) 40 | -------------------------------------------------------------------------------- /xtable_operator/files/requirements.txt: -------------------------------------------------------------------------------- 1 | jpype1 ==1.5.0 -------------------------------------------------------------------------------- /xtable_operator/files/startup.sh: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | #!/bin/sh 5 | if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]] 6 | then 7 | sudo yum install -y java-11-amazon-corretto-headless 8 | fi -------------------------------------------------------------------------------- /xtable_operator/operator.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | from pathlib import Path 5 | import yaml 6 | 7 | from airflow.models import BaseOperator 8 | from airflow.utils.decorators import apply_defaults 9 | 10 | from xtable_operator import cmd 11 | 12 | 13 | class XtableOperator(BaseOperator): 14 | """ 15 | Xtable Operator 16 | 17 | This Operator is meant to be executed in an Amazon MWAA or Airflow environment. It capsulates the Apache Xtable java library calls. 18 | """ 19 | 20 | @apply_defaults 21 | def __init__( 22 | self, 23 | dataset_config: dict, 24 | tmp_path: Path, 25 | iceberg_catalog_config: dict = None, 26 | java11: Path = None, 27 | *args, 28 | **kwargs 29 | ): 30 | """ 31 | Initialize all the relevant paths. We need the xtable dataset and catalog configurations, 32 | as well as a writeable temporary path to generate the config files on the fly. For executing the 33 | xtable command, we need the java 11 library in the jars folder. 34 | 35 | Args: 36 | dataset_config (dict): Dataset configuration. 37 | tmp_path (Path): Path to the temporary folder. 38 | iceberg_catalog_config (dict, optional): Iceberg catalog configuration. Defaults to None. 39 | java11 (Path, optional): Path to the java 11 library. Defaults to Path( 40 | f"/usr/lib/jvm/java-11-amazon-corretto.{platform.machine()}/lib/server/libjvm.so" 41 | ). 42 | """ 43 | super(XtableOperator, self).__init__(*args, **kwargs) 44 | 45 | # write config file 46 | config_path = tmp_path / "config.yaml" 47 | with config_path.open("w") as file: 48 | yaml.dump(dataset_config, file) 49 | 50 | # write catalog file 51 | if iceberg_catalog_config: 52 | catalog_path = tmp_path / "catalog.yaml" 53 | with catalog_path.open("w") as file: 54 | yaml.dump(iceberg_catalog_config, file) 55 | else: 56 | catalog_path = None 57 | 58 | # self 59 | self.config_path: Path = config_path 60 | self.catalog_path: Path = catalog_path 61 | self.jars: Path = Path(__file__).resolve().parent / "jars" 62 | self.java11: Path = java11 63 | 64 | def execute(self, context): 65 | """Execute the xtable command.""" 66 | cmd.sync( 67 | config_path=self.config_path, 68 | catalog_path=self.catalog_path, 69 | jars=self.jars, 70 | ) 71 | --------------------------------------------------------------------------------