├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── docs
    ├── AWSGlueTableProperties.png
    ├── BuildOutput.png
    ├── ConfigurationUpdate.png
    ├── DagFolder.png
    └── MwaaFolder.png
├── scripts
    ├── create_delta_glue_job.py
    ├── create_delta_s3.py
    ├── create_hudi_glue_job.py
    ├── create_hudi_s3.py
    ├── create_iceberg_glue_job.py
    ├── create_iceberg_s3.py
    └── requirements.txt
├── xtable_lambda
    ├── app.py
    ├── cdk.json
    ├── cdk_stack.py
    ├── requirements.txt
    └── src
    │   ├── Dockerfile
    │   ├── __init__.py
    │   ├── converter.py
    │   ├── detector.py
    │   ├── models.py
    │   ├── requirements.txt
    │   └── xtable.py
└── xtable_operator
    ├── Dockerfile
    ├── __init__.py
    ├── build-airflow-operator.sh
    ├── cmd.py
    ├── files
        ├── .airflowignore
        ├── DAG.py
        ├── requirements.txt
        └── startup.sh
    └── operator.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | .venv/
 2 | out
 3 | .jar
 4 | .java-version
 5 | .DS_Store
 6 | .pytest_cache
 7 | __pycache__
 8 | dist
 9 | xtable
10 | jars
11 | build
12 | *.egg-info
13 | licenses.txt
14 | probe_out


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT No Attribution
 2 | 
 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so.
10 | 
11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17 | 
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # `apache-xtable-on-aws-samples`
  2 | 
  3 | - [`apache-xtable-on-aws-samples`](#apache-xtable-on-aws-samples)
  4 |   - [AWS Lambda](#aws-lambda)
  5 |     - [Prerequisites - AWS Lambda](#prerequisites---aws-lambda)
  6 |     - [Step-by-step-guide - AWS Lambda](#step-by-step-guide---aws-lambda)
  7 |     - [Clean Up](#clean-up)
  8 |   - [Amazon MWAA Operator](#amazon-mwaa-operator)
  9 |     - [Prerequisites - Amazon MWAA Operator](#prerequisites---amazon-mwaa-operator)
 10 |     - [Step-by-step-guide - Amazon MWAA Operator](#step-by-step-guide---amazon-mwaa-operator)
 11 |     - [\[Optional step: AWS Glue Data Catalog used as an Iceberg Catalog\]](#optional-step-aws-glue-data-catalog-used-as-an-iceberg-catalog)
 12 |     - [Clean Up](#clean-up-1)
 13 |   - [Security](#security)
 14 |   - [Code of Conduct](#code-of-conduct)
 15 |   - [License](#license)
 16 | 
 17 | Open table formats (OTFs) like Apache Iceberg are being increasingly adopted, for example, to improve transactional consistency of a data lake or to consolidate batch and streaming data pipelines on a single file format and reduce complexity. In practice, architects need to integrate the chosen format with the various layers of a modern data platform. However, the level of support for the different OTFs varies across common analytical services.
 18 | 
 19 | Commercial vendors and the open source community have recognized this situation and are working on interoperability between table formats. One approach is to make a single physical dataset readable in different formats by translating its metadata and avoiding reprocessing of actual data files. Apache XTable is an open source solution that follows this approach and provides abstractions and tools for the translation of open table format metadata.
 20 | 
 21 | In this repository, we show how to get started with Apache XTable on AWS. This repository is meant to be used in combination with related blog posts on this topic:
 22 | 
 23 | - [Blog - Run Apache XTable in AWS Lambda for background conversion of open table formats](https://aws.amazon.com/blogs/big-data/run-apache-xtable-in-aws-lambda-for-background-conversion-of-open-table-formats/)
 24 | - [Blog - Run Apache XTable on Amazon MWAA to translate open table formats](https://aws.amazon.com/blogs/big-data/run-apache-xtable-on-amazon-mwaa-to-translate-open-table-formats/)
 25 | 
 26 | ## AWS Lambda
 27 | 
 28 | Let’s dive deeper into how to deploy the provided AWS CDK stack.
 29 | 
 30 | ### Prerequisites - AWS Lambda
 31 | 
 32 | You need one of the following container runtimes and AWS CDK installed:
 33 | 
 34 | - [Finch](https://runfinch.com/docs/getting-started/installation/) or [Docker](https://docs.docker.com/get-docker/)
 35 | - [AWS CDK](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html)
 36 | 
 37 | You also need an existing lakehouse setup or can use the provided [scripts](https://github.com/aws-samples/apache-xtable-on-aws-samples/tree/main/scripts) to set up a test environment.
 38 | 
 39 | ### Step-by-step-guide - AWS Lambda
 40 | 
 41 | (1) To deploy the stack, clone the github repo, change into the folder for this Blog Post (xtable_lambda) and deploy the CDK stack. This deploys a set of Lambda functions and an Amazon EventBridge Scheduler:
 42 | 
 43 | ``` bash
 44 | git clone https://github.com/aws-samples/apache-xtable-on-aws-samples.git
 45 | cd xtable_lambda
 46 | cdk deploy
 47 | ```
 48 | 
 49 | When using Finch you need to set CDK_DOCKER environment variable before deployment:
 50 | 
 51 | ``` bash
 52 | export CDK_DOCKER=finch
 53 | ```
 54 | 
 55 | After the deployment, all your correctly configured Glue Tables will be transformed every hour.
 56 | 
 57 | (2) Set required Glue Data Catalog parameters:
 58 | 
 59 | For the solution to work, you have to set the correct parameters in the Glue Data Catalog, in each of the Glue Tables you want to transform:
 60 | 
 61 | - `"xtable_table_type": "<source_format>"`
 62 | - `"xtable_target_formats": "<target_format>, <target_format>"`
 63 | 
 64 | In the AWS Console the parameters look like the following and can be set under “Table properties” when editing a Glue Table:
 65 | 
 66 | <img src="./docs/AWSGlueTableProperties.png" height="400" alt="AWS Glue Table Properties"/>
 67 | 
 68 | (3) (Optional) Create data lake test environment
 69 | 
 70 | In case you do not have a lakehouse setup, these [scripts](https://github.com/aws-samples/apache-xtable-on-aws-samples/tree/main/scripts) can help you to set up a test environment either with your local machine or in a [AWS Glue for Spark Job](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-intro-tutorial.html).
 71 | 
 72 | ``` bash
 73 | #local: create hudi dataset on S3
 74 | cd scripts
 75 | pip install -r requirements.txt
 76 | python ./create_hudi_s3.p
 77 | ```
 78 | 
 79 | ### Clean Up
 80 | 
 81 | - Delete the deployed CDK stack:
 82 | 
 83 | ```bash
 84 |   cdk destroy
 85 | ```
 86 | 
 87 | - Delete the downloaded git repository:
 88 | 
 89 | ```bash
 90 |   rm -r apache-xtable-on-aws-samples
 91 | ```
 92 | 
 93 | - Delete the used docker image:
 94 | 
 95 | ```bash
 96 |   docker image rm public.ecr.aws/lambda/python:3.12.2024.07.10.11-arm64
 97 | ```
 98 | 
 99 | - Reverse AWS Glue configurations in your Glue Tables
100 | - (Optional) Delete test data files created with the test scripts in your S3 Bucket
101 | 
102 | ## Amazon MWAA Operator
103 | 
104 | To deploy the custom operator to Amazon MWAA, we upload it together with DAGs into the configured DAG folder. Besides the operator itself, we also need to upload XTable’s executable JAR. As of writing this post, the JAR needs to be compiled by the user from source code. To simplify this, we provide a container-based build script.
105 | 
106 | ### Prerequisites - Amazon MWAA Operator
107 | 
108 | We assume you have at least an environment consisting of Amazon MWAA itself, an S3 bucket, and an [AWS Identity and Access Management](https://aws.amazon.com/iam/) (IAM) role for Amazon MWAA that has read access to the bucket and optionally write access to the AWS Glue Data Catalog. In addition, you need one of the following container runtimes to run the provided build script for XTable:
109 | 
110 | - [Finch](https://runfinch.com/docs/getting-started/installation/)
111 | - [Docker](https://docs.docker.com/get-docker/)
112 | 
113 | ### Step-by-step-guide - Amazon MWAA Operator
114 | 
115 | To compile XTable, you can use the provided build script and complete the following steps:
116 | 
117 | (1) Clone the sample code from GitHub:
118 | 
119 | ``` bash
120 | git clone git@github.com:aws-samples/apache-xtable-on-aws-samples.git
121 | cd ./apache-xtable-on-aws-samples/xtable_operator
122 | ```
123 | 
124 | (2) Run the build script:
125 | 
126 | ```bash
127 | ./build-airflow-operator.sh
128 | ```
129 | 
130 | <img src="./docs/BuildOutput.png" height="160" alt="Build Output"/>
131 | 
132 | (3) Because the Airflow operator uses the library [JPype](https://github.com/jpype-project/jpype) to invoke XTable’s JAR, add a dependency in the Amazon MWAA requirement.txt file:
133 | 
134 | ``` python
135 | # requirements.txt
136 | JPype1==1.5.0
137 | ```
138 | 
139 | For a background on installing additional Python libraries on Amazon MWAA, see [Installing Python dependencies](https://docs.aws.amazon.com/mwaa/latest/userguide/working-dags-dependencies.html).
140 | 
141 | (4) Because XTable is Java-based, a Java 11 runtime environment (JRE) is required on Amazon MWAA. You can use Amazon MWAA’s startup script, to install a JRE. Add the following lines to an existing startup script or create a new one as provided in the sample code base of this post:
142 | 
143 | ``` bash
144 | # startup.sh
145 | if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]]
146 | then
147 |     sudo yum install -y java-11-amazon-corretto-headless
148 | fi
149 | ```
150 | 
151 | For more information about this mechanism, see [Using a startup script with Amazon MWAA](https://docs.aws.amazon.com/mwaa/latest/userguide/using-startup-script.html).
152 | 
153 | (5) Upload *xtable_operator/*, *requirements.txt*, *startup.sh* and *.airflowignore* to the S3 bucket and respective paths from which Amazon MWAA will read files. Make sure the IAM role for Amazon MWAA has appropriate read permissions.
154 | 
155 | With regard to the Customer Operator, make sure to upload the local folder *xtable_operator/* into the configured DAG folder along with the [*.airflowignore*](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#airflowignore) file.
156 | 
157 | <p align="center" float="left">
158 |     <img src="./docs/MwaaFolder.png" height="350" alt="Build Output"/>
159 |     <img src="./docs/DagFolder.png" height="350" alt="Build Output"/>
160 | </p>
161 | 
162 | (6) Update the configuration of your Amazon MWAA environment as follows and start the update process:
163 | 
164 | - Add or update the S3 URI to the *requirements.txt* file through the Requirements file configuration option.
165 | - Add or update the S3 URI to the *startup.sh* script through the Startup script configuration option.
166 | 
167 | <p align="center">
168 |     <img src="./docs/ConfigurationUpdate.png" height="350" alt="Build Output"/>
169 | </p>
170 | 
171 | ### [Optional step: AWS Glue Data Catalog used as an Iceberg Catalog]
172 | 
173 | Optionally, you can use the AWS Glue Data Catalog as an Iceberg catalog.
174 | 
175 | (7) In case you create Iceberg metadata and want to register it in the AWS Glue Data Catalog, the Amazon MWAA role needs permissions to create or modify tables in AWS Glue. The following listing shows a minimal policy for this. It constrains permissions to a defined database in AWS Glue:
176 | 
177 | ``` json
178 | {
179 |     "Version": "2012-10-17",
180 |     "Statement": [
181 |         {
182 |             "Effect": "Allow",
183 |             "Action": [
184 |                 "glue:GetDatabase",
185 |                 "glue:CreateTable",
186 |                 "glue:GetTables",
187 |                 "glue:UpdateTable",
188 |                 "glue:GetDatabases",
189 |                 "glue:GetTable"
190 |             ],
191 |             "Resource": [
192 |                 "arn:aws:glue:<AWS Region>:<AWS Account ID>:catalog",
193 |                 "arn:aws:glue:<AWS Region>:<AWS Account ID>:database/<Database name>",
194 |                 "arn:aws:glue:<AWS Region>:<AWS Account ID>:table/<Database name>/*"
195 |             ]
196 |         }
197 |     ]
198 | }
199 | ```
200 | 
201 | ### Clean Up
202 | 
203 | - Delete the downloaded git repository:
204 | 
205 | ```bash
206 |   rm -r apache-xtable-on-aws-samples
207 | ```
208 | 
209 | - Delete the used docker image:
210 | 
211 | ```bash
212 |   docker image rm public.ecr.aws/amazonlinux/amazonlinux:2023.4.20240319.1
213 | ```
214 | 
215 | - Reverse Amazon MWAA configurations in:
216 |   - requirements.txt
217 |   - startup.sh
218 |   - DAG
219 |   - MWAA execution role
220 | 
221 | - Delete files or versions of files in your S3 Bucket:
222 |   - requirements.txt
223 |   - startup.sh
224 |   - DAG
225 |   - .airflowignore
226 | 
227 | ## Security
228 | 
229 | See [CONTRIBUTING](./CONTRIBUTING.md#security-issue-notifications) for more information.
230 | 
231 | ## Code of Conduct
232 | 
233 | See [CODE OF CONDUCT](./CODE_OF_CONDUCT.md) for more information.
234 | 
235 | ## License
236 | 
237 | This library is licensed under the MIT-0 License. See the [LICENSE](./LICENSE) file.
238 | 


--------------------------------------------------------------------------------
/docs/AWSGlueTableProperties.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/docs/AWSGlueTableProperties.png


--------------------------------------------------------------------------------
/docs/BuildOutput.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/docs/BuildOutput.png


--------------------------------------------------------------------------------
/docs/ConfigurationUpdate.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/docs/ConfigurationUpdate.png


--------------------------------------------------------------------------------
/docs/DagFolder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/docs/DagFolder.png


--------------------------------------------------------------------------------
/docs/MwaaFolder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/docs/MwaaFolder.png


--------------------------------------------------------------------------------
/scripts/create_delta_glue_job.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import os
 3 | from awsglue.utils import getResolvedOptions
 4 | from pyspark.context import SparkContext
 5 | from awsglue.context import GlueContext
 6 | from awsglue.job import Job
 7 | 
 8 | # Glue Setup
 9 | # --conf
10 | # spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
11 | # --datalake-formats
12 | # delta
13 | # --additional-python-modules
14 | # Faker
15 | args = getResolvedOptions(sys.argv, ["JOB_NAME"])
16 | glueContext = GlueContext(SparkContext())
17 | spark = glueContext.spark_session
18 | job = Job(glueContext)
19 | job.init(args["JOB_NAME"], args)
20 | 
21 | 
22 | def generate_dataframe(n=100):
23 |     import datetime
24 |     import numpy as np
25 |     import pandas as pd
26 | 
27 |     rng = np.random.default_rng()
28 | 
29 |     df = pd.DataFrame(
30 |         {
31 |             "date": rng.choice(
32 |                 pd.date_range(
33 |                     datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1)
34 |                 ).strftime("%Y-%m-%d"),
35 |                 size=n,
36 |             ),
37 |             "col1_int": rng.integers(low=0, high=100, size=n),
38 |             "col2_float": rng.uniform(low=0, high=1, size=n),
39 |             "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n),
40 |             "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(),
41 |             "col5_bool": rng.choice([True, False], size=n),
42 |         }
43 |     )
44 | 
45 |     df["id"] = df.index
46 |     return spark.createDataFrame(df)
47 | 
48 | 
49 | additional_options = {
50 |     "path": args["DELTA_TABLE_PATH"], # e.g. s3://data/delta/table
51 | }
52 | 
53 | df = generate_dataframe()
54 | (
55 |     df.write.format("delta")
56 |     .options(**additional_options)
57 |     .mode("append")
58 |     .partitionBy("date")
59 |     .saveAsTable("table")
60 | )
61 | 
62 | 
63 | job.commit()
64 | 


--------------------------------------------------------------------------------
/scripts/create_delta_s3.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import pyspark
 3 | from delta import configure_spark_with_delta_pip
 4 | # env
 5 | from dotenv import load_dotenv
 6 | 
 7 | load_dotenv()  # take environment variables from .env
 8 | 
 9 | os.environ["SPARK_LOCAL_IP"] = "127.0.0.1"
10 | 
11 | # session
12 | builder = (
13 |     pyspark.sql.SparkSession.builder.appName("MyApp")
14 |     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
15 |     .config(
16 |         "spark.sql.catalog.spark_catalog",
17 |         "org.apache.spark.sql.delta.catalog.DeltaCatalog",
18 |     )
19 |     .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
20 | )
21 | spark = configure_spark_with_delta_pip(
22 |     builder, extra_packages=["org.apache.spark:spark-hadoop-cloud_2.12:3.5.1"]
23 | ).getOrCreate()
24 | 
25 | 
26 | def generate_dataframe(n=100):
27 |     import datetime
28 |     import numpy as np
29 |     import pandas as pd
30 | 
31 |     rng = np.random.default_rng()
32 | 
33 |     df = pd.DataFrame(
34 |         {
35 |             "date": rng.choice(
36 |                 pd.date_range(
37 |                     datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1)
38 |                 ).strftime("%Y-%m-%d"),
39 |                 size=n,
40 |             ),
41 |             "col1_int": rng.integers(low=0, high=100, size=n),
42 |             "col2_float": rng.uniform(low=0, high=1, size=n),
43 |             "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n),
44 |             "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(),
45 |             "col5_bool": rng.choice([True, False], size=n),
46 |         }
47 |     )
48 | 
49 |     df["id"] = df.index
50 |     return spark.createDataFrame(df)
51 | 
52 | 
53 | if __name__ == "__main__":
54 |     # generate dataframe
55 |     dataframe = generate_dataframe()
56 | 
57 |     # write table
58 |     (
59 |         dataframe.write.format("delta")
60 |         .option("path", os.environ["DELTA_TABLE_PATH"]) # e.g. s3://data/delta/table
61 |         .mode("overwrite")
62 |         .partitionBy("date")
63 |         .saveAsTable("table")
64 |     )
65 |     
66 | 


--------------------------------------------------------------------------------
/scripts/create_hudi_glue_job.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import os
 3 | from awsglue.utils import getResolvedOptions
 4 | from pyspark.context import SparkContext
 5 | from awsglue.context import GlueContext
 6 | from awsglue.job import Job
 7 | 
 8 | # Glue Setup
 9 | # --conf
10 | # spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false
11 | # --datalake-formats
12 | # hudi
13 | args = getResolvedOptions(sys.argv, ["JOB_NAME"])
14 | glueContext = GlueContext(SparkContext())
15 | spark = glueContext.spark_session
16 | job = Job(glueContext)
17 | job.init(args["JOB_NAME"], args)
18 | 
19 | 
20 | def generate_dataframe(n=100):
21 |     import datetime
22 |     import numpy as np
23 |     import pandas as pd
24 | 
25 |     rng = np.random.default_rng()
26 | 
27 |     df = pd.DataFrame(
28 |         {
29 |             "date": rng.choice(
30 |                 pd.date_range(
31 |                     datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1)
32 |                 ).strftime("%Y-%m-%d"),
33 |                 size=n,
34 |             ),
35 |             "col1_int": rng.integers(low=0, high=100, size=n),
36 |             "col2_float": rng.uniform(low=0, high=1, size=n),
37 |             "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n),
38 |             "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(),
39 |             "col5_bool": rng.choice([True, False], size=n),
40 |         }
41 |     )
42 | 
43 |     df["id"] = df.index
44 |     return spark.createDataFrame(df)
45 | 
46 | 
47 | additional_options = {
48 |     "hoodie.table.name": "table",
49 |     "hoodie.datasource.write.table.name": "table",
50 |     "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
51 |     "hoodie.datasource.write.operation": "insert",
52 |     "hoodie.datasource.write.recordkey.field": "id",
53 |     "hoodie.datasource.write.precombine.field": "col4_time",
54 |     "hoodie.datasource.write.partitionpath.field": "date",
55 |     "hoodie.datasource.write.hive_style_partitioning": "true",
56 |     "path": args["HUDI_TABLE_PATH"], # e.g. s3://data/hudi/table
57 | }
58 | 
59 | df = generate_dataframe()
60 | df.write.format("hudi").options(**additional_options).mode("append").save()
61 | 
62 | job.commit()
63 | 


--------------------------------------------------------------------------------
/scripts/create_hudi_s3.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from pyspark.sql import SparkSession
 3 | 
 4 | 
 5 | # env
 6 | os.environ["SPARK_LOCAL_IP"] = "127.0.0.1"
 7 | 
 8 | # session
 9 | spark = (
10 |     SparkSession.builder.appName("MyApp")
11 |     .config(
12 |         "spark.jars.packages",
13 |         "org.apache.hudi:hudi-spark3-bundle_2.12:0.15.0,org.apache.spark:spark-hadoop-cloud_2.12:3.5.1",
14 |     )
15 |     .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
16 |     .config(
17 |         "spark.sql.catalog.spark_catalog",
18 |         "org.apache.spark.sql.hudi.catalog.HoodieCatalog",
19 |     )
20 |     .config(
21 |         "spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
22 |     )
23 |     .config("spark.kryo.registrator", "org.apache.spark.HoodieSparkKryoRegistrar")
24 |     .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
25 |     .config("spark.driver.memory", "16g")
26 |     .config("spark.hadoop.fs.s3a.fast.upload", "true")
27 |     .config("spark.hadoop.fs.s3a.upload.buffer", "bytebuffer")
28 |     .getOrCreate()
29 | )
30 | 
31 | 
32 | def generate_dataframe(n=100):
33 |     import datetime
34 |     import numpy as np
35 |     import pandas as pd
36 | 
37 |     rng = np.random.default_rng()
38 | 
39 |     df = pd.DataFrame(
40 |         {
41 |             "date": rng.choice(
42 |                 pd.date_range(
43 |                     datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1)
44 |                 ).strftime("%Y-%m-%d"),
45 |                 size=n,
46 |             ),
47 |             "col1_int": rng.integers(low=0, high=100, size=n),
48 |             "col2_float": rng.uniform(low=0, high=1, size=n),
49 |             "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n),
50 |             "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(),
51 |             "col5_bool": rng.choice([True, False], size=n),
52 |         }
53 |     )
54 | 
55 |     df["id"] = df.index
56 |     return spark.createDataFrame(df)
57 | 
58 | 
59 | if __name__ == "__main__":
60 |     hoodie_options = {
61 |         "hoodie.table.name": "table",
62 |         "hoodie.datasource.write.recordkey.field": "id",
63 |         "hoodie.datasource.write.precombine.field": "col4_time",
64 |         "hoodie.datasource.write.operation": "insert",
65 |         "hoodie.datasource.write.partitionpath.field": "date",
66 |         "hoodie.datasource.write.table.name": "table",
67 |         "hoodie.deltastreamer.source.dfs.listing.max.fileid": "-1",
68 |         "hoodie.datasource.write.hive_style_partitioning": "true",
69 |         "path": os.environ["HUDI_TABLE_PATH"], # e.g. s3://data/hudi/table
70 |     }
71 | 
72 |     # generate dataframe
73 |     dataframe = generate_dataframe()
74 | 
75 |     dataframe.write.format("hudi").options(**hoodie_options).mode("append").save()
76 | 


--------------------------------------------------------------------------------
/scripts/create_iceberg_glue_job.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import os
 3 | from awsglue.utils import getResolvedOptions
 4 | from pyspark.context import SparkContext
 5 | from awsglue.context import GlueContext
 6 | from awsglue.job import Job
 7 | 
 8 | # Glue Setup
 9 | # --conf
10 | # spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.spark_catalog.warehouse=<# e.g. s3://data/iceberg/table> --conf spark.sql.catalog.spark_catalog.type=hadoop --conf spark.sql.defaultCatalog=spark_catalog --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
11 | # --datalake-formats
12 | # iceberg
13 | # --additional-python-modules
14 | # Faker
15 | 
16 | 
17 | args = getResolvedOptions(sys.argv, ["JOB_NAME"])
18 | glueContext = GlueContext(SparkContext())
19 | spark = glueContext.spark_session
20 | job = Job(glueContext)
21 | job.init(args["JOB_NAME"], args)
22 | 
23 | 
24 | def generate_dataframe(n=100):
25 |     import datetime
26 |     import numpy as np
27 |     import pandas as pd
28 | 
29 |     rng = np.random.default_rng()
30 | 
31 |     df = pd.DataFrame(
32 |         {
33 |             "date": rng.choice(
34 |                 pd.date_range(
35 |                     datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1)
36 |                 ).strftime("%Y-%m-%d"),
37 |                 size=n,
38 |             ),
39 |             "col1_int": rng.integers(low=0, high=100, size=n),
40 |             "col2_float": rng.uniform(low=0, high=1, size=n),
41 |             "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n),
42 |             "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(),
43 |             "col5_bool": rng.choice([True, False], size=n),
44 |         }
45 |     )
46 | 
47 |     df["id"] = df.index
48 |     return spark.createDataFrame(df)
49 | 
50 | 
51 | # generate dataframe
52 | additional_options = {
53 |     "path": args["ICEBERG_TABLE_PATH"], # e.g. s3://data/iceberg/table
54 | }
55 | 
56 | # Generate and sort by partitions
57 | df = generate_dataframe()
58 | df = df.repartition("date").sortWithinPartitions("date")
59 | df.createOrReplaceTempView("tmp_table")
60 | 
61 | (
62 |     df.write.format("iceberg")
63 |     .options(**additional_options)
64 |     .mode("append")
65 |     .partitionBy("date")
66 |     .saveAsTable("iceberg.table")
67 | )
68 | 
69 | # commit
70 | job.commit()
71 | 


--------------------------------------------------------------------------------
/scripts/create_iceberg_s3.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from pyspark.sql import SparkSession
 3 | 
 4 | 
 5 | # env
 6 | os.environ["SPARK_LOCAL_IP"] = "127.0.0.1"
 7 | 
 8 | # session
 9 | spark = (
10 |     SparkSession.builder.appName("MyApp")
11 |     .config(
12 |         "spark.jars.packages",
13 |         "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.4.2,org.apache.spark:spark-hadoop-cloud_2.12:3.4.3,software.amazon.awssdk:bundle:2.26.21,software.amazon.awssdk:url-connection-client:2.26.21",
14 |     )
15 |     .config(
16 |         "spark.sql.extensions",
17 |         "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
18 |     )
19 |     .config(
20 |         "spark.sql.catalog.spark_catalog",
21 |         "org.apache.iceberg.spark.SparkCatalog",
22 |     )
23 |     .config(
24 |         "spark.sql.catalog.spark_catalog.warehouse",
25 |         os.environ["ICEBERG_WAREHOUSE_PATH"], # e.g. s3://data/
26 |     )
27 |     .config(
28 |         "spark.sql.catalog.spark_catalog.type",
29 |         "hadoop",
30 |     )
31 |     .config(
32 |         "spark.sql.defaultCatalog",
33 |         "spark_catalog",
34 |     )
35 |     .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
36 |     .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
37 |     .config("spark.driver.memory", "16g")
38 |     .config("spark.hadoop.fs.s3a.fast.upload", "true")
39 |     .config("spark.hadoop.fs.s3a.upload.buffer", "bytebuffer")
40 |     .getOrCreate()
41 | )
42 | 
43 | 
44 | def generate_dataframe(n=100):
45 |     import datetime
46 |     import numpy as np
47 |     import pandas as pd
48 | 
49 |     rng = np.random.default_rng()
50 | 
51 |     df = pd.DataFrame(
52 |         {
53 |             "date": rng.choice(
54 |                 pd.date_range(
55 |                     datetime.datetime(2024, 1, 1), datetime.datetime(2025, 1, 1)
56 |                 ).strftime("%Y-%m-%d"),
57 |                 size=n,
58 |             ),
59 |             "col1_int": rng.integers(low=0, high=100, size=n),
60 |             "col2_float": rng.uniform(low=0, high=1, size=n),
61 |             "col3_str": rng.choice([f"str_{i}" for i in range(10)], size=n),
62 |             "col4_time": pd.Timestamp.now(tz=datetime.timezone.utc).timestamp(),
63 |             "col5_bool": rng.choice([True, False], size=n),
64 |         }
65 |     )
66 | 
67 |     df["id"] = df.index
68 |     return spark.createDataFrame(df)
69 | 
70 | 
71 | if __name__ == "__main__":
72 |     # generate dataframe
73 |     additional_options = {
74 |         "path": os.environ["ICEBERG_TABLE_PATH"], # e.g. s3://data/iceberg/table
75 |     }
76 | 
77 |     # Generate and sort by partitions
78 |     df = generate_dataframe()
79 |     df = df.repartition("date").sortWithinPartitions("date")
80 |     df.createOrReplaceTempView("tmp_table")
81 | 
82 |     (
83 |         df.write.format("iceberg")
84 |         .options(**additional_options)
85 |         .mode("append")
86 |         .partitionBy("date")
87 |         .saveAsTable("iceberg.table")
88 |     )
89 | 


--------------------------------------------------------------------------------
/scripts/requirements.txt:
--------------------------------------------------------------------------------
1 | delta-spark==3.2.0
2 | pyspark==3.5.1


--------------------------------------------------------------------------------
/xtable_lambda/app.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | #!/usr/bin/env python3
 4 | import os
 5 | 
 6 | import aws_cdk as cdk
 7 | from aws_cdk import Aspects
 8 | from cdk_nag import AwsSolutionsChecks
 9 | 
10 | 
11 | from cdk_stack import XtableLambda
12 | 
13 | 
14 | app = cdk.App()
15 | XtableLambda(
16 |     app,
17 |     "XtableLambda",
18 |     env=cdk.Environment(
19 |         account=os.getenv("CDK_DEFAULT_ACCOUNT"), region=os.getenv("CDK_DEFAULT_REGION")
20 |     ),
21 | )
22 | Aspects.of(app).add(AwsSolutionsChecks(verbose=True))
23 | 
24 | app.synth()
25 | 


--------------------------------------------------------------------------------
/xtable_lambda/cdk.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "app": "python3 app.py",
 3 |   "watch": {
 4 |     "include": [
 5 |       "**"
 6 |     ],
 7 |     "exclude": [
 8 |       "README.md",
 9 |       "cdk*.json",
10 |       "requirements*.txt",
11 |       "source.bat",
12 |       "**/__init__.py",
13 |       "**/__pycache__",
14 |       "tests"
15 |     ]
16 |   },
17 |   "context": {
18 |     "@aws-cdk/aws-lambda:recognizeLayerVersion": true,
19 |     "@aws-cdk/core:checkSecretUsage": true,
20 |     "@aws-cdk/core:target-partitions": [
21 |       "aws",
22 |       "aws-cn"
23 |     ],
24 |     "@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true,
25 |     "@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true,
26 |     "@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true,
27 |     "@aws-cdk/aws-iam:minimizePolicies": true,
28 |     "@aws-cdk/core:validateSnapshotRemovalPolicy": true,
29 |     "@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true,
30 |     "@aws-cdk/aws-s3:createDefaultLoggingPolicy": true,
31 |     "@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true,
32 |     "@aws-cdk/aws-apigateway:disableCloudWatchRole": true,
33 |     "@aws-cdk/core:enablePartitionLiterals": true,
34 |     "@aws-cdk/aws-events:eventsTargetQueueSameAccount": true,
35 |     "@aws-cdk/aws-iam:standardizedServicePrincipals": true,
36 |     "@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true,
37 |     "@aws-cdk/aws-iam:importedRoleStackSafeDefaultPolicyName": true,
38 |     "@aws-cdk/aws-s3:serverAccessLogsUseBucketPolicy": true,
39 |     "@aws-cdk/aws-route53-patters:useCertificate": true,
40 |     "@aws-cdk/customresources:installLatestAwsSdkDefault": false,
41 |     "@aws-cdk/aws-rds:databaseProxyUniqueResourceName": true,
42 |     "@aws-cdk/aws-codedeploy:removeAlarmsFromDeploymentGroup": true,
43 |     "@aws-cdk/aws-apigateway:authorizerChangeDeploymentLogicalId": true,
44 |     "@aws-cdk/aws-ec2:launchTemplateDefaultUserData": true,
45 |     "@aws-cdk/aws-secretsmanager:useAttachedSecretResourcePolicyForSecretTargetAttachments": true,
46 |     "@aws-cdk/aws-redshift:columnId": true,
47 |     "@aws-cdk/aws-stepfunctions-tasks:enableEmrServicePolicyV2": true,
48 |     "@aws-cdk/aws-ec2:restrictDefaultSecurityGroup": true,
49 |     "@aws-cdk/aws-apigateway:requestValidatorUniqueId": true,
50 |     "@aws-cdk/aws-kms:aliasNameRef": true,
51 |     "@aws-cdk/aws-autoscaling:generateLaunchTemplateInsteadOfLaunchConfig": true,
52 |     "@aws-cdk/core:includePrefixInUniqueNameGeneration": true,
53 |     "@aws-cdk/aws-efs:denyAnonymousAccess": true,
54 |     "@aws-cdk/aws-opensearchservice:enableOpensearchMultiAzWithStandby": true,
55 |     "@aws-cdk/aws-lambda-nodejs:useLatestRuntimeVersion": true,
56 |     "@aws-cdk/aws-efs:mountTargetOrderInsensitiveLogicalId": true,
57 |     "@aws-cdk/aws-rds:auroraClusterChangeScopeOfInstanceParameterGroupWithEachParameters": true,
58 |     "@aws-cdk/aws-appsync:useArnForSourceApiAssociationIdentifier": true,
59 |     "@aws-cdk/aws-rds:preventRenderingDeprecatedCredentials": true,
60 |     "@aws-cdk/aws-codepipeline-actions:useNewDefaultBranchForCodeCommitSource": true,
61 |     "@aws-cdk/aws-cloudwatch-actions:changeLambdaPermissionLogicalIdForLambdaAction": true,
62 |     "@aws-cdk/aws-codepipeline:crossAccountKeysDefaultValueToFalse": true,
63 |     "@aws-cdk/aws-codepipeline:defaultPipelineTypeToV2": true,
64 |     "@aws-cdk/aws-kms:reduceCrossAccountRegionPolicyScope": true,
65 |     "@aws-cdk/aws-eks:nodegroupNameAttribute": true,
66 |     "@aws-cdk/aws-ec2:ebsDefaultGp3Volume": true,
67 |     "@aws-cdk/aws-ecs:removeDefaultDeploymentAlarm": true,
68 |     "@aws-cdk/custom-resources:logApiResponseDataPropertyTrueDefault": false,
69 |     "@aws-cdk/aws-stepfunctions-tasks:ecsReduceRunTaskPermissions": true
70 |   }
71 | }
72 | 


--------------------------------------------------------------------------------
/xtable_lambda/cdk_stack.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | from aws_cdk import (
 4 |     Duration,
 5 |     Stack,
 6 |     aws_lambda as _lambda,
 7 |     aws_iam as iam,
 8 |     aws_events as events,
 9 |     aws_events_targets as targets,
10 | )
11 | from constructs import Construct
12 | 
13 | 
14 | class XtableLambda(Stack):
15 | 
16 |     def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
17 |         super().__init__(scope, construct_id, **kwargs)
18 | 
19 |         # Converter
20 |         converter = _lambda.DockerImageFunction(
21 |             scope=self,
22 |             id="Converter",
23 |             memory_size=10240,
24 |             architecture=_lambda.Architecture.ARM_64,
25 |             code=_lambda.DockerImageCode.from_image_asset(
26 |                 directory="src", cmd=["converter.handler"]
27 |             ),
28 |             timeout=Duration.minutes(15),
29 |             environment={
30 |                 "SPARK_LOCAL_IP": "127.0.0.1",
31 |             },
32 |         )
33 |         converter.add_to_role_policy(
34 |             iam.PolicyStatement(
35 |                 actions=[
36 |                     "glue:GetTables",
37 |                     "glue:GetTable",
38 |                     "glue:GetDatabases",
39 |                     "glue:GetDatabase",
40 |                     "glue:UpdateDatabase",
41 |                     "glue:CreateDatabase",
42 |                     "glue:UpdateTable",
43 |                     "glue:CreateTable",
44 |                     "s3:Abort*",
45 |                     "s3:DeleteObject*",
46 |                     "s3:GetBucket*",
47 |                     "s3:GetObject*",
48 |                     "s3:List*",
49 |                     "s3:PutObject",
50 |                     "s3:PutObjectLegalHold",
51 |                     "s3:PutObjectRetention",
52 |                     "s3:PutObjectTagging",
53 |                     "s3:PutObjectVersionTagging",
54 |                 ],
55 |                 resources=["*"],
56 |             )
57 |         )
58 | 
59 |         # Detector
60 |         detector = _lambda.DockerImageFunction(
61 |             scope=self,
62 |             id="Detector",
63 |             architecture=_lambda.Architecture.ARM_64,
64 |             memory_size=1024,
65 |             code=_lambda.DockerImageCode.from_image_asset(
66 |                 directory="src", cmd=["detector.handler"]
67 |             ),
68 |             timeout=Duration.minutes(3),
69 |             environment={"CONVERTER_FUNCTION_NAME": converter.function_name},
70 |         )
71 |         converter.grant_invoke(detector)
72 |         detector.add_to_role_policy(
73 |             iam.PolicyStatement(
74 |                 actions=["glue:GetTables", "glue:GetDatabases"],
75 |                 resources=["*"],
76 |             )
77 |         )
78 | 
79 |         # Scheduled events
80 |         event = events.Rule(
81 |             scope=self,
82 |             id="DetectorSchedule",
83 |             schedule=events.Schedule.rate(Duration.hours(1)),
84 |         )
85 |         event.add_target(targets.LambdaFunction(detector))
86 | 


--------------------------------------------------------------------------------
/xtable_lambda/requirements.txt:
--------------------------------------------------------------------------------
1 | aws-cdk-lib==2.148.1
2 | constructs>=10.0.0,<11.0.0
3 | cdk_nag==2.34.6
4 | 
5 | 


--------------------------------------------------------------------------------
/xtable_lambda/src/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | # syntax = docker/dockerfile:1.7.0
 5 | # args
 6 | ARG MAVEN_VERSION=3.9.6
 7 | ARG MAVEN_MAJOR_VERSION=3
 8 | ARG XTABLE_VERSION=0.1.0-incubating
 9 | ARG XTABLE_BRANCH=0.1.0-incubating
10 | ARG ICEBERG_VERSION=1.4.2
11 | ARG SPARK_VERSION=3.4
12 | ARG SCALA_VERSION=2.12
13 | ARG ARCH=aarch64
14 | 
15 | # Build Stage
16 | FROM public.ecr.aws/lambda/python:3.12.2024.07.10.11-arm64 AS build-stage
17 | WORKDIR /
18 | 
19 | ARG MAVEN_VERSION
20 | ARG MAVEN_MAJOR_VERSION
21 | ARG XTABLE_BRANCH
22 | ARG ICEBERG_VERSION
23 | ARG SPARK_VERSION
24 | ARG SCALA_VERSION
25 | 
26 | # install java
27 | RUN dnf install -y java-11-amazon-corretto-headless \
28 |     git \
29 |     unzip \
30 |     wget \
31 |     gcc \
32 |     g++ \
33 |     && dnf clean all
34 | 
35 | # install maven
36 | RUN wget https://dlcdn.apache.org/maven/maven-"$MAVEN_MAJOR_VERSION"/"$MAVEN_VERSION"/binaries/apache-maven-"$MAVEN_VERSION"-bin.zip
37 | RUN unzip apache-maven-"$MAVEN_VERSION"-bin.zip
38 | 
39 | # clone sources
40 | RUN git clone --depth 1 --branch "$XTABLE_BRANCH" https://github.com/apache/incubator-xtable.git
41 | 
42 | # build xtable jar
43 | WORKDIR /incubator-xtable
44 | RUN /apache-maven-"$MAVEN_VERSION"/bin/mvn package -DskipTests=true
45 | WORKDIR /
46 | 
47 | # Download jars for iceberg and glue
48 | RUN wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-aws-bundle/"$ICEBERG_VERSION"/iceberg-aws-bundle-"$ICEBERG_VERSION".jar
49 | RUN wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"/"$ICEBERG_VERSION"/iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"-"$ICEBERG_VERSION".jar
50 | 
51 | # Copy requirements.txt
52 | COPY requirements.txt .
53 | 
54 | # Install the specified packages
55 | RUN pip install -r requirements.txt -t requirements/
56 | 
57 | USER 1000
58 | 
59 | # Run stage
60 | FROM public.ecr.aws/lambda/python:3.12.2024.07.10.11-arm64
61 | 
62 | # args
63 | ARG XTABLE_VERSION
64 | ARG ICEBERG_VERSION
65 | ARG SPARK_VERSION
66 | ARG SCALA_VERSION
67 | ARG ARCH
68 | 
69 | # copy java
70 | COPY --from=build-stage /usr/lib/jvm/java-11-amazon-corretto."$ARCH" /usr/lib/jvm/java-11-amazon-corretto."$ARCH"
71 | 
72 | # Copy jar files
73 | COPY --from=build-stage /incubator-xtable/xtable-utilities/target/xtable-utilities-"$XTABLE_VERSION"-bundled.jar "$LAMBDA_TASK_ROOT"/jars/
74 | COPY --from=build-stage /iceberg-aws-bundle-"$ICEBERG_VERSION".jar "$LAMBDA_TASK_ROOT"/jars_iceberg/
75 | COPY --from=build-stage /iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"-"$ICEBERG_VERSION".jar "$LAMBDA_TASK_ROOT"/jars_iceberg/
76 | 
77 | # Copy python requirements
78 | COPY --from=build-stage /requirements "$LAMBDA_TASK_ROOT"
79 | 
80 | # Copy src code
81 | COPY . "$LAMBDA_TASK_ROOT"


--------------------------------------------------------------------------------
/xtable_lambda/src/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/xtable_lambda/src/__init__.py


--------------------------------------------------------------------------------
/xtable_lambda/src/converter.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | import boto3
 4 | from pathlib import Path
 5 | 
 6 | from aws_lambda_powertools.utilities.typing import LambdaContext
 7 | from aws_lambda_powertools.utilities.parser import event_parser
 8 | 
 9 | import xtable
10 | from models import ConversionTask
11 | 
12 | # clients
13 | lambda_client = boto3.client("lambda")
14 | 
15 | 
16 | @event_parser(model=ConversionTask)
17 | def handler(event: ConversionTask, context: LambdaContext):
18 | 
19 |     # sync with passed Dataset Config
20 |     if "ICEBERG" not in event.dataset_config.targetFormats:
21 | 
22 |         xtable.sync(
23 |             dataset_config=event.dataset_config,
24 |             tmp_path=Path("/tmp"),
25 |             jars=[
26 |                 Path(__file__).resolve().parent / "jars/*",
27 |             ],
28 |         )
29 | 
30 |     # if iceberg is target also register the table in the glue data catalog
31 |     else:
32 |         # sync with passed Dataset and Catalog Config
33 |         xtable.sync(
34 |             dataset_config=event.dataset_config,
35 |             catalog_config=event.catalog_config,
36 |             tmp_path=Path("/tmp"),
37 |             jars=[
38 |                 Path(__file__).resolve().parent / "jars/*",
39 |                 Path(__file__).resolve().parent / "jars_iceberg/*",
40 |             ],
41 |         )
42 | 


--------------------------------------------------------------------------------
/xtable_lambda/src/detector.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | import os
 4 | import re
 5 | from typing import Any, Dict
 6 | 
 7 | import boto3
 8 | from aws_lambda_powertools.utilities.typing import LambdaContext
 9 | 
10 | from models import (
11 |     ConversionTask,
12 |     DatasetConfig,
13 |     CatalogConfig,
14 |     Dataset,
15 |     CatalogOptions,
16 |     S3Path,
17 | )
18 | 
19 | # clients
20 | glue_client = boto3.client("glue")
21 | lambda_client = boto3.client("lambda")
22 | 
23 | # vars
24 | CONVERTER_FUNCTION_NAME = os.environ["CONVERTER_FUNCTION_NAME"]
25 | 
26 | 
27 | def get_convertable_tables():
28 |     """
29 |     Generator for all Glue tables in this accounts Glue Catalog
30 |     Yields:
31 |         dict: table dict
32 |     """
33 |     # Get paginator for databases
34 |     databases = glue_client.get_paginator("get_databases").paginate()
35 |     for database_list in databases:
36 |         database_list = database_list["DatabaseList"]
37 |         for database in database_list:
38 |             # Get paginator for tables
39 |             tables = glue_client.get_paginator("get_tables").paginate(
40 |                 DatabaseName=database["Name"]
41 |             )
42 |             for table_list in tables:
43 |                 table_list = table_list["TableList"]
44 |                 for table in table_list:
45 |                     # if required table parameters exist
46 |                     if {"xtable_target_formats", "xtable_table_type"} <= table["Parameters"].keys():
47 |                         yield table
48 | 
49 | 
50 | def handler(event: Dict[str, Any], context: LambdaContext):
51 |     # iterate over databases and tables in Glue
52 |     for table in get_convertable_tables():
53 |         # vars
54 |         s3_path = S3Path(table["StorageDescriptor"]["Location"])
55 | 
56 |         # create task payload
57 |         task = ConversionTask(
58 |             dataset_config=DatasetConfig(
59 |                 sourceFormat=table["Parameters"]["xtable_table_type"].upper(),
60 |                 targetFormats=re.findall(
61 |                     r"\w+",
62 |                     table["Parameters"]["xtable_target_formats"].upper(),
63 |                 ),
64 |                 datasets=[
65 |                     Dataset(
66 |                         tableBasePath=f"{s3_path}",
67 |                         tableName=f"{s3_path.name}_converted",
68 |                         namespace=f"{s3_path.parent.name}",
69 |                     )
70 |                 ],
71 |             )
72 |         )
73 | 
74 |         # if ICEBERG add Glue Table Warehouse registration
75 |         if "ICEBERG" in task.dataset_config.targetFormats:
76 |             task.catalog_config = CatalogConfig(
77 |                 catalogOptions=CatalogOptions(warehouse=f"{s3_path}")
78 |             )
79 | 
80 |         # send to converter lambda
81 |         lambda_client.invoke(
82 |             FunctionName=CONVERTER_FUNCTION_NAME,
83 |             InvocationType="Event",
84 |             Payload=task.model_dump_json(by_alias=True),
85 |         )
86 | 


--------------------------------------------------------------------------------
/xtable_lambda/src/models.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | import re
 4 | from typing import List, Literal, Optional
 5 | from pydantic import BaseModel, Field
 6 | from pathlib import PurePath
 7 | 
 8 | 
 9 | class Dataset(BaseModel):
10 |     tableBasePath: str
11 |     tableName: str
12 |     namespace: Optional[str] = None
13 |     partitionSpec: Optional[str] = None
14 | 
15 | 
16 | class DatasetConfig(BaseModel):
17 |     sourceFormat: Literal["DELTA", "ICEBERG", "HUDI"]
18 |     targetFormats: List[Literal["DELTA", "ICEBERG", "HUDI"]]
19 |     datasets: List[Dataset]
20 | 
21 | 
22 | class CatalogOptions(BaseModel):
23 |     warehouse: str
24 |     catalog_impl: str = Field(
25 |         alias="catalog-impl", default="org.apache.iceberg.aws.glue.GlueCatalog"
26 |     )
27 |     io_impl: str = Field(alias="io-impl", default="org.apache.iceberg.aws.s3.S3FileIO")
28 | 
29 | 
30 | class CatalogConfig(BaseModel):
31 |     catalogImpl: str = Field(default="org.apache.iceberg.aws.glue.GlueCatalog")
32 |     catalogName: str = Field(default="glue")
33 |     catalogOptions: CatalogOptions
34 | 
35 | 
36 | class ConversionTask(BaseModel):
37 |     dataset_config: DatasetConfig
38 |     catalog_config: Optional[CatalogConfig] = None
39 | 
40 | 
41 | class S3Path(PurePath):
42 |     s3_uri_regex = re.compile(r"s3://(?P<path>.+)")
43 |     path_regex = re.compile(r"/*(?P<path>.+)")
44 |     drive = "s3://"
45 | 
46 |     def __init__(self, path: str):
47 |         s3_match = re.match(self.s3_uri_regex, path)
48 |         path_match = re.match(self.path_regex, path)
49 |         if s3_match:
50 |             super().__init__(s3_match.group("path"))
51 |         elif path_match:
52 |             super().__init__(path_match.group("path"))
53 | 


--------------------------------------------------------------------------------
/xtable_lambda/src/requirements.txt:
--------------------------------------------------------------------------------
1 | jpype1==1.5.0
2 | pydantic==2.8.2
3 | aws-lambda-powertools==2.36.0
4 | pyyaml==6.0.1
5 | boto3==1.34.74


--------------------------------------------------------------------------------
/xtable_lambda/src/xtable.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | import platform
 4 | from pathlib import Path
 5 | from typing import List
 6 | 
 7 | import yaml
 8 | import jpype
 9 | import jpype.imports
10 | import jpype.types
11 | 
12 | from models import DatasetConfig, CatalogConfig
13 | 
14 | def sync(
15 |     dataset_config: DatasetConfig,
16 |     tmp_path: Path,
17 |     catalog_config: CatalogConfig = None,
18 |     java11: Path = Path(
19 |         f"/usr/lib/jvm/java-11-amazon-corretto.{platform.machine()}/lib/server/libjvm.so"
20 |     ),
21 |     jars: List[Path] = [Path(__file__).resolve().parent / "jars/*"],
22 | ):
23 |     """
24 |     Sync a dataset metadata from source to target format. Optionally writes it to a catalog.
25 | 
26 |     Args:
27 |         dataset_config (dict): Dataset configuration.
28 |         tmp_path (Path): Path to a temporary directory.
29 |         catalog_config (dict, optional): Iceberg catalog configuration. Defaults to None.
30 |         java11 (Path, optional): Path to the java 11 library. Defaults to Path(
31 |             f"/usr/lib/jvm/java-11-amazon-corretto.{platform.machine()}/lib/server/libjvm.so"
32 |         ).
33 |         jars (Path(s), optional): Path(s) to the jar files. Defaults to Path(__file__).resolve().parent / "jars".
34 | 
35 |     Returns:
36 |         None
37 | 
38 |     """
39 | 
40 |     # write config file
41 |     config_path = tmp_path / "config.yaml"
42 |     with config_path.open("w") as file:
43 |         yaml.dump(dataset_config.model_dump(by_alias=True), file)
44 | 
45 |     # write catalog file
46 |     if catalog_config:
47 |         catalog_path = tmp_path / "catalog.yaml"
48 |         with catalog_path.open("w") as file:
49 |             yaml.dump(catalog_config.model_dump(by_alias=True), file)
50 |     else:
51 |         catalog_path = None
52 | 
53 |     # start a jvm in the background
54 |     if jpype.isJVMStarted() is False:
55 |         jpype.startJVM(java11.absolute().as_posix(), classpath=jars)
56 | 
57 |     # call java class with or without catalog config
58 |     run_sync = jpype.JPackage("org").apache.xtable.utilities.RunSync.main
59 |     if catalog_path:
60 |         run_sync(
61 |             [
62 |                 "--datasetConfig",
63 |                 config_path.absolute().as_posix(),
64 |                 "--icebergCatalogConfig",
65 |                 catalog_path.absolute().as_posix(),
66 |             ]
67 |         )
68 | 
69 |     else:
70 |         run_sync(["--datasetConfig", config_path.absolute().as_posix()])
71 | 


--------------------------------------------------------------------------------
/xtable_operator/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | # syntax = docker/dockerfile:1.7.0
 5 | # args
 6 | ARG MAVEN_VERSION=3.9.6
 7 | ARG MAVEN_MAJOR_VERSION=3
 8 | ARG XTABLE_VERSION=0.1.0-SNAPSHOT
 9 | ARG XTABLE_BRANCH=v0.1.0-beta1
10 | ARG ICEBERG_VERSION=1.4.2
11 | ARG SPARK_VERSION=3.4
12 | ARG SCALA_VERSION=2.12
13 | 
14 | # Build Stage
15 | from public.ecr.aws/amazonlinux/amazonlinux:2023.4.20240319.1 as build-stage
16 | 
17 | ARG MAVEN_VERSION
18 | ARG MAVEN_MAJOR_VERSION
19 | ARG XTABLE_BRANCH
20 | ARG ICEBERG_VERSION
21 | ARG SPARK_VERSION
22 | ARG SCALA_VERSION
23 | 
24 | # install java
25 | RUN yum install -y java-11-amazon-corretto-headless \
26 |     git \
27 |     unzip \
28 |     wget \
29 |     && yum clean all
30 | 
31 | # install maven
32 | RUN wget https://dlcdn.apache.org/maven/maven-"$MAVEN_MAJOR_VERSION"/"$MAVEN_VERSION"/binaries/apache-maven-"$MAVEN_VERSION"-bin.zip
33 | RUN unzip apache-maven-"$MAVEN_VERSION"-bin.zip
34 | 
35 | # clone sources
36 | RUN git clone --depth 1 --branch "$XTABLE_BRANCH" https://github.com/apache/incubator-xtable.git
37 | 
38 | # build xtable jar
39 | WORKDIR /incubator-xtable
40 | RUN /apache-maven-"$MAVEN_VERSION"/bin/mvn package -DskipTests=true
41 | WORKDIR /
42 | 
43 | # Download jars for iceberg and glue
44 | RUN wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-aws-bundle/"$ICEBERG_VERSION"/iceberg-aws-bundle-"$ICEBERG_VERSION".jar
45 | RUN wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"/"$ICEBERG_VERSION"/iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"-"$ICEBERG_VERSION".jar
46 | 
47 | USER 1000
48 | 
49 | # Run Stage
50 | from scratch as out-stage
51 | 
52 | # args
53 | ARG XTABLE_VERSION
54 | ARG ICEBERG_VERSION
55 | ARG SPARK_VERSION
56 | ARG SCALA_VERSION
57 | 
58 | # Copy jar files
59 | COPY --from=build-stage /incubator-xtable/utilities/target/utilities-"$XTABLE_VERSION"-bundled.jar /xtable_operator/jars/
60 | COPY --from=build-stage /iceberg-aws-bundle-"$ICEBERG_VERSION".jar /xtable_operator/jars/
61 | COPY --from=build-stage /iceberg-spark-runtime-"$SPARK_VERSION"_"$SCALA_VERSION"-"$ICEBERG_VERSION".jar /xtable_operator/jars/
62 | 
63 | # Add python operator
64 | COPY operator.py /xtable_operator/
65 | COPY cmd.py /xtable_operator/


--------------------------------------------------------------------------------
/xtable_operator/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/apache-xtable-on-aws-samples/b87cd02d936daa3a3f507bcad5e6e32aafa96f4a/xtable_operator/__init__.py


--------------------------------------------------------------------------------
/xtable_operator/build-airflow-operator.sh:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | # Builds the MWAA or Airflow Operator and outputs it to the out folder. This can be used to upload to MWAA.
 5 | 
 6 | # End script if one subcommand fails
 7 | set -e
 8 | 
 9 | # Defining colours for output
10 | GREEN='\033[1;32m'
11 | OFF='\033[0m'
12 | 
13 | # Check if finch is installed, if not fall back to docker
14 | printf "${GREEN}Checking if finch is installed...${OFF}\n"
15 | if ! command -v finch &> /dev/null
16 | then
17 |     finch=docker
18 | else
19 |     finch=finch
20 | fi
21 | 
22 | # Build Apache Xtable
23 | printf "${GREEN}Building...${OFF}\n"
24 | $finch build -f Dockerfile -o type=tar,dest=out.tar .
25 | 
26 | # Copy other helper files to the out directory
27 | printf "${GREEN}Copying build artifacts and other helper files to the out directory...${OFF}\n"
28 | mkdir -p ./out/ \
29 |     && mv ./out.tar ./out/out.tar \
30 |     && cp ./files/requirements.txt ./out/ \
31 |     && cp ./files/DAG.py ./out/ \
32 |     && cp ./files/startup.sh ./out/ \
33 |     && cp ./files/.airflowignore ./out/
34 | 
35 | # Extract docker output
36 | cd out/ \
37 |     && tar xf out.tar \
38 |     && rm ./out.tar
39 | 


--------------------------------------------------------------------------------
/xtable_operator/cmd.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import jpype
 5 | import jpype.imports
 6 | import jpype.types
 7 | from pathlib import Path
 8 | import platform
 9 | 
10 | def sync(
11 |     config_path: Path,
12 |     jars: Path,
13 |     catalog_path: Path = None,
14 |     java11: Path = Path(
15 |         f"/usr/lib/jvm/java-11-amazon-corretto.{platform.machine()}/lib/server/libjvm.so"
16 |     ),
17 | ):
18 |     """Sync a dataset metadata from source to target format. Optionally writes it to a catalog.
19 | 
20 |     Args:
21 |         config_path (Path): Path to the dataset config file.
22 |         jars (Path): Path to the jars folder.
23 |         catalog_path (Path, optional): Path to the catalog config file. Defaults to None.
24 |         java11 (Path, optional): Path to the java 11 library. Defaults to Path(
25 |             f"/usr/lib/jvm/java-11-amazon-corretto.{platform.machine()}/lib/server/libjvm.so"
26 |         ).
27 |     
28 |     """
29 | 
30 |     # start a jvm in the background
31 |     jpype.startJVM(java11.absolute().as_posix(), classpath=jars / "*")
32 |     run_sync = jpype.JPackage("io").onetable.utilities.RunSync.main
33 | 
34 |     # call java class with or without catalog config
35 |     if catalog_path:
36 |         run_sync(
37 |             [
38 |                 "--datasetConfig",
39 |                 config_path.absolute().as_posix(),
40 |                 "--icebergCatalogConfig",
41 |                 catalog_path.absolute().as_posix(),
42 |             ]
43 |         )
44 | 
45 |     else:
46 |         run_sync(["--datasetConfig", config_path.absolute().as_posix()])
47 | 
48 |     # shutdown
49 |     jpype.shutdownJVM()
50 | 


--------------------------------------------------------------------------------
/xtable_operator/files/.airflowignore:
--------------------------------------------------------------------------------
1 | xtable_operator


--------------------------------------------------------------------------------
/xtable_operator/files/DAG.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | from airflow import DAG
 5 | from datetime import datetime, timedelta
 6 | from xtable_operator.operator import XtableOperator
 7 | from pathlib import Path
 8 | 
 9 | default_args = {
10 |     "owner": "airflow",
11 |     "depends_on_past": False,
12 |     "start_date": datetime(2024, 1, 1),
13 |     "email_on_failure": False,
14 |     "email_on_retry": False,
15 |     "retries": 1,
16 |     "retry_delay": timedelta(minutes=5),
17 | }
18 | 
19 | 
20 | with DAG(
21 |     "xtableDag", max_active_runs=3, schedule_interval="@once", default_args=default_args
22 | ) as dag:
23 |     path = Path("/usr/local/airflow/tmp")
24 |     path.mkdir(exist_ok=True)
25 |     op = XtableOperator(
26 |         task_id="xtableTask",
27 |         tmp_path=path,
28 |         dataset_config={
29 |             "sourceFormat": "DELTA",
30 |             "targetFormats": ["ICEBERG"],
31 |             "datasets": [
32 |                 {
33 |                     "tableBasePath": "/example/datalake/table/random_numbers",
34 |                     "tableName": "random_numbers",
35 |                     "namespace": "table",
36 |                 }
37 |             ],
38 |         },
39 |     )
40 | 


--------------------------------------------------------------------------------
/xtable_operator/files/requirements.txt:
--------------------------------------------------------------------------------
1 | jpype1 ==1.5.0


--------------------------------------------------------------------------------
/xtable_operator/files/startup.sh:
--------------------------------------------------------------------------------
1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | # SPDX-License-Identifier: MIT-0
3 | 
4 | #!/bin/sh
5 | if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]]
6 | then
7 |     sudo yum install -y java-11-amazon-corretto-headless
8 | fi


--------------------------------------------------------------------------------
/xtable_operator/operator.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | from pathlib import Path
 5 | import yaml
 6 | 
 7 | from airflow.models import BaseOperator
 8 | from airflow.utils.decorators import apply_defaults
 9 | 
10 | from xtable_operator import cmd
11 | 
12 | 
13 | class XtableOperator(BaseOperator):
14 |     """
15 |     Xtable Operator
16 | 
17 |     This Operator is meant to be executed in an Amazon MWAA or Airflow environment. It capsulates the Apache Xtable java library calls.
18 |     """
19 | 
20 |     @apply_defaults
21 |     def __init__(
22 |         self,
23 |         dataset_config: dict,
24 |         tmp_path: Path,
25 |         iceberg_catalog_config: dict = None,
26 |         java11: Path = None,
27 |         *args,
28 |         **kwargs
29 |     ):
30 |         """
31 |         Initialize all the relevant paths. We need the xtable dataset and catalog configurations, 
32 |         as well as a writeable temporary path to generate the config files on the fly. For executing the 
33 |         xtable command, we need the java 11 library in the jars folder.
34 | 
35 |         Args:
36 |             dataset_config (dict): Dataset configuration.
37 |             tmp_path (Path): Path to the temporary folder.
38 |             iceberg_catalog_config (dict, optional): Iceberg catalog configuration. Defaults to None.
39 |             java11 (Path, optional): Path to the java 11 library. Defaults to Path(
40 |                 f"/usr/lib/jvm/java-11-amazon-corretto.{platform.machine()}/lib/server/libjvm.so"
41 |             ).
42 |         """
43 |         super(XtableOperator, self).__init__(*args, **kwargs)
44 | 
45 |         # write config file
46 |         config_path = tmp_path / "config.yaml"
47 |         with config_path.open("w") as file:
48 |             yaml.dump(dataset_config, file)
49 | 
50 |         # write catalog file
51 |         if iceberg_catalog_config:
52 |             catalog_path = tmp_path / "catalog.yaml"
53 |             with catalog_path.open("w") as file:
54 |                 yaml.dump(iceberg_catalog_config, file)
55 |         else:
56 |             catalog_path = None
57 | 
58 |         # self
59 |         self.config_path: Path = config_path
60 |         self.catalog_path: Path = catalog_path
61 |         self.jars: Path = Path(__file__).resolve().parent / "jars"
62 |         self.java11: Path = java11
63 | 
64 |     def execute(self, context):
65 |         """Execute the xtable command."""
66 |         cmd.sync(
67 |             config_path=self.config_path,
68 |             catalog_path=self.catalog_path,
69 |             jars=self.jars,
70 |         )
71 | 


--------------------------------------------------------------------------------