├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── app.py
├── assets
├── iceberg-data-level-01.png
├── iceberg-data-level-02.png
├── iceberg-data-level-03.png
└── iceberg-table.png
├── cdk.json
├── cdk_stacks
├── __init__.py
├── glue_job_role.py
├── glue_stream_data_schema.py
├── glue_streaming_job.py
├── kds.py
├── lakeformation_permissions.py
└── s3.py
├── glue-streaming-data-to-iceberg-table.svg
├── requirements-dev.txt
├── requirements.txt
├── source.bat
└── src
├── main
└── python
│ ├── spark_iceberg_writes_with_dataframe.py
│ ├── spark_iceberg_writes_with_sql_insert_overwrite.py
│ └── spark_iceberg_writes_with_sql_merge_into.py
└── utils
└── gen_fake_kinesis_stream_data.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | MANIFEST
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 | .pytest_cache/
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | db.sqlite3
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # Jupyter Notebook
73 | .ipynb_checkpoints
74 | Untitled*.ipynb
75 |
76 | # pyenv
77 | .python-version
78 |
79 | # celery beat schedule file
80 | celerybeat-schedule
81 |
82 | # SageMath parsed files
83 | *.sage.py
84 |
85 | # Environments
86 | .env
87 | .venv
88 | env/
89 | venv/
90 | ENV/
91 | env.bak/
92 | venv.bak/
93 |
94 | # Spyder project settings
95 | .spyderproject
96 | .spyproject
97 |
98 | # Rope project settings
99 | .ropeproject
100 |
101 | # mkdocs documentation
102 | /site
103 |
104 | # mypy
105 | .mypy_cache/
106 |
107 | .DS_Store
108 | .idea/
109 | bin/
110 | lib64
111 | pyvenv.cfg
112 | *.bak
113 | share/
114 | cdk.out/
115 | cdk.context.json*
116 | zap/
117 |
118 | */.gitignore
119 | */setup.py
120 | */source.bat
121 |
122 | */*/.gitignore
123 | */*/setup.py
124 | */*/source.bat
125 |
126 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing Guidelines
2 |
3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4 | documentation, we greatly value feedback and contributions from our community.
5 |
6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7 | information to effectively respond to your bug report or contribution.
8 |
9 |
10 | ## Reporting Bugs/Feature Requests
11 |
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 |
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 |
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 |
22 |
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 |
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 |
30 | To send us a pull request, please:
31 |
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 |
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 |
42 |
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 |
46 |
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 |
52 |
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 |
56 |
57 | ## Licensing
58 |
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
4 | this software and associated documentation files (the "Software"), to deal in
5 | the Software without restriction, including without limitation the rights to
6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7 | the Software, and to permit persons to whom the Software is furnished to do so.
8 |
9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | # AWS Glue Streaming ETL Job with Apace Iceberg CDK Python project!
3 |
4 | 
5 |
6 | In this project, we create a streaming ETL job in AWS Glue to integrate Iceberg with a streaming use case and create an in-place updatable data lake on Amazon S3.
7 |
8 | After ingested to Amazon S3, you can query the data with [Amazon Athena](http://aws.amazon.com/athena).
9 |
10 | This project can be deployed with [AWS CDK Python](https://docs.aws.amazon.com/cdk/api/v2/).
11 | The `cdk.json` file tells the CDK Toolkit how to execute your app.
12 |
13 | This project is set up like a standard Python project. The initialization
14 | process also creates a virtualenv within this project, stored under the `.venv`
15 | directory. To create the virtualenv it assumes that there is a `python3`
16 | (or `python` for Windows) executable in your path with access to the `venv`
17 | package. If for any reason the automatic creation of the virtualenv fails,
18 | you can create the virtualenv manually.
19 |
20 | To manually create a virtualenv on MacOS and Linux:
21 |
22 | ```
23 | $ python3 -m venv .venv
24 | ```
25 |
26 | After the init process completes and the virtualenv is created, you can use the following
27 | step to activate your virtualenv.
28 |
29 | ```
30 | $ source .venv/bin/activate
31 | ```
32 |
33 | If you are a Windows platform, you would activate the virtualenv like this:
34 |
35 | ```
36 | % .venv\Scripts\activate.bat
37 | ```
38 |
39 | Once the virtualenv is activated, you can install the required dependencies.
40 |
41 | ```
42 | (.venv) $ pip install -r requirements.txt
43 | ```
44 |
45 | In case of `AWS Glue 3.0`, before synthesizing the CloudFormation, **you first set up Apache Iceberg connector for AWS Glue to use Apache Iceber with AWS Glue jobs.** (For more information, see [References](#references) (2))
46 |
47 | Then you should set approperly the cdk context configuration file, `cdk.context.json`.
48 |
49 | For example:
50 |
91 |
92 | :information_source: `--primary_key` option should be set by Iceberg table's primary column name.
93 |
94 | :warning: **You should create a S3 bucket for a glue job script and upload the glue job script file into the s3 bucket.**
95 |
96 | At this point you can now synthesize the CloudFormation template for this code.
97 |
98 |
103 |
104 | To add additional dependencies, for example other CDK libraries, just add
105 | them to your `setup.py` file and rerun the `pip install -r requirements.txt`
106 | command.
107 |
108 | ## Run Test
109 |
110 | 1. Set up **Apache Iceberg connector for AWS Glue** to use Apache Iceberg with AWS Glue jobs.
111 | 2. Create a S3 bucket for Apache Iceberg table
112 |
123 |
124 | Running `cdk deploy GlueSchemaOnKinesisStream` command is like that we create a schema manually using the AWS Glue Data Catalog as the following steps:
125 |
126 | (1) On the AWS Glue console, choose **Data Catalog**.
127 | (2) Choose **Databases**, and click **Add database**.
128 | (3) Create a database with the name `iceberg_demo_db`.
129 | (4) On the **Data Catalog** menu, Choose **Tables**, and click **Add Table**.
130 | (5) For the table name, enter `iceberg_demo_kinesis_stream_table`.
131 | (6) Select `iceberg_demo_db` as a database.
132 | (7) Choose **Kinesis** as the type of source.
133 | (8) Enter the name of the stream.
134 | (9) For the classification, choose **JSON**.
135 | (10) Define the schema according to the following table.
136 | | Column name | Data type | Example |
137 | |-------------|-----------|---------|
138 | | name | string | "Ricky" |
139 | | age | int | 23 |
140 | | m_time | string | "2023-06-13 07:24:26" |
141 |
142 | (11) Choose **Finish**
143 |
144 | 5. Upload **AWS SDK for Java 2.x** jar file into S3
145 |
187 | 7. Make sure the glue job to access the Kinesis Data Streams table in the Glue Catalog database, otherwise grant the glue job to permissions
188 |
189 | Wec can get permissions by running the following command:
190 |
200 | 8. Create a table with partitioned data in Amazon Athena
201 |
202 | Go to [Athena](https://console.aws.amazon.com/athena/home) on the AWS Management console.
203 | * (step 1) Create a database
204 |
205 | In order to create a new database called `iceberg_demo_db`, enter the following statement in the Athena query editor
206 | and click the **Run** button to execute the query.
207 |
208 |
209 | CREATE DATABASE IF NOT EXISTS iceberg_demo_db
210 |
211 |
212 | * (step 2) Create a table
213 |
214 | Copy the following query into the Athena query editor, replace the `xxxxxxx` in the last line under `LOCATION` with the string of your S3 bucket, and execute the query to create a new table.
215 |
227 | If the query is successful, a table named `iceberg_demo_table` is created and displayed on the left panel under the **Tables** section.
228 |
229 | If you get an error, check if (a) you have updated the `LOCATION` to the correct S3 bucket name, (b) you have mydatabase selected under the Database dropdown, and (c) you have `AwsDataCatalog` selected as the **Data source**.
230 |
231 | :information_source: If you fail to create the table, give Athena users access permissions on `iceberg_demo_db` through [AWS Lake Formation](https://console.aws.amazon.com/lakeformation/home), or you can grant anyone using Athena to access `iceberg_demo_db` by running the following command:
232 |
294 | 11. Check streaming data in S3
295 |
296 | After `3~5` minutes, you can see that the streaming data have been delivered from **Kinesis Data Streams** to **S3**.
297 |
298 | 
299 | 
300 | 
301 | 
302 |
303 | 12. Run test query
304 |
305 | Enter the following SQL statement and execute the query.
306 |
307 | SELECT COUNT(*)
308 | FROM iceberg_demo_db.iceberg_demo_table;
309 |
310 |
311 | ## Clean Up
312 |
313 | 1. Stop the glue job by replacing the job name in below command.
314 |
315 |
323 |
324 | 2. Delete the CloudFormation stack by running the below command.
325 |
326 |
327 | (.venv) $ cdk destroy --all
328 |
329 |
330 | ## Useful commands
331 |
332 | * `cdk ls` list all stacks in the app
333 | * `cdk synth` emits the synthesized CloudFormation template
334 | * `cdk deploy` deploy this stack to your default AWS account/region
335 | * `cdk diff` compare deployed stack with current state
336 | * `cdk docs` open CDK documentation
337 |
338 | ## References
339 |
340 | * (1) [AWS Glue versions](https://docs.aws.amazon.com/glue/latest/dg/release-notes.html): The AWS Glue version determines the versions of Apache Spark and Python that AWS Glue supports.
341 | * (2) [Use the AWS Glue connector to read and write Apache Iceberg tables with ACID transactions and perform time travel \(2022-06-21\)](https://aws.amazon.com/ko/blogs/big-data/use-the-aws-glue-connector-to-read-and-write-apache-iceberg-tables-with-acid-transactions-and-perform-time-travel/)
342 | * (3) [Streaming Data into Apache Iceberg Tables Using AWS Kinesis and AWS Glue (2022-09-26)](https://www.dremio.com/subsurface/streaming-data-into-apache-iceberg-tables-using-aws-kinesis-and-aws-glue/)
343 | * (4) [Amazon Athena Using Iceberg tables](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html)
344 | * (5) [Streaming ETL jobs in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html)
345 | * (6) [AWS Glue job parameters](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html)
346 | * (7) [Crafting serverless streaming ETL jobs with AWS Glue](https://aws.amazon.com/ko/blogs/big-data/crafting-serverless-streaming-etl-jobs-with-aws-glue/)
347 | * (8) [Apache Iceberg - Spark Writes with SQL (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-writes/)
348 | * (9) [Apache Iceberg - Spark Structured Streaming (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/)
349 | * (10) [Apache Iceberg - Writing against partitioned table (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/#writing-against-partitioned-table)
350 | * Iceberg supports append and complete output modes:
351 | * `append`: appends the rows of every micro-batch to the table
352 | * `complete`: replaces the table contents every micro-batch
353 |
354 | Iceberg requires the data to be sorted according to the partition spec per task (Spark partition) in prior to write against partitioned table.
355 | Otherwise, you might encounter the following error:
356 |
357 | pyspark.sql.utils.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrame/Datasets;
358 |
359 | * (10) [Apache Iceberg - Maintenance for streaming tables (v0.14.0)](https://iceberg.apache.org/docs/0.14.0/spark-structured-streaming/#maintenance-for-streaming-tables)
360 | * (11) [awsglue python package](https://github.com/awslabs/aws-glue-libs): The awsglue Python package contains the Python portion of the AWS Glue library. This library extends PySpark to support serverless ETL on AWS.
361 | * (12) [AWS Glue Notebook Samples](https://github.com/aws-samples/aws-glue-samples/tree/master/examples/notebooks) - sample iPython notebook files which show you how to use open data dake formats; Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue Interactive Sessions and AWS Glue Studio Notebook.
362 |
363 | ## Troubleshooting
364 |
365 | * Granting database or table permissions error using AWS CDK
366 | * Error message:
367 |
368 | AWS::LakeFormation::PrincipalPermissions | CfnPrincipalPermissions Resource handler returned message: "Resource does not exist or requester is not authorized to access requested permissions. (Service: LakeFormation, Status Code: 400, Request ID: f4d5e58b-29b6-4889-9666-7e38420c9035)" (RequestToken: 4a4bb1d6-b051-032f-dd12-5951d7b4d2a9, HandlerErrorCode: AccessDenied)
369 |