├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── Readme.md ├── architecture.drawio ├── architecture.png ├── awswrangler-2.5.0-py3-none-any.whl ├── glue-enrich-cur.py ├── requirements.txt └── setup-glue-enrichment.yml /.gitignore: -------------------------------------------------------------------------------- 1 | *venv* 2 | .env* 3 | .vscode 4 | glue-enrich-cur.zip 5 | test.py -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | software and associated documentation files (the "Software"), to deal in the Software 5 | without restriction, including without limitation the rights to use, copy, modify, 6 | merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /Readme.md: -------------------------------------------------------------------------------- 1 | # Glue Cost and Usage Report Enrichment 2 | 3 | This package creates a Glue Python Shell Job that will enrich [Cost and Usage Report](https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html) data by creating additional columns with [AWS Organizations Account tags](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_tagging.html). Tag column values are set by joining the values on `line_item_account_usage_id`. This makes it possible to filter/group CUR data by account-level tags. An example use case is to create an Organizations tag, BudgetCode, for each AWS account and then use this script to add the BudgetCode to every line item in the report so that chargeback reports can be generated. 4 | 5 | This script uses the [AWS Data Wrangler](https://github.com/awslabs/aws-data-wrangler) package to perform the bulk of the ETL work, and also to create the table definition in Glue so that the encriched reports can be queried using Athena. 6 | 7 | ![Diagram](architecture.png) 8 | 9 | ## Functionality 10 | - The script uses a Glue Python Shell Job to process the CUR data, merge the account tags, and then rewrite the "enriched" data in Parquet to another S3 bucket and/or path. 11 | - The AWS account information, including account tags, is collected from the AWS Organizations API and written to S3. A table is defined in the Glue data catalog so that it can be queried. 12 | - The script can also create a table in Glue that can be used to perform queries in Athena. You could also define use a Glue crawler for this purpose. 13 | - Optionally, it can also further partition the data by line_item_account_id. 14 | - The Glue job can be configured to process all CUR files in S3, or it can perform an "Incremental" processing, where it will only process CUR files from the last x months. It is recommended that you do a one time full processing/enrichment on all CUR files and then periodic "Fulls" 15 | - The Python Shell script uses the [AWS Data Wrangler](https://aws-data-wrangler.readthedocs.io/en/latest/) package. This package extends the Pandas library, making it easy to work with Athena, Glue, S3, RedShift, Cloudwatch Logs, Aurora, SageMaker, DynamoDB, and EMR. 16 | 17 | ### Technical Note 18 | The Python Shell script originally used Pandas DataFrames to perform the ETL operations. In later released, this has been offloaded to Athena CREATE TABLE AS SELECT statements with temporary Glue data catalog tables. One motivator for this approach was work around Spark and pyarrow's limited support for Delta encoded columns, which CUR data written in Parquet format [occasionally contains](https://github.com/awslabs/aws-data-wrangler/issues/442). If you want to see the original script, you can find it in [this commit](https://github.com/aws-samples/glue-enrich-cost-and-usage/tree/333e8af620bea2dde41e5126fc558ba6eecc94bc). 19 | 20 | ## Prerequisites 21 | 1. *An AWS Account and AWS Organizations enabled:* In order to use Organizations tags with accounts, AWS Organizations must be enabled. **The resources used in this blog post must be deployed in the master account.** 22 | 2. *Cost & Usage Report*: A Cost & Usage Report must be defined in the Master account. The report must be written in Apache Parquet format to an S3 bucket. Instructions for creating a Cost & Usage Report are available in the AWS Documentation (https://docs.aws.amazon.com/cur/latest/userguide/cur-create.html). *Please note:* Once you’ve setup a Cost & Usage Report, it can take up to 24 hours for the first report to be delivered to S3. You will not be able to use this solution until you have at least one report delivered 23 | 24 | 25 | ## How it works 26 | 27 | The CloudFormation template used in the setup below creates the following resources: 28 | 29 | * **Glue Python Shell Jobs**- These shell jobs contain Python scripts that will perform the ETL operations, loading the data from the existing Cost & Usage Report files and then writing the processed dataset back to S3. One job will perform a “Full” ETL on all existing report data, while the other job will perform an “Incremental” ETL on data from the past 2 months. 30 | 31 | **Note**: Glue Jobs are billed per second (https://aws.amazon.com/glue/pricing/), with a 10-minute minimum. In most environments, the incremental job should take less than 10 minutes to complete and is estimated to cost less than $0.15 per day to run. Customers that wish to process all historical Cost & Usage Data or that have dozens or hundreds of AWS accounts can monitor the job duration and logs for a more accurate estimate. 32 | 33 | * **Glue Database**- The database will be used to store metadata about our enriched Cost & Usage Report data. This is primarily so we can perform queries in Athena later. 34 | 35 | * **S3 Bucket**- This bucket will store Python shell scripts and also our enriched Cost & Usage Data. Glue Python Shell Jobs execute a Python script that is stored in S3. Additional Python libraries can also be loaded from .whl and .egg files located in S3. 36 | 37 | * **IAM Role for Glue Jobs**- Glue Jobs require an IAM role with permissions to access the resources they will interact with. 38 | 39 | The Python script used by Glue leverages the [AWS Data Wrangler](https://aws-data-wrangler.readthedocs.io/en/latest/what.html) library. This package extends the popular Pandas library to AWS services, making it easy to connect to, load, and save [Pandas](https://github.com/pandas-dev/pandas) dataframes with many AWS services, including S3, Glue, Redshift, EMR, Athena, and Cloudwatch Log Insights. In a few lines of code, the script performs the following steps: 40 | 41 | 1. Loads each Cost & Usage Report parquet file from S3 into a pandas dataframe 42 | 2. Merges the all AWS Account tags (retrieved via the Organizations API) with each line item, based on the line_item_usage_account_id column 43 | 3. Writes each dataframe to the S3 Bucket, partitioned by year, month, and line_item_usage_account_id for potentially [more efficient Athena queries](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/) 44 | 4. Creates a table in the Glue Database with the enriched Cost & Usage Report data and refreshes the table 45 | 46 | When the Glue job completes the enriched reports can be queried via Athena, as demonstrated below. 47 | 48 | ## Setup 49 | 50 | **Note:** This will NOT modify your existing CUR data files. Instead, it creates a copy of the data in a separate bucket. It is recommended that you retain the original CUR files for backup. 51 | 52 | 1. First, use the [AWS Organizations](https://console.aws.amazon.com/organizations/home?region=us-east-1#/accounts) console to add the tags you’d like to have included in your enriched Cost & Usage Reports. 53 | 54 | 2. Deploy the CloudFormation template, setup-glue-encrichment.yml. Specify the source bucket name and prefix (path) where the current CUR files are located. 55 | 56 | 3. After the CloudFormation stack is created, go to outputs and click the link to the S3 bucket that the stack created. 57 | 58 | 4. Upload: awswrangler-2.5.0-py3-none-any.whl and glue-enrich-cur.py 59 | 60 | 5. **To add the tags to all historical CUR data**, go to the [AWS Glue console](https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=jobs) and run the `aws-cur-enrichment-full` job. If you would prefer to limit this, you can use the `aws-cur-enrichment-incremental` job instead. 61 | 62 | 6. (Optional) Schedule `aws-cur-enrichment-incremental` to run once a day. You could also make this event-driven by using S3 events so that it runs each time a new CUR file is delivered. This incremental job will update the accounts tags for the current month AND the previous month. This makes the job run faster, process less data, and avoids altering accounts tags from previous month reports so that the behavior is similar to cost allocation tags (where tags are not retroactively modified in historical CUR data). 63 | 64 | 7. Go to the [Athena console](https://console.aws.amazon.com/athena/home?region=us-east-1#preview/cost_and_usage_enriched/cur_enriched) and run the following query: 65 | ```sql 66 | SELECT * FROM "cost_and_usage_enriched"."cur_enriched" limit 10; 67 | ``` 68 | 69 | Search for the column names that begin with `account_tag_` in the results and you will see your account tags (if they exist for the `line_item_usage_account_number`, otherwise the field for that record will be blank). 70 | 71 | This is an example query that will return results where the account has `CostCenter` tag: 72 | ```sql 73 | SELECT * FROM "cost_and_usage_enriched"."cur_enriched" 74 | WHERE tag_costcenter <> '' 75 | limit 10; 76 | ``` 77 | ## Additional Features 78 | The Glue jobs are configured to pass arguments to the script that determine how it will process the CUR. This can be modified by custoizing the arguments passed to the script. Additionally, the script can be used outside of a Glue job as a command line utility. In particular, the `--include_fields`, `--exclude_fields`, `--include_account_tags`, and `--exclude_account_tags` arguments could be useful for filtering columns or tags in the enriched report. Below are the options that the script accepts as input: 79 | 80 | ``` 81 | usage: glue-enrich-cur.py [-h] --s3_source_bucket S3_SOURCE_BUCKET --s3_source_prefix S3_SOURCE_PREFIX --s3_target_bucket S3_TARGET_BUCKET [--s3_target_prefix S3_TARGET_PREFIX] 82 | [--incremental_mode_months INCREMENTAL_MODE_MONTHS] [--create_table] [--overwrite_existing_table] [--partition_by_account] [--database_name DATABASE_NAME] 83 | [--table_name TABLE_NAME] [--exclude_fields [EXCLUDE_FIELDS ...]] [--include_fields [INCLUDE_FIELDS ...]] [--include_account_tags [INCLUDE_ACCOUNT_TAGS ...]] 84 | [--exclude_account_tags [EXCLUDE_ACCOUNT_TAGS ...]] [--extra-py-files EXTRA_PY_FILES] [--scriptLocation SCRIPTLOCATION] [--job-bookmark-option JOB_BOOKMARK_OPTION] 85 | [--job-language JOB_LANGUAGE] [--connection-names CONNECTION_NAMES] 86 | 87 | Enriches CUR data stored in S3 and enriches S3 by adding organizational account tags as columns to line items. 88 | 89 | optional arguments: 90 | -h, --help show this help message and exit 91 | --s3_source_bucket S3_SOURCE_BUCKET 92 | The source bucket where the CUR data is located 93 | --s3_source_prefix S3_SOURCE_PREFIX 94 | The source prefix (path) where the CUR data is located. This should be the prefix immediately preceding the partition prefixes (i.e. path before /year=2020) 95 | --s3_target_bucket S3_TARGET_BUCKET 96 | The destination bucket to put enriched CUR data. 97 | --s3_target_prefix S3_TARGET_PREFIX 98 | The destination prefix (path) where enriched CUR data will be placed. Put this in a prefix that is OUTSIDE of the original CUR data prefix. 99 | --incremental_mode_months INCREMENTAL_MODE_MONTHS 100 | When set, last x months of CUR data is processed. When not set or set to 0, all data is processed. For incremental updates, recommend setting this to 2. 101 | --create_table When set, a table will be created in Glue automatically and made available through Athena. 102 | --overwrite_existing_table 103 | When set, the existing table will be overwritten. This should be set when performing updates. 104 | --partition_by_account 105 | When set, an additional partition for line_item_usage_account_id will be created. Example: /year=2020/month=1/line_item_usage_account_id=123456789012 106 | --database_name DATABASE_NAME 107 | The name of the Glue database to create the table in. Must be set when --create_table is set 108 | --table_name TABLE_NAME 109 | The name of the Glue table to create or overwrite. Must be set when --create_table is set 110 | --exclude_fields [EXCLUDE_FIELDS ...] 111 | Columns or fields in the CUR data to exclude from the enriched report. Useful for creating resports with fewer columns. This does not affect AWS account tags, use 112 | --exclude_account_tags for that. 113 | --include_fields [INCLUDE_FIELDS ...] 114 | Columns or fields in the CUR data to include. All other CUR fields will be dropped. This does not affect AWS account tags, This does not affect AWS account tags, use 115 | --include_account_tags for that. 116 | --include_account_tags [INCLUDE_ACCOUNT_TAGS ...] 117 | Organizations account tags to exclude from the enriched report. Useful when you have a lot of Organizations account tags and only want to include a filtered set. 118 | --exclude_account_tags [EXCLUDE_ACCOUNT_TAGS ...] 119 | Organizations account tags to exclude from the enriched report. Useful when you want to exclude specific tags. 120 | --extra-py-files EXTRA_PY_FILES 121 | NOT USED 122 | --scriptLocation SCRIPTLOCATION 123 | NOT USED 124 | --job-bookmark-option JOB_BOOKMARK_OPTION 125 | NOT USED 126 | --job-language JOB_LANGUAGE 127 | NOT USED 128 | --connection-names CONNECTION_NAMES 129 | NOT USED 130 | ``` 131 | 132 | ## Alternatives 133 | This approach demonstrates the power/utility of Glue Jobs and Python Shell Scripts for ETL, but it's not the only way to solve this problem. It's worth highlighting a few alternatives to using this script: 134 | 1. With the launch of [Cost Categories](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/create-cost-categories.html) it is now possible achieve similar capabilities, as the cost categories and values are also included in CUR reports. This approach also has the added benefit of working in Cost Explorer. However, you may already have Organizational accounts tags and prefer using that to manage account metadata. 135 | 2. Instead of using a Python Shell job to join the account tags data with the CUR data, you could instead use a job to extract the account tags from the Organizations API and put them into a separate file in S3, then create an "account_tags" table in Glue. You could then use this table in Athena to perform a JOIN on "account_tags" and your CUR data table (assuming you've already created it separately). You would update the "account_tags" table on a schedule to get the latest tags. However, this would mean that any historical tags/values are lost and/or not reflected in prior-month CUR data. 136 | 137 | **Note**: The Python Shell job creates a table with this data in the Glue data catalog that could be reused. 138 | 3. It is also possible to use an external reporting/BI tool, such as QuickSight, to merge the account tags with CUR data 139 | 140 | 141 | ## Possible future enhancements 142 | Below are other ideas for enhancing this script in the future. These may be implemented if time permits. 143 | - Create additional columns to show RI/SP utilization and effective charges to the linked account that used it so that it's easier to handle chargeback for shared RIs. 144 | - Create a separate script/utility that loads account details in to an Glue/Athena table (see #2 under Alternatives) -------------------------------------------------------------------------------- /architecture.drawio: -------------------------------------------------------------------------------- 1 | 7Vpdb+I4FP01SLsPRXE+4bGk7cyuuppq0Wi0T8gkJvE0xKlxCuyv3+vEARKbDt2hpYygqNjHzrVzfY+P7aTnhPPVJ46L9C8Wk6xnW/Gq59z0bBshy4cfiaxrZOC6NZBwGqtKW2BM/yUKtBRa0pgsWhUFY5mgRRuMWJ6TSLQwzDlbtqvNWNZutcAJ0YBxhDMd/UZjkW7uy9oWfCY0SVXTA08VzHFTWQGLFMdsuQM5tz0n5IyJOjVfhSSTzmv8Ul93t6d00zFOcnHIBV+T5O6fxZc1uo8fnmbOqCzipytl5Rlnpbrh629jAB7wmnD4vY4iVoL9+g7EunFLwWguKtd6I/hCy6HV86AklLm+7XWAbj5oA0jPSRttoJsP2gDqmked9lG3gzuAlmuZtzrtWzsdhK8zYqXIaE7CTRBaACYcxxQGJ2QZ44DlLAfvjVIxzyCHILlMqSDjAkfSq0sgEGAzlgtFA2Q3eeV4aRXCqJDp+SqRjOvj5cLtJ5yVRdXkH0AEY+kEkpMoY2U8wZmQhgRnj6TpXM924O9Ohs5oRrOs0+lnwgUFVlxnNJH2BZPNYZXLyKyyCHdC8+S+yt04luq9qYkYL1ISq1tSMQhNkNXe4EYbysBcQ9icCL6GKs0FQ8WydSe/3JLW9xWW7vDVbYiM1USRbGxvuQQJRadXUCvQSENimFpUlnGRsoTlOLvdoiMYqDze+GVb555Jf1cx850IsVYBgkvB2hFVtykbetmR0C9W8oi80H9bzbaYJ0S8UM8zDwwnGRb0ud2Pozu5kYlf3svDU3rZ1lTik0yBTKzBdzkkxmlt5k82PUwrjHph0gyjbuja0apWzeaGFrqgCQt0EOnVGgHQQRNmUrvu1chwNepcvV9r9s2zXQ2CsqHr3dzZO2U3lIMhWulGLtnQkQC4xrt2rJFnEo1Z9enO6I1c3OMpyR7YgirzUyYEm/9QTyLoFSw/Wpz7kS7iRVG7Y0ZXsh9moeSkZmQtkyPIGgWzivNj6JJn2S1dcnxf0yV3YJClwRtR2dEXfCIlOb6w9sLaM2ctrgP5KLwNHOtj8dY/83WOe+A6xz7lOsfVJscvnCYUXAZo+PVv+H+DBcyVfrWDmsIO2U9k6rcHzJ9KIn7XRumneK82Xx1a20GAkK/RWlX+cIyuhITw22dS6wnax/JpGT0SMVlSkU7Y9DtYWbyNCMOmTyez30feO/J5cOZ89g7ks3NKPnv79i1mFoeAZix5Dw7vkeaz57BcOU9i8OMkUs48CoP9Q+TY7TvoHRmMkBYo50Xh4TlQeKhR+DbnNJL0uEjyWUtyf9hWZTc4vSqjcz+0RYeeJ6KTsrrp5g6tQ/loYsb4HFeEuBxHvOlxxMWdH+p0585x3cErT3dGIXI8XXN+1dOdqD0/HEWFbH/QlqBGbFrLSoP+uG81L577MQ869JwHnfSgB+knPQsiyuJK7l+uSLXCnMPN99fgpHdYP+7h8tmvHwWZFzCeR3qYYgedXaBlWDGanvG/2XJRP2DQosXwMoU2YXvXfjjwd0ca7R3I7uB0Am9j6gj+dlHH36ZzM8vgcLvZrh/f4/rzq7HA0eOFpP+fpIvKgUeJGGdg94PBcPvxO/rq6gEU9Jt2Wpz1Xx1BkN2+TVeV7byT6Nz+Bw== -------------------------------------------------------------------------------- /architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/glue-enrich-cost-and-usage/864f63d16f80830bf79ba140fd3f861806012110/architecture.png -------------------------------------------------------------------------------- /awswrangler-2.5.0-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/glue-enrich-cost-and-usage/864f63d16f80830bf79ba140fd3f861806012110/awswrangler-2.5.0-py3-none-any.whl -------------------------------------------------------------------------------- /glue-enrich-cur.py: -------------------------------------------------------------------------------- 1 | """ /* 2 | * Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 3 | * SPDX-License-Identifier: MIT-0 4 | * 5 | * Permission is hereby granted, free of charge, to any person obtaining a copy of this 6 | * software and associated documentation files (the "Software"), to deal in the Software 7 | * without restriction, including without limitation the rights to use, copy, modify, 8 | * merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 9 | * permit persons to whom the Software is furnished to do so. 10 | * 11 | * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 12 | * INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 13 | * PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 14 | * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 15 | * OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 16 | * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 17 | */ """ 18 | 19 | import argparse 20 | import logging 21 | import os 22 | import re 23 | import sys 24 | 25 | import awswrangler as wr 26 | import boto3 27 | import pandas as pd 28 | 29 | logging.basicConfig(stream=sys.stdout, level=logging.INFO) 30 | 31 | log = logging.getLogger(__name__) 32 | log.setLevel(logging.INFO) 33 | handler = logging.StreamHandler(sys.stdout) 34 | handler.setLevel(logging.DEBUG) 35 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') 36 | handler.setFormatter(formatter) 37 | log.addHandler(handler) 38 | 39 | orgs_client = boto3.client('organizations') 40 | 41 | # Fetches all accounts from the organizations API 42 | def get_accounts (): 43 | accounts={ 'Accounts':[] } 44 | paginator = orgs_client.get_paginator('list_accounts') 45 | page_iterator = paginator.paginate() 46 | for page in page_iterator: 47 | accounts['Accounts'] = accounts['Accounts'] + page['Accounts'] 48 | return accounts 49 | 50 | # Fetches all tags for an AWS Account ID 51 | def get_tags_for_account (account): 52 | tags={ 'Tags':[] } 53 | paginator = orgs_client.get_paginator('list_tags_for_resource') 54 | page_iterator = paginator.paginate( 55 | ResourceId=account 56 | ) 57 | for page in page_iterator: 58 | tags['Tags'] = tags['Tags'] + page['Tags'] 59 | return tags 60 | 61 | 62 | # OPTIONAL ARGUMENT PARSER (getResolvedOptions doesn't support optional arguments at the moment) 63 | arg_parser = argparse.ArgumentParser( 64 | description="Enriches CUR data stored in S3 and enriches S3 by adding organizational account tags as columns to line items.", 65 | epilog="NOTES: The script expects the source CUR data to be stored in parquet. The Cost and Usage Report configured to generate this data should be set to aoutput for ATHENA. Organizational account tags are merged based on line_item_usage_account_id. Because this queries the Organizations API, it must be run from the Master Account." 66 | ) 67 | arg_parser.add_argument('--s3_source_bucket', required=True, help="The source bucket where the CUR data is located") 68 | arg_parser.add_argument('--s3_source_prefix', required=True, help="The source prefix (path) where the CUR data is located. This should be the prefix immediately preceding the partition prefixes (i.e. path before /year=2020)") 69 | arg_parser.add_argument('--s3_target_bucket', required=True, help="The destination bucket to put enriched CUR data.") 70 | arg_parser.add_argument('--s3_target_prefix', required=False, default="", help="The destination prefix (path) where enriched CUR data will be placed. Put this in a prefix that is OUTSIDE of the original CUR data prefix.") 71 | arg_parser.add_argument('--incremental_mode_months', type=int, required=False, default=0, help="When set, last x months of CUR data is processed. When not set or set to 0, all data is processed. For incremental updates, recommend setting this to 2.") 72 | arg_parser.add_argument('--create_table', action='store_true', required=False, default=False, help="When set, a table will be created in Glue automatically and made available through Athena.") 73 | arg_parser.add_argument('--overwrite_existing_table', action='store_true', required=False, default=False, help="When set, the existing table will be overwritten. This should be set when performing updates.") 74 | arg_parser.add_argument('--partition_by_account', action='store_true', required=False, default=False, help="When set, an additional partition for line_item_usage_account_id will be created. Example: /year=2020/month=1/line_item_usage_account_id=123456789012") 75 | arg_parser.add_argument('--database_name', type=str, required=False, default=None, help="The name of the Glue database to create the table in. Must be set when --create_table is set") 76 | arg_parser.add_argument('--table_name', type=str, required=False, default=None, help="The name of the Glue table to create or overwrite. Must be set when --create_table is set") 77 | arg_parser.add_argument('--exclude_fields', type=str, nargs="*", required=False, default=[], help="Columns or fields in the CUR data to exclude from the enriched report. Useful for creating resports with fewer columns. This does not affect AWS account tags, use --exclude_account_tags for that.") 78 | arg_parser.add_argument('--include_fields', type=str, nargs="*", required=False, default=[], help="Columns or fields in the CUR data to include. All other CUR fields will be dropped. This does not affect AWS account tags, This does not affect AWS account tags, use --include_account_tags for that.") 79 | arg_parser.add_argument('--include_account_tags', type=str, nargs="*", required=False, default=[], help="Organizations account tags to exclude from the enriched report. Useful when you have a lot of Organizations account tags and only want to include a filtered set.") 80 | arg_parser.add_argument('--exclude_account_tags', type=str, nargs="*", required=False, default=[], help="Organizations account tags to exclude from the enriched report. Useful when you want to exclude specific tags.") 81 | 82 | # Not used, but included because Glue passes these arguments in 83 | arg_parser.add_argument('--extra-py-files', type=str, required=False, default=None, help="NOT USED") 84 | arg_parser.add_argument('--scriptLocation', type=str, required=False, default=None, help="NOT USED") 85 | arg_parser.add_argument('--job-bookmark-option', type=str, required=False, default=None, help="NOT USED") 86 | arg_parser.add_argument('--job-language', type=str, required=False, default=None, help="NOT USED") 87 | arg_parser.add_argument('--connection-names', type=str, required=False, default=None, help="NOT USED") 88 | 89 | log.info(vars(arg_parser.parse_args() )) 90 | args = vars(arg_parser.parse_args()) 91 | 92 | 93 | S3_SOURCE_BUCKET = args["s3_source_bucket"] 94 | S3_SOURCE_PREFIX = args["s3_source_prefix"] 95 | S3_TARGET_BUCKET = args["s3_target_bucket"] 96 | S3_TARGET_PREFIX = args["s3_target_prefix"] 97 | INCREMENTAL_MODE_MONTHS = args["incremental_mode_months"] 98 | PARTITION_BY_ACCOUNT = args['partition_by_account'] 99 | CREATE_TABLE = args["create_table"] 100 | DATABASE_NAME = args["database_name"] 101 | TABLE_NAME = args["table_name"] 102 | OVERWRITE_EXISTING_TABLE = args["overwrite_existing_table"] 103 | INCLUDE_FIELDS = [item.lower().strip() for item in args["include_fields"]] 104 | EXCLUDE_FIELDS = [item.lower().strip() for item in args["exclude_fields"]] 105 | INCLUDE_ACCOUNT_TAGS = [item.lower().strip() for item in args["include_account_tags"]] 106 | EXCLUDE_ACCOUNT_TAGS = [item.lower().strip() for item in args["exclude_account_tags"]] 107 | 108 | if TABLE_NAME == None or DATABASE_NAME == None: raise Exception('Must specify Glue Database and Table when "--create_table" is set') 109 | if wr.catalog.does_table_exist(DATABASE_NAME,TABLE_NAME) and not OVERWRITE_EXISTING_TABLE: raise Exception('The table {}.{} already exists but OVERWRITE_EXISTING_TABLE isn''t set.'.format(DATABASE_NAME, TABLE_NAME)) 110 | 111 | 112 | # First we retrieve the accounts through the organizations API 113 | log.info("Fetching accounts and organizations account tags") 114 | accounts = get_accounts() 115 | account_tags_list = [] 116 | for account in accounts.get('Accounts', {}): 117 | account_dict = {} 118 | account_dict["account_id"] = account["Id"] 119 | account_dict["name"] = account["Name"] 120 | account_dict["email"] = account["Email"] 121 | account_dict["arn"] = account["Arn"] 122 | tags = get_tags_for_account(account["Id"]) 123 | for tag in tags['Tags']: 124 | account_dict['account_tag_'+tag["Key"]] = tag["Value"] 125 | account_tags_list.append(account_dict) 126 | 127 | log.info("Creating account tags DataFrame") 128 | # Create DataFrame with accounts and associated tags 129 | account_tags_df = pd.DataFrame(account_tags_list).convert_dtypes() 130 | account_tags_df.info() 131 | 132 | # Write account_tags_def to S3 in Parquet format 133 | s3_written_objs = wr.s3.to_parquet( 134 | df = account_tags_df, 135 | path="s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+"account_tags", 136 | mode = "overwrite", 137 | compression="snappy", 138 | index=False, 139 | dataset = True 140 | ) 141 | 142 | # Create a table in Glue data catalog for account tags 143 | wr.s3.store_parquet_metadata( 144 | path="s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+"account_tags", 145 | database=DATABASE_NAME, 146 | table="account_tags", 147 | dataset=True, 148 | mode="overwrite" 149 | ) 150 | 151 | # Fetch catalog column names, drop account_id 152 | account_tags_columns = list(wr.catalog.get_table_types( 153 | database=DATABASE_NAME, 154 | table='account_tags' 155 | ).keys()) 156 | account_tags_columns.remove('account_id') 157 | 158 | # Get a list of objects 159 | log.info("Listing objects in path: {}".format("s3://"+S3_SOURCE_BUCKET+"/"+S3_SOURCE_PREFIX)) 160 | 161 | s3_objects = wr.s3.list_objects("s3://"+S3_SOURCE_BUCKET+"/"+S3_SOURCE_PREFIX) 162 | log.info("Objects found: {}".format(s3_objects)) 163 | 164 | s3_prefixes = set() 165 | for object in s3_objects: 166 | s3_prefixes.add(object[:object.rfind('/')]) 167 | s3_prefixes = list(s3_prefixes) 168 | 169 | # Sort objects by month, adding leading zeros to the months to sort correctly 170 | s3_objects.sort(key=lambda x: re.sub(r'(month=)(\d+)', lambda m : m.group(1)+m.group(2).zfill(2),x)) 171 | s3_prefixes.sort(key=lambda x: re.sub(r'(month=)(\d+)', lambda m : m.group(1)+m.group(2).zfill(2),x)) 172 | log.info("Prefixes found: {}".format(s3_prefixes)) 173 | 174 | # If INCREMENTAL_MODE_MONTHS is greater than zero, only use the last INCREMENTAL_MODE_MONTHS items in the list. This assumes everything is partitioned by '/year=xxxx/month=xx/' 175 | if INCREMENTAL_MODE_MONTHS > 0: 176 | s3_prefixes = s3_prefixes[-INCREMENTAL_MODE_MONTHS:] 177 | log.info("Incremental mode set. Filtered objects: {}".format(s3_prefixes)) 178 | 179 | # Rather than trying to do a join on all the data at once, we're breaking it up by partition (prefix). his also allows us to perform incremental updates, leaving prior months 180 | # CUR data unaltered for historical purposes if desired. It can also easier to troubleshoot with subsets of data. 181 | for s3_prefix in s3_prefixes: 182 | # Make sure garbage collection happens 183 | #del df 184 | #gc.collect() 185 | log.info("=============") 186 | log.info('READING METADATA: {}'.format(s3_prefix+"/")) 187 | cur_column_types, partitions = wr.s3.read_parquet_metadata( 188 | path=s3_prefix+"/", 189 | dataset=True 190 | ) 191 | log.info('CLEANING UP TEMP TABLES IF THEY EXIST') 192 | wr.catalog.delete_table_if_exists( 193 | database=DATABASE_NAME, 194 | table='temp_cur_enriched' 195 | ) 196 | wr.catalog.delete_table_if_exists( 197 | database=DATABASE_NAME, 198 | table='temp_cur_original' 199 | ) 200 | 201 | # Create a temporary table in the data catalog from the source CUR data. 202 | log.info('CREATING TEMP TABLE FOR: {}'.format(s3_prefix+"/")) 203 | wr.catalog.create_parquet_table( 204 | path = s3_prefix+"/", 205 | database = DATABASE_NAME, 206 | table = 'temp_cur_original', 207 | columns_types= cur_column_types, 208 | #partitions_types=partitions, 209 | mode = 'overwrite' 210 | ) 211 | 212 | # Fetch catalog column names 213 | temp_cur_original_columns = wr.catalog.get_table_types( 214 | database=DATABASE_NAME, 215 | table='temp_cur_original' 216 | ) 217 | log.info('CLEANING UP {}'.format("s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+ s3_prefix.split('/', 3)[3])+"/") 218 | wr.s3.delete_objects( 219 | path="s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+ s3_prefix.split('/', 3)[3]+"/*" 220 | ) 221 | 222 | 223 | 224 | # Note: Using a SELECT * statement won't work if we want to partition by line_item_usage_account_id because it has to be the last column in the list. 225 | # The lines below build a list of column names to use in the CREATE TABLE AS SELECT (CTAS) statement. The 'account_id' column from account_tags is 226 | # also dropped since it's redundant. 227 | log.info('Merging account tags') 228 | temp_cur_original_columns = list(temp_cur_original_columns.keys()) 229 | 230 | temp_cur_original_columns = [item for item in temp_cur_original_columns if item.lower().strip() not in EXCLUDE_FIELDS] 231 | account_tags_columns = [item for item in account_tags_columns if item.lower().strip().replace('account_tag_', '') not in EXCLUDE_ACCOUNT_TAGS] 232 | 233 | if len(INCLUDE_FIELDS) > 0: temp_cur_original_columns = [item for item in temp_cur_original_columns if item.lower().strip() in INCLUDE_FIELDS] 234 | if len(INCLUDE_ACCOUNT_TAGS) > 0: account_tags_columns = [item for item in account_tags_columns if item.lower().strip().replace('account_tag_', '') in INCLUDE_ACCOUNT_TAGS] 235 | 236 | if PARTITION_BY_ACCOUNT: 237 | temp_cur_original_columns = [item for item in temp_cur_original_columns if item not in ['line_item_usage_account_id']] 238 | column_string = "\""+"\",\"".join(temp_cur_original_columns)+"\","+"\""+"\",\"".join(filter(None,account_tags_columns))+"\""+',"line_item_usage_account_id"' 239 | else: 240 | column_string = column_string = "\""+"\",\"".join(temp_cur_original_columns)+"\","+"\""+"\",\"".join(account_tags_columns)+"\"" 241 | 242 | column_string = column_string.replace(",\"\"", "").replace("\"\",", "") 243 | log.debug("Final list of columns: {}".format(column_string)) 244 | 245 | # Rather than loading the CUR data into a dataframe and merging, which can be memory and CPU intensive, the command below uses an Athena CTAS statement to 246 | # create a new table with the joined data. The data is written to S3 in Parquet format. 247 | # Note: Another reason for using Athena is that CUR Parquet files use the Parquet V2 format and occasionally uses DELTA encoded columns. These columns 248 | # are not currently readable with Pandas - and more specifically the pyarrow package. 249 | # https://github.com/awslabs/aws-data-wrangler/issues/442 250 | # 251 | log.info('Writing output to {}'.format("s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+ s3_prefix.split('/', 3)[3])) 252 | wr.athena.read_sql_query( 253 | database=DATABASE_NAME, 254 | ctas_approach=False, 255 | sql=""" 256 | CREATE TABLE temp_cur_enriched WITH 257 | ( 258 | format='PARQUET', 259 | parquet_compression='SNAPPY', 260 | external_location = ':s3_target;' 261 | :partition_clause; 262 | ) 263 | AS 264 | SELECT :columns; FROM temp_cur_original LEFT JOIN account_tags ON temp_cur_original.line_item_usage_account_id = account_tags.account_id 265 | """, 266 | params={ 267 | "s3_target": "s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+ s3_prefix.split('/', 3)[3], 268 | "partition_clause" : ",partitioned_by=array['line_item_usage_account_id']" if PARTITION_BY_ACCOUNT else "", 269 | "columns": column_string 270 | } 271 | ) 272 | 273 | # Once the enriched CUR data is written to S3 by Athena, we can remove the table definition the CTAS statement created in the catalog. 274 | log.info('CLEANING UP TEMP TABLES IF THEY EXIST') 275 | wr.catalog.delete_table_if_exists( 276 | database=DATABASE_NAME, 277 | table='temp_cur_enriched' 278 | ) 279 | # Also remote the table containing the original CUR data 280 | wr.catalog.delete_table_if_exists( 281 | database=DATABASE_NAME, 282 | table='temp_cur_original' 283 | ) 284 | log.info("=============") 285 | 286 | 287 | if CREATE_TABLE: 288 | if OVERWRITE_EXISTING_TABLE: 289 | log.info("Deleting {}.{} if it exists".format(DATABASE_NAME, TABLE_NAME)) 290 | wr.catalog.delete_table_if_exists( 291 | database=DATABASE_NAME, 292 | table=TABLE_NAME 293 | ) 294 | 295 | 296 | log.info("Extracting parquet metadata from {}".format( "s3://"+S3_TARGET_BUCKET+"/"+s3_prefix.split('/', 3)[3]+"/")) 297 | cur_column_types, partitions = wr.s3.read_parquet_metadata( 298 | path="s3://"+S3_TARGET_BUCKET+"/"+s3_prefix.split('/', 3)[3]+"/", 299 | dataset=True, 300 | sampling=1 301 | ) 302 | 303 | log.info("Creating Table {}.{} with path {}".format(DATABASE_NAME, TABLE_NAME, "s3://"+S3_TARGET_BUCKET+"/"+(S3_SOURCE_PREFIX+"/" if S3_SOURCE_PREFIX != "" else ""))) 304 | wr.catalog.create_parquet_table( 305 | path = "s3://"+S3_TARGET_BUCKET+"/"+(S3_SOURCE_PREFIX+"/" if S3_SOURCE_PREFIX != "" else ""), 306 | database = DATABASE_NAME, 307 | table = TABLE_NAME, 308 | columns_types= cur_column_types, 309 | partitions_types=partitions 310 | ) 311 | 312 | 313 | log.info("Updated Table Partitions {}.{}".format(DATABASE_NAME, TABLE_NAME)) 314 | wr.athena.repair_table( 315 | database=DATABASE_NAME, 316 | table=TABLE_NAME 317 | ) 318 | 319 | 320 | log.info ("Finished") 321 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | awswrangler==2.0.5 2 | -------------------------------------------------------------------------------- /setup-glue-enrichment.yml: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | # 4 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this 5 | # software and associated documentation files (the "Software"), to deal in the Software 6 | # without restriction, including without limitation the rights to use, copy, modify, 7 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 8 | # permit persons to whom the Software is furnished to do so. 9 | # 10 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 11 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 12 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 13 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 14 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 15 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 16 | # 17 | AWSTemplateFormatVersion: "2010-09-09" 18 | Description: Creates Glue Jobs, a Glue Catalog, an S3 Bucket, and an IAM role for the glue-cur-enricher example. 19 | Parameters: 20 | S3SourceBucket: 21 | Type: String 22 | Description: "Name of the S3 bucket containing the existing Cost & Usage Reports" 23 | AllowedPattern: "^[0-9a-zA-Z]+([0-9a-zA-Z-]*[0-9a-zA-Z])*$" 24 | ConstraintDescription: "Bucket name can include numbers, lowercase letters, uppercase letters, and hyphens (-). It cannot start or end with a hyphen (-)." 25 | 26 | S3SourcePrefix: 27 | Type: String 28 | Description: "Prefix (path) within S3 Bucket where the existing Cost & Usage Reports can be found. Example: cur/cur_report" 29 | Default: "" 30 | Metadata: 31 | 'AWS::CloudFormation::Interface': 32 | ParameterGroups: 33 | - Label: 34 | default: "Cost & Usage Report Parameters" 35 | Parameters: 36 | - S3SourceBucket 37 | - S3SourcePrefix 38 | ParameterLabels: 39 | S3SourceBucket: 40 | default: Cost and Usage Report Source Bucket 41 | S3SourcePrefix: 42 | default: Cost and Usage Report S3 Prefix 43 | Resources: 44 | EnrichedCURBucket: 45 | Type: AWS::S3::Bucket 46 | 47 | EnrichedCURRole: 48 | Type: 'AWS::IAM::Role' 49 | Properties: 50 | AssumeRolePolicyDocument: 51 | Version: '2012-10-17' 52 | Statement: 53 | - Effect: Allow 54 | Principal: 55 | Service: 56 | - glue.amazonaws.com 57 | Action: 58 | - 'sts:AssumeRole' 59 | Path: / 60 | ManagedPolicyArns: 61 | - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole 62 | - arn:aws:iam::aws:policy/AWSOrganizationsReadOnlyAccess 63 | Policies: 64 | - PolicyName: AccessAthena 65 | PolicyDocument: 66 | Version: 2012-10-17 67 | Statement: 68 | - Effect: Allow 69 | Action: 70 | - s3:GetBucketLocation 71 | - s3:GetObject 72 | - s3:ListBucket 73 | - s3:ListBucketMultipartUploads 74 | - s3:ListMultipartUploadParts 75 | - s3:AbortMultipartUpload 76 | - s3:CreateBucket 77 | - s3:PutObject 78 | Resource: 'arn:aws:s3:::aws-athena-query-results-*' 79 | - Effect: Allow 80 | Action: 81 | - athena:StartQueryExecution 82 | - athena:GetQueryExecution 83 | - athena:GetQueryResults 84 | Resource: !Sub 'arn:aws:athena:*:${AWS::AccountId}:workgroup/*' 85 | - Effect: Allow 86 | Action: athena:GetQueryExecutions 87 | Resource: '*' 88 | - PolicyName: EnrichedCURS3SourceReadOnly 89 | PolicyDocument: 90 | Version: 2012-10-17 91 | Statement: 92 | - Effect: Allow 93 | Action: 94 | - s3:GetObject 95 | - s3:ListBucket 96 | Resource: 97 | - !Sub 'arn:aws:s3:::${S3SourceBucket}/${S3SourcePrefix}/*' 98 | - PolicyName: EnrichedCURS3TargetReadWrite 99 | PolicyDocument: 100 | Version: 2012-10-17 101 | Statement: 102 | - Effect: Allow 103 | Action: 104 | - s3:GetObject 105 | - s3:ListBucket 106 | - s3:PutObject 107 | - s3:DeleteObject 108 | Resource: 109 | - !Sub 'arn:aws:s3:::${EnrichedCURBucket}/*' 110 | - PolicyName: GlueListAllBuckets 111 | PolicyDocument: 112 | Version: 2012-10-17 113 | Statement: 114 | - Effect: Allow 115 | Action: 116 | - s3:ListAllMyBuckets 117 | - s3:headBucket 118 | Resource: '*' 119 | EnrichedCURGlueDB: 120 | Type: AWS::Glue::Database 121 | Properties: 122 | CatalogId: !Ref AWS::AccountId 123 | DatabaseInput: 124 | Name: 'cost_and_usage_enriched' 125 | Description: Contains enriched CUR data 126 | 127 | EnrichedCURFull: 128 | Type: AWS::Glue::Job 129 | Properties: 130 | Name: aws-cur-enrichment-full 131 | Command: 132 | Name: pythonshell 133 | PythonVersion: "3" 134 | ScriptLocation: !Sub 's3://${EnrichedCURBucket}/glue-enrich-cur.py' 135 | DefaultArguments: 136 | "--s3_source_bucket": !Ref S3SourceBucket 137 | "--s3_target_bucket": !Ref EnrichedCURBucket 138 | "--s3_source_prefix": !Ref S3SourcePrefix 139 | "--create_table": "" 140 | "--database_name": !Ref EnrichedCURGlueDB 141 | "--table_name": "cur_enriched" 142 | "--overwrite_existing_table": "" 143 | "--partition_by_account": "" 144 | "--extra-py-files": !Sub 's3://${EnrichedCURBucket}/awswrangler-2.5.0-py3-none-any.whl' 145 | GlueVersion: "1.0" 146 | ExecutionProperty: 147 | MaxConcurrentRuns: 1 148 | MaxCapacity: 1 149 | MaxRetries: 0 150 | Role: !Ref EnrichedCURRole 151 | EnrichedCURIncremental: 152 | Type: AWS::Glue::Job 153 | Properties: 154 | Name: aws-cur-enrichment-incremental 155 | Command: 156 | Name: pythonshell 157 | PythonVersion: "3" 158 | ScriptLocation: !Sub 's3://${EnrichedCURBucket}/glue-enrich-cur.py' 159 | DefaultArguments: 160 | "--s3_source_bucket": !Ref S3SourceBucket 161 | "--s3_target_bucket": !Ref EnrichedCURBucket 162 | "--s3_source_prefix": !Ref S3SourcePrefix 163 | "--create_table": "" 164 | "--database_name": !Ref EnrichedCURGlueDB 165 | "--table_name": "cur_enriched" 166 | "--overwrite_existing_table": "" 167 | "--partition_by_account": "" 168 | "--incremental_mode_months": "2" 169 | "--extra-py-files": !Sub 's3://${EnrichedCURBucket}/awswrangler-2.5.0-py3-none-any.whl' 170 | GlueVersion: "1.0" 171 | ExecutionProperty: 172 | MaxConcurrentRuns: 1 173 | MaxCapacity: 1 174 | MaxRetries: 0 175 | Role: !Ref EnrichedCURRole 176 | Outputs: 177 | EnrichedCURBucketUrl: 178 | Description: Use this link to go to the AWS S3 Console and upload your the .py script and .whl file. The enriched CUR files will also be written to this bucket. 179 | Value: !Join ['', ['https://s3.console.aws.amazon.com/s3/buckets/', !Ref EnrichedCURBucket ]] --------------------------------------------------------------------------------