├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── Readme.md
├── architecture.drawio
├── architecture.png
├── awswrangler-2.5.0-py3-none-any.whl
├── glue-enrich-cur.py
├── requirements.txt
└── setup-glue-enrichment.yml


/.gitignore:
--------------------------------------------------------------------------------
1 | *venv*
2 | .env*
3 | .vscode
4 | glue-enrich-cur.zip
5 | test.py


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this
 4 | software and associated documentation files (the "Software"), to deal in the Software
 5 | without restriction, including without limitation the rights to use, copy, modify,
 6 | merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
 7 | permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
10 | INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
11 | PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
12 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
13 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
14 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
  1 | # Glue Cost and Usage Report Enrichment
  2 | 
  3 | This package creates a Glue Python Shell Job that will enrich [Cost and Usage Report](https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html) data by creating additional columns with [AWS Organizations Account tags](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_tagging.html). Tag column values are set by joining the values on `line_item_account_usage_id`. This makes it possible to filter/group CUR data by account-level tags. An example use case is to create an Organizations tag, BudgetCode, for each AWS account and then use this script to add the BudgetCode to every line item in the report so that chargeback reports can be generated.
  4 | 
  5 | This script uses the [AWS Data Wrangler](https://github.com/awslabs/aws-data-wrangler) package to perform the bulk of the ETL work, and also to create the table definition in Glue so that the encriched reports can be queried using Athena.
  6 | 
  7 | ![Diagram](architecture.png)
  8 | 
  9 | ## Functionality
 10 | - The script uses a Glue Python Shell Job to process the CUR data, merge the account tags, and then rewrite the "enriched" data in Parquet to another S3 bucket and/or path.
 11 | - The AWS account information, including account tags, is collected from the AWS Organizations API and written to S3. A table is defined in the Glue data catalog so that it can be queried.
 12 | - The script can also create a table in Glue that can be used to perform queries in Athena. You could also define use a Glue crawler for this purpose.
 13 | - Optionally, it can also further partition the data by line_item_account_id. 
 14 | - The Glue job can be configured to process all CUR files in S3, or it can perform an "Incremental" processing, where it will only process CUR files from the last x months. It is recommended that you do a one time full processing/enrichment on all CUR files and then periodic "Fulls" 
 15 | - The Python Shell script uses the [AWS Data Wrangler](https://aws-data-wrangler.readthedocs.io/en/latest/) package. This package extends the Pandas library, making it easy to work with Athena, Glue, S3, RedShift, Cloudwatch Logs, Aurora, SageMaker, DynamoDB, and EMR. 
 16 | 
 17 | ### Technical Note
 18 | The Python Shell script originally used Pandas DataFrames to perform the ETL operations. In later released, this has been offloaded to Athena CREATE TABLE AS SELECT statements with temporary Glue data catalog tables. One motivator for this approach was work around Spark and pyarrow's limited support for Delta encoded columns, which CUR data written in Parquet format [occasionally contains](https://github.com/awslabs/aws-data-wrangler/issues/442). If you want to see the original script, you can find it in [this commit](https://github.com/aws-samples/glue-enrich-cost-and-usage/tree/333e8af620bea2dde41e5126fc558ba6eecc94bc). 
 19 | 
 20 | ## Prerequisites
 21 | 1. *An AWS Account and AWS Organizations enabled:*  In order to use Organizations tags with accounts, AWS Organizations must be enabled. **The resources used in this blog post must be deployed in the master account.**
 22 | 2. *Cost & Usage Report*: A Cost & Usage Report must be defined in the Master account. The report must be written in Apache Parquet format to an S3 bucket. Instructions for creating a Cost & Usage Report are available in the AWS Documentation (https://docs.aws.amazon.com/cur/latest/userguide/cur-create.html).  *Please note:* Once you’ve setup a Cost & Usage Report, it can take up to 24 hours for the first report to be delivered to S3. You will not be able to use this solution until you have at least one report delivered
 23 | 
 24 | 
 25 | ## How it works
 26 | 
 27 | The CloudFormation template used in the setup below creates the following resources:
 28 | 
 29 | * **Glue Python Shell Jobs**- These shell jobs contain Python scripts that will perform the ETL operations, loading the data from the existing Cost & Usage Report files and then writing the processed dataset back to S3. One job will perform a “Full” ETL on all existing report data, while the other job will perform an “Incremental” ETL on data from the past 2 months. 
 30 |     
 31 |     **Note**: Glue Jobs are billed per second (https://aws.amazon.com/glue/pricing/), with a 10-minute minimum. In most environments, the incremental job should take less than 10 minutes to complete and is estimated to cost less than $0.15 per day to run. Customers that wish to process all historical Cost & Usage Data or that have dozens or hundreds of AWS accounts can monitor the job duration and logs for a more accurate estimate.
 32 |     
 33 | * **Glue Database**- The database will be used to store metadata about our enriched Cost & Usage Report data. This is primarily so we can perform queries in Athena later. 
 34 |     
 35 | * **S3 Bucket**- This bucket will store Python shell scripts and also our enriched Cost & Usage Data. Glue Python Shell Jobs execute a Python script that is stored in S3. Additional Python libraries can also be loaded from .whl and .egg files located in S3. 
 36 |     
 37 | * **IAM Role for Glue Jobs**- Glue Jobs require an IAM role with permissions to access the resources they will interact with. 
 38 | 
 39 | The Python script used by Glue leverages the [AWS Data Wrangler](https://aws-data-wrangler.readthedocs.io/en/latest/what.html) library. This package extends the popular Pandas library to AWS services, making it easy to connect to, load, and save [Pandas](https://github.com/pandas-dev/pandas) dataframes with many AWS services, including S3, Glue, Redshift, EMR, Athena, and Cloudwatch Log Insights. In a few lines of code, the script performs the following steps:
 40 | 
 41 | 1. Loads each Cost & Usage Report parquet file from S3 into a pandas dataframe
 42 | 2. Merges the all AWS Account tags (retrieved via the Organizations API) with each line item, based on the line_item_usage_account_id column
 43 | 3. Writes each dataframe to the S3 Bucket, partitioned by year, month, and line_item_usage_account_id for potentially [more efficient Athena queries](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/)
 44 | 4. Creates a table in the Glue Database with the enriched Cost & Usage Report data and refreshes the table
 45 | 
 46 | When the Glue job completes the enriched reports can be queried via Athena, as demonstrated below.
 47 | 
 48 | ## Setup
 49 | 
 50 | **Note:** This will NOT modify your existing CUR data files. Instead, it creates a copy of the data in a separate bucket. It is recommended that you retain the original CUR files for backup.
 51 | 
 52 | 1. First, use the [AWS Organizations](https://console.aws.amazon.com/organizations/home?region=us-east-1#/accounts) console to add the tags you’d like to have included in your enriched Cost & Usage Reports.
 53 | 
 54 | 2. Deploy the CloudFormation template, setup-glue-encrichment.yml. Specify the source bucket name and prefix (path) where the current CUR files are located.
 55 | 
 56 | 3. After the CloudFormation stack is created, go to outputs and click the link to the S3 bucket that the stack created.
 57 | 
 58 | 4. Upload: awswrangler-2.5.0-py3-none-any.whl and glue-enrich-cur.py
 59 | 
 60 | 5. **To add the tags to all historical CUR data**, go to the [AWS Glue console](https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=jobs) and run the `aws-cur-enrichment-full` job. If you would prefer to limit this, you can use the `aws-cur-enrichment-incremental` job instead. 
 61 | 
 62 | 6. (Optional) Schedule `aws-cur-enrichment-incremental` to run once a day. You could also make this event-driven by using S3 events so that it runs each time a new CUR file is delivered. This incremental job will update the accounts tags for the current month AND the previous month. This makes the job run faster, process less data, and avoids altering accounts tags from previous month reports so that the behavior is similar to cost allocation tags (where tags are not retroactively modified in historical CUR data).
 63 | 
 64 | 7. Go to the [Athena console](https://console.aws.amazon.com/athena/home?region=us-east-1#preview/cost_and_usage_enriched/cur_enriched) and run the following query:
 65 |     ```sql
 66 |     SELECT * FROM "cost_and_usage_enriched"."cur_enriched" limit 10;
 67 |     ```
 68 | 
 69 |     Search for the column names that begin with `account_tag_` in the results and you will see your account tags (if they exist for the `line_item_usage_account_number`, otherwise the field for that record will be blank).
 70 | 
 71 |     This is an example query that will return results where the account has `CostCenter` tag:
 72 |     ```sql
 73 |     SELECT * FROM "cost_and_usage_enriched"."cur_enriched" 
 74 |     WHERE tag_costcenter <> ''
 75 |     limit 10;
 76 |     ```
 77 | ## Additional Features
 78 | The Glue jobs are configured to pass arguments to the script that determine how it will process the CUR. This can be modified by custoizing the arguments passed to the script. Additionally, the script can be used outside of a Glue job as a command line utility. In particular, the `--include_fields`, `--exclude_fields`, `--include_account_tags`, and `--exclude_account_tags` arguments could be useful for filtering columns or tags in the enriched report. Below are the options that the script accepts as input:
 79 | 
 80 | ```
 81 | usage: glue-enrich-cur.py [-h] --s3_source_bucket S3_SOURCE_BUCKET --s3_source_prefix S3_SOURCE_PREFIX --s3_target_bucket S3_TARGET_BUCKET [--s3_target_prefix S3_TARGET_PREFIX]
 82 |                           [--incremental_mode_months INCREMENTAL_MODE_MONTHS] [--create_table] [--overwrite_existing_table] [--partition_by_account] [--database_name DATABASE_NAME]
 83 |                           [--table_name TABLE_NAME] [--exclude_fields [EXCLUDE_FIELDS ...]] [--include_fields [INCLUDE_FIELDS ...]] [--include_account_tags [INCLUDE_ACCOUNT_TAGS ...]]
 84 |                           [--exclude_account_tags [EXCLUDE_ACCOUNT_TAGS ...]] [--extra-py-files EXTRA_PY_FILES] [--scriptLocation SCRIPTLOCATION] [--job-bookmark-option JOB_BOOKMARK_OPTION]
 85 |                           [--job-language JOB_LANGUAGE] [--connection-names CONNECTION_NAMES]
 86 | 
 87 | Enriches CUR data stored in S3 and enriches S3 by adding organizational account tags as columns to line items.
 88 | 
 89 | optional arguments:
 90 |   -h, --help            show this help message and exit
 91 |   --s3_source_bucket S3_SOURCE_BUCKET
 92 |                         The source bucket where the CUR data is located
 93 |   --s3_source_prefix S3_SOURCE_PREFIX
 94 |                         The source prefix (path) where the CUR data is located. This should be the prefix immediately preceding the partition prefixes (i.e. path before /year=2020)
 95 |   --s3_target_bucket S3_TARGET_BUCKET
 96 |                         The destination bucket to put enriched CUR data.
 97 |   --s3_target_prefix S3_TARGET_PREFIX
 98 |                         The destination prefix (path) where enriched CUR data will be placed. Put this in a prefix that is OUTSIDE of the original CUR data prefix.
 99 |   --incremental_mode_months INCREMENTAL_MODE_MONTHS
100 |                         When set, last x months of CUR data is processed. When not set or set to 0, all data is processed. For incremental updates, recommend setting this to 2.
101 |   --create_table        When set, a table will be created in Glue automatically and made available through Athena.
102 |   --overwrite_existing_table
103 |                         When set, the existing table will be overwritten. This should be set when performing updates.
104 |   --partition_by_account
105 |                         When set, an additional partition for line_item_usage_account_id will be created. Example: /year=2020/month=1/line_item_usage_account_id=123456789012
106 |   --database_name DATABASE_NAME
107 |                         The name of the Glue database to create the table in. Must be set when --create_table is set
108 |   --table_name TABLE_NAME
109 |                         The name of the Glue table to create or overwrite. Must be set when --create_table is set
110 |   --exclude_fields [EXCLUDE_FIELDS ...]
111 |                         Columns or fields in the CUR data to exclude from the enriched report. Useful for creating resports with fewer columns. This does not affect AWS account tags, use
112 |                         --exclude_account_tags for that.
113 |   --include_fields [INCLUDE_FIELDS ...]
114 |                         Columns or fields in the CUR data to include. All other CUR fields will be dropped. This does not affect AWS account tags, This does not affect AWS account tags, use
115 |                         --include_account_tags for that.
116 |   --include_account_tags [INCLUDE_ACCOUNT_TAGS ...]
117 |                         Organizations account tags to exclude from the enriched report. Useful when you have a lot of Organizations account tags and only want to include a filtered set.
118 |   --exclude_account_tags [EXCLUDE_ACCOUNT_TAGS ...]
119 |                         Organizations account tags to exclude from the enriched report. Useful when you want to exclude specific tags.
120 |   --extra-py-files EXTRA_PY_FILES
121 |                         NOT USED
122 |   --scriptLocation SCRIPTLOCATION
123 |                         NOT USED
124 |   --job-bookmark-option JOB_BOOKMARK_OPTION
125 |                         NOT USED
126 |   --job-language JOB_LANGUAGE
127 |                         NOT USED
128 |   --connection-names CONNECTION_NAMES
129 |                         NOT USED
130 | ```            
131 | 
132 | ## Alternatives
133 | This approach demonstrates the power/utility of Glue Jobs and Python Shell Scripts for ETL, but it's not the only way to solve this problem. It's worth highlighting a few alternatives to using this script:
134 | 1. With the launch of [Cost Categories](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/create-cost-categories.html) it is now possible achieve similar capabilities, as the cost categories and values are also included in CUR reports. This approach also has the added benefit of working in Cost Explorer. However, you may already have Organizational accounts tags and prefer using that to manage account metadata.
135 | 2. Instead of using a Python Shell job to join the account tags data with the CUR data, you could instead use a job to extract the account tags from the Organizations API and put them into a separate file in S3, then create an "account_tags" table in Glue. You could then use this table in Athena to perform a JOIN on "account_tags" and your CUR data table (assuming you've already created it separately). You would update the "account_tags" table on a schedule to get the latest tags. However, this would mean that any historical tags/values are lost and/or not reflected in prior-month CUR data. 
136 | 
137 |     **Note**: The Python Shell job creates a table with this data in the Glue data catalog that could be reused.
138 | 3. It is also possible to use an external reporting/BI tool, such as QuickSight, to merge the account tags with CUR data
139 | 
140 | 
141 | ## Possible future enhancements
142 | Below are other ideas for enhancing this script in the future. These may be implemented if time permits.
143 | - Create additional columns to show RI/SP utilization and effective charges to the linked account that used it so that it's easier to handle chargeback for shared RIs.
144 | - Create a separate script/utility that loads account details in to an Glue/Athena table (see #2 under Alternatives)


--------------------------------------------------------------------------------
/architecture.drawio:
--------------------------------------------------------------------------------
1 | <mxfile host="Electron" modified="2021-03-13T01:06:05.514Z" agent="5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/14.4.3 Chrome/87.0.4280.141 Electron/11.3.0 Safari/537.36" etag="zSfW-KBct1Rp8vdkYac_" version="14.4.3" type="device"><diagram id="cdQTLNM4XHoKdEB1roL7" name="Page-1">7Vpdb+I4FP01SLsPRXE+4bGk7cyuuppq0Wi0T8gkJvE0xKlxCuyv3+vEARKbDt2hpYygqNjHzrVzfY+P7aTnhPPVJ46L9C8Wk6xnW/Gq59z0bBshy4cfiaxrZOC6NZBwGqtKW2BM/yUKtBRa0pgsWhUFY5mgRRuMWJ6TSLQwzDlbtqvNWNZutcAJ0YBxhDMd/UZjkW7uy9oWfCY0SVXTA08VzHFTWQGLFMdsuQM5tz0n5IyJOjVfhSSTzmv8Ul93t6d00zFOcnHIBV+T5O6fxZc1uo8fnmbOqCzipytl5Rlnpbrh629jAB7wmnD4vY4iVoL9+g7EunFLwWguKtd6I/hCy6HV86AklLm+7XWAbj5oA0jPSRttoJsP2gDqmked9lG3gzuAlmuZtzrtWzsdhK8zYqXIaE7CTRBaACYcxxQGJ2QZ44DlLAfvjVIxzyCHILlMqSDjAkfSq0sgEGAzlgtFA2Q3eeV4aRXCqJDp+SqRjOvj5cLtJ5yVRdXkH0AEY+kEkpMoY2U8wZmQhgRnj6TpXM924O9Ohs5oRrOs0+lnwgUFVlxnNJH2BZPNYZXLyKyyCHdC8+S+yt04luq9qYkYL1ISq1tSMQhNkNXe4EYbysBcQ9icCL6GKs0FQ8WydSe/3JLW9xWW7vDVbYiM1USRbGxvuQQJRadXUCvQSENimFpUlnGRsoTlOLvdoiMYqDze+GVb555Jf1cx850IsVYBgkvB2hFVtykbetmR0C9W8oi80H9bzbaYJ0S8UM8zDwwnGRb0ud2Pozu5kYlf3svDU3rZ1lTik0yBTKzBdzkkxmlt5k82PUwrjHph0gyjbuja0apWzeaGFrqgCQt0EOnVGgHQQRNmUrvu1chwNepcvV9r9s2zXQ2CsqHr3dzZO2U3lIMhWulGLtnQkQC4xrt2rJFnEo1Z9enO6I1c3OMpyR7YgirzUyYEm/9QTyLoFSw/Wpz7kS7iRVG7Y0ZXsh9moeSkZmQtkyPIGgWzivNj6JJn2S1dcnxf0yV3YJClwRtR2dEXfCIlOb6w9sLaM2ctrgP5KLwNHOtj8dY/83WOe+A6xz7lOsfVJscvnCYUXAZo+PVv+H+DBcyVfrWDmsIO2U9k6rcHzJ9KIn7XRumneK82Xx1a20GAkK/RWlX+cIyuhITw22dS6wnax/JpGT0SMVlSkU7Y9DtYWbyNCMOmTyez30feO/J5cOZ89g7ks3NKPnv79i1mFoeAZix5Dw7vkeaz57BcOU9i8OMkUs48CoP9Q+TY7TvoHRmMkBYo50Xh4TlQeKhR+DbnNJL0uEjyWUtyf9hWZTc4vSqjcz+0RYeeJ6KTsrrp5g6tQ/loYsb4HFeEuBxHvOlxxMWdH+p0585x3cErT3dGIXI8XXN+1dOdqD0/HEWFbH/QlqBGbFrLSoP+uG81L577MQ869JwHnfSgB+knPQsiyuJK7l+uSLXCnMPN99fgpHdYP+7h8tmvHwWZFzCeR3qYYgedXaBlWDGanvG/2XJRP2DQosXwMoU2YXvXfjjwd0ca7R3I7uB0Am9j6gj+dlHH36ZzM8vgcLvZrh/f4/rzq7HA0eOFpP+fpIvKgUeJGGdg94PBcPvxO/rq6gEU9Jt2Wpz1Xx1BkN2+TVeV7byT6Nz+Bw==</diagram></mxfile>


--------------------------------------------------------------------------------
/architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/glue-enrich-cost-and-usage/864f63d16f80830bf79ba140fd3f861806012110/architecture.png


--------------------------------------------------------------------------------
/awswrangler-2.5.0-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/glue-enrich-cost-and-usage/864f63d16f80830bf79ba140fd3f861806012110/awswrangler-2.5.0-py3-none-any.whl


--------------------------------------------------------------------------------
/glue-enrich-cur.py:
--------------------------------------------------------------------------------
  1 | """ /*
  2 |  * Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  3 |  * SPDX-License-Identifier: MIT-0
  4 |  *
  5 |  * Permission is hereby granted, free of charge, to any person obtaining a copy of this
  6 |  * software and associated documentation files (the "Software"), to deal in the Software
  7 |  * without restriction, including without limitation the rights to use, copy, modify,
  8 |  * merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
  9 |  * permit persons to whom the Software is furnished to do so.
 10 |  *
 11 |  * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 12 |  * INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 13 |  * PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 14 |  * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 15 |  * OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 16 |  * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 17 |  */ """
 18 | 
 19 | import argparse
 20 | import logging
 21 | import os
 22 | import re
 23 | import sys
 24 | 
 25 | import awswrangler as wr
 26 | import boto3
 27 | import pandas as pd
 28 | 
 29 | logging.basicConfig(stream=sys.stdout, level=logging.INFO)
 30 | 
 31 | log = logging.getLogger(__name__)
 32 | log.setLevel(logging.INFO)
 33 | handler = logging.StreamHandler(sys.stdout)
 34 | handler.setLevel(logging.DEBUG)
 35 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 36 | handler.setFormatter(formatter)
 37 | log.addHandler(handler)
 38 | 
 39 | orgs_client = boto3.client('organizations')
 40 | 
 41 | # Fetches all accounts from the organizations API
 42 | def get_accounts ():
 43 |     accounts={ 'Accounts':[] }
 44 |     paginator = orgs_client.get_paginator('list_accounts')
 45 |     page_iterator = paginator.paginate()
 46 |     for page in page_iterator:
 47 |         accounts['Accounts'] = accounts['Accounts'] + page['Accounts']    
 48 |     return accounts    
 49 | 
 50 | # Fetches all tags for an AWS Account ID
 51 | def get_tags_for_account (account):
 52 |     tags={ 'Tags':[] }
 53 |     paginator = orgs_client.get_paginator('list_tags_for_resource')
 54 |     page_iterator = paginator.paginate(
 55 |         ResourceId=account
 56 |     )
 57 |     for page in page_iterator:
 58 |         tags['Tags'] = tags['Tags'] + page['Tags']    
 59 |     return tags
 60 | 
 61 | 
 62 | # OPTIONAL ARGUMENT PARSER (getResolvedOptions doesn't support optional arguments at the moment)
 63 | arg_parser = argparse.ArgumentParser(
 64 |     description="Enriches CUR data stored in S3 and enriches S3 by adding organizational account tags as columns to line items.", 
 65 |     epilog="NOTES: The script expects the source CUR data to be stored in parquet. The Cost and Usage Report configured to generate this data should be set to aoutput for ATHENA. Organizational account tags are merged based on line_item_usage_account_id. Because this queries the Organizations API, it must be run from the Master Account."
 66 | )
 67 | arg_parser.add_argument('--s3_source_bucket', required=True, help="The source bucket where the CUR data is located")
 68 | arg_parser.add_argument('--s3_source_prefix', required=True, help="The source prefix (path) where the CUR data is located. This should be the prefix immediately preceding the partition prefixes (i.e. path before /year=2020)")
 69 | arg_parser.add_argument('--s3_target_bucket', required=True, help="The destination bucket to put enriched CUR data.")
 70 | arg_parser.add_argument('--s3_target_prefix', required=False, default="", help="The destination prefix (path) where enriched CUR data will be placed. Put this in a prefix that is OUTSIDE of the original CUR data prefix.")
 71 | arg_parser.add_argument('--incremental_mode_months', type=int, required=False, default=0, help="When set, last x months of CUR data is processed. When not set or set to 0, all data is processed. For incremental updates, recommend setting this to 2.")
 72 | arg_parser.add_argument('--create_table', action='store_true', required=False, default=False, help="When set, a table will be created in Glue automatically and made available through Athena.")
 73 | arg_parser.add_argument('--overwrite_existing_table', action='store_true', required=False, default=False, help="When set, the existing table will be overwritten. This should be set when performing updates.")
 74 | arg_parser.add_argument('--partition_by_account', action='store_true', required=False, default=False, help="When set, an additional partition for line_item_usage_account_id will be created. Example: /year=2020/month=1/line_item_usage_account_id=123456789012")
 75 | arg_parser.add_argument('--database_name', type=str, required=False, default=None, help="The name of the Glue database to create the table in. Must be set when --create_table is set")
 76 | arg_parser.add_argument('--table_name', type=str, required=False, default=None, help="The name of the Glue table to create or overwrite. Must be set when --create_table is set")
 77 | arg_parser.add_argument('--exclude_fields', type=str, nargs="*", required=False, default=[], help="Columns or fields in the CUR data to exclude from the enriched report. Useful for creating resports with fewer columns. This does not affect AWS account tags, use --exclude_account_tags for that.")
 78 | arg_parser.add_argument('--include_fields', type=str, nargs="*", required=False, default=[], help="Columns or fields in the CUR data to include. All other CUR fields will be dropped. This does not affect AWS account tags, This does not affect AWS account tags, use --include_account_tags for that.")
 79 | arg_parser.add_argument('--include_account_tags', type=str, nargs="*", required=False, default=[], help="Organizations account tags to exclude from the enriched report. Useful when you have a lot of Organizations account tags and only want to include a filtered set.")
 80 | arg_parser.add_argument('--exclude_account_tags', type=str, nargs="*", required=False, default=[], help="Organizations account tags to exclude from the enriched report. Useful when you want to exclude specific tags.")
 81 | 
 82 | # Not used, but included because Glue passes these arguments in
 83 | arg_parser.add_argument('--extra-py-files', type=str, required=False, default=None, help="NOT USED")
 84 | arg_parser.add_argument('--scriptLocation', type=str, required=False, default=None, help="NOT USED")
 85 | arg_parser.add_argument('--job-bookmark-option', type=str, required=False, default=None, help="NOT USED")
 86 | arg_parser.add_argument('--job-language', type=str, required=False, default=None, help="NOT USED")
 87 | arg_parser.add_argument('--connection-names', type=str, required=False, default=None, help="NOT USED")
 88 | 
 89 | log.info(vars(arg_parser.parse_args() ))
 90 | args = vars(arg_parser.parse_args())
 91 | 
 92 | 
 93 | S3_SOURCE_BUCKET = args["s3_source_bucket"]
 94 | S3_SOURCE_PREFIX = args["s3_source_prefix"]
 95 | S3_TARGET_BUCKET = args["s3_target_bucket"]
 96 | S3_TARGET_PREFIX = args["s3_target_prefix"]
 97 | INCREMENTAL_MODE_MONTHS = args["incremental_mode_months"]
 98 | PARTITION_BY_ACCOUNT = args['partition_by_account']
 99 | CREATE_TABLE = args["create_table"]
100 | DATABASE_NAME = args["database_name"]
101 | TABLE_NAME = args["table_name"]
102 | OVERWRITE_EXISTING_TABLE = args["overwrite_existing_table"]
103 | INCLUDE_FIELDS = [item.lower().strip() for item in args["include_fields"]]
104 | EXCLUDE_FIELDS = [item.lower().strip() for item in args["exclude_fields"]]
105 | INCLUDE_ACCOUNT_TAGS = [item.lower().strip() for item in args["include_account_tags"]]
106 | EXCLUDE_ACCOUNT_TAGS = [item.lower().strip() for item in args["exclude_account_tags"]]
107 | 
108 | if TABLE_NAME == None or DATABASE_NAME == None: raise Exception('Must specify Glue Database and Table when "--create_table" is set')
109 | if wr.catalog.does_table_exist(DATABASE_NAME,TABLE_NAME) and not OVERWRITE_EXISTING_TABLE: raise Exception('The table {}.{} already exists but OVERWRITE_EXISTING_TABLE isn''t set.'.format(DATABASE_NAME, TABLE_NAME))
110 | 
111 | 
112 | # First we retrieve the accounts through the organizations API
113 | log.info("Fetching accounts and organizations account tags")
114 | accounts = get_accounts()
115 | account_tags_list = []
116 | for account in accounts.get('Accounts', {}):
117 |     account_dict = {}
118 |     account_dict["account_id"] = account["Id"]
119 |     account_dict["name"] = account["Name"]
120 |     account_dict["email"] = account["Email"]
121 |     account_dict["arn"] = account["Arn"]
122 |     tags = get_tags_for_account(account["Id"])
123 |     for tag in tags['Tags']:
124 |         account_dict['account_tag_'+tag["Key"]] = tag["Value"]
125 |     account_tags_list.append(account_dict)
126 | 
127 | log.info("Creating account tags DataFrame")
128 | # Create DataFrame with accounts and associated tags
129 | account_tags_df = pd.DataFrame(account_tags_list).convert_dtypes()
130 | account_tags_df.info()
131 | 
132 | # Write account_tags_def to S3 in Parquet format
133 | s3_written_objs = wr.s3.to_parquet(
134 |     df = account_tags_df,
135 |     path="s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+"account_tags",
136 |     mode = "overwrite",
137 |     compression="snappy",
138 |     index=False,
139 |     dataset = True
140 |   )
141 | 
142 | # Create a table in Glue data catalog for account tags
143 | wr.s3.store_parquet_metadata(
144 |     path="s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+"account_tags",
145 |     database=DATABASE_NAME,
146 |     table="account_tags",
147 |     dataset=True,
148 |     mode="overwrite"
149 | )
150 | 
151 | # Fetch catalog column names, drop account_id
152 | account_tags_columns = list(wr.catalog.get_table_types(
153 |     database=DATABASE_NAME, 
154 |     table='account_tags'
155 | ).keys())
156 | account_tags_columns.remove('account_id')
157 | 
158 | # Get a list of objects
159 | log.info("Listing objects in path: {}".format("s3://"+S3_SOURCE_BUCKET+"/"+S3_SOURCE_PREFIX))
160 | 
161 | s3_objects = wr.s3.list_objects("s3://"+S3_SOURCE_BUCKET+"/"+S3_SOURCE_PREFIX)
162 | log.info("Objects found: {}".format(s3_objects))
163 | 
164 | s3_prefixes = set()
165 | for object in s3_objects:
166 |     s3_prefixes.add(object[:object.rfind('/')])
167 | s3_prefixes = list(s3_prefixes)
168 | 
169 | # Sort objects by month, adding leading zeros to the months to sort correctly
170 | s3_objects.sort(key=lambda x: re.sub(r'(month=)(\d+)', lambda m : m.group(1)+m.group(2).zfill(2),x))
171 | s3_prefixes.sort(key=lambda x: re.sub(r'(month=)(\d+)', lambda m : m.group(1)+m.group(2).zfill(2),x))
172 | log.info("Prefixes found: {}".format(s3_prefixes))
173 | 
174 | # If INCREMENTAL_MODE_MONTHS is greater than zero, only use the last INCREMENTAL_MODE_MONTHS items in the list. This assumes everything is partitioned by '/year=xxxx/month=xx/'
175 | if INCREMENTAL_MODE_MONTHS > 0:
176 |   s3_prefixes = s3_prefixes[-INCREMENTAL_MODE_MONTHS:]
177 |   log.info("Incremental mode set. Filtered objects: {}".format(s3_prefixes))
178 | 
179 | # Rather than trying to do a join on all the data at once, we're breaking it up by partition (prefix). his also allows us to perform incremental updates, leaving prior months 
180 | # CUR data unaltered for historical purposes if desired. It can also easier to troubleshoot with subsets of data.
181 | for s3_prefix in s3_prefixes:
182 |   # Make sure garbage collection happens
183 |   #del df
184 |   #gc.collect()  
185 |   log.info("=============")
186 |   log.info('READING METADATA: {}'.format(s3_prefix+"/"))
187 |   cur_column_types, partitions = wr.s3.read_parquet_metadata(
188 |     path=s3_prefix+"/",
189 |     dataset=True
190 |   )
191 |   log.info('CLEANING UP TEMP TABLES IF THEY EXIST')
192 |   wr.catalog.delete_table_if_exists(
193 |     database=DATABASE_NAME,
194 |     table='temp_cur_enriched'
195 |   )
196 |   wr.catalog.delete_table_if_exists(
197 |     database=DATABASE_NAME,
198 |     table='temp_cur_original'
199 |   )
200 | 
201 |   # Create a temporary table in the data catalog from the source CUR data. 
202 |   log.info('CREATING TEMP TABLE FOR: {}'.format(s3_prefix+"/"))
203 |   wr.catalog.create_parquet_table(
204 |     path = s3_prefix+"/",
205 |     database = DATABASE_NAME,
206 |     table = 'temp_cur_original',
207 |     columns_types= cur_column_types,
208 |     #partitions_types=partitions,
209 |     mode = 'overwrite'
210 |   ) 
211 |   
212 |   # Fetch catalog column names
213 |   temp_cur_original_columns = wr.catalog.get_table_types(
214 |     database=DATABASE_NAME, 
215 |     table='temp_cur_original'
216 |   )
217 |   log.info('CLEANING UP {}'.format("s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+ s3_prefix.split('/', 3)[3])+"/")
218 |   wr.s3.delete_objects(
219 |     path="s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+ s3_prefix.split('/', 3)[3]+"/*"
220 |   )
221 | 
222 |   
223 |   
224 |   # Note: Using a SELECT * statement won't work if we want to partition by line_item_usage_account_id because it has to be the last column in the list. 
225 |   # The lines below build a list of column names to use in the CREATE TABLE AS SELECT (CTAS) statement. The 'account_id' column from account_tags is
226 |   # also dropped since it's redundant.
227 |   log.info('Merging account tags')
228 |   temp_cur_original_columns = list(temp_cur_original_columns.keys())
229 | 
230 |   temp_cur_original_columns = [item for item in temp_cur_original_columns if item.lower().strip() not in EXCLUDE_FIELDS]
231 |   account_tags_columns = [item for item in account_tags_columns if item.lower().strip().replace('account_tag_', '') not in EXCLUDE_ACCOUNT_TAGS]
232 | 
233 |   if len(INCLUDE_FIELDS) > 0: temp_cur_original_columns = [item for item in temp_cur_original_columns if item.lower().strip() in INCLUDE_FIELDS]
234 |   if len(INCLUDE_ACCOUNT_TAGS) > 0: account_tags_columns = [item for item in account_tags_columns if item.lower().strip().replace('account_tag_', '') in INCLUDE_ACCOUNT_TAGS]
235 | 
236 |   if PARTITION_BY_ACCOUNT: 
237 |       temp_cur_original_columns = [item for item in temp_cur_original_columns if item not in ['line_item_usage_account_id']]
238 |       column_string = "\""+"\",\"".join(temp_cur_original_columns)+"\","+"\""+"\",\"".join(filter(None,account_tags_columns))+"\""+',"line_item_usage_account_id"'
239 |   else:
240 |       column_string = column_string = "\""+"\",\"".join(temp_cur_original_columns)+"\","+"\""+"\",\"".join(account_tags_columns)+"\""
241 | 
242 |   column_string = column_string.replace(",\"\"", "").replace("\"\",", "")
243 |   log.debug("Final list of columns: {}".format(column_string))
244 |   
245 |   # Rather than loading the CUR data into a dataframe and merging, which can be memory and CPU intensive, the command below uses an Athena CTAS statement to 
246 |   # create a new table with the joined data. The data is written to S3 in Parquet format. 
247 |   # Note: Another reason for using Athena is that CUR Parquet files use the Parquet V2 format and occasionally uses DELTA encoded columns. These columns
248 |   # are not currently readable with Pandas - and more specifically the pyarrow package. 
249 |   # https://github.com/awslabs/aws-data-wrangler/issues/442
250 |   # 
251 |   log.info('Writing output to {}'.format("s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+ s3_prefix.split('/', 3)[3]))
252 |   wr.athena.read_sql_query(
253 |     database=DATABASE_NAME,
254 |     ctas_approach=False,
255 |     sql="""
256 |     CREATE TABLE temp_cur_enriched WITH 
257 |      (
258 |        format='PARQUET',
259 |        parquet_compression='SNAPPY',
260 |        external_location = ':s3_target;'
261 |        :partition_clause;
262 |      )
263 |     AS 
264 |      SELECT :columns; FROM temp_cur_original LEFT JOIN account_tags ON temp_cur_original.line_item_usage_account_id = account_tags.account_id
265 |      """,
266 |     params={
267 |       "s3_target": "s3://"+S3_TARGET_BUCKET+"/"+(S3_TARGET_PREFIX+"/" if S3_TARGET_PREFIX != "" else "")+ s3_prefix.split('/', 3)[3],
268 |       "partition_clause" : ",partitioned_by=array['line_item_usage_account_id']" if PARTITION_BY_ACCOUNT else "",
269 |       "columns": column_string
270 |     }
271 |   )
272 | 
273 |   # Once the enriched CUR data is written to S3 by Athena, we can remove the table definition the CTAS statement created in the catalog. 
274 |   log.info('CLEANING UP TEMP TABLES IF THEY EXIST')
275 |   wr.catalog.delete_table_if_exists(
276 |     database=DATABASE_NAME,
277 |     table='temp_cur_enriched'
278 |   )
279 |   # Also remote the table containing the original CUR data
280 |   wr.catalog.delete_table_if_exists(
281 |     database=DATABASE_NAME,
282 |     table='temp_cur_original'
283 |   )
284 |   log.info("=============")
285 | 
286 | 
287 | if CREATE_TABLE:
288 |   if OVERWRITE_EXISTING_TABLE:
289 |     log.info("Deleting {}.{} if it exists".format(DATABASE_NAME, TABLE_NAME))
290 |     wr.catalog.delete_table_if_exists(
291 |       database=DATABASE_NAME,
292 |       table=TABLE_NAME
293 |     )
294 |   
295 | 
296 |   log.info("Extracting parquet metadata from {}".format( "s3://"+S3_TARGET_BUCKET+"/"+s3_prefix.split('/', 3)[3]+"/"))
297 |   cur_column_types, partitions = wr.s3.read_parquet_metadata(
298 |     path="s3://"+S3_TARGET_BUCKET+"/"+s3_prefix.split('/', 3)[3]+"/",
299 |     dataset=True,
300 |     sampling=1
301 |   ) 
302 | 
303 |   log.info("Creating Table {}.{} with path {}".format(DATABASE_NAME, TABLE_NAME, "s3://"+S3_TARGET_BUCKET+"/"+(S3_SOURCE_PREFIX+"/" if S3_SOURCE_PREFIX != "" else "")))
304 |   wr.catalog.create_parquet_table(
305 |     path = "s3://"+S3_TARGET_BUCKET+"/"+(S3_SOURCE_PREFIX+"/" if S3_SOURCE_PREFIX != "" else ""),
306 |     database = DATABASE_NAME,
307 |     table = TABLE_NAME,
308 |     columns_types= cur_column_types,
309 |     partitions_types=partitions
310 |   )
311 | 
312 |   
313 |   log.info("Updated Table Partitions {}.{}".format(DATABASE_NAME, TABLE_NAME))
314 |   wr.athena.repair_table(
315 |     database=DATABASE_NAME,
316 |     table=TABLE_NAME
317 |   ) 
318 | 
319 | 
320 | log.info ("Finished")
321 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | awswrangler==2.0.5
2 | 


--------------------------------------------------------------------------------
/setup-glue-enrichment.yml:
--------------------------------------------------------------------------------
  1 |  # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 |  # SPDX-License-Identifier: MIT-0
  3 |  #
  4 |  # Permission is hereby granted, free of charge, to any person obtaining a copy of this
  5 |  # software and associated documentation files (the "Software"), to deal in the Software
  6 |  # without restriction, including without limitation the rights to use, copy, modify,
  7 |  # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
  8 |  # permit persons to whom the Software is furnished to do so.
  9 |  #
 10 |  # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 11 |  # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 12 |  # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 13 |  # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 14 |  # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 15 |  # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 16 |  #
 17 | AWSTemplateFormatVersion: "2010-09-09"
 18 | Description: Creates Glue Jobs, a Glue Catalog, an S3 Bucket, and an IAM role for the glue-cur-enricher example.
 19 | Parameters:
 20 |   S3SourceBucket:
 21 |     Type: String
 22 |     Description: "Name of the S3 bucket containing the existing Cost & Usage Reports"
 23 |     AllowedPattern: "^[0-9a-zA-Z]+([0-9a-zA-Z-]*[0-9a-zA-Z])*$"
 24 |     ConstraintDescription: "Bucket name can include numbers, lowercase letters, uppercase letters, and hyphens (-). It cannot start or end with a hyphen (-)."
 25 | 
 26 |   S3SourcePrefix:
 27 |     Type: String
 28 |     Description: "Prefix (path) within S3 Bucket where the existing Cost & Usage Reports can be found. Example: cur/cur_report"
 29 |     Default: ""
 30 | Metadata:
 31 |   'AWS::CloudFormation::Interface':
 32 |     ParameterGroups:
 33 |       - Label:
 34 |           default: "Cost & Usage Report Parameters"
 35 |         Parameters:
 36 |           - S3SourceBucket
 37 |           - S3SourcePrefix
 38 |     ParameterLabels:
 39 |       S3SourceBucket:
 40 |         default: Cost and Usage Report Source Bucket
 41 |       S3SourcePrefix:
 42 |         default: Cost and Usage Report S3 Prefix
 43 | Resources:
 44 |   EnrichedCURBucket:
 45 |     Type: AWS::S3::Bucket
 46 |       
 47 |   EnrichedCURRole:
 48 |     Type: 'AWS::IAM::Role'
 49 |     Properties:
 50 |       AssumeRolePolicyDocument:
 51 |         Version: '2012-10-17'
 52 |         Statement:
 53 |           - Effect: Allow
 54 |             Principal:
 55 |               Service:
 56 |                 - glue.amazonaws.com
 57 |             Action:
 58 |               - 'sts:AssumeRole'
 59 |       Path: /
 60 |       ManagedPolicyArns: 
 61 |         - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
 62 |         - arn:aws:iam::aws:policy/AWSOrganizationsReadOnlyAccess
 63 |       Policies:
 64 |         - PolicyName: AccessAthena
 65 |           PolicyDocument:
 66 |             Version: 2012-10-17
 67 |             Statement:
 68 |               - Effect: Allow
 69 |                 Action: 
 70 |                   - s3:GetBucketLocation
 71 |                   - s3:GetObject
 72 |                   - s3:ListBucket
 73 |                   - s3:ListBucketMultipartUploads
 74 |                   - s3:ListMultipartUploadParts
 75 |                   - s3:AbortMultipartUpload
 76 |                   - s3:CreateBucket
 77 |                   - s3:PutObject
 78 |                 Resource: 'arn:aws:s3:::aws-athena-query-results-*'
 79 |               - Effect: Allow
 80 |                 Action: 
 81 |                   - athena:StartQueryExecution
 82 |                   - athena:GetQueryExecution
 83 |                   - athena:GetQueryResults
 84 |                 Resource: !Sub 'arn:aws:athena:*:${AWS::AccountId}:workgroup/*'
 85 |               - Effect: Allow
 86 |                 Action: athena:GetQueryExecutions
 87 |                 Resource: '*'
 88 |         - PolicyName: EnrichedCURS3SourceReadOnly
 89 |           PolicyDocument:
 90 |             Version: 2012-10-17
 91 |             Statement:
 92 |               - Effect: Allow
 93 |                 Action: 
 94 |                 - s3:GetObject
 95 |                 - s3:ListBucket
 96 |                 Resource: 
 97 |                 - !Sub 'arn:aws:s3:::${S3SourceBucket}/${S3SourcePrefix}/*'  
 98 |         - PolicyName: EnrichedCURS3TargetReadWrite
 99 |           PolicyDocument:
100 |             Version: 2012-10-17
101 |             Statement:
102 |               - Effect: Allow
103 |                 Action: 
104 |                 - s3:GetObject
105 |                 - s3:ListBucket
106 |                 - s3:PutObject
107 |                 - s3:DeleteObject
108 |                 Resource: 
109 |                 - !Sub 'arn:aws:s3:::${EnrichedCURBucket}/*'                
110 |         - PolicyName: GlueListAllBuckets
111 |           PolicyDocument:
112 |             Version: 2012-10-17
113 |             Statement:
114 |               - Effect: Allow
115 |                 Action:
116 |                 - s3:ListAllMyBuckets
117 |                 - s3:headBucket
118 |                 Resource: '*'
119 |   EnrichedCURGlueDB:
120 |     Type: AWS::Glue::Database
121 |     Properties:
122 |       CatalogId: !Ref AWS::AccountId
123 |       DatabaseInput:
124 |         Name: 'cost_and_usage_enriched'
125 |         Description: Contains enriched CUR data
126 |  
127 |   EnrichedCURFull:
128 |     Type: AWS::Glue::Job
129 |     Properties:
130 |       Name: aws-cur-enrichment-full
131 |       Command:
132 |         Name: pythonshell
133 |         PythonVersion: "3"
134 |         ScriptLocation: !Sub 's3://${EnrichedCURBucket}/glue-enrich-cur.py'
135 |       DefaultArguments:
136 |         "--s3_source_bucket": !Ref S3SourceBucket
137 |         "--s3_target_bucket": !Ref EnrichedCURBucket
138 |         "--s3_source_prefix": !Ref S3SourcePrefix
139 |         "--create_table": ""
140 |         "--database_name": !Ref EnrichedCURGlueDB
141 |         "--table_name": "cur_enriched"
142 |         "--overwrite_existing_table": ""
143 |         "--partition_by_account": ""
144 |         "--extra-py-files": !Sub 's3://${EnrichedCURBucket}/awswrangler-2.5.0-py3-none-any.whl'
145 |       GlueVersion: "1.0"
146 |       ExecutionProperty:
147 |         MaxConcurrentRuns: 1
148 |       MaxCapacity: 1
149 |       MaxRetries: 0
150 |       Role: !Ref EnrichedCURRole
151 |   EnrichedCURIncremental:
152 |     Type: AWS::Glue::Job
153 |     Properties:
154 |       Name: aws-cur-enrichment-incremental
155 |       Command:
156 |         Name: pythonshell
157 |         PythonVersion: "3"
158 |         ScriptLocation: !Sub 's3://${EnrichedCURBucket}/glue-enrich-cur.py'
159 |       DefaultArguments:
160 |         "--s3_source_bucket": !Ref S3SourceBucket
161 |         "--s3_target_bucket": !Ref EnrichedCURBucket
162 |         "--s3_source_prefix": !Ref S3SourcePrefix
163 |         "--create_table": ""
164 |         "--database_name": !Ref EnrichedCURGlueDB
165 |         "--table_name": "cur_enriched"
166 |         "--overwrite_existing_table": ""
167 |         "--partition_by_account": ""
168 |         "--incremental_mode_months": "2"
169 |         "--extra-py-files": !Sub 's3://${EnrichedCURBucket}/awswrangler-2.5.0-py3-none-any.whl'
170 |       GlueVersion: "1.0"
171 |       ExecutionProperty:
172 |         MaxConcurrentRuns: 1
173 |       MaxCapacity: 1
174 |       MaxRetries: 0
175 |       Role: !Ref EnrichedCURRole   
176 | Outputs:
177 |   EnrichedCURBucketUrl: 
178 |     Description: Use this link to go to the AWS S3 Console and upload your the .py script and .whl file. The enriched CUR files will also be written to this bucket.
179 |     Value: !Join ['', ['https://s3.console.aws.amazon.com/s3/buckets/', !Ref EnrichedCURBucket ]]


--------------------------------------------------------------------------------