├── .gitignore
├── README.md
├── analyzeDBR.config
├── athenaproxy
    └── athenaproxy.jar
├── bin
    └── analyzeDBR
├── cf
    ├── dbr_app.yaml
    └── dbr_network.yaml
├── csv2parquet
├── dbr_dashboard.png
├── go
    └── src
    │   └── analyzeDBR
    │       └── analyzeDBR.go
├── run.sh
└── userdata.sh


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | go/src/github*
3 | go/pkg*
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # DBRdashboard
  2 | 
  3 | 
  4 | DBRdashboard is an automated AWS detailed billing record analyzer. You can use it to produce dashboards on your AWS spend and reserved instance utilization. Graphs are updated as frequently as AWS updates the DBR files themselves (multiple times per day).
  5 | 
  6 | ![DBRdashboard Screenshot](https://raw.githubusercontent.com/andyfase/awsDBRanalysis/master/dbr_dashboard.png)
  7 | 
  8 | DBRdashboard queries the detailed billing record using AWS Athena. The queries and metrics that it produces are completely customizable to your own needs and requirements.
  9 | 
 10 | Currently the system relies on Cloudwatch to produce the dashboards (dashboard setup is a manual setup step). With small code modifications any metrics / dashboard system could be utilized.
 11 | 
 12 | In addition, the system also maintains a set of AWS Athena tables for you - to query your detailed billing record as you wish. A table per month is created and kept upto date as new billing data is added to the underlying DBR CSV file.
 13 | 
 14 | ## How does this work?
 15 | 
 16 | AWS publishes [detail billing reports](http://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-reports.html#other-reports) periodically during the day. These reports contain very detailed line-by-line billing details for every AWS charge.
 17 | 
 18 | DBRdashboard periodically spins up a EC2 instance that checks and converts these CSV based reports into [parquet format](https://parquet.apache.org/) files (for performance purposes) and re-uploads these converted files to S3. It then utilizes AWS Athena and standard SQL to create database tables and query specific billing metrics within them. The results of the queries are then reported to AWS Cloudwatch as custom metrics.
 19 | 
 20 | Once the metrics are in Cloudwatch, it is then very easy to produce graphs and create a billing dashboard customized to your exact requirements.
 21 | 
 22 | In addition to querying the detailed billing report. DBRdashboard also queries any [reserved instances](https://aws.amazon.com/ec2/pricing/reserved-instances/) on the account and then correlates them against actual usage to generate utilization metrics. Overall utilization and per instance type under-utilization metrics are available. 
 23 | 
 24 | ## Setup
 25 | 
 26 | Setup of DBR dashboard should take ~15 minutes if detailed billing records are already enabled. 
 27 | 
 28 | **NOTE: DBR Dashboard can only be spun-up in us-east-1 or us-west-2 AWS regions as of now.**
 29 | 
 30 | ### Step 1
 31 | 
 32 | If you have not already, turn on detailed billing records in your AWS account, configure and specify a DBR S3 bucket as per the instructions [here](http://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-reports.html#turnonreports) (Look under section _"To turn on detailed billing reports"_).
 33 | 
 34 | ### Step 2
 35 | 
 36 | Fork this GIT repo. 
 37 | 
 38 | The EC2 instance itself will bootstrap itself by cloning a configurable GIT repo and then run scripts and custom binaries to generate and upload the custom metrics. This custom binary utilizes a configuration file which will need to be edited to enable/disable certain functionality and customize the metrics and queries that are run.
 39 | 
 40 | Therefore, forking this repo allow you to commit configuration modifications which will automatically come into effect the next time the EC2 instance spins up.
 41 | 
 42 | ### Step 3
 43 | 
 44 | Using cloudformation bring up both stacks that are within the `cf` directory in your newly forked repo.
 45 | 
 46 | The network stack create's exports which are then referenced by the app stack. To be able to keep multiple stacks operational a `ResourcePrefix` is used which is pre-pended to the exports. The `ResourcePrefix` can be any text string **but must be identical across both stacks**.
 47 | 
 48 | `dbr_network.yaml` is a CF template that will setup the VPC and general networking required. It is recomended to use a small CIDR block, as DBRdownload will only ever spin-up a single EC2 instance.
 49 | 
 50 | `dbr_app.yaml` is a CF template that sets up the required IAM role, Athena user as well as a auto-scale group with a configured time-based scale up policy. It is recommended to configure the schedule parameter to ~ 4-6 hours, as this is roughly how often the original DBR CSV file is updated by AWS.
 51 | 
 52 | Ensure you specify the GIT clone URL for your own repo. This will allow you to push configuration (or code) changes which will then be automatically picked up for the next time the EC2 instance spins up.
 53 | 
 54 | ### Step 4
 55 | 
 56 | Once this is configured you will need to wait for the first auto-scale spin-up to occur. To speed this up, you could also manually set the `desired` capacity of the ASG to `1` so that an instance automatically spins up. Its safe to leave this up and running as it will be shutdown automatically by the automatic schedule before the end of the hour.
 57 | 
 58 | Once the instance has spun up it will bootstrap itself and run the code to generate the custom metrics. These should start to appear in Cloudwatch, typically these appear within 15 minutes of the instance coming up.
 59 | 
 60 | Using the custom metrics, generate graphs that you would like and start creating your very own DBRdashboard. Instructions on creating a Cloudwatch dashboard can be viewed [here](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html)
 61 | 
 62 | ## How much will this cost?
 63 | 
 64 | DBRdashboard attempts to provide significant valuable insight to your overall costs with very little actual cost-overhead.
 65 | 
 66 | To this point DBRdashboard uses the following AWS services, with these approx costs
 67 | 
 68 | 
 69 | 
 70 | AWS Service    | Description                               | Expected Cost
 71 | -------------- | ----------------------------------------- | -------------
 72 | **EC2 Spot**   |    Small m1 or m3 instances are used.     | ~ 2c / hour, $2.50 / month*
 73 | **Auto Scale** | Incl scheduled scaling                    | Free
 74 | **S3**         | Storage costs depends on size of DBR      | ~ 10c / month
 75 | **Athena**     | Costs vary per query and size of DBR      | ~ 50c / month
 76 | **Cloudwatch** | Costs depend on number of metrics**       | $10 - $20 / month
 77 |                | **Expected Total**                            | $15 - $35 / month
 78 | 
 79 | \* Based on schedule of 6 hours, instance running for 4 hours per day
 80 | 
 81 | \** *Cloudwatch custom metrics are 30c per metric per month over first free 10 metrics. Dashboard is $5 per month if it contains over 50 metrics.*
 82 | 
 83 | ## Configuration
 84 | 
 85 | Configuration for DBRdashboard is performed by editing the configuration file `analyzeDBR.config` this file is passed into the binary `snalyzeDBR` as a command line option. 
 86 | 
 87 | The configuration file is in the [TOML](www.toml.org?) format and has a number of sections which are described below.
 88 | 
 89 | You can choose to edit this file in place (within your forked repo) or to create a new configuration file. In which case you will need to alter the file `run.sh` - look for the line that contains:
 90 | 
 91 | ```
 92 | ./bin/analyzeDBR -config ./analyzeDBR.config -key $4 -secret $5 -region $3 -account $2 -bucket $1 -date $(date +%Y%m) -blended=$DBR_BLENDED
 93 | ```
 94 | 
 95 | Change the `--config` parameter to be the new configuration filename you have created.
 96 | 
 97 | ### Genral Configuration options
 98 | 
 99 | These options are held within the `[general]` TOML section
100 | 
101 | Option Name     | Description                                   | Default Value
102 | --------------- | --------------------------------------------- | -------------
103 | `namespace`     | The Cloudwatch namespace used for all metrics | `DBR`
104 | 
105 | ### RI Configuration options
106 | 
107 | Option Name      | Description                                          | Default Value
108 | ---------------- | ---------------------------------------------------- | -------------
109 | `enableRIanalysis` | Determines if DBRdashboard outputs RI metrics or not | `true`
110 | `enableRITotalUtilization`| Determines if a "total" RI % utilization figure is generated or not | `true`
111 | `riPercentageThreshold` | Determines the low boundary or % utilization not to report on (i.e. to ignore) | `5`
112 | `riTotalThreshold` | Determines the low boundary of "number of RI's" not to report on (i.e. ignore instance types that have less than X total RIs) | `5`
113 | `cwNameTotal` | Cloudwatch metric name for total utilization % | `riTotalUtilization`
114 | `cwName` | Cloudwatch metric name for instance level under-utilization % | `riUnderUtilization`
115 | `cwDimensionTotal` | Cloudwatch default total utilization dimension name | `total`
116 | `cwDimension` | Cloudwatch default instance level dimension name | `instance`
117 | `cwType` | Cloudwatch "type" for RI metrics | `percentage`
118 | `sql` | Athena SQL query used to retrieve RI instance usage hours | `see config file`
119 | `ri.ignore` | TOML map of instances that are ignored when calculating % under-utilization | `see config file`
120 | 
121 | ### Metric Configuration
122 | 
123 | Each DBR metric is held within a `TOML` array in the configuration file. This array is iterated over to query Athena and then send the results as metrics to Cloudwatch.
124 | 
125 | To add new metrics simply copy-and-paste an existing `[[metric]]` entry and then modify the various attributes, which are
126 | 
127 | Metric Attribute  |  Description
128 | ----------------- | ------------
129 | `enabled`         | Determines if the metric is used or not 
130 | `type`            | Reserved for future use. value of `dimension-per-row` is only accepted value currently
131 | `cwName` | The metric name that will be sent to Cloudwatch
132 | `cwDimension` | The dimension name that will be sent to Cloudwatch (the value of the dimension will be taken from the "dimension" row value (see below)
133 | `cwType` | The cloudwatch metric type that will be sent to cloudwatch
134 | `sql` | The SQL that will be executed on the Athena DBR table to fetch the metric information (see below)
135 | 
136 | ### Athena Metric SQL
137 | 
138 | Each metric that you wish to display on the dashboard is obtained by querying the DBR Athena table. Each row that is returned is considered a new metric value. The `date` column is used as the time-series "divider" and is converted to a timestamp which is sent for this row.
139 | 
140 | Default useful metrics are pre-configured within the original configuration file. These can be disabled if required or even completely removed. New metrics can be added as described above. 
141 | 
142 | **Effectively if you can write SQL that fetches the data you need it can be turned into a metric and graphed on cloudwatch, no custom-coding required.**
143 |  
144 | 
145 | Each row in the query results **MUST** contain the following aliased columns
146 | 
147 | Column Name | Description
148 | ----------- | -----------
149 | `date`      | the timeperiod for the metric. Typically the hour (`format YYYY-MM-DD HH`) or day 
150 | `value`     | The metric value for this time period (normally a count(*) in SQL
151 | `dimension` | The dimension value that will be sent for this row. 
152 | 
153 | For example if a query returns a row with 
154 | 
155 | `date` | `dimension` | `value`
156 | ------ | ----------- | ------
157 | 2017-02-01 17 | m3.xlarge | 50
158 | 
159 | Then a custom metric (named using the `cwName` parameter) will be sent to Cloudwatch as follows:
160 | 
161 | * The **timestamp** will be set to `2017-02-01 17:00:00`
162 | * The **dimension name** will be set to the parameter value `cwDimension`
163 | * The **dimension value** will be set to `m3.xlarge`
164 | * The **value** will be set to `50`
165 | 
166 | Every row returned will send a metric using `put-metric-data` 
167 | 
168 | Note. Athena uses Presto under-the-hood. Hence all Presto SQL functions are available for you to utilize. These can be found [here](https://prestodb.io/docs/current/functions.html).
169 | 
170 | ### Limitations
171 | 
172 | * Only RI's under the "running" account are fetched and used to generate % RI utilization. If RI's exist under linked accounts they are not currently included and will cause incorrect results. Temporary work around for this is to move all RI's into the payer account if possible.
173 | 
174 | ### Things left to do
175 | 
176 | 1. Add Cloudwatch Logs support so that the code logs into cloudwatch
177 | 1. Add configuration to fetch RI data from multiple linked accounts. This will require additional permissions across all accounts to be able to make the API call.
178 | 1. Document process for multi-dimension metrics (code supports it, just need to document)
179 | 1. Modify code to not store user key/secret in ASG user-data and potentially use KMS instead. Hopefully support for Athena to use IAM roles will make this un-necessary
180 | 
181 | 
182 | 
183 | 
184 | 


--------------------------------------------------------------------------------
/analyzeDBR.config:
--------------------------------------------------------------------------------
  1 | [general]
  2 | namespace = "DBR"
  3 | 
  4 | [ri]
  5 | enableRIanalysis = true
  6 | enableRITotalUtilization = true # Set this to true to get a total RI percentage utilization value.
  7 | riPercentageThreshold = 5 # Ignore un-used RI's where percentage of under-use lower than this value
  8 | riTotalThreshold = 5 # Ignore un-used RI's where total number of RI's (per instance type) is below this.
  9 | cwNameTotal = "riTotalUtilization"
 10 | cwName = "riUnderUtilization"
 11 | cwDimension = "instance"
 12 | cwDimensionTotal = "total"
 13 | cwType = "Percent"
 14 | sql = """
 15 |   SELECT distinct
 16 |     COALESCE(
 17 |        regexp_extract(itemdescription,'per(?:\\sOn Demand)*\\s(.*?)(?:\\s\\(Amazon VPC\\))*,*\\s([a-z]\\d\\.\\d*\\w+)\\s', 1),
 18 |        regexp_extract(itemdescription,'^([a-z]\\d\\.\\d*\\w+)\\s(.*?)(?:\\s\\(Amazon VPC\\))*\\sSpot',2)
 19 |        ) AS platform,
 20 |    COALESCE(
 21 |       regexp_extract(itemdescription,'per(?:\\sOn Demand)*\\s(.*?)(?:\\s\\(Amazon VPC\\))*,*\\s([a-z]\\d\\.\\d*\\w+)\\s', 2),
 22 |       regexp_extract(itemdescription,'^([a-z]\\d\\.\\d*\\w+)\\s(.*?)(?:\\s\\(Amazon VPC\\))*\\sSpot',1)
 23 |       ) AS instance,
 24 |     substr(usagestartdate, 1, 13) AS date,
 25 |     availabilityzone AS az,
 26 |     count(*) AS hours
 27 |   FROM dbr.autodbr_**DATE**
 28 |   WHERE productname = 'Amazon Elastic Compute Cloud'
 29 |   AND operation like '%RunInstances%'
 30 |   AND usagetype like '%Usage%'
 31 |   AND reservedinstance = 'Y'
 32 |   AND split_part(usagetype, ':', 2) is not NULL
 33 |   AND length(availabilityzone) > 1
 34 |   AND length(usagestartdate) > 1
 35 |   AND try_cast(usagestartdate as timestamp) IS NOT NULL
 36 |   AND try_cast(usagestartdate as timestamp) > now() - interval '72' hour
 37 |   AND try_cast(usagestartdate as timestamp) < now()
 38 |   GROUP BY
 39 |     COALESCE(
 40 |        regexp_extract(itemdescription,'per(?:\\sOn Demand)*\\s(.*?)(?:\\s\\(Amazon VPC\\))*,*\\s([a-z]\\d\\.\\d*\\w+)\\s', 1),
 41 |        regexp_extract(itemdescription,'^([a-z]\\d\\.\\d*\\w+)\\s(.*?)(?:\\s\\(Amazon VPC\\))*\\sSpot',2)
 42 |        ),
 43 |    COALESCE(
 44 |       regexp_extract(itemdescription,'per(?:\\sOn Demand)*\\s(.*?)(?:\\s\\(Amazon VPC\\))*,*\\s([a-z]\\d\\.\\d*\\w+)\\s', 2),
 45 |       regexp_extract(itemdescription,'^([a-z]\\d\\.\\d*\\w+)\\s(.*?)(?:\\s\\(Amazon VPC\\))*\\sSpot',1)
 46 |       ),
 47 |     substr(usagestartdate, 1, 13),
 48 |     availabilityzone
 49 | """
 50 | [ri.ignore] ## Ignore un-used RI's in this map/hash
 51 | "t2.micro" = 1
 52 | "m1.small" = 1 # This has to be ignored as RI usage in DBR file for this instance type is not accurate
 53 | 
 54 | [[metrics]]
 55 | ## Count of Instance purchase types (RI, Spot, onDemand) per hour"
 56 | enabled = true
 57 | type = "dimension-per-row"
 58 | cwName = "InstancePurchaseType"
 59 | cwDimension = "type"
 60 | cwType = "Count"
 61 | sql = """
 62 |     SELECT distinct
 63 |       substr(split_part(usagetype, ':', 1), strpos(split_part(usagetype, ':', 1), '-') + 1, 10) AS dimension,
 64 |       substr(usagestartdate, 1, 13) AS date,
 65 |       count(*) AS value
 66 |     FROM dbr.autodbr_**DATE**
 67 |     WHERE productname = 'Amazon Elastic Compute Cloud'
 68 |     AND operation like '%RunInstances%'
 69 |     AND usagetype like '%Usage%'
 70 |     AND try_cast(usagestartdate as timestamp) IS NOT NULL
 71 |     AND try_cast(usagestartdate as timestamp) > now() - interval '24' hour
 72 |     AND try_cast(usagestartdate as timestamp) < now()
 73 |     GROUP BY
 74 |       substr(split_part(usagetype, ':', 1), strpos(split_part(usagetype, ':', 1), '-') + 1, 10),
 75 |       substr(usagestartdate, 1, 13)
 76 |     ORDER BY substr(usagestartdate, 1, 13) desc \
 77 | """
 78 | 
 79 | [[metrics]]
 80 | ## Summary of Overall Cost per hour
 81 | enabled = true
 82 | type = "dimension-per-row"
 83 | cwName = "TotalCost"
 84 | cwDimension = "cost"
 85 | cwType = "None"
 86 | sql = """
 87 |     SELECT
 88 |       'total' as dimension,
 89 |       substr(usagestartdate, 1, 13) AS date,
 90 |       sum(cast(**COST** as double)) AS value
 91 |     FROM dbr.autodbr_**DATE**
 92 |     WHERE length(usagestartdate) >= 19
 93 |     AND try_cast(usagestartdate as timestamp) IS NOT NULL
 94 |     AND try_cast(usagestartdate as timestamp) > now() - interval '24' hour
 95 |     AND try_cast(usagestartdate as timestamp) < now()
 96 |     GROUP BY substr(usagestartdate, 1, 13)
 97 |     ORDER BY substr(usagestartdate, 1, 13) desc \
 98 | """
 99 | 
100 | 
101 | [[metrics]]
102 | ## Summary of Cost per service per hour
103 | enabled = true
104 | type = "dimension-per-row"
105 | cwName = "ServiceCost"
106 | cwDimension = "service"
107 | cwType = "None"
108 | sql = """
109 |     SELECT
110 |       productname AS dimension,
111 |       substr(usagestartdate, 1, 13) AS date,
112 |       sum(cast(**COST** as double)) AS value
113 |     FROM dbr.autodbr_**DATE**
114 |     WHERE length(usagestartdate) >= 19
115 |     AND try_cast(usagestartdate as timestamp) IS NOT NULL
116 |     AND try_cast(usagestartdate as timestamp) > now() - interval '24' hour
117 |     AND try_cast(usagestartdate as timestamp) < now()
118 |     GROUP BY
119 |       productname,
120 |       substr(usagestartdate, 1, 13)
121 |     HAVING sum(cast(**COST** as double)) > 0
122 |     ORDER BY substr(usagestartdate, 1, 13), productname desc \
123 | """
124 | 
125 | [[metrics]]
126 | ## Count of Instance Types per Hour
127 | enabled = true
128 | type = "dimension-per-row"
129 | cwName = "InstanceType"
130 | cwDimension = "instance"
131 | cwType = "Count"
132 | sql = """
133 |   SELECT distinct
134 |     COALESCE(
135 |        regexp_extract(itemdescription,'per(?:\\sOn Demand)*\\s(.*?)(?:\\s\\(Amazon VPC\\))*,*\\s([a-z]\\d\\.\\d*\\w+)\\s', 2),
136 |        regexp_extract(itemdescription,'^([a-z]\\d\\.\\d*\\w+)\\s(.*?)(?:\\s\\(Amazon VPC\\))*\\sSpot',1)
137 |        ) AS dimension,
138 |     substr(usagestartdate, 1, 13) AS date,
139 |     count(*) AS value
140 |   FROM dbr.autodbr_**DATE**
141 |   WHERE productname = 'Amazon Elastic Compute Cloud'
142 |   AND operation like '%RunInstances%'
143 |   AND usagetype like '%Usage%'
144 |   AND try_cast(usagestartdate as timestamp) IS NOT NULL
145 |   AND try_cast(usagestartdate as timestamp) > now() - interval '24' hour
146 |   AND try_cast(usagestartdate as timestamp) < now()
147 |   GROUP BY
148 |   COALESCE(
149 |      regexp_extract(itemdescription,'per(?:\\sOn Demand)*\\s(.*?)(?:\\s\\(Amazon VPC\\))*,*\\s([a-z]\\d\\.\\d*\\w+)\\s', 2),
150 |      regexp_extract(itemdescription,'^([a-z]\\d\\.\\d*\\w+)\\s(.*?)(?:\\s\\(Amazon VPC\\))*\\sSpot',1)
151 |      ),
152 |     substr(usagestartdate, 1, 13)
153 |   ORDER BY substr(usagestartdate, 1, 13) desc
154 | """
155 | 
156 | [[metrics]]
157 | ## Count of Linked Account per hour
158 | ## Only enable this if you have linked accounts
159 | enabled = false
160 | type = "dimension-per-row"
161 | cwName = "AccountCost"
162 | cwDimension = "accountid"
163 | cwType = "None"
164 | sql = """
165 |     SELECT distinct
166 |       linkedaccountid AS dimension,
167 |       substr(usagestartdate, 1, 13) AS date,
168 |       sum(cast(blendedcost as double)) AS value
169 |     FROM dbr.autodbr_**DATE**
170 |     WHERE length(usagestartdate) >= 19
171 |     AND try_cast(usagestartdate as timestamp) IS NOT NULL
172 |     AND try_cast(usagestartdate as timestamp) > now() - interval '24' hour
173 |     AND try_cast(usagestartdate as timestamp) < now()
174 |     AND length(linkedaccountid) > 1
175 |     GROUP BY
176 |       linkedaccountid,
177 |       substr(usagestartdate, 1, 13)
178 |     ORDER BY
179 |       substr(usagestartdate, 1, 13),
180 |       sum(cast(blendedcost as double)) desc
181 | """
182 | 
183 | [athena]
184 | create_database = "create database if not exists `dbr` comment \"AutoDBR Athena Database\""
185 | create_table = """
186 |   create external table if not exists `dbr.autodbr_**DATE**` (
187 |     `InvoiceID` string,
188 |     `PayerAccountId` string,
189 |     `LinkedAccountId` string,
190 |     `RecordType` string,
191 |     `RecordId` string,
192 |     `ProductName` string,
193 |     `RateId` string,
194 |     `SubscriptionId` string,
195 |     `PricingPlanId` string,
196 |     `UsageType` string,
197 |     `Operation` string,
198 |     `AvailabilityZone` string,
199 |     `ReservedInstance` string,
200 |     `ItemDescription` string,
201 |     `UsageStartDate` string,
202 |     `UsageEndDate` string,
203 |     `UsageQuantity` string,
204 |     `Rate` string,
205 |     `Cost` string
206 |   )
207 |   STORED AS PARQUET
208 |   LOCATION 's3://**BUCKET**/dbr-parquet/**ACCOUNT**-**DATE**/' \
209 |   """
210 | create_table_blended = """
211 |   create external table if not exists `dbr.autodbr_**DATE**` (
212 |     `InvoiceID` string,
213 |     `PayerAccountId` string,
214 |     `LinkedAccountId` string,
215 |     `RecordType` string,
216 |     `RecordId` string,
217 |     `ProductName` string,
218 |     `RateId` string,
219 |     `SubscriptionId` string,
220 |     `PricingPlanId` string,
221 |     `UsageType` string,
222 |     `Operation` string,
223 |     `AvailabilityZone` string,
224 |     `ReservedInstance` string,
225 |     `ItemDescription` string,
226 |     `UsageStartDate` string,
227 |     `UsageEndDate` string,
228 |     `UsageQuantity` string,
229 |     `BlendedRate` string,
230 |     `BlendedCost` string,
231 |     `UnBlendedRate` string,
232 |     `UnBlendedCost` string
233 |     )
234 |     STORED AS PARQUET
235 |     LOCATION 's3://**BUCKET**/dbr-parquet/**ACCOUNT**-**DATE**/' \
236 |     """
237 | 


--------------------------------------------------------------------------------
/athenaproxy/athenaproxy.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andyfase/DBRdashboard/d4be41b13473d6cde86f99adc41f28e5207024bc/athenaproxy/athenaproxy.jar


--------------------------------------------------------------------------------
/bin/analyzeDBR:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andyfase/DBRdashboard/d4be41b13473d6cde86f99adc41f28e5207024bc/bin/analyzeDBR


--------------------------------------------------------------------------------
/cf/dbr_app.yaml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: "2010-09-09"
  2 | 
  3 | Description: DBR Dashboard app stack including ASG with timed scale-up / scale-down policy.
  4 | 
  5 | Parameters:
  6 |   ResourcePrefix:
  7 |     Type: String
  8 |     Description: A description to identify resources. Use an identical value across stacks (e.g. "dbr-dashboard")
  9 |     Default: dbr-dashboard
 10 |     MinLength: 2
 11 | 
 12 |   KeypairName:
 13 |     Type: AWS::EC2::KeyPair::KeyName
 14 | 
 15 |   DBRBucket:
 16 |     Type: String
 17 |     Description: S3 bucket configured for Detailed Billing Records
 18 |     MinLength: 2
 19 | 
 20 |   DBRUploadBucket:
 21 |     Type: String
 22 |     Description: OPTIONAL S3 bucket for storing converted DBR files (only use this is your DBR bucket is invalid for use with Athena i.e. contains a underscore)
 23 | 
 24 |   InstanceType:
 25 |     Type: String
 26 |     Description: Instance Type to run (for cost savings limited to M1, M3 or C3 instances - instance store backed, with Moderate networking)
 27 |     Default: m3.medium
 28 |     AllowedValues:
 29 |       - m1.medium
 30 |       - m1.large
 31 |       - m3.medium
 32 |       - m3.large
 33 |       - c3.large
 34 | 
 35 |   SpotPrice:
 36 |     Type: Number
 37 |     Default: 0.05
 38 |     MinValue: 0.01
 39 |     Description: The Maximum price for the instance you want to pay (Note you will be charged based on the current price)
 40 | 
 41 |   Schedule:
 42 |     Type: Number
 43 |     Default: 4
 44 |     MinValue: 1
 45 |     Description: Time gap (in hours) between DBR automatic processing (the lower, the more frequent and costly)
 46 | 
 47 |   GitURL:
 48 |     Type: String
 49 |     Default: "https://github.com/andyfase/DBRdashboard.git"
 50 |     MinLength: 2
 51 |     Description: "URL of the git repo that will be checked out and used when configuring the EC2 instance, use your cloned repo."
 52 | 
 53 |   GitBranch:
 54 |     Type: String
 55 |     Default: "master"
 56 |     MinLength: 2
 57 |     Description: "Branch of your GIT repo that the EC2 instance will checkout. Change this from 'master' to test out seperate code branches before they are merged to master"
 58 | 
 59 | Mappings:
 60 |   RegionMap:
 61 |     us-east-1:
 62 |       "ami": "ami-55065e42"
 63 |     us-west-2:
 64 |       "ami": "ami-51c56331"
 65 | 
 66 | Conditions:
 67 |   NoUploadBucket:
 68 |     "Fn::Equals":
 69 |       - { Ref: DBRUploadBucket }
 70 |       - ""
 71 | 
 72 | Resources:
 73 |   autodbrAthena:
 74 |     Type: AWS::IAM::User
 75 |     Properties:
 76 |       UserName: "auto-dbr-athena"
 77 |       Policies:
 78 |         - PolicyName: "athena-dbr-access"
 79 |           PolicyDocument:
 80 |             Version: "2012-10-17"
 81 |             Statement:
 82 |               - Effect: Allow
 83 |                 Action:
 84 |                   - athena:CancelQueryExecution
 85 |                   - athena:GetCatalogs
 86 |                   - athena:GetExecutionEngine
 87 |                   - athena:GetExecutionEngines
 88 |                   - athena:GetNamespace
 89 |                   - athena:GetNamespaces
 90 |                   - athena:GetQueryExecution
 91 |                   - athena:GetQueryExecutions
 92 |                   - athena:GetQueryResults
 93 |                   - athena:GetTable
 94 |                   - athena:GetTables
 95 |                   - athena:RunQuery
 96 |                 Resource:
 97 |                   - "*"
 98 |               - Effect: Allow
 99 |                 Action:
100 |                   - s3:GetBucketLocation
101 |                   - s3:GetObject
102 |                   - s3:ListBucket
103 |                   - s3:ListBucketMultipartUploads
104 |                   - s3:ListMultipartUploadParts
105 |                   - s3:AbortMultipartUpload
106 |                   - s3:CreateBucket
107 |                   - s3:PutObject
108 |                 Resource:
109 |                   - "arn:aws:s3:::aws-athena-query-results-*"
110 |               - Effect: Allow
111 |                 Action:
112 |                   - s3:GetObject
113 |                   - s3:ListBucket
114 |                   - s3:ListObjects
115 |                 Resource:
116 |                   - "Fn::Join":
117 |                     - ""
118 |                     -
119 |                       - "arn:aws:s3:::"
120 |                       - "Fn::If":
121 |                         - NoUploadBucket
122 |                         - { Ref: DBRBucket }
123 |                         - { Ref: DBRUploadBucket }
124 |                       - "*"
125 |   autodbrAthenaAccessKey:
126 |     Type: AWS::IAM::AccessKey
127 |     Properties:
128 |       UserName:
129 |         { Ref: autodbrAthena }
130 | 
131 |   IamRole:
132 |     Type: AWS::IAM::Role
133 |     Properties:
134 |       AssumeRolePolicyDocument:
135 |         Version: "2012-10-17"
136 |         Statement:
137 |           - Sid: 'PermitAssumeRoleEc2'
138 |             Action:
139 |               - sts:AssumeRole
140 |             Effect: Allow
141 |             Principal:
142 |               Service:
143 |                 - ec2.amazonaws.com
144 |       Path: /
145 |       Policies:
146 |         - PolicyName: autoDBR_S3_and_CloudWatch
147 |           PolicyDocument:
148 |             Version: "2012-10-17"
149 |             Statement:
150 |               - Effect: Allow
151 |                 Action:
152 |                   - s3:ListBucket
153 |                   - s3:GetObject
154 |                   - s3:ListObjects
155 |                 Resource:
156 |                   - "Fn::Join":
157 |                     - ""
158 |                     -
159 |                       - "arn:aws:s3:::"
160 |                       - { Ref: DBRBucket }
161 |                       - "/*"
162 |                   - "Fn::Join":
163 |                     - ""
164 |                     -
165 |                       - "arn:aws:s3:::"
166 |                       - { Ref: DBRBucket }
167 |                   - "Fn::If":
168 |                     - NoUploadBucket
169 |                     - { Ref: "AWS::NoValue" }
170 |                     - "Fn::Join":
171 |                       - ""
172 |                       -
173 |                         - "arn:aws:s3:::"
174 |                         - { Ref: DBRUploadBucket }
175 |               - Effect: Allow
176 |                 Action:
177 |                   - s3:GetObject
178 |                   - s3:ListObjects
179 |                   - s3:PutObject
180 |                 Resource:
181 |                   - "Fn::Join":
182 |                     - ""
183 |                     -
184 |                       - "arn:aws:s3:::"
185 |                       - "Fn::If":
186 |                         - NoUploadBucket
187 |                         - { Ref: DBRBucket }
188 |                         - { Ref: DBRUploadBucket }
189 |                       - "/dbr-parquet"
190 |                       - "/*"
191 |                   - "Fn::Join":
192 |                     - ""
193 |                     -
194 |                       - "arn:aws:s3:::"
195 |                       - "Fn::If":
196 |                         - NoUploadBucket
197 |                         - { Ref: DBRBucket }
198 |                         - { Ref: DBRUploadBucket }
199 |                       - "/dbr-parquet"
200 |               - Effect: Allow
201 |                 Action:
202 |                   - cloudwatch:PutMetricData
203 |                   - ec2:DescribeReservedInstances
204 |                 Resource:
205 |                   - "*"
206 |   IamProfile:
207 |     Type: AWS::IAM::InstanceProfile
208 |     Properties:
209 |       Path: /
210 |       Roles:
211 |         - Ref: IamRole
212 | 
213 |   SecurityGroup:
214 |     Type: AWS::EC2::SecurityGroup
215 |     Properties:
216 |       VpcId:
217 |         "Fn::ImportValue":
218 |           "Fn::Sub": "${ResourcePrefix}-VpcId"
219 |       GroupDescription:
220 |         "Fn::Join":
221 |           -
222 |             "-"
223 |           -
224 |             - { Ref: ResourcePrefix }
225 |             - "sg"
226 |       Tags:
227 |         - Key: Name
228 |           Value:
229 |             "Fn::Join":
230 |               -
231 |                 "-"
232 |               -
233 |                 - { Ref: ResourcePrefix }
234 |                 - "sg"
235 | 
236 |   SecurityGroupSshIngress:
237 |     Type: AWS::EC2::SecurityGroupIngress
238 |     Properties:
239 |       GroupId: { Ref: SecurityGroup }
240 |       CidrIp: "0.0.0.0/0"
241 |       IpProtocol: tcp
242 |       FromPort: 22
243 |       ToPort: 22
244 | 
245 |   LaunchConfig:
246 |     Type: AWS::AutoScaling::LaunchConfiguration
247 |     Properties:
248 |       ImageId:
249 |         "Fn::FindInMap":
250 |           - RegionMap
251 |           - Ref: "AWS::Region"
252 |           - "ami"
253 |       KeyName: { Ref: KeypairName }
254 |       IamInstanceProfile: { Ref: IamProfile }
255 |       InstanceType: { Ref: InstanceType }
256 |       SpotPrice: { Ref: SpotPrice }
257 |       SecurityGroups:
258 |         - { Ref: SecurityGroup }
259 |       UserData:
260 |         "Fn::Base64":
261 |           "Fn::Join":
262 |             - "\n"
263 |             -
264 |               - "#!/bin/bash"
265 |               - ""
266 |               - "chmod 777 /media/ephemeral0"
267 |               - "yum update -y"
268 |               - "yum install -y git"
269 |               - "Fn::Sub": "git clone ${GitURL}"
270 |               - "cd DBRdashboard"
271 |               - "Fn::Sub": "git checkout ${GitBranch}"
272 |               - "Fn::Sub": "chmod 755 ./run.sh && ./run.sh ${DBRBucket} ${AWS::AccountId} ${AWS::Region} ${autodbrAthenaAccessKey} ${autodbrAthenaAccessKey.SecretAccessKey} ${DBRUploadBucket}"
273 |               - ""
274 | 
275 |   Asg:
276 |     Type: AWS::AutoScaling::AutoScalingGroup
277 |     Properties:
278 |       Tags:
279 |         - Key: Name
280 |           Value:
281 |             "Fn::Join":
282 |               - "-"
283 |               -
284 |                 - { Ref: ResourcePrefix }
285 |                 - "asg"
286 |           PropagateAtLaunch: true
287 | 
288 |       MinSize: 0
289 |       MaxSize: 1
290 |       DesiredCapacity: 0
291 |       LaunchConfigurationName: { Ref: LaunchConfig }
292 |       VPCZoneIdentifier:
293 |         - "Fn::ImportValue":
294 |             "Fn::Sub": "${ResourcePrefix}-PublicSubnetAZ0"
295 |         - "Fn::ImportValue":
296 |             "Fn::Sub": "${ResourcePrefix}-PublicSubnetAZ1"
297 | 
298 |   AsgScheduleUp:
299 |     Type: AWS::AutoScaling::ScheduledAction
300 |     Properties:
301 |       AutoScalingGroupName: { Ref: Asg }
302 |       DesiredCapacity: 1
303 |       Recurrence:
304 |           "Fn::Sub": "1 */${Schedule} * * *"
305 | 
306 |   AsgScheduleDown:
307 |     Type: AWS::AutoScaling::ScheduledAction
308 |     Properties:
309 |       AutoScalingGroupName: { Ref: Asg }
310 |       DesiredCapacity: 0
311 |       Recurrence:
312 |           "Fn::Sub": "55 */1 * * *"
313 | 
314 | Outputs:
315 |   SgIdExport:
316 |     Value: { Ref: SecurityGroup }
317 |     Export:
318 |       Name:
319 |         "Fn::Join":
320 |           - "-"
321 |           -
322 |             - { Ref: ResourcePrefix }
323 |             - "SgId"
324 | 


--------------------------------------------------------------------------------
/cf/dbr_network.yaml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: "2010-09-09"
  2 | 
  3 | Description: Network stack for DBR Dashboard
  4 | 
  5 | Parameters:
  6 |   ResourcePrefix:
  7 |     Type: String
  8 |     Description: A description to identify resources. Use an identical value across stacks (e.g. "dbr-dashboard")
  9 |     Default: dbr-dashboard
 10 |     MinLength: 2
 11 | 
 12 |   VpcCidr:
 13 |     Type: String
 14 |     Description: A network CIDR e.g. "10.0.0.0/25"
 15 |     AllowedPattern: ^\d+\.\d+\.\d+\.\d+\/\d+$
 16 |     Default: 10.0.0.0/25
 17 | 
 18 |   PublicSubnetAZ0Cidr:
 19 |     Type: String
 20 |     Description: A subnet CIDR e.g. "10.0.0.0/27"
 21 |     AllowedPattern: ^\d+\.\d+\.\d+\.\d+\/\d+$
 22 |     Default: 10.0.0.0/27
 23 | 
 24 |   PrivateSubnetAZ0Cidr:
 25 |     Type: String
 26 |     Description: A subnet CIDR e.g. "10.0.0.32/27"
 27 |     AllowedPattern: ^\d+\.\d+\.\d+\.\d+\/\d+$
 28 |     Default: 10.0.0.32/27
 29 | 
 30 | 
 31 |   PublicSubnetAZ1Cidr:
 32 |     Type: String
 33 |     Description: A subnet CIDR e.g. "10.0.0.64/27"
 34 |     AllowedPattern: ^\d+\.\d+\.\d+\.\d+\/\d+$
 35 |     Default: 10.0.0.64/27
 36 | 
 37 | 
 38 |   PrivateSubnetAZ1Cidr:
 39 |     Type: String
 40 |     Description: A subnet CIDR e.g. "10.0.0.96/27
 41 |     AllowedPattern: ^\d+\.\d+\.\d+\.\d+\/\d+$
 42 |     Default: 10.0.0.96/27
 43 | 
 44 | Resources:
 45 |   Vpc:
 46 |     Type: AWS::EC2::VPC
 47 |     Properties:
 48 |       EnableDnsSupport: True
 49 |       EnableDnsHostnames: True
 50 |       InstanceTenancy: default
 51 |       CidrBlock: { Ref: VpcCidr }
 52 |       Tags: [ { Key: Name, Value: { Ref: ResourcePrefix } } ]
 53 | 
 54 |   DHCPSettings:
 55 |     Type: AWS::EC2::DHCPOptions
 56 |     Properties:
 57 |       DomainNameServers:  [ "AmazonProvidedDNS" ]
 58 |       DomainName:  ec2-internal
 59 | 
 60 |   DHCPSettingsAssociation:
 61 |     Type: AWS::EC2::VPCDHCPOptionsAssociation
 62 |     Properties:
 63 |       VpcId: { "Ref" : "Vpc" }
 64 |       DhcpOptionsId: { "Ref" : "DHCPSettings" }
 65 | 
 66 |   InternetGateway:
 67 |     Type: AWS::EC2::InternetGateway
 68 |     Properties:
 69 |       Tags:
 70 |         - Key: Name
 71 |           Value:
 72 |             "Fn::Join":
 73 |               - "-"
 74 |               - - { Ref: ResourcePrefix }
 75 |                 - "internetgw"
 76 | 
 77 |   InternetGatewayAttachment:
 78 |     Type: AWS::EC2::VPCGatewayAttachment
 79 |     Properties:
 80 |       VpcId: { "Ref": "Vpc" }
 81 |       InternetGatewayId : { "Ref" : "InternetGateway" }
 82 | 
 83 |   PublicRouteTable:
 84 |     Type: AWS::EC2::RouteTable
 85 |     Properties:
 86 |       VpcId: { Ref : Vpc }
 87 |       Tags:
 88 |         - Key: Name
 89 |           Value:
 90 |             "Fn::Join":
 91 |               - "-"
 92 |               - - { Ref: ResourcePrefix }
 93 |                 - "rtb"
 94 |                 - "public"
 95 | 
 96 |   PrivateRouteTable:
 97 |     Type: AWS::EC2::RouteTable
 98 |     Properties:
 99 |       VpcId: { "Ref": "Vpc" }
100 |       Tags:
101 |         - Key: Name
102 |           Value:
103 |             "Fn::Join":
104 |               - "-"
105 |               - - { Ref: ResourcePrefix }
106 |                 - "rtb"
107 |                 - "private"
108 | 
109 |   RoutePublicToInternet:
110 |     Type: AWS::EC2::Route
111 |     Properties:
112 |       DestinationCidrBlock: "0.0.0.0/0"
113 |       RouteTableId:  { Ref: PublicRouteTable }
114 |       GatewayId:     { Ref: InternetGateway }
115 | 
116 |   PublicSubnetAZ0:
117 |     Type: AWS::EC2::Subnet
118 |     Properties:
119 |       VpcId: { Ref: Vpc }
120 |       AvailabilityZone: { "Fn::Select": [ "0", { "Fn::GetAZs": "" } ] }
121 |       MapPublicIpOnLaunch: "true"
122 |       CidrBlock: { Ref: PublicSubnetAZ0Cidr }
123 |       Tags:
124 |         - Key: Name
125 |           Value:
126 |             "Fn::Join":
127 |               - "-"
128 |               - - { Ref: ResourcePrefix }
129 |                 - "public"
130 |                 - { "Fn::Select": [ "0", { "Fn::GetAZs": "" } ] }
131 |   SubnetRouteTableAssociationPublicAZ0:
132 |     Type: AWS::EC2::SubnetRouteTableAssociation
133 |     Properties:
134 |       SubnetId: { Ref: PublicSubnetAZ0 }
135 |       RouteTableId: { Ref: PublicRouteTable }
136 | 
137 |   PrivateSubnetAZ0:
138 |     Type: AWS::EC2::Subnet
139 |     Properties:
140 |       VpcId: { Ref: Vpc }
141 |       AvailabilityZone: { "Fn::Select": [ "0", { "Fn::GetAZs": "" } ] }
142 |       CidrBlock: { Ref: PrivateSubnetAZ0Cidr }
143 |       Tags:
144 |         - Key: Name
145 |           Value:
146 |             "Fn::Join":
147 |               - "-"
148 |               - - { Ref: ResourcePrefix }
149 |                 - "private"
150 |                 - { "Fn::Select": [ "0", { "Fn::GetAZs": "" } ] }
151 |   SubnetRouteTableAssociationPrivateAZ0:
152 |     Type: AWS::EC2::SubnetRouteTableAssociation
153 |     Properties:
154 |       SubnetId: { Ref: PrivateSubnetAZ0 }
155 |       RouteTableId: { Ref: PrivateRouteTable }
156 | 
157 |   PublicSubnetAZ1:
158 |     Type: AWS::EC2::Subnet
159 |     Properties:
160 |       VpcId: { Ref: Vpc }
161 |       AvailabilityZone: { "Fn::Select": [ "1", { "Fn::GetAZs": "" } ] }
162 |       MapPublicIpOnLaunch: "true"
163 |       CidrBlock: { Ref: PublicSubnetAZ1Cidr }
164 |       Tags:
165 |         - Key: Name
166 |           Value:
167 |             "Fn::Join":
168 |               - "-"
169 |               - - { Ref: ResourcePrefix }
170 |                 - "public"
171 |                 - { "Fn::Select": [ "1", { "Fn::GetAZs": "" } ] }
172 |   SubnetRouteTableAssociationPublicAZ1:
173 |     Type: AWS::EC2::SubnetRouteTableAssociation
174 |     Properties:
175 |       SubnetId: { Ref: PublicSubnetAZ1 }
176 |       RouteTableId: { Ref: PublicRouteTable }
177 | 
178 |   PrivateSubnetAZ1:
179 |     Type: AWS::EC2::Subnet
180 |     Properties:
181 |       VpcId: { Ref: Vpc }
182 |       AvailabilityZone: { "Fn::Select": [ "1", { "Fn::GetAZs": "" } ] }
183 |       CidrBlock: { Ref: PrivateSubnetAZ1Cidr }
184 |       Tags:
185 |         - Key: Name
186 |           Value:
187 |             "Fn::Join":
188 |               - "-"
189 |               - - { Ref: ResourcePrefix }
190 |                 - "private"
191 |                 - { "Fn::Select": [ "1", { "Fn::GetAZs": "" } ] }
192 |   SubnetRouteTableAssociationPrivateAZ1:
193 |     Type: AWS::EC2::SubnetRouteTableAssociation
194 |     Properties:
195 |       SubnetId: { Ref: PrivateSubnetAZ1 }
196 |       RouteTableId: { Ref: PrivateRouteTable }
197 | 
198 | 
199 | Outputs:
200 |   VpcIdExport:
201 |     Value: { Ref: Vpc }
202 |     Export:
203 |       Name:
204 |         "Fn::Join":
205 |           - "-"
206 |           - - { Ref: ResourcePrefix }
207 |             - "VpcId"
208 | 
209 |   PublicSubnetAZ0Export:
210 |     Value: { Ref: PublicSubnetAZ0 }
211 |     Export:
212 |       Name:
213 |         "Fn::Join":
214 |           - "-"
215 |           - - { Ref: ResourcePrefix }
216 |             - PublicSubnetAZ0
217 | 
218 |   PrivateSubnetAZ0Export:
219 |     Value: { Ref: PrivateSubnetAZ0 }
220 |     Export:
221 |       Name:
222 |         "Fn::Join":
223 |           - "-"
224 |           - - { Ref: ResourcePrefix }
225 |             - PrivateSubnetAZ0
226 | 
227 |   PublicSubnetAZ1Export:
228 |     Value: { Ref: PublicSubnetAZ1 }
229 |     Export:
230 |       Name:
231 |         "Fn::Join":
232 |           - "-"
233 |           - - { Ref: ResourcePrefix }
234 |             - PublicSubnetAZ1
235 | 
236 |   PrivateSubnetAZ1Export:
237 |     Value: { Ref: PrivateSubnetAZ1 }
238 |     Export:
239 |       Name:
240 |         "Fn::Join":
241 |           - "-"
242 |           - - { Ref: ResourcePrefix }
243 |             - PrivateSubnetAZ1
244 | 


--------------------------------------------------------------------------------
/csv2parquet:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import argparse
  4 | import atexit
  5 | import csv
  6 | import os
  7 | import shutil
  8 | import string
  9 | import subprocess
 10 | import sys
 11 | import tempfile
 12 | 
 13 | HELP='''
 14 | csv_input is a CSV file, whose first line defines the column names.
 15 | parquet_output is the Parquet output (i.e., directory in which one or
 16 | more Parquet files are written.)
 17 | 
 18 | When used, --column-map must be followed by an even number of
 19 | strings, constituting the key-value pairs mapping CSV to Parquet
 20 | column names:
 21 |   csv2parquet data.csv data.parquet --column-map "CSV Column Name" "Parquet Column Name"
 22 | 
 23 | To provide types for columns, use --types:
 24 |   csv2parquet data.csv data.parquet --types "CSV Column Name" "INT"
 25 | 
 26 | See documentation (README.md in the source repo) for more information.
 27 | '''.strip()
 28 | 
 29 | # True iff we are to preserve temporary files.
 30 | # Globally set to true in debug mode.
 31 | global preserve
 32 | preserve = False
 33 | 
 34 | DRILL_OVERRIDE_TEMPLATE = '''
 35 | drill.exec: {
 36 |   sys.store.provider: {
 37 |     # The following section is only required by LocalPStoreProvider
 38 |     local: {
 39 |       path: "${local_base}",
 40 |       write: true
 41 |     }
 42 |   },
 43 |   tmp: {
 44 |     directories: ["${local_base}"],
 45 |     filesystem: "drill-local:///"
 46 |   },
 47 |   sort: {
 48 |     purge.threshold : 100,
 49 |     external: {
 50 |       batch.size : 4000,
 51 |       spill: {
 52 |         batch.size : 4000,
 53 |         group.size : 100,
 54 |         threshold : 200,
 55 |         directories : [ "${local_base}/spill" ],
 56 |         fs : "file:///"
 57 |       }
 58 |     }
 59 |   },
 60 | }
 61 | '''
 62 | 
 63 | # exceptions
 64 | class CsvSourceError(Exception):
 65 |     def __init__(self, message):
 66 |         super().__init__(message)
 67 |         self.message = message
 68 | 
 69 | class DrillScriptError(CsvSourceError):
 70 |     def __init__(self, returncode):
 71 |         super().__init__(returncode)
 72 |         self.returncode = returncode
 73 | 
 74 | class InvalidColumnNames(CsvSourceError):
 75 |     pass
 76 | 
 77 | # classes
 78 | class Column:
 79 |     def __init__(self, csv, parquet, type):
 80 |         self.csv = csv
 81 |         self.parquet = parquet
 82 |         self.type = type
 83 |     def __eq__(self, other):
 84 |         return \
 85 |             self.csv == other.csv and \
 86 |             self.parquet == other.parquet and \
 87 |             self.type == other.type
 88 |     def line(self, index):
 89 |         if self.type is None:
 90 |             return 'columns[{}] as `{}`'.format(index, self.parquet)
 91 |         # In Drill, if a SELECT query has both an OFFSET with a CAST,
 92 |         # Drill will apply that cast even to columns that are
 93 |         # skipped. For a headerless CSV file, we could just use
 94 |         # something like:
 95 |         #
 96 |         #     CAST(columns[{index}] as {type}) as `{parquet_name}`
 97 |         #
 98 |         # But if the header line is present, this causes the entire
 99 |         # conversion to fail, because Drill attempts to cast the
100 |         # header (e.g., "Price") to the type (e.g., INT), triggering a
101 |         # fatal error. So instead we must do:
102 |         #
103 |         #     CASE when columns[{index}]='{csv_name}' then CAST(NULL AS {type}) else ...
104 |         #
105 |         # I really don't like this, because it makes it possible for
106 |         # data corruption to hide. If a cell should contain a number,
107 |         # but instead contains a non-numeric string, that should be a
108 |         # loud, noisy error which is impossible to ignore. However, if
109 |         # that happens here, and you are so unlucky that the corrupted
110 |         # value happens to equal the CSV column name, then it is
111 |         # silently nulled out.  This is admittedly very unlikely, but
112 |         # that's not the same as impossible. If you are reading this
113 |         # and have an idea for a better solution, please contact the
114 |         # author (see README.md).
115 |         return "CASE when columns[{index}]='{csv_name}' then CAST(NULL AS {type}) else CAST(columns[{index}] as {type}) end as `{parquet_name}`".format(
116 |             index=index, type=self.type, parquet_name=self.parquet, csv_name=self.csv)
117 | 
118 | class Columns:
119 |     def __init__(self, csv_columns: list, name_map: dict, type_map: dict):
120 |         self.csv_columns = csv_columns
121 |         self.name_map = name_map
122 |         self.type_map = type_map
123 | 
124 |         self.items = []
125 |         invalid_names = []
126 |         for csv_name in self.csv_columns:
127 |             parquet_name = self.name_map.get(csv_name, csv_name)
128 |             type_name = self.type_map.get(csv_name, None)
129 |             if not is_valid_parquet_column_name(parquet_name):
130 |                 invalid_names.append(parquet_name)
131 |             self.items.append(Column(csv_name, parquet_name, type_name))
132 |         if len(invalid_names) > 0:
133 |             raise InvalidColumnNames(invalid_names)
134 | 
135 |     def __iter__(self):
136 |         return iter(self.items)
137 | 
138 | class CsvSource:
139 |     def __init__(self, path: str, name_map: dict = None, type_map: dict = None):
140 |         if name_map is None:
141 |             name_map = {}
142 |         if type_map is None:
143 |             type_map = {}
144 |         self.path = os.path.realpath(path)
145 |         self.headers = self._init_headers()
146 |         self.columns = Columns(self.headers, name_map, type_map)
147 |     def _init_headers(self):
148 |         with open(self.path, newline='') as handle:
149 |             csv_data = csv.reader(handle)
150 |             return next(csv_data)
151 | 
152 | class TempLocation:
153 |     _tempdir = None
154 |     def __init__(self):
155 |         drive, path = os.path.splitdrive(self.tempdir)
156 |         assert drive == '', 'Windows support not provided yet'
157 |         assert path.startswith('/tmp/'), self.tempdir
158 |         self.dfs_tmp_base = path[len('/tmp'):]
159 |     def dfs_tmp_path(self, path: str):
160 |         return os.path.join(self.dfs_tmp_base, path)
161 |     def full_path(self, path: str):
162 |         return os.path.join(self.tempdir, path)
163 |     @property
164 |     def tempdir(self):
165 |         if self._tempdir is None:
166 |             self._tempdir = tempfile.mkdtemp(prefix='/tmp/')
167 |             if preserve:
168 |                 print('Preserving logs and intermediate files: ' + self._tempdir)
169 |             else:
170 |                 atexit.register(shutil.rmtree, self._tempdir)
171 |         return self._tempdir
172 | 
173 | class DrillInstallation:
174 |     '''Create a temporary, custom Drill installation
175 | 
176 |     Even in embedded mode, Drill runs in a stateful fashion, storing
177 |     its state in (by default) /tmp/drill. This poses a few problems
178 |     for running csv2parquet in ways that are (a) robust, and (b)
179 |     certain not to affect any other Drill users, human or machine.
180 | 
181 |     This would be an easy problem to solve if I could create a custom
182 |     conf/drill-overrides.conf, and pass it to drill-embedded as a
183 |     command line option. However, as of Drill 1.4, the only way to do
184 |     such customization is to modify the actual drill-overrides.conf
185 |     file itself, before starting drill-embedded.
186 | 
187 |     This class gets around this through an admittedly unorthodox hack:
188 |     creating a whole parallel Drill installation under a temporary
189 |     directory, with a customized conf/ directory, but reusing (via
190 |     symlinks) everything else in the main installation. This lets us
191 |     safely construct our own drill configuration, using its own
192 |     separate equivalent of /tmp/drill (etc.), and which is all cleaned
193 |     up after the script exits.
194 | 
195 |     (Feature request to anyone on the Drill team reading this: Please
196 |     give drill-embedded a --with-override-file option!)
197 | 
198 |     '''
199 |     def __init__(self, reference_executable:str=None):
200 |         self.location = TempLocation()
201 |         if reference_executable is None:
202 |             reference_executable = shutil.which('drill-embedded')
203 |         assert reference_executable is not None
204 |         self.reference_executable = reference_executable
205 |         self.reference_base, self.bindir = os.path.split(os.path.dirname(reference_executable))
206 |         self.install()
207 |     @property
208 |     def base(self):
209 |         return os.path.join(self.location.tempdir, 'drill')
210 |     @property
211 |     def local_base(self):
212 |         return os.path.join(self.location.tempdir, 'drill-local-base')
213 |     @property
214 |     def executable(self):
215 |         return os.path.join(self.base, self.bindir, 'drill-embedded')
216 |     def install(self):
217 |         # create required subdirs
218 |         for dirname in (self.base, self.local_base):
219 |             os.makedirs(dirname)
220 |         # link to reference
221 |         for item in os.scandir(self.reference_base):
222 |             if item.name == 'conf':
223 |                 assert item.is_dir(), os.path.realpath(item)
224 |                 continue
225 |             os.symlink(item.path, os.path.join(self.base, item.name))
226 |         # install config
227 |         conf_dir = os.path.join(self.base, 'conf')
228 |         os.makedirs(conf_dir)
229 |         with open(os.path.join(conf_dir, 'drill-override.conf'), 'w') as handle:
230 |             handle.write(string.Template(DRILL_OVERRIDE_TEMPLATE).substitute(
231 |                 local_base=self.local_base))
232 | 
233 |     def build_script(self, csv_source: CsvSource, parquet_output: str):
234 |         return DrillScript(self, csv_source, parquet_output)
235 | 
236 | class DrillScript:
237 |     def __init__(self, drill: DrillInstallation, csv_source: CsvSource, parquet_output: str):
238 |         self.drill = drill
239 |         self.csv_source = csv_source
240 |         self.parquet_output = parquet_output
241 |     def render(self):
242 |         return render_drill_script(
243 |             self.csv_source.columns,
244 |             self.drill.location.dfs_tmp_path('parquet_tmp_output'),
245 |             self.csv_source.path,
246 |         )
247 |     def run(self):
248 |         # execute drill script
249 |         script_path = os.path.join(self.drill.location.tempdir, 'script')
250 |         script_stdout = os.path.join(self.drill.location.tempdir, 'script_stdout')
251 |         script_stderr = os.path.join(self.drill.location.tempdir, 'script_stderr')
252 |         cmd = [
253 |             self.drill.executable,
254 |             '--run={}'.format(script_path),
255 |         ]
256 |         with open(script_path, 'w') as handle:
257 |             handle.write(self.render())
258 |         with open(script_stdout, 'w') as stdout, open(script_stderr, 'w') as stderr:
259 |             proc = subprocess.Popen(cmd, stdout=stdout, stderr=stderr)
260 |             proc.wait()
261 |         if proc.returncode != 0:
262 |             raise DrillScriptError(proc.returncode)
263 | 
264 |         # publish resulting output parquet file
265 |         shutil.move(self.drill.location.full_path('parquet_tmp_output'), self.parquet_output)
266 | 
267 | # helper functions
268 | def get_args():
269 |     parser = argparse.ArgumentParser(
270 |             description='',
271 |             epilog=HELP,
272 |             formatter_class=argparse.RawDescriptionHelpFormatter,
273 |     )
274 |     parser.add_argument('csv_input',
275 |                         help='Path to input CSV file')
276 |     parser.add_argument('parquet_output',
277 |                         help='Path to Parquet output')
278 |     parser.add_argument('--debug', default=False, action='store_true',
279 |                         help='Preserve intermediate files and logs')
280 |     parser.add_argument('--column-map', nargs='*',
281 |                         help='Map CSV header names to Parquet column names')
282 |     parser.add_argument('--types', nargs='*',
283 |                         help='Map CSV header names to Parquet types')
284 |     args = parser.parse_args()
285 |     try:
286 |         args.column_map = list2dict(args.column_map)
287 |     except ValueError:
288 |         parser.error('--column-map requires an even number of arguments, as key-value pairs')
289 |     try:
290 |         args.types = list2dict(args.types)
291 |     except ValueError:
292 |         parser.error('--types requires an even number of arguments, as key-value pairs')
293 |     return args
294 | 
295 | def list2dict(items):
296 |     '''convert [a, b, c, d] to {a:b, c:d}'''
297 |     if items is None:
298 |         return {}
299 |     if len(items) % 2 != 0:
300 |         raise ValueError
301 |     return dict( (items[n], items[n+1])
302 |                  for n in range(0, len(items)-1, 2) )
303 | 
304 | def is_valid_parquet_column_name(val):
305 |     return '.' not in val
306 | 
307 | def render_drill_script(columns: Columns, parquet_output: str, csv_input: str):
308 |     script = '''alter session set `store.format`='parquet';
309 | CREATE TABLE dfs.tmp.`{}` AS
310 | SELECT
311 | '''.format(parquet_output)
312 |     column_lines = [column.line(n) for n, column in enumerate(columns)]
313 |     script += ',\n'.join(column_lines) + '\n'
314 |     script += 'FROM dfs.`{}`\n'.format(csv_input)
315 |     script += 'OFFSET 1\n'
316 |     return script
317 | 
318 | if __name__ == "__main__":
319 |     args = get_args()
320 |     if args.debug:
321 |         preserve = True
322 |     # Quick pre-check whether destination exists, so user doesn't have
323 |     # to wait long before we abort with a write error. There's a race
324 |     # condition because it can still be created between now and when
325 |     # we eventually try to write it, but this will catch the common case.
326 |     if os.path.exists(args.parquet_output):
327 |         sys.stderr.write('Output location "{}" already exists. Rename or delete before running again.\n'.format(args.parquet_output))
328 |         sys.exit(1)
329 |     csv_source = CsvSource(args.csv_input, args.column_map, args.types)
330 |     drill = DrillInstallation()
331 |     drill_script = drill.build_script(csv_source, args.parquet_output)
332 |     try:
333 |         drill_script.run()
334 |     except DrillScriptError as err:
335 |         sys.stderr.write('''FATAL: Drill script failed with error code {}.  To troubleshoot, run
336 | with --debug and inspect files script, script_stderr and script_stdout.
337 | '''.format(err.returncode))
338 |         sys.exit(2)
339 | 


--------------------------------------------------------------------------------
/dbr_dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/andyfase/DBRdashboard/d4be41b13473d6cde86f99adc41f28e5207024bc/dbr_dashboard.png


--------------------------------------------------------------------------------
/go/src/analyzeDBR/analyzeDBR.go:
--------------------------------------------------------------------------------
  1 | package main
  2 | 
  3 | /*
  4 | Imports of interest:
  5 |   "github.com/BurntSushi/toml" - TOML parser
  6 |   "flag" inbuilt package for commmand line parameter parsing
  7 |   "github.com/mohae/deepcopy" package to copy any arbitary struct / map etc
  8 |   "Numerous AWS packages" offical AWS SDK's
  9 | */
 10 | import (
 11 |   "log"
 12 |   "github.com/BurntSushi/toml"
 13 |   "flag"
 14 |   "io/ioutil"
 15 |   "os"
 16 |   "errors"
 17 |   "regexp"
 18 |   "encoding/json"
 19 |   "net/http"
 20 |   "bytes"
 21 |   "strings"
 22 |   "time"
 23 |   "strconv"
 24 |   "github.com/aws/aws-sdk-go/aws"
 25 |   "github.com/aws/aws-sdk-go/aws/session"
 26 |   "github.com/aws/aws-sdk-go/service/cloudwatch"
 27 |   "github.com/aws/aws-sdk-go/service/ec2"
 28 |   "github.com/mohae/deepcopy"
 29 | )
 30 | /*
 31 | Structs Below are used to contain configuration parsed in
 32 | */
 33 | type General struct {
 34 |   Namespace string
 35 | }
 36 | 
 37 | type RI struct {
 38 |   Enabled bool `toml:"enableRIanalysis"`
 39 |   TotalUtilization bool `toml:"enableRITotalUtilization"`
 40 |   PercentThreshold int `toml:"riPercentageThreshold"`
 41 |   TotalThreshold int `toml:"riTotalThreshold"`
 42 |   CwName  string
 43 |   CwNameTotal string
 44 |   CwDimension string
 45 |   CwDimensionTotal string
 46 |   CwType string
 47 |   Sql string
 48 |   Ignore map[string]int
 49 | }
 50 | 
 51 | type Athena struct {
 52 |   DbSQL string `toml:"create_database"`
 53 |   TableSQL string `toml:"create_table"`
 54 |   TableBlendedSQL string `toml:"create_table_blended"`
 55 |   Test string
 56 | }
 57 | 
 58 | type Metric struct {
 59 |   Enabled bool
 60 |   Type string
 61 |   SQL string
 62 |   CwName string
 63 |   CwDimension string
 64 |   CwType string
 65 | }
 66 | 
 67 | type Config struct {
 68 |   General General
 69 |   RI RI
 70 |   Athena Athena
 71 |   Metrics []Metric
 72 | }
 73 | /*
 74 | End of configuraton structs
 75 | */
 76 | 
 77 | /*
 78 | Structs for Athena requests / responses. Requests go through a proxy on the same server
 79 | Proxy accepts and gives back JSON.
 80 | */
 81 | type AthenaRequest struct {
 82 |   AthenaUrl string `json:"athenaUrl"`
 83 |   S3StagingDir string `json:"s3StagingDir"`
 84 |   AwsSecretKey string `json:"awsSecretKey"`
 85 |   AwsAccessKey string `json:"awsAccessKey"`
 86 |   Query string `json:"query"`
 87 | }
 88 | 
 89 | type AthenaResponse struct {
 90 |   Columns []map[string]string
 91 |   Rows []map[string]string
 92 | }
 93 | 
 94 | var defaultConfigPath = "./analyzeDBR.config"
 95 | 
 96 | /*
 97 | Function reads in and validates command line parameters
 98 | */
 99 | func getParams(configFile *string, account *string, region *string, key *string, secret *string, date *string, bucket *string, blended *bool) error {
100 | 
101 |   // Define input command line config parameter and parse it
102 |   flag.StringVar(configFile, "config", defaultConfigPath, "Input config file for analyzeDBR")
103 |   flag.StringVar(key, "key", "", "Athena IAM access key")
104 |   flag.StringVar(secret, "secret", "", "Athena IAM secret key")
105 |   flag.StringVar(region, "region", "", "Athena Region")
106 |   flag.StringVar(account, "account", "", "AWS Account #")
107 |   flag.StringVar(date, "date", "", "Current month in YYYY-MM format")
108 |   flag.StringVar(bucket, "bucket", "", "AWS Bucket where DBR files sit")
109 |   flag.BoolVar(blended, "blended", false, "Set to 1 if DBR file contains blended costs")
110 | 
111 |   flag.Parse()
112 | 
113 |   // check input against defined regex's
114 |   r_empty 	:= regexp.MustCompile(`^$`)
115 |   r_region 	:= regexp.MustCompile(`^\w+-\w+-\d$`)
116 |   r_account	:= regexp.MustCompile(`^\d+$`)
117 |   r_date	  := regexp.MustCompile(`^\d{6}$`)
118 | 
119 |   if r_empty.MatchString(*key) {
120 |   return errors.New("Must provide Athena access key")
121 |   }
122 |   if r_empty.MatchString(*secret) {
123 |   return errors.New("Must provide Athena secret key")
124 |   }
125 |   if r_empty.MatchString(*bucket) {
126 |   return errors.New("Must provide valid AWS DBR bucket")
127 |   }
128 |   if ! r_region.MatchString(*region) {
129 |   return errors.New("Must provide valid AWS region")
130 |   }
131 |   if ! r_account.MatchString(*account) {
132 |   return errors.New("Must provide valid AWS account number")
133 |   }
134 |   if ! r_date.MatchString(*date) {
135 |   return errors.New("Must provide valid date (YYYY-MM)")
136 |   }
137 | 
138 |   return nil
139 | }
140 | 
141 | /*
142 | Function reads in configuration file provided in configFile input
143 | Config file is stored in TOML format
144 | */
145 | func getConfig(conf *Config, configFile string) error {
146 | 
147 |     // check for existance of file
148 |     if _, err := os.Stat(configFile); err != nil {
149 |       return errors.New("Config File " + configFile + " does not exist")
150 |     }
151 | 
152 |     // read file
153 |     b, err := ioutil.ReadFile(configFile)
154 |   if err != nil {
155 |       return err
156 |     }
157 | 
158 |     // parse TOML config file into struct
159 |     if _, err := toml.Decode(string(b), &conf); err != nil {
160 |       return err
161 |     }
162 | 
163 |     return nil
164 | }
165 | 
166 | /*
167 | Function substitutes parameters into SQL command.
168 | Input map contains key (thing to look for in input sql) and if found replaces with given value)
169 | */
170 | func substituteParams(sql string, params map[string]string) string {
171 | 
172 |   for sub, value := range params {
173 |   sql = strings.Replace(sql, sub, value, -1)
174 |   }
175 | 
176 |   return sql
177 | }
178 | 
179 | /*
180 | Function takes SQL to send to Athena converts into JSON to send to Athena HTTP proxy and then sends it.
181 | Then recieves responses in JSON which is converted back into a struct and returned
182 | */
183 | func sendQuery(key string, secret string, region string, account string, sql string) (AthenaResponse, error) {
184 | 
185 |   // construct json
186 |   req := AthenaRequest {
187 |   AwsAccessKey: key,
188 |   AwsSecretKey: secret,
189 |   AthenaUrl: "jdbc:awsathena://athena." + region + ".amazonaws.com:443",
190 |   S3StagingDir: "s3://aws-athena-query-results-" + account + "-" + region + "/",
191 |   Query: sql }
192 | 
193 |   // encode into JSON
194 |   b := new(bytes.Buffer)
195 |   err := json.NewEncoder(b).Encode(req)
196 |   if err != nil {
197 |   return AthenaResponse{}, err
198 |   }
199 | 
200 |   // send request through proxy
201 |   resp, err := http.Post("http://127.0.0.1:10000/query", "application/json", b)
202 |   if err != nil {
203 |   return AthenaResponse{}, err
204 |   }
205 | 
206 |   // check status code
207 |   if resp.StatusCode != 200 {
208 |   respBytes, _ := ioutil.ReadAll(resp.Body)
209 |   return AthenaResponse{}, errors.New(string(respBytes))
210 |   }
211 | 
212 |   // decode json into response struct
213 |   var results AthenaResponse
214 |   err = json.NewDecoder(resp.Body).Decode(&results)
215 |   if err != nil {
216 |   respBytes, _ := ioutil.ReadAll(resp.Body)
217 |   return AthenaResponse{}, errors.New(string(respBytes))
218 |   }
219 | 
220 |   return results, nil
221 | }
222 | 
223 | /*
224 | Function takes metric data (from Athena etal) and sends through to cloudwatch.
225 | */
226 | func sendMetric(svc *cloudwatch.CloudWatch, data AthenaResponse, cwNameSpace string, cwName string, cwType string, cwDimensionName string) error {
227 | 
228 |   input := cloudwatch.PutMetricDataInput{}
229 |   input.Namespace = aws.String(cwNameSpace)
230 |   i := 0
231 |   for row := range data.Rows {
232 |     // skip metric if dimension or value is empty
233 |     if len(data.Rows[row]["dimension"]) < 1 || len(data.Rows[row]["value"]) < 1 {
234 |       continue
235 |     }
236 | 
237 |   // send Metric Data as we have reached 20 records, and clear MetricData Array
238 |   if i >= 20 {
239 |   _, err := svc.PutMetricData(&input)
240 |   if err != nil {
241 |   return err
242 |   }
243 |   input.MetricData = nil
244 |   i = 0
245 |   }
246 | 
247 |   t, _ := time.Parse("2006-01-02 15", data.Rows[row]["date"])
248 |   v, _ := strconv.ParseFloat(data.Rows[row]["value"], 64)
249 |   metric := cloudwatch.MetricDatum{
250 |             MetricName: aws.String(cwName),
251 |             Timestamp:  aws.Time(t),
252 |             Unit:       aws.String(cwType),
253 |   Value:			aws.Float64(v),
254 |     }
255 | 
256 |     // Dimension can be a single or comma seperated list of values, or key/values
257 |     // presence of "=" sign in value designates key=value. Otherwise input cwDimensionName is used as key
258 |     d := strings.Split(data.Rows[row]["dimension"], ",")
259 |     for i := range d  {
260 |       var dname, dvalue string
261 |       if strings.Contains(d[i], "=") {
262 |         dTuple := strings.Split(d[i], "=")
263 |         dname = dTuple[0]
264 |         dvalue = dTuple[1]
265 |       } else {
266 |         dname = cwDimensionName
267 |         dvalue = d[i]
268 |       }
269 |       cwD := cloudwatch.Dimension{
270 |             Name: aws.String(dname),
271 |             Value: aws.String(dvalue),
272 |           }
273 |       metric.Dimensions = append(metric.Dimensions, &cwD)
274 |     }
275 | 
276 |   input.MetricData = append(input.MetricData, &metric)
277 |   i++
278 |   }
279 | 
280 |   // if we still have data to send - send it
281 |   if len(input.MetricData) > 0 {
282 |     _, err := svc.PutMetricData(&input)
283 |     if err != nil {
284 |         return err
285 |     }
286 |   }
287 | 
288 |   return nil
289 | }
290 | 
291 | /*
292 | Function processes a single hours worth of RI usage and compares against available RIs to produce % utiization / under-utilization
293 | */
294 | func riUtilizationHour(svc *cloudwatch.CloudWatch, date string, used map[string]map[string]map[string]int, azRI map[string]map[string]map[string]int, regionRI map[string]map[string]int, conf Config, region string) error {
295 | 
296 |   // Perform Deep Copy of both RI maps.
297 |   // We need a copy of the maps as we decrement the RI's available by the hourly usage and a map is a pointer
298 |   // hence decrementing the original maps will affect the pass-by-reference data
299 |   cpy := deepcopy.Copy(azRI)
300 |   t_azRI, ok := cpy.(map[string]map[string]map[string]int)
301 |   if ! ok {
302 |     return errors.New("could not copy AZ RI map")
303 |   }
304 | 
305 |   cpy = deepcopy.Copy(regionRI)
306 |   t_regionRI, ok := cpy.(map[string]map[string]int)
307 |   if ! ok {
308 |     return errors.New("could not copy Regional RI map")
309 |   }
310 | 
311 |   // Iterate through used hours decrementing any available RI's per hour's that were used
312 |   // AZ specific RI's are first checked and then regional RI's
313 |   for az := range used {
314 |     for instance := range used[az] {
315 |       // check if azRI for this region even exist
316 |       _, ok := t_azRI[az][instance]
317 |       if ok {
318 |         for platform := range used[az][instance] {
319 |           // check if azRI for this region and platform even exists
320 |           _, ok2 := t_azRI[az][instance][platform]
321 |             if ok2 {
322 |               // More RI's than we used
323 |               if t_azRI[az][instance][platform] >= used[az][instance][platform] {
324 |                 t_azRI[az][instance][platform] -= used[az][instance][platform]
325 |                 used[az][instance][platform] = 0
326 |               } else {
327 |                 // Less RI's than we used
328 |                 used[az][instance][platform] -= t_azRI[az][instance][platform]
329 |                 t_azRI[az][instance][platform] = 0
330 |               }
331 |             }
332 |           }
333 |         }
334 | 
335 |       // check if regionRI even exists and that instance used is in the right region
336 |       _, ok = t_regionRI[instance]
337 |       if ok && az[:len(az)-1] == region {
338 |         for platform := range used[az][instance] {
339 |           // if we still have more used instances check against regional RI's
340 |           if used[az][instance][platform] > 0 && t_regionRI[instance][platform] > 0 {
341 |             if t_regionRI[instance][platform] >= used[az][instance][platform] {
342 |               t_regionRI[instance][platform] -= used[az][instance][platform]
343 |               used[az][instance][platform] = 0
344 |             } else {
345 |               used[az][instance][platform] -= t_regionRI[instance][platform]
346 |               t_regionRI[instance][platform] = 0
347 |             }
348 |           }
349 |         }
350 |       }
351 |     }
352 |   }
353 | 
354 |   // Now loop through the temp RI data to check if any RI's are still available
355 |   // If they are and the % of un-use is above the configured threshold then colate for sending to cloudwatch
356 |   // We sum up the total of regional and AZ specific RI's so that we get one instance based metric regardless of region or AZ RI
357 |   i_unused := make(map[string]map[string]int)
358 |   i_total  := make(map[string]map[string]int)
359 |   var unused int
360 |   var total int
361 | 
362 |   for az := range t_azRI {
363 |     for instance := range t_azRI[az] {
364 |       _, ok := i_unused[instance]
365 |       if ! ok {
366 |         i_unused[instance] = make(map[string]int)
367 |         i_total[instance] = make(map[string]int)
368 |       }
369 |       for platform := range t_azRI[az][instance] {
370 |         i_total[instance][platform] = azRI[az][instance][platform]
371 |         i_unused[instance][platform] = t_azRI[az][instance][platform]
372 |         total += azRI[az][instance][platform]
373 |         unused += t_azRI[az][instance][platform]
374 |       }
375 |     }
376 |   }
377 | 
378 |   for instance := range t_regionRI {
379 |     for platform := range t_regionRI[instance] {
380 |       _, ok := i_unused[instance]
381 |       if ! ok {
382 |         i_unused[instance] = make(map[string]int)
383 |         i_total[instance] = make(map[string]int)
384 |       }
385 |       i_total[instance][platform] += regionRI[instance][platform]
386 |       i_unused[instance][platform] += t_regionRI[instance][platform]
387 |       total += regionRI[instance][platform]
388 |       unused += t_regionRI[instance][platform]
389 |     }
390 |   }
391 | 
392 |   // loop over per-instance utilization and build metrics to send
393 |   metrics := AthenaResponse{}
394 |   for instance := range i_unused {
395 |     _, ok := conf.RI.Ignore[instance]
396 |     if !ok { // instance not on ignore list
397 |       for platform := range i_unused[instance] {
398 |         percent := (float64(i_unused[instance][platform]) / float64(i_total[instance][platform])) * 100
399 |         if int(percent) > conf.RI.PercentThreshold && i_total[instance][platform] > conf.RI.TotalThreshold {
400 |           metrics.Rows = append(metrics.Rows, map[string]string{"dimension": "instance=" + instance + ",platform=" + platform, "date": date, "value": strconv.FormatInt(int64(percent), 10)})
401 |         }
402 |       }
403 |     }
404 |   }
405 | 
406 | 
407 |   // send per instance type under-utilization
408 |   if len(metrics.Rows) > 0 {
409 |     if err := sendMetric(svc, metrics, conf.General.Namespace, conf.RI.CwName, conf.RI.CwType, conf.RI.CwDimension); err != nil {
410 |       log.Fatal(err)
411 |     }
412 |   }
413 | 
414 |   // If confured send overall total utilization
415 |   if conf.RI.TotalUtilization {
416 |     percent := 100 - ((float64(unused) / float64(total)) * 100)
417 |     total := AthenaResponse{}
418 |     total.Rows = append(total.Rows, map[string]string{"dimension": "hourly", "date": date, "value": strconv.FormatInt(int64(percent), 10)})
419 |     if err := sendMetric(svc, total, conf.General.Namespace, conf.RI.CwNameTotal, conf.RI.CwType, conf.RI.CwDimensionTotal); err != nil {
420 |       log.Fatal(err)
421 |     }
422 |   }
423 | 
424 |   return nil
425 | }
426 | 
427 | /*
428 | Main RI function. Gest RI and usage data (from Athena).
429 | Then loops through every hour and calls riUtilizationHour to process each hours worth of data
430 | */
431 | func riUtilization(sess *session.Session, conf Config, key string, secret string, region string, account string, date string) error {
432 | 
433 |   svc := ec2.New(sess)
434 | 
435 |   params := &ec2.DescribeReservedInstancesInput{
436 |     DryRun: aws.Bool(false),
437 |     Filters: []*ec2.Filter{
438 |        {
439 |            Name: aws.String("state"),
440 |            Values: []*string{
441 |                aws.String("active"),
442 |            },
443 |        },
444 |    },
445 |  }
446 | 
447 |   resp, err := svc.DescribeReservedInstances(params)
448 |   if err != nil {
449 |     return err
450 |   }
451 | 
452 |   az_ri := make(map[string]map[string]map[string]int)
453 |   region_ri := make(map[string]map[string]int)
454 | 
455 |   // map in number of RI's available both AZ specific and regional
456 |   for i := range resp.ReservedInstances {
457 |     ri := resp.ReservedInstances[i]
458 | 
459 |     // Trim VPC identifier of Platform type as its not relevant for RI Utilization calculations
460 |     platform := strings.TrimSuffix(*ri.ProductDescription, " (Amazon VPC)")
461 | 
462 |     if *ri.Scope == "Availability Zone" {
463 |       _, ok := az_ri[*ri.AvailabilityZone]
464 |       if ! ok {
465 |         az_ri[*ri.AvailabilityZone] = make(map[string]map[string]int)
466 |       }
467 |       _, ok = az_ri[*ri.AvailabilityZone][*ri.InstanceType]
468 |       if ! ok {
469 |         az_ri[*ri.AvailabilityZone][*ri.InstanceType] = make(map[string]int)
470 |       }
471 |       az_ri[*ri.AvailabilityZone][*ri.InstanceType][platform] += int(*ri.InstanceCount)
472 |     } else if  *ri.Scope == "Region" {
473 |       _, ok := region_ri[*ri.InstanceType]
474 |       if ! ok {
475 |         region_ri[*ri.InstanceType] = make(map[string]int)
476 |       }
477 |       region_ri[*ri.InstanceType][platform] += int(*ri.InstanceCount)
478 |     }
479 |   }
480 | 
481 |   // Fetch RI hours used
482 |   data, err := sendQuery(key, secret, region, account, substituteParams(conf.RI.Sql, map[string]string{"**DATE**": date }))
483 |   if err != nil {
484 |     log.Fatal(err)
485 |   }
486 | 
487 |   // loop through response data and generate map of hourly usage, per AZ, per instance, per platform
488 |   hours := make(map[string]map[string]map[string]map[string]int)
489 |   for row := range data.Rows {
490 |     _, ok := hours[data.Rows[row]["date"]]
491 |     if ! ok {
492 |       hours[data.Rows[row]["date"]] = make(map[string]map[string]map[string]int)
493 |     }
494 |     _, ok = hours[data.Rows[row]["date"]][data.Rows[row]["az"]]
495 |     if ! ok {
496 |       hours[data.Rows[row]["date"]][data.Rows[row]["az"]] = make(map[string]map[string]int)
497 |     }
498 |     _, ok = hours[data.Rows[row]["date"]][data.Rows[row]["az"]][data.Rows[row]["instance"]]
499 |     if ! ok {
500 |       hours[data.Rows[row]["date"]][data.Rows[row]["az"]][data.Rows[row]["instance"]] = make(map[string]int)
501 |     }
502 | 
503 |     v, _ := strconv.ParseInt(data.Rows[row]["hours"], 10, 64)
504 |     hours[data.Rows[row]["date"]][data.Rows[row]["az"]][data.Rows[row]["instance"]][data.Rows[row]["platform"]] += int(v)
505 |   }
506 | 
507 |   // Create new cloudwatch client.
508 |   svcCloudwatch := cloudwatch.New(sess)
509 | 
510 |   // Iterate through each hour and compare the number of instances used vs the number of RIs available
511 |   // If RI leftover percentage is > 1% push to cloudwatch
512 |   for hour := range hours {
513 |     if err := riUtilizationHour(svcCloudwatch, hour, hours[hour], az_ri, region_ri, conf, region); err != nil {
514 |       return err
515 |     }
516 |   }
517 |   return nil
518 | }
519 | 
520 | func main() {
521 | 
522 |   var configFile, region, key, secret, account, bucket, date, costColumn string
523 |   var blendedDBR bool
524 |   if err := getParams(&configFile, &account, &region, &key, &secret, &date, &bucket, &blendedDBR); err != nil {
525 |   log.Fatal(err)
526 |   }
527 | 
528 |   var conf Config
529 |   if err := getConfig(&conf, configFile); err != nil {
530 |     log.Fatal(err)
531 |   }
532 | 
533 |   // make sure Athena DB exists - dont care about results
534 |   if _, err := sendQuery(key, secret, region, account, conf.Athena.DbSQL); err != nil {
535 |   log.Fatal(err)
536 |   }
537 | 
538 |   // make sure current Athena table exists - dont care about results
539 |   // Depending on the Type of DBR (blended or not blended - the table we create is slightly different)
540 |   if blendedDBR {
541 |   costColumn = "blendedcost"
542 |   if _, err := sendQuery(key, secret, region, account, substituteParams(conf.Athena.TableBlendedSQL, map[string]string{"**BUCKET**": bucket, "**DATE**": date, "**ACCOUNT**": account})); err != nil {
543 |   log.Fatal(err)
544 |   }
545 |   } else {
546 |   costColumn = "cost"
547 |   if _, err := sendQuery(key, secret, region, account, substituteParams(conf.Athena.TableSQL, map[string]string{"**BUCKET**": bucket, "**DATE**": date, "**ACCOUNT**": account})); err != nil {
548 |   log.Fatal(err)
549 |   }
550 |   }
551 | 
552 |   /// initialize AWS GO client
553 |   sess, err := session.NewSession(&aws.Config{Region: aws.String(region)})
554 |   if err != nil {
555 |   log.Fatal(err)
556 |   }
557 | 
558 |   // Create new cloudwatch client.
559 |   svc := cloudwatch.New(sess)
560 | 
561 |   // If RI analysis enabled - do it
562 |   if conf.RI.Enabled {
563 |     if err := riUtilization(sess, conf, key, secret, region, account, date); err != nil {
564 |       log.Fatal(err)
565 |     }
566 |   }
567 | 
568 |   // iterate through metrics - perform query then send data to cloudwatch
569 |   for metric := range conf.Metrics {
570 |   if ! conf.Metrics[metric].Enabled {
571 |   continue
572 |   }
573 |   results, err := sendQuery(key, secret, region, account, substituteParams(conf.Metrics[metric].SQL, map[string]string{"**DATE**": date, "**COST**": costColumn}))
574 |   if err != nil {
575 |   log.Fatal(err)
576 |   }
577 |   if conf.Metrics[metric].Type == "dimension-per-row" {
578 |   if err := sendMetric(svc, results, conf.General.Namespace, conf.Metrics[metric].CwName, conf.Metrics[metric].CwType, conf.Metrics[metric].CwDimension); err != nil {
579 |   log.Fatal(err)
580 |   }
581 |   }
582 |   }
583 | }
584 | 


--------------------------------------------------------------------------------
/run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # INPUT Parameters
 4 | #
 5 | # $1 - S3 bucket to use
 6 | # $2 - AWS Account ID
 7 | # $3 - region
 8 | # $4 - user-key
 9 | # $5 - user-secret
10 | # $6 - Optional Upload bucket
11 | 
12 | #
13 | # Helper function that runs whatever command is passed to it and exits if command does not return zero status
14 | # This could be extended to clean up if required
15 | #
16 | function run {
17 |     "$@"
18 |     local status=$?
19 |     if [ $status -ne 0 ]; then
20 |         echo "$1 errored with $status" >&2
21 |         exit $status
22 |     fi
23 |     return $status
24 | }
25 | 
26 | # INSTALL PRE-REQS
27 | sudo mkdir /opt/drill
28 | sudo mkdir /opt/drill/log
29 | sudo chmod 777 /opt/drill/log
30 | sudo curl -s "http://download.nextag.com/apache/drill/drill-1.9.0/apache-drill-1.9.0.tar.gz" | sudo tar xz --strip=1 -C /opt/drill
31 | sudo yum install -y java-1.8.0-openjdk
32 | sudo yum install -y python35
33 | sudo yum install -y aws-cli
34 | sudo yum install -y unzip
35 | 
36 | # Add Drill to PATH
37 | export PATH=/opt/drill/bin:$PATH
38 | 
39 | # Start Athena Proxy
40 | PORT=10000 java -cp ./athenaproxy/athenaproxy.jar com.getredash.awsathena_proxy.API . &
41 | 
42 | # Check if optional upload bucket parameter has been provided - if it has use it
43 | if [ -z "${6}" ]; then
44 |   UPLOAD_BUCKET=$1
45 | else
46 |   UPLOAD_BUCKET=$6
47 | fi
48 | 
49 | DBRFILES3="s3://${1}/${2}-aws-billing-detailed-line-items-$(date +%Y-%m).csv.zip"
50 | DBRFILEFS="${2}-aws-billing-detailed-line-items-$(date +%Y-%m).csv.zip"
51 | DBRFILEFS_CSV="${2}-aws-billing-detailed-line-items-$(date +%Y-%m).csv"
52 | DBRFILEFS_PARQUET="${2}-aws-billing-detailed-line-items-$(date +%Y-%m).parquet"
53 | 
54 | ## Fetch current DBR file and unzip
55 | run aws s3 cp $DBRFILES3 /media/ephemeral0/ --quiet
56 | run unzip -qq /media/ephemeral0/$DBRFILEFS -d /media/ephemeral0/
57 | 
58 | ## Check if DBR file contains Blended / Unblended Rates
59 | DBR_BLENDED=`head -1 /media/ephemeral0/$DBRFILEFS_CSV | grep UnBlended | wc -l`
60 | 
61 | run hostname localhost
62 | ## Column map requried as Athena only works with lowercase columns.
63 | ## Also DBR columns are different depending on Linked Account or without hence alter column map based on that
64 | if [ $DBR_BLENDED -eq 1 ]; then
65 |     run ./csv2parquet /media/ephemeral0/$DBRFILEFS_CSV /media/ephemeral0/$DBRFILEFS_PARQUET --column-map "InvoiceID" "invoiceid" "PayerAccountId" "payeraccountid" "LinkedAccountId" "linkedaccountid" "RecordType" "recordtype" "RecordId" "recordid" "ProductName" "productname" "RateId" "rateid" "SubscriptionId" "subscriptionid" "PricingPlanId" "pricingplanid" "UsageType" "usagetype" "Operation" "operation" "AvailabilityZone" "availabilityzone" "ReservedInstance" "reservedinstance" "ItemDescription" "itemdescription" "UsageStartDate" "usagestartdate" "UsageEndDate" "usageenddate" "UsageQuantity" "usagequantity" "BlendedRate" "blendedrate" "BlendedCost" "blendedcost" "UnBlendedRate" "unblendedrate" "UnBlendedCost" "unblendedcost"
66 | else
67 |     run ./csv2parquet /media/ephemeral0/$DBRFILEFS_CSV /media/ephemeral0/$DBRFILEFS_PARQUET --column-map "InvoiceID" "invoiceid" "PayerAccountId" "payeraccountid" "LinkedAccountId" "linkedaccountid" "RecordType" "recordtype" "RecordId" "recordid" "ProductName" "productname" "RateId" "rateid" "SubscriptionId" "subscriptionid" "PricingPlanId" "pricingplanid" "UsageType" "usagetype" "Operation" "operation" "AvailabilityZone" "availabilityzone" "ReservedInstance" "reservedinstance" "ItemDescription" "itemdescription" "UsageStartDate" "usagestartdate" "UsageEndDate" "usageenddate" "UsageQuantity" "usagequantity" "Rate" "rate" "Cost" "cost"
68 | fi
69 | 
70 | ## Upload Parquet DBR back to bucket
71 | run aws s3 sync /media/ephemeral0/$DBRFILEFS_PARQUET s3://${UPLOAD_BUCKET}/dbr-parquet/${2}-$(date +%Y%m) --quiet
72 | 
73 | ## run go program to Query Athena and send metrics to cloudwatch
74 | ./bin/analyzeDBR -config ./analyzeDBR.config -key $4 -secret $5 -region $3 -account $2 -bucket $UPLOAD_BUCKET -date $(date +%Y%m) -blended=$DBR_BLENDED
75 | 
76 | ## done
77 | 


--------------------------------------------------------------------------------
/userdata.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | chmod 777 /media/ephemeral0
4 | yum install -y git
5 | git clone https://github.com/andyfase/awsDBRanalysis.git
6 | cd awsDBRanalysis
7 | ./run.sh
8 | 


--------------------------------------------------------------------------------