├── .github └── PULL_REQUEST_TEMPLATE.md ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── build └── transform.py ├── buildspec.yml ├── functions ├── createPartitions.mjs ├── moveAccessLogs.mjs ├── transformPartition.mjs └── util.mjs ├── images ├── cloudformation-launch-stack.png ├── moveAccessLogs.png └── transformPartition.png ├── package-lock.json ├── package.json └── template.yaml /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | *Issue #, if available:* 2 | 3 | *Description of changes:* 4 | 5 | 6 | By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | packaged.yaml 2 | .DS_Store 3 | samconfig.toml 4 | packaged*.yaml 5 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check [existing open](https://github.com/aws-samples/amazon-cloudfront-access-logs-queries/issues), or [recently closed](https://github.com/aws-samples/amazon-cloudfront-access-logs-queries/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *mainline* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/aws-samples/amazon-cloudfront-access-logs-queries/labels/help%20wanted) issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](https://github.com/aws-samples/amazon-cloudfront-access-logs-queries/blob/mainline/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | software and associated documentation files (the "Software"), to deal in the Software 5 | without restriction, including without limitation the rights to use, copy, modify, 6 | merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Analyzing your Amazon CloudFront access logs at scale 2 | 3 | This is a sample implementation for the concepts described in the AWS blog post [_Analyze your Amazon CloudFront access logs at scale_](https://aws.amazon.com/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/) using 4 | [AWS CloudFormation](https://aws.amazon.com/cloudformation/), 5 | [Amazon Athena](https://aws.amazon.com/athena/), 6 | [AWS Glue](https://aws.amazon.com/glue/), 7 | [AWS Lambda](https://aws.amazon.com/lambda/), and 8 | [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (S3). 9 | 10 | This application is available in the AWS Serverless Application Repository. You can deploy it to your account from there: 11 | 12 | [![cloudformation-launch-button](https://s3.amazonaws.com/cloudformation-examples/cloudformation-launch-stack.png)](https://serverlessrepo.aws.amazon.com/#/applications/arn:aws:serverlessrepo:us-east-1:387304072572:applications~amazon-cloudfront-access-logs-queries) 13 | 14 | ## Overview 15 | 16 | The application has two main parts: 17 | 18 | - An S3 bucket `--cf-access-logs` that serves as a log bucket for Amazon CloudFront access logs. As soon as Amazon CloudFront delivers a new access logs file, an event triggers the AWS Lambda function `moveAccessLogs`. This moves the file to an [Apache Hive style](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterPartition) prefix. 19 | 20 | ![infrastructure-overview](images/moveAccessLogs.png) 21 | 22 | - An hourly scheduled AWS Lambda function `transformPartition` that runs an [INSERT INTO](https://docs.aws.amazon.com/athena/latest/ug/insert-into.html) query on a single partition per run, taking one hour of data into account. It writes the content of the partition to the Apache Parquet format into the `--cf-access-logs` S3 bucket. 23 | 24 | ![infrastructure-overview](images/transformPartition.png) 25 | 26 | ## FAQs 27 | 28 | ### Q: How can I get started? 29 | 30 | Use the _Launch Stack_ button above to start the deployment of the application to your account. The AWS Management Console will guide you through the process. You can override the following parameters during deployment: 31 | 32 | - The `NewKeyPrefix` (default: `new/`) is the S3 prefix that is used in the configuration of your Amazon CloudFront distribution for log storage. The AWS Lambda function will move the files from here. 33 | - The `GzKeyPrefix` (default: `partitioned-gz/`) and `ParquetKeyPrefix` (default: `partitioned-parquet/`) are the S3 prefixes for partitions that contain gzip or Apache Parquet files. 34 | - `ResourcePrefix` (default: `myapp`) is a prefix that is used for the S3 bucket and the AWS Glue database to prevent naming collisions. 35 | 36 | The stack contains a single S3 bucket called `--cf-access-logs`. After the deployment you can modify your existing Amazon CloudFront distribution configuration to deliver access logs to this bucket with the `new/` log prefix. 37 | 38 | As soon Amazon CloudFront delivers new access logs, files will be moved to `GzKeyPrefix`. After 1-2 hours, they will be transformed to files in `ParquetKeyPrefix`. 39 | 40 | You can query your access logs at any time in the [Amazon Athena Query editor](https://console.aws.amazon.com/athena/home#query) using the AWS Glue view called `combined` in the database called `_cf_access_logs_db`: 41 | 42 | ```sql 43 | SELECT * FROM cf_access_logs.combined limit 10; 44 | ``` 45 | 46 | ### Q: How can I customize and deploy the template? 47 | 48 | 1. [Fork](https://github.com/aws-samples/amazon-cloudfront-access-logs-queries#fork-destination-box) this GitHub repository. 49 | 2. Clone the forked GitHub repository to your local machine. 50 | 3. Modify the templates. 51 | 4. Install the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/installing.html) 52 | & [AWS Serverless Application Model (SAM) CLI](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html). 53 | 5. Validate your template: 54 | 55 | ```sh 56 | $ sam validate -t template.yaml 57 | ``` 58 | 59 | 6. Package the files for deployment with SAM (see [SAM docs](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-deploying.html) for details) to a bucket of your choice. The bucket's region must be in the region you want to deploy the sample application to: 60 | 61 | ```sh 62 | $ sam package 63 | --template-file template.yaml 64 | --output-template-file packaged.yaml 65 | --s3-bucket 66 | ``` 67 | 68 | 7. Deploy the packaged application to your account: 69 | 70 | ```sh 71 | $ aws cloudformation deploy 72 | --template-file packaged.yaml 73 | --stack-name my-stack 74 | --capabilities CAPABILITY_IAM 75 | ``` 76 | 77 | ### Q: How can I use the sample application for multiple Amazon CloudFront distributions? 78 | 79 | If your data does not need to be partitioned by Amazon CloudFront distribution, you can use the same bucket and path (`new/`) for more than one distribution. Then you can query the data by `host` column. If you need to speed up the parquet transformation duration (must stay under 15 minutes) or query duration, deploy another AWS CloudFormation stack from the same template for each distribution. The stack name is added to all resource names (e.g. AWS Lambda functions, S3 bucket etc.) so you can distinguish the different stacks in the AWS Management Console. 80 | 81 | ### Q: In which region can I deploy the sample application? 82 | 83 | The _Launch Stack_ button above opens the AWS Serverless Application Repository in the US East 1 (Northern Virginia) region. You may switch to other regions from there before deployment. 84 | 85 | ### Q: How can I add a new question to this list? 86 | 87 | If you found yourself wishing this set of frequently asked questions had an answer for a particular problem, please [submit a pull request](https://help.github.com/articles/creating-a-pull-request-from-a-fork/). The chances are good that others will also benefit from having the answer listed here. 88 | 89 | ### Q: How can I contribute? 90 | 91 | See the [Contributing Guidelines](https://github.com/aws-samples/amazon-cloudfront-access-logs-queries/blob/mainline/CONTRIBUTING.md) for details. 92 | 93 | ## License Summary 94 | 95 | This sample code is made available under a modified MIT license. See the [LICENSE](https://github.com/aws-samples/amazon-cloudfront-access-logs-queries/blob/mainline/LICENSE) file. 96 | -------------------------------------------------------------------------------- /build/transform.py: -------------------------------------------------------------------------------- 1 | import yaml, sys 2 | from os import path 3 | 4 | 5 | def main(template, package, build_no, commit): 6 | 7 | # read template file 8 | with open(template) as input_file: 9 | template_doc = yaml.safe_load(input_file) 10 | 11 | 12 | print(template_doc) 13 | 14 | # read package file 15 | with open(package) as package_file: 16 | package_doc = yaml.safe_load(package_file) 17 | 18 | commit_short = commit[:6] 19 | sem_ver = ("%s+%s.%s" % (package_doc["version"], build_no, commit_short)) 20 | 21 | template_doc["Metadata"] = { 22 | "AWS::ServerlessRepo::Application": { 23 | "Name": package_doc["name"], 24 | "Description": package_doc["description"], 25 | "Author": package_doc["author"]["name"], 26 | "SpdxLicenseId": package_doc["license"], 27 | "HomePageUrl": package_doc["homepage"], 28 | "SourceCodeUrl": ("%s/tree/%s" % (package_doc["homepage"], commit_short)), 29 | "SemanticVersion": sem_ver 30 | } 31 | } 32 | 33 | with open(path.join(path.dirname(input_file.name), 'packaged-versioned.yaml'), 'w') as output_file: 34 | yaml.dump(template_doc, output_file, sort_keys=False) 35 | 36 | if __name__ == "__main__": 37 | template = sys.argv[1] 38 | package = sys.argv[2] 39 | build_no = sys.argv[3] 40 | commit = sys.argv[4] 41 | main(template, package, build_no, commit) 42 | -------------------------------------------------------------------------------- /buildspec.yml: -------------------------------------------------------------------------------- 1 | version: 0.2 2 | 3 | phases: 4 | build: 5 | commands: 6 | - aws cloudformation package 7 | --s3-prefix ${CODEBUILD_RESOLVED_SOURCE_VERSION} 8 | --template-file template.yaml 9 | --s3-bucket ${deploymentbucket} 10 | --output-template-file SAM-template.yaml 11 | --region us-east-1 12 | - aws s3 cp 13 | SAM-template.yaml 14 | s3://${deploymentbucket}/${CODEBUILD_RESOLVED_SOURCE_VERSION}/ 15 | --region us-east-1 16 | 17 | artifacts: 18 | type: zip 19 | files: 20 | - SAM-template.yaml 21 | -------------------------------------------------------------------------------- /functions/createPartitions.mjs: -------------------------------------------------------------------------------- 1 | // Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | // SPDX-License-Identifier: MIT-0 3 | import { runQuery } from './util.mjs' 4 | 5 | // AWS Glue Data Catalog database and table 6 | const table = process.env.TABLE; 7 | const database = process.env.DATABASE; 8 | 9 | // creates partitions for the hour after the current hour 10 | export const handler = async () => { 11 | const nextHour = new Date(Date.now() + 60 * 60 * 1000); 12 | const year = nextHour.getUTCFullYear(); 13 | const month = (nextHour.getUTCMonth() + 1).toString().padStart(2, '0'); 14 | const day = nextHour.getUTCDate().toString().padStart(2, '0'); 15 | const hour = nextHour.getUTCHours().toString().padStart(2, '0'); 16 | console.log('Creating Partition', { year, month, day, hour }); 17 | 18 | const createPartitionStatement = ` 19 | ALTER TABLE ${database}.${table} 20 | ADD IF NOT EXISTS 21 | PARTITION ( 22 | year = '${year}', 23 | month = '${month}', 24 | day = '${day}', 25 | hour = '${hour}' );`; 26 | 27 | await runQuery(createPartitionStatement); 28 | } 29 | -------------------------------------------------------------------------------- /functions/moveAccessLogs.mjs: -------------------------------------------------------------------------------- 1 | // Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | // SPDX-License-Identifier: MIT-0 3 | import { S3Client } from "@aws-sdk/client-s3"; 4 | import { CopyObjectCommand } from '@aws-sdk/client-s3'; 5 | import { DeleteObjectCommand } from '@aws-sdk/client-s3'; 6 | const s3 = new S3Client(); 7 | 8 | // prefix to copy partitioned data to w/o leading but w/ trailing slash 9 | const targetKeyPrefix = process.env.TARGET_KEY_PREFIX; 10 | 11 | // regex for filenames by Amazon CloudFront access logs. Groups: 12 | // - 1. year 13 | // - 2. month 14 | // - 3. day 15 | // - 4. hour 16 | const datePattern = '[^\\d](\\d{4})-(\\d{2})-(\\d{2})-(\\d{2})[^\\d]'; 17 | const filenamePattern = '[^/]+$'; 18 | 19 | export const handler = async (event, _, callback) => { 20 | const moves = event.Records.map(record => { 21 | const bucket = record.s3.bucket.name; 22 | const sourceKey = record.s3.object.key; 23 | 24 | const sourceRegex = new RegExp(datePattern, 'g'); 25 | const match = sourceRegex.exec(sourceKey); 26 | if (match == null) { 27 | console.log(`Object key ${sourceKey} does not look like an access log file, so it will not be moved.`); 28 | } else { 29 | const [, year, month, day, hour] = match; 30 | 31 | const filenameRegex = new RegExp(filenamePattern, 'g'); 32 | const filename = filenameRegex.exec(sourceKey)[0]; 33 | 34 | const targetKey = `${targetKeyPrefix}year=${year}/month=${month}/day=${day}/hour=${hour}/${filename}`; 35 | console.log(`Copying ${sourceKey} to ${targetKey}.`); 36 | 37 | const copyParams = { 38 | CopySource: bucket + '/' + sourceKey, 39 | Bucket: bucket, 40 | Key: targetKey 41 | }; 42 | const copy = s3.send(new CopyObjectCommand(copyParams)); 43 | 44 | const deleteParams = { Bucket: bucket, Key: sourceKey }; 45 | 46 | return copy.then(function () { 47 | console.log(`Copied. Now deleting ${sourceKey}.`); 48 | const del = s3.send(new DeleteObjectCommand(deleteParams)); 49 | console.log(`Deleted ${sourceKey}.`); 50 | return del; 51 | }, function (reason) { 52 | const error = new Error(`Error while copying ${sourceKey}: ${reason}`); 53 | callback(error); 54 | }); 55 | 56 | } 57 | }); 58 | await Promise.all(moves); 59 | }; 60 | -------------------------------------------------------------------------------- /functions/transformPartition.mjs: -------------------------------------------------------------------------------- 1 | // Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | // SPDX-License-Identifier: MIT-0 3 | import { runQuery } from './util.mjs' 4 | 5 | // AWS Glue Data Catalog database and tables 6 | const sourceTable = process.env.SOURCE_TABLE; 7 | const targetTable = process.env.TARGET_TABLE; 8 | const database = process.env.DATABASE; 9 | 10 | // get the partition of 2hours ago 11 | export const handler = async () => { 12 | const partitionHour = new Date(Date.now() - 120 * 60 * 1000); 13 | const year = partitionHour.getUTCFullYear(); 14 | const month = (partitionHour.getUTCMonth() + 1).toString().padStart(2, '0'); 15 | const day = partitionHour.getUTCDate().toString().padStart(2, '0'); 16 | const hour = partitionHour.getUTCHours().toString().padStart(2, '0'); 17 | 18 | console.log('Transforming Partition', { year, month, day, hour }); 19 | 20 | const ctasStatement = ` 21 | INSERT INTO ${database}.${targetTable} 22 | SELECT * 23 | FROM ${database}.${sourceTable} 24 | WHERE year = '${year}' 25 | AND month = '${month}' 26 | AND day = '${day}' 27 | AND hour = '${hour}';`; 28 | 29 | await runQuery(ctasStatement); 30 | } 31 | -------------------------------------------------------------------------------- /functions/util.mjs: -------------------------------------------------------------------------------- 1 | // Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | // SPDX-License-Identifier: MIT-0 3 | import { AthenaClient, GetQueryExecutionCommand, StartQueryExecutionCommand } from '@aws-sdk/client-athena'; 4 | const athena = new AthenaClient(); 5 | 6 | // s3 URL of the query results (without trailing slash) 7 | const athenaQueryResultsLocation = process.env.ATHENA_QUERY_RESULTS_LOCATION; 8 | 9 | async function waitForQueryExecution(queryExecutionId) { 10 | while (true) { 11 | const data = await athena.send(new GetQueryExecutionCommand({ 12 | QueryExecutionId: queryExecutionId 13 | })); 14 | const state = data.QueryExecution.Status.State; 15 | if (state === 'SUCCEEDED') { 16 | return; 17 | } else if (state === 'FAILED' || state === 'CANCELLED') { 18 | throw Error(`Query ${queryExecutionId} failed: ${data.QueryExecution.Status.StateChangeReason}`); 19 | } 20 | await new Promise(resolve => setTimeout(resolve, 100)); 21 | } 22 | } 23 | 24 | 25 | export function runQuery(query) { 26 | const params = { 27 | QueryString: query, 28 | ResultConfiguration: { 29 | OutputLocation: athenaQueryResultsLocation 30 | } 31 | }; 32 | return athena.send(new StartQueryExecutionCommand(params)) 33 | .then(data => waitForQueryExecution(data.QueryExecutionId)); 34 | } 35 | -------------------------------------------------------------------------------- /images/cloudformation-launch-stack.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-cloudfront-access-logs-queries/23a6265a09671ae40b4bebb9a4ef71beed56a0bd/images/cloudformation-launch-stack.png -------------------------------------------------------------------------------- /images/moveAccessLogs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-cloudfront-access-logs-queries/23a6265a09671ae40b4bebb9a4ef71beed56a0bd/images/moveAccessLogs.png -------------------------------------------------------------------------------- /images/transformPartition.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-cloudfront-access-logs-queries/23a6265a09671ae40b4bebb9a4ef71beed56a0bd/images/transformPartition.png -------------------------------------------------------------------------------- /package-lock.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "amazon-cloudfront-access-logs-queries", 3 | "version": "1.1.1", 4 | "lockfileVersion": 3, 5 | "requires": true, 6 | "packages": { 7 | "": { 8 | "name": "amazon-cloudfront-access-logs-queries", 9 | "version": "1.1.1", 10 | "license": "MIT-0" 11 | } 12 | } 13 | } 14 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "amazon-cloudfront-access-logs-queries", 3 | "version": "1.1.1", 4 | "description": "This is a sample implementation for the concepts described in the AWS blog post \"Analyzing your Amazon CloudFront access logs at scale\".", 5 | "author": { 6 | "name": "Steffen Grunwald" 7 | }, 8 | "license": "MIT-0", 9 | "homepage": "https://github.com/aws-samples/amazon-cloudfront-access-logs-queries" 10 | } 11 | -------------------------------------------------------------------------------- /template.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | Transform: AWS::Serverless-2016-10-31 3 | Description: > 4 | Stack that deploys a bucket which you can use as a target for your 5 | Amazon CloudFront access logs (use the prefix 'new/'). An event 6 | notification is configured so that new objects created will fire an 7 | AWS Lambda function that moves the objects to prefixes (under 8 | 'partitioned-gz/') that adhere to the Apache Hive partitioning format. 9 | This way the data is easier to consume for big data tools (as Amazon 10 | Athena and AWS Glue). 11 | 12 | Read the blog post 13 | https://aws.amazon.com/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/ 14 | for further details about this sample asset (uksb-1tupboc62). 15 | 16 | Parameters: 17 | ResourcePrefix: 18 | Type: String 19 | Default: 'myapp' 20 | AllowedPattern: '[a-z0-9]+' 21 | MinLength: 1 22 | MaxLength: 20 23 | Description: > 24 | Prefix that is used for the created resources (20 chars, a-z and 0-9 only) 25 | NewKeyPrefix: 26 | Type: String 27 | Default: 'new/' 28 | AllowedPattern: '[A-Za-z0-9\-]+/' 29 | Description: > 30 | Prefix of new access log files that are written by Amazon CloudFront. 31 | Including the trailing slash. 32 | GzKeyPrefix: 33 | Type: String 34 | Default: 'partitioned-gz/' 35 | AllowedPattern: '[A-Za-z0-9\-]+/' 36 | Description: > 37 | Prefix of gzip'ed access log files that are moved to the Apache Hive 38 | like style. Including the trailing slash. 39 | ParquetKeyPrefix: 40 | Type: String 41 | Default: 'partitioned-parquet/' 42 | AllowedPattern: '[A-Za-z0-9\-]+/' 43 | Description: > 44 | Prefix of parquet files that are created in Apache Hive 45 | like style by the CTAS query. Including the trailing slash. 46 | Resources: 47 | TransformPartFn: 48 | Type: AWS::Serverless::Function 49 | Properties: 50 | CodeUri: functions/ 51 | Handler: transformPartition.handler 52 | Runtime: nodejs20.x 53 | Timeout: 900 54 | Policies: 55 | - Version: '2012-10-17' 56 | Statement: 57 | - Effect: Allow 58 | Action: 59 | - athena:StartQueryExecution 60 | - athena:GetQueryExecution 61 | Resource: '*' 62 | - Effect: Allow 63 | Action: 64 | - s3:ListBucket 65 | - s3:GetBucketLocation 66 | Resource: !Sub "arn:${AWS::Partition}:s3:::${ResourcePrefix}-${AWS::AccountId}-cf-access-logs" 67 | - Effect: Allow 68 | Action: 69 | - s3:PutObject 70 | - s3:GetObject 71 | Resource: !Sub "arn:${AWS::Partition}:s3:::${ResourcePrefix}-${AWS::AccountId}-cf-access-logs/*" 72 | - Effect: Allow 73 | Action: 74 | - glue:CreatePartition 75 | - glue:GetDatabase 76 | - glue:GetTable 77 | - glue:BatchCreatePartition 78 | - glue:GetPartition 79 | - glue:GetPartitions 80 | - glue:CreateTable 81 | - glue:DeleteTable 82 | - glue:DeletePartition 83 | Resource: '*' 84 | Environment: 85 | Variables: 86 | SOURCE_TABLE: !Ref PartitionedGzTable 87 | TARGET_TABLE: !Ref PartitionedParquetTable 88 | DATABASE: !Ref CfLogsDatabase 89 | ATHENA_QUERY_RESULTS_LOCATION: !Sub "s3://${ResourcePrefix}-${AWS::AccountId}-cf-access-logs/athena-query-results" 90 | Events: 91 | HourlyEvt: 92 | Type: Schedule 93 | Properties: 94 | Schedule: cron(1 * * * ? *) 95 | CreatePartFn: 96 | Type: AWS::Serverless::Function 97 | Properties: 98 | CodeUri: functions/ 99 | Handler: createPartitions.handler 100 | Runtime: nodejs20.x 101 | Timeout: 5 102 | Policies: 103 | - Version: '2012-10-17' 104 | Statement: 105 | - Effect: Allow 106 | Action: 107 | - athena:StartQueryExecution 108 | - athena:GetQueryExecution 109 | Resource: '*' 110 | - Effect: Allow 111 | Action: 112 | - s3:ListBucket 113 | - s3:GetBucketLocation 114 | Resource: !Sub "arn:${AWS::Partition}:s3:::${ResourcePrefix}-${AWS::AccountId}-cf-access-logs" 115 | - Effect: Allow 116 | Action: 117 | - s3:PutObject 118 | Resource: !Sub "arn:${AWS::Partition}:s3:::${ResourcePrefix}-${AWS::AccountId}-cf-access-logs/*" 119 | - Effect: Allow 120 | Action: 121 | - glue:CreatePartition 122 | - glue:GetDatabase 123 | - glue:GetTable 124 | - glue:BatchCreatePartition 125 | Resource: '*' 126 | Environment: 127 | Variables: 128 | TABLE: !Ref PartitionedGzTable 129 | DATABASE: !Ref CfLogsDatabase 130 | ATHENA_QUERY_RESULTS_LOCATION: !Sub "s3://${ResourcePrefix}-${AWS::AccountId}-cf-access-logs/athena-query-results" 131 | Events: 132 | HourlyEvt: 133 | Type: Schedule 134 | Properties: 135 | Schedule: cron(55 * * * ? *) 136 | MoveNewAccessLogsFn: 137 | Type: AWS::Serverless::Function 138 | Properties: 139 | CodeUri: functions/ 140 | Handler: moveAccessLogs.handler 141 | Runtime: nodejs20.x 142 | Timeout: 30 143 | Policies: 144 | - Version: '2012-10-17' 145 | Statement: 146 | - Effect: Allow 147 | Action: 148 | - s3:GetObject 149 | - s3:DeleteObject 150 | Resource: !Sub "arn:${AWS::Partition}:s3:::${ResourcePrefix}-${AWS::AccountId}-cf-access-logs/${NewKeyPrefix}*" 151 | - Effect: Allow 152 | Action: 153 | - s3:PutObject 154 | Resource: !Sub "arn:${AWS::Partition}:s3:::${ResourcePrefix}-${AWS::AccountId}-cf-access-logs/${GzKeyPrefix}*" 155 | Environment: 156 | Variables: 157 | TARGET_KEY_PREFIX: !Ref GzKeyPrefix 158 | Events: 159 | AccessLogsUploadedEvent: 160 | Type: S3 161 | Properties: 162 | Bucket: !Ref CloudFrontAccessLogsBucket 163 | Events: s3:ObjectCreated:* 164 | Filter: 165 | S3Key: 166 | Rules: 167 | - Name: prefix 168 | Value: !Ref NewKeyPrefix 169 | CloudFrontAccessLogsBucket: 170 | Type: "AWS::S3::Bucket" 171 | Description: "Bucket for Amazon CloudFront access logs" 172 | Properties: 173 | BucketName: !Sub "${ResourcePrefix}-${AWS::AccountId}-cf-access-logs" 174 | LifecycleConfiguration: 175 | Rules: 176 | - Id: ExpireAthenaQueryResults 177 | Prefix: athena-query-results/ 178 | Status: Enabled 179 | ExpirationInDays: 1 180 | BucketEncryption: 181 | ServerSideEncryptionConfiguration: 182 | - ServerSideEncryptionByDefault: 183 | SSEAlgorithm: AES256 184 | PublicAccessBlockConfiguration: 185 | BlockPublicAcls: Yes 186 | BlockPublicPolicy: Yes 187 | IgnorePublicAcls: Yes 188 | RestrictPublicBuckets: Yes 189 | CloudFrontAccessLogsBucketPolicy: 190 | Type: "AWS::S3::BucketPolicy" 191 | Properties: 192 | Bucket: !Ref CloudFrontAccessLogsBucket 193 | PolicyDocument: 194 | Version: "2012-10-17" 195 | Statement: 196 | - Effect: Deny 197 | Principal: "*" 198 | Action: s3:* 199 | Resource: 200 | - !Sub "${CloudFrontAccessLogsBucket.Arn}" 201 | - !Sub "${CloudFrontAccessLogsBucket.Arn}/*" 202 | Condition: 203 | Bool: 204 | "aws:SecureTransport": "false" 205 | 206 | # Glue Resources 207 | # - Database 208 | # - Partitioned Gzip Table 209 | # - Partitioned Parquet Table 210 | # - Combined view of both tables 211 | 212 | CfLogsDatabase: 213 | Type: AWS::Glue::Database 214 | Properties: 215 | CatalogId: !Ref AWS::AccountId 216 | DatabaseInput: 217 | Name: !Sub "${ResourcePrefix}_cf_access_logs_db" 218 | PartitionedGzTable: 219 | Type: AWS::Glue::Table 220 | Properties: 221 | CatalogId: !Ref AWS::AccountId 222 | DatabaseName: !Ref CfLogsDatabase 223 | TableInput: 224 | Name: 'partitioned_gz' 225 | Description: 'Gzip logs delivered by Amazon CloudFront partitioned' 226 | TableType: EXTERNAL_TABLE 227 | Parameters: { "skip.header.line.count": "2" } 228 | PartitionKeys: 229 | - Name: year 230 | Type: string 231 | - Name: month 232 | Type: string 233 | - Name: day 234 | Type: string 235 | - Name: hour 236 | Type: string 237 | StorageDescriptor: 238 | OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 239 | Columns: 240 | - Name: date 241 | Type: date 242 | - Name: time 243 | Type: string 244 | - Name: location 245 | Type: string 246 | - Name: bytes 247 | Type: bigint 248 | - Name: request_ip 249 | Type: string 250 | - Name: method 251 | Type: string 252 | - Name: host 253 | Type: string 254 | - Name: uri 255 | Type: string 256 | - Name: status 257 | Type: int 258 | - Name: referrer 259 | Type: string 260 | - Name: user_agent 261 | Type: string 262 | - Name: query_string 263 | Type: string 264 | - Name: cookie 265 | Type: string 266 | - Name: result_type 267 | Type: string 268 | - Name: request_id 269 | Type: string 270 | - Name: host_header 271 | Type: string 272 | - Name: request_protocol 273 | Type: string 274 | - Name: request_bytes 275 | Type: bigint 276 | - Name: time_taken 277 | Type: float 278 | - Name: xforwarded_for 279 | Type: string 280 | - Name: ssl_protocol 281 | Type: string 282 | - Name: ssl_cipher 283 | Type: string 284 | - Name: response_result_type 285 | Type: string 286 | - Name: http_version 287 | Type: string 288 | - Name: fle_status 289 | Type: string 290 | - Name: fle_encrypted_fields 291 | Type: int 292 | - Name: c_port 293 | Type: int 294 | - Name: time_to_first_byte 295 | Type: float 296 | - Name: x_edge_detailed_result_type 297 | Type: string 298 | - Name: sc_content_type 299 | Type: string 300 | - Name: sc_content_len 301 | Type: bigint 302 | - Name: sc_range_start 303 | Type: bigint 304 | - Name: sc_range_end 305 | Type: bigint 306 | InputFormat: org.apache.hadoop.mapred.TextInputFormat 307 | Location: !Sub "s3://${ResourcePrefix}-${AWS::AccountId}-cf-access-logs/${GzKeyPrefix}" 308 | SerdeInfo: 309 | Parameters: 310 | field.delim": "\t" 311 | serialization.format: "\t" 312 | SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 313 | PartitionedParquetTable: 314 | Type: AWS::Glue::Table 315 | Properties: 316 | CatalogId: !Ref AWS::AccountId 317 | DatabaseName: !Ref CfLogsDatabase 318 | TableInput: 319 | Name: 'partitioned_parquet' 320 | Description: 'Parquet format access logs as transformed from gzip version' 321 | TableType: EXTERNAL_TABLE 322 | Parameters: { 'has_encrypted_data': 'false', 'parquet.compression': 'SNAPPY' } 323 | PartitionKeys: 324 | - Name: year 325 | Type: string 326 | - Name: month 327 | Type: string 328 | - Name: day 329 | Type: string 330 | - Name: hour 331 | Type: string 332 | StorageDescriptor: 333 | OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat 334 | Columns: 335 | - Name: date 336 | Type: date 337 | - Name: time 338 | Type: string 339 | - Name: location 340 | Type: string 341 | - Name: bytes 342 | Type: bigint 343 | - Name: request_ip 344 | Type: string 345 | - Name: method 346 | Type: string 347 | - Name: host 348 | Type: string 349 | - Name: uri 350 | Type: string 351 | - Name: status 352 | Type: int 353 | - Name: referrer 354 | Type: string 355 | - Name: user_agent 356 | Type: string 357 | - Name: query_string 358 | Type: string 359 | - Name: cookie 360 | Type: string 361 | - Name: result_type 362 | Type: string 363 | - Name: request_id 364 | Type: string 365 | - Name: host_header 366 | Type: string 367 | - Name: request_protocol 368 | Type: string 369 | - Name: request_bytes 370 | Type: bigint 371 | - Name: time_taken 372 | Type: float 373 | - Name: xforwarded_for 374 | Type: string 375 | - Name: ssl_protocol 376 | Type: string 377 | - Name: ssl_cipher 378 | Type: string 379 | - Name: response_result_type 380 | Type: string 381 | - Name: http_version 382 | Type: string 383 | - Name: fle_status 384 | Type: string 385 | - Name: fle_encrypted_fields 386 | Type: int 387 | - Name: c_port 388 | Type: int 389 | - Name: time_to_first_byte 390 | Type: float 391 | - Name: x_edge_detailed_result_type 392 | Type: string 393 | - Name: sc_content_type 394 | Type: string 395 | - Name: sc_content_len 396 | Type: bigint 397 | - Name: sc_range_start 398 | Type: bigint 399 | - Name: sc_range_end 400 | Type: bigint 401 | InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat 402 | Location: !Sub "s3://${ResourcePrefix}-${AWS::AccountId}-cf-access-logs/${ParquetKeyPrefix}" 403 | SerdeInfo: 404 | SerializationLibrary: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe 405 | CombinedView: 406 | Type: AWS::Glue::Table 407 | Properties: 408 | CatalogId: !Ref AWS::AccountId 409 | DatabaseName: !Ref CfLogsDatabase 410 | TableInput: 411 | Name: 'combined' 412 | Description: 'combined view over gzip and parquet tables' 413 | TableType: VIRTUAL_VIEW 414 | Parameters: { 'presto_view': 'true' } 415 | PartitionKeys: [] 416 | StorageDescriptor: 417 | Columns: 418 | - Name: date 419 | Type: date 420 | - Name: time 421 | Type: string 422 | - Name: location 423 | Type: string 424 | - Name: bytes 425 | Type: bigint 426 | - Name: request_ip 427 | Type: string 428 | - Name: method 429 | Type: string 430 | - Name: host 431 | Type: string 432 | - Name: uri 433 | Type: string 434 | - Name: status 435 | Type: int 436 | - Name: referrer 437 | Type: string 438 | - Name: user_agent 439 | Type: string 440 | - Name: query_string 441 | Type: string 442 | - Name: cookie 443 | Type: string 444 | - Name: result_type 445 | Type: string 446 | - Name: request_id 447 | Type: string 448 | - Name: host_header 449 | Type: string 450 | - Name: request_protocol 451 | Type: string 452 | - Name: request_bytes 453 | Type: bigint 454 | - Name: time_taken 455 | Type: float 456 | - Name: xforwarded_for 457 | Type: string 458 | - Name: ssl_protocol 459 | Type: string 460 | - Name: ssl_cipher 461 | Type: string 462 | - Name: response_result_type 463 | Type: string 464 | - Name: http_version 465 | Type: string 466 | - Name: fle_status 467 | Type: string 468 | - Name: fle_encrypted_fields 469 | Type: int 470 | - Name: c_port 471 | Type: int 472 | - Name: time_to_first_byte 473 | Type: float 474 | - Name: x_edge_detailed_result_type 475 | Type: string 476 | - Name: sc_content_type 477 | Type: string 478 | - Name: sc_content_len 479 | Type: bigint 480 | - Name: sc_range_start 481 | Type: bigint 482 | - Name: sc_range_end 483 | Type: bigint 484 | - Name: year 485 | Type: string 486 | - Name: month 487 | Type: string 488 | - Name: day 489 | Type: string 490 | - Name: hour 491 | Type: string 492 | - Name: file 493 | Type: string 494 | SerdeInfo: {} 495 | ViewOriginalText: 496 | Fn::Join: 497 | - '' 498 | - - '/* Presto View: ' 499 | - Fn::Base64: 500 | Fn::Sub: 501 | - |- 502 | { 503 | "originalSql": "SELECT *, \"$path\" as file FROM ${database}.${partitioned_gz_table} WHERE (concat(year, month, day, hour) >= date_format(date_trunc('hour', ((current_timestamp - INTERVAL '15' MINUTE) - INTERVAL '1' HOUR)), '%Y%m%d%H')) UNION ALL SELECT *, \"$path\" as file FROM ${database}.${partitioned_parquet_table} WHERE (concat(year, month, day, hour) < date_format(date_trunc('hour', ((current_timestamp - INTERVAL '15' MINUTE) - INTERVAL '1' HOUR)), '%Y%m%d%H'))", 504 | "catalog": "awsdatacatalog", 505 | "schema": "${database}", 506 | "columns": [ 507 | {"name": "date", "type": "date"}, 508 | {"name": "time", "type": "varchar"}, 509 | {"name": "location", "type": "varchar"}, 510 | {"name": "bytes", "type": "bigint"}, 511 | {"name": "request_ip", "type": "varchar"}, 512 | {"name": "method", "type": "varchar"}, 513 | {"name": "host", "type": "varchar"}, 514 | {"name": "uri", "type": "varchar"}, 515 | {"name": "status", "type": "integer"}, 516 | {"name": "referrer", "type": "varchar"}, 517 | {"name": "user_agent", "type": "varchar"}, 518 | {"name": "query_string", "type": "varchar"}, 519 | {"name": "cookie", "type": "varchar"}, 520 | {"name": "result_type", "type": "varchar"}, 521 | {"name": "request_id", "type": "varchar"}, 522 | {"name": "host_header", "type": "varchar"}, 523 | {"name": "request_protocol", "type": "varchar"}, 524 | {"name": "request_bytes", "type": "bigint"}, 525 | {"name": "time_taken", "type": "real"}, 526 | {"name": "xforwarded_for", "type": "varchar"}, 527 | {"name": "ssl_protocol", "type": "varchar"}, 528 | {"name": "ssl_cipher", "type": "varchar"}, 529 | {"name": "response_result_type", "type": "varchar"}, 530 | {"name": "http_version", "type": "varchar"}, 531 | {"name": "fle_status", "type": "varchar"}, 532 | {"name": "fle_encrypted_fields", "type": "integer"}, 533 | {"name": "c_port", "type": "integer"}, 534 | {"name": "time_to_first_byte", "type": "real"}, 535 | {"name": "x_edge_detailed_result_type", "type": "varchar"}, 536 | {"name": "sc_content_type", "type": "varchar"}, 537 | {"name": "sc_content_len", "type": "bigint"}, 538 | {"name": "sc_range_start", "type": "bigint"}, 539 | {"name": "sc_range_end", "type": "bigint"}, 540 | {"name": "year", "type": "varchar"}, 541 | {"name": "month", "type": "varchar"}, 542 | {"name": "day", "type": "varchar"}, 543 | {"name": "hour", "type": "varchar"}, 544 | {"name": "file", "type": "varchar"} 545 | ] 546 | } 547 | - { database: !Ref CfLogsDatabase, 548 | partitioned_gz_table: !Ref PartitionedGzTable, 549 | partitioned_parquet_table: !Ref PartitionedParquetTable } 550 | - ' */' 551 | --------------------------------------------------------------------------------