├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── cfn └── samples3syncenv.template.yaml ├── images ├── awsBatchS3SyncArch.png ├── batchComputeEnvironment.png ├── batchJobDef.png ├── batchJobQueue.png ├── cloudTrailS3DataEvents.png └── s3InputSample.png └── src ├── Dockerfile └── s3CopySyncScript.py /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | *.swp 3 | package-lock.json 4 | .pytest_cache 5 | *.egg-info 6 | 7 | # Byte-compiled / optimized / DLL files 8 | __pycache__/ 9 | *.py[cod] 10 | *$py.class 11 | 12 | # Environments 13 | .env 14 | .venv 15 | env/ 16 | venv/ 17 | ENV/ 18 | env.bak/ 19 | venv.bak/ 20 | 21 | # CDK Context & Staging files 22 | .cdk.staging/ 23 | cdk.out/ 24 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | - [Using AWS Batch jobs to bulk copy/sync files in S3](#using-aws-batch-jobs-to-bulk-copysync-files-in-s3) 4 | - [Use Cases](#use-cases) 5 | - [Overview](#overview) 6 | - [Architecture](#architecture) 7 | - [Demo Environment](#demo-environment) 8 | - [Prerequisites](#prerequisites) 9 | - [Cloudformation Template](#cloudformation-template) 10 | - [ECR Image](#ecr-image) 11 | - [S3 Input Location](#s3-input-location) 12 | - [CSV File Format Details](#csv-file-format-details) 13 | - [Cloudtrail Monitoring](#cloudtrail-monitoring) 14 | - [AWS Batch](#aws-batch) 15 | - [Compute Environment](#compute-environment) 16 | - [Job Queue](#job-queue) 17 | - [Job Definition](#job-definition) 18 | - [IAM Roles](#iam-roles) 19 | - [Job Invokation via an Eventbridge Rule](#job-invokation-via-an-eventbridge-rule) 20 | 21 | 22 | 23 | 24 | # Using AWS Batch jobs to bulk copy/sync files in S3 25 | 26 | ## Use Cases 27 | This utility is aimed towards technical teams who are looking for a scaleable way to copy/sync files within S3. In certain cases, teams may want to replicate certin "prod" data into "dev" to have more robust development environments. Teams may also look to transfer data to other buckets to share with external teams or publically. In both scenarios, having control over what data is copied/syned becomes paramount to ensure sensitive data does not make its way to an environment that could cause security concerns. 28 | 29 | An existing solution is [S3 replication](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html), but that comes with certain limitations. Managing replication rules with multiple directories can get cumbersome, the same prefix must be kept, source and destination must be different buckets, and already existing data cannot be backfilled using replication. As workarounds, teams have managed environments to run custom scripts using AWS SDK to overcome such roadblocks. 30 | 31 | 32 | ## Overview 33 | This guide details how to use [AWS Batch](https://aws.amazon.com/batch/) to perform bulk copy/sync activities on files in [S3](https://aws.amazon.com/s3/). Batch allows users to run massively scalable computing jobs on AWS by provisioning the optimal compute resources needed. The user is able to focus on configuring what the job should be doing instead of provisioning infrastructure. 34 | 35 | Here we use Batch's [Managed Compute environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html#managed_compute_environments) with the [Fargate provioning model](https://docs.aws.amazon.com/batch/latest/userguide/fargate.html), where Batch runs containers without the user having to manage the underlying EC2 instances. We will package our application within an [ECR](https://aws.amazon.com/ecr/) container image that Fargate will use to launch the environment where the job runs. 36 | 37 | With Batch configured, the user will upload a .csv file in an S3 location that has it's events being logged in [Cloudtrail](https://aws.amazon.com/cloudtrail/). This is being monitored by [Eventbridge](https://aws.amazon.com/eventbridge/) which will kick off the Batch job once the appropriate file is uploaded. The .csv file contains a list of S3 source/destination pairs to be copied/synced in the job as detailed below. For accessing S3 resources in different AWS accounts, be sure to look at the [IAM Roles section below](#iam-roles) 38 | 39 | ## Architecture 40 | This is an overview of the architecture described above: 41 | 42 | ![awsBatchS3SyncArch](images/awsBatchS3SyncArch.png) 43 | 44 | ## Demo Environment 45 | 46 | ### Prerequisites 47 | **Default (or other) VPC/Subnet/Security Group** 48 | * To setup the environment from the cloudformation template, you will need a VPC/subnets to launch fargate resources in. A default VPC is suitable for demo purposes. 49 | * AWS accounts created after `2013-12-04` have a default VPC created. Otherwise, you can [follow these instructions](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html#create-default-vpc) to create it. 50 | * Make note of the subnetIds and securityGroupIDs or retrieve them via CLI 51 | - Default VPC ID 52 | + `aws ec2 describe-vpcs --filters Name=isDefault,Values=true --query "Vpcs[].VpcId | [0]"` 53 | - Get all Subnet IDs for the default VPC 54 | + `aws ec2 describe-subnets --filters Name=vpc-id,Values=$(aws ec2 describe-vpcs --filters Name=isDefault,Values=true --query "Vpcs[].VpcId | [0]" ) --query "Subnets[].SubnetId"` 55 | - Get all Security Group IDs for the default VPC 56 | + `aws ec2 describe-security-groups --filters Name=vpc-id,Values=$(aws ec2 describe-vpcs --filters Name=isDefault,Values=true --query "Vpcs[].VpcId | [0]" ) --query "SecurityGroups[].GroupId"` 57 | 58 | ### Cloudformation Template 59 | * Create a stack from the cloudformation template from the `cfn` directory. This creates an environment identical to the one in this guide. 60 | * From the commands above, you will need to pass subnetId(s) and security group ID(s) as parameters to create the stack 61 | * Once creation is complete, you can follow the rest of the guide below, namely: 62 | - You can build and push a the Docker image to ECR 63 | - Once pushed, you can run Batch sync/copy jobs for objects within the S3 bucket the cfn creates. 64 | 65 | ## ECR Image 66 | The ECR Image contains our application logic to sync/copy S3 files based on the csv input. This is done in the python script `s3CopySyncScript.py`. Based on the CSV input, it will perform a [managed transfer using the copy api](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.copy) if a file is given as a source/destination. If a prefix is given as source/destination, it will use the [AWS CLI to perform an aws s3 sync](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html). 67 | 68 | The `Dockerfile` builds an image based on [AL2](https://hub.docker.com/_/amazonlinux), installing the AWS cli, python3, boto3, and setting other s3 configuration for optimal transfers. 69 | 70 | Create an ECR repository and use these commands to build and push the image as latest using the CLI. 71 | 72 | ``` 73 | aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <<>>.dkr.ecr.us-east-1.amazonaws.com 74 | 75 | docker build -t <<>> . 76 | 77 | docker tag <<>>:latest <<>>.dkr.ecr.us-east-1.amazonaws.com/<<>>:latest 78 | 79 | docker push <<>>.dkr.ecr.us-east-1.amazonaws.com/<<>>:latest 80 | ``` 81 | 82 | ## S3 Input Location 83 | The `.csv file` will be uploaded to s3 to kick off the job. Designate an area for this. In this example, I've created a sample bucket with the prefix `input` and will be uploading files here. Notice also the uploaded csv is prefixed as such: `s3_batch_sync_input-*.csv`. Using a naming convention like this can simplify eventbridge monitoring which we'll see below. 84 | 85 | ![s3InputSample](images/s3InputSample.png) 86 | 87 | ### CSV File Format Details 88 | The csv file should be formatted as such. 89 | 90 | | source | destination | 91 | |-------------------------------------------------|------------------------------------------| 92 | | s3://some-bucket/source/ | s3://some-bucket/dest/ | 93 | | s3://some-bucket/sourcetwo/ | s3://some-bucket/desttwo/ | 94 | | s3://some-bucket/sourceindiv/individualFile.txt | s3://some-bucket/dest/individualFile.txt | 95 | 96 | The first 2 rows in this example are prefixes where we want an s3 sync to occur. The last row is a specific object we want to copy directly. The bucket/AWS account does not need to be the same, as long as IAM permissions are [properly applied as noted below](#iam-roles). 97 | 98 | ## Cloudtrail Monitoring 99 | Setup a trail that will monitor the S3 location where the input file will land. In setting up the trail you can set these options as you see fit: name, an s3 location for the logs, log encryption using KMS, log validation. You will need to make sure that S3 data events for the location where the input file lands is enabled at a minimum: 100 | 101 | ![cloudTrailS3DataEvents](images/cloudTrailS3DataEvents.png) 102 | 103 | 104 | ## AWS Batch 105 | 106 | ### Compute Environment 107 | This will serve as a pool that our batch jobs can pull resources from. Create a `managed environment` with the instance configuration set to `fargate` for this example. Set the max vCPUs to set an upper limit for concurrent fargate resources being used. Other configuration options for an AWS Batch [Compute Environment are detailed here](https://docs.aws.amazon.com/batch/latest/userguide/create-compute-environment.html). 108 | Lastly pick the VPC and subnets your environment will be located in, and security groups that may need to be attached to instances. If you're using S3 VPC gateway endpoints this would be key. In our example, we're using the default VPC since we're accessing S3 through public internet. 109 | Once complete, the environment state would be `ENABLED`. 110 | 111 | ![batchComputeEnvironment](images/batchComputeEnvironment.png) 112 | 113 | 114 | ### Job Queue 115 | AWS Batch Jobs are submitted to a job queue until compute environment resources are available for the job to run. You can have multiple queues with different priorities which pull from different compute environments. [More details are here](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html). 116 | For this example, create a queue and attach it to the compute environment previously made. 117 | 118 | ![batchJobQueue](images/batchJobQueue.png) 119 | 120 | ### Job Definition 121 | The Job Definition acts as a template from which to launch our individual jobs. [Detailed instructions are here for additional configuration](https://docs.aws.amazon.com/batch/latest/userguide/create-job-definition.html). 122 | 123 | * **Basics** 124 | - Enter a `name` for the template and pick `fargate` for the platform. 125 | - Depending on your anticipated use, set a `retry strategy` and `timeout`. For this example we set 2 job attempts and a timeout of 120 seconds. 126 | - We'll also put job logs in the default AWS Batch logs group in cloudwatch but it [can be customized as detailed here](https://docs.aws.amazon.com/batch/latest/userguide/using_awslogs.html). 127 | 128 | * **Python Script Usage** 129 | - Note the usage of the script described here and then set the container properties in the next step as required 130 | - Syntax: `python s3CopySyncScript.py <> <> <
> <>` 131 | + *header* indicates whether the input csv has a header row 132 | + *sync_delete* indicates whether the [--delete flag is used in case of an aws s3 sync]((https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html)) 133 | + EG: `Syntax: python s3CopySyncScript.py my-s3-bucket s3_batch_sync_input-my-sample.csv True True` 134 | 135 | * **Container Properties** 136 | - In the `image` box, put the URI of the ECR image that was created. 137 | - The `Command` is used as the [CMD instruction to execute our container](https://docs.docker.com/engine/reference/builder/#cmd). In our case, we want to execute the python script and pass it our input file details. 138 | + In JSON form we enter: `["python3","s3CopySyncScript.py","Ref::s3_bucket","Ref::s3_key", "True", "True"]` 139 | * In this example, I have a header in the input and am using the --delete flag an aws s3 sync 140 | - For `vCPUs` and `memory`, we set 1 and 2GB to be conservative for this example. Set it as needed. 141 | - `Job Role` and `Execution Role` are detailed below. 142 | - We ticked the box for `assign public IP` since we're accessing S3 through the public internet and are using `Fargate platform version 1.4.0` 143 | 144 | * **Parameters** 145 | - In the python command above, notice the `"Ref::s3_bucket","Ref::s3_key"`. These are parameters to be substituted when a job is invoked through Eventbridge. 146 | - In this section, we could set defaults for them or other parameters. [See more details here](https://docs.aws.amazon.com/batch/latest/userguide/job_definition_parameters.html). 147 | 148 | ![batchJobDef](images/batchJobDef.png) 149 | 150 | #### IAM Roles 151 | 152 | **Execution Role** 153 | The Execution role is used to setup individual ECS tasks where we run the Batch jobs and for logging. The role should have a trust relationship with `ecs-tasks.amazonaws.com`. In our example, the AWS managed policy `AmazonECSTaskExecutionRolePolicy` is attached along with an inline policy giving it permission to create log groups if needed. 154 | ```json 155 | { 156 | "Version": "2012-10-17", 157 | "Statement": [ 158 | { 159 | "Effect": "Allow", 160 | "Action": [ 161 | "logs:CreateLogGroup", 162 | "logs:CreateLogStream", 163 | "logs:PutLogEvents", 164 | "logs:DescribeLogStreams" 165 | ], 166 | "Resource": [ 167 | "arn:aws:logs:*:*:*" 168 | ] 169 | } 170 | ] 171 | } 172 | ``` 173 | More details about [ECS task execution roles are here](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html). 174 | 175 | **Job Role** 176 | The Job Role is an IAM role that is used provide AWS API access to individual running jobs. Here we configure access to AWS resources the job accesses, the files in S3 in our case. Here we're only accessing resources in one bucket, but be sure to configure this as needed depending on your sources/destinations. Again, the role should have a trust relationship with `ecs-tasks.amazonaws.com`. 177 | ```json 178 | { 179 | "Version": "2012-10-17", 180 | "Statement": [ 181 | { 182 | "Sid": "VisualEditor0", 183 | "Effect": "Allow", 184 | "Action": [ 185 | "s3:PutObject", 186 | "s3:GetObject", 187 | "s3:ListBucketMultipartUploads", 188 | "s3:ListBucket", 189 | "s3:ListMultipartUploadParts" 190 | ], 191 | "Resource": [ 192 | "arn:aws:s3:::s3-batch-sync-article/*", 193 | "arn:aws:s3:::s3-batch-sync-article" 194 | ] 195 | } 196 | ] 197 | } 198 | 199 | ``` 200 | If you're accessing S3 objects from/syncing to destinations in multiple accounts, [Cross Account S3 Resource Access would need to be configured as detailed here](https://aws.amazon.com/premiumsupport/knowledge-center/s3-cross-account-upload-access/). The account where the Batch jobs run can be considered Account A where a policy providing access resources in AWS Account B's buckets is attached to the IAM role. In Account B, the bucket policy would be modified to allow access from Account A's IAM role. 201 | More details about these [task IAM roles can be found here](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html). 202 | 203 | 204 | 205 | 206 | 207 | ## Job Invokation via an Eventbridge Rule 208 | Create an Eventbridge rule that will invoke the AWS Batch job. 209 | 210 | Here, the S3 uploads are being logged in cloudtrail. An eventbridge rule will invoke the job with an appropriate upload. Using the naming convention mentioned above, we can use a custom event pattern match and [content filtering](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns-content-based-filtering.html#filtering-prefix-matching) to only trigger on certain uploads. 211 | 212 | ```json 213 | { 214 | "source": ["aws.s3"], 215 | "detail-type": ["AWS API Call via CloudTrail"], 216 | "detail": { 217 | "eventSource": ["s3.amazonaws.com"], 218 | "eventName": ["PutObject", "CompleteMultipartUpload"], 219 | "requestParameters": { 220 | "bucketName": ["s3-batch-sync-article"], 221 | "key": [{ 222 | "prefix": "input/s3_batch_sync_input-" 223 | }] 224 | } 225 | } 226 | } 227 | 228 | ``` 229 | 230 | Here, we'll trigger the target for this rule when a file lands in the appropriate location with the required prefix. 231 | 232 | **AWS Batch Target** 233 | * Set the target for this rule to a `Batch job queue`. 234 | * Give it the job queue and job definition set above. Provide a name for the jobs that will run. 235 | * Use `configure input` to pass details about the input file to the job. In our job, the bucket and key is required as arguments to the python script which we supply as Job Parameters. 236 | - Use the first `input path` box to get the bucket and key from the event that triggered the Eventbridge rule. This gets the bucket and key 237 | + `{"S3BucketValue":"$.detail.requestParameters.bucketName","S3KeyValue":"$.detail.requestParameters.key"}` 238 | - The `input template` box lets you pass parameters or other arguments to the job that is to be invoked. Here we pass the `s3_bucket` and `s3_key` job parameters 239 | + `{"Parameters" : {"s3_bucket": , "s3_key": }}` 240 | - See more details about [AWS Batch Jobs as CloudWatch Events Targets here](https://docs.aws.amazon.com/batch/latest/userguide/batch-cwe-target.html) 241 | -------------------------------------------------------------------------------- /cfn/samples3syncenv.template.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | Description: Sample Template for AWS S3 Copy and Sync using Batch utility 3 | 4 | Parameters: 5 | S3BucketName: 6 | Type: String 7 | Default: "mysample-s3copysync-bucket" 8 | SecurityGroups: 9 | Type: CommaDelimitedList 10 | Description: Securty groups attached to farage instances 11 | Subnets: 12 | Type: CommaDelimitedList 13 | Description: Subnets where fargate instances can be launched 14 | 15 | 16 | Resources: 17 | InputS3Bucket: 18 | Type: "AWS::S3::Bucket" 19 | DeletionPolicy : "Delete" 20 | Properties: 21 | AccessControl: "BucketOwnerFullControl" 22 | BucketName: !Sub '${S3BucketName}-${AWS::AccountId}' 23 | 24 | InputS3BucketPolicy: 25 | Type: "AWS::S3::BucketPolicy" 26 | Properties: 27 | Bucket: !Ref InputS3Bucket 28 | PolicyDocument: !Sub '{ 29 | "Version": "2012-10-17", 30 | "Statement": [ 31 | { 32 | "Sid": "AWSCloudTrailAclCheck20150319", 33 | "Effect": "Allow", 34 | "Principal": { 35 | "Service": "cloudtrail.amazonaws.com" 36 | }, 37 | "Action": "s3:GetBucketAcl", 38 | "Resource": "arn:aws:s3:::${InputS3Bucket}" 39 | }, 40 | { 41 | "Sid": "AWSCloudTrailWrite20150319", 42 | "Effect": "Allow", 43 | "Principal": { 44 | "Service": "cloudtrail.amazonaws.com" 45 | }, 46 | "Action": "s3:PutObject", 47 | "Resource": "arn:aws:s3:::${InputS3Bucket}/AWSLogs/${AWS::AccountId}/*", 48 | "Condition": { 49 | "StringEquals": { 50 | "s3:x-amz-acl": "bucket-owner-full-control" 51 | } 52 | } 53 | } 54 | ] 55 | }' 56 | DependsOn: InputS3Bucket 57 | 58 | InputBucketCloudtrail: 59 | Type: "AWS::CloudTrail::Trail" 60 | Properties: 61 | S3BucketName: !Ref InputS3Bucket 62 | IsLogging: True 63 | TrailName: !Sub 's3-batch-sync-input-trail-${AWS::AccountId}' 64 | EventSelectors: 65 | - IncludeManagementEvents: False 66 | DataResources: 67 | - Type: AWS::S3::Object 68 | Values: 69 | - !Sub "arn:aws:s3:::${InputS3Bucket}/input" 70 | ReadWriteType: "WriteOnly" 71 | DependsOn: InputS3BucketPolicy 72 | 73 | EventsSubmitBatchJobRole: 74 | Type: AWS::IAM::Role 75 | Properties: 76 | AssumeRolePolicyDocument: 77 | Statement: 78 | - Action: ['sts:AssumeRole'] 79 | Effect: Allow 80 | Principal: 81 | Service: [events.amazonaws.com] 82 | Version: '2012-10-17' 83 | Policies: 84 | - PolicyDocument: '{ 85 | "Version": "2012-10-17", 86 | "Statement": [ 87 | { 88 | "Effect": "Allow", 89 | "Action": [ 90 | "batch:SubmitJob" 91 | ], 92 | "Resource": "*" 93 | } 94 | ] 95 | }' 96 | PolicyName: !Sub 'SubmitBatchJob-${AWS::AccountId}' 97 | RoleName: !Sub 'EventsInvokeS3SyncBatchJob-${AWS::AccountId}' 98 | 99 | BatchJobRole: 100 | Type: AWS::IAM::Role 101 | Properties: 102 | # ManagedPolicyArns: 103 | # - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy 104 | AssumeRolePolicyDocument: 105 | Statement: 106 | - Action: ['sts:AssumeRole'] 107 | Effect: Allow 108 | Principal: 109 | Service: [ecs-tasks.amazonaws.com] 110 | Version: '2012-10-17' 111 | Policies: 112 | - PolicyDocument: !Sub '{ 113 | "Version": "2012-10-17", 114 | "Statement": [ 115 | { 116 | "Sid": "VisualEditor0", 117 | "Effect": "Allow", 118 | "Action": [ 119 | "s3:PutObject", 120 | "s3:GetObject", 121 | "s3:ListBucketMultipartUploads", 122 | "s3:ListBucket", 123 | "s3:ListMultipartUploadParts" 124 | ], 125 | "Resource": [ 126 | "arn:aws:s3:::${InputS3Bucket}/*", 127 | "arn:aws:s3:::${InputS3Bucket}" 128 | ] 129 | } 130 | ] 131 | }' 132 | PolicyName: !Sub 'S3SyncBatchJobRunPolicy-${AWS::AccountId}' 133 | RoleName: !Sub 'S3SyncBatchJobRunRole-${AWS::AccountId}' 134 | 135 | BatchExecRole: 136 | Type: AWS::IAM::Role 137 | Properties: 138 | ManagedPolicyArns: 139 | - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy 140 | AssumeRolePolicyDocument: 141 | Statement: 142 | - Action: ['sts:AssumeRole'] 143 | Effect: Allow 144 | Principal: 145 | Service: [ecs-tasks.amazonaws.com] 146 | Version: '2012-10-17' 147 | Policies: 148 | - PolicyDocument: !Sub '{ 149 | "Version": "2012-10-17", 150 | "Statement": [ 151 | { 152 | "Effect": "Allow", 153 | "Action": [ 154 | "logs:CreateLogGroup", 155 | "logs:CreateLogStream", 156 | "logs:PutLogEvents", 157 | "logs:DescribeLogStreams" 158 | ], 159 | "Resource": [ 160 | "arn:aws:logs:*:*:*" 161 | ] 162 | } 163 | ] 164 | }' 165 | PolicyName: !Sub 'S3SyncBatchExecutionPolicy-${AWS::AccountId}' 166 | RoleName: !Sub 'S3SyncBatchExecRole-${AWS::AccountId}' 167 | 168 | ScriptECRRepo: 169 | Type: "AWS::ECR::Repository" 170 | Properties: 171 | RepositoryName: !Sub 'al-s3-sync-repo-${AWS::AccountId}' 172 | 173 | S3SyncBatchComputeEnv: 174 | Type: "AWS::Batch::ComputeEnvironment" 175 | Properties: 176 | ComputeEnvironmentName: !Sub 's3-sync-compute-env-${AWS::AccountId}' 177 | Type: 'MANAGED' 178 | State: 'Enabled' 179 | ComputeResources: 180 | Type: 'FARGATE' 181 | MaxvCpus: 8 182 | SecurityGroupIds: !Ref SecurityGroups 183 | Subnets: !Ref Subnets 184 | 185 | S3SyncBatchJobQueue: 186 | Type: "AWS::Batch::JobQueue" 187 | Properties: 188 | JobQueueName: !Sub 's3-sync-batch-job-queue-${AWS::AccountId}' 189 | Priority: 1 190 | State: "ENABLED" 191 | ComputeEnvironmentOrder: 192 | - ComputeEnvironment: !Ref S3SyncBatchComputeEnv 193 | Order: 1 194 | DependsOn: S3SyncBatchComputeEnv 195 | 196 | S3SyncBatchJobDef: 197 | Type: "AWS::Batch::JobDefinition" 198 | Properties: 199 | JobDefinitionName: !Sub 's3-sync-batch-job-def-${AWS::AccountId}' 200 | PlatformCapabilities: ["FARGATE"] 201 | Type: "container" 202 | ContainerProperties: 203 | Command: ["python3","s3CopySyncScript.py","Ref::s3_bucket","Ref::s3_key","True","True"] 204 | ExecutionRoleArn: !GetAtt BatchExecRole.Arn 205 | FargatePlatformConfiguration: 206 | PlatformVersion: "1.4.0" 207 | Image: !Sub '${ScriptECRRepo.RepositoryUri}:latest' 208 | JobRoleArn: !GetAtt BatchJobRole.Arn 209 | NetworkConfiguration: 210 | AssignPublicIp: "ENABLED" 211 | ResourceRequirements: 212 | - Type: "VCPU" 213 | Value: "1" 214 | - Type: "MEMORY" 215 | Value: "2048" 216 | DependsOn: ['BatchJobRole','BatchExecRole','ScriptECRRepo'] 217 | 218 | InputEventbridgeRule: 219 | Type: "AWS::Events::Rule" 220 | Properties: 221 | Name: !Sub 's3-batch-sync-invoke-${AWS::AccountId}' 222 | EventPattern: !Sub '{ 223 | "source": ["aws.s3"], 224 | "detail-type": ["AWS API Call via CloudTrail"], 225 | "detail": { 226 | "eventSource": ["s3.amazonaws.com"], 227 | "eventName": ["PutObject", "CompleteMultipartUpload"], 228 | "requestParameters": { 229 | "bucketName": ["${InputS3Bucket}"], 230 | "key": [{ 231 | "prefix": "input/s3_batch_sync_input-" 232 | }] 233 | } 234 | } 235 | }' 236 | Targets: 237 | - Arn: !Ref S3SyncBatchJobQueue 238 | Id: S3CopySyncJobTarget 239 | BatchParameters: 240 | JobDefinition: !Ref S3SyncBatchJobDef 241 | JobName: "s3-batch-sync-eventbridge-invoked-job" 242 | RetryStrategy: 243 | Attempts: 2 244 | InputTransformer: 245 | InputPathsMap: 246 | "S3BucketValue" : "$.detail.requestParameters.bucketName" 247 | "S3KeyValue" : "$.detail.requestParameters.key" 248 | InputTemplate: | 249 | {"Parameters" : {"s3_bucket": , "s3_key": }} 250 | RoleArn: !GetAtt EventsSubmitBatchJobRole.Arn 251 | DependsOn: ['InputBucketCloudtrail','S3SyncBatchJobQueue','S3SyncBatchJobDef','EventsSubmitBatchJobRole'] 252 | 253 | Outputs: 254 | S3Bucket: 255 | Description: Name of the S3 Bucket where logs are stored and where data can be synced to/from 256 | Value: !Ref InputS3Bucket 257 | Cloutrail: 258 | Description: Trail that captures events in the S3 Bucket 259 | Value: !Ref InputBucketCloudtrail 260 | ECRRepo: 261 | Description: Repo where the docker image with application logic can be stored. Build the dockerfile and push here 262 | Value: !Ref ScriptECRRepo 263 | BatchComputeEnv: 264 | Description: AWS Batch compute environment for managed fargate resources 265 | Value: !Ref S3SyncBatchComputeEnv 266 | BatchJobQueue: 267 | Description: AWS Batch job queue for submitted demo jobs 268 | Value: !Ref S3SyncBatchJobQueue 269 | BatchJobDefinition: 270 | Description: AWS Batch job definition. References the application logic in ECR 271 | Value: !Ref S3SyncBatchJobDef 272 | EventbridgeRule: 273 | Description: Eventbridge rule that matches input events in S3 and triggers the batch job 274 | Value: !Ref InputEventbridgeRule 275 | -------------------------------------------------------------------------------- /images/awsBatchS3SyncArch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/awsBatchS3SyncArch.png -------------------------------------------------------------------------------- /images/batchComputeEnvironment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/batchComputeEnvironment.png -------------------------------------------------------------------------------- /images/batchJobDef.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/batchJobDef.png -------------------------------------------------------------------------------- /images/batchJobQueue.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/batchJobQueue.png -------------------------------------------------------------------------------- /images/cloudTrailS3DataEvents.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/cloudTrailS3DataEvents.png -------------------------------------------------------------------------------- /images/s3InputSample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/s3InputSample.png -------------------------------------------------------------------------------- /src/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM amazonlinux:2.0.20210617.0 2 | 3 | # prereqs 4 | RUN yum update -y && \ 5 | yum install python3 -y && \ 6 | yum install unzip -y 7 | 8 | 9 | # install aws cli 10 | RUN /usr/bin/curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" &&\ 11 | /usr/bin/unzip awscliv2.zip &&\ 12 | ./aws/install && \ 13 | rm awscliv2.zip 14 | 15 | RUN /usr/local/bin/aws configure set default.s3.max_concurrent_requests 20 && \ 16 | /usr/local/bin/aws configure set default.s3.max_queue_size 10000 && \ 17 | /usr/local/bin/aws configure set default.s3.multipart_threshold 64MB && \ 18 | /usr/local/bin/aws configure set default.s3.multipart_chunksize 16MB && \ 19 | # /usr/local/bin/aws configure set default.s3.use_accelerate_endpoint true && \ 20 | /usr/local/bin/aws configure set default.s3.addressing_style path 21 | 22 | # boto 3 23 | RUN /usr/bin/pip3 install boto3 24 | 25 | # cleanup 26 | RUN yum clean all && \ 27 | rm -rf /var/cache/yum 28 | 29 | 30 | COPY s3CopySyncScript.py . 31 | 32 | # CMD is set in the batch job definition 33 | # CMD ["python3","s3CopySyncScript.py"] 34 | -------------------------------------------------------------------------------- /src/s3CopySyncScript.py: -------------------------------------------------------------------------------- 1 | import ast 2 | import boto3 3 | import subprocess 4 | import sys 5 | import csv 6 | from io import StringIO 7 | from urllib.parse import urlparse 8 | import logging 9 | 10 | logging.basicConfig( 11 | level=logging.INFO, 12 | format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s', 13 | datefmt='%Y-%m-%d %H:%M:%S', 14 | ) 15 | 16 | LOGGER = logging.getLogger() 17 | 18 | 19 | def get_bool_input(input_parameter): 20 | """Returns a boolean corresponding to a string "True/False" input 21 | 22 | Args: 23 | input_parameter (str): String input for a boolean 24 | 25 | Returns: 26 | bool: True/False input 27 | """ 28 | if input_parameter.lower().capitalize() in ["True","False"]: 29 | return ast.literal_eval(input_parameter.lower().capitalize()) 30 | else: 31 | return None 32 | 33 | class S3SyncUtils: 34 | 35 | def __init__(self): 36 | self.s3_resource = boto3.resource('s3') 37 | 38 | def __copy_if_exists(self,src_bucket, src_key, src_path,dst_path,delete_destination): 39 | s3_head_object_response = self.s3_resource.meta.client.head_object( 40 | Bucket=src_bucket, 41 | Key=src_key) 42 | if s3_head_object_response['ContentType'].startswith('application/x-directory'): # sync 43 | LOGGER.info('Source is a directory! Syncing entire paths.') 44 | self.__s3_sync_helper(src_path,dst_path,delete_destination) 45 | else: 46 | LOGGER.info('File Exists! Overwriting file') 47 | copy_source = { 48 | 'Bucket': src_bucket, 49 | 'Key': src_key 50 | } 51 | dst_bucket_key = dst_path.split('//')[1].split('/',1) 52 | self.s3_resource.meta.client.copy(CopySource=copy_source, 53 | Bucket=dst_bucket_key[0], 54 | Key=dst_bucket_key[1], 55 | ExtraArgs={ 56 | 'ACL':'bucket-owner-full-control' 57 | }) 58 | 59 | def __s3_sync_helper(self, src_path, dst_path, delete_destination): 60 | sync_cmd = ['/usr/local/bin/aws', 61 | 's3', 62 | 'sync', 63 | src_path, 64 | dst_path, 65 | '--acl','bucket-owner-full-control'] 66 | if delete_destination: 67 | sync_cmd.append('--delete') 68 | out = subprocess.run(sync_cmd, capture_output=True) 69 | out = out.stdout.decode() 70 | LOGGER.info(out) 71 | 72 | 73 | def s3_copy_sync(self,input_s3_bucket,input_s3_key, skip_header=True, delete_destination=False): 74 | """ 75 | :param input_s3_bucket: bucket name of the input file 76 | :param input_s3_key: name of the file 77 | :param skip_header: boolean : skip first row if true 78 | :param delete_destination: boolean : delete destination if true 79 | :return: None 80 | """ 81 | csv_str = self.s3_resource.Object(input_s3_bucket,input_s3_key).get()['Body'].read().decode('utf-8') 82 | rdr = csv.reader(StringIO(csv_str), delimiter=',') 83 | if skip_header: 84 | next(rdr) #skip the header row 85 | #loop over each file/path given as src/destination. 86 | for row in rdr: 87 | source=row[0] 88 | dest=row[1] 89 | LOGGER.info(f'Source:: {source} Dest:: {dest}') 90 | # split the S3 path for source into what would be bucket and key 91 | # EG: s3://bucket-name/some/path/file.txt --> bucket-name, some/path/file.txt 92 | parsed_s3_url = urlparse(source) 93 | bucket_src = parsed_s3_url.netloc 94 | key_src = parsed_s3_url.path[1:] # remove first "/" for a valid s3 key 95 | self.__copy_if_exists(bucket_src,key_src,source,dest,delete_destination) 96 | 97 | def main(argv): 98 | if len(argv) != 5: 99 | LOGGER.info("Syntax: python s3CopySyncScript.py <> <> <
> <>") 100 | sys.exit(1) 101 | 102 | LOGGER.info(f"Received {len(argv)} arguments:\n" 103 | f"\tInput csv file bucket name ::: {argv[1]}\n" 104 | f"\tInput csv file file name ::: {argv[2]}\n" 105 | f"\tFile containing header ::: {argv[3]}\n" 106 | f"\tDelete destination files ::: {argv[4]}" ) 107 | 108 | S3_UTILS = S3SyncUtils() 109 | header_input = get_bool_input(argv[3]) 110 | delete_input = get_bool_input(argv[4]) 111 | if header_input is None or delete_input is None: 112 | LOGGER.error("Incorrect input, the HEADER must be true/false and delete files must be true or false") 113 | sys.exit(2) 114 | S3_UTILS.s3_copy_sync(argv[1],argv[2],header_input,delete_input) 115 | 116 | 117 | if __name__ == '__main__': 118 | main(sys.argv) 119 | --------------------------------------------------------------------------------