├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── cfn
    └── samples3syncenv.template.yaml
├── images
    ├── awsBatchS3SyncArch.png
    ├── batchComputeEnvironment.png
    ├── batchJobDef.png
    ├── batchJobQueue.png
    ├── cloudTrailS3DataEvents.png
    └── s3InputSample.png
└── src
    ├── Dockerfile
    └── s3CopySyncScript.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | .DS_Store
 2 | *.swp
 3 | package-lock.json
 4 | .pytest_cache
 5 | *.egg-info
 6 |  
 7 | # Byte-compiled / optimized / DLL files
 8 | __pycache__/
 9 | *.py[cod]
10 | *$py.class
11 |  
12 | # Environments
13 | .env
14 | .venv
15 | env/
16 | venv/
17 | ENV/
18 | env.bak/
19 | venv.bak/
20 |  
21 | # CDK Context & Staging files
22 | .cdk.staging/
23 | cdk.out/
24 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <!-- MarkdownTOC autolink="true" markdown_preview="markdown" -->
  2 | 
  3 | - [Using AWS Batch jobs to bulk copy/sync files in S3](#using-aws-batch-jobs-to-bulk-copysync-files-in-s3)
  4 |     - [Use Cases](#use-cases)
  5 |     - [Overview](#overview)
  6 |     - [Architecture](#architecture)
  7 |     - [Demo Environment](#demo-environment)
  8 |         - [Prerequisites](#prerequisites)
  9 |         - [Cloudformation Template](#cloudformation-template)
 10 |     - [ECR Image](#ecr-image)
 11 |     - [S3 Input Location](#s3-input-location)
 12 |         - [CSV File Format Details](#csv-file-format-details)
 13 |     - [Cloudtrail Monitoring](#cloudtrail-monitoring)
 14 |     - [AWS Batch](#aws-batch)
 15 |         - [Compute Environment](#compute-environment)
 16 |         - [Job Queue](#job-queue)
 17 |         - [Job Definition](#job-definition)
 18 |             - [IAM Roles](#iam-roles)
 19 |     - [Job Invokation via an Eventbridge Rule](#job-invokation-via-an-eventbridge-rule)
 20 | 
 21 | <!-- /MarkdownTOC -->
 22 | 
 23 | 
 24 | # Using AWS Batch jobs to bulk copy/sync files in S3
 25 | 
 26 | ## Use Cases
 27 | This utility is aimed towards technical teams who are looking for a scaleable way to copy/sync files within S3. In certain cases, teams may want to replicate certin "prod" data into "dev" to have more robust development environments. Teams may also look to transfer data to other buckets to share with external teams or publically. In both scenarios, having control over what data is copied/syned becomes paramount to ensure sensitive data does not make its way to an environment that could cause security concerns. 
 28 | 
 29 | An existing solution is [S3 replication](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html), but that comes with certain limitations. Managing replication rules with multiple directories can get cumbersome, the same prefix must be kept, source and destination must be different buckets, and already existing data cannot be backfilled using replication. As workarounds, teams have managed environments to run custom scripts using AWS SDK to overcome such roadblocks.
 30 | 
 31 | 
 32 | ## Overview
 33 | This guide details how to use [AWS Batch](https://aws.amazon.com/batch/) to perform bulk copy/sync activities on files in [S3](https://aws.amazon.com/s3/). Batch allows users to run massively scalable computing jobs on AWS by provisioning the optimal compute resources needed. The user is able to focus on configuring what the job should be doing instead of provisioning infrastructure.
 34 | 
 35 | Here we use Batch's [Managed Compute environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html#managed_compute_environments) with the [Fargate provioning model](https://docs.aws.amazon.com/batch/latest/userguide/fargate.html), where Batch runs containers without the user having to manage the underlying EC2 instances. We will package our application within an [ECR](https://aws.amazon.com/ecr/) container image that Fargate will use to launch the environment where the job runs.
 36 | 
 37 | With Batch configured, the user will upload a .csv file in an S3 location that has it's events being logged in [Cloudtrail](https://aws.amazon.com/cloudtrail/). This is being monitored by [Eventbridge](https://aws.amazon.com/eventbridge/) which will kick off the Batch job once the appropriate file is uploaded. The .csv file contains a list of S3 source/destination pairs to be copied/synced in the job as detailed below. For accessing S3 resources in different AWS accounts, be sure to look at the [IAM Roles section below](#iam-roles)
 38 | 
 39 | ## Architecture
 40 | This is an overview of the architecture described above:  
 41 | 
 42 | ![awsBatchS3SyncArch](images/awsBatchS3SyncArch.png)
 43 | 
 44 | ## Demo Environment
 45 | 
 46 | ### Prerequisites
 47 | **Default (or other) VPC/Subnet/Security Group**
 48 | * To setup the environment from the cloudformation template, you will need a VPC/subnets to launch fargate resources in. A default VPC is suitable for demo purposes.
 49 | * AWS accounts created after `2013-12-04` have a default VPC created. Otherwise, you can [follow these instructions](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html#create-default-vpc) to create it.
 50 | * Make note of the subnetIds and securityGroupIDs or retrieve them via CLI
 51 |     - Default VPC ID
 52 |         + `aws ec2 describe-vpcs --filters Name=isDefault,Values=true --query "Vpcs[].VpcId | [0]"`
 53 |     - Get all Subnet IDs for the default VPC
 54 |         + `aws ec2 describe-subnets --filters Name=vpc-id,Values=$(aws ec2 describe-vpcs --filters Name=isDefault,Values=true --query "Vpcs[].VpcId | [0]" ) --query "Subnets[].SubnetId"`
 55 |     - Get all Security Group IDs for the default VPC
 56 |         + `aws ec2 describe-security-groups --filters Name=vpc-id,Values=$(aws ec2 describe-vpcs --filters Name=isDefault,Values=true --query "Vpcs[].VpcId | [0]" ) --query "SecurityGroups[].GroupId"`
 57 | 
 58 | ### Cloudformation Template
 59 | * Create a stack from the cloudformation template from the `cfn` directory. This creates an environment identical to the one in this guide.
 60 | * From the commands above, you will need to pass subnetId(s) and security group ID(s) as parameters to create the stack
 61 | * Once creation is complete, you can follow the rest of the guide below, namely:
 62 |     - You can build and push a the Docker image to ECR
 63 |     - Once pushed, you can run Batch sync/copy jobs for objects within the S3 bucket the cfn creates.
 64 | 
 65 | ## ECR Image
 66 | The ECR Image contains our application logic to sync/copy S3 files based on the csv input. This is done in the python script `s3CopySyncScript.py`. Based on the CSV input, it will perform a [managed transfer using the copy api](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.copy) if a file is given as a source/destination. If a prefix is given as source/destination, it will use the [AWS CLI to perform an aws s3 sync](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html).
 67 | 
 68 | The `Dockerfile` builds an image based on [AL2](https://hub.docker.com/_/amazonlinux), installing the AWS cli, python3, boto3, and setting other s3 configuration for optimal transfers.
 69 | 
 70 | Create an ECR repository and use these commands to build and push the image as latest using the CLI.
 71 | 
 72 | ```
 73 | aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <<<ACCOUNT_ID>>>.dkr.ecr.us-east-1.amazonaws.com
 74 | 
 75 | docker build -t <<<REPO/NAME>>> .
 76 | 
 77 | docker tag <<<REPO/NAME>>>:latest <<<ACCOUNT_ID>>>.dkr.ecr.us-east-1.amazonaws.com/<<<REPO/NAME>>>:latest
 78 | 
 79 | docker push <<<ACCOUNT_ID>>>.dkr.ecr.us-east-1.amazonaws.com/<<<REPO/NAME>>>:latest
 80 | ```
 81 | 
 82 | ## S3 Input Location
 83 | The `.csv file` will be uploaded to s3 to kick off the job. Designate an area for this. In this example, I've created a sample bucket with the prefix `input` and will be uploading files here. Notice also the uploaded csv is prefixed as such: `s3_batch_sync_input-*.csv`. Using a naming convention like this can simplify eventbridge monitoring which we'll see below.
 84 | 
 85 | ![s3InputSample](images/s3InputSample.png)
 86 | 
 87 | ### CSV File Format Details
 88 | The csv file should be formatted as such.
 89 | 
 90 | | source                                          | destination                              |
 91 | |-------------------------------------------------|------------------------------------------|
 92 | | s3://some-bucket/source/                        | s3://some-bucket/dest/                   |
 93 | | s3://some-bucket/sourcetwo/                     | s3://some-bucket/desttwo/                |
 94 | | s3://some-bucket/sourceindiv/individualFile.txt | s3://some-bucket/dest/individualFile.txt |
 95 | 
 96 | The first 2 rows in this example are prefixes where we want an s3 sync to occur. The last row is a specific object we want to copy directly. The bucket/AWS account does not need to be the same, as long as IAM permissions are [properly applied as noted below](#iam-roles).
 97 | 
 98 | ## Cloudtrail Monitoring
 99 | Setup a trail that will monitor the S3 location where the input file will land. In setting up the trail you can set these options as you see fit: name, an s3 location for the logs, log encryption using KMS, log validation. You will need to make sure that S3 data events for the location where the input file lands is enabled at a minimum:
100 | 
101 | ![cloudTrailS3DataEvents](images/cloudTrailS3DataEvents.png)
102 | 
103 | 
104 | ## AWS Batch
105 | 
106 | ### Compute Environment
107 | This will serve as a pool that our batch jobs can pull resources from. Create a `managed environment` with the instance configuration set to `fargate` for this example. Set the max vCPUs to set an upper limit for concurrent fargate resources being used. Other configuration options for an AWS Batch [Compute Environment are detailed here](https://docs.aws.amazon.com/batch/latest/userguide/create-compute-environment.html).
108 | Lastly pick the VPC and subnets your environment will be located in, and security groups that may need to be attached to instances. If you're using S3 VPC gateway endpoints this would be key. In our example, we're using the default VPC since we're accessing S3 through public internet.
109 | Once complete, the environment state would be `ENABLED`.  
110 | 
111 | ![batchComputeEnvironment](images/batchComputeEnvironment.png)
112 | 
113 | 
114 | ### Job Queue
115 | AWS Batch Jobs are submitted to a job queue until compute environment resources are available for the job to run. You can have multiple queues with different priorities which pull from different compute environments. [More details are here](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html).
116 | For this example, create a queue and attach it to the compute environment previously made.
117 | 
118 | ![batchJobQueue](images/batchJobQueue.png)
119 | 
120 | ### Job Definition
121 | The Job Definition acts as a template from which to launch our individual jobs. [Detailed instructions are here for additional configuration](https://docs.aws.amazon.com/batch/latest/userguide/create-job-definition.html).
122 | 
123 | * **Basics**  
124 |     - Enter a `name` for the template and pick `fargate` for the platform.
125 |     - Depending on your anticipated use, set a `retry strategy` and `timeout`. For this example we set 2 job attempts and a timeout of 120 seconds.
126 |     - We'll also put job logs in the default AWS Batch logs group in cloudwatch but it [can be customized as detailed here](https://docs.aws.amazon.com/batch/latest/userguide/using_awslogs.html).
127 | 
128 | * **Python Script Usage**
129 |     - Note the usage of the script described here and then set the container properties in the next step as required
130 |     - Syntax: `python s3CopySyncScript.py <<bucket containing input .csv file>> <<key of input .csv file>> <<header True/False>> <<sync_delete True/False>>`
131 |         + *header* indicates whether the input csv has a header row
132 |         + *sync_delete* indicates whether the [--delete flag is used in case of an aws s3 sync]((https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html))
133 |         + EG: `Syntax: python s3CopySyncScript.py my-s3-bucket s3_batch_sync_input-my-sample.csv True True`
134 | 
135 | *  **Container Properties**  
136 |     - In the `image` box, put the URI of the ECR image that was created.
137 |     - The `Command` is used as the [CMD instruction to execute our container](https://docs.docker.com/engine/reference/builder/#cmd). In our case, we want to execute the python script and pass it our input file details.
138 |         + In JSON form we enter: `["python3","s3CopySyncScript.py","Ref::s3_bucket","Ref::s3_key", "True", "True"]`
139 |             * In this example, I have a header in the input and am using the --delete flag an aws s3 sync
140 |     - For `vCPUs` and `memory`, we set 1 and 2GB to be conservative for this example. Set it as needed.
141 |     - `Job Role` and `Execution Role` are detailed below.
142 |     - We ticked the box for `assign public IP` since we're accessing S3 through the public internet and are using `Fargate platform version 1.4.0`
143 | 
144 | * **Parameters**  
145 |     - In the python command above, notice the `"Ref::s3_bucket","Ref::s3_key"`. These are parameters to be substituted when a job is invoked through Eventbridge.
146 |     - In this section, we could set defaults for them or other parameters. [See more details here](https://docs.aws.amazon.com/batch/latest/userguide/job_definition_parameters.html).
147 | 
148 | ![batchJobDef](images/batchJobDef.png)
149 | 
150 | #### IAM Roles
151 | 
152 | **Execution Role**  
153 | The Execution role is used to setup individual ECS tasks where we run the Batch jobs and for logging. The role should have a trust relationship with `ecs-tasks.amazonaws.com`. In our example, the AWS managed policy `AmazonECSTaskExecutionRolePolicy` is attached along with an inline policy giving it permission to create log groups if needed.
154 | ```json
155 | {
156 |     "Version": "2012-10-17",
157 |     "Statement": [
158 |         {
159 |             "Effect": "Allow",
160 |             "Action": [
161 |                 "logs:CreateLogGroup",
162 |                 "logs:CreateLogStream",
163 |                 "logs:PutLogEvents",
164 |                 "logs:DescribeLogStreams"
165 |             ],
166 |             "Resource": [
167 |                 "arn:aws:logs:*:*:*"
168 |             ]
169 |         }
170 |     ]
171 | }
172 | ```
173 | More details about [ECS task execution roles are here](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html).
174 | 
175 | **Job Role**  
176 | The Job Role is an IAM role that is used provide AWS API access to individual running jobs. Here we configure access to AWS resources the job accesses, the files in S3 in our case. Here we're only accessing resources in one bucket, but be sure to configure this as needed depending on your sources/destinations. Again, the role should have a trust relationship with `ecs-tasks.amazonaws.com`.
177 | ```json
178 | {
179 |     "Version": "2012-10-17",
180 |     "Statement": [
181 |         {
182 |             "Sid": "VisualEditor0",
183 |             "Effect": "Allow",
184 |             "Action": [
185 |                 "s3:PutObject",
186 |                 "s3:GetObject",
187 |                 "s3:ListBucketMultipartUploads",
188 |                 "s3:ListBucket",
189 |                 "s3:ListMultipartUploadParts"
190 |             ],
191 |             "Resource": [
192 |                 "arn:aws:s3:::s3-batch-sync-article/*",
193 |                 "arn:aws:s3:::s3-batch-sync-article"
194 |             ]
195 |         }
196 |     ]
197 | }
198 | 
199 | ```
200 | If you're accessing S3 objects from/syncing to destinations in multiple accounts, [Cross Account S3 Resource Access would need to be configured as detailed here](https://aws.amazon.com/premiumsupport/knowledge-center/s3-cross-account-upload-access/). The account where the Batch jobs run can be considered Account A where a policy providing access resources in AWS Account B's buckets is attached to the IAM role. In Account B, the bucket policy would be modified to allow access from Account A's IAM role.
201 | More details about these [task IAM roles can be found here](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html).
202 | 
203 | 
204 | 
205 | 
206 | 
207 | ## Job Invokation via an Eventbridge Rule
208 | Create an Eventbridge rule that will invoke the AWS Batch job.
209 | 
210 | Here, the S3 uploads are being logged in cloudtrail. An eventbridge rule will invoke the job with an appropriate upload. Using the naming convention mentioned above, we can use a custom event pattern match and [content filtering](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns-content-based-filtering.html#filtering-prefix-matching) to only trigger on certain uploads.
211 | 
212 | ```json
213 | {
214 |   "source": ["aws.s3"],
215 |   "detail-type": ["AWS API Call via CloudTrail"],
216 |   "detail": {
217 |     "eventSource": ["s3.amazonaws.com"],
218 |     "eventName": ["PutObject", "CompleteMultipartUpload"],
219 |     "requestParameters": {
220 |       "bucketName": ["s3-batch-sync-article"],
221 |       "key": [{
222 |         "prefix": "input/s3_batch_sync_input-"
223 |       }]
224 |     }
225 |   }
226 | }
227 | 
228 | ```
229 | 
230 | Here, we'll trigger the target for this rule when a file lands in the appropriate location with the required prefix.
231 | 
232 | **AWS Batch Target**
233 | * Set the target for this rule to a `Batch job queue`.
234 | * Give it the job queue and job definition set above. Provide a name for the jobs that will run.
235 | * Use `configure input` to pass details about the input file to the job. In our job, the bucket and key is required as arguments to the python script which we supply as Job Parameters.
236 |     - Use the first `input path` box to get the bucket and key from the event that triggered the Eventbridge rule. This gets the bucket and key
237 |         + `{"S3BucketValue":"$.detail.requestParameters.bucketName","S3KeyValue":"$.detail.requestParameters.key"}`
238 |     - The `input template` box lets you pass parameters or other arguments to the job that is to be invoked. Here we pass the `s3_bucket` and `s3_key` job parameters
239 |         + `{"Parameters" : {"s3_bucket": <S3BucketValue>, "s3_key": <S3KeyValue>}}`
240 |     - See more details about [AWS Batch Jobs as CloudWatch Events Targets here](https://docs.aws.amazon.com/batch/latest/userguide/batch-cwe-target.html)
241 | 


--------------------------------------------------------------------------------
/cfn/samples3syncenv.template.yaml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: '2010-09-09'
  2 | Description: Sample Template for AWS S3 Copy and Sync using Batch utility
  3 | 
  4 | Parameters:
  5 |   S3BucketName:
  6 |     Type: String
  7 |     Default: "mysample-s3copysync-bucket"
  8 |   SecurityGroups:
  9 |     Type: CommaDelimitedList
 10 |     Description: Securty groups attached to farage instances
 11 |   Subnets:
 12 |     Type: CommaDelimitedList
 13 |     Description: Subnets where fargate instances can be launched
 14 | 
 15 | 
 16 | Resources:
 17 |   InputS3Bucket:
 18 |     Type: "AWS::S3::Bucket"
 19 |     DeletionPolicy : "Delete"
 20 |     Properties:
 21 |       AccessControl: "BucketOwnerFullControl"
 22 |       BucketName: !Sub '${S3BucketName}-${AWS::AccountId}'
 23 | 
 24 |   InputS3BucketPolicy:
 25 |     Type: "AWS::S3::BucketPolicy"
 26 |     Properties:
 27 |       Bucket: !Ref InputS3Bucket
 28 |       PolicyDocument: !Sub '{
 29 |           "Version": "2012-10-17",
 30 |           "Statement": [
 31 |               {
 32 |                   "Sid": "AWSCloudTrailAclCheck20150319",
 33 |                   "Effect": "Allow",
 34 |                   "Principal": {
 35 |                       "Service": "cloudtrail.amazonaws.com"
 36 |                   },
 37 |                   "Action": "s3:GetBucketAcl",
 38 |                   "Resource": "arn:aws:s3:::${InputS3Bucket}"
 39 |               },
 40 |               {
 41 |                   "Sid": "AWSCloudTrailWrite20150319",
 42 |                   "Effect": "Allow",
 43 |                   "Principal": {
 44 |                       "Service": "cloudtrail.amazonaws.com"
 45 |                   },
 46 |                   "Action": "s3:PutObject",
 47 |                   "Resource": "arn:aws:s3:::${InputS3Bucket}/AWSLogs/${AWS::AccountId}/*",
 48 |                   "Condition": {
 49 |                       "StringEquals": {
 50 |                           "s3:x-amz-acl": "bucket-owner-full-control"
 51 |                       }
 52 |                   }
 53 |               }
 54 |           ]
 55 |       }'
 56 |     DependsOn: InputS3Bucket
 57 | 
 58 |   InputBucketCloudtrail:
 59 |     Type: "AWS::CloudTrail::Trail"
 60 |     Properties:
 61 |       S3BucketName: !Ref InputS3Bucket
 62 |       IsLogging: True
 63 |       TrailName: !Sub 's3-batch-sync-input-trail-${AWS::AccountId}'
 64 |       EventSelectors:
 65 |         - IncludeManagementEvents: False
 66 |           DataResources:
 67 |             - Type: AWS::S3::Object
 68 |               Values: 
 69 |                 - !Sub "arn:aws:s3:::${InputS3Bucket}/input"
 70 |           ReadWriteType: "WriteOnly"
 71 |     DependsOn: InputS3BucketPolicy
 72 |     
 73 |   EventsSubmitBatchJobRole:
 74 |     Type: AWS::IAM::Role
 75 |     Properties:
 76 |       AssumeRolePolicyDocument:
 77 |         Statement:
 78 |         - Action: ['sts:AssumeRole']
 79 |           Effect: Allow
 80 |           Principal:
 81 |             Service: [events.amazonaws.com]
 82 |         Version: '2012-10-17'
 83 |       Policies:
 84 |         - PolicyDocument: '{
 85 |               "Version": "2012-10-17",
 86 |               "Statement": [
 87 |                   {
 88 |                       "Effect": "Allow",
 89 |                       "Action": [
 90 |                           "batch:SubmitJob"
 91 |                       ],
 92 |                       "Resource": "*"
 93 |                   }
 94 |               ]
 95 |           }'
 96 |           PolicyName: !Sub 'SubmitBatchJob-${AWS::AccountId}'
 97 |       RoleName: !Sub 'EventsInvokeS3SyncBatchJob-${AWS::AccountId}'
 98 | 
 99 |   BatchJobRole:
100 |     Type: AWS::IAM::Role
101 |     Properties:
102 |       # ManagedPolicyArns:
103 |       # - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
104 |       AssumeRolePolicyDocument:
105 |         Statement:
106 |         - Action: ['sts:AssumeRole']
107 |           Effect: Allow
108 |           Principal:
109 |             Service: [ecs-tasks.amazonaws.com]
110 |         Version: '2012-10-17'
111 |       Policies:
112 |         - PolicyDocument: !Sub '{
113 |               "Version": "2012-10-17",
114 |               "Statement": [
115 |                   {
116 |                       "Sid": "VisualEditor0",
117 |                       "Effect": "Allow",
118 |                       "Action": [
119 |                           "s3:PutObject",
120 |                           "s3:GetObject",
121 |                           "s3:ListBucketMultipartUploads",
122 |                           "s3:ListBucket",
123 |                           "s3:ListMultipartUploadParts"
124 |                       ],
125 |                       "Resource": [
126 |                           "arn:aws:s3:::${InputS3Bucket}/*",
127 |                           "arn:aws:s3:::${InputS3Bucket}"
128 |                       ]
129 |                   }
130 |               ]
131 |           }'
132 |           PolicyName: !Sub 'S3SyncBatchJobRunPolicy-${AWS::AccountId}'
133 |       RoleName: !Sub 'S3SyncBatchJobRunRole-${AWS::AccountId}'
134 | 
135 |   BatchExecRole:
136 |     Type: AWS::IAM::Role
137 |     Properties:
138 |       ManagedPolicyArns:
139 |       - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
140 |       AssumeRolePolicyDocument:
141 |         Statement:
142 |         - Action: ['sts:AssumeRole']
143 |           Effect: Allow
144 |           Principal:
145 |             Service: [ecs-tasks.amazonaws.com]
146 |         Version: '2012-10-17'
147 |       Policies:
148 |         - PolicyDocument: !Sub '{
149 |               "Version": "2012-10-17",
150 |               "Statement": [
151 |                   {
152 |                       "Effect": "Allow",
153 |                       "Action": [
154 |                           "logs:CreateLogGroup",
155 |                           "logs:CreateLogStream",
156 |                           "logs:PutLogEvents",
157 |                           "logs:DescribeLogStreams"
158 |                       ],
159 |                       "Resource": [
160 |                           "arn:aws:logs:*:*:*"
161 |                       ]
162 |                   }
163 |               ]
164 |           }'
165 |           PolicyName: !Sub 'S3SyncBatchExecutionPolicy-${AWS::AccountId}'
166 |       RoleName: !Sub 'S3SyncBatchExecRole-${AWS::AccountId}'
167 | 
168 |   ScriptECRRepo:
169 |     Type: "AWS::ECR::Repository"
170 |     Properties: 
171 |       RepositoryName: !Sub 'al-s3-sync-repo-${AWS::AccountId}'
172 | 
173 |   S3SyncBatchComputeEnv:
174 |     Type: "AWS::Batch::ComputeEnvironment"
175 |     Properties: 
176 |       ComputeEnvironmentName: !Sub 's3-sync-compute-env-${AWS::AccountId}'
177 |       Type: 'MANAGED'
178 |       State: 'Enabled'
179 |       ComputeResources:
180 |         Type: 'FARGATE'
181 |         MaxvCpus: 8
182 |         SecurityGroupIds: !Ref SecurityGroups
183 |         Subnets: !Ref Subnets
184 | 
185 |   S3SyncBatchJobQueue:
186 |     Type: "AWS::Batch::JobQueue"
187 |     Properties: 
188 |       JobQueueName: !Sub 's3-sync-batch-job-queue-${AWS::AccountId}'
189 |       Priority: 1
190 |       State: "ENABLED"
191 |       ComputeEnvironmentOrder:
192 |         - ComputeEnvironment: !Ref S3SyncBatchComputeEnv
193 |           Order: 1
194 |     DependsOn: S3SyncBatchComputeEnv
195 | 
196 |   S3SyncBatchJobDef:
197 |     Type: "AWS::Batch::JobDefinition"
198 |     Properties:
199 |       JobDefinitionName: !Sub 's3-sync-batch-job-def-${AWS::AccountId}'
200 |       PlatformCapabilities: ["FARGATE"]
201 |       Type: "container"
202 |       ContainerProperties:
203 |         Command: ["python3","s3CopySyncScript.py","Ref::s3_bucket","Ref::s3_key","True","True"]
204 |         ExecutionRoleArn: !GetAtt BatchExecRole.Arn
205 |         FargatePlatformConfiguration:
206 |           PlatformVersion: "1.4.0"
207 |         Image: !Sub '${ScriptECRRepo.RepositoryUri}:latest'
208 |         JobRoleArn: !GetAtt BatchJobRole.Arn
209 |         NetworkConfiguration:
210 |           AssignPublicIp: "ENABLED"
211 |         ResourceRequirements:
212 |           - Type: "VCPU"
213 |             Value: "1"
214 |           - Type: "MEMORY"
215 |             Value: "2048"
216 |     DependsOn: ['BatchJobRole','BatchExecRole','ScriptECRRepo']
217 | 
218 |   InputEventbridgeRule:
219 |     Type: "AWS::Events::Rule"
220 |     Properties:
221 |       Name: !Sub 's3-batch-sync-invoke-${AWS::AccountId}'
222 |       EventPattern: !Sub '{
223 |         "source": ["aws.s3"],
224 |         "detail-type": ["AWS API Call via CloudTrail"],
225 |         "detail": {
226 |           "eventSource": ["s3.amazonaws.com"],
227 |           "eventName": ["PutObject", "CompleteMultipartUpload"],
228 |           "requestParameters": {
229 |             "bucketName": ["${InputS3Bucket}"],
230 |             "key": [{
231 |               "prefix": "input/s3_batch_sync_input-"
232 |             }]
233 |           }
234 |         }
235 |       }'
236 |       Targets:
237 |         - Arn: !Ref S3SyncBatchJobQueue
238 |           Id: S3CopySyncJobTarget
239 |           BatchParameters:
240 |             JobDefinition: !Ref S3SyncBatchJobDef
241 |             JobName: "s3-batch-sync-eventbridge-invoked-job"
242 |             RetryStrategy:
243 |               Attempts: 2
244 |           InputTransformer:
245 |             InputPathsMap:
246 |               "S3BucketValue" : "$.detail.requestParameters.bucketName"
247 |               "S3KeyValue" : "$.detail.requestParameters.key"
248 |             InputTemplate: |
249 |               {"Parameters" : {"s3_bucket": <S3BucketValue>, "s3_key": <S3KeyValue>}}
250 |           RoleArn: !GetAtt EventsSubmitBatchJobRole.Arn
251 |     DependsOn: ['InputBucketCloudtrail','S3SyncBatchJobQueue','S3SyncBatchJobDef','EventsSubmitBatchJobRole']
252 | 
253 | Outputs:
254 |   S3Bucket:
255 |     Description: Name of the S3 Bucket where logs are stored and where data can be synced to/from
256 |     Value: !Ref InputS3Bucket
257 |   Cloutrail:
258 |     Description: Trail that captures events in the S3 Bucket
259 |     Value: !Ref InputBucketCloudtrail
260 |   ECRRepo:
261 |     Description: Repo where the docker image with application logic can be stored. Build the dockerfile and push here
262 |     Value: !Ref ScriptECRRepo
263 |   BatchComputeEnv:
264 |     Description: AWS Batch compute environment for managed fargate resources
265 |     Value: !Ref S3SyncBatchComputeEnv
266 |   BatchJobQueue:
267 |     Description: AWS Batch job queue for submitted demo jobs
268 |     Value: !Ref S3SyncBatchJobQueue
269 |   BatchJobDefinition:
270 |     Description: AWS Batch job definition. References the application logic in ECR
271 |     Value: !Ref S3SyncBatchJobDef
272 |   EventbridgeRule:
273 |     Description: Eventbridge rule that matches input events in S3 and triggers the batch job
274 |     Value: !Ref InputEventbridgeRule
275 | 


--------------------------------------------------------------------------------
/images/awsBatchS3SyncArch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/awsBatchS3SyncArch.png


--------------------------------------------------------------------------------
/images/batchComputeEnvironment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/batchComputeEnvironment.png


--------------------------------------------------------------------------------
/images/batchJobDef.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/batchJobDef.png


--------------------------------------------------------------------------------
/images/batchJobQueue.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/batchJobQueue.png


--------------------------------------------------------------------------------
/images/cloudTrailS3DataEvents.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/cloudTrailS3DataEvents.png


--------------------------------------------------------------------------------
/images/s3InputSample.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-s3-copy-sync-using-batch/d318dd4be33a3d6e1683a885ba181a28bc121efb/images/s3InputSample.png


--------------------------------------------------------------------------------
/src/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM amazonlinux:2.0.20210617.0
 2 | 
 3 | # prereqs
 4 | RUN yum update -y && \
 5 | 	yum install python3 -y && \
 6 | 	yum install unzip -y
 7 | 
 8 | 
 9 | # install aws cli
10 | RUN /usr/bin/curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" &&\
11 | 	/usr/bin/unzip awscliv2.zip &&\
12 | 	./aws/install && \
13 | 	rm awscliv2.zip
14 | 
15 | RUN /usr/local/bin/aws configure set default.s3.max_concurrent_requests 20 && \
16 | 	/usr/local/bin/aws configure set default.s3.max_queue_size 10000 && \
17 | 	/usr/local/bin/aws configure set default.s3.multipart_threshold 64MB && \
18 | 	/usr/local/bin/aws configure set default.s3.multipart_chunksize 16MB && \
19 | 	# /usr/local/bin/aws configure set default.s3.use_accelerate_endpoint true && \
20 | 	/usr/local/bin/aws configure set default.s3.addressing_style path
21 | 
22 | # boto 3
23 | RUN /usr/bin/pip3 install boto3
24 | 
25 | # cleanup
26 | RUN yum clean all && \
27 | 	rm -rf /var/cache/yum
28 | 
29 | 
30 | COPY s3CopySyncScript.py .
31 | 
32 | # CMD is set in the batch job definition
33 | # CMD ["python3","s3CopySyncScript.py"]
34 | 


--------------------------------------------------------------------------------
/src/s3CopySyncScript.py:
--------------------------------------------------------------------------------
  1 | import ast
  2 | import boto3
  3 | import subprocess
  4 | import sys
  5 | import csv
  6 | from io import StringIO
  7 | from urllib.parse import urlparse
  8 | import logging
  9 | 
 10 | logging.basicConfig(
 11 |     level=logging.INFO,
 12 |     format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s',
 13 |     datefmt='%Y-%m-%d %H:%M:%S',
 14 | )
 15 | 
 16 | LOGGER = logging.getLogger()
 17 | 
 18 | 
 19 | def get_bool_input(input_parameter):
 20 | 	"""Returns a boolean corresponding to a string "True/False" input
 21 | 	
 22 | 	Args:
 23 | 	    input_parameter (str): String input for a boolean
 24 | 	
 25 | 	Returns:
 26 | 	    bool: True/False input
 27 | 	"""
 28 | 	if input_parameter.lower().capitalize() in ["True","False"]:
 29 | 		return ast.literal_eval(input_parameter.lower().capitalize())
 30 | 	else:
 31 | 		return None
 32 | 
 33 | class S3SyncUtils:
 34 | 
 35 | 	def __init__(self):
 36 | 		self.s3_resource = boto3.resource('s3')
 37 | 
 38 | 	def __copy_if_exists(self,src_bucket, src_key, src_path,dst_path,delete_destination):
 39 | 		s3_head_object_response = self.s3_resource.meta.client.head_object(
 40 | 			Bucket=src_bucket,
 41 | 			Key=src_key)
 42 | 		if s3_head_object_response['ContentType'].startswith('application/x-directory'): # sync
 43 | 			LOGGER.info('Source is a directory! Syncing entire paths.')
 44 | 			self.__s3_sync_helper(src_path,dst_path,delete_destination)
 45 | 		else:
 46 | 			LOGGER.info('File Exists! Overwriting file')
 47 | 			copy_source = {
 48 | 				'Bucket': src_bucket,
 49 | 				'Key': src_key
 50 | 			}
 51 | 			dst_bucket_key = dst_path.split('//')[1].split('/',1)
 52 | 			self.s3_resource.meta.client.copy(CopySource=copy_source, 
 53 | 						Bucket=dst_bucket_key[0], 
 54 | 						Key=dst_bucket_key[1],
 55 | 						ExtraArgs={
 56 | 							'ACL':'bucket-owner-full-control'
 57 | 						})
 58 | 
 59 | 	def __s3_sync_helper(self, src_path, dst_path, delete_destination):
 60 | 		sync_cmd = ['/usr/local/bin/aws',
 61 | 				's3',
 62 | 				'sync',
 63 | 				src_path,
 64 | 				dst_path,
 65 | 				'--acl','bucket-owner-full-control']
 66 | 		if delete_destination:
 67 | 			sync_cmd.append('--delete')
 68 | 		out = subprocess.run(sync_cmd, capture_output=True)
 69 | 		out = out.stdout.decode()
 70 | 		LOGGER.info(out)
 71 | 
 72 | 
 73 | 	def s3_copy_sync(self,input_s3_bucket,input_s3_key, skip_header=True, delete_destination=False):
 74 | 		"""
 75 | 		:param input_s3_bucket: bucket name of the input file
 76 | 		:param input_s3_key: name of the file
 77 | 		:param skip_header:  boolean  : skip first row if true
 78 | 		:param delete_destination: boolean : delete destination if true
 79 | 		:return: None
 80 | 		"""
 81 | 		csv_str = self.s3_resource.Object(input_s3_bucket,input_s3_key).get()['Body'].read().decode('utf-8')
 82 | 		rdr = csv.reader(StringIO(csv_str), delimiter=',')
 83 | 		if skip_header:
 84 | 			next(rdr) #skip the header row
 85 | 		#loop over each file/path given as src/destination.
 86 | 		for row in rdr:
 87 | 			source=row[0]
 88 | 			dest=row[1]
 89 | 			LOGGER.info(f'Source:: {source} Dest:: {dest}')
 90 | 			# split the S3 path for source into what would be bucket and key
 91 | 			# EG: s3://bucket-name/some/path/file.txt --> bucket-name, some/path/file.txt
 92 | 			parsed_s3_url = urlparse(source)
 93 | 			bucket_src = parsed_s3_url.netloc
 94 | 			key_src = parsed_s3_url.path[1:] # remove first "/" for a valid s3 key
 95 | 			self.__copy_if_exists(bucket_src,key_src,source,dest,delete_destination)
 96 | 
 97 | def main(argv):
 98 | 	if len(argv) != 5:
 99 | 		LOGGER.info("Syntax: python s3CopySyncScript.py <<bucket containing input .csv file>> <<key of input .csv file>> <<header True/False>> <<sync_delete True/False>>")
100 | 		sys.exit(1)
101 | 
102 | 	LOGGER.info(f"Received {len(argv)} arguments:\n"
103 | 		f"\tInput csv file bucket name ::: {argv[1]}\n"
104 | 		f"\tInput csv file file name ::: {argv[2]}\n"
105 | 		f"\tFile containing header ::: {argv[3]}\n"
106 | 		f"\tDelete destination files ::: {argv[4]}" )
107 | 
108 | 	S3_UTILS = S3SyncUtils()
109 | 	header_input = get_bool_input(argv[3])
110 | 	delete_input = get_bool_input(argv[4])
111 | 	if header_input is None or delete_input is None:
112 | 		LOGGER.error("Incorrect input, the HEADER must be true/false and delete files must be true or false")
113 | 		sys.exit(2)
114 | 	S3_UTILS.s3_copy_sync(argv[1],argv[2],header_input,delete_input)
115 | 
116 | 
117 | if __name__ == '__main__':
118 | 	main(sys.argv)
119 | 


--------------------------------------------------------------------------------