├── .github └── workflows │ └── github-actions.yml ├── CHANGELOG.md ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── DMSCDC_CloudTemplate_Reusable.yaml ├── DMSCDC_CloudTemplate_Source.yaml ├── DMSCDC_Controller.py ├── DMSCDC_LoadIncremental.py ├── DMSCDC_LoadInitial.py ├── DMSCDC_ProcessTable.py ├── DMSCDC_SampleDB.yaml ├── DMSCDC_SampleDB_Incremental.sql ├── DMSCDC_SampleDB_Initial.sql ├── LICENSE ├── README.md ├── launch-stack.svg └── pymysql.zip /.github/workflows/github-actions.yml: -------------------------------------------------------------------------------- 1 | name: DeployToS3 2 | on: 3 | push: 4 | branches: 5 | - master 6 | jobs: 7 | DeployCode: 8 | runs-on: ubuntu-latest 9 | steps: 10 | - name: Check out repository code 11 | uses: actions/checkout@v2 12 | - name: Configure AWS credentials 13 | uses: aws-actions/configure-aws-credentials@v1 14 | with: 15 | aws-access-key-id: ${{ secrets.AWS_ID }} 16 | aws-secret-access-key: ${{ secrets.AWS_KEY }} 17 | aws-region: ${{ secrets.REGION }} 18 | - run: | 19 | aws s3 sync . s3://dmscdc-files --acl public-read --exclude ".*" 20 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | All notable changes to this project will be documented in this file. 3 | 4 | ## [1.1.0] - 2022-01-22 5 | ### Added 6 | * Can use wild-card for schema and load multiple schema's at once 7 | * Parallel processing of tables. 8 | * Smarter repartition logic for tables with many partitions. 9 | * Upgrade to Glue v3.0 (Faster start/run times) 10 | * VPC options added to [DMSCDC_CloudTemplate_Reusable.yaml](DMSCDC_CloudTemplate_Reusable.yaml) to allow users to specify VPC settings and ensure replication instance can connect to the sources DB. 11 | * [DMSCDC_SampleDB.yaml](DMSCDC_SampleDB.yaml) to easily test a source DB with a data load. 12 | 13 | ### Fixed 14 | - Initial run of [DMSCDC_Controller.py](DMSCDC_Controller.py) was failing if DMS replication task was not loading 15 | - New tables added after the initial deployment were not loading 16 | - Incremental processing was not picking up new files correctly and was causing some files to be re-processed 17 | - New columns added to tables were not being added to the data lake 18 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /DMSCDC_CloudTemplate_Reusable.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Parameters: 3 | SecurityGroup: 4 | Description: Sec Group for Replication Instance with rules to allow it to connect to your source DB. 5 | Type: AWS::EC2::SecurityGroup::Id 6 | Subnet1: 7 | Description: Subnet-1 for Replication Instance with network path to source DB. 8 | Type: AWS::EC2::Subnet::Id 9 | Subnet2: 10 | Description: Subnet-2 for Replication Instance with network path to source DB. 11 | Type: AWS::EC2::Subnet::Id 12 | Resources: 13 | LambdaRoleToInitDMS: 14 | Type: AWS::IAM::Role 15 | Properties : 16 | AssumeRolePolicyDocument: 17 | Version : 2012-10-17 18 | Statement : 19 | - 20 | Effect : Allow 21 | Principal : 22 | Service : 23 | - lambda.amazonaws.com 24 | Action : 25 | - sts:AssumeRole 26 | Path : / 27 | Policies: 28 | - 29 | PolicyName: LambdaCloudFormationPolicy 30 | PolicyDocument: 31 | Version: 2012-10-17 32 | Statement: 33 | - 34 | Effect: Allow 35 | Action: 36 | - s3:* 37 | Resource: 38 | - !Sub "arn:aws:s3:::cloudformation-custom-resource-response-${AWS::Region}" 39 | - !Sub "arn:aws:s3:::cloudformation-waitcondition-${AWS::Region}" 40 | - !Sub "arn:aws:s3:::cloudformation-custom-resource-response-${AWS::Region}/*" 41 | - !Sub "arn:aws:s3:::cloudformation-waitcondition-${AWS::Region}/*" 42 | ManagedPolicyArns: 43 | - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole 44 | - arn:aws:iam::aws:policy/AmazonS3FullAccess 45 | - arn:aws:iam::aws:policy/CloudWatchLogsFullAccess 46 | - arn:aws:iam::aws:policy/AmazonRDSDataFullAccess 47 | - arn:aws:iam::aws:policy/IAMFullAccess 48 | - arn:aws:iam::aws:policy/AmazonRedshiftFullAccess 49 | LambdaFunctionDMSRoles: 50 | Type: "AWS::Lambda::Function" 51 | Properties: 52 | Timeout: 30 53 | Code: 54 | ZipFile: | 55 | import json 56 | import boto3 57 | import cfnresponse 58 | import logging 59 | import time 60 | client = boto3.client('iam') 61 | logging.basicConfig() 62 | logger = logging.getLogger(__name__) 63 | logger.setLevel(logging.INFO) 64 | 65 | def handler(event, context): 66 | logger.info(json.dumps(event)) 67 | 68 | if event['RequestType'] == 'Delete': 69 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'Delete complete'}) 70 | else: 71 | try: 72 | response = client.get_role(RoleName='dms-cloudwatch-logs-role') 73 | except: 74 | try: 75 | role_policy_document = { 76 | "Version": "2012-10-17", 77 | "Statement": [ 78 | { 79 | "Effect": "Allow", 80 | "Principal": { 81 | "Service": [ 82 | "dms.amazonaws.com" 83 | ] 84 | }, 85 | "Action": "sts:AssumeRole" 86 | } 87 | ] 88 | } 89 | client.create_role( 90 | RoleName='dms-cloudwatch-logs-role', 91 | AssumeRolePolicyDocument=json.dumps(role_policy_document) 92 | ) 93 | client.attach_role_policy( 94 | RoleName='dms-cloudwatch-logs-role', 95 | PolicyArn='arn:aws:iam::aws:policy/service-role/AmazonDMSCloudWatchLogsRole' 96 | ) 97 | except Exception as e: 98 | logger.error(e) 99 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': 'Create failed'}) 100 | try: 101 | response = client.get_role(RoleName='dms-vpc-role') 102 | except: 103 | try: 104 | role_policy_document = { 105 | "Version": "2012-10-17", 106 | "Statement": [ 107 | { 108 | "Effect": "Allow", 109 | "Principal": { 110 | "Service": [ 111 | "dms.amazonaws.com" 112 | ] 113 | }, 114 | "Action": "sts:AssumeRole" 115 | } 116 | ] 117 | } 118 | client.create_role( 119 | RoleName='dms-vpc-role', 120 | AssumeRolePolicyDocument=json.dumps(role_policy_document) 121 | ) 122 | client.attach_role_policy( 123 | RoleName='dms-vpc-role', 124 | PolicyArn='arn:aws:iam::aws:policy/service-role/AmazonDMSVPCManagementRole' 125 | ) 126 | time.sleep(30) 127 | except Exception as e: 128 | logger.error(e) 129 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': 'Create failed'}) 130 | 131 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'Create complete'}) 132 | Handler: index.handler 133 | Role: 134 | Fn::GetAtt: [LambdaRoleToInitDMS, Arn] 135 | Runtime: python3.7 136 | InitDMSRoles: 137 | Type: Custom::InitDMSRoles 138 | DependsOn: 139 | - LambdaFunctionDMSRoles 140 | Properties: 141 | ServiceToken: !GetAtt 'LambdaFunctionDMSRoles.Arn' 142 | DMSSubnetGroup: 143 | Type: AWS::DMS::ReplicationSubnetGroup 144 | DependsOn: 145 | - InitDMSRoles 146 | Properties: 147 | ReplicationSubnetGroupDescription: Subnet group for DMSCDC pipeline. 148 | SubnetIds: 149 | - !Ref Subnet1 150 | - !Ref Subnet2 151 | DMSReplicationInstance: 152 | Type: AWS::DMS::ReplicationInstance 153 | DependsOn: 154 | - InitDMSRoles 155 | Properties: 156 | ReplicationInstanceIdentifier: DMSCDC-ReplicationInstance 157 | ReplicationInstanceClass: dms.c4.large 158 | ReplicationSubnetGroupIdentifier: !Ref DMSSubnetGroup 159 | VpcSecurityGroupIds: 160 | - !Ref SecurityGroup 161 | ExecutionRole: 162 | Type: AWS::IAM::Role 163 | Properties : 164 | RoleName: DMSCDC_Execution_Role 165 | AssumeRolePolicyDocument: 166 | Version : 2012-10-17 167 | Statement : 168 | - 169 | Effect : Allow 170 | Principal : 171 | Service : 172 | - lambda.amazonaws.com 173 | - glue.amazonaws.com 174 | - dms.amazonaws.com 175 | Action : 176 | - sts:AssumeRole 177 | Path : / 178 | ManagedPolicyArns: 179 | - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole 180 | - arn:aws:iam::aws:policy/AmazonS3FullAccess 181 | - arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole 182 | Policies : 183 | - 184 | PolicyName: DMSCDCExecutionPolicy 185 | PolicyDocument : 186 | Version: 2012-10-17 187 | Statement: 188 | - 189 | Effect: Allow 190 | Action: 191 | - lambda:InvokeFunction 192 | - dynamodb:PutItem 193 | - dynamodb:CreateTable 194 | - dynamodb:UpdateItem 195 | - dynamodb:UpdateTable 196 | - dynamodb:GetItem 197 | - dynamodb:DescribeTable 198 | - iam:GetRole 199 | - iam:PassRole 200 | - dms:StartReplicationTask 201 | - dms:TestConnection 202 | - dms:StopReplicationTask 203 | Resource: 204 | - !Sub 'arn:aws:dynamodb:${AWS::Region}:${AWS::AccountId}:table/DMSCDC_Controller' 205 | - !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:DMSCDC_InitController' 206 | - !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:DMSCDC_InitDMS' 207 | - !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:DMSCDC_ImportGlueScript' 208 | - !Sub 'arn:aws:iam::${AWS::AccountId}:role/DMSCDC_ExecutionRole' 209 | - !Sub 'arn:aws:dms:${AWS::Region}:${AWS::AccountId}:endpoint:*' 210 | - !Sub 'arn:aws:dms:${AWS::Region}:${AWS::AccountId}:task:*' 211 | - 212 | Effect: Allow 213 | Action: 214 | - dms:DescribeConnections 215 | - dms:DescribeReplicationTasks 216 | Resource: "*" 217 | LambdaFunctionImportGlueScript: 218 | Type: "AWS::Lambda::Function" 219 | Properties: 220 | FunctionName: DMSCDC_ImportGlueScript 221 | Timeout: 30 222 | Code: 223 | ZipFile: | 224 | import json 225 | import boto3 226 | import cfnresponse 227 | import logging 228 | 229 | logging.basicConfig() 230 | logger = logging.getLogger(__name__) 231 | logger.setLevel(logging.INFO) 232 | 233 | def handler(event, context): 234 | logger.info(json.dumps(event)) 235 | s3 = boto3.client('s3') 236 | glueBucket = event['ResourceProperties']['glueBucket'] 237 | 238 | if event['RequestType'] == 'Delete': 239 | try: 240 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadIncremental.py') 241 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadIncremental.py.temp') 242 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadInitial.py') 243 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadInitial.py.temp') 244 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_ProcessTable.py') 245 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_ProcessTable.py.temp') 246 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_Controller.py') 247 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_Controller.py.temp') 248 | except Exception as e: 249 | logger.info(e.message) 250 | 251 | logger.info('Delete Complete') 252 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'Delete Complete'}) 253 | 254 | else: 255 | try: 256 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadIncremental.py') 257 | except Exception as e: 258 | logger.info(e.message) 259 | try: 260 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadInitial.py') 261 | except Exception as e: 262 | logger.info(e.message) 263 | try: 264 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_Controller.py') 265 | except Exception as e: 266 | logger.info(e.message) 267 | try: 268 | s3.delete_object(Bucket=glueBucket, Key='DMSCDC_ProcessTable.py') 269 | except Exception as e: 270 | logger.info(e.message) 271 | try: 272 | s3.copy_object(Bucket=glueBucket, CopySource="dmscdc-files/DMSCDC_LoadIncremental.py", Key="DMSCDC_LoadIncremental.py") 273 | s3.copy_object(Bucket=glueBucket, CopySource="dmscdc-files/DMSCDC_LoadInitial.py", Key="DMSCDC_LoadInitial.py") 274 | s3.copy_object(Bucket=glueBucket, CopySource="dmscdc-files/DMSCDC_Controller.py", Key="DMSCDC_Controller.py") 275 | s3.copy_object(Bucket=glueBucket, CopySource="dmscdc-files/DMSCDC_ProcessTable.py", Key="DMSCDC_ProcessTable.py") 276 | 277 | logger.info('Copy Complete') 278 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'Copy complete'}) 279 | 280 | except Exception as e: 281 | message = 'S3 Issue: ' + str(e) 282 | logger.error(message) 283 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message}) 284 | 285 | Handler: index.handler 286 | Role: 287 | Fn::GetAtt: [ExecutionRole, Arn] 288 | Runtime: python3.7 289 | DependsOn: 290 | - ExecutionRole 291 | ImportScript: 292 | Type: Custom::CopyScript 293 | DependsOn: 294 | - LambdaFunctionImportGlueScript 295 | - GlueBucket 296 | Properties: 297 | ServiceToken: 298 | Fn::GetAtt : [LambdaFunctionImportGlueScript, Arn] 299 | glueBucket: 300 | Ref: GlueBucket 301 | LambdaFunctionInitController: 302 | Type: "AWS::Lambda::Function" 303 | Properties: 304 | FunctionName: DMSCDC_InitController 305 | Timeout: 600 306 | Code: 307 | ZipFile: | 308 | import json 309 | import boto3 310 | import cfnresponse 311 | import time 312 | import logging 313 | 314 | logging.basicConfig() 315 | logger = logging.getLogger(__name__) 316 | logger.setLevel(logging.INFO) 317 | 318 | glue = boto3.client('glue') 319 | 320 | def testGlueJob(jobId, count, sec): 321 | i = 0 322 | while i < count: 323 | response = glue.get_job_run(JobName='DMSCDC_Controller', RunId=jobId) 324 | status = response['JobRun']['JobRunState'] 325 | if (status == 'SUCCEEDED' or status == 'STARTING' or status == 'STOPPING'): 326 | return 1 327 | elif status == 'RUNNING' : 328 | time.sleep(sec) 329 | i+=1 330 | else: 331 | return 0 332 | if i == count: 333 | return 0 334 | 335 | def handler(event, context): 336 | logger.info(json.dumps(event)) 337 | if event['RequestType'] != 'Create': 338 | logger.info('No action necessary') 339 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'NA'}) 340 | else: 341 | try: 342 | dmsBucket = event['ResourceProperties']['dmsBucket'] 343 | dmsPath = event['ResourceProperties']['dmsPath'] 344 | lakeBucket = event['ResourceProperties']['lakeBucket'] 345 | lakePath = event['ResourceProperties']['lakePath'] 346 | lakeDBName = event['ResourceProperties']['lakeDBName'] 347 | 348 | response = glue.start_job_run( 349 | JobName='DMSCDC_Controller', 350 | Arguments={ 351 | '--bucket':dmsBucket, 352 | '--prefix':dmsPath, 353 | '--out_bucket':lakeBucket, 354 | '--out_prefix':lakePath}) 355 | 356 | if testGlueJob(response['JobRunId'], 20, 30) != 1: 357 | message = 'Error during Controller execution' 358 | logger.info(message) 359 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message}) 360 | else: 361 | try: 362 | message = 'Load Complete: https://console.aws.amazon.com/dynamodb/home?#tables:selected=DMSCDC_Controller;tab=items' 363 | response = glue.start_trigger(Name='DMSCDC-Trigger-'+lakeDBName) 364 | logger.info(message) 365 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': message}) 366 | except: 367 | message = 'Load Complete, but trigger failed to initialize' 368 | logger.info(message) 369 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': message}) 370 | except Exception as e: 371 | message = 'Glue Job Issue: ' + str(e) 372 | logger.info(message) 373 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message}) 374 | Handler: index.handler 375 | Role: 376 | Fn::GetAtt: [ExecutionRole, Arn] 377 | Runtime: python3.7 378 | DependsOn: 379 | - ExecutionRole 380 | LambdaFunctionInitDMS: 381 | Type: "AWS::Lambda::Function" 382 | Properties: 383 | FunctionName: DMSCDC_InitDMS 384 | Timeout: 600 385 | Code: 386 | ZipFile: | 387 | import json 388 | import boto3 389 | import cfnresponse 390 | import time 391 | import logging 392 | 393 | logging.basicConfig() 394 | logger = logging.getLogger(__name__) 395 | logger.setLevel(logging.INFO) 396 | 397 | dms = boto3.client('dms') 398 | s3 = boto3.resource('s3') 399 | 400 | def testConnection(endpointArn, count): 401 | i = 0 402 | while i < count: 403 | connections = dms.describe_connections(Filters=[{'Name':'endpoint-arn','Values':[endpointArn]}]) 404 | for connection in connections['Connections']: 405 | if connection['Status'] == 'successful': 406 | return 1 407 | elif connection['Status'] == 'testing': 408 | time.sleep(30) 409 | i += 1 410 | elif connection['Status'] == 'failed': 411 | return 0 412 | if i == count: 413 | return 0 414 | 415 | def testRepTask(taskArn, count): 416 | i = 0 417 | while i < count: 418 | replicationTasks = dms.describe_replication_tasks(Filters=[{'Name':'replication-task-arn','Values':[taskArn]}]) 419 | for replicationTask in replicationTasks['ReplicationTasks']: 420 | if replicationTask['Status'] == 'running': 421 | return 1 422 | elif replicationTask['Status'] == 'starting': 423 | time.sleep(30) 424 | i += 1 425 | else: 426 | return 0 427 | if i == count: 428 | return 0 429 | 430 | def handler(event, context): 431 | logger.info(json.dumps(event)) 432 | taskArn = event['ResourceProperties']['taskArn'] 433 | sourceArn = event['ResourceProperties']['sourceArn'] 434 | targetArn = event['ResourceProperties']['targetArn'] 435 | instanceArn = event['ResourceProperties']['instanceArn'] 436 | dmsBucket = event['ResourceProperties']['dmsBucket'] 437 | 438 | if event['RequestType'] == 'Delete': 439 | try: 440 | dms.stop_replication_task(ReplicationTaskArn=taskArn) 441 | time.sleep(30) 442 | s3.Bucket(dmsBucket).objects.all().delete() 443 | logger.info('Task Stopped, Bucket Emptied') 444 | except Exception as e: 445 | message = 'Issue with Stopping Task or Emptying Bucket:' + str(e) 446 | logger.error(message) 447 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'NA'}) 448 | else: 449 | taskArn = event['ResourceProperties']['taskArn'] 450 | sourceArn = event['ResourceProperties']['sourceArn'] 451 | targetArn = event['ResourceProperties']['targetArn'] 452 | instanceArn = event['ResourceProperties']['instanceArn'] 453 | 454 | try: 455 | if testConnection(sourceArn, 1) != 1: 456 | dms.test_connection( 457 | ReplicationInstanceArn=instanceArn, 458 | EndpointArn=sourceArn 459 | ) 460 | if testConnection(sourceArn,10) != 1: 461 | message = 'Source Endpoint Connection Failure' 462 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message}) 463 | return {'message': message} 464 | if testConnection(targetArn, 1) != 1: 465 | dms.test_connection( 466 | ReplicationInstanceArn=instanceArn, 467 | EndpointArn=targetArn 468 | ) 469 | if testConnection(targetArn,10) != 1: 470 | message = 'Target Endpoint Connection Failure' 471 | logger.error(message) 472 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message}) 473 | 474 | except Exception as e: 475 | message = 'DMS Endpoint Issue: ' + str(e) 476 | logger.error(message) 477 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message}) 478 | 479 | try: 480 | if testRepTask(taskArn, 1) != 1: 481 | dms.start_replication_task( 482 | ReplicationTaskArn=taskArn, 483 | StartReplicationTaskType='start-replication') 484 | if testRepTask(taskArn, 10) != 1: 485 | message = 'Error during DMS execution' 486 | logger.error(message) 487 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message}) 488 | else: 489 | message = 'DMS Execution Successful' 490 | logger.info(message) 491 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': message}) 492 | except Exception as e: 493 | message = 'DMS Task Issue: ' + str(e) 494 | logger.error(message) 495 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message}) 496 | 497 | Handler: index.handler 498 | Role: 499 | Fn::GetAtt: [ExecutionRole, Arn] 500 | Runtime: python3.7 501 | DependsOn: 502 | - ExecutionRole 503 | GlueBucket: 504 | Type: AWS::S3::Bucket 505 | GlueJobLoadIncremental: 506 | Type: AWS::Glue::Job 507 | Properties: 508 | Name: DMSCDC_LoadIncremental 509 | Role: 510 | Fn::GetAtt: [ExecutionRole, Arn] 511 | ExecutionProperty: 512 | MaxConcurrentRuns: 50 513 | AllocatedCapacity: 2 514 | GlueVersion: 3.0 515 | Command: 516 | Name: glueetl 517 | ScriptLocation: !Sub 518 | - s3://${bucket}/DMSCDC_LoadIncremental.py 519 | - {bucket: !Ref GlueBucket} 520 | DefaultArguments: 521 | "--job-bookmark-option" : "job-bookmark-disable" 522 | "--TempDir" : !Sub 523 | - s3://${bucket} 524 | - {bucket: !Ref GlueBucket} 525 | "--enable-metrics" : "" 526 | DependsOn: 527 | - ExecutionRole 528 | GlueJobLoadInitial: 529 | Type: AWS::Glue::Job 530 | Properties: 531 | Name: DMSCDC_LoadInitial 532 | Role: 533 | Fn::GetAtt: [ExecutionRole, Arn] 534 | ExecutionProperty: 535 | MaxConcurrentRuns: 50 536 | AllocatedCapacity: 2 537 | GlueVersion: 3.0 538 | Command: 539 | Name: glueetl 540 | ScriptLocation: !Sub 541 | - s3://${bucket}/DMSCDC_LoadInitial.py 542 | - {bucket: !Ref GlueBucket} 543 | DefaultArguments: 544 | "--job-bookmark-option" : "job-bookmark-disable" 545 | "--TempDir" : !Sub 546 | - s3://${bucket} 547 | - {bucket: !Ref GlueBucket} 548 | "--enable-metrics" : "" 549 | DependsOn: 550 | - ExecutionRole 551 | GlueJobController: 552 | Type: AWS::Glue::Job 553 | Properties: 554 | Name: DMSCDC_Controller 555 | Role: 556 | Fn::GetAtt: [ExecutionRole, Arn] 557 | ExecutionProperty: 558 | MaxConcurrentRuns: 50 559 | Command: 560 | Name: pythonshell 561 | PythonVersion: 3 562 | ScriptLocation: !Sub 563 | - s3://${bucket}/DMSCDC_Controller.py 564 | - {bucket: !Ref GlueBucket} 565 | DefaultArguments: 566 | "--job-bookmark-option" : "job-bookmark-disable" 567 | "--TempDir" : !Sub 568 | - s3://${bucket} 569 | - {bucket: !Ref GlueBucket} 570 | "--enable-metrics" : "" 571 | DependsOn: 572 | - ExecutionRole 573 | GlueJobProcessTable: 574 | Type: AWS::Glue::Job 575 | Properties: 576 | Name: DMSCDC_ProcessTable 577 | Role: 578 | Fn::GetAtt: [ExecutionRole, Arn] 579 | ExecutionProperty: 580 | MaxConcurrentRuns: 50 581 | Command: 582 | Name: pythonshell 583 | PythonVersion: 3 584 | ScriptLocation: !Sub 585 | - s3://${bucket}/DMSCDC_ProcessTable.py 586 | - {bucket: !Ref GlueBucket} 587 | DefaultArguments: 588 | "--job-bookmark-option" : "job-bookmark-disable" 589 | "--TempDir" : !Sub 590 | - s3://${bucket} 591 | - {bucket: !Ref GlueBucket} 592 | "--enable-metrics" : "" 593 | DependsOn: 594 | - ExecutionRole 595 | -------------------------------------------------------------------------------- /DMSCDC_CloudTemplate_Source.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Parameters: 3 | DBEngine: 4 | Description: Choose the source DB engine to stream changes. 5 | Type: String 6 | AllowedValues: 7 | - mysql 8 | - oracle 9 | - postgres 10 | - mariadb 11 | - aurora 12 | - aurora-postgresql 13 | - db2 14 | - sybase 15 | - sqlserver 16 | DBServer: 17 | Description: Hostname of the source DB. 18 | Type: String 19 | DBPort: 20 | Description: Port of the source DB. 21 | Type: Number 22 | DBUser: 23 | Description: Username for the source DB. 24 | Type: String 25 | DBPassword: 26 | Description: Password for the source DB. 27 | Type: String 28 | NoEcho: true 29 | DBName: 30 | Description: Name of the source DB. 31 | Type: String 32 | DMSReplicationInstanceArn: 33 | Description: Enter the ARN for the DMS Replication Instance which has been configured with network connectivity to your Source DB. 34 | Type: String 35 | DMSTableFilter: 36 | Description: Choose the DMS table filter criteria. 37 | Type: String 38 | Default: "%" 39 | AllowedPattern: ^[a-zA-Z0-9-_%]*$ 40 | ConstraintDescription: Must only contain upper/lower-case letters, numbers, -, _, or %. 41 | DMSSchemaFilter: 42 | Description: Choose the DMS schema filter criteria. 43 | Type: String 44 | Default: "%" 45 | AllowedPattern: ^[a-zA-Z0-9-_%]*$ 46 | ConstraintDescription: Must only contain upper/lower-case letters, numbers, -, or _. 47 | CreateDMSBucket: 48 | Description: Select Yes to create a new bucket and No to use an existing bucket 49 | Type: String 50 | Default: 'Yes' 51 | AllowedValues: 52 | - 'Yes' 53 | - 'No' 54 | DMSBucket: 55 | Description: Name of the Bucket to land the DMS data. 56 | Type: String 57 | DMSFolder: 58 | Description: (Optional) Prefix within bucket to store DMS data. Please omit the trailing / 59 | Type: String 60 | CreateLakeBucket: 61 | Description: Select Yes to create a new bucket and No to use an existing bucket 62 | Type: String 63 | Default: 'Yes' 64 | AllowedValues: 65 | - 'Yes' 66 | - 'No' 67 | LakeBucket: 68 | Description: Name of the Bucket to land the processed data. 69 | Type: String 70 | LakeFolder: 71 | Description: (Optional) Prefix within bucket to store processed data. Please omit the trailing / 72 | Type: String 73 | LakeDBName: 74 | Description: Name to be used in analytical tools such as Athena and Redshift to reference these tables within the data lake. 75 | Type: String 76 | CDCSchedule: 77 | Description: "CDC Execution Schedule Cron Expression. Default is every hour on the hour. For more info: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html" 78 | Type: String 79 | Default: cron(0 0-23 * * ? *) 80 | CrawlerSchedule: 81 | Description: "Crawler Execution Schedule Cron Expression. Default is daily at 12:30 AM GMT: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html" 82 | Type: String 83 | Default: cron(30 0 * * ? *) 84 | Conditions: 85 | IsDMSFolderEmpty: !Equals [!Ref "DMSFolder", ""] 86 | IsLakeFolderEmpty: !Equals [!Ref "LakeFolder", ""] 87 | CreateDMSBucket: !Equals 88 | - !Ref CreateDMSBucket 89 | - 'Yes' 90 | CreateLakeBucket: !Equals 91 | - !Ref CreateLakeBucket 92 | - 'Yes' 93 | Metadata: 94 | AWS::CloudFormation::Interface: 95 | ParameterGroups: 96 | - 97 | Label: 98 | default: "DMS Source Database Configuration" 99 | Parameters: 100 | - DBEngine 101 | - DBServer 102 | - DBPort 103 | - DBUser 104 | - DBPassword 105 | - DBName 106 | - 107 | Label: 108 | default: "DMS Task Configuration" 109 | Parameters: 110 | - DMSReplicationInstanceArn 111 | - DMSTableFilter 112 | - DMSSchemaFilter 113 | - CreateDMSBucket 114 | - DMSBucket 115 | - DMSFolder 116 | - 117 | Label: 118 | default: "Data Lake Configuration" 119 | Parameters: 120 | - CreateLakeBucket 121 | - LakeBucket 122 | - LakeFolder 123 | - LakeDBName 124 | - CDCSchedule 125 | - CrawlerSchedule 126 | Resources: 127 | DMSSourceEndpoint: 128 | Type: AWS::DMS::Endpoint 129 | Properties: 130 | EndpointType: source 131 | EngineName: 132 | Ref: DBEngine 133 | ServerName: 134 | Ref: DBServer 135 | Port: 136 | Ref: DBPort 137 | Username: 138 | Ref: DBUser 139 | Password: 140 | Ref: DBPassword 141 | DatabaseName: 142 | Ref: DBName 143 | DMSRawBucket: 144 | Condition: CreateDMSBucket 145 | Type: AWS::S3::Bucket 146 | Properties: 147 | BucketName: 148 | Ref: DMSBucket 149 | DMSTargetEndpoint: 150 | Type: AWS::DMS::Endpoint 151 | Properties: 152 | EndpointType: target 153 | EngineName: s3 154 | ExtraConnectionAttributes : "dataFormat=parquet;includeOpForFullLoad=true;;parquetTimestampInMillisecond=true;" 155 | S3Settings : 156 | BucketFolder: 157 | Ref: DMSFolder 158 | BucketName: 159 | Ref: DMSBucket 160 | ServiceAccessRoleArn: !Sub 'arn:aws:iam::${AWS::AccountId}:role/DMSCDC_Execution_Role' 161 | DMSReplicationTask: 162 | Type: AWS::DMS::ReplicationTask 163 | Properties: 164 | MigrationType: full-load-and-cdc 165 | ReplicationInstanceArn: 166 | Ref: DMSReplicationInstanceArn 167 | SourceEndpointArn: 168 | Ref: DMSSourceEndpoint 169 | TableMappings: 170 | Fn::Sub: 171 | - "{\"rules\": [{\"rule-type\": \"selection\", \"rule-id\": \"1\", \"rule-action\": \"include\", \"object-locator\": {\"schema-name\": \"${schema}\", \"table-name\": \"${tables}\"}, \"rule-name\": \"1\"}]}" 172 | - {schema: !Ref DMSSchemaFilter, tables: !Ref DMSTableFilter } 173 | TargetEndpointArn: 174 | Ref: DMSTargetEndpoint 175 | ReplicationTaskSettings: "{\"Logging\": {\"EnableLogging\": true }}" 176 | DataLakeBucket: 177 | Condition: CreateDMSBucket 178 | Type: 'AWS::S3::Bucket' 179 | Properties: 180 | BucketName: 181 | Ref: LakeBucket 182 | Trigger: 183 | Type: AWS::Glue::Trigger 184 | Properties: 185 | Type: SCHEDULED 186 | Name: !Sub 187 | - "DMSCDC-Trigger-${dbname}" 188 | - {dbname: !Ref LakeDBName} 189 | Schedule: 190 | Ref: CDCSchedule 191 | Actions: 192 | - JobName: DMSCDC_Controller 193 | Arguments: 194 | '--bucket': !Ref DMSBucket 195 | '--prefix': 196 | Fn::If: 197 | - IsDMSFolderEmpty 198 | - !Ref DMSFolder 199 | - !Sub 200 | - '${prefix}/' 201 | - prefix: !Ref DMSFolder 202 | '--out_bucket': !Ref LakeBucket 203 | '--out_prefix': 204 | Fn::If: 205 | - IsLakeFolderEmpty 206 | - !Ref LakeFolder 207 | - !Sub 208 | - '${prefix}/' 209 | - prefix: !Ref LakeFolder 210 | GlueCrawler: 211 | Type: AWS::Glue::Crawler 212 | Properties: 213 | Role: !Sub 'arn:aws:iam::${AWS::AccountId}:role/DMSCDC_Execution_Role' 214 | Schedule: 215 | ScheduleExpression: 216 | Ref: CrawlerSchedule 217 | DatabaseName: 218 | Ref: LakeDBName 219 | Targets: 220 | S3Targets: 221 | - 222 | Path: 223 | Fn::Join : 224 | - "/" 225 | - 226 | - Ref : LakeBucket 227 | - Fn::If: 228 | - IsLakeFolderEmpty 229 | - !Ref LakeFolder 230 | - !Sub 231 | - '${prefix}/' 232 | - prefix: !Ref LakeFolder 233 | InitDMS: 234 | Type: Custom::InitDMS 235 | DependsOn: 236 | - DMSReplicationTask 237 | Properties: 238 | ServiceToken: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:DMSCDC_InitDMS' 239 | taskArn: !Ref DMSReplicationTask 240 | sourceArn: !Ref DMSSourceEndpoint 241 | targetArn: !Ref DMSTargetEndpoint 242 | instanceArn: !Ref DMSReplicationInstanceArn 243 | dmsBucket: !Ref DMSBucket 244 | InitController: 245 | Type: Custom::InitController 246 | DependsOn: 247 | - InitDMS 248 | - Trigger 249 | Properties: 250 | ServiceToken: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:DMSCDC_InitController' 251 | dmsBucket: !Ref DMSBucket 252 | dmsPath: !Sub 253 | - "${prefix}${schema}/" 254 | - { prefix: !Ref DMSFolder,schema: !Ref DMSSchemaFilter} 255 | lakeBucket: !Ref LakeBucket 256 | lakePath: 257 | Fn::If: 258 | - IsLakeFolderEmpty 259 | - !Ref LakeFolder 260 | - !Sub 261 | - '${prefix}/' 262 | - prefix: !Ref LakeFolder 263 | lakeDBName: !Ref LakeDBName 264 | -------------------------------------------------------------------------------- /DMSCDC_Controller.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import json 3 | from urllib.parse import urlparse 4 | import urllib 5 | import datetime 6 | import boto3 7 | import time 8 | from awsglue.utils import getResolvedOptions 9 | import logging 10 | 11 | handler = logging.StreamHandler(sys.stdout) 12 | handler.setLevel(logging.INFO) 13 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') 14 | handler.setFormatter(formatter) 15 | 16 | logging.basicConfig() 17 | logger = logging.getLogger(__name__) 18 | logger.addHandler(handler) 19 | 20 | ddbconn = boto3.client('dynamodb') 21 | glue = boto3.client('glue') 22 | s3conn = boto3.client('s3') 23 | 24 | prefix = "" 25 | out_prefix = "" 26 | 27 | #Optional Parameters 28 | if ('--prefix' in sys.argv): 29 | nextarg = sys.argv[sys.argv.index('--prefix')+1] 30 | if (nextarg is not None and not(nextarg.startswith('--'))): 31 | prefix = nextarg 32 | if ('--out_prefix' in sys.argv): 33 | nextarg = sys.argv[sys.argv.index('--out_prefix')+1] 34 | if (nextarg is not None and not(nextarg.startswith('--'))): 35 | out_prefix = nextarg 36 | 37 | #Required Parameters 38 | args = getResolvedOptions(sys.argv, [ 39 | 'bucket', 40 | 'out_bucket']) 41 | 42 | bucket = args['bucket'] 43 | out_bucket = args['out_bucket'] 44 | out_path = out_bucket + '/' + out_prefix 45 | 46 | #get the list of table folders 47 | s3_input = 's3://'+bucket+'/'+prefix 48 | url = urlparse(s3_input) 49 | schemas = s3conn.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/').get('CommonPrefixes') 50 | if schemas is None: 51 | logger.error('DMSBucket: '+bucket+' DMSFolder: ' + prefix + ' is empty. Confirm the source DB has data and the DMS replication task is replicating to this S3 location. Review the documention https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html to ensure the DB is setup for CDC.') 52 | sys.exit() 53 | 54 | index = 0 55 | #get folder metadata 56 | for schema in schemas: 57 | tables = s3conn.list_objects(Bucket=bucket, Prefix=schema['Prefix'], Delimiter='/').get('CommonPrefixes') 58 | for table in tables: 59 | full_folder = table['Prefix'] 60 | foldername = full_folder[len(schema['Prefix']):] 61 | path = bucket + '/' + full_folder 62 | logger.warn('Processing table: ' + path) 63 | item = { 64 | 'path': {'S':path}, 65 | 'bucket': {'S':bucket}, 66 | 'prefix': {'S':schema['Prefix']}, 67 | 'folder': {'S':foldername}, 68 | 'PrimaryKey': {'S':'null'}, 69 | 'PartitionKey': {'S':'null'}, 70 | 'LastFullLoadDate': {'S':'1900-01-01 00:00:00'}, 71 | 'LastIncrementalFile': {'S':path + '0.parquet'}, 72 | 'ActiveFlag': {'S':'false'}, 73 | 'out_path': {'S': out_path}} 74 | 75 | #CreateTable if not already present 76 | try: 77 | response1 = ddbconn.describe_table(TableName='DMSCDC_Controller') 78 | except Exception as e: 79 | ddbconn.create_table( 80 | TableName='DMSCDC_Controller', 81 | KeySchema=[{'AttributeName': 'path','KeyType': 'HASH'}], 82 | AttributeDefinitions=[{'AttributeName': 'path','AttributeType': 'S'}], 83 | BillingMode='PAY_PER_REQUEST') 84 | time.sleep(10) 85 | 86 | #Put Item if not already present 87 | try: 88 | response = ddbconn.get_item( 89 | TableName='DMSCDC_Controller', 90 | Key={'path': {'S':path}}) 91 | if 'Item' in response: 92 | item = response['Item'] 93 | else: 94 | ddbconn.put_item( 95 | TableName='DMSCDC_Controller', 96 | Item=item) 97 | except: 98 | ddbconn.put_item( 99 | TableName='DMSCDC_Controller', 100 | Item=item) 101 | 102 | logger.debug(json.dumps(item)) 103 | logger.warn('Bucket: '+bucket+' Path: ' + full_folder) 104 | 105 | 106 | #determine if need to run incremental --> Run incremental --> Update DDB 107 | if item['ActiveFlag']['S'] == 'true': 108 | logger.warn('starting processTable, args: ' + json.dumps(item)) 109 | response = glue.start_job_run(JobName='DMSCDC_ProcessTable',Arguments={'--item':json.dumps(item)}) 110 | logger.debug(json.dumps(response)) 111 | else: 112 | logger.error('Load is not active. Update dynamoDB: ' + full_folder) 113 | -------------------------------------------------------------------------------- /DMSCDC_LoadIncremental.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from awsglue.job import Job 3 | from pyspark.context import SparkContext 4 | from awsglue.context import GlueContext 5 | from pyspark.sql import SQLContext 6 | from pyspark.sql import SparkSession 7 | from pyspark.sql.functions import * 8 | from pyspark.sql.window import Window 9 | from awsglue.utils import getResolvedOptions 10 | import boto3 11 | import urllib 12 | from urllib import parse 13 | 14 | s3conn = boto3.client('s3') 15 | 16 | sparkContext = SparkContext.getOrCreate() 17 | glueContext = GlueContext(sparkContext) 18 | spark = glueContext.spark_session 19 | job = Job(glueContext) 20 | 21 | args = getResolvedOptions(sys.argv, [ 22 | 'JOB_NAME', 23 | 'bucket', 24 | 'prefix', 25 | 'folder', 26 | 'out_path', 27 | 'lastIncrementalFile', 28 | 'newIncrementalFile', 29 | 'primaryKey', 30 | 'partitionKey']) 31 | 32 | job.init(args['JOB_NAME'], args) 33 | s3_outputpath = 's3://' + args['out_path'] + args['folder'] 34 | 35 | last_file = args['lastIncrementalFile'] 36 | curr_file = args['newIncrementalFile'] 37 | primary_keys = args['primaryKey'] 38 | partition_keys = args['partitionKey'] 39 | 40 | # gets list of files to process 41 | inputFileList = [] 42 | results = s3conn.list_objects_v2(Bucket=args['bucket'], Prefix=args['prefix'] +'2', StartAfter=args['lastIncrementalFile']).get('Contents') 43 | for result in results: 44 | if (args['bucket'] + '/' + result['Key'] != last_file): 45 | inputFileList.append('s3://' + args['bucket'] + '/' + result['Key']) 46 | 47 | inputfile = spark.read.parquet(*inputFileList) 48 | 49 | tgtExists = True 50 | try: 51 | test = spark.read.parquet(s3_outputpath) 52 | except: 53 | tgtExists = False 54 | 55 | #No Primary_Keys implies insert only 56 | if primary_keys == "null" or not(tgtExists): 57 | output = inputfile.filter(inputfile.Op=='I') 58 | filelist = [["null"]] 59 | else: 60 | primaryKeys = primary_keys.split(",") 61 | windowRow = Window.partitionBy(primaryKeys).orderBy("sortpath") 62 | 63 | #Loads the target data adding columns for processing 64 | target = spark.read.parquet(s3_outputpath).withColumn("sortpath", lit("0")).withColumn("tgt_filepath",input_file_name()).withColumn("rownum", lit(1)) 65 | input = inputfile.withColumn("sortpath", input_file_name()).withColumn("src_filepath",input_file_name()).withColumn("rownum", row_number().over(windowRow)) 66 | 67 | #determine impacted files 68 | files = target.join(input, primaryKeys, 'inner').select(col("tgt_filepath").alias("list_filepath")).distinct() 69 | filelist = files.collect() 70 | 71 | uniondata = input.unionByName(target.join(files,files.list_filepath==target.tgt_filepath), allowMissingColumns=True) 72 | window = Window.partitionBy(primaryKeys).orderBy(desc("sortpath"), desc("rownum")) 73 | output = uniondata.withColumn('rnk', rank().over(window)).where(col("rnk")==1).where(col("Op")!="D").coalesce(1).select(inputfile.columns) 74 | 75 | # write data by partitions 76 | if partition_keys != "null" : 77 | partitionKeys = partition_keys.split(",") 78 | partitionCount = output.select(partitionKeys).distinct().count() 79 | output.repartition(partitionCount,partitionKeys).write.mode('append').partitionBy(partitionKeys).parquet(s3_outputpath) 80 | else: 81 | output.write.mode('append').parquet(s3_outputpath) 82 | 83 | #delete old files 84 | for row in filelist: 85 | if row[0] != "null": 86 | o = parse.urlparse(row[0]) 87 | s3conn.delete_object(Bucket=o.netloc, Key=parse.unquote(o.path)[1:]) 88 | -------------------------------------------------------------------------------- /DMSCDC_LoadInitial.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from awsglue.job import Job 3 | from pyspark.context import SparkContext 4 | from pyspark.sql import SQLContext 5 | from pyspark.sql import SparkSession 6 | from pyspark.sql.functions import * 7 | from awsglue.utils import getResolvedOptions 8 | from awsglue.context import GlueContext 9 | 10 | args = getResolvedOptions(sys.argv, [ 11 | 'JOB_NAME', 12 | 'bucket', 13 | 'prefix', 14 | 'folder', 15 | 'out_path', 16 | 'partitionKey']) 17 | 18 | 19 | sparkContext = SparkContext.getOrCreate() 20 | glueContext = GlueContext(sparkContext) 21 | spark = glueContext.spark_session 22 | job = Job(glueContext) 23 | job.init(args['JOB_NAME'], args) 24 | 25 | s3_inputpath = 's3://' + args['bucket'] + '/' + args['prefix'] 26 | s3_outputpath = 's3://' + args['out_path'] + args['folder'] 27 | 28 | input = spark.read.parquet(s3_inputpath+"/LOAD*.parquet").withColumn("Op", lit("I")) 29 | 30 | partition_keys = args['partitionKey'] 31 | if partition_keys != "null" : 32 | partitionKeys = partition_keys.split(",") 33 | partitionCount = input.select(partitionKeys).distinct().count() 34 | input.repartition(partitionCount,partitionKeys).write.mode('overwrite').partitionBy(partitionKeys).parquet(s3_outputpath) 35 | else: 36 | input.write.mode('overwrite').parquet(s3_outputpath) 37 | 38 | job.commit() 39 | -------------------------------------------------------------------------------- /DMSCDC_ProcessTable.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import json 3 | import datetime 4 | import boto3 5 | import time 6 | from awsglue.utils import getResolvedOptions 7 | import logging 8 | 9 | handler = logging.StreamHandler(sys.stdout) 10 | handler.setLevel(logging.INFO) 11 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') 12 | handler.setFormatter(formatter) 13 | 14 | logging.basicConfig() 15 | logger = logging.getLogger(__name__) 16 | logger.addHandler(handler) 17 | 18 | ddbconn = boto3.client('dynamodb') 19 | glue = boto3.client('glue') 20 | s3conn = boto3.client('s3') 21 | 22 | #Required Parameters 23 | args = getResolvedOptions(sys.argv, ['item']) 24 | item = json.loads(args['item']) 25 | 26 | partitionKey = item['PartitionKey']['S'] 27 | lastFullLoadDate = item['LastFullLoadDate']['S'] 28 | lastIncrementalFile = item['LastIncrementalFile']['S'] 29 | activeFlag = item['ActiveFlag']['S'] 30 | primaryKey = item['PrimaryKey']['S'] 31 | folder = item['folder']['S'] 32 | prefix = item['prefix']['S'] 33 | full_folder = prefix + folder 34 | path = item['path']['S'] 35 | bucket = item['bucket']['S'] 36 | out_path = item['out_path']['S'] 37 | 38 | def runGlueJob(job, args): 39 | logger.info('starting runGlueJob: ' + job + ' args: ' + json.dumps(args)) 40 | response = glue.start_job_run(JobName=job,Arguments=args) 41 | count=90 42 | sec=10 43 | jobId=response['JobRunId'] 44 | i = 0 45 | while i < count: 46 | response = glue.get_job_run(JobName=job, RunId=jobId) 47 | status = response['JobRun']['JobRunState'] 48 | if status == 'SUCCEEDED': 49 | return 0 50 | elif (status == 'RUNNING' or status == 'STARTING' or status == 'STOPPING'): 51 | time.sleep(sec) 52 | i+=1 53 | else: 54 | logger.error('Error during loadIncremental execution: ' + status) 55 | return 1 56 | if i == count: 57 | logger.error('Execution timeout: ' + count*sec + ' seconds') 58 | return 1 59 | 60 | logger.warn('starting processTable: ' + path) 61 | loadInitial = False 62 | #determine if need to run initial --> Run Initial --> Update DDB 63 | initialfiles = s3conn.list_objects(Bucket=bucket, Prefix=full_folder+'LOAD').get('Contents') 64 | if initialfiles is not None : 65 | s3FileTS = initialfiles[0]['LastModified'].replace(tzinfo=None) 66 | ddbFileTS = datetime.datetime.strptime(lastFullLoadDate, '%Y-%m-%d %H:%M:%S') 67 | if s3FileTS > ddbFileTS: 68 | message='Starting to process Initial file: ' + full_folder 69 | loadInitial = True 70 | lastFullLoadDate = datetime.datetime.strftime(s3FileTS,'%Y-%m-%d %H:%M:%S') 71 | else: 72 | message='Intial files already processed: ' + full_folder 73 | else: 74 | message='No initial files to process: ' + full_folder 75 | logger.warn(message) 76 | 77 | #Call Initial Glue Job for this source 78 | if loadInitial: 79 | args={ 80 | '--bucket':bucket, 81 | '--prefix':full_folder, 82 | '--folder':folder, 83 | '--out_path':out_path, 84 | '--partitionKey':partitionKey} 85 | if runGlueJob('DMSCDC_LoadInitial',args) != 1: 86 | ddbconn.update_item( 87 | TableName='DMSCDC_Controller', 88 | Key={"path": {"S":path}}, 89 | AttributeUpdates={"LastFullLoadDate": {"Value": {"S": lastFullLoadDate}}}) 90 | 91 | loadIncremental = False 92 | #Get the latest incremental file 93 | incrementalFiles = s3conn.list_objects_v2(Bucket=bucket, Prefix=full_folder+'2', StartAfter=lastIncrementalFile).get('Contents') 94 | if incrementalFiles is not None: 95 | filecount = len(incrementalFiles) 96 | newIncrementalFile = bucket + '/' + incrementalFiles[filecount-1]['Key'] 97 | if newIncrementalFile != lastIncrementalFile: 98 | loadIncremental = True 99 | message = 'Starting to process incremental files: ' + full_folder 100 | else: 101 | message = 'Incremental files already processed: ' + full_folder 102 | else: 103 | message = 'No incremental files to process: ' + full_folder 104 | logger.warn(message) 105 | 106 | #Call Incremental Glue Job for this source 107 | if loadIncremental: 108 | args = { 109 | '--bucket':bucket, 110 | '--prefix':full_folder, 111 | '--folder':folder, 112 | '--out_path':out_path, 113 | '--partitionKey':partitionKey, 114 | '--lastIncrementalFile' : lastIncrementalFile, 115 | '--newIncrementalFile' : newIncrementalFile, 116 | '--primaryKey' : primaryKey 117 | } 118 | if runGlueJob('DMSCDC_LoadIncremental',args) != 1: 119 | ddbconn.update_item( 120 | TableName='DMSCDC_Controller', 121 | Key={"path": {"S":path}}, 122 | AttributeUpdates={"LastIncrementalFile": {"Value": {"S": newIncrementalFile}}}) 123 | -------------------------------------------------------------------------------- /DMSCDC_SampleDB.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Parameters: 3 | DBUser: 4 | Description: Username for the source DB. 5 | Type: String 6 | DBPassword: 7 | Description: Password for the source DB. 8 | Type: String 9 | NoEcho: true 10 | DBName: 11 | Description: Name of the source DB. 12 | Type: String 13 | SecurityGroup: 14 | Description: Sec Group for RDS & Lambda 15 | Type: AWS::EC2::SecurityGroup::Id 16 | Subnet1: 17 | Description: Subnet-1 for RDS & Lambda 18 | Type: AWS::EC2::Subnet::Id 19 | Subnet2: 20 | Description: Subnet-2 for RDS & Lambda 21 | Type: AWS::EC2::Subnet::Id 22 | Resources: 23 | InitRole: 24 | Type: AWS::IAM::Role 25 | Properties : 26 | AssumeRolePolicyDocument: 27 | Version : 2012-10-17 28 | Statement : 29 | - 30 | Effect : Allow 31 | Principal : 32 | Service : 33 | - lambda.amazonaws.com 34 | Action : 35 | - sts:AssumeRole 36 | Path : / 37 | Policies: 38 | - 39 | PolicyName: LambdaCloudFormationPolicy 40 | PolicyDocument: 41 | Version: 2012-10-17 42 | Statement: 43 | - 44 | Effect: Allow 45 | Action: 46 | - s3:* 47 | Resource: 48 | - !Sub "arn:aws:s3:::cloudformation-custom-resource-response-${AWS::Region}" 49 | - !Sub "arn:aws:s3:::cloudformation-waitcondition-${AWS::Region}" 50 | - !Sub "arn:aws:s3:::cloudformation-custom-resource-response-${AWS::Region}/*" 51 | - !Sub "arn:aws:s3:::cloudformation-waitcondition-${AWS::Region}/*" 52 | ManagedPolicyArns: 53 | - arn:aws:iam::aws:policy/CloudWatchLogsFullAccess 54 | - arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole 55 | RDSDBParameterGroup: 56 | Type: AWS::RDS::DBParameterGroup 57 | Properties: 58 | Description: DMDCDC-MySQL 59 | Family: mysql5.7 60 | Parameters: 61 | binlog_format : 'ROW' 62 | binlog_checksum : 'NONE' 63 | RDSSubnetGroup: 64 | Type: AWS::RDS::DBSubnetGroup 65 | Properties: 66 | DBSubnetGroupDescription: Subnet for DMSCDC-MySql 67 | SubnetIds: 68 | - !Ref Subnet1 69 | - !Ref Subnet2 70 | RDSDB: 71 | Type: AWS::RDS::DBInstance 72 | Properties: 73 | DBParameterGroupName: !Ref RDSDBParameterGroup 74 | DBName: !Ref DBName 75 | Engine: mysql 76 | EngineVersion: 5.7 77 | DBInstanceClass: db.t2.micro 78 | StorageType: gp2 79 | PubliclyAccessible: True 80 | AllocatedStorage: 20 81 | MasterUserPassword: !Ref DBPassword 82 | MasterUsername: !Ref DBUser 83 | DBSubnetGroupName: !Ref RDSSubnetGroup 84 | VPCSecurityGroups: 85 | - !Ref SecurityGroup 86 | RDSLoadLambda: 87 | Type: AWS::Lambda::Function 88 | Properties: 89 | Role: !GetAtt 'InitRole.Arn' 90 | Layers: 91 | - !Ref PymysqlLayer 92 | Handler: index.handler 93 | Runtime: python3.9 94 | Timeout: 30 95 | VpcConfig: 96 | SecurityGroupIds: 97 | - !Ref SecurityGroup 98 | SubnetIds: 99 | - !Ref Subnet1 100 | - !Ref Subnet2 101 | Code: 102 | ZipFile: | 103 | import sys, logging, pymysql, json, urllib3 104 | from pymysql.constants import CLIENT 105 | http = urllib3.PoolManager() 106 | logger = logging.getLogger() 107 | logger.setLevel(logging.INFO) 108 | 109 | def send(event, context, responseStatus, responseData): 110 | responseUrl = event['ResponseURL'] 111 | responseBody = { 112 | 'Status' : responseStatus, 113 | 'Reason' : "See the details in CloudWatch Log Stream: {}".format(context.log_stream_name), 114 | 'PhysicalResourceId' : context.log_stream_name, 115 | 'StackId' : event['StackId'], 116 | 'RequestId' : event['RequestId'], 117 | 'LogicalResourceId' : event['LogicalResourceId'], 118 | 'NoEcho' : False, 119 | 'Data' : responseData 120 | } 121 | json_responseBody = json.dumps(responseBody) 122 | headers = { 123 | 'content-type' : '', 124 | 'content-length' : str(len(json_responseBody)) 125 | } 126 | try: 127 | response = http.request('PUT', responseUrl, headers=headers, body=json_responseBody) 128 | except Exception as e: 129 | print("send(..) failed executing http.request(..):", e) 130 | 131 | def handler(event, context): 132 | print("Received event: " + json.dumps(event, indent=2)) 133 | 134 | if event['RequestType'] == 'Delete': 135 | send(event, context, 'SUCCESS', {'Data': 'Delete complete'}) 136 | else: 137 | host = event['ResourceProperties']['host'] 138 | username = event['ResourceProperties']['username'] 139 | password = event['ResourceProperties']['password'] 140 | dbname = event['ResourceProperties']['dbname'] 141 | script = event['ResourceProperties']['script'] 142 | 143 | try: 144 | conn = pymysql.connect(host=host, user=username, passwd=password, db=dbname, connect_timeout=5, client_flag= CLIENT.MULTI_STATEMENTS) 145 | except Exception as e: 146 | print(str(e)) 147 | send(event, context, 'FAILED', {'Data': 'Connect failed'}) 148 | sys.exit() 149 | 150 | try: 151 | response = http.request('GET', script, preload_content=False) 152 | script_content = response.read() 153 | except Exception as e: 154 | print(str(e)) 155 | send(event, context, 'FAILED', {'Data': 'Open script failed'}) 156 | sys.exit() 157 | 158 | try: 159 | cur = conn.cursor() 160 | response = cur.execute(script_content.decode('utf-8')) 161 | conn.commit() 162 | send(event, context, 'SUCCESS', {'Data': 'Insert complete'}) 163 | except Exception as e: 164 | print(str(e)) 165 | send(event, context, 'FAILED', {'Data': 'Insert failed'}) 166 | RDSLoadInit: 167 | Type: Custom::RDSLoadInit 168 | DependsOn: 169 | - RDSLoadLambda 170 | Properties: 171 | ServiceToken: !GetAtt 'RDSLoadLambda.Arn' 172 | host: !GetAtt 'RDSDB.Endpoint.Address' 173 | username: !Ref DBUser 174 | password: !Ref DBPassword 175 | dbname: !Ref DBName 176 | script: https://dmscdc-files.s3.us-west-2.amazonaws.com/DMSCDC_SampleDB_Initial.sql 177 | PymysqlLayer: 178 | Type: AWS::Lambda::LayerVersion 179 | Properties: 180 | CompatibleRuntimes: 181 | - python3.8 182 | - python3.9 183 | Content: 184 | S3Bucket: dmscdc-files 185 | S3Key: pymysql.zip 186 | Outputs: 187 | RDSEndpoint: 188 | Description: Cluster endpoint 189 | Value: !Sub "${RDSDB.Endpoint.Address}:${RDSDB.Endpoint.Port}" 190 | -------------------------------------------------------------------------------- /DMSCDC_SampleDB_Incremental.sql: -------------------------------------------------------------------------------- 1 | use sampledb; 2 | 3 | update product set name = 'Sample Product', dept = 'Sample Dept', category = 'Sample Category' where id = 1001; 4 | delete from product where id = 1002; 5 | call loadorders(1345, current_date); 6 | insert into store values(1009, '125 Technology Dr.', 'Irvine', 'CA', 'US', '92618'); 7 | -------------------------------------------------------------------------------- /DMSCDC_SampleDB_Initial.sql: -------------------------------------------------------------------------------- 1 | create schema if not exists sampledb; 2 | 3 | drop table if exists sampledb.store; 4 | create table sampledb.store( 5 | id int, 6 | address1 varchar(1024), 7 | city varchar(255), 8 | state varchar(2), 9 | countrycode varchar(2), 10 | postcode varchar(10)); 11 | 12 | insert into sampledb.store (id, address1, city, state, countrycode, postcode) values 13 | (1001, '320 W. 100th Ave, 100, Southgate Shopping Ctr - Anchorage','Anchorage','AK','US','99515'), 14 | (1002, '1005 E Dimond Blvd','Anchorage','AK','US','99515'), 15 | (1003, '1771 East Parks Hwy, Unit #4','Wasilla','AK','US','99654'), 16 | (1004, '345 South Colonial Dr','Alabaster','AL','US','35007'), 17 | (1005, '700 Montgomery Hwy, Suite 100','Vestavia Hills','AL','US','35216'), 18 | (1006, '20701 I-30','Benton','AR','US','72015'), 19 | (1007, '2034 Fayetteville Rd','Van Buren','AR','US','72956'), 20 | (1008, '3640 W. Anthem Way','Anthem','AZ','US','85086'); 21 | 22 | drop table if exists sampledb.product; 23 | create table sampledb.product( 24 | id int , 25 | name varchar(255), 26 | dept varchar(100), 27 | category varchar(100), 28 | price decimal(10,2)); 29 | 30 | insert into sampledb.product (id, name, dept, category, price) values 31 | (1001,'Fire 7','Amazon Devices','Fire Tablets',39), 32 | (1002,'Fire HD 8','Amazon Devices','Fire Tablets',89), 33 | (1003,'Fire HD 10','Amazon Devices','Fire Tablets',119), 34 | (1004,'Fire 7 Kids Edition','Amazon Devices','Fire Tablets',79), 35 | (1005,'Fire 8 Kids Edition','Amazon Devices','Fire Tablets',99), 36 | (1006,'Fire HD 10 Kids Edition','Amazon Devices','Fire Tablets',159), 37 | (1007,'Fire TV Stick','Amazon Devices','Fire TV',49), 38 | (1008,'Fire TV Stick 4K','Amazon Devices','Fire TV',49), 39 | (1009,'Fire TV Cube','Amazon Devices','Fire TV',119), 40 | (1010,'Kindle','Amazon Devices','Kindle E-readers',79), 41 | (1011,'Kindle Paperwhite','Amazon Devices','Kindle E-readers',129), 42 | (1012,'Kindle Oasis','Amazon Devices','Kindle E-readers',279), 43 | (1013,'Echo Dot (3rd Gen)','Amazon Devices','Echo and Alexa',49), 44 | (1014,'Echo Auto','Amazon Devices','Echo and Alexa',24), 45 | (1015,'Echo Show (2nd Gen)','Amazon Devices','Echo and Alexa',229), 46 | (1016,'Echo Plus (2nd Gen)','Amazon Devices','Echo and Alexa',149), 47 | (1017,'Fire TV Recast','Amazon Devices','Echo and Alexa',229), 48 | (1018,'EchoSub Bundle with 2 Echo (2nd Gen)','Amazon Devices','Echo and Alexa',279), 49 | (1019,'Mini Projector','Electronics','Projectors',288), 50 | (1020,'TOUMEI Mini Projector','Electronics','Projectors',300), 51 | (1021,'InFocus IN114XA Projector, DLP XGA 3600 Lumens 3D Ready 2HDMI Speakers','Electronics','Projectors',350), 52 | (1022,'Optoma X343 3600 Lumens XGA DLP Projector with 15,000-hour Lamp Life','Electronics','Projectors',339), 53 | (1023,'TCL 55S517 55-Inch 4K Ultra HD Roku Smart LED TV (2018 Model)','Electronics','TV',379), 54 | (1024,'Samsung 55NU7100 Flat 55” 4K UHD 7 Series Smart TV 2018','Electronics','TV',547), 55 | (1025,'LG Electronics 55UK6300PUE 55-Inch 4K Ultra HD Smart LED TV (2018 Model)','Electronics','TV',496); 56 | 57 | 58 | drop table if exists sampledb.productorder; 59 | create table sampledb.productorder ( 60 | id int NOT NULL, 61 | productid int, 62 | storeid int, 63 | qty int, 64 | soldprice decimal(10,2), 65 | create_dt date); 66 | 67 | DROP PROCEDURE IF EXISTS sampledb.loadorders; 68 | CREATE PROCEDURE sampledb.loadorders ( 69 | IN OrderCnt int, 70 | IN create_dt date 71 | ) 72 | BEGIN 73 | DECLARE i INT DEFAULT 0; 74 | DECLARE maxid INT Default 0; 75 | SELECT coalesce(max(id),0) INTO maxid from sampledb.productorder; 76 | helper: LOOP 77 | IF i 01/22/2022: This solution was updated to resolve some key bugs and implement certain performance related enhancements. See the [change log](CHANGELOG.md) for more details. 9 | 10 | [Glue Jobs](#glue-jobs) 11 | * [Controller](#controller) - [DMSCDC_Controller.py](./DMSCDC_Controller.py) 12 | * [ProcessTable](#processtable) - [DMSCDC_ProcessTable.py](./DMSCDC_ProcessTable.py) 13 | * [LoadInitial](#loadinitial) - [DMSCDC_LoadInitial.py](./DMSCDC_LoadInitial.py) 14 | * [LoadIncremental](#loadincremental) - [DMSCDC_LoadIncremental.py](./DMSCDC_LoadIncremental.py) 15 | 16 | [Sample Database](#sample-database) 17 | * [Source DB Setup](#source-db-setup) - [DMSCDC_SampleDB.yaml](./DMSCDC_SampleDB.yaml) 18 | * [Incremental Load Script](#incremental-load-script) - [DMSCDC_SampleDB_Incremental.sql](./DMSCDC_SampleDB_Incremental.sql) 19 | 20 | ## Glue Jobs 21 | This solution uses a Python shell job for the Controller. The controller will run each time the workflow is triggered but the LoadInitial and LoadIncremental are only executed as needed. 22 | 23 | ### Controller 24 | The controller job will leverage the state table stored in DynamoDB and compare the state with the files in S3. It will determine which tables/files need to be processed. 25 | 26 | The first step in the processing is to determine the *schemas* and *tables* which are located in DMS raw location and construct an *item* object representing the default values for the *table*. 27 | ```python 28 | #get the list of table folders 29 | s3_input = 's3://'+bucket+'/'+prefix 30 | url = urlparse(s3_input) 31 | schemas = s3conn.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/').get('CommonPrefixes') 32 | if schemas is None: 33 | logger.error('DMSBucket: '+bucket+' DMSFolder: ' + prefix + ' is empty. Confirm the source DB has data and the DMS replication task is replicating to this S3 location. Review the documention https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html to ensure the DB is setup for CDC.') 34 | sys.exit() 35 | 36 | index = 0 37 | #get folder metadata 38 | for schema in schemas: 39 | tables = s3conn.list_objects(Bucket=bucket, Prefix=schema['Prefix'], Delimiter='/').get('CommonPrefixes') 40 | for table in tables: 41 | full_folder = table['Prefix'] 42 | foldername = full_folder[len(schema['Prefix']):] 43 | path = bucket + '/' + full_folder 44 | print('Processing table: ' + path) 45 | item = { 46 | 'path': {'S':path}, 47 | 'bucket': {'S':bucket}, 48 | 'prefix': {'S':schema['Prefix']}, 49 | 'folder': {'S':foldername}, 50 | 'PrimaryKey': {'S':'null'}, 51 | 'PartitionKey': {'S':'null'}, 52 | 'LastFullLoadDate': {'S':'1900-01-01 00:00:00'}, 53 | 'LastIncrementalFile': {'S':path + '0.parquet'}, 54 | 'ActiveFlag': {'S':'false'}} 55 | ``` 56 | 57 | The next step is to pull the state of each table from the DMSCDC_Controller DynamoDB table. If the table is not present the table will be created. If the row is not present the row will be created with the default values set earlier. 58 | ```python 59 | try: 60 | response1 = ddbconn.describe_table(TableName='DMSCDC_Controller') 61 | except Exception as e: 62 | ddbconn.create_table( 63 | TableName='DMSCDC_Controller', 64 | KeySchema=[{'AttributeName': 'path','KeyType': 'HASH'}], 65 | AttributeDefinitions=[{'AttributeName': 'path','AttributeType': 'S'}], 66 | BillingMode='PAY_PER_REQUEST') 67 | time.sleep(10) 68 | 69 | try: 70 | response = ddbconn.get_item( 71 | TableName='DMSCDC_Controller', 72 | Key={'path': {'S':path}}) 73 | if 'Item' in response: 74 | item = response['Item'] 75 | else: 76 | ddbconn.put_item( 77 | TableName='DMSCDC_Controller', 78 | Item=item) 79 | except: 80 | ddbconn.put_item( 81 | TableName='DMSCDC_Controller', 82 | Item=item) 83 | ``` 84 | 85 | Next, the controller will start a new python shell job `DMSCDC_ProcessTable` for each table in parallel. The job will determine if an initial and/or incremental load is needed. 86 | 87 | > Note: if the DynamoDB table has the ActiveFlag != 'true', this table will be skipped. This is to ensure that a user has reviewed the table and set appropriate keys. If the PrimaryKey has been set, the incremental load will conduct change detection logic. If the PrimaryKey has NOT been set, incremental load will only load rows marked for insert. If the PartitionKey has been set, the Spark job will partition the data before it is written. 88 | 89 | ```python 90 | #determine if need to run incremental --> Run incremental --> Update DDB 91 | if item['ActiveFlag']['S'] == 'true': 92 | logger.warn('starting processTable, args: ' + json.dumps(item)) 93 | response = glue.start_job_run(JobName='DMSCDC_ProcessTable',Arguments={'--item':json.dumps(item)}) 94 | logger.debug(json.dumps(response)) 95 | else: 96 | logger.error('Load is not active. Update dynamoDB: ' + full_folder) 97 | ``` 98 | 99 | ### ProcessTable 100 | The first step of the `DMSCDCD_ProcessTable` job, is to determine if the initial file needs to be processed by comparing the timestamp of the initial load file stored in the DynamoDB table with the timestamp retrieved from S3. If it is determined that the initial file should be processed the [LoadInitial](#LoadInitial) glue job is triggered. The controller will wait for its completion using the `runGlueJob` function. If the job completes successfully, the DynamoDB table will be updated with the timestamp of the file which was processed. 101 | 102 | ```python 103 | logger.warn('starting processTable: ' + path) 104 | loadInitial = False 105 | #determine if need to run initial --> Run Initial --> Update DDB 106 | initialfiles = s3conn.list_objects(Bucket=bucket, Prefix=full_folder+'LOAD').get('Contents') 107 | if initialfiles is not None : 108 | s3FileTS = initialfiles[0]['LastModified'].replace(tzinfo=None) 109 | ddbFileTS = datetime.datetime.strptime(lastFullLoadDate, '%Y-%m-%d %H:%M:%S') 110 | if s3FileTS > ddbFileTS: 111 | message='Starting to process Initial file: ' + full_folder 112 | loadInitial = True 113 | lastFullLoadDate = datetime.datetime.strftime(s3FileTS,'%Y-%m-%d %H:%M:%S') 114 | else: 115 | message='Intial files already processed: ' + full_folder 116 | else: 117 | message='No initial files to process: ' + full_folder 118 | logger.warn(message) 119 | 120 | #Call Initial Glue Job for this source 121 | if loadInitial: 122 | args={ 123 | '--bucket':bucket, 124 | '--prefix':full_folder, 125 | '--folder':folder, 126 | '--out_path':out_path, 127 | '--partitionKey':partitionKey} 128 | if runGlueJob('DMSCDC_LoadInitial',args) != 1: 129 | ddbconn.update_item( 130 | TableName='DMSCDC_Controller', 131 | Key={"path": {"S":path}}, 132 | AttributeUpdates={"LastFullLoadDate": {"Value": {"S": lastFullLoadDate}}}) 133 | ``` 134 | The second step of the `DMSCDCD_ProcessTable` job, is to determine if any incremental files need to be processed by comparing the *lastIncrementalFile* stored in the DynamoDB table with the *newIncrementalFile* retrieved from S3. If those two values are different it is determined that there are incremental files and the [LoadIncremental](#LoadIncremental) glue job is triggered. The controller will wait for its completion using the *testGlueJob* function. If the job completes successfully, the DynamoDB table will be updated with the *newIncrementalFile*. 135 | 136 | ```python 137 | loadIncremental = False 138 | #Get the latest incremental file 139 | incrementalFiles = s3conn.list_objects_v2(Bucket=bucket, Prefix=full_folder+'2', StartAfter=lastIncrementalFile).get('Contents') 140 | if incrementalFiles is not None: 141 | filecount = len(incrementalFiles) 142 | newIncrementalFile = bucket + '/' + incrementalFiles[filecount-1]['Key'] 143 | if newIncrementalFile != lastIncrementalFile: 144 | loadIncremental = True 145 | message = 'Starting to process incremental files: ' + full_folder 146 | else: 147 | message = 'Incremental files already processed: ' + full_folder 148 | else: 149 | message = 'No incremental files to process: ' + full_folder 150 | logger.warn(message) 151 | 152 | #Call Incremental Glue Job for this source 153 | if loadIncremental: 154 | args = { 155 | '--bucket':bucket, 156 | '--prefix':full_folder, 157 | '--folder':foldername, 158 | '--out_path':out_path, 159 | '--partitionKey':partitionKey, 160 | '--lastIncrementalFile' : lastIncrementalFile, 161 | '--newIncrementalFile' : newIncrementalFile, 162 | '--primaryKey' : primaryKey 163 | } 164 | if runGlueJob('DMSCDC_LoadIncremental',args) != 1: 165 | ddbconn.update_item( 166 | TableName='DMSCDC_Controller', 167 | Key={"path": {"S":path}}, 168 | AttributeUpdates={"LastIncrementalFile": {"Value": {"S": newIncrementalFile}}}) 169 | 170 | ``` 171 | 172 | ### LoadInitial 173 | The initial load will first read the initial load file into a Spark DataFrame. An additional column will be added to the dataframe *"Op"* with a default value of *I* representing *inserts*. This field is added to provide parity between the initial load and the incremental load. 174 | ```python 175 | input = spark.read.parquet(s3_inputpath+"/LOAD*.parquet").withColumn("Op", lit("I")) 176 | ``` 177 | If a partition key has been set for this table, the data will be written to the Data Lake in the appropriate partitions. 178 | 179 | Note: When the initial job runs, any previously loaded data is overwritten. 180 | ```python 181 | partition_keys = args['partitionKey'] 182 | if partition_keys != "null" : 183 | partitionKeys = partition_keys.split(",") 184 | partitionCount = input.select(partitionKeys).distinct().count() 185 | input.repartition(partitionCount,partitionKeys).write.mode('overwrite').partitionBy(partitionKeys).parquet(s3_outputpath) 186 | else: 187 | input.write.mode('overwrite').parquet(s3_outputpath) 188 | ``` 189 | 190 | ### LoadIncremental 191 | The incremental load will first read only the new incremental files into a Spark DataFrame. 192 | ```python 193 | last_file = args['lastIncrementalFile'] 194 | curr_file = args['newIncrementalFile'] 195 | primary_keys = args['primaryKey'] 196 | partition_keys = args['partitionKey'] 197 | 198 | # gets list of files to process 199 | inputFileList = [] 200 | results = s3conn.list_objects_v2(Bucket=args['bucket'], Prefix=args['prefix'] +'2', StartAfter=args['lastIncrementalFile']).get('Contents') 201 | for result in results: 202 | if (args['bucket'] + '/' + result['Key'] != last_file): 203 | inputFileList.append('s3://' + args['bucket'] + '/' + result['Key']) 204 | 205 | inputfile = spark.read.parquet(*inputFileList) 206 | ``` 207 | 208 | Next, the process will determine if a primary key has been set in the DynamoDB table. If it has been set, the change detection process will be executed. If it has not, or this is the first time this table is being loaded, only records marked for insert (where the *Op* = 'I') will be prepped to be written. 209 | 210 | ```python 211 | tgtExists = True 212 | try: 213 | test = spark.read.parquet(s3_outputpath) 214 | except: 215 | tgtExists = False 216 | 217 | if primary_keys == "null" or not(tgtExists): 218 | output = inputfile.filter(inputfile.Op=='I') 219 | filelist = [["null"]] 220 | ``` 221 | 222 | Otherwise, to process the changed records the process will de-dup the input records as well as determine the impacted data lake files. 223 | 224 | First, a few metadata fields are added to the input data and the already loaded data: 225 | * sortpath - For already loaded data, this is set to "0" to ensure that newly loaded data with the same primary key will override it. For new data this is the filename generated by DMS which will have a higher sort order than "0". 226 | * filepath - For already loaded data, this is the file name of the data lake file. For new data this is the filename generated by DMS. 227 | * rownum - This is the relative rownumber of each primary key. For already loaded data, this will always be '1' as there should always be only 1 record for each primary key. For new data data, this will be a value from 1-to-N for each operation against a primary key. In the case when there are multiple operations on the same primary key, the latest operation will have the greatest rownum value. 228 | 229 | ```python 230 | primaryKeys = primary_keys.split(",") 231 | windowRow = Window.partitionBy(primaryKeys).orderBy("sortpath") 232 | 233 | #Loads the target data adding columns for processing 234 | target = spark.read.parquet(s3_outputpath).withColumn("sortpath", lit("0")).withColumn("tgt_filepath",input_file_name()).withColumn("rownum", lit(1)) 235 | input = inputfile.withColumn("sortpath", input_file_name()).withColumn("src_filepath",input_file_name()).withColumn("rownum", row_number().over(windowRow)) 236 | 237 | ``` 238 | 239 | To determine which data lake files will be impacted, the input data is joined to the target data and a distinct list of files is captured. 240 | 241 | ```python 242 | #determine impacted files 243 | files = target.join(input, primaryKeys, 'inner').select(col("tgt_filepath").alias("list_filepath")).distinct() 244 | filelist = files.collect() 245 | ``` 246 | 247 | Since S3 data is immutable, not only the records which have changed will need to be written, but also all the unchanged records from the impacted data lake files. To accomplish this task, a new dataframe will be created which is union of the source and target data. 248 | 249 | ```python 250 | uniondata = input.unionByName(target.join(files,files.list_filepath==target.tgt_filepath), allowMissingColumns=True) 251 | ``` 252 | 253 | To determine which rows to write from the union dataframe, a *rnk* column is added indicating the last change for a given primary leveraging the *sortpath* and *rownum* fields which were added. In addition, only records where the last operation is not a delete (*Op != D*) will be filtered. 254 | 255 | ```python 256 | window = Window.partitionBy(primaryKeys).orderBy(desc("sortpath"), desc("rownum")) 257 | output = uniondata.withColumn('rnk', rank().over(window)).where(col("rnk")==1).where(col("Op")!="D").coalesce(1).select(inputfile.columns) 258 | ``` 259 | 260 | If a partition key has been set for this table, the data will be written to the Data Lake in the appropriate partitions. 261 | 262 | ```python 263 | # write data by partitions 264 | if partition_keys != "null" : 265 | partitionKeys = partition_keys.split(",") 266 | partitionCount = output.select(partitionKeys).distinct().count() 267 | output.repartition(partitionCount,partitionKeys).write.mode('append').partitionBy(partitionKeys).parquet(s3_outputpath) 268 | else: 269 | output.write.mode('append').parquet(s3_outputpath) 270 | ``` 271 | 272 | Finally, the impacted data lake files containing the changed records need to be deleted. 273 | 274 | Note: Because S3 does not guarantee consistency on delete operations, until the file is actually removed from S3, there is a possibility of users seeing duplicate data temporarily. 275 | 276 | ```python 277 | #delete old files 278 | for row in filelist: 279 | if row[0] != "null": 280 | o = parse.urlparse(row[0]) 281 | s3conn.delete_object(Bucket=o.netloc, Key=parse.unquote(o.path)[1:]) 282 | ``` 283 | 284 | ## Sample Database 285 | Below is a sample MySQL database you can build and use in conjunction with this project. It can be used to demonstrate the end-to-end flow of data into your Data Lake. 286 | 287 | ### Source DB Setup 288 | In order for a DB to be a candidate for DMS as a source it needs to meet certain [requirements](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html) whether it's managed through Amazon via RDS or if it's an EC2 or on-premise installation. Launch the following cloud formation stack to deploy a MySQL DB that has been pre-configured to meet these requirements and which is *pre-loaded* with the [initial load script](DMSCDC_SampleDB_Initial.sql). Be sure to choose the *same VPC/Subnet/Security Group* as your DMS Replication Instance to ensure they have network connectivity. 289 | 290 | [![](launch-stack.svg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=DMSCDCSampleDB&templateURL=https://s3-us-west-2.amazonaws.com/dmscdc-files/DMSCDC_SampleDB.yaml) 291 | 292 | ### Incremental Load Script 293 | Before proceeding with the incremental load, ensure that 294 | 1. The initial data has arrived in the DMS bucket. 295 | 1. You have configured your DynamoDB table to set the appropriate ActiveFlag, PrimaryKey and PartitionKey values. 296 | 1. The initial data has arrived in the LAKE bucket. Note: By default the solution will run on an hourly basis on the 00 minute. To change the load frequency or minute, modify the **[Glue Trigger](https://console.aws.amazon.com/glue/home?#etl:tab=triggers)**. 297 | 298 | Once the seed data has been loaded, execute the following script in the MySQL database to simulate changed data. 299 | ```sql 300 | update product set name = 'Sample Product', dept = 'Sample Dept', category = 'Sample Category' where id = 1001; 301 | delete from product where id = 1002; 302 | call loadorders(1345, current_date); 303 | insert into store values(1009, '125 Technology Dr.', 'Irvine', 'CA', 'US', '92618'); 304 | ``` 305 | 306 | Once DMS has processed these changes, you will see new incremental files for `store`, `product`, and `productorder` in the DMS bucket. On the next execution of the Glue job, you will see new and updated files in the LAKE bucket. Test. 307 | -------------------------------------------------------------------------------- /launch-stack.svg: -------------------------------------------------------------------------------- 1 | Launch Stack -------------------------------------------------------------------------------- /pymysql.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-big-data-blog-dmscdc-walkthrough/024e354e35747828eadc334fd0b3c626f1055162/pymysql.zip --------------------------------------------------------------------------------