├── .github
    └── workflows
    │   └── github-actions.yml
├── CHANGELOG.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── DMSCDC_CloudTemplate_Reusable.yaml
├── DMSCDC_CloudTemplate_Source.yaml
├── DMSCDC_Controller.py
├── DMSCDC_LoadIncremental.py
├── DMSCDC_LoadInitial.py
├── DMSCDC_ProcessTable.py
├── DMSCDC_SampleDB.yaml
├── DMSCDC_SampleDB_Incremental.sql
├── DMSCDC_SampleDB_Initial.sql
├── LICENSE
├── README.md
├── launch-stack.svg
└── pymysql.zip


/.github/workflows/github-actions.yml:
--------------------------------------------------------------------------------
 1 | name: DeployToS3
 2 | on:
 3 |   push:
 4 |     branches:
 5 |       - master
 6 | jobs:
 7 |   DeployCode:
 8 |     runs-on: ubuntu-latest
 9 |     steps:
10 |       - name: Check out repository code
11 |         uses: actions/checkout@v2
12 |       - name: Configure AWS credentials
13 |         uses: aws-actions/configure-aws-credentials@v1
14 |         with:
15 |           aws-access-key-id: ${{ secrets.AWS_ID }}
16 |           aws-secret-access-key: ${{ secrets.AWS_KEY }}
17 |           aws-region: ${{ secrets.REGION }}
18 |       - run: |
19 |           aws s3 sync . s3://dmscdc-files --acl public-read --exclude ".*"
20 | 


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
 1 | # Changelog
 2 | All notable changes to this project will be documented in this file.
 3 | 
 4 | ## [1.1.0] - 2022-01-22
 5 | ### Added
 6 | * Can use wild-card for schema and load multiple schema's at once
 7 | * Parallel processing of tables.
 8 | * Smarter repartition logic for tables with many partitions.
 9 | * Upgrade to Glue v3.0 (Faster start/run times)
10 | * VPC options added to [DMSCDC_CloudTemplate_Reusable.yaml](DMSCDC_CloudTemplate_Reusable.yaml) to allow users to specify VPC settings and ensure replication instance can connect to the sources DB.
11 | * [DMSCDC_SampleDB.yaml](DMSCDC_SampleDB.yaml) to easily test a source DB with a data load.
12 | 
13 | ### Fixed
14 | - Initial run of [DMSCDC_Controller.py](DMSCDC_Controller.py) was failing if DMS replication task was not loading
15 | - New tables added after the initial deployment were not loading
16 | - Incremental processing was not picking up new files correctly and was causing some files to be re-processed
17 | - New columns added to tables were not being added to the data lake
18 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 | 


--------------------------------------------------------------------------------
/DMSCDC_CloudTemplate_Reusable.yaml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: 2010-09-09
  2 | Parameters:
  3 |   SecurityGroup:
  4 |     Description: Sec Group for Replication Instance with rules to allow it to connect to your source DB.
  5 |     Type: AWS::EC2::SecurityGroup::Id
  6 |   Subnet1:
  7 |     Description: Subnet-1 for Replication Instance with network path to source DB.
  8 |     Type: AWS::EC2::Subnet::Id
  9 |   Subnet2:
 10 |     Description: Subnet-2 for Replication Instance with network path to source DB.
 11 |     Type: AWS::EC2::Subnet::Id
 12 | Resources:
 13 |   LambdaRoleToInitDMS:
 14 |     Type: AWS::IAM::Role
 15 |     Properties :
 16 |       AssumeRolePolicyDocument:
 17 |         Version : 2012-10-17
 18 |         Statement :
 19 |           -
 20 |             Effect : Allow
 21 |             Principal :
 22 |               Service :
 23 |                 - lambda.amazonaws.com
 24 |             Action :
 25 |               - sts:AssumeRole
 26 |       Path : /
 27 |       Policies:
 28 |         -
 29 |           PolicyName: LambdaCloudFormationPolicy
 30 |           PolicyDocument:
 31 |             Version: 2012-10-17
 32 |             Statement:
 33 |               -
 34 |                 Effect: Allow
 35 |                 Action:
 36 |                   - s3:*
 37 |                 Resource:
 38 |                   - !Sub "arn:aws:s3:::cloudformation-custom-resource-response-${AWS::Region}"
 39 |                   - !Sub "arn:aws:s3:::cloudformation-waitcondition-${AWS::Region}"
 40 |                   - !Sub "arn:aws:s3:::cloudformation-custom-resource-response-${AWS::Region}/*"
 41 |                   - !Sub "arn:aws:s3:::cloudformation-waitcondition-${AWS::Region}/*"
 42 |       ManagedPolicyArns:
 43 |         - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
 44 |         - arn:aws:iam::aws:policy/AmazonS3FullAccess
 45 |         - arn:aws:iam::aws:policy/CloudWatchLogsFullAccess
 46 |         - arn:aws:iam::aws:policy/AmazonRDSDataFullAccess
 47 |         - arn:aws:iam::aws:policy/IAMFullAccess
 48 |         - arn:aws:iam::aws:policy/AmazonRedshiftFullAccess
 49 |   LambdaFunctionDMSRoles:
 50 |     Type: "AWS::Lambda::Function"
 51 |     Properties:
 52 |       Timeout: 30
 53 |       Code:
 54 |         ZipFile: |
 55 |           import json
 56 |           import boto3
 57 |           import cfnresponse
 58 |           import logging
 59 |           import time
 60 |           client = boto3.client('iam')
 61 |           logging.basicConfig()
 62 |           logger = logging.getLogger(__name__)
 63 |           logger.setLevel(logging.INFO)
 64 | 
 65 |           def handler(event, context):
 66 |             logger.info(json.dumps(event))
 67 | 
 68 |             if event['RequestType'] == 'Delete':
 69 |               cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'Delete complete'})
 70 |             else:
 71 |               try:
 72 |                 response = client.get_role(RoleName='dms-cloudwatch-logs-role')
 73 |               except:
 74 |                 try:
 75 |                   role_policy_document = {
 76 |                       "Version": "2012-10-17",
 77 |                       "Statement": [
 78 |                         {
 79 |                           "Effect": "Allow",
 80 |                           "Principal": {
 81 |                             "Service": [
 82 |                               "dms.amazonaws.com"
 83 |                             ]
 84 |                           },
 85 |                           "Action": "sts:AssumeRole"
 86 |                         }
 87 |                       ]
 88 |                   }
 89 |                   client.create_role(
 90 |                       RoleName='dms-cloudwatch-logs-role',
 91 |                       AssumeRolePolicyDocument=json.dumps(role_policy_document)
 92 |                   )
 93 |                   client.attach_role_policy(
 94 |                     RoleName='dms-cloudwatch-logs-role',
 95 |                     PolicyArn='arn:aws:iam::aws:policy/service-role/AmazonDMSCloudWatchLogsRole'
 96 |                   )
 97 |                 except Exception as e:
 98 |                   logger.error(e)
 99 |                   cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': 'Create failed'})
100 |               try:
101 |                 response = client.get_role(RoleName='dms-vpc-role')
102 |               except:
103 |                 try:
104 |                   role_policy_document = {
105 |                       "Version": "2012-10-17",
106 |                       "Statement": [
107 |                         {
108 |                           "Effect": "Allow",
109 |                           "Principal": {
110 |                             "Service": [
111 |                               "dms.amazonaws.com"
112 |                             ]
113 |                           },
114 |                           "Action": "sts:AssumeRole"
115 |                         }
116 |                       ]
117 |                   }
118 |                   client.create_role(
119 |                       RoleName='dms-vpc-role',
120 |                       AssumeRolePolicyDocument=json.dumps(role_policy_document)
121 |                   )
122 |                   client.attach_role_policy(
123 |                     RoleName='dms-vpc-role',
124 |                     PolicyArn='arn:aws:iam::aws:policy/service-role/AmazonDMSVPCManagementRole'
125 |                   )
126 |                   time.sleep(30)
127 |                 except Exception as e:
128 |                   logger.error(e)
129 |                   cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': 'Create failed'})
130 | 
131 |               cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'Create complete'})
132 |       Handler: index.handler
133 |       Role:
134 |         Fn::GetAtt: [LambdaRoleToInitDMS, Arn]
135 |       Runtime: python3.7
136 |   InitDMSRoles:
137 |      Type: Custom::InitDMSRoles
138 |      DependsOn:
139 |        - LambdaFunctionDMSRoles
140 |      Properties:
141 |        ServiceToken: !GetAtt 'LambdaFunctionDMSRoles.Arn'
142 |   DMSSubnetGroup:
143 |     Type: AWS::DMS::ReplicationSubnetGroup
144 |     DependsOn:
145 |       - InitDMSRoles
146 |     Properties:
147 |       ReplicationSubnetGroupDescription: Subnet group for DMSCDC pipeline.
148 |       SubnetIds:
149 |         - !Ref Subnet1
150 |         - !Ref Subnet2
151 |   DMSReplicationInstance:
152 |     Type: AWS::DMS::ReplicationInstance
153 |     DependsOn:
154 |       - InitDMSRoles
155 |     Properties:
156 |       ReplicationInstanceIdentifier: DMSCDC-ReplicationInstance
157 |       ReplicationInstanceClass: dms.c4.large
158 |       ReplicationSubnetGroupIdentifier: !Ref DMSSubnetGroup
159 |       VpcSecurityGroupIds:
160 |           - !Ref SecurityGroup
161 |   ExecutionRole:
162 |     Type: AWS::IAM::Role
163 |     Properties :
164 |       RoleName: DMSCDC_Execution_Role
165 |       AssumeRolePolicyDocument:
166 |         Version : 2012-10-17
167 |         Statement :
168 |           -
169 |             Effect : Allow
170 |             Principal :
171 |               Service :
172 |                 - lambda.amazonaws.com
173 |                 - glue.amazonaws.com
174 |                 - dms.amazonaws.com
175 |             Action :
176 |               - sts:AssumeRole
177 |       Path : /
178 |       ManagedPolicyArns:
179 |         - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
180 |         - arn:aws:iam::aws:policy/AmazonS3FullAccess
181 |         - arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole
182 |       Policies :
183 |         -
184 |           PolicyName: DMSCDCExecutionPolicy
185 |           PolicyDocument :
186 |             Version: 2012-10-17
187 |             Statement:
188 |               -
189 |                 Effect: Allow
190 |                 Action:
191 |                   - lambda:InvokeFunction
192 |                   - dynamodb:PutItem
193 |                   - dynamodb:CreateTable
194 |                   - dynamodb:UpdateItem
195 |                   - dynamodb:UpdateTable
196 |                   - dynamodb:GetItem
197 |                   - dynamodb:DescribeTable
198 |                   - iam:GetRole
199 |                   - iam:PassRole
200 |                   - dms:StartReplicationTask
201 |                   - dms:TestConnection
202 |                   - dms:StopReplicationTask
203 |                 Resource:
204 |                   - !Sub 'arn:aws:dynamodb:${AWS::Region}:${AWS::AccountId}:table/DMSCDC_Controller'
205 |                   - !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:DMSCDC_InitController'
206 |                   - !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:DMSCDC_InitDMS'
207 |                   - !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:DMSCDC_ImportGlueScript'
208 |                   - !Sub 'arn:aws:iam::${AWS::AccountId}:role/DMSCDC_ExecutionRole'
209 |                   - !Sub 'arn:aws:dms:${AWS::Region}:${AWS::AccountId}:endpoint:*'
210 |                   - !Sub 'arn:aws:dms:${AWS::Region}:${AWS::AccountId}:task:*'
211 |               -
212 |                 Effect: Allow
213 |                 Action:
214 |                   - dms:DescribeConnections
215 |                   - dms:DescribeReplicationTasks
216 |                 Resource: "*"
217 |   LambdaFunctionImportGlueScript:
218 |     Type: "AWS::Lambda::Function"
219 |     Properties:
220 |       FunctionName: DMSCDC_ImportGlueScript
221 |       Timeout: 30
222 |       Code:
223 |         ZipFile: |
224 |           import json
225 |           import boto3
226 |           import cfnresponse
227 |           import logging
228 | 
229 |           logging.basicConfig()
230 |           logger = logging.getLogger(__name__)
231 |           logger.setLevel(logging.INFO)
232 | 
233 |           def handler(event, context):
234 |             logger.info(json.dumps(event))
235 |             s3 = boto3.client('s3')
236 |             glueBucket = event['ResourceProperties']['glueBucket']
237 | 
238 |             if event['RequestType'] == 'Delete':
239 |               try:
240 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadIncremental.py')
241 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadIncremental.py.temp')
242 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadInitial.py')
243 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadInitial.py.temp')
244 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_ProcessTable.py')
245 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_ProcessTable.py.temp')
246 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_Controller.py')
247 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_Controller.py.temp')
248 |               except Exception as e:
249 |                 logger.info(e.message)
250 | 
251 |               logger.info('Delete Complete')
252 |               cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'Delete Complete'})
253 | 
254 |             else:
255 |               try:
256 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadIncremental.py')
257 |               except Exception as e:
258 |                 logger.info(e.message)
259 |               try:
260 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_LoadInitial.py')
261 |               except Exception as e:
262 |                 logger.info(e.message)
263 |               try:
264 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_Controller.py')
265 |               except Exception as e:
266 |                 logger.info(e.message)
267 |               try:
268 |                 s3.delete_object(Bucket=glueBucket, Key='DMSCDC_ProcessTable.py')
269 |               except Exception as e:
270 |                 logger.info(e.message)
271 |               try:
272 |                 s3.copy_object(Bucket=glueBucket, CopySource="dmscdc-files/DMSCDC_LoadIncremental.py", Key="DMSCDC_LoadIncremental.py")
273 |                 s3.copy_object(Bucket=glueBucket, CopySource="dmscdc-files/DMSCDC_LoadInitial.py", Key="DMSCDC_LoadInitial.py")
274 |                 s3.copy_object(Bucket=glueBucket, CopySource="dmscdc-files/DMSCDC_Controller.py", Key="DMSCDC_Controller.py")
275 |                 s3.copy_object(Bucket=glueBucket, CopySource="dmscdc-files/DMSCDC_ProcessTable.py", Key="DMSCDC_ProcessTable.py")
276 | 
277 |                 logger.info('Copy Complete')
278 |                 cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'Copy complete'})
279 | 
280 |               except Exception as e:
281 |                 message = 'S3 Issue: '  + str(e)
282 |                 logger.error(message)
283 |                 cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message})
284 | 
285 |       Handler: index.handler
286 |       Role:
287 |         Fn::GetAtt: [ExecutionRole, Arn]
288 |       Runtime: python3.7
289 |     DependsOn:
290 |       - ExecutionRole
291 |   ImportScript:
292 |     Type: Custom::CopyScript
293 |     DependsOn:
294 |       - LambdaFunctionImportGlueScript
295 |       - GlueBucket
296 |     Properties:
297 |       ServiceToken:
298 |         Fn::GetAtt : [LambdaFunctionImportGlueScript, Arn]
299 |       glueBucket:
300 |         Ref: GlueBucket
301 |   LambdaFunctionInitController:
302 |     Type: "AWS::Lambda::Function"
303 |     Properties:
304 |       FunctionName: DMSCDC_InitController
305 |       Timeout: 600
306 |       Code:
307 |         ZipFile: |
308 |           import json
309 |           import boto3
310 |           import cfnresponse
311 |           import time
312 |           import logging
313 | 
314 |           logging.basicConfig()
315 |           logger = logging.getLogger(__name__)
316 |           logger.setLevel(logging.INFO)
317 | 
318 |           glue = boto3.client('glue')
319 | 
320 |           def testGlueJob(jobId, count, sec):
321 |             i = 0
322 |             while i < count:
323 |               response = glue.get_job_run(JobName='DMSCDC_Controller', RunId=jobId)
324 |               status = response['JobRun']['JobRunState']
325 |               if (status == 'SUCCEEDED' or status == 'STARTING' or status == 'STOPPING'):
326 |                 return 1
327 |               elif status == 'RUNNING' :
328 |                 time.sleep(sec)
329 |                 i+=1
330 |               else:
331 |                 return 0
332 |             if i == count:
333 |                 return 0
334 | 
335 |           def handler(event, context):
336 |             logger.info(json.dumps(event))
337 |             if event['RequestType'] != 'Create':
338 |                 logger.info('No action necessary')
339 |                 cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'NA'})
340 |             else:
341 |                 try:
342 |                   dmsBucket = event['ResourceProperties']['dmsBucket']
343 |                   dmsPath = event['ResourceProperties']['dmsPath']
344 |                   lakeBucket = event['ResourceProperties']['lakeBucket']
345 |                   lakePath = event['ResourceProperties']['lakePath']
346 |                   lakeDBName = event['ResourceProperties']['lakeDBName']
347 | 
348 |                   response = glue.start_job_run(
349 |                     JobName='DMSCDC_Controller',
350 |                     Arguments={
351 |                       '--bucket':dmsBucket,
352 |                       '--prefix':dmsPath,
353 |                       '--out_bucket':lakeBucket,
354 |                       '--out_prefix':lakePath})
355 | 
356 |                   if testGlueJob(response['JobRunId'], 20, 30) != 1:
357 |                     message = 'Error during Controller execution'
358 |                     logger.info(message)
359 |                     cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message})
360 |                   else:
361 |                     try:
362 |                       message = 'Load Complete: https://console.aws.amazon.com/dynamodb/home?#tables:selected=DMSCDC_Controller;tab=items'
363 |                       response = glue.start_trigger(Name='DMSCDC-Trigger-'+lakeDBName)
364 |                       logger.info(message)
365 |                       cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': message})
366 |                     except:
367 |                       message = 'Load Complete, but trigger failed to initialize'
368 |                       logger.info(message)
369 |                       cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': message})
370 |                 except Exception as e:
371 |                   message = 'Glue Job Issue: ' + str(e)
372 |                   logger.info(message)
373 |                   cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message})
374 |       Handler: index.handler
375 |       Role:
376 |         Fn::GetAtt: [ExecutionRole, Arn]
377 |       Runtime: python3.7
378 |     DependsOn:
379 |       - ExecutionRole
380 |   LambdaFunctionInitDMS:
381 |     Type: "AWS::Lambda::Function"
382 |     Properties:
383 |       FunctionName: DMSCDC_InitDMS
384 |       Timeout: 600
385 |       Code:
386 |         ZipFile: |
387 |           import json
388 |           import boto3
389 |           import cfnresponse
390 |           import time
391 |           import logging
392 | 
393 |           logging.basicConfig()
394 |           logger = logging.getLogger(__name__)
395 |           logger.setLevel(logging.INFO)
396 | 
397 |           dms = boto3.client('dms')
398 |           s3 = boto3.resource('s3')
399 | 
400 |           def testConnection(endpointArn, count):
401 |             i = 0
402 |             while i < count:
403 |               connections = dms.describe_connections(Filters=[{'Name':'endpoint-arn','Values':[endpointArn]}])
404 |               for connection in connections['Connections']:
405 |                 if connection['Status'] == 'successful':
406 |                   return 1
407 |                 elif connection['Status'] == 'testing':
408 |                   time.sleep(30)
409 |                   i += 1
410 |                 elif connection['Status'] == 'failed':
411 |                   return 0
412 |             if i == count:
413 |               return 0
414 | 
415 |           def testRepTask(taskArn, count):
416 |             i = 0
417 |             while i < count:
418 |               replicationTasks = dms.describe_replication_tasks(Filters=[{'Name':'replication-task-arn','Values':[taskArn]}])
419 |               for replicationTask in replicationTasks['ReplicationTasks']:
420 |                 if replicationTask['Status'] == 'running':
421 |                   return 1
422 |                 elif replicationTask['Status'] == 'starting':
423 |                   time.sleep(30)
424 |                   i += 1
425 |                 else:
426 |                   return 0
427 |             if i == count:
428 |               return 0
429 | 
430 |           def handler(event, context):
431 |             logger.info(json.dumps(event))
432 |             taskArn = event['ResourceProperties']['taskArn']
433 |             sourceArn = event['ResourceProperties']['sourceArn']
434 |             targetArn = event['ResourceProperties']['targetArn']
435 |             instanceArn = event['ResourceProperties']['instanceArn']
436 |             dmsBucket = event['ResourceProperties']['dmsBucket']
437 | 
438 |             if event['RequestType'] == 'Delete':
439 |                 try:
440 |                   dms.stop_replication_task(ReplicationTaskArn=taskArn)
441 |                   time.sleep(30)
442 |                   s3.Bucket(dmsBucket).objects.all().delete()
443 |                   logger.info('Task Stopped, Bucket Emptied')
444 |                 except Exception as e:
445 |                   message = 'Issue with Stopping Task or Emptying Bucket:' + str(e)
446 |                   logger.error(message)
447 |                 cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'NA'})
448 |             else:
449 |                 taskArn = event['ResourceProperties']['taskArn']
450 |                 sourceArn = event['ResourceProperties']['sourceArn']
451 |                 targetArn = event['ResourceProperties']['targetArn']
452 |                 instanceArn = event['ResourceProperties']['instanceArn']
453 | 
454 |                 try:
455 |                   if testConnection(sourceArn, 1) != 1:
456 |                     dms.test_connection(
457 |                       ReplicationInstanceArn=instanceArn,
458 |                       EndpointArn=sourceArn
459 |                       )
460 |                     if testConnection(sourceArn,10) != 1:
461 |                       message = 'Source Endpoint Connection Failure'
462 |                       cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message})
463 |                       return {'message': message}
464 |                   if testConnection(targetArn, 1) != 1:
465 |                     dms.test_connection(
466 |                       ReplicationInstanceArn=instanceArn,
467 |                       EndpointArn=targetArn
468 |                       )
469 |                     if testConnection(targetArn,10) != 1:
470 |                       message = 'Target Endpoint Connection Failure'
471 |                       logger.error(message)
472 |                       cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message})
473 | 
474 |                 except Exception as e:
475 |                   message = 'DMS Endpoint Issue: ' + str(e)
476 |                   logger.error(message)
477 |                   cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message})
478 | 
479 |                 try:
480 |                   if testRepTask(taskArn, 1) != 1:
481 |                     dms.start_replication_task(
482 |                       ReplicationTaskArn=taskArn,
483 |                       StartReplicationTaskType='start-replication')
484 |                     if testRepTask(taskArn, 10) != 1:
485 |                       message = 'Error during DMS execution'
486 |                       logger.error(message)
487 |                       cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message})
488 |                     else:
489 |                       message = 'DMS Execution Successful'
490 |                       logger.info(message)
491 |                       cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': message})
492 |                 except Exception as e:
493 |                   message = 'DMS Task Issue: ' + str(e)
494 |                   logger.error(message)
495 |                   cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': message})
496 | 
497 |       Handler: index.handler
498 |       Role:
499 |         Fn::GetAtt: [ExecutionRole, Arn]
500 |       Runtime: python3.7
501 |     DependsOn:
502 |       - ExecutionRole
503 |   GlueBucket:
504 |     Type: AWS::S3::Bucket
505 |   GlueJobLoadIncremental:
506 |     Type: AWS::Glue::Job
507 |     Properties:
508 |       Name: DMSCDC_LoadIncremental
509 |       Role:
510 |         Fn::GetAtt: [ExecutionRole, Arn]
511 |       ExecutionProperty:
512 |         MaxConcurrentRuns: 50
513 |       AllocatedCapacity: 2
514 |       GlueVersion: 3.0
515 |       Command:
516 |         Name: glueetl
517 |         ScriptLocation: !Sub
518 |           - s3://${bucket}/DMSCDC_LoadIncremental.py
519 |           - {bucket: !Ref GlueBucket}
520 |       DefaultArguments:
521 |         "--job-bookmark-option" : "job-bookmark-disable"
522 |         "--TempDir" : !Sub
523 |           - s3://${bucket}
524 |           - {bucket: !Ref GlueBucket}
525 |         "--enable-metrics" : ""
526 |     DependsOn:
527 |       - ExecutionRole
528 |   GlueJobLoadInitial:
529 |     Type: AWS::Glue::Job
530 |     Properties:
531 |       Name: DMSCDC_LoadInitial
532 |       Role:
533 |         Fn::GetAtt: [ExecutionRole, Arn]
534 |       ExecutionProperty:
535 |         MaxConcurrentRuns: 50
536 |       AllocatedCapacity: 2
537 |       GlueVersion: 3.0
538 |       Command:
539 |         Name: glueetl
540 |         ScriptLocation: !Sub
541 |           - s3://${bucket}/DMSCDC_LoadInitial.py
542 |           - {bucket: !Ref GlueBucket}
543 |       DefaultArguments:
544 |         "--job-bookmark-option" : "job-bookmark-disable"
545 |         "--TempDir" : !Sub
546 |           - s3://${bucket}
547 |           - {bucket: !Ref GlueBucket}
548 |         "--enable-metrics" : ""
549 |     DependsOn:
550 |       - ExecutionRole
551 |   GlueJobController:
552 |     Type: AWS::Glue::Job
553 |     Properties:
554 |       Name: DMSCDC_Controller
555 |       Role:
556 |         Fn::GetAtt: [ExecutionRole, Arn]
557 |       ExecutionProperty:
558 |         MaxConcurrentRuns: 50
559 |       Command:
560 |         Name: pythonshell
561 |         PythonVersion: 3
562 |         ScriptLocation: !Sub
563 |           - s3://${bucket}/DMSCDC_Controller.py
564 |           - {bucket: !Ref GlueBucket}
565 |       DefaultArguments:
566 |         "--job-bookmark-option" : "job-bookmark-disable"
567 |         "--TempDir" : !Sub
568 |           - s3://${bucket}
569 |           - {bucket: !Ref GlueBucket}
570 |         "--enable-metrics" : ""
571 |     DependsOn:
572 |       - ExecutionRole
573 |   GlueJobProcessTable:
574 |     Type: AWS::Glue::Job
575 |     Properties:
576 |       Name: DMSCDC_ProcessTable
577 |       Role:
578 |         Fn::GetAtt: [ExecutionRole, Arn]
579 |       ExecutionProperty:
580 |         MaxConcurrentRuns: 50
581 |       Command:
582 |         Name: pythonshell
583 |         PythonVersion: 3
584 |         ScriptLocation: !Sub
585 |           - s3://${bucket}/DMSCDC_ProcessTable.py
586 |           - {bucket: !Ref GlueBucket}
587 |       DefaultArguments:
588 |         "--job-bookmark-option" : "job-bookmark-disable"
589 |         "--TempDir" : !Sub
590 |           - s3://${bucket}
591 |           - {bucket: !Ref GlueBucket}
592 |         "--enable-metrics" : ""
593 |     DependsOn:
594 |       - ExecutionRole
595 | 


--------------------------------------------------------------------------------
/DMSCDC_CloudTemplate_Source.yaml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: 2010-09-09
  2 | Parameters:
  3 |   DBEngine:
  4 |     Description: Choose the source DB engine to stream changes.
  5 |     Type: String
  6 |     AllowedValues:
  7 |       - mysql
  8 |       - oracle
  9 |       - postgres
 10 |       - mariadb
 11 |       - aurora
 12 |       - aurora-postgresql
 13 |       - db2
 14 |       - sybase
 15 |       - sqlserver
 16 |   DBServer:
 17 |     Description: Hostname of the source DB.
 18 |     Type: String
 19 |   DBPort:
 20 |     Description: Port of the source DB.
 21 |     Type: Number
 22 |   DBUser:
 23 |     Description: Username for the source DB.
 24 |     Type: String
 25 |   DBPassword:
 26 |     Description: Password for the source DB.
 27 |     Type: String
 28 |     NoEcho: true
 29 |   DBName:
 30 |     Description: Name of the source DB.
 31 |     Type: String
 32 |   DMSReplicationInstanceArn:
 33 |     Description: Enter the ARN for the DMS Replication Instance which has been configured with network connectivity to your Source DB.
 34 |     Type: String
 35 |   DMSTableFilter:
 36 |     Description: Choose the DMS table filter criteria.
 37 |     Type: String
 38 |     Default: "%"
 39 |     AllowedPattern: ^[a-zA-Z0-9-_%]*$
 40 |     ConstraintDescription: Must only contain upper/lower-case letters, numbers, -, _, or %.
 41 |   DMSSchemaFilter:
 42 |     Description: Choose the DMS schema filter criteria.
 43 |     Type: String
 44 |     Default: "%"
 45 |     AllowedPattern: ^[a-zA-Z0-9-_%]*$
 46 |     ConstraintDescription: Must only contain upper/lower-case letters, numbers, -, or _.
 47 |   CreateDMSBucket:
 48 |     Description: Select Yes to create a new bucket and No to use an existing bucket
 49 |     Type: String
 50 |     Default: 'Yes'
 51 |     AllowedValues:
 52 |     - 'Yes'
 53 |     - 'No'
 54 |   DMSBucket:
 55 |     Description: Name of the Bucket to land the DMS data.
 56 |     Type: String
 57 |   DMSFolder:
 58 |     Description: (Optional) Prefix within bucket to store DMS data.  Please omit the trailing /
 59 |     Type: String
 60 |   CreateLakeBucket:
 61 |     Description: Select Yes to create a new bucket and No to use an existing bucket
 62 |     Type: String
 63 |     Default: 'Yes'
 64 |     AllowedValues:
 65 |     - 'Yes'
 66 |     - 'No'
 67 |   LakeBucket:
 68 |     Description: Name of the Bucket to land the processed data.
 69 |     Type: String
 70 |   LakeFolder:
 71 |     Description: (Optional) Prefix within bucket to store processed data.  Please omit the trailing /
 72 |     Type: String
 73 |   LakeDBName:
 74 |     Description: Name to be used in analytical tools such as Athena and Redshift to reference these tables within the data lake.
 75 |     Type: String
 76 |   CDCSchedule:
 77 |     Description: "CDC Execution Schedule Cron Expression.  Default is every hour on the hour. For more info: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html"
 78 |     Type: String
 79 |     Default: cron(0 0-23 * * ? *)
 80 |   CrawlerSchedule:
 81 |     Description: "Crawler Execution Schedule Cron Expression. Default is daily at 12:30 AM GMT: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html"
 82 |     Type: String
 83 |     Default: cron(30 0 * * ? *)
 84 | Conditions:
 85 |   IsDMSFolderEmpty: !Equals [!Ref "DMSFolder", ""]
 86 |   IsLakeFolderEmpty: !Equals [!Ref "LakeFolder", ""]
 87 |   CreateDMSBucket: !Equals
 88 |     - !Ref CreateDMSBucket
 89 |     - 'Yes'
 90 |   CreateLakeBucket: !Equals
 91 |     - !Ref CreateLakeBucket
 92 |     - 'Yes'
 93 | Metadata:
 94 |   AWS::CloudFormation::Interface:
 95 |     ParameterGroups:
 96 |       -
 97 |         Label:
 98 |           default: "DMS Source Database Configuration"
 99 |         Parameters:
100 |           - DBEngine
101 |           - DBServer
102 |           - DBPort
103 |           - DBUser
104 |           - DBPassword
105 |           - DBName
106 |       -
107 |         Label:
108 |           default: "DMS Task Configuration"
109 |         Parameters:
110 |           - DMSReplicationInstanceArn
111 |           - DMSTableFilter
112 |           - DMSSchemaFilter
113 |           - CreateDMSBucket
114 |           - DMSBucket
115 |           - DMSFolder
116 |       -
117 |         Label:
118 |           default: "Data Lake Configuration"
119 |         Parameters:
120 |           - CreateLakeBucket
121 |           - LakeBucket
122 |           - LakeFolder
123 |           - LakeDBName
124 |           - CDCSchedule
125 |           - CrawlerSchedule
126 | Resources:
127 |   DMSSourceEndpoint:
128 |     Type: AWS::DMS::Endpoint
129 |     Properties:
130 |       EndpointType: source
131 |       EngineName:
132 |         Ref: DBEngine
133 |       ServerName:
134 |         Ref: DBServer
135 |       Port:
136 |         Ref: DBPort
137 |       Username:
138 |         Ref: DBUser
139 |       Password:
140 |         Ref: DBPassword
141 |       DatabaseName:
142 |         Ref: DBName
143 |   DMSRawBucket:
144 |     Condition: CreateDMSBucket
145 |     Type: AWS::S3::Bucket
146 |     Properties:
147 |       BucketName:
148 |         Ref: DMSBucket
149 |   DMSTargetEndpoint:
150 |     Type: AWS::DMS::Endpoint
151 |     Properties:
152 |         EndpointType: target
153 |         EngineName: s3
154 |         ExtraConnectionAttributes : "dataFormat=parquet;includeOpForFullLoad=true;;parquetTimestampInMillisecond=true;"
155 |         S3Settings :
156 |             BucketFolder:
157 |               Ref: DMSFolder
158 |             BucketName:
159 |               Ref: DMSBucket
160 |             ServiceAccessRoleArn: !Sub 'arn:aws:iam::${AWS::AccountId}:role/DMSCDC_Execution_Role'
161 |   DMSReplicationTask:
162 |     Type: AWS::DMS::ReplicationTask
163 |     Properties:
164 |       MigrationType: full-load-and-cdc
165 |       ReplicationInstanceArn:
166 |         Ref: DMSReplicationInstanceArn
167 |       SourceEndpointArn:
168 |         Ref: DMSSourceEndpoint
169 |       TableMappings:
170 |         Fn::Sub:
171 |           - "{\"rules\": [{\"rule-type\": \"selection\", \"rule-id\": \"1\", \"rule-action\": \"include\", \"object-locator\": {\"schema-name\": \"${schema}\", \"table-name\": \"${tables}\"}, \"rule-name\": \"1\"}]}"
172 |           - {schema: !Ref DMSSchemaFilter, tables: !Ref DMSTableFilter }
173 |       TargetEndpointArn:
174 |         Ref: DMSTargetEndpoint
175 |       ReplicationTaskSettings: "{\"Logging\": {\"EnableLogging\": true }}"
176 |   DataLakeBucket:
177 |     Condition: CreateDMSBucket
178 |     Type: 'AWS::S3::Bucket'
179 |     Properties:
180 |       BucketName:
181 |         Ref: LakeBucket
182 |   Trigger:
183 |     Type: AWS::Glue::Trigger
184 |     Properties:
185 |       Type: SCHEDULED
186 |       Name: !Sub
187 |         - "DMSCDC-Trigger-${dbname}"
188 |         - {dbname: !Ref LakeDBName}
189 |       Schedule:
190 |         Ref: CDCSchedule
191 |       Actions:
192 |         - JobName: DMSCDC_Controller
193 |           Arguments:
194 |             '--bucket': !Ref DMSBucket
195 |             '--prefix':
196 |               Fn::If:
197 |                 - IsDMSFolderEmpty
198 |                 - !Ref DMSFolder
199 |                 - !Sub
200 |                   - '${prefix}/'
201 |                   - prefix: !Ref DMSFolder
202 |             '--out_bucket': !Ref LakeBucket
203 |             '--out_prefix':
204 |               Fn::If:
205 |                 - IsLakeFolderEmpty
206 |                 - !Ref LakeFolder
207 |                 - !Sub
208 |                   - '${prefix}/'
209 |                   - prefix: !Ref LakeFolder
210 |   GlueCrawler:
211 |     Type: AWS::Glue::Crawler
212 |     Properties:
213 |       Role: !Sub 'arn:aws:iam::${AWS::AccountId}:role/DMSCDC_Execution_Role'
214 |       Schedule:
215 |         ScheduleExpression:
216 |           Ref: CrawlerSchedule
217 |       DatabaseName:
218 |         Ref: LakeDBName
219 |       Targets:
220 |         S3Targets:
221 |           -
222 |             Path:
223 |               Fn::Join :
224 |                 - "/"
225 |                 -
226 |                   - Ref : LakeBucket
227 |                   - Fn::If:
228 |                       - IsLakeFolderEmpty
229 |                       - !Ref LakeFolder
230 |                       - !Sub
231 |                         - '${prefix}/'
232 |                         - prefix: !Ref LakeFolder
233 |   InitDMS:
234 |     Type: Custom::InitDMS
235 |     DependsOn:
236 |       - DMSReplicationTask
237 |     Properties:
238 |       ServiceToken: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:DMSCDC_InitDMS'
239 |       taskArn: !Ref DMSReplicationTask
240 |       sourceArn: !Ref DMSSourceEndpoint
241 |       targetArn: !Ref DMSTargetEndpoint
242 |       instanceArn: !Ref DMSReplicationInstanceArn
243 |       dmsBucket: !Ref DMSBucket
244 |   InitController:
245 |     Type: Custom::InitController
246 |     DependsOn:
247 |       - InitDMS
248 |       - Trigger
249 |     Properties:
250 |       ServiceToken: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:DMSCDC_InitController'
251 |       dmsBucket: !Ref DMSBucket
252 |       dmsPath: !Sub
253 |         - "${prefix}${schema}/"
254 |         - { prefix: !Ref DMSFolder,schema: !Ref DMSSchemaFilter}
255 |       lakeBucket: !Ref LakeBucket
256 |       lakePath:
257 |         Fn::If:
258 |           - IsLakeFolderEmpty
259 |           - !Ref LakeFolder
260 |           - !Sub
261 |             - '${prefix}/'
262 |             - prefix: !Ref LakeFolder
263 |       lakeDBName: !Ref LakeDBName
264 | 


--------------------------------------------------------------------------------
/DMSCDC_Controller.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import json
  3 | from urllib.parse import urlparse
  4 | import urllib
  5 | import datetime
  6 | import boto3
  7 | import time
  8 | from awsglue.utils import getResolvedOptions
  9 | import logging
 10 | 
 11 | handler = logging.StreamHandler(sys.stdout)
 12 | handler.setLevel(logging.INFO)
 13 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 14 | handler.setFormatter(formatter)
 15 | 
 16 | logging.basicConfig()
 17 | logger = logging.getLogger(__name__)
 18 | logger.addHandler(handler)
 19 | 
 20 | ddbconn = boto3.client('dynamodb')
 21 | glue = boto3.client('glue')
 22 | s3conn = boto3.client('s3')
 23 | 
 24 | prefix = ""
 25 | out_prefix = ""
 26 | 
 27 | #Optional Parameters
 28 | if ('--prefix' in sys.argv):
 29 |     nextarg = sys.argv[sys.argv.index('--prefix')+1]
 30 |     if (nextarg is not None and not(nextarg.startswith('--'))):
 31 |         prefix = nextarg
 32 | if ('--out_prefix' in sys.argv):
 33 |     nextarg = sys.argv[sys.argv.index('--out_prefix')+1]
 34 |     if (nextarg is not None and not(nextarg.startswith('--'))):
 35 |         out_prefix = nextarg
 36 | 
 37 | #Required Parameters
 38 | args = getResolvedOptions(sys.argv, [
 39 |         'bucket',
 40 |         'out_bucket'])
 41 | 
 42 | bucket = args['bucket']
 43 | out_bucket = args['out_bucket']
 44 | out_path = out_bucket + '/' + out_prefix
 45 | 
 46 | #get the list of table folders
 47 | s3_input = 's3://'+bucket+'/'+prefix
 48 | url = urlparse(s3_input)
 49 | schemas = s3conn.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/').get('CommonPrefixes')
 50 | if schemas is None:
 51 |     logger.error('DMSBucket: '+bucket+' DMSFolder: ' + prefix + ' is empty.  Confirm the source DB has data and the DMS replication task is replicating to this S3 location. Review the documention https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html to ensure the DB is setup for CDC.')
 52 |     sys.exit()
 53 | 
 54 | index = 0
 55 | #get folder metadata
 56 | for schema in schemas:
 57 |     tables = s3conn.list_objects(Bucket=bucket, Prefix=schema['Prefix'], Delimiter='/').get('CommonPrefixes')
 58 |     for table in tables:
 59 |         full_folder = table['Prefix']
 60 |         foldername = full_folder[len(schema['Prefix']):]
 61 |         path = bucket + '/' + full_folder
 62 |         logger.warn('Processing table: ' + path)
 63 |         item = {
 64 |             'path': {'S':path},
 65 |             'bucket': {'S':bucket},
 66 |             'prefix': {'S':schema['Prefix']},
 67 |             'folder': {'S':foldername},
 68 |             'PrimaryKey': {'S':'null'},
 69 |             'PartitionKey': {'S':'null'},
 70 |             'LastFullLoadDate': {'S':'1900-01-01 00:00:00'},
 71 |             'LastIncrementalFile': {'S':path + '0.parquet'},
 72 |             'ActiveFlag': {'S':'false'},
 73 |             'out_path': {'S': out_path}}
 74 | 
 75 |         #CreateTable if not already present
 76 |         try:
 77 |             response1 = ddbconn.describe_table(TableName='DMSCDC_Controller')
 78 |         except Exception as e:
 79 |             ddbconn.create_table(
 80 |                 TableName='DMSCDC_Controller',
 81 |                 KeySchema=[{'AttributeName': 'path','KeyType': 'HASH'}],
 82 |                 AttributeDefinitions=[{'AttributeName': 'path','AttributeType': 'S'}],
 83 |                 BillingMode='PAY_PER_REQUEST')
 84 |             time.sleep(10)
 85 | 
 86 |         #Put Item if not already present
 87 |         try:
 88 |             response = ddbconn.get_item(
 89 |               TableName='DMSCDC_Controller',
 90 |               Key={'path': {'S':path}})
 91 |             if 'Item' in response:
 92 |                 item = response['Item']
 93 |             else:
 94 |                 ddbconn.put_item(
 95 |                     TableName='DMSCDC_Controller',
 96 |                     Item=item)
 97 |         except:
 98 |             ddbconn.put_item(
 99 |                 TableName='DMSCDC_Controller',
100 |                 Item=item)
101 | 
102 |         logger.debug(json.dumps(item))
103 |         logger.warn('Bucket: '+bucket+' Path: ' + full_folder)
104 | 
105 | 
106 |         #determine if need to run incremental --> Run incremental --> Update DDB
107 |         if item['ActiveFlag']['S'] == 'true':
108 |             logger.warn('starting processTable, args: ' + json.dumps(item))
109 |             response = glue.start_job_run(JobName='DMSCDC_ProcessTable',Arguments={'--item':json.dumps(item)})
110 |             logger.debug(json.dumps(response))
111 |         else:
112 |           logger.error('Load is not active.  Update dynamoDB: ' + full_folder)
113 | 


--------------------------------------------------------------------------------
/DMSCDC_LoadIncremental.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | from awsglue.job import Job
 3 | from pyspark.context import SparkContext
 4 | from awsglue.context import GlueContext
 5 | from pyspark.sql import SQLContext
 6 | from pyspark.sql import SparkSession
 7 | from pyspark.sql.functions import *
 8 | from pyspark.sql.window import Window
 9 | from awsglue.utils import getResolvedOptions
10 | import boto3
11 | import urllib
12 | from urllib import parse
13 | 
14 | s3conn = boto3.client('s3')
15 | 
16 | sparkContext = SparkContext.getOrCreate()
17 | glueContext = GlueContext(sparkContext)
18 | spark = glueContext.spark_session
19 | job = Job(glueContext)
20 | 
21 | args = getResolvedOptions(sys.argv, [
22 |         'JOB_NAME',
23 |         'bucket',
24 |         'prefix',
25 |         'folder',
26 |         'out_path',
27 |         'lastIncrementalFile',
28 |         'newIncrementalFile',
29 |         'primaryKey',
30 |         'partitionKey'])
31 | 
32 | job.init(args['JOB_NAME'], args)
33 | s3_outputpath = 's3://' + args['out_path'] + args['folder']
34 | 
35 | last_file = args['lastIncrementalFile']
36 | curr_file = args['newIncrementalFile']
37 | primary_keys = args['primaryKey']
38 | partition_keys = args['partitionKey']
39 | 
40 | # gets list of files to process
41 | inputFileList = []
42 | results = s3conn.list_objects_v2(Bucket=args['bucket'], Prefix=args['prefix'] +'2', StartAfter=args['lastIncrementalFile']).get('Contents')
43 | for result in results:
44 |     if (args['bucket'] + '/' + result['Key'] != last_file):
45 |         inputFileList.append('s3://' + args['bucket'] + '/' + result['Key'])
46 | 
47 | inputfile = spark.read.parquet(*inputFileList)
48 | 
49 | tgtExists = True
50 | try:
51 |     test = spark.read.parquet(s3_outputpath)
52 | except:
53 |     tgtExists = False
54 | 
55 | #No Primary_Keys implies insert only
56 | if primary_keys == "null" or not(tgtExists):
57 |     output = inputfile.filter(inputfile.Op=='I')
58 |     filelist = [["null"]]
59 | else:
60 |     primaryKeys = primary_keys.split(",")
61 |     windowRow = Window.partitionBy(primaryKeys).orderBy("sortpath")
62 | 
63 |     #Loads the target data adding columns for processing
64 |     target = spark.read.parquet(s3_outputpath).withColumn("sortpath", lit("0")).withColumn("tgt_filepath",input_file_name()).withColumn("rownum", lit(1))
65 |     input = inputfile.withColumn("sortpath", input_file_name()).withColumn("src_filepath",input_file_name()).withColumn("rownum", row_number().over(windowRow))
66 | 
67 |     #determine impacted files
68 |     files = target.join(input, primaryKeys, 'inner').select(col("tgt_filepath").alias("list_filepath")).distinct()
69 |     filelist = files.collect()
70 | 
71 |     uniondata = input.unionByName(target.join(files,files.list_filepath==target.tgt_filepath), allowMissingColumns=True)
72 |     window = Window.partitionBy(primaryKeys).orderBy(desc("sortpath"), desc("rownum"))
73 |     output = uniondata.withColumn('rnk', rank().over(window)).where(col("rnk")==1).where(col("Op")!="D").coalesce(1).select(inputfile.columns)
74 | 
75 | # write data by partitions
76 | if partition_keys != "null" :
77 |     partitionKeys = partition_keys.split(",")
78 |     partitionCount = output.select(partitionKeys).distinct().count()
79 |     output.repartition(partitionCount,partitionKeys).write.mode('append').partitionBy(partitionKeys).parquet(s3_outputpath)
80 | else:
81 |     output.write.mode('append').parquet(s3_outputpath)
82 | 
83 | #delete old files
84 | for row in filelist:
85 |     if row[0] != "null":
86 |         o = parse.urlparse(row[0])
87 |         s3conn.delete_object(Bucket=o.netloc, Key=parse.unquote(o.path)[1:])
88 | 


--------------------------------------------------------------------------------
/DMSCDC_LoadInitial.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | from awsglue.job import Job
 3 | from pyspark.context import SparkContext
 4 | from pyspark.sql import SQLContext
 5 | from pyspark.sql import SparkSession
 6 | from pyspark.sql.functions import *
 7 | from awsglue.utils import getResolvedOptions
 8 | from awsglue.context import GlueContext
 9 | 
10 | args = getResolvedOptions(sys.argv, [
11 |         'JOB_NAME',
12 |         'bucket',
13 |         'prefix',
14 |         'folder',
15 |         'out_path',
16 |         'partitionKey'])
17 | 
18 | 
19 | sparkContext = SparkContext.getOrCreate()
20 | glueContext = GlueContext(sparkContext)
21 | spark = glueContext.spark_session
22 | job = Job(glueContext)
23 | job.init(args['JOB_NAME'], args)
24 | 
25 | s3_inputpath = 's3://' + args['bucket'] + '/' + args['prefix']
26 | s3_outputpath = 's3://' + args['out_path'] + args['folder']
27 | 
28 | input = spark.read.parquet(s3_inputpath+"/LOAD*.parquet").withColumn("Op", lit("I"))
29 | 
30 | partition_keys = args['partitionKey']
31 | if partition_keys != "null" :
32 |     partitionKeys = partition_keys.split(",")
33 |     partitionCount = input.select(partitionKeys).distinct().count()
34 |     input.repartition(partitionCount,partitionKeys).write.mode('overwrite').partitionBy(partitionKeys).parquet(s3_outputpath)
35 | else:
36 |     input.write.mode('overwrite').parquet(s3_outputpath)
37 | 
38 | job.commit()
39 | 


--------------------------------------------------------------------------------
/DMSCDC_ProcessTable.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import json
  3 | import datetime
  4 | import boto3
  5 | import time
  6 | from awsglue.utils import getResolvedOptions
  7 | import logging
  8 | 
  9 | handler = logging.StreamHandler(sys.stdout)
 10 | handler.setLevel(logging.INFO)
 11 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 12 | handler.setFormatter(formatter)
 13 | 
 14 | logging.basicConfig()
 15 | logger = logging.getLogger(__name__)
 16 | logger.addHandler(handler)
 17 | 
 18 | ddbconn = boto3.client('dynamodb')
 19 | glue = boto3.client('glue')
 20 | s3conn = boto3.client('s3')
 21 | 
 22 | #Required Parameters
 23 | args = getResolvedOptions(sys.argv, ['item'])
 24 | item = json.loads(args['item'])
 25 | 
 26 | partitionKey = item['PartitionKey']['S']
 27 | lastFullLoadDate = item['LastFullLoadDate']['S']
 28 | lastIncrementalFile = item['LastIncrementalFile']['S']
 29 | activeFlag = item['ActiveFlag']['S']
 30 | primaryKey = item['PrimaryKey']['S']
 31 | folder = item['folder']['S']
 32 | prefix = item['prefix']['S']
 33 | full_folder =  prefix + folder
 34 | path = item['path']['S']
 35 | bucket = item['bucket']['S']
 36 | out_path = item['out_path']['S']
 37 | 
 38 | def runGlueJob(job, args):
 39 |     logger.info('starting runGlueJob: ' + job + ' args: ' + json.dumps(args))
 40 |     response = glue.start_job_run(JobName=job,Arguments=args)
 41 |     count=90
 42 |     sec=10
 43 |     jobId=response['JobRunId']
 44 |     i = 0
 45 |     while i < count:
 46 |         response = glue.get_job_run(JobName=job, RunId=jobId)
 47 |         status = response['JobRun']['JobRunState']
 48 |         if status == 'SUCCEEDED':
 49 |             return 0
 50 |         elif (status == 'RUNNING' or status == 'STARTING' or status == 'STOPPING'):
 51 |             time.sleep(sec)
 52 |             i+=1
 53 |         else:
 54 |             logger.error('Error during loadIncremental execution: ' + status)
 55 |             return 1
 56 |         if i == count:
 57 |             logger.error('Execution timeout: ' + count*sec + ' seconds')
 58 |             return 1
 59 | 
 60 | logger.warn('starting processTable: ' + path)
 61 | loadInitial = False
 62 | #determine if need to run initial --> Run Initial --> Update DDB
 63 | initialfiles = s3conn.list_objects(Bucket=bucket, Prefix=full_folder+'LOAD').get('Contents')
 64 | if initialfiles is not None :
 65 |     s3FileTS = initialfiles[0]['LastModified'].replace(tzinfo=None)
 66 |     ddbFileTS = datetime.datetime.strptime(lastFullLoadDate, '%Y-%m-%d %H:%M:%S')
 67 |     if s3FileTS > ddbFileTS:
 68 |         message='Starting to process Initial file: ' + full_folder
 69 |         loadInitial = True
 70 |         lastFullLoadDate = datetime.datetime.strftime(s3FileTS,'%Y-%m-%d %H:%M:%S')
 71 |     else:
 72 |       message='Intial files already processed: ' + full_folder
 73 | else:
 74 |   message='No initial files to process: ' + full_folder
 75 | logger.warn(message)
 76 | 
 77 | #Call Initial Glue Job for this source
 78 | if loadInitial:
 79 |     args={
 80 |         '--bucket':bucket,
 81 |         '--prefix':full_folder,
 82 |         '--folder':folder,
 83 |         '--out_path':out_path,
 84 |         '--partitionKey':partitionKey}
 85 |     if runGlueJob('DMSCDC_LoadInitial',args) != 1:
 86 |         ddbconn.update_item(
 87 |             TableName='DMSCDC_Controller',
 88 |             Key={"path": {"S":path}},
 89 |             AttributeUpdates={"LastFullLoadDate": {"Value": {"S": lastFullLoadDate}}})
 90 | 
 91 | loadIncremental = False
 92 | #Get the latest incremental file
 93 | incrementalFiles = s3conn.list_objects_v2(Bucket=bucket, Prefix=full_folder+'2', StartAfter=lastIncrementalFile).get('Contents')
 94 | if incrementalFiles is not None:
 95 |     filecount = len(incrementalFiles)
 96 |     newIncrementalFile = bucket + '/' + incrementalFiles[filecount-1]['Key']
 97 |     if newIncrementalFile != lastIncrementalFile:
 98 |         loadIncremental = True
 99 |         message = 'Starting to process incremental files: ' + full_folder
100 |     else:
101 |         message = 'Incremental files already processed: ' + full_folder
102 | else:
103 |     message = 'No incremental files to process: ' + full_folder
104 | logger.warn(message)
105 | 
106 | #Call Incremental Glue Job for this source
107 | if loadIncremental:
108 |     args = {
109 |         '--bucket':bucket,
110 |         '--prefix':full_folder,
111 |         '--folder':folder,
112 |         '--out_path':out_path,
113 |         '--partitionKey':partitionKey,
114 |         '--lastIncrementalFile' : lastIncrementalFile,
115 |         '--newIncrementalFile' : newIncrementalFile,
116 |         '--primaryKey' : primaryKey
117 |         }
118 |     if runGlueJob('DMSCDC_LoadIncremental',args) != 1:
119 |         ddbconn.update_item(
120 |             TableName='DMSCDC_Controller',
121 |             Key={"path": {"S":path}},
122 |             AttributeUpdates={"LastIncrementalFile": {"Value": {"S": newIncrementalFile}}})
123 | 


--------------------------------------------------------------------------------
/DMSCDC_SampleDB.yaml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: 2010-09-09
  2 | Parameters:
  3 |   DBUser:
  4 |     Description: Username for the source DB.
  5 |     Type: String
  6 |   DBPassword:
  7 |     Description: Password for the source DB.
  8 |     Type: String
  9 |     NoEcho: true
 10 |   DBName:
 11 |     Description: Name of the source DB.
 12 |     Type: String
 13 |   SecurityGroup:
 14 |     Description: Sec Group for RDS & Lambda
 15 |     Type: AWS::EC2::SecurityGroup::Id
 16 |   Subnet1:
 17 |     Description: Subnet-1 for RDS & Lambda
 18 |     Type: AWS::EC2::Subnet::Id
 19 |   Subnet2:
 20 |     Description: Subnet-2 for RDS & Lambda
 21 |     Type: AWS::EC2::Subnet::Id
 22 | Resources:
 23 |   InitRole:
 24 |     Type: AWS::IAM::Role
 25 |     Properties :
 26 |       AssumeRolePolicyDocument:
 27 |         Version : 2012-10-17
 28 |         Statement :
 29 |           -
 30 |             Effect : Allow
 31 |             Principal :
 32 |               Service :
 33 |                 - lambda.amazonaws.com
 34 |             Action :
 35 |               - sts:AssumeRole
 36 |       Path : /
 37 |       Policies:
 38 |         -
 39 |           PolicyName: LambdaCloudFormationPolicy
 40 |           PolicyDocument:
 41 |             Version: 2012-10-17
 42 |             Statement:
 43 |               -
 44 |                 Effect: Allow
 45 |                 Action:
 46 |                   - s3:*
 47 |                 Resource:
 48 |                   - !Sub "arn:aws:s3:::cloudformation-custom-resource-response-${AWS::Region}"
 49 |                   - !Sub "arn:aws:s3:::cloudformation-waitcondition-${AWS::Region}"
 50 |                   - !Sub "arn:aws:s3:::cloudformation-custom-resource-response-${AWS::Region}/*"
 51 |                   - !Sub "arn:aws:s3:::cloudformation-waitcondition-${AWS::Region}/*"
 52 |       ManagedPolicyArns:
 53 |         - arn:aws:iam::aws:policy/CloudWatchLogsFullAccess
 54 |         - arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole
 55 |   RDSDBParameterGroup:
 56 |     Type: AWS::RDS::DBParameterGroup
 57 |     Properties:
 58 |       Description: DMDCDC-MySQL
 59 |       Family: mysql5.7
 60 |       Parameters:
 61 |         binlog_format : 'ROW'
 62 |         binlog_checksum : 'NONE'
 63 |   RDSSubnetGroup:
 64 |     Type: AWS::RDS::DBSubnetGroup
 65 |     Properties:
 66 |       DBSubnetGroupDescription: Subnet for DMSCDC-MySql
 67 |       SubnetIds:
 68 |         - !Ref Subnet1
 69 |         - !Ref Subnet2
 70 |   RDSDB:
 71 |     Type: AWS::RDS::DBInstance
 72 |     Properties:
 73 |       DBParameterGroupName: !Ref RDSDBParameterGroup
 74 |       DBName: !Ref DBName
 75 |       Engine: mysql
 76 |       EngineVersion: 5.7
 77 |       DBInstanceClass: db.t2.micro
 78 |       StorageType: gp2
 79 |       PubliclyAccessible: True
 80 |       AllocatedStorage: 20
 81 |       MasterUserPassword: !Ref DBPassword
 82 |       MasterUsername: !Ref DBUser
 83 |       DBSubnetGroupName: !Ref RDSSubnetGroup
 84 |       VPCSecurityGroups:
 85 |         - !Ref SecurityGroup
 86 |   RDSLoadLambda:
 87 |     Type: AWS::Lambda::Function
 88 |     Properties:
 89 |       Role: !GetAtt 'InitRole.Arn'
 90 |       Layers:
 91 |         - !Ref PymysqlLayer
 92 |       Handler: index.handler
 93 |       Runtime: python3.9
 94 |       Timeout: 30
 95 |       VpcConfig:
 96 |         SecurityGroupIds:
 97 |           - !Ref SecurityGroup
 98 |         SubnetIds:
 99 |           - !Ref Subnet1
100 |           - !Ref Subnet2
101 |       Code:
102 |         ZipFile: |
103 |           import sys, logging, pymysql, json, urllib3
104 |           from pymysql.constants import CLIENT
105 |           http = urllib3.PoolManager()
106 |           logger = logging.getLogger()
107 |           logger.setLevel(logging.INFO)
108 | 
109 |           def send(event, context, responseStatus, responseData):
110 |             responseUrl = event['ResponseURL']
111 |             responseBody = {
112 |                 'Status' : responseStatus,
113 |                 'Reason' : "See the details in CloudWatch Log Stream: {}".format(context.log_stream_name),
114 |                 'PhysicalResourceId' : context.log_stream_name,
115 |                 'StackId' : event['StackId'],
116 |                 'RequestId' : event['RequestId'],
117 |                 'LogicalResourceId' : event['LogicalResourceId'],
118 |                 'NoEcho' : False,
119 |                 'Data' : responseData
120 |             }
121 |             json_responseBody = json.dumps(responseBody)
122 |             headers = {
123 |                 'content-type' : '',
124 |                 'content-length' : str(len(json_responseBody))
125 |             }
126 |             try:
127 |                 response = http.request('PUT', responseUrl, headers=headers, body=json_responseBody)
128 |             except Exception as e:
129 |                 print("send(..) failed executing http.request(..):", e)
130 | 
131 |           def handler(event, context):
132 |             print("Received event: " + json.dumps(event, indent=2))
133 | 
134 |             if event['RequestType'] == 'Delete':
135 |               send(event, context, 'SUCCESS', {'Data': 'Delete complete'})
136 |             else:
137 |               host  = event['ResourceProperties']['host']
138 |               username = event['ResourceProperties']['username']
139 |               password = event['ResourceProperties']['password']
140 |               dbname = event['ResourceProperties']['dbname']
141 |               script = event['ResourceProperties']['script']
142 | 
143 |               try:
144 |                 conn = pymysql.connect(host=host, user=username, passwd=password, db=dbname, connect_timeout=5, client_flag= CLIENT.MULTI_STATEMENTS)
145 |               except Exception as e:
146 |                 print(str(e))
147 |                 send(event, context, 'FAILED', {'Data': 'Connect failed'})
148 |                 sys.exit()
149 | 
150 |               try:
151 |                 response = http.request('GET', script, preload_content=False)
152 |                 script_content = response.read()
153 |               except Exception as e:
154 |                 print(str(e))
155 |                 send(event, context, 'FAILED', {'Data': 'Open script failed'})
156 |                 sys.exit()
157 | 
158 |               try:
159 |                 cur = conn.cursor()
160 |                 response = cur.execute(script_content.decode('utf-8'))
161 |                 conn.commit()
162 |                 send(event, context, 'SUCCESS', {'Data': 'Insert complete'})
163 |               except Exception as e:
164 |                 print(str(e))
165 |                 send(event, context, 'FAILED', {'Data': 'Insert failed'})
166 |   RDSLoadInit:
167 |     Type: Custom::RDSLoadInit
168 |     DependsOn:
169 |       - RDSLoadLambda
170 |     Properties:
171 |       ServiceToken: !GetAtt 'RDSLoadLambda.Arn'
172 |       host: !GetAtt 'RDSDB.Endpoint.Address'
173 |       username: !Ref DBUser
174 |       password: !Ref DBPassword
175 |       dbname: !Ref DBName
176 |       script: https://dmscdc-files.s3.us-west-2.amazonaws.com/DMSCDC_SampleDB_Initial.sql
177 |   PymysqlLayer:
178 |     Type: AWS::Lambda::LayerVersion
179 |     Properties:
180 |       CompatibleRuntimes:
181 |         - python3.8
182 |         - python3.9
183 |       Content:
184 |         S3Bucket: dmscdc-files
185 |         S3Key: pymysql.zip
186 | Outputs:
187 |   RDSEndpoint:
188 |     Description: Cluster endpoint
189 |     Value: !Sub "${RDSDB.Endpoint.Address}:${RDSDB.Endpoint.Port}"
190 | 


--------------------------------------------------------------------------------
/DMSCDC_SampleDB_Incremental.sql:
--------------------------------------------------------------------------------
1 | use sampledb;
2 | 
3 | update product set name = 'Sample Product', dept = 'Sample Dept', category = 'Sample Category' where id = 1001;
4 | delete from product where id = 1002;
5 | call loadorders(1345, current_date);
6 | insert into store values(1009, '125 Technology Dr.', 'Irvine', 'CA', 'US', '92618');
7 | 


--------------------------------------------------------------------------------
/DMSCDC_SampleDB_Initial.sql:
--------------------------------------------------------------------------------
 1 | create schema if not exists sampledb;
 2 | 
 3 | drop table if exists sampledb.store;
 4 | create table sampledb.store(
 5 | id int,
 6 | address1 varchar(1024),
 7 | city varchar(255),
 8 | state varchar(2),
 9 | countrycode varchar(2),
10 | postcode varchar(10));
11 | 
12 | insert into sampledb.store (id, address1, city, state, countrycode, postcode) values
13 | (1001, '320 W. 100th Ave, 100, Southgate Shopping Ctr - Anchorage','Anchorage','AK','US','99515'),
14 | (1002, '1005 E Dimond Blvd','Anchorage','AK','US','99515'),
15 | (1003, '1771 East Parks Hwy, Unit #4','Wasilla','AK','US','99654'),
16 | (1004, '345 South Colonial Dr','Alabaster','AL','US','35007'),
17 | (1005, '700 Montgomery Hwy, Suite 100','Vestavia Hills','AL','US','35216'),
18 | (1006, '20701 I-30','Benton','AR','US','72015'),
19 | (1007, '2034 Fayetteville Rd','Van Buren','AR','US','72956'),
20 | (1008, '3640 W. Anthem Way','Anthem','AZ','US','85086');
21 | 
22 | drop table if exists sampledb.product;
23 | create table sampledb.product(
24 | id int ,
25 | name varchar(255),
26 | dept varchar(100),
27 | category varchar(100),
28 | price decimal(10,2));
29 | 
30 | insert into sampledb.product (id, name, dept, category, price) values
31 | (1001,'Fire 7','Amazon Devices','Fire Tablets',39),
32 | (1002,'Fire HD 8','Amazon Devices','Fire Tablets',89),
33 | (1003,'Fire HD 10','Amazon Devices','Fire Tablets',119),
34 | (1004,'Fire 7 Kids Edition','Amazon Devices','Fire Tablets',79),
35 | (1005,'Fire 8 Kids Edition','Amazon Devices','Fire Tablets',99),
36 | (1006,'Fire HD 10 Kids Edition','Amazon Devices','Fire Tablets',159),
37 | (1007,'Fire TV Stick','Amazon Devices','Fire TV',49),
38 | (1008,'Fire TV Stick 4K','Amazon Devices','Fire TV',49),
39 | (1009,'Fire TV Cube','Amazon Devices','Fire TV',119),
40 | (1010,'Kindle','Amazon Devices','Kindle E-readers',79),
41 | (1011,'Kindle Paperwhite','Amazon Devices','Kindle E-readers',129),
42 | (1012,'Kindle Oasis','Amazon Devices','Kindle E-readers',279),
43 | (1013,'Echo Dot (3rd Gen)','Amazon Devices','Echo and Alexa',49),
44 | (1014,'Echo Auto','Amazon Devices','Echo and Alexa',24),
45 | (1015,'Echo Show (2nd Gen)','Amazon Devices','Echo and Alexa',229),
46 | (1016,'Echo Plus (2nd Gen)','Amazon Devices','Echo and Alexa',149),
47 | (1017,'Fire TV Recast','Amazon Devices','Echo and Alexa',229),
48 | (1018,'EchoSub Bundle with 2 Echo (2nd Gen)','Amazon Devices','Echo and Alexa',279),
49 | (1019,'Mini Projector','Electronics','Projectors',288),
50 | (1020,'TOUMEI Mini Projector','Electronics','Projectors',300),
51 | (1021,'InFocus IN114XA Projector, DLP XGA 3600 Lumens 3D Ready 2HDMI Speakers','Electronics','Projectors',350),
52 | (1022,'Optoma X343 3600 Lumens XGA DLP Projector with 15,000-hour Lamp Life','Electronics','Projectors',339),
53 | (1023,'TCL 55S517 55-Inch 4K Ultra HD Roku Smart LED TV (2018 Model)','Electronics','TV',379),
54 | (1024,'Samsung 55NU7100 Flat 55” 4K UHD 7 Series Smart TV 2018','Electronics','TV',547),
55 | (1025,'LG Electronics 55UK6300PUE 55-Inch 4K Ultra HD Smart LED TV (2018 Model)','Electronics','TV',496);
56 | 
57 | 
58 | drop table if exists sampledb.productorder;
59 | create table sampledb.productorder (
60 | id int NOT NULL,
61 | productid int,
62 | storeid int,
63 | qty int,
64 | soldprice decimal(10,2),
65 | create_dt date);
66 | 
67 | DROP PROCEDURE IF EXISTS sampledb.loadorders;
68 | CREATE PROCEDURE sampledb.loadorders (
69 |   IN OrderCnt int,
70 |   IN create_dt date
71 | )
72 | BEGIN
73 |   DECLARE i INT DEFAULT 0;
74 |   DECLARE maxid INT Default 0;
75 |   SELECT coalesce(max(id),0) INTO maxid from sampledb.productorder;
76 |   helper: LOOP
77 |     IF i<OrderCnt THEN
78 |       SET i = i+1;
79 |       INSERT INTO sampledb.productorder(id, productid, storeid, qty, soldprice, create_dt)
80 |          values (maxid+i,1000+ceil(rand()*25), 1000+ceil(rand()*8), ceil(rand()*10), rand()*100, create_dt) ;
81 |       ITERATE helper;
82 |     ELSE
83 |       LEAVE  helper;
84 |     END IF;
85 |   END LOOP;
86 | END;
87 | 
88 | truncate table sampledb.productorder;
89 | call sampledb.loadorders(1010, '2018-11-27');
90 | call sampledb.loadorders(1196, '2018-12-03');
91 | call sampledb.loadorders(4250, '2018-12-10');
92 | call sampledb.loadorders(1246, '2018-12-13');
93 | call sampledb.loadorders(723, '2018-11-28');
94 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Load ongoing data lake changes with AWS DMS and AWS Glue
  2 | **AWS Big Data Blog Post Code Walkthrough**
  3 | 
  4 | The AWS Big Data blog post [Load ongoing data lake changes with AWS DMS and AWS Glue](https://aws.amazon.com/blogs/big-data/loading-ongoing-data-lake-changes-with-aws-dms-and-aws-glue/) demonstrates how to deploy a solution that loads ongoing changes from popular database sources into your data lake. The solution streams new and changed data into Amazon S3. It also creates and updates appropriate data lake objects, providing a source-similar view of the data based on a schedule you configure.
  5 | 
  6 | This Github project describes in detail the Glue code components used in the blog post.   In addition, it includes detailed steps to setup the RDS DB mentioned in the blog post to demonstrate the solution.
  7 | 
  8 | > 01/22/2022: This solution was updated to resolve some key bugs and implement certain performance related enhancements. See the [change log](CHANGELOG.md) for more details.
  9 | 
 10 | [Glue Jobs](#glue-jobs)
 11 |  * [Controller](#controller) - [DMSCDC_Controller.py](./DMSCDC_Controller.py)
 12 |  * [ProcessTable](#processtable) - [DMSCDC_ProcessTable.py](./DMSCDC_ProcessTable.py)
 13 |  * [LoadInitial](#loadinitial) - [DMSCDC_LoadInitial.py](./DMSCDC_LoadInitial.py)
 14 |  * [LoadIncremental](#loadincremental) - [DMSCDC_LoadIncremental.py](./DMSCDC_LoadIncremental.py)
 15 | 
 16 | [Sample Database](#sample-database)
 17 |  * [Source DB Setup](#source-db-setup) - [DMSCDC_SampleDB.yaml](./DMSCDC_SampleDB.yaml)
 18 |  * [Incremental Load Script](#incremental-load-script) - [DMSCDC_SampleDB_Incremental.sql](./DMSCDC_SampleDB_Incremental.sql)
 19 | 
 20 | ## Glue Jobs
 21 | This solution uses a Python shell job for the Controller. The controller will run each time the workflow is triggered but the LoadInitial and LoadIncremental are only executed as needed.
 22 | 
 23 | ### Controller
 24 | The controller job will leverage the state table stored in DynamoDB and compare the state with the files in S3.  It will determine which tables/files need to be processed.
 25 | 
 26 | The first step in the processing is to determine the *schemas* and *tables* which are located in DMS raw location and construct an *item* object representing the default values for the *table*.  
 27 | ```python
 28 | #get the list of table folders
 29 | s3_input = 's3://'+bucket+'/'+prefix
 30 | url = urlparse(s3_input)
 31 | schemas = s3conn.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/').get('CommonPrefixes')
 32 | if schemas is None:
 33 |     logger.error('DMSBucket: '+bucket+' DMSFolder: ' + prefix + ' is empty.  Confirm the source DB has data and the DMS replication task is replicating to this S3 location. Review the documention https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html to ensure the DB is setup for CDC.')
 34 |     sys.exit()
 35 | 
 36 | index = 0
 37 | #get folder metadata
 38 | for schema in schemas:
 39 |     tables = s3conn.list_objects(Bucket=bucket, Prefix=schema['Prefix'], Delimiter='/').get('CommonPrefixes')
 40 |     for table in tables:
 41 |         full_folder = table['Prefix']
 42 |         foldername = full_folder[len(schema['Prefix']):]
 43 |         path = bucket + '/' + full_folder
 44 |         print('Processing table: ' + path)
 45 |         item = {
 46 |             'path': {'S':path},
 47 |             'bucket': {'S':bucket},
 48 |             'prefix': {'S':schema['Prefix']},
 49 |             'folder': {'S':foldername},
 50 |             'PrimaryKey': {'S':'null'},
 51 |             'PartitionKey': {'S':'null'},
 52 |             'LastFullLoadDate': {'S':'1900-01-01 00:00:00'},
 53 |             'LastIncrementalFile': {'S':path + '0.parquet'},
 54 |             'ActiveFlag': {'S':'false'}}
 55 | ```
 56 | 
 57 | The next step is to pull the state of each table from the DMSCDC_Controller DynamoDB table.  If the table is not present the table will be created.  If the row is not present the row will be created with the default values set earlier.
 58 | ```python
 59 |     try:
 60 |         response1 = ddbconn.describe_table(TableName='DMSCDC_Controller')
 61 |     except Exception as e:
 62 |         ddbconn.create_table(
 63 |             TableName='DMSCDC_Controller',
 64 |             KeySchema=[{'AttributeName': 'path','KeyType': 'HASH'}],
 65 |             AttributeDefinitions=[{'AttributeName': 'path','AttributeType': 'S'}],
 66 |             BillingMode='PAY_PER_REQUEST')
 67 |         time.sleep(10)
 68 | 
 69 |     try:
 70 |         response = ddbconn.get_item(
 71 |           TableName='DMSCDC_Controller',
 72 |           Key={'path': {'S':path}})
 73 |         if 'Item' in response:
 74 |             item = response['Item']
 75 |         else:
 76 |             ddbconn.put_item(
 77 |                 TableName='DMSCDC_Controller',
 78 |                 Item=item)
 79 |     except:
 80 |         ddbconn.put_item(
 81 |             TableName='DMSCDC_Controller',
 82 |             Item=item)
 83 | ```
 84 | 
 85 | Next, the controller will start a new python shell job `DMSCDC_ProcessTable` for each table in parallel.  The job will determine if an initial and/or incremental load is needed.
 86 | 
 87 | > Note: if the DynamoDB table has the ActiveFlag != 'true', this table will be skipped.  This is to ensure that a user has reviewed the table and set appropriate keys. If the PrimaryKey has been set, the incremental load will conduct change detection logic.  If the PrimaryKey has NOT been set, incremental load will only load rows marked for insert. If the PartitionKey has been set, the Spark job will partition the data before it is written.  
 88 | 
 89 | ```python
 90 | #determine if need to run incremental --> Run incremental --> Update DDB
 91 | if item['ActiveFlag']['S'] == 'true':
 92 |     logger.warn('starting processTable, args: ' + json.dumps(item))
 93 |     response = glue.start_job_run(JobName='DMSCDC_ProcessTable',Arguments={'--item':json.dumps(item)})
 94 |     logger.debug(json.dumps(response))
 95 | else:
 96 |   logger.error('Load is not active.  Update dynamoDB: ' + full_folder)
 97 | ```        
 98 | 
 99 | ### ProcessTable
100 | The first step of the `DMSCDCD_ProcessTable` job, is to determine if the initial file needs to be processed by comparing the timestamp of the initial load file stored in the DynamoDB table with the timestamp retrieved from S3.  If it is determined that the initial file should be processed the [LoadInitial](#LoadInitial) glue job is triggered.  The controller will wait for its completion using the `runGlueJob` function. If the job completes successfully, the DynamoDB table will be updated with the timestamp of the file which was processed.
101 | 
102 | ```python
103 | logger.warn('starting processTable: ' + path)
104 | loadInitial = False
105 | #determine if need to run initial --> Run Initial --> Update DDB
106 | initialfiles = s3conn.list_objects(Bucket=bucket, Prefix=full_folder+'LOAD').get('Contents')
107 | if initialfiles is not None :
108 |     s3FileTS = initialfiles[0]['LastModified'].replace(tzinfo=None)
109 |     ddbFileTS = datetime.datetime.strptime(lastFullLoadDate, '%Y-%m-%d %H:%M:%S')
110 |     if s3FileTS > ddbFileTS:
111 |         message='Starting to process Initial file: ' + full_folder
112 |         loadInitial = True
113 |         lastFullLoadDate = datetime.datetime.strftime(s3FileTS,'%Y-%m-%d %H:%M:%S')
114 |     else:
115 |       message='Intial files already processed: ' + full_folder
116 | else:
117 |   message='No initial files to process: ' + full_folder
118 | logger.warn(message)
119 | 
120 | #Call Initial Glue Job for this source
121 | if loadInitial:
122 |     args={
123 |         '--bucket':bucket,
124 |         '--prefix':full_folder,
125 |         '--folder':folder,
126 |         '--out_path':out_path,
127 |         '--partitionKey':partitionKey}
128 |     if runGlueJob('DMSCDC_LoadInitial',args) != 1:
129 |         ddbconn.update_item(
130 |             TableName='DMSCDC_Controller',
131 |             Key={"path": {"S":path}},
132 |             AttributeUpdates={"LastFullLoadDate": {"Value": {"S": lastFullLoadDate}}})
133 | ```
134 | The second step of the `DMSCDCD_ProcessTable` job, is to determine if any incremental files need to be processed by comparing the *lastIncrementalFile* stored in the DynamoDB table with the *newIncrementalFile* retrieved from S3.  If those two values are different it is determined that there are incremental files and the [LoadIncremental](#LoadIncremental) glue job is triggered.  The controller will wait for its completion using the *testGlueJob* function. If the job completes successfully, the DynamoDB table will be updated with the *newIncrementalFile*.
135 | 
136 | ```python
137 | loadIncremental = False
138 | #Get the latest incremental file
139 | incrementalFiles = s3conn.list_objects_v2(Bucket=bucket, Prefix=full_folder+'2', StartAfter=lastIncrementalFile).get('Contents')
140 | if incrementalFiles is not None:
141 |     filecount = len(incrementalFiles)
142 |     newIncrementalFile = bucket + '/' + incrementalFiles[filecount-1]['Key']
143 |     if newIncrementalFile != lastIncrementalFile:
144 |         loadIncremental = True
145 |         message = 'Starting to process incremental files: ' + full_folder
146 |     else:
147 |         message = 'Incremental files already processed: ' + full_folder
148 | else:
149 |     message = 'No incremental files to process: ' + full_folder
150 | logger.warn(message)
151 | 
152 | #Call Incremental Glue Job for this source
153 | if loadIncremental:
154 |     args = {
155 |         '--bucket':bucket,
156 |         '--prefix':full_folder,
157 |         '--folder':foldername,
158 |         '--out_path':out_path,
159 |         '--partitionKey':partitionKey,
160 |         '--lastIncrementalFile' : lastIncrementalFile,
161 |         '--newIncrementalFile' : newIncrementalFile,
162 |         '--primaryKey' : primaryKey
163 |         }
164 |     if runGlueJob('DMSCDC_LoadIncremental',args) != 1:
165 |         ddbconn.update_item(
166 |             TableName='DMSCDC_Controller',
167 |             Key={"path": {"S":path}},
168 |             AttributeUpdates={"LastIncrementalFile": {"Value": {"S": newIncrementalFile}}})
169 | 
170 | ```
171 | 
172 | ### LoadInitial
173 | The initial load will first read the initial load file into a Spark DataFrame. An additional column will be added to the dataframe *"Op"* with a default value of *I* representing *inserts*.  This field is added to provide parity between the initial load and the incremental load.     
174 | ```python
175 | input = spark.read.parquet(s3_inputpath+"/LOAD*.parquet").withColumn("Op", lit("I"))
176 | ```
177 | If a partition key has been set for this table, the data will be written to the Data Lake in the appropriate partitions.  
178 | 
179 | Note: When the initial job runs, any previously loaded data is overwritten.
180 | ```python
181 | partition_keys = args['partitionKey']
182 | if partition_keys != "null" :
183 |     partitionKeys = partition_keys.split(",")
184 |     partitionCount = input.select(partitionKeys).distinct().count()
185 |     input.repartition(partitionCount,partitionKeys).write.mode('overwrite').partitionBy(partitionKeys).parquet(s3_outputpath)
186 | else:
187 |     input.write.mode('overwrite').parquet(s3_outputpath)
188 | ```
189 | 
190 | ### LoadIncremental
191 | The incremental load will first read only the new incremental files into a Spark DataFrame.  
192 | ```python
193 | last_file = args['lastIncrementalFile']
194 | curr_file = args['newIncrementalFile']
195 | primary_keys = args['primaryKey']
196 | partition_keys = args['partitionKey']
197 | 
198 | # gets list of files to process
199 | inputFileList = []
200 | results = s3conn.list_objects_v2(Bucket=args['bucket'], Prefix=args['prefix'] +'2', StartAfter=args['lastIncrementalFile']).get('Contents')
201 | for result in results:
202 |     if (args['bucket'] + '/' + result['Key'] != last_file):
203 |         inputFileList.append('s3://' + args['bucket'] + '/' + result['Key'])
204 | 
205 | inputfile = spark.read.parquet(*inputFileList)
206 | ```
207 | 
208 | Next, the process will determine if a primary key has been set in the DynamoDB table.  If it has been set, the change detection process will be executed.  If it has not, or this is the first time this table is being loaded, only records marked for insert (where the *Op* = 'I') will be prepped to be written.
209 | 
210 | ```python
211 | tgtExists = True
212 | try:
213 |     test = spark.read.parquet(s3_outputpath)
214 | except:
215 |     tgtExists = False
216 | 
217 | if primary_keys == "null" or not(tgtExists):
218 |     output = inputfile.filter(inputfile.Op=='I')
219 |     filelist = [["null"]]
220 | ```
221 | 
222 | Otherwise, to process the changed records the process will de-dup the input records as well as determine the impacted data lake files.
223 | 
224 | First, a few metadata fields are added to the input data and the already loaded data:
225 | * sortpath - For already loaded data, this is set to "0" to ensure that newly loaded data with the same primary key will override it.  For new data this is the filename generated by DMS which will have a higher sort order than "0".
226 | * filepath - For already loaded data, this is the file name of the data lake file.  For new data this is the filename generated by DMS.
227 | * rownum - This is the relative rownumber of each primary key. For already loaded data, this will always be '1' as there should always be only 1 record for each primary key.  For new data data, this will be a value from 1-to-N for each operation against a primary key.  In the case when there are multiple operations on the same primary key, the latest operation will have the greatest rownum value.
228 | 
229 | ```python    
230 | primaryKeys = primary_keys.split(",")
231 | windowRow = Window.partitionBy(primaryKeys).orderBy("sortpath")
232 | 
233 | #Loads the target data adding columns for processing
234 | target = spark.read.parquet(s3_outputpath).withColumn("sortpath", lit("0")).withColumn("tgt_filepath",input_file_name()).withColumn("rownum", lit(1))
235 | input = inputfile.withColumn("sortpath", input_file_name()).withColumn("src_filepath",input_file_name()).withColumn("rownum", row_number().over(windowRow))
236 | 
237 | ```
238 | 
239 | To determine which data lake files will be impacted, the input data is joined to the target data and a distinct list of files is captured.  
240 | 
241 | ```python
242 | #determine impacted files
243 | files = target.join(input, primaryKeys, 'inner').select(col("tgt_filepath").alias("list_filepath")).distinct()
244 | filelist = files.collect()
245 | ```
246 | 
247 | Since S3 data is immutable, not only the records which have changed will need to be written, but also all the unchanged records from the impacted data lake files.  To accomplish this task, a new dataframe will be created which is union of the source and target data.
248 | 
249 | ```python
250 | uniondata = input.unionByName(target.join(files,files.list_filepath==target.tgt_filepath), allowMissingColumns=True)
251 | ```
252 | 
253 | To determine which rows to write from the union dataframe, a *rnk* column is added indicating the last change for a given primary leveraging the *sortpath* and *rownum* fields which were added.  In addition, only records where the last operation is not a delete (*Op != D*) will be filtered.
254 | 
255 | ```python
256 | window = Window.partitionBy(primaryKeys).orderBy(desc("sortpath"), desc("rownum"))
257 | output = uniondata.withColumn('rnk', rank().over(window)).where(col("rnk")==1).where(col("Op")!="D").coalesce(1).select(inputfile.columns)
258 | ```
259 | 
260 | If a partition key has been set for this table, the data will be written to the Data Lake in the appropriate partitions.  
261 | 
262 | ```python
263 | # write data by partitions
264 | if partition_keys != "null" :
265 |     partitionKeys = partition_keys.split(",")
266 |     partitionCount = output.select(partitionKeys).distinct().count()
267 |     output.repartition(partitionCount,partitionKeys).write.mode('append').partitionBy(partitionKeys).parquet(s3_outputpath)
268 | else:
269 |     output.write.mode('append').parquet(s3_outputpath)
270 | ```
271 | 
272 | Finally, the impacted data lake files containing the changed records need to be deleted.  
273 | 
274 | Note: Because S3 does not guarantee consistency on delete operations, until the file is actually removed from S3, there is a possibility of users seeing duplicate data temporarily.
275 | 
276 | ```python
277 | #delete old files
278 | for row in filelist:
279 |     if row[0] != "null":
280 |         o = parse.urlparse(row[0])
281 |         s3conn.delete_object(Bucket=o.netloc, Key=parse.unquote(o.path)[1:])
282 | ```
283 | 
284 | ## Sample Database
285 | Below is a sample MySQL database you can build and use in conjunction with this project.  It can be used to demonstrate the end-to-end flow of data into your Data Lake.
286 | 
287 | ### Source DB Setup
288 | In order for a DB to be a candidate for DMS as a source it needs to meet certain [requirements](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html) whether it's managed through Amazon via RDS or if it's an EC2 or on-premise installation.  Launch the following cloud formation stack to deploy a MySQL DB that has been pre-configured to meet these requirements and which is *pre-loaded* with the [initial load script](DMSCDC_SampleDB_Initial.sql).  Be sure to choose the *same VPC/Subnet/Security Group* as your DMS Replication Instance to ensure they have network connectivity.
289 | 
290 | [![](launch-stack.svg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=DMSCDCSampleDB&templateURL=https://s3-us-west-2.amazonaws.com/dmscdc-files/DMSCDC_SampleDB.yaml)
291 | 
292 | ### Incremental Load Script
293 | Before proceeding with the incremental load, ensure that
294 | 1. The initial data has arrived in the DMS bucket.
295 | 1. You have configured your DynamoDB table to set the appropriate ActiveFlag, PrimaryKey and PartitionKey values.
296 | 1. The initial data has arrived in the LAKE bucket.  Note: By default the solution will run on an hourly basis on the 00 minute.  To change the load frequency or minute, modify the **[Glue Trigger](https://console.aws.amazon.com/glue/home?#etl:tab=triggers)**.  
297 | 
298 | Once the seed data has been loaded, execute the following script in the MySQL database to simulate changed data.
299 | ```sql
300 | update product set name = 'Sample Product', dept = 'Sample Dept', category = 'Sample Category' where id = 1001;
301 | delete from product where id = 1002;
302 | call loadorders(1345, current_date);
303 | insert into store values(1009, '125 Technology Dr.', 'Irvine', 'CA', 'US', '92618');
304 | ```
305 | 
306 | Once DMS has processed these changes, you will see new incremental files for `store`, `product`, and `productorder` in the DMS bucket.  On the next execution of the Glue job, you will see new and updated files in the LAKE bucket. Test.
307 | 


--------------------------------------------------------------------------------
/launch-stack.svg:
--------------------------------------------------------------------------------
1 | <svg width="144" height="27" viewBox="0 0 144 27" xmlns="http://www.w3.org/2000/svg"><title>Launch Stack</title><defs><linearGradient x1="50%" y1="0%" x2="50%" y2="100%" id="a"><stop stop-color="#FFE4B2" offset="0%"/><stop stop-color="#F79800" offset="100%"/></linearGradient><linearGradient x1="45.017%" y1="100%" x2="68.082%" y2="3.32%" id="b"><stop stop-color="#151443" offset="0%"/><stop stop-color="#6D80B2" offset="100%"/></linearGradient></defs><g fill="none" fill-rule="evenodd"><path d="M2 5v17c0 1.66 1.34 3 3 3h125.5c6.348 0 11.5-5.15 11.5-11.5C142 7.148 136.852 2 130.5 2H5C3.34 2 2 3.34 2 5z" fill="url(#a)"/><path d="M2 5v17c0 1.66 1.34 3 3 3h125.5c6.348 0 11.5-5.15 11.5-11.5C142 7.148 136.852 2 130.5 2H5C3.34 2 2 3.34 2 5zM0 5c0-2.762 2.233-5 5-5h125.5c7.456 0 13.5 6.043 13.5 13.5 0 7.456-6.05 13.5-13.5 13.5H5c-2.762 0-5-2.232-5-5V5z" fill="#0058A5"/><circle fill="url(#b)" cx="129.5" cy="13.5" r="9.5"/><path fill="#FFF" d="M133 13.5l-5 4.5V9"/><path d="M18.136 19h6.4v-1.648h-4.432V8.216h-1.968V19zm14.068 0c-.08-.48-.112-1.168-.112-1.872v-2.816c0-1.696-.72-3.28-3.216-3.28-1.232 0-2.24.336-2.816.688l.384 1.28c.528-.336 1.328-.576 2.096-.576 1.376 0 1.584.848 1.584 1.36v.128c-2.88-.016-4.624.976-4.624 2.944 0 1.184.88 2.32 2.448 2.32 1.008 0 1.824-.432 2.304-1.04h.048l.128.864h1.776zm-2.032-2.736c0 .128-.016.288-.064.432-.176.56-.752 1.072-1.536 1.072-.624 0-1.12-.352-1.12-1.12 0-1.184 1.328-1.488 2.72-1.456v1.072zm11.076-5.056H39.28v4.704c0 .224-.048.432-.112.608-.208.496-.72 1.056-1.504 1.056-1.04 0-1.456-.832-1.456-2.128v-4.24H34.24v4.576c0 2.544 1.296 3.392 2.72 3.392 1.392 0 2.16-.8 2.496-1.36h.032L39.584 19h1.728c-.032-.64-.064-1.408-.064-2.336v-5.456zM43.476 19h1.984v-4.576c0-.224.016-.464.08-.64.208-.592.752-1.152 1.536-1.152 1.072 0 1.488.848 1.488 1.968V19h1.968v-4.624c0-2.464-1.408-3.344-2.768-3.344-1.296 0-2.144.736-2.48 1.344h-.048l-.096-1.168h-1.728c.048.672.064 1.424.064 2.32V19zm14.708-1.696c-.384.16-.864.304-1.552.304-1.344 0-2.384-.912-2.384-2.512-.016-1.424.88-2.528 2.384-2.528.704 0 1.168.16 1.488.304l.352-1.472c-.448-.208-1.184-.368-1.904-.368-2.736 0-4.336 1.824-4.336 4.16 0 2.416 1.584 3.968 4.016 3.968.976 0 1.792-.208 2.208-.4l-.272-1.456zM60.012 19h1.984v-4.656c0-.224.016-.432.08-.592.208-.592.752-1.104 1.52-1.104 1.088 0 1.504.848 1.504 1.984V19h1.968v-4.592c0-2.496-1.392-3.376-2.72-3.376-.496 0-.96.128-1.344.352-.416.224-.736.528-.976.896h-.032V7.64h-1.984V19zm12.264-.512c.592.352 1.776.672 2.912.672 2.784 0 4.096-1.504 4.096-3.232 0-1.552-.912-2.496-2.784-3.2-1.44-.56-2.064-.944-2.064-1.776 0-.624.544-1.296 1.792-1.296 1.008 0 1.76.304 2.144.512l.48-1.584c-.56-.288-1.424-.544-2.592-.544-2.336 0-3.808 1.344-3.808 3.104 0 1.552 1.136 2.496 2.912 3.136 1.376.496 1.92.976 1.92 1.792 0 .88-.704 1.472-1.968 1.472-1.008 0-1.968-.32-2.608-.688l-.432 1.632zm9.076-9.04v1.76h-1.12v1.472h1.12v3.664c0 1.024.192 1.728.608 2.176.368.4.976.64 1.696.64.624 0 1.136-.08 1.424-.192l-.032-1.504c-.176.048-.432.096-.768.096-.752 0-1.008-.496-1.008-1.44v-3.44h1.872v-1.472h-1.872V8.984l-1.92.464zM92.892 19c-.08-.48-.112-1.168-.112-1.872v-2.816c0-1.696-.72-3.28-3.216-3.28-1.232 0-2.24.336-2.816.688l.384 1.28c.528-.336 1.328-.576 2.096-.576 1.376 0 1.584.848 1.584 1.36v.128c-2.88-.016-4.624.976-4.624 2.944 0 1.184.88 2.32 2.448 2.32 1.008 0 1.824-.432 2.304-1.04h.048l.128.864h1.776zm-2.032-2.736c0 .128-.016.288-.064.432-.176.56-.752 1.072-1.536 1.072-.624 0-1.12-.352-1.12-1.12 0-1.184 1.328-1.488 2.72-1.456v1.072zm9.556 1.04c-.384.16-.864.304-1.552.304-1.344 0-2.384-.912-2.384-2.512-.016-1.424.88-2.528 2.384-2.528.704 0 1.168.16 1.488.304l.352-1.472c-.448-.208-1.184-.368-1.904-.368-2.736 0-4.336 1.824-4.336 4.16 0 2.416 1.584 3.968 4.016 3.968.976 0 1.792-.208 2.208-.4l-.272-1.456zm3.796-9.664h-1.968V19h1.968v-2.656l.672-.784 2.24 3.44h2.416l-3.296-4.608 2.88-3.184h-2.368l-1.888 2.512c-.208.272-.432.608-.624.912h-.032V7.64z" fill="#000"/></g></svg>


--------------------------------------------------------------------------------
/pymysql.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-big-data-blog-dmscdc-walkthrough/024e354e35747828eadc334fd0b3c626f1055162/pymysql.zip


--------------------------------------------------------------------------------